# Machine Learning Pipelines

* Advantages of Machine Learning Pipelines
* Scikit-learn Pipeline
* Scikit-learn Feature Union
* Pipelines and Grid Search
* Case Study

**CASE STUDY:**
#### Corporate Messaging Case Study
This corporate message data is from one of the free datasets provided on the [Figure Eight Platform](https://www.figure-eight.com/data-for-everyone/), licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).

```python
import nltk
nltk.download(['punkt', 'wordnet'])
import re
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import confusion_matrix, accuracy_score


def load_data():
    df = pd.read_csv('corporate_messaging.csv', encoding='latin-1')
    df = df[(df["category:confidence"] == 1) & (df['category'] != 'Exclude')]
    X = df.text.values
    y = df.category.values
    
    return X, y


def tokenize(text: str) -> list:
    """
    Function to clean text
    
    - Replaces URLs with "urlplaceholder"
    - Tokenizes
    - Lemmatizes
    - Removes extra whitespace
    - Transforms to lowercase
    
    :param text (str): string data
    
    :return clean_tokens (lst): list of cleaned tokens
    """
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|\[$-_@.&+]|[!*\(\),\]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    detected_urls = re.findall(url_regex, text)    
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    clean_tokens = [lemmatizer.lemmatize(token).lower().strip() for token in tokens]
    clean_tokens = [lemmatizer.lemmatize(token, pos='v') for token in clean_tokens]

    return clean_tokens


def transformer(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)
    #2
    vect = CountVectorizer(tokenizer=tokenize)
    tfidf = TfidfTransformer()
    
    X_train_count = vect.fit_transform(X_train)
    X_train_tfidf = tfidf.fit_transform(X_train_count)
    #3
    X_test_count = vect.transform(X_test)
    X_test_tfidf = tfidf.transform(X_test_count)
    
    return X_train_tfidf, y_train, X_test_tfidf, y_test


def trainer(X_train_tfidf,
            y_train,
            X_test_tfidf,
            clf):
    clf.fit(X_train_tfidf, y_train)
    y_pred = clf.predict(X_test_tfidf)
    
    return y_pred


def display_results(y_test, y_pred):
    labels = np.array(list(set(y_test)), dtype='object')
    confusion_mat = confusion_matrix(y_test, y_pred, labels=labels)
    accuracy = accuracy_score(y_test, y_pred)
    
    display(pd.DataFrame(confusion_mat,
            columns=[lab + "_true" for lab in labels],
            index=[lab + "_pred" for lab in labels]))
    print("Accuracy:", round(accuracy, 4))
    
    return None


def main(clf):
    #1
    X, y = load_data()
    #2, 3
    X_train_tfidf, y_train, X_test_tfidf, y_test = transformer(X, y)
    y_pred = trainer(X_train_tfidf, y_train, X_test_tfidf, clf)
    #4
    display_results(y_test, y_pred)
    
    return None
```

### Pipeline: SKLearn function
Instead of writing this in the form of unique functions, we can use SKLearn's `Pipeline()` method

Below, you'll find a simple example of a machine learning workflow where we generate features from text data using count vectorizer and tf-idf transformer, and then fit it to a random forest classifier. Before we get into using pipelines, let's first use this example to go over some scikit-learn terminology.

* **Estimator:** An estimator is any object that learns from data, whether it's a classification, regression, or clustering algorithm, or a transformer that extracts or filters useful features from raw data. Since estimators learn from data, they each must have a fit method that takes a dataset. In the example below, the CountVectorizer, TfidfTransformer, and RandomForestClassifier are all estimators, and each have a fit method.

* **Transform:** A transformer is a specific type of estimator that has a fit method to learn from training data, and then a transform method to apply a transformation model to new data. These transformations can include cleaning, reducing, expanding, or generating features. In the example below, CountVectorizer and TfidfTransformer are transformers.

* **Predictor:** A predictor is a specific type of estimator that has a predict method to predict on test data based on a supervised learning algorithm, and has a fit method to train the model on training data. The final estimator, RandomForestClassifier, in the example below is a predictor.

In machine learning tasks, it's pretty common to have a very specific sequence of transformers to fit to data before applying a final estimator, such as this classifier. And normally, we'd have to initialize all the estimators, fit and transform the training data for each of the transformers, and then fit to the final estimator. Next, we'd have to call transform for each transformer again to the test data, and finally call predict on the final estimator.

**NOTE** Every step of the `Pipeline()` has to be a transformer *EXCEPT* for the last step

#### Without `Pipeline()`:
```python
    vect = CountVectorizer()
    tfidf = TfidfTransformer()
    clf = RandomForestClassifier()

    # train classifier
    X_train_counts = vect.fit_transform(X_train)
    X_train_tfidf = tfidf.fit_transform(X_train_counts)
    clf.fit(X_train_tfidf, y_train)

    # predict on test data
    X_test_counts = vect.transform(X_test)
    X_test_tfidf = tfidf.transform(X_test_counts)
    y_pred = clf.predict(X_test_tfidf)
```

#### With `Pipeline()`:
```python
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

# build pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier(random_state=0))
])
# train classifier
pipeline.fit(X_train, y_train)
# predict on test data
y_pred = pipeline.predict(X_test)
```

Now, by fitting our pipeline to the training data, we're accomplishing exactly what we would by fitting and transforming each of these steps to our training data one by one. Similarly, when we call `predict` on our pipeline to our test data, we're accomplishing what we would by calling `transform` on each of our transformer objects to our test data and then calling `predict` on our final estimator. Not only does this make our code much shorter and simpler, it has other great advantages, which we'll cover in the next video.

Note that every step of this pipeline has to be a transformer, except for the last step, which can be of an estimator type. Pipeline takes on all the methods of whatever the last estimator in its sequence is. For example, here, since the final estimator of our pipeline is a classifier, the pipeline object can be used as a classifier, taking on the `fit` and `predict` methods of its last step. Alternatively, if the last estimator was a transformer, then pipeline would be a transformer.

### Pipeline: Advantages
#### 1. Simplicity and Convencience
* **Automates repetitive steps** - Chaining all of your steps into one estimator allows you to fit and predict on all steps of your sequence automatically with one call. It handles smaller steps for you, so you can focus on implementing higher level changes swiftly and efficiently.

* **Easily understandable workflow** - Not only does this make your code more concise, it also makes your workflow much easier to understand and modify. Without Pipeline, your model can easily turn into messy spaghetti code from all the adjustments and experimentation required to improve your model.

* **Reduces mental workload** - Because Pipeline automates the intermediate actions required to execute each step, it reduces the mental burden of having to keep track of all your data transformations. Using Pipeline may require some extra work at the beginning of your modeling process, but it prevents a lot of headaches later on.

#### 2. Optimizing Entire Workflow
* **Grid Search:** Method that automates the process of testing different hyper parameters to optimize a model.
* By running grid search on your pipeline, you're able to optimize your entire workflow, including data transformation and modeling steps. This accounts for any interactions among the steps that may affect the final metrics.
* Without grid search, tuning these parameters can be painfully slow, incomplete, and messy.

#### 3. Preventing Data leakage
* Using Pipeline, all transformations for data preparation and feature extractions occur within each fold of the cross validation process.
* This prevents common mistakes where you’d allow your training process to be influenced by your test data - for example, if you used the entire training dataset to normalize or extract features from your data.

## Pipeline and Feature Unions
* **Feature Union:** Feature union is a class in scikit-learn’s Pipeline module that allows us to perform steps in parallel and take the union of their results for the next step.
* A pipeline performs a list of steps in a linear sequence, while a feature union performs a list of steps in parallel and then combines their results.
* In more complex workflows, multiple feature unions are often used within pipelines, and multiple pipelines are used within feature unions.
<img src='ml_feat_un_0.png'>

Sometimes, you won't have all the data transformation steps you need in scikit-learn's library, which is why it is possible to actually create your own custom transformers. Keep in mind that `TextLengthExtractor` is a custom transformer that is already built in a separate file and imported for this example.

### Using Feature Union
[VIDEO](https://youtu.be/QmE6CMGar1U)
Taking the example from the previous video, let's say you wanted to extract two different kinds of features from the same text column - tfidf values, and the length of the text. Your first approach might be to create an additional column from the `text` column called `text_length` like this. Then both `text` and `text_length` can be part of your feature matrix. But now your pipeline would break. You can't run `CountVectorizer` on `NumPy` arrays of strings and integers.

```python
df['txt_length'] = df['text'].apply(len)
X = df[['text', 'txt_length']].values
y = df['label'].values
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier()),
])

# train classifier
pipeline.fit(Xtrain)

# predict on test data
predicted = pipeline.predict(Xtest)
```

Let's say you had a custom transformer called `TextLengthExtractor`. Now, you could leave `X_train` as just the original text column, if you could figure out how to add the text length extractor to your pipeline. If only you could fit it on the original text data, rather than the output of the previous transformer. But you need the outputs of `TfiddfTransformer` and `TextLengthExtractor` to feed into the classifier as input.

```python
X = df['text'].values
y = df['label'].values
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('txt_length', TextLengthExtractor()),
    ('clf', RandomForestClassifier()),
])

# train classifier
pipeline.fit(Xtrain)

# predict on test data
predicted = pipeline.predict(Xtest)
```

* Feature unions are super helpful for handling these situations, where we need to run two steps in parallel on the same data and combine their results to pass into the next step.

* Like pipelines, **feature unions** are built using a list of `(key, value)` pairs, where the key is the string that you want to name a step, and the value is the estimator object. Also like pipelines, feature unions combine a list of estimators to become a single estimator. However, a feature union runs its estimators in parallel, rather than in a sequence as a pipeline does. In this example, the estimators run in parallel are `nlp_pipeline` and `text_length`. Notice we use a pipeline in this feature union to make sure the count vectorizer and tfidf transformer steps are still running in sequence:

```python
from sklearn.pipeline import Pipeline, FeatureUnion

X = df['text'].values
y = df['label'].values
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([
    ('features', FeatureUnion([

        ('nlp_pipeline', Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer())
        ])),
        ('txt_len', TextLengthExtractor())
    ])),

    ('clf', RandomForestClassifier())
])

# train classifier
pipeline.fit(Xtrain)

# predict on test data
predicted = pipeline.predict(Xtest)
```

* Now, our pipeline doesn't break and uses both features! This would be equivalent to this code.

```python
vect = CountVectorizer(tokenizer=tokenize)
tfidf = TfidfTransformer()
txt_len = TextLengthExtractor()
clf = RandomForestClassifier()

# train classifier
X_train_counts = vect.fit_transform(X_train)
X_train_tfidf = tfidf.fit_transform(X_train_counts)

X_train_len = txt_len.fit_transform(X_train)
X_train_features = hstack([X_train_tfidf, X_train_len])
clf.fit(X_train_features, y_train)

# predict on test data
X_test_counts = vect.transform(X_test)
X_test_tfidf = tfidf.transform(X_test_counts)

X_test_len = txt_len.transform(X_test)
X_test_features = hstack([X_test_tfidf, X_test_len])
y_pred = clf.predict(X_test_features)
```

* The **tfidf transformer** and the **text length extractor** are fit to the input data, in this case the raw data, independently. They are then performed in parallel, and their outputs are combined and passed to the next estimator, in this case, the classifier.

Read more about feature unions in Scikit-learn's [user guide](http://scikit-learn.org/stable/modules/pipeline.html#feature-union).

## Creating Custom Transformer
In the last section, you used a custom transformer that extracted whether each text started with a verb. You can implement a **custom transformer** yourself by extending the *base class in Scikit-Learn*. Let's take a look at a very simple example that multiplies the input data by ten.

```python
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class TenMultiplier(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X * 10
```

**Remember, all estimators have a *fit method*, and since this is a transformer, it also has a *transform method**

* **Fit Method:** This takes in a 2d-array `X` for the feature data and a 1d-array `y` for the target labels. Inside the `fit` method, we simply return `self`. This allows us to chain methods together, since the result on calling fit on the transformer is still the transformer object. This method is required to be compatible with `sklearn`.
* **Transform Method:** The transform function is where we include the code that transforms the data. In this case, we return the data in `X` multiplied by 10. This `transform` method also takes a 2d-array `X`

___
**EXAMPLE**

Let's test our new transformer, by entering the code below:

```python
multiplier = TenMultiplier()

X = np.array([6, 3, 7, 4, 7])
multiplier.transform(X)
```

`OUTPUT:`
```python
>>> array([60, 30, 70, 40, 70])
```

___

Next, we'll create a custom transformer that has a bit more significance. Let's build a text case normalizer, which converts all text to lowercase.

* We aren't setting anything in our `__init__` method... so we can actually remove that
* We can leave our `fit` method as is
* Focusing on the `transform` method
* We can lowercase all the values in `X` by applying a `lambda` function that calls lower on each value
* We'll have ti wrao ths in a `pandas Series` to be able to use this apply function

```python
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin

class CaseNormalizer(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y=None):
        return self
    
    
    def transform(self, X):
        return pd.Series(X).apply(lambda x: x.lower()).values
```
___
**EXAMPLE**
    
```python
case_normalizer = CaseNormalizer()
X = np.array(['Here', 'Are', 'SOme', 'Words', 'transFORMed', 'by', 'SCIKIT-LEARN'])
case_normalizer.transformer(X)
```
`OUTPUT:`
```python
>>> array(['here', 'are', 'some', 'words', 'transformed', 'by', 'scikit-learn'], dtype=object)
```

___

Knowing how to write your own custom functions allows you to have more control and flexibility with your machine learning pipelines.

Another way to create custom transformers is by using this `FunctionTransformer` from `sklearn`'s preprocessing module. This allows you to wrap an existing function to become a transformer. This provides less flexibility, but is much simpler.

Read more about using FunctionTransformer to create custom transformers [here](http://scikit-learn.org/stable/modules/preprocessing.html#custom-transformers) and [here](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer).

___
<details><summary>**EXAMPLE**</summary>
    
```python
class StartingVerbExtractor(BaseEstimator, TransformerMixin):


    def starting_verb(self, text):
        # tokenize by sentences
        sentence_list = nltk.sent_tokenize(text)
        
        for sentence in sentence_list:
            # tokenize each sentence into words and tag part of speech
            pos_tags = nltk.pos_tag(tokenize(sentence))

            # index pos_tags to get the first word and part of speech tag
            first_word, first_tag = pos_tags[0]
            
            # return true if the first word is an appropriate verb or RT for retweet
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return True

        return False


    def fit(self, x, y=None):
        return self


    def transform(self, X):
        # apply starting_verb function to all values in X
        X_tagged = pd.Series(X).apply(self.starting_verb)

        return pd.DataFrame(X_tagged)
```

**IMPLEMENTING**
```python
def load_data():
    df = pd.read_csv('corporate_messaging.csv', encoding='latin-1')
    df = df[(df["category:confidence"] == 1) & (df['category'] != 'Exclude')]
    X = df.text.values
    y = df.category.values
    return X, y


def tokenize(text):
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens


def model_pipeline():
    pipeline = Pipeline([
        ('features', FeatureUnion([

            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ])),

            ('starting_verb', StartingVerbExtractor())
        ])),

        ('clf', RandomForestClassifier())
    ])

    return pipeline


def display_results(y_test, y_pred):
    labels = np.unique(y_pred)
    confusion_mat = confusion_matrix(y_test, y_pred, labels=labels)
    accuracy = (y_pred == y_test).mean()

    print("Labels:", labels)
    print("Confusion Matrix:\n", confusion_mat)
    print("Accuracy:", accuracy)


def main():
    X, y = load_data()
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    model = model_pipeline()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    display_results(y_test, y_pred)

main()
```
</details>

## Pipelines and Gridsearch

<img src="gs_pipe_0.png">

A powerful benefit to using `pipeline` is the ability to perform a `grid search` on your **entire workflow**.

Most machine learning algorithms have a set of parameters that need tuning. `Grid search` is a tool that allows you to define a “grid” of parameters, or a set of values to check. Your computer automates the process of trying out all possible combinations of values. `Grid search` scores each combination with `cross validation`, and uses the `cross validation` score to determine the parameters that produce the most optimal model.

Running `grid search` on your `pipeline` allows you to try many parameter values thoroughly and conveniently, for both your data transformations and estimators.

And again, although you can also run `grid search` on just a single classifier, **running it on your whole pipeline helps you test multiple parameter combinations across your entire pipeline**. This accounts for interactions among parameters not just in your model, but data preparation steps as well.
___

As you may have seen before, `grid search` can be used to optimize hyper parameters of a model. Here is a simple example that uses `grid search` to find parameters for a `support vector classifier`.

All you need to do is create a dictionary of `parameters` to search, using keys for the names of the parameters and values for the list of parameter values to check.

Then, pass the model and parameter grid to the `grid search` object. Now when you call fit on this grid search object, it will run cross validation on all different combinations of these parameters to find the best combination of parameters for the model.

```python
parameters = {
    'kernel': ['linear', 'rbf'],
    'C':[1, 10]
}

svc = SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(X_train, y_train)
```
___

Consider if we had a data preprocessing step, where we standardized the data using `StandardScaler` like this.
```python
scaler = StandardScaler()
scaled_data = scaler.fit_transform(X_train)

parameters = {
    'kernel': ['linear', 'rbf'],
    'C':[1, 10]
}

svc = SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(scaled_data, y_train)
```

This may seem okay at first, but
* if you standardize your whole training dataset

**AND**
* then use cross validation in grid search to evaluate your model
**you've got data leakage**

Grid search uses cross validation to score your model, **meaning it splits your training data into folds of train and validation sets**
1. trains your model on the train set
2. scores it on the validation set
3. and does so multiple times.

However, each time, or fold, that this happens, the **model already has knowledge of the validation set because all the data was rescaled based on the distribution of the whole training dataset**.

Important factors like:
* mean
* standard deviation

**are influenced by the whole dataset**

*This means the model perform better than it really should on unseen data, since **information about the validation set is always baked into the rescaled values** of your train dataset.*

The way to fix this, would be to make sure you run standard scaler only on the training set, and not the validation set within each fold of cross validation.

### Pipelines allow you to do just this.
___
```python
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', SVC())
])

parameters = {
    'scaler__with_mean': [True, False]
    'clf__kernel': ['linear', 'rbf'],
    'clf__C':[1, 10]
}

cv = GridSearchCV(pipeline, param_grid=parameters)

cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)
```

Now, since the rescaling is included as part of the pipeline, the **standardization doesn't happen until we run grid search**.

Meaning in each fold of cross validation, the rescaling is done only on the data that the model is trained on, **preventing leakage from the validation set**. As you can see, pipelines are very valuable to removing the risk of data leakage during the data preparation process.

**Note on Run Time**
Running grid search can take a while, especially if you are searching over a lot of parameters! If you want to reduce it to a few minutes, try commenting out some of your parameters to grid search over just 1 or 2 parameters with a small number of values each. Once you know that works, feel free to add more parameters and see how well your final model can perform! You can try this out in the next page.

## Hyper-parameters:
There are a lot of options available for tuning, some are more important than others. Below is a way to identify all of them you can addres:
```python
pipeline = Pipeline([
    ('features', FeatureUnion([

        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer())
        ])),

        ('starting_verb', StartingVerbExtractor())
    ])),

    ('clf', RandomForestClassifier())
])

pipeline.get_params()
```
`OUTPUT`:
```python
>>> {'memory': None,
 'steps': [('features',
   FeatureUnion(transformer_list=[('text_pipeline',
                                   Pipeline(steps=[('vect',
                                                    CountVectorizer(tokenizer=<function tokenize at 0x7f853a3fb680>)),
                                                   ('tfidf',
                                                    TfidfTransformer())])),
                                  ('starting_verb', StartingVerbExtractor())])),
  ('clf', RandomForestClassifier())],
 'verbose': False,
 'features': FeatureUnion(transformer_list=[('text_pipeline',
                                 Pipeline(steps=[('vect',
                                                  CountVectorizer(tokenizer=<function tokenize at 0x7f853a3fb680>)),
                                                 ('tfidf',
                                                  TfidfTransformer())])),
                                ('starting_verb', StartingVerbExtractor())]),
 'clf': RandomForestClassifier(),
 'features__n_jobs': None,
 'features__transformer_list': [('text_pipeline',
   Pipeline(steps=[('vect',
                    CountVectorizer(tokenizer=<function tokenize at 0x7f853a3fb680>)),
                   ('tfidf', TfidfTransformer())])),
  ('starting_verb', StartingVerbExtractor())],
 'features__transformer_weights': None,
 'features__verbose': False,
 'features__text_pipeline': Pipeline(steps=[('vect',
                  CountVectorizer(tokenizer=<function tokenize at 0x7f853a3fb680>)),
                 ('tfidf', TfidfTransformer())]),
 'features__starting_verb': StartingVerbExtractor(),
 'features__text_pipeline__memory': None,
 'features__text_pipeline__steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x7f853a3fb680>)),
  ('tfidf', TfidfTransformer())],
 'features__text_pipeline__verbose': False,
 'features__text_pipeline__vect': CountVectorizer(tokenizer=<function tokenize at 0x7f853a3fb680>),
 'features__text_pipeline__tfidf': TfidfTransformer(),
 'features__text_pipeline__vect__analyzer': 'word',
 'features__text_pipeline__vect__binary': False,
 'features__text_pipeline__vect__decode_error': 'strict',
 'features__text_pipeline__vect__dtype': numpy.int64,
 'features__text_pipeline__vect__encoding': 'utf-8',
 'features__text_pipeline__vect__input': 'content',
 'features__text_pipeline__vect__lowercase': True,
 'features__text_pipeline__vect__max_df': 1.0,
 'features__text_pipeline__vect__max_features': None,
 'features__text_pipeline__vect__min_df': 1,
 'features__text_pipeline__vect__ngram_range': (1, 1),
 'features__text_pipeline__vect__preprocessor': None,
 'features__text_pipeline__vect__stop_words': None,
 'features__text_pipeline__vect__strip_accents': None,
 'features__text_pipeline__vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'features__text_pipeline__vect__tokenizer': <function __main__.tokenize(text)>,
 'features__text_pipeline__vect__vocabulary': None,
 'features__text_pipeline__tfidf__norm': 'l2',
 'features__text_pipeline__tfidf__smooth_idf': True,
 'features__text_pipeline__tfidf__sublinear_tf': False,
 'features__text_pipeline__tfidf__use_idf': True,
 'clf__bootstrap': True,
 'clf__ccp_alpha': 0.0,
 'clf__class_weight': None,
 'clf__criterion': 'gini',
 'clf__max_depth': None,
 'clf__max_features': 'auto',
 'clf__max_leaf_nodes': None,
 'clf__max_samples': None,
 'clf__min_impurity_decrease': 0.0,
 'clf__min_impurity_split': None,
 'clf__min_samples_leaf': 1,
 'clf__min_samples_split': 2,
 'clf__min_weight_fraction_leaf': 0.0,
 'clf__n_estimators': 100,
 'clf__n_jobs': None,
 'clf__oob_score': False,
 'clf__random_state': None,
 'clf__verbose': 0,
 'clf__warm_start': False}
```