# Machine Learning Pipelines

* Advantages of Machine Learning Pipelines
* Scikit-learn Pipeline
* Scikit-learn Feature Union
* Pipelines and Grid Search
* Case Study

**CASE STUDY:**
#### Corporate Messaging Case Study
This corporate message data is from one of the free datasets provided on the [Figure Eight Platform](https://www.figure-eight.com/data-for-everyone/), licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).

```python
import nltk
nltk.download(['punkt', 'wordnet'])
import re
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import confusion_matrix, accuracy_score


def load_data():
    df = pd.read_csv('corporate_messaging.csv', encoding='latin-1')
    df = df[(df["category:confidence"] == 1) & (df['category'] != 'Exclude')]
    X = df.text.values
    y = df.category.values
    
    return X, y


def tokenize(text: str) -> list:
    """
    Function to clean text
    
    - Replaces URLs with "urlplaceholder"
    - Tokenizes
    - Lemmatizes
    - Removes extra whitespace
    - Transforms to lowercase
    
    :param text (str): string data
    
    :return clean_tokens (lst): list of cleaned tokens
    """
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|\[$-_@.&+]|[!*\(\),\]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    
    detected_urls = re.findall(url_regex, text)    
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    clean_tokens = [lemmatizer.lemmatize(token).lower().strip() for token in tokens]
    clean_tokens = [lemmatizer.lemmatize(token, pos='v') for token in clean_tokens]

    return clean_tokens


def transformer(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)
    #2
    vect = CountVectorizer(tokenizer=tokenize)
    tfidf = TfidfTransformer()
    
    X_train_count = vect.fit_transform(X_train)
    X_train_tfidf = tfidf.fit_transform(X_train_count)
    #3
    X_test_count = vect.transform(X_test)
    X_test_tfidf = tfidf.transform(X_test_count)
    
    return X_train_tfidf, y_train, X_test_tfidf, y_test


def trainer(X_train_tfidf,
            y_train,
            X_test_tfidf,
            clf):
    clf.fit(X_train_tfidf, y_train)
    y_pred = clf.predict(X_test_tfidf)
    
    return y_pred


def display_results(y_test, y_pred):
    labels = np.array(list(set(y_test)), dtype='object')
    confusion_mat = confusion_matrix(y_test, y_pred, labels=labels)
    accuracy = accuracy_score(y_test, y_pred)
    
    display(pd.DataFrame(confusion_mat,
            columns=[lab + "_true" for lab in labels],
            index=[lab + "_pred" for lab in labels]))
    print("Accuracy:", round(accuracy, 4))
    
    return None


def main(clf):
    #1
    X, y = load_data()
    #2, 3
    X_train_tfidf, y_train, X_test_tfidf, y_test = transformer(X, y)
    y_pred = trainer(X_train_tfidf, y_train, X_test_tfidf, clf)
    #4
    display_results(y_test, y_pred)
    
    return None
```

### Pipeline: SKLearn function
Instead of writing this in the form of unique functions, we can use SKLearn's `Pipeline()` method

Below, you'll find a simple example of a machine learning workflow where we generate features from text data using count vectorizer and tf-idf transformer, and then fit it to a random forest classifier. Before we get into using pipelines, let's first use this example to go over some scikit-learn terminology.

* **Estimator:** An estimator is any object that learns from data, whether it's a classification, regression, or clustering algorithm, or a transformer that extracts or filters useful features from raw data. Since estimators learn from data, they each must have a fit method that takes a dataset. In the example below, the CountVectorizer, TfidfTransformer, and RandomForestClassifier are all estimators, and each have a fit method.

* **Transform:** A transformer is a specific type of estimator that has a fit method to learn from training data, and then a transform method to apply a transformation model to new data. These transformations can include cleaning, reducing, expanding, or generating features. In the example below, CountVectorizer and TfidfTransformer are transformers.

* **Predictor:** A predictor is a specific type of estimator that has a predict method to predict on test data based on a supervised learning algorithm, and has a fit method to train the model on training data. The final estimator, RandomForestClassifier, in the example below is a predictor.

In machine learning tasks, it's pretty common to have a very specific sequence of transformers to fit to data before applying a final estimator, such as this classifier. And normally, we'd have to initialize all the estimators, fit and transform the training data for each of the transformers, and then fit to the final estimator. Next, we'd have to call transform for each transformer again to the test data, and finally call predict on the final estimator.

**NOTE** Every step of the `Pipeline()` has to be a transformer *EXCEPT* for the last step

#### Without `Pipeline()`:
```python
    vect = CountVectorizer()
    tfidf = TfidfTransformer()
    clf = RandomForestClassifier()

    # train classifier
    X_train_counts = vect.fit_transform(X_train)
    X_train_tfidf = tfidf.fit_transform(X_train_counts)
    clf.fit(X_train_tfidf, y_train)

    # predict on test data
    X_test_counts = vect.transform(X_test)
    X_test_tfidf = tfidf.transform(X_test_counts)
    y_pred = clf.predict(X_test_tfidf)
```

#### With `Pipeline()`:
```python
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)

# build pipeline
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', RandomForestClassifier(random_state=0))
])
# train classifier
pipeline.fit(X_train, y_train)
# predict on test data
y_pred = pipeline.predict(X_test)
```

Now, by fitting our pipeline to the training data, we're accomplishing exactly what we would by fitting and transforming each of these steps to our training data one by one. Similarly, when we call `predict` on our pipeline to our test data, we're accomplishing what we would by calling `transform` on each of our transformer objects to our test data and then calling `predict` on our final estimator. Not only does this make our code much shorter and simpler, it has other great advantages, which we'll cover in the next video.

Note that every step of this pipeline has to be a transformer, except for the last step, which can be of an estimator type. Pipeline takes on all the methods of whatever the last estimator in its sequence is. For example, here, since the final estimator of our pipeline is a classifier, the pipeline object can be used as a classifier, taking on the `fit` and `predict` methods of its last step. Alternatively, if the last estimator was a transformer, then pipeline would be a transformer.

### Pipeline: Advantages
#### 1. Simplicity and Convencience
* **Automates repetitive steps** - Chaining all of your steps into one estimator allows you to fit and predict on all steps of your sequence automatically with one call. It handles smaller steps for you, so you can focus on implementing higher level changes swiftly and efficiently.

* **Easily understandable workflow** - Not only does this make your code more concise, it also makes your workflow much easier to understand and modify. Without Pipeline, your model can easily turn into messy spaghetti code from all the adjustments and experimentation required to improve your model.

* **Reduces mental workload** - Because Pipeline automates the intermediate actions required to execute each step, it reduces the mental burden of having to keep track of all your data transformations. Using Pipeline may require some extra work at the beginning of your modeling process, but it prevents a lot of headaches later on.

#### 2. Optimizing Entire Workflow
* **Grid Search:** Method that automates the process of testing different hyper parameters to optimize a model.
* By running grid search on your pipeline, you're able to optimize your entire workflow, including data transformation and modeling steps. This accounts for any interactions among the steps that may affect the final metrics.
* Without grid search, tuning these parameters can be painfully slow, incomplete, and messy.

#### 3. Preventing Data leakage
* Using Pipeline, all transformations for data preparation and feature extractions occur within each fold of the cross validation process.
* This prevents common mistakes where you’d allow your training process to be influenced by your test data - for example, if you used the entire training dataset to normalize or extract features from your data.

## Pipeline and Feature Unions
* **Feature Union:** Feature union is a class in scikit-learn’s Pipeline module that allows us to perform steps in parallel and take the union of their results for the next step.
* A pipeline performs a list of steps in a linear sequence, while a feature union performs a list of steps in parallel and then combines their results.
* In more complex workflows, multiple feature unions are often used within pipelines, and multiple pipelines are used within feature unions.
<img src='ml_feat_un_0.png'>

Sometimes, you won't have all the data transformation steps you need in scikit-learn's library, which is why it is possible to actually create your own custom transformers. Keep in mind that `TextLengthExtractor` is a custom transformer that is already built in a separate file and imported for this example.