# <font color='#B31B1B'> Scikit-Learn </font>
We've already seen scikit-learn in action in class as a tool for fitting different machine learning models. But scikit-learn has even more to offer! In this section we'll review some basic syntax for fitting models in scikit-learn and explore tools for data pre-processing and hyperparameter tuning.

In [None]:
from IPython.display import Image
import pandas as pd
import numpy as np
import seaborn as sns
sns.set()

import warnings
warnings.simplefilter("ignore")

### <font color='#B31B1B'> Toy Data </font>

We'll use a custom 'toy' dataset that contains a nice mix of different data types.

In [None]:
data = pd.read_csv("data/data_processing.csv")

In [None]:
data.head()

Our goal is going to be to predict col3, which is a column with real valued features. This makes it **what kind of machine learning problem?**

### <font color='#B31B1B'> Pre-Processing Data </font>
For models like linear regression, we need all of our features to be numeric and not have any missing values. Luckily we've already seen some methods in class and homework to convert non-numeric features to a nice form. In scikit-learn we can connect all the steps to pre-process data into one operation called a pipeline.

For example, a possible pipeline for our synthetic data could be:

![pipeline](pipeline.png)

We are going to build this pipeline using scikit learn's [pipelines](http://scikit-learn.org/stable/modules/pipeline.html).

Scikit-learn [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) allows us to plug in together any number of transformers (that transform data) and estimators (that predict).

In [None]:
#Let's grab our different data types
target_variable = "col3"
numerical_cols =  data.drop(columns=target_variable).select_dtypes(np.number).columns
categorical_col = ['categorical_col']
ordinal_col = ["ordinal_col"]
text_col = ['text_col']

### <font color='#B31B1B'> Numerical Pre-processing </font>

We will create a pipeline for each kind of data first, then we will see how to put them together into a single Pipeline

First we can start by pre-processing our numeric data. Some steps we might want to include

- **Imputation**: Filling in missing values 

- **Scaling**: Normalizing our values (some options are max-min scaling, or subtracting mean and dividing by variance/standard deviation)

We can access both of these operations in the scikit-learn impute and preprocessing modules.

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

imputer = SimpleImputer(strategy="mean") #Can set other strategies (check the help doc!)
scaler = StandardScaler()

If we wanted to run these alone (not in a pipeline), the general syntax is 'fit' (fitting parameters for transformation, like mean/median/mode) then 'transform' (applying transform with fitted parameters). Let's take a look at applying the imputer to the 'missing_col'.

In [None]:
#Before imputation
data[['missing_col']].isna().sum()

In [None]:
#Using imputation

imputed_col = imputer.fit(data[['missing_col']]).transform(data[['missing_col']])
#Or perform fit/transform together
imputed_col = imputer.fit_transform(data[['missing_col']])

pd.Series(imputed_col.reshape(-1)).isna().sum()

Where scikit-learn really shines is putting these operations together into a pipeline!

In [None]:
from sklearn.pipeline import make_pipeline

A sklearn Pipeline is defined as a sequence of steps.

For example, for the numerical variables our pipeline will have 2 steps:
- first we will impute,
- then we will scale. 

We can create this numerical pipeline like this:

In [None]:
numerical_pipeline = make_pipeline(
    imputer,
    scaler
)

In [None]:
numerical_pipeline

Now we can use `fit` and `transform` with the pipeline and it will apply all the transformers sequentially.

In [None]:
numerical_pipeline.fit_transform(data[numerical_cols])

This is awesome, we dont need to concatenate the processing steps manually, but we still have a problem, we cant apply this numerical transformer to the whole dataframe

In [None]:
numerical_pipeline.fit_transform(data)

The pipeline fails when it tries to apply the imputer and the scaler to the non numerical columns.

We can fix this by applying a step that will select only the numerical columns before applying the imputer and the scaler. We can use the package named  [mlxtend](https://github.com/rasbt/mlxtend/pull/378) (which provides additional functionalities to scikit-learn) to import a ColumnSelector, which is a transformer that selects columns.

for example, we can do:

In [None]:
from mlxtend.feature_selection import ColumnSelector

numerical_col_selector = ColumnSelector(cols=numerical_cols)

We can apply this selector to the whole dataframe and it will automatically select the numerical columns

In [None]:
numerical_col_selector.fit_transform(data)

Now we can create a numerical pipeline that takes care of selection the appropriate columns

In [None]:
numerical_pipeline = make_pipeline(
    numerical_col_selector,
    imputer,
    scaler
)

And now we can apply this pipeline to the whole dataset

In [None]:
numerical_pipeline.fit_transform(data)[:5]

 ### <font color='#B31B1B'> Text Pipeline </font>

There are a number of ways to process text features. One of the most popular is TF-IDF (Term frequency-inverse document frequency). The high level idea is that we'll create one feature per word in our dataset, and the value for this new feature for each row is going to be the multiplication of its 'term-frequency' (i.e. how often the word occurs in the text column for this row) by the word's inverse document frequency (i.e. 1/the fraction of rows that have that word). 

To do this in scikit-learn we need a [DenseTransformer](https://rasbt.github.io/mlxtend/user_guide/preprocessing/DenseTransformer/) that transforms the output produced by TfidfVectorizer (a sparse matrix) into a numpy array.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 
from mlxtend.preprocessing import DenseTransformer

text_pipeline = make_pipeline(
    ColumnSelector(cols=text_col, drop_axis=True),
    TfidfVectorizer(),
    DenseTransformer()
)

In [None]:
text_pipeline.fit_transform(data)

### <font color='#B31B1B'> Categorical Pipeline </font>
For categorical variables, we'll use **one-hot encoding**, which means we will create k binary columns (where k is the unique values for the variable). Then these columns will have all of their values 0 except one column that will represent the actual categorical value.

We can use the transformer `OneHotEncoder` that is part of the package [`category_encoders`](http://contrib.scikit-learn.org/categorical-encoding/), which is a package that adds additional categorical encoders that are compatible with scikit-learn.

In [None]:
from category_encoders import OneHotEncoder

Same way as before, we will do One-Hot Encoding

In [None]:
categorical_pipeline = make_pipeline(
     ColumnSelector(cols=categorical_col),
     OneHotEncoder()
)

categorical_pipeline.fit_transform(data)[:5]

### <font color='#B31B1B'> Ordinal Pipeline </font>
To encode the ordinal column we need to first, use column selector to select it and then use OrdinalEncoder. We need to remember that ordinal Encoder requires the selection of the columns we want to transform, and because the output of ColumnSelector is a numpy array, we have to specify the column in position 0.

In [None]:
from category_encoders import OrdinalEncoder

# ColumnSelector's output is an array, so we use the column 0 for ordinal encoder
ordinal_encoder = OrdinalEncoder(mapping=[
    {"col": 0, 
      "mapping": {
        "very bad": 0,
        "bad": 1,
        "normal": 2,
        "good": 3,
        "very good": 4
      } 
     }
])
ordinal_pipeline = make_pipeline(
    ColumnSelector(cols=ordinal_col),
    ordinal_encoder
)

ordinal_pipeline.fit_transform(data)[:5]

## <font color='#B31B1B'> Pipeline Union </font>

Now we have the individual pipelines that process each different datatype.

Now we just need to get a [FeatureUnion](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html#sklearn.pipeline.FeatureUnion) to join all of the outputs together. A feature union doesnt apply all of its steps sequentially. Instead, it applies the input to all of the steps at once and puts them together

![pipeline](pipeline.png)

In [None]:
from sklearn.pipeline import make_union

In [None]:
processing_pipeline = make_union(
    numerical_pipeline,
    text_pipeline,
    categorical_pipeline,
    ordinal_pipeline
)

In [None]:
processing_pipeline

Now we can use this FeatureUnion to transform the whole dataset

In [None]:
fit_data = processing_pipeline.fit_transform(data)

fit_data

Finally, we just have to add an estimator to the end of the pipeline, so it can be trained with the transformed data. Recall that the general syntax for training a model (outside of a pipeline) is fit then predict. For example, if we wanted to fit a linear regression on our pre-processed data we would:

In [None]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(fit_data, data[target_variable])

#Now our model's fit and we can either pull out the coefficients or use it to make predictions!

We can embed this learning process into our pipeline

In [None]:
estimator = LinearRegression()
estimator_pipeline = make_pipeline(
    processing_pipeline,
    estimator
)

Now that we added an estimator to the pipeline we can do `fit` and `predict`

In [None]:
estimator_pipeline.fit(data.drop(target_variable, axis=1), data[target_variable])

In [None]:
estimator_pipeline.predict(data)[:5]

Using pipelines not only helps organizing the different steps. We can also use all of scikit-learn tools with pipelines! For example, we can do cross validation with the estimator pipeline we just made.

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
X = data.drop(columns=target_variable)
y = data[target_variable]

In [None]:
mae_cv = cross_val_score(estimator_pipeline, 
                X, 
                y,
                scoring='neg_mean_absolute_error', 
                cv=5
)
mae_cv

In [None]:
mae_cv.mean()

## <font color='#B31B1B'> Hyperparameter Optimization </font>

Because pipelines follow the scikit-learn convention of implementing the methods fit, transform and predict, we can use a search in a pipeline.

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
predictive_pipeline = Pipeline(
    [
     ("processing", processing_pipeline),
     ("estimator", LinearRegression())   
    ])

This pipeline has a loot of hyperparams we can tweak, basically every single parameter on every single step of the pipeline, we  can see their names like this:

In [None]:
sorted(predictive_pipeline.get_params().keys())

We'll focus on whether or not we want to fit an intercept in our linear regression model.

In [None]:
param_dist_random = {
    "estimator__fit_intercept": [True, False],
}

To add hyperparameter tuning to our pipeline we simply set the estimator to be our predictive pipeline. There are a few ways to do automated hyperparameter selection in scikit-learn, RandomizedSearchCV and GridSearchCV. They both take a set of values to consider, but the first trys random selelctions of those hyperparameters, and the other looks at every possible combination. The second is thus guaranteed to do better but can take much longer (for those interested check out this <a href='https://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf'> great paper </a> that compares the two.

We'll use randomized search, but the syntax is the same for grid search (except you have to put in a set of values, not a distribution of potential values).

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
random_search_pipeline = RandomizedSearchCV(
    estimator=predictive_pipeline, 
    param_distributions=param_dist_random,
   scoring="neg_mean_squared_error", n_jobs=-1, n_iter=10)

Now we can perform the randomized grid search for the whole pipeline (under the hood, we're trying a bunch of different hyperparameters then doing CV to get an estimate of its performance, we then return the one with the best CV performance).

In [None]:
%%time
random_search_pipeline.fit(X, y)

Let's take a peak at our best model + score.

In [None]:
print(random_search_pipeline.best_score_)
print(random_search_pipeline.best_estimator_)

We can even see what final steps we've applied in our best model (this is super handy if we're trying out different hyperparameters during data pre-processing)

In [None]:
random_search_pipeline.best_estimator_.steps[0]

## <font color='#B31B1B'> Exporting Pipelines </font>

The great thing about scikit learn pipelines is that we can save it, and reuse them **without having to retrain them**, we can use the library joblib to do so.

For example, to save the pipeline we have built:

In [None]:
from joblib import dump, load
dump(estimator_pipeline, 'pipeline.joblib') 

In [None]:
reloaded = load("pipeline.joblib")

Now we can predict directly!

In [None]:
reloaded.predict(X)[:10]