## Scikit Learn Pipelines and Preprocessing steps
Since we now know about different scikit learn functions, let's try to learn how do we tie different steps together! 

By the end of this tutorial you'll know:
1. Different pre-processing steps which can be done on our data
2. How to tie the transformers and estimators together into a pipeline
3. How to make your own transformer
4. Finally, how to tie it all up together!

### Scikit Learn Pipelines
As per the documentation, 
"Sequentially apply a list of transforms and a final estimator.
    Intermediate steps of the pipeline must be 'transforms', that is, they
    must implement fit and transform methods.
    The final estimator only needs to implement fit."
For more information click here : (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py)

Let's start by loading our data:

In [4]:
import pandas as pd 
import numpy as np

data = pd.read_csv('data/data.csv', index_col=0)
print(data.dtypes)
print()
print('Summary Statistics for Target Variable: \n', data['Absenteeism time in hours'].describe())
print(data.shape)
# we have a mix of categorical, numeric, and string data.
data.head(10)

ID                                   int64
Reason for absence                  object
Month of absence                     int64
Day of the week                     object
Distance from Residence to Work    float64
Service time                       float64
Age                                float64
Work load Average/day              float64
Hit target                           int64
Disciplinary failure                 int64
Education                           object
Number of Children                   int64
Social drinker                       int64
Social smoker                        int64
Pet                                  int64
Weight                               int64
Height                               int64
Body mass index                      int64
Absenteeism time in hours            int64
dtype: object

Summary Statistics for Target Variable: 
 count    749.000000
mean       8.080107
std       17.001698
min        0.000000
25%        2.000000
50%        3.000000
75%   

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Number of Children,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,Patient follow-up,7,Tuesday,36.0,13.0,33.0,239.554,97,0,High school,2,1,0,1,90,172,30,4
1,36,No reason given,7,Tuesday,13.0,18.0,50.0,239.554,97,1,High school,1,1,0,0,98,178,31,0
2,3,Blood donation,7,Wednesday,51.0,18.0,38.0,239.554,97,0,High school,0,1,0,0,89,170,31,2
3,7,Diseases of the eye and adnexa,7,Thursday,,14.0,39.0,239.554,97,0,High school,2,1,1,0,68,168,24,4
4,11,Blood donation,7,Thursday,36.0,13.0,33.0,239.554,97,0,High school,2,1,0,1,90,172,30,2
5,3,Blood donation,7,Friday,51.0,18.0,38.0,239.554,97,0,High school,0,1,0,0,89,170,31,2
6,10,Medical consultation,7,Friday,52.0,3.0,28.0,239.554,97,0,High school,1,1,0,4,80,172,27,8
7,20,Blood donation,7,Friday,50.0,11.0,36.0,239.554,97,0,High school,4,1,0,0,65,168,23,4
8,14,"Injury, poisoning, and certain other consequen...",7,Monday,12.0,14.0,34.0,239.554,97,0,High school,2,1,0,0,95,196,25,40
9,1,Medical consultation,7,Monday,11.0,14.0,37.0,239.554,97,0,Postgraduate,1,0,0,1,88,172,29,8


# Preprocessing Steps

Before jumping to creating a pipeline, let's start by following some preprocessing steps

In [5]:
print(data.isna().sum())

ID                                 0
Reason for absence                 0
Month of absence                   0
Day of the week                    0
Distance from Residence to Work    1
Service time                       3
Age                                6
Work load Average/day              0
Hit target                         0
Disciplinary failure               0
Education                          0
Number of Children                 0
Social drinker                     0
Social smoker                      0
Pet                                0
Weight                             0
Height                             0
Body mass index                    0
Absenteeism time in hours          0
dtype: int64


Let's jot down the transformations which we need to do on data before training:
1. Impute missing values
2. Convert categorical columns to numerical values
3. Scale/Discretizitation/Binarization

### 1. Imputation of missing values:

We will learn techniques to impute numerical values and categorical values using SimpleImputer.

Documentation of SimpleImputer can be found here:
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

and Categorical Imputer can be found here:


In [6]:
import sklearn
print(sklearn.__version__)
from sklearn.impute import SimpleImputer

0.21.3


Now let's select the columns which need imputation

In [7]:
impute_columns = ["Distance from Residence to Work", "Service time", "Age"]

Apply imputers on columns and check the results

In [8]:
imp = SimpleImputer(strategy="mean")
impute_df = pd.DataFrame(imp.fit_transform(data[impute_columns]),columns=impute_columns)

In [9]:
print(impute_df.isna().sum())

Distance from Residence to Work    0
Service time                       0
Age                                0
dtype: int64


### 2. Convert Categorical columns to numeric columns

We can get ORDINAL and ONE_HOT_ENCODING from scikit learn.

But the library, category_encoders (http://contrib.scikit-learn.org/categorical-encoding/index.html) offers a lot of different encoding techniques!

Check out this cool article for WOE encoding:
https://medium.com/@sundarstyles89/weight-of-evidence-and-information-value-using-python-6f05072e83eb

Let's start by making a list of columns which need categorical encoding. Let's try two encodings for now, one-hot and label.

In [10]:
label_encode_column = ['Reason for absence']
one_hot_encode_column = ['Education', 'Day of the week']

In [11]:
!pip install category_encoders

[33mYou are using pip version 18.0, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [12]:
from category_encoders.ordinal import OrdinalEncoder
from category_encoders.one_hot import OneHotEncoder

In [13]:
one_hot = OneHotEncoder(use_cat_names=True)
one_hot_encoded_df = one_hot.fit_transform(data[one_hot_encode_column])

In [14]:
one_hot_encoded_df.head()

Unnamed: 0,Education_High school,Education_Postgraduate,Education_Graduate,Education_Master and Doctor,Day of the week_Tuesday,Day of the week_Wednesday,Day of the week_Thursday,Day of the week_Friday,Day of the week_Monday
0,1,0,0,0,1,0,0,0,0
1,1,0,0,0,1,0,0,0,0
2,1,0,0,0,0,1,0,0,0
3,1,0,0,0,0,0,1,0,0
4,1,0,0,0,0,0,1,0,0


In [15]:
ordinal_encoder = OrdinalEncoder()
ordinal_encoded_df = ordinal_encoder.fit_transform(data[label_encode_column])

In [16]:
ordinal_encoded_df.head()

Unnamed: 0,Reason for absence
0,1
1,2
2,3
3,4
4,3


### 3.Scale/Discretizitation/Binarization

Let's identify the columns for binning/scaling and discretizitation

a. Discretization:
    Discretization, also known as quantization or binning, divides a continuous feature into a pre-specified number of categories (bins), and thus makes the data discrete.
    
Sklearn provides a KBinsDiscretizer class that can take care of this. The only thing you have to specify are the number of bins (n_bins) for each feature and how to encode these bins (ordinal, onehot or onehot-dense)

Let's try to discretize on some columns:

In [17]:
from sklearn.preprocessing import KBinsDiscretizer

discretize_column = ["Weight"]
disc = KBinsDiscretizer(n_bins=3, encode='ordinal', 
                        strategy='uniform')
discrete_df = pd.DataFrame(disc.fit_transform(data[discretize_column]),columns=discretize_column)

So far we have done the followin steps:

1. Imputation........done
2. Categorical and numerical encoding..........done
3. Discretization.......done

Time to tie it all together... But how?

## Split train and test

In [18]:
from sklearn.metrics import *
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestRegressor
from math import sqrt
import matplotlib.pyplot as plt

In [19]:
target = data.loc[:,'Absenteeism time in hours']
features = data.drop('Absenteeism time in hours', axis=1)

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=10)

In [20]:
all_columns = list(features.columns)

## Generate our own transformer to select columns from dataframe

Remember TransformerMixin and BaseEstimator from previous class?
What do they do?

In [21]:
from sklearn.base import TransformerMixin, BaseEstimator

In [22]:
class ColumnSelector(BaseEstimator,TransformerMixin):
    
    def __init__(self, columns):
        self.columns = columns
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x[self.columns]

To make the column selection process simpler, let's make lists of columns which we need to select

In [3]:
numeric_columns = ['Distance from Residence to Work',
                   'Service time',
                   'Work load Average/day ',
                   'Hit target',
                   'Height',
                   'Body mass index']

discrete_columns = ['ID',
                  'Age',
                  'Month of absence',
                  'Disciplinary failure',
                  'Number of Children',
                  'Social drinker', 
                  'Social smoker', 
                  'Pet']

bin_column = ['Weight']

impute_columns = ["Distance from Residence to Work", "Service time", "Age"]

label_encode_column = ['Reason for absence']
one_hot_encode_column = ['Education', 'Day of the week']

Generate pipeline for each of the transformation required and stack them up!

In [21]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.pipeline import make_pipeline 

##Using Pipeline and FeatureUnion to stack different pipelines
impute_pipeline = Pipeline([
    ('selector', ColumnSelector(impute_columns)),
    ('imputer', SimpleImputer(strategy="median")),
    ])
bin_pipeline = Pipeline([
    ('selector', ColumnSelector(bin_column)),
    ('Binning', KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')),
    ])
label_encode = Pipeline([
    ('selector', ColumnSelector(label_encode_column)),
    ('LabelEncoder', OrdinalEncoder()),
    ])
one_hot_encode = Pipeline([
    ('selector', ColumnSelector(one_hot_encode_column)),
    ('LabelEncoder', OneHotEncoder()),
    ])
scaler_pipeline = Pipeline([
    ('selector', ColumnSelector(numeric_columns)),
    ('Scaler', StandardScaler()),
    ])

processing_pipeline = FeatureUnion(transformer_list=[
    ("impute_pipeline", impute_pipeline),
    ("bin_pipeline", bin_pipeline),
    ("label_encode", label_encode),
    ("one_hot_encode", one_hot_encode),
    ])

# full_pipeline = FeatureUnion(transformer_list=[
#     ("processing", processing_pipeline),
#     ("scaler_pipeline", scaler_pipeline),
#     ])

Now, let's add a final classifier to complete this pipeline

In [22]:
#finalpipeline = (make_pipeline(processing_pipeline, RandomForestRegressor(random_state=1, 
#                                                                          n_jobs=-1, 
#                                                                          n_estimators=100)))
finalpipeline = Pipeline([
    ('processing_data', processing_pipeline),
    ('classifier', RandomForestRegressor(random_state=1, n_jobs=-1, n_estimators=100)),
    ])
# Fitting the pipeline
finalpipeline.fit(x_train, y_train)

Pipeline(memory=None,
         steps=[('processing_data',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('impute_pipeline',
                                                 Pipeline(memory=None,
                                                          steps=[('selector',
                                                                  ColumnSelector(columns=['Distance '
                                                                                          'from '
                                                                                          'Residence '
                                                                                          'to '
                                                                                          'Work',
                                                                                          'Service '
                                                                                    

In [23]:
y_pred = finalpipeline.predict(x_test)
print(y_pred)
print(mean_squared_error(y_pred, y_test))

[ 3.40463209  5.137       4.2577619   1.57494444  2.39816667  4.7720119
  1.09257143  3.24675458  3.0523981   7.72833333  8.15285714  7.8805
  5.992       5.09916667  1.26842857  3.21165873  2.55283333 23.54107143
  1.58418615  2.67547006  2.12961039  2.87536538  2.53075649  0.99
  5.1725      0.11625     2.00622944 41.805       5.992      11.193
  5.9625      8.34195382  2.94333333  3.40463209  2.60630744  4.9363254
  5.09916667  6.10266667  2.67547006 63.21833333 84.35        8.23166667
 25.19333333  2.81603175  2.37538844  0.68       19.08166667  2.95346429
  2.26252056 14.21233333  2.90066667 25.19333333  2.60630744  2.90066667
  1.57494444  1.730869   26.4         7.592       6.6595      1.730869
  7.19666667  5.2329261   7.102       7.365       0.54        5.63180952
  2.12961039 14.16358333  5.689       4.10783333  3.33816126  8.28216667
 55.112       2.16696609  3.79535195  9.77216667  4.3945      4.9363254
  2.23166667  6.97033333  2.08868254  9.90533333  5.38416667  7.4786666

## Question 1.: Add scaling of numeric data to check performance

In [24]:
# %load solutions/scalarSol.py
from sklearn.pipeline import Pipeline, FeatureUnion

impute_pipeline = Pipeline([
    ('selector', ColumnSelector(impute_columns)),
    ('imputer', SimpleImputer(strategy="median")),
    ])
bin_pipeline = Pipeline([
    ('selector', ColumnSelector(bin_column)),
    ('Binning', KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')),
    ])
label_encode = Pipeline([
    ('selector', ColumnSelector(label_encode_column)),
    ('LabelEncoder', OrdinalEncoder()),
    ])
one_hot_encode = Pipeline([
    ('selector', ColumnSelector(one_hot_encode_column)),
    ('LabelEncoder', OneHotEncoder()),
    ])
numeric_pipeline = Pipeline([
    ('selector', ColumnSelector(numeric_columns)),
    ('imputer', SimpleImputer(strategy="median")),
    ('Scaler', StandardScaler()),
    ])


full_pipeline = FeatureUnion(transformer_list=[
    ("numeric_pipeline", numeric_pipeline),
    ("bin_pipeline", bin_pipeline),
    ("label_encode", label_encode),
    ("one_hot_encode", one_hot_encode),
    ])

finalpipeline = (make_pipeline(full_pipeline, RandomForestRegressor(random_state=1, 
                                                                          n_jobs=-1, 
                                                                        n_estimators=100)))
# Fitting the pipeline
finalpipeline.fit(x_train, y_train)

Pipeline(memory=None,
         steps=[('featureunion',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('numeric_pipeline',
                                                 Pipeline(memory=None,
                                                          steps=[('selector',
                                                                  ColumnSelector(columns=['Distance '
                                                                                          'from '
                                                                                          'Residence '
                                                                                          'to '
                                                                                          'Work',
                                                                                          'Service '
                                                                                      

## Question 2: Add your own transformer to binarize the input

In [25]:
# %load solutions/binarySol.py
class BinarizeTransformer(TransformerMixin):
    
    def __init__(self, threshold = 0):
        self.threshold = threshold
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        cond = x > self.threshold
        not_cond = np.logical_not(cond)
        x[cond] = 1
        x[not_cond] = 0
        return x



In [26]:
## Test on:
X = np.arange(10)
binarizer = BinarizeTransformer(threshold=5)
binarizer.fit_transform(X)

array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1])

## Question 3: Create binarized transformation for weight instead of binning

In [27]:
# %load solutions/binaryTransform.py
from sklearn.pipeline import Pipeline, FeatureUnion

impute_pipeline = Pipeline([
    ('selector', ColumnSelector(impute_columns)),
    ('imputer', SimpleImputer(strategy="median")),
    ])
bin_pipeline = Pipeline([
    ('selector', ColumnSelector(bin_column)),
    ('Binning', BinarizeTransformer(threshold=5)),
    ])
label_encode = Pipeline([
    ('selector', ColumnSelector(label_encode_column)),
    ('LabelEncoder', OrdinalEncoder()),
    ])
one_hot_encode = Pipeline([
    ('selector', ColumnSelector(one_hot_encode_column)),
    ('LabelEncoder', OneHotEncoder()),
    ])
numeric_pipeline = Pipeline([
    ('selector', ColumnSelector(numeric_columns)),
    ('imputer', SimpleImputer(strategy="median")),
    ('Scaler', StandardScaler()),
    ])


full_pipeline = FeatureUnion(transformer_list=[
    ("numeric_pipeline", numeric_pipeline),
    ("bin_pipeline", bin_pipeline),
    ("label_encode", label_encode),
    ("one_hot_encode", one_hot_encode),
    ])

finalpipeline = (make_pipeline(full_pipeline, RandomForestRegressor(random_state=1, 
                                                                          n_jobs=-1, 
                                                                        n_estimators=100)))
# Fitting the pipeline
finalpipeline.fit(x_train, y_train)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._where(-key, value, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Pipeline(memory=None,
         steps=[('featureunion',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('numeric_pipeline',
                                                 Pipeline(memory=None,
                                                          steps=[('selector',
                                                                  ColumnSelector(columns=['Distance '
                                                                                          'from '
                                                                                          'Residence '
                                                                                          'to '
                                                                                          'Work',
                                                                                          'Service '
                                                                                      

## Question 4: Perform grid search on this final pipeline

Hint: Refer to assignment 2 alternative solution
* Use randomforestregressor
* and couple of values to do cross validation

In [28]:
# %load solutions/gridSearch.py
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Number of trees in random forest
n_estimators = [100, 500, 750, 1000]
# Number of features to consider at every split
max_features = [3, 4, 5]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 110, num = 4)]
max_depth.append(None)

# Create the random grid
random_grid = {'randomforestregressor__n_estimators': n_estimators,
               'randomforestregressor__max_features': max_features,
               'randomforestregressor__max_depth': max_depth,
              }

print(random_grid)

print("Grid search")
print('\n')

scoring = 'neg_mean_absolute_error'
clf = GridSearchCV(finalpipeline, random_grid, n_jobs=-1, verbose=True, scoring=scoring)
clf.fit(x_train, y_train)

clf_preds = clf.predict(x_test)
clf_preds = pd.Series(clf_preds)
clf_preds = clf_preds.rename("Grid Search Predicted values")


{'randomforestregressor__n_estimators': [100, 500, 750, 1000], 'randomforestregressor__max_features': [3, 4, 5], 'randomforestregressor__max_depth': [5, 40, 75, 110, None]}
Grid search


Fitting 3 folds for each of 60 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   11.2s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:   50.8s finished
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._where(-key, value, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice

In [29]:
clf.best_params_

{'randomforestregressor__max_depth': 40,
 'randomforestregressor__max_features': 3,
 'randomforestregressor__n_estimators': 500}

In [30]:
clf.best_score_

-6.7311019747920415

In [None]:
print(mean_squared_error(y_pred, y_test))