## Scikit Learn Pipelines and Preprocessing steps
Since we now know about different scikit learn functions, let's try to learn how do we tie different steps together! 

By the end of this tutorial you'll know:
1. Different pre-processing steps which can be done on our data
2. How to tie the transformers and estimators together into a pipeline
3. How to make your own transformer
4. Finally, how to tie it all up together!

### Scikit Learn Pipelines
As per the documentation, 
"Sequentially apply a list of transforms and a final estimator.
    Intermediate steps of the pipeline must be 'transforms', that is, they
    must implement fit and transform methods.
    The final estimator only needs to implement fit."
For more information click here : (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py)

Let's start by loading our data:

In [385]:
import pandas as pd 
import numpy as np

data = pd.read_csv('data/data.csv', index_col=0)
print(data.dtypes)
print()
print('Summary Statistics for Target Variable: \n', data['Absenteeism time in hours'].describe())
print(data.shape)
# we have a mix of categorical, numeric, and string data.
data.head(10)

ID                                   int64
Reason for absence                  object
Month of absence                     int64
Day of the week                     object
Distance from Residence to Work    float64
Service time                       float64
Age                                float64
Work load Average/day              float64
Hit target                           int64
Disciplinary failure                 int64
Education                           object
Number of Children                   int64
Social drinker                       int64
Social smoker                        int64
Pet                                  int64
Weight                               int64
Height                               int64
Body mass index                      int64
Absenteeism time in hours            int64
dtype: object

Summary Statistics for Target Variable: 
 count    749.000000
mean       8.080107
std       17.001698
min        0.000000
25%        2.000000
50%        3.000000
75%   

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Distance from Residence to Work,Service time,Age,Work load Average/day,Hit target,Disciplinary failure,Education,Number of Children,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,Patient follow-up,7,Tuesday,36.0,13.0,33.0,239.554,97,0,High school,2,1,0,1,90,172,30,4
1,36,No reason given,7,Tuesday,13.0,18.0,50.0,239.554,97,1,High school,1,1,0,0,98,178,31,0
2,3,Blood donation,7,Wednesday,51.0,18.0,38.0,239.554,97,0,High school,0,1,0,0,89,170,31,2
3,7,Diseases of the eye and adnexa,7,Thursday,,14.0,39.0,239.554,97,0,High school,2,1,1,0,68,168,24,4
4,11,Blood donation,7,Thursday,36.0,13.0,33.0,239.554,97,0,High school,2,1,0,1,90,172,30,2
5,3,Blood donation,7,Friday,51.0,18.0,38.0,239.554,97,0,High school,0,1,0,0,89,170,31,2
6,10,Medical consultation,7,Friday,52.0,3.0,28.0,239.554,97,0,High school,1,1,0,4,80,172,27,8
7,20,Blood donation,7,Friday,50.0,11.0,36.0,239.554,97,0,High school,4,1,0,0,65,168,23,4
8,14,"Injury, poisoning, and certain other consequen...",7,Monday,12.0,14.0,34.0,239.554,97,0,High school,2,1,0,0,95,196,25,40
9,1,Medical consultation,7,,11.0,14.0,37.0,239.554,97,0,Postgraduate,1,0,0,1,88,172,29,8


# Preprocessing Steps

Before jumping to creating a pipeline, let's start by following some preprocessing steps

In [386]:
print(data.isna().sum())

ID                                 0
Reason for absence                 0
Month of absence                   0
Day of the week                    1
Distance from Residence to Work    1
Service time                       3
Age                                6
Work load Average/day              0
Hit target                         0
Disciplinary failure               0
Education                          0
Number of Children                 0
Social drinker                     0
Social smoker                      0
Pet                                0
Weight                             0
Height                             0
Body mass index                    0
Absenteeism time in hours          0
dtype: int64


Let's jot down the transformations which we need to do on data before training:
1. Impute missing values
2. Convert categorical columns to numerical values
3. Scale/Discretizitation/Binarization

### 1. Imputation of missing values:

We will learn techniques to impute numerical values and categorical values using SimpleImputer.

Documentation of SimpleImputer can be found here:
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

and Categorical Imputer can be found here:


In [387]:
from sklearn.impute import SimpleImputer

Now let's select the columns which need imputation

In [388]:
impute_columns = ["Day of the week", "Distance from Residence to Work", "Service time", "Age"]

Apply imputers on columns and check the results

In [389]:
imp = SimpleImputer(strategy="most_frequent")
impute_df = pd.DataFrame(imp.fit_transform(data[impute_columns]),columns=impute_columns)

In [390]:
print(impute_df.isna().sum())

Day of the week                    0
Distance from Residence to Work    0
Service time                       0
Age                                0
dtype: int64


### 2. Convert Categorical columns to numeric columns

We can get ORDINAL and ONE_HOT_ENCODING from scikit learn.

But the library, category_encoders (http://contrib.scikit-learn.org/categorical-encoding/index.html) offers a lot of different encoding techniques!

Check out this cool article for WOE encoding:
https://medium.com/@sundarstyles89/weight-of-evidence-and-information-value-using-python-6f05072e83eb

Let's start by making a list of columns which need categorical encoding. Let's try two encodings for now, one-hot and label.

In [391]:
label_encode_column = ['Reason for absence']
one_hot_encode_column = ['Education', 'Day of the week']

In [392]:
from category_encoders.ordinal import OrdinalEncoder
from category_encoders.one_hot import OneHotEncoder

In [393]:
one_hot = OneHotEncoder(use_cat_names=True)
one_hot_encoded_df = one_hot.fit_transform(data[one_hot_encode_column])

In [394]:
one_hot_encoded_df.head()

Unnamed: 0,Education_High school,Education_Postgraduate,Education_Graduate,Education_Master and Doctor,Day of the week_Tuesday,Day of the week_Wednesday,Day of the week_Thursday,Day of the week_Friday,Day of the week_Monday,Day of the week_nan
0,1,0,0,0,1,0,0,0,0,0
1,1,0,0,0,1,0,0,0,0,0
2,1,0,0,0,0,1,0,0,0,0
3,1,0,0,0,0,0,1,0,0,0
4,1,0,0,0,0,0,1,0,0,0


In [395]:
ordinal_encoder = OrdinalEncoder()
ordinal_encoded_df = ordinal_encoder.fit_transform(data[label_encode_column])

In [396]:
ordinal_encoded_df.head()

Unnamed: 0,Reason for absence
0,1
1,2
2,3
3,4
4,3


### 3.Scale/Discretizitation/Binarization

Let's identify the columns for binning/scaling and discretizitation

a. Discretization:
    Discretization, also known as quantization or binning, divides a continuous feature into a pre-specified number of categories (bins), and thus makes the data discrete.
    
Sklearn provides a KBinsDiscretizer class that can take care of this. The only thing you have to specify are the number of bins (n_bins) for each feature and how to encode these bins (ordinal, onehot or onehot-dense)

Let's try to discretize on some columns:

In [397]:
from sklearn.preprocessing import KBinsDiscretizer

discretize_column = ["Weight"]
disc = KBinsDiscretizer(n_bins=3, encode='ordinal', 
                        strategy='uniform')
discrete_df = pd.DataFrame(disc.fit_transform(data[discretize_column]),columns=discretize_column)

So far we have done the followin steps:

1. Imputation........done
2. Categorical and numerical encoding..........done
3. Discretization.......done

Time to tie it all together... But how?

In [398]:
all_columns = list(data.columns)

In [399]:
from sklearn.pipeline import make_pipeline, make_union

discrete_columns = list(set(all_columns) - set(impute_columns))
##Preprocessing pipeline
impute_pipeline = make_pipeline(
    # If using make_union, then we HAVE to first select all the columns we will pull from.
    ColumnSelector(all_columns),
    make_union(
        # First, we select and 'hold out' the discrete variables, as we wont do any further work to them.
        make_pipeline(ColumnSelector(discrete_columns),
        ),
        # Impute numerical features
        make_pipeline(
            ColumnSelector(impute_columns),
            SimpleImputer(strategy='most_frequent')
        )
    )
)

processed_data = pd.DataFrame(impute_pipeline.fit_transform(data), columns=all_columns)

In [400]:
len(processed_data.columns)

19

In [401]:
len(data.columns)

19

In [407]:
discrete_columns = list(set(all_columns) - set(discretize_column) - set(label_encode_column) - set(one_hot_encode_column))
processing_pipeline = make_pipeline(
    # If using make_union, then we HAVE to first select all the columns we will pull from.
    ColumnSelector(all_columns),
    make_union(
        # First, we select and 'hold out' the discrete variables, as we wont do any further work to them.
        make_pipeline(ColumnSelector(discrete_columns),
        ),
         # Pipeline for numeric features
         make_pipeline(
             ColumnSelector(label_encode_column),
             OrdinalEncoder()
         ),
        # Pipeline for numeric features
         make_pipeline(
             ColumnSelector(one_hot_encode_column),
             OneHotEncoder()
         ),
        # Pipeline for numeric features
         make_pipeline(
             ColumnSelector(discretize_column),
             KBinsDiscretizer(n_bins=3, encode='ordinal', 
                        strategy='uniform')
         )
    )
)

final_processed_data = processing_pipeline.fit_transform(data)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import cross_val_predict
from sklearn.ensemble import RandomForestRegressor
from math import sqrt
import matplotlib.pyplot as plt

In [None]:
finalpipeline = (make_pipeline(processing_pipeline, RandomForestRegressor(random_state=1, 
                                                                          n_jobs=-1, 
                                                                          n_estimators=100)))
# Fitting the pipeline
finalpipeline.fit(x_train, y_train)

In [408]:
len(final_processed_data.columns)

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

In [404]:
from sklearn.base import TransformerMixin, BaseEstimator

In [405]:
class ColumnSelector(BaseEstimator,TransformerMixin):
    
    def __init__(self, columns):
        self.columns = columns
        
    def fit(self, x, y = None):
        return self
    
    def transform(self, x):
        return x[self.columns]

In [406]:
##Function to write weight of evidence encoding
## Assignment: Write a custom imputer for categorical values
from sklearn.base import TransformerMixin


class SeriesImputer(TransformerMixin):

    def __init__(self):
        """Impute missing values.

        If the Series is of dtype Object, then impute with the most frequent object.
        If the Series is not of dtype Object, then impute with the mean.  

        """
    def fit(self, X, y=None):
        if   X.dtype == numpy.dtype('O'): self.fill = X.value_counts().index[0]
        else                            : self.fill = X.mean()
        return self

    def transform(self, X, y=None):
        return X.fillna(self.fill)