# Data Pipelines with Sklearn

We've seen time and time again, data is amazing.... ly messy. You can safely bet that you will always need to massage your data so that it is in the proper format you need to perform your analysis (unless you're using the Iris Dataset that is).



Data pipelines are a way to streamline your workflow. It allows you to automate several steps of the data processing and model creation. Additionally it also has the benefit of making your code cleaner, easier to understand and reproduce.

## Major steps in data processing

- Data Preprocessing

- Data Normalization and Standardization

- Feature Engineering and Selection

- Splitting Data into Train and Test Datasets

- Setting up an Algorithm and Model Fitting

- Model Evaluation and Selection

All of these steps can be added to your data pipeline object but not all are necessary. We'll look at a few examples of how to do this and why.

## Data preprocessing

It is an umbrella that covers all of the transformations that must be done on the data you're using.

We start with verifying the quality of the data by inspecting the numerical and categorical values, dealing with outliers and missing values, and dealing with duplicate or inconsistent values. Performing encoding or bucketing of values may be needed. Normalizing data, selecting or engineering features. 

All of this with the final goal of getting your data to a state that your algorithm is able to easily interpret your data and is able to create a model.

In [12]:
import pandas as pd

In [13]:
#Import the data :
df_loan = pd.read_csv('loan-dataset/loan_data.csv')
df_loan.drop('Loan_ID', axis=1, inplace=True)
df_loan.head()

#What is the type for the Dependents column ?

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [4]:
df_loan.Dependents.unique()

#Mmmmm.... we'll deal with this later

array(['0', '1', '2', '3+', nan], dtype=object)

In [5]:
#Check out the different types :
df_loan.dtypes

Gender                object
Married               object
Dependents            object
Education             object
Self_Employed         object
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History       float64
Property_Area         object
Loan_Status           object
dtype: object

In [6]:
#Let's see how amy unique values we have in each of our columns :
df_loan.nunique()

Gender                 2
Married                2
Dependents             4
Education              2
Self_Employed          2
ApplicantIncome      505
CoapplicantIncome    287
LoanAmount           203
Loan_Amount_Term      10
Credit_History         2
Property_Area          3
Loan_Status            2
dtype: int64

In [7]:
#Check if we have any null values :
df_loan.isnull().values.any()

True

In [8]:
#Since we do, let's see how many in each column
df_loan.isnull().sum()

Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

Ok now that we've peaked at our data, let's start transforming the data into something that we can build an algorithm for to predict if the loan will be approved or not. This will be a classification problem with only two possible outcomes : Yes or No (the bank can't be coy and say maybe...)

The first thing we need to do is transform the value '3+' in the Dependents column to just 3.

To do this we need to construct a custom transformer function that will handle the data how we want. In this case we want to replace '3+' with just 3.

In [14]:
from sklearn.preprocessing import FunctionTransformer

#We define a custom function to do the transformation for us :
def custom_transformation(df):
    df['Dependents'] = df['Dependents'].replace('3+', '3')

#We create the FunctionTransformation object that we can then pass into our pipeline object :
ft = FunctionTransformer(func=custom_transformation, validate=False)

#We can fit the transform right away by itself or insert it into a Pipeline object.
ft.fit_transform(df_loan)

In [15]:
df_loan.head(10)

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y
6,Male,Yes,0,Not Graduate,No,2333,1516.0,95.0,360.0,1.0,Urban,Y
7,Male,Yes,3,Graduate,No,3036,2504.0,158.0,360.0,0.0,Semiurban,N
8,Male,Yes,2,Graduate,No,4006,1526.0,168.0,360.0,1.0,Urban,Y
9,Male,Yes,1,Graduate,No,12841,10968.0,349.0,360.0,1.0,Semiurban,N


## Data Normalization and Standardization

Data normalization and Standardization is routely done in order to improve numerical stability of our models. Certain models can be very sensitive to outliers or make certain assumptions about the data, like distribution.

In [16]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

Let's discuss what these packages are and why we care.

Pipeline : This will allow you to create the actual Pipeline object itself, which is a way to concetrate all of the preprocessing, model fitting and selection into a single (hopefully) smooth process.

SimpleImputer : This package allows you to deal with missing values and gives you several options when it comes to how you want to handle your missing data.

ColumnTransformer : This function will apply our diverse transforms to the columns we chose. Notice that we could have chosen to add our 'Dependents' data transformation inside of our pipeline here instead of applying it by itself.

OneHotEncoder : This will encode our categorical variables into 1s and 0s. Remember that one-hot-encoding should be used on categorical variables that are nominal or binary only. (If you're dealing with ordinal categorical variables you can use LabelEncoder from the package sklearn.preprocessing)

StandardScaler : This will allow you to scale your data to be within a certain range : min_val to max_val. Most commonly used to make data be between 0 and 1. There are many scalers that exist and their use will depend on your application (great non-answer again). If you would like to read up on some of the scalers available [this website](https://benalexkeen.com/feature-scaling-with-scikit-learn/) does a pretty good job at walking you through those.

In [17]:
#Now that we know what these packages are doing, let's select the data we want to trasnform.
#Let's extract the numeric and categorical column names from our dataframe.

#Numeric columns :
numFeatures = df_loan.select_dtypes(include=['int64', 'float64']).columns

#Categorical columns :
catFeatures = df_loan.select_dtypes(include=['object']).drop(['Loan_Status'], axis=1).columns

print(catFeatures)

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'Property_Area'],
      dtype='object')


In [18]:
#Now we can finally build our first Pipeline. 
#Let's create the Pipeline for transforming our numeric data. We have to provide the steps we want to 
#execute in order. First replace NaNs and then scale the data.
numTransformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

#There is also : sklearn.pipeline.make_pipeline which can also be used to create a Pipeline

In [19]:
#Now that we have the Pipeline for numerical data, let's do the same for categorical data.
catTransformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

In [20]:
#Now we want to be able to apply the preprossing transformations to the categorical and numerical Pipelines :
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numTransformer, numFeatures),
        ('cat', catTransformer, catFeatures)])

In [10]:
# df_loan.isnull().sum()
preprocessor.fit_transform(df_loan)

array([[ 0.07299082, -0.55448733, -0.21124125, ...,  0.        ,
         0.        ,  1.        ],
       [-0.13441195, -0.03873155, -0.21124125, ...,  1.        ,
         0.        ,  0.        ],
       [-0.39374734, -0.55448733, -0.94899647, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 0.43717437, -0.47240418,  1.27616847, ...,  0.        ,
         0.        ,  1.        ],
       [ 0.35706382, -0.55448733,  0.49081614, ...,  0.        ,
         0.        ,  1.        ],
       [-0.13441195, -0.55448733, -0.15174486, ...,  0.        ,
         1.        ,  0.        ]])

## Feature Engineering, Selection and Dimentionality Reduction

This is the part where you would do variable/feature selection, which essentially just means choosing a subset of features from the ones you have available. This help improve accuracy when done well, will make your model train faster and help with overfitting

- Feature Engineering : create a new small subset of features that capture well the phenomena we're interested in in our data
- Feature Selection : keeps a subset of the original features.

Feature Engineering/Selection can help your algorithm go from not great to pretty amazing. This is particularly important when the number of features available is very large, specially if there aren't that many data points available. 

There are two types of feature selection strategies :

- Univariate Feature Selection : More manual, check each feature for correlation with outcome variable. This is easier to do if you have some domain knowledge about your data.

- Multivariate Feature Selection : When there are too many features, then we group together features at once. If you don't know a lot about the domain, this technique is more trustworthy.
    - There are 3 categories for Multivariate Feature Selection :
        - Filter Methods : check variance (set threshold), Pearson's correlation test and Linear Discriminant Analysis (LDA)
        
        - Wrapper Methods : Forwards Selection/Backwards Elimination
        
        - Embedded Methods : Lasso Regularization, Gradient Boosting Machine (GMB), Random Forest (RF)




- Linear discriminant analysis (LDA): 

    - We used LDA in Supervised Learning when features are labelled.
    
- Principal Component Analysis(PCA): The main purposes of a PCA are the analysis of data to identify patterns and finding patterns to reduce the dimensions of the dataset with minimal loss of information.

    - PCA will try to reduce dimensionality by exploring how one feature of the data is expressed in terms of the other features(linear dependency). Feature selection instead, takes the target into consideration.
    
    - PCA works best on dataset having 3 or higher dimensions. Because, with higher dimensions, it becomes increasingly difficult to make interpretations from the resultant cloud of data.

We will cover Feature Engineering/Selection more in the next couple of weeks.

## Splitting Data into Train and Test Datasets

In [28]:
#sklearn has a package that lets you split your data into train/test sets :
from sklearn.model_selection import train_test_split

#We need to create our indepent features matrix and our dependent variable:
#Dependent variable : Loan Status (approved or not)
y = df_loan['Loan_Status']

#Remaining features are our independent variables :
X = df_loan.drop('Loan_Status', axis=1)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Setting up an Algorithm and Model Fitting

Let's bring it all together ! We are going to use a Random Forest Classifier on our data.

In [29]:
from sklearn.ensemble import RandomForestClassifier

rf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', RandomForestClassifier())])

In [30]:
X_train.columns

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area'],
      dtype='object')

In [31]:
rf.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  Index(['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History'],
      dtype='object')),
                                                 ('cat',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                          

In [37]:
import numpy as np
X_train_test = X_train
hdsjldls = preprocessor.fit_transform(X_train_test)


print(np.any(np.isnan(hdsjldls)))
print(np.all(np.isfinite(hdsjldls)))

False
True


## Model Evaluation and Selection

The Pipeline approach allows us to exploit the streamlined preprocessing of the data allowing us to fit lots of different models on our data very quickly.

In [20]:
y_pred = rf.predict(X_test)
y_pred

array(['Y', 'Y', 'Y', 'N', 'Y', 'N', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'N', 'N', 'Y', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'N', 'N', 'Y', 'N', 'Y', 'N', 'Y', 'N', 'Y', 'Y', 'Y',
       'Y', 'Y', 'N', 'N', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'N', 'N', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'N', 'Y',
       'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'N', 'N', 'Y', 'Y', 'Y', 'N',
       'Y', 'N', 'N', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y', 'Y',
       'Y', 'Y', 'N', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y',
       'Y', 'Y', 'Y', 'Y', 'Y', 'N', 'Y', 'Y', 'Y', 'Y', 'N', 'N', 'Y',
       'Y', 'Y', 'Y', 'Y', 'N', 'Y'], dtype=object)

In [21]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="rbf", C=0.025, probability=True),
    NuSVC(probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier()
    ]

for classifier in classifiers:
    pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', classifier)])
    pipe.fit(X_train, y_train)
    print(classifier)
    print("model score: %.3f" % pipe.score(X_test, y_test), "\n\n")

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')
model score: 0.756 


SVC(C=0.025, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=True, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
model score: 0.732 


NuSVC(cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
      kernel='rbf', max_iter=-1, nu=0.5, probability=True, random_state=None,
      shrinking=True, tol=0.001, verbose=False)
model score: 0.829 


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
model score: 0.780 


AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=50, random_state=None)
model score: 0.813 


GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1

We can also use cross validation inside of our pipeline :

In [26]:
from sklearn.model_selection import cross_val_score

for classifier in classifiers:
    pipe = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', classifier)])
    pipe.fit(X_train, y_train)   
    print(classifier)
    print(cross_val_score(pipe,X_train,y_train,cv=3,
    scoring='f1_micro'))

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')
[0.75       0.7804878  0.76687117]
SVC(C=0.025, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=True, random_state=None,
    shrinking=True, tol=0.001, verbose=False)




[0.67682927 0.67682927 0.67484663]
NuSVC(cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
      kernel='rbf', max_iter=-1, nu=0.5, probability=True, random_state=None,
      shrinking=True, tol=0.001, verbose=False)




[0.81707317 0.80487805 0.78527607]
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
[0.73780488 0.63414634 0.67484663]
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
[0.73170732 0.67682927 0.7607362 ]


Last but not least, let's look at what an NLP pipeline might look like

In [None]:
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline


#X_train/X_test is a list of string representing documents
#y_train/y_test are the labels
X_train,X_test,y_train,y_test = someDataSet

# #calculates vector of term frequencies :
# vect = CountVectorizer()

# #normalizes term frequencies :
# tfidf = TfidfTransformer()

# #a linear SVM classifier :
# clf = LinearSVC()

#creating pipeline :
pipeline = Pipeline([
    ('vect',CountVectorizer()),
    ('tfidf',TfidfTransformer()),
    ('clf',LinearSVC())
])

#Using cross-validation :
scores = cross_val_scores(pipeline,X_train,y_train,cv=3,
    scoring='f1_micro')

#Fitting the model :
pipeline.fit(X_train,y_train)

#Predicting on unseen data :
y_preds = pipeline.predict(x_test)

In [3]:
# import numpy as np
!echo $PATH

/Users/csresearch/miniconda3/bin:/Users/csresearch/miniconda3/condabin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/TeX/texbin


In [11]:
df_loan.isnull().sum()

Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64