# Preprocessing and Pipelines
---
In this notebook, we will discuss what are pipelines and how `sklearn` allows for transformers and estimators to be chained together and used as a single unit. We will also learn some preprocessing techniques that can enhance the performance of our models.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Dealing with cateogrical variables

Categorical variables are those whose values are strings representing different classes of that variable for examples sex (male, female), color (red, blue). These strings aren't accepted by `sklearn` by default so we need to encode them numerically.

We can convert these features into **dummy variables** one for each category. 0 means observation wasn't that category (class) and 1 means it was. For example, for a dataset about cars we have a categorical feature *Origin* that can be one of 3 values; US, Europe, Asia. When converting this variable to dummy variables we get this,

![](assets/images/dummy.png)

Now for each row (observation), only one of these columns can be 1 and the others will be 0.

***NOTE:*** If the car's origin wasn't US or Asia then we are sure it will be Europe. Thus, we can discard one of these columns as it can be interpreted from the other columns.

We can generate dummy variables using 2 functions:

* From `sklearn` we can use `OneHotEncoder()`
* From `pandas` we can use `get_dummies()`

In [2]:
cars_df = pd.read_csv("assets/data/cars.csv")
cars_df.head()

Unnamed: 0,mpg,displ,hp,weight,accel,origin,size
0,18.0,250.0,88,3139,14.5,US,15.0
1,9.0,304.0,193,4732,18.5,US,20.0
2,36.1,91.0,60,1800,16.4,Asia,10.0
3,18.5,250.0,98,3525,19.0,US,15.0
4,34.3,97.0,78,2188,15.8,Europe,10.0


For the cars dataset, the target variable is `mpg`. We want to create dummy variables from the `origin` feature.

In [3]:
df_origin = pd.get_dummies(cars_df)
df_origin.head()

Unnamed: 0,mpg,displ,hp,weight,accel,size,origin_Asia,origin_Europe,origin_US
0,18.0,250.0,88,3139,14.5,15.0,0,0,1
1,9.0,304.0,193,4732,18.5,20.0,0,0,1
2,36.1,91.0,60,1800,16.4,10.0,1,0,0
3,18.5,250.0,98,3525,19.0,15.0,0,0,1
4,34.3,97.0,78,2188,15.8,10.0,0,1,0


Like we illustrated above, it's better to drop of the dummy variables. Let's say we want to drop `origin_Asia`. No offense, Asia.

In [4]:
df_origin.drop(columns=['origin_Asia'], inplace=True)
df_origin.head()

Unnamed: 0,mpg,displ,hp,weight,accel,size,origin_Europe,origin_US
0,18.0,250.0,88,3139,14.5,15.0,0,1
1,9.0,304.0,193,4732,18.5,20.0,0,1
2,36.1,91.0,60,1800,16.4,10.0,0,0
3,18.5,250.0,98,3525,19.0,15.0,0,1
4,34.3,97.0,78,2188,15.8,10.0,1,0


Alternative way is to set the argument `drop_first` to `True` in the `get_dummies` function.

Now, since all of our features are numeric, we can fit a model to it.

In [6]:
X = df_origin.drop(columns=['mpg']).values
y = df_origin.mpg.values.reshape(-1, 1)

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

ridge = Ridge(alpha=0.5, normalize=True)
ridge.fit(X_train, y_train)
ridge.score(X_test, y_test)

0.7190645190217895

## Handling missing data

We say a dataset has missing data when there is no reasonable value for a given feature in a particular row. This can occur for many reasons:

* There may be no observation recorded
* There may have been a transcription (data entry) error
* The data may have been corruptied

In [8]:
diabetes_df = pd.read_csv("assets/data/diabetes.csv")
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pregnancies  768 non-null    int64  
 1   glucose      768 non-null    int64  
 2   diastolic    768 non-null    int64  
 3   triceps      768 non-null    int64  
 4   insulin      768 non-null    int64  
 5   bmi          768 non-null    float64
 6   dpf          768 non-null    float64
 7   age          768 non-null    int64  
 8   diabetes     768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


mmmmm, for all features `pandas` indicates that they have 768 non-null which means no missing data exists. However, missing data can be encoded in different ways; 0s, '?', -1.

In [9]:
diabetes_df.describe(include="all")

Unnamed: 0,pregnancies,glucose,diastolic,triceps,insulin,bmi,dpf,age,diabetes
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


Looking at the description, we can see that the min of some features like `glucose`, `diastolic`, `triceps`, `insulin` and `bmi` are 0s which is illogical. Let's replace them with `np.nan` which is the know value of missing values. `nan` means **n**ot **a** **n**umber. And check the info again.

In [12]:
diabetes_df.glucose.replace(0, np.nan, inplace=True)
diabetes_df.diastolic.replace(0, np.nan, inplace=True)
diabetes_df.triceps.replace(0, np.nan, inplace=True)
diabetes_df.insulin.replace(0, np.nan, inplace=True)
diabetes_df.bmi.replace(0, np.nan, inplace=True)

In [13]:
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pregnancies  768 non-null    int64  
 1   glucose      763 non-null    float64
 2   diastolic    733 non-null    float64
 3   triceps      541 non-null    float64
 4   insulin      394 non-null    float64
 5   bmi          757 non-null    float64
 6   dpf          768 non-null    float64
 7   age          768 non-null    int64  
 8   diabetes     768 non-null    int64  
dtypes: float64(6), int64(3)
memory usage: 54.1 KB


One way of dealing with missing data is to drop all rows that contain any missing data. We can do this using the dataframe method `dropna()`.

In [14]:
df_missing_dropped = diabetes_df.dropna()
df_missing_dropped.shape

(392, 9)

Now, after dropping all rows with missing data we have **only** 392 rows which is about half of the original dataset just like that. This isn't preferred and we need more robust methods.

Another way is to ***impute*** the missing the data. Imputing means make educated guesses of the missing values. One common way for continuous values is to compute the mean.

We can use `Imputer` class from `sklearn.preprocessing` like this

```python
from sklearn.impute import SimpleImputer

imp = SimpleImputer(
    missing_values = np.nan, # how the missing values are encoded in the data
    strategy = "mean", # what strategy should we use to replace the missing data
)

imp.fit(X) # fit the imputer to our features, let it learn the means

X = imp.transform(X) # transform the features to imputer version
```

Imputer are said to be **transfromers** because they are models that can change the data and transform it to another version.

In [20]:
from sklearn.impute import SimpleImputer

X = diabetes_df.drop(columns=['diabetes']).values
y = diabetes_df.diabetes.values.reshape(-1, 1)

imp = SimpleImputer(missing_values = np.nan, strategy = "mean")

imp.fit(X)

X_imputer = imp.transform(X)

## Pipelines

Pipeline is an object that can combine transformers (like `SimpleImputer`) and machine learning model (like `LogisticRegression`) into one one object that can take our data transform it using transformers and then fit the model to the transformed data and returns the trained model.

```python
from sklearn.pipeline import Pipeline

# we create a list that will hold the transformers and the machine learning models that will be fit to the data
steps = [
    # each element of the list is a tuple of 2 elements, the first is our name for that step, the second is the object
    ("imputation", imp), # name of the step is "imputation", and we pass `imp` that imputes the missing data
    ("log_reg", log_reg) # name of the step is "log_reg", and we pass `log_reg` as our logistic regression model
]

pipeline = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

pipeline.fit(X_train, X_test)
y_pred = pipeline.predict(X_test)
pipeline.score(X_test, y_test)
```

In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline


X = diabetes_df.drop(columns=['diabetes']).values
y = diabetes_df.diabetes.values

log_reg = LogisticRegression(max_iter=1000)

steps = [
    ("imputation", imp),
    ("log_reg", log_reg)
]

pipeline = Pipeline(steps)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
pipeline.score(X_test, y_test)

0.7402597402597403

***NOTE:*** Each step in the pipeline other than the last one **must** be a transformer. And the last must be an estimator.

## Centering and Scaling

In [36]:
red_wine_df = pd.read_csv("assets/data/red-wine.csv", sep=";")
red_wine_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [37]:
red_wine_df.describe(include="all")

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


As we can see, the ranges are not that close between the features. Density is in the range of [0.99, 1] while total sulful dioxide is in the range of [6, 289].

Some ML models use distance in their calculations like KNN. The larger-scale features may unduly influence the model. That's why we need to have our features to be on similar scales. To achieve this, we apply normalization.

There're different ways to normalize (center and scale) the data:

* Standardization : Subtract the mean and divide by the variance. This make all features centered around 0 and have a variance of 1.
* Normalization : Subtract the minimum and divide by the range. This makes the minimum 0 and the maximum 1.

For standardization, we use `StandardScaler` from `sklearn.preprocessing`. `StandardScaler` is also a transformer that has `fit` and `transform` methods.

In [40]:
from sklearn.preprocessing import StandardScaler

X = red_wine_df.drop(columns=["quality"]).values
y = red_wine_df.quality.values


scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

In [44]:
pd.DataFrame(X_scaled, columns=red_wine_df.columns[:-1]).describe(include="all")

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,3.554936e-16,1.733031e-16,-8.887339000000001e-17,-1.244227e-16,3.732682e-16,-6.221137e-17,4.4436690000000005e-17,-3.473172e-14,2.861723e-15,6.754377e-16,1.066481e-16
std,1.000313,1.000313,1.000313,1.000313,1.000313,1.000313,1.000313,1.000313,1.000313,1.000313,1.000313
min,-2.137045,-2.27828,-1.391472,-1.162696,-1.603945,-1.4225,-1.230584,-3.538731,-3.700401,-1.936507,-1.898919
25%,-0.7007187,-0.7699311,-0.9293181,-0.4532184,-0.371229,-0.8487156,-0.7440403,-0.6077557,-0.6551405,-0.6382196,-0.8663789
50%,-0.2410944,-0.04368911,-0.05636026,-0.240375,-0.1799455,-0.1793002,-0.2574968,0.001760083,-0.007212705,-0.2251281,-0.2093081
75%,0.5057952,0.6266881,0.7652471,0.04341614,0.05384542,0.4901152,0.4723184,0.5768249,0.5759223,0.4240158,0.6354971
max,4.355149,5.877976,3.743574,9.195681,11.12703,5.367284,7.375154,3.680055,4.528282,7.918677,4.202453


The ranges are much closer now.

Now, we can build a pipeline with the scaler and an estimator like KNN.

In [49]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

steps = [
    ("scaling", scaler),
    ("knn", KNeighborsClassifier())
]

pipeline = Pipeline(steps)

X = red_wine_df.drop(columns=["quality"]).values
y = red_wine_df.quality.values
y = y <= 5 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
pipeline.score(X_test, y_test)

0.746875

In [50]:
knn_unscaled = KNeighborsClassifier()
knn_unscaled.fit(X_train, y_train)
knn_unscaled.score(X_test, y_test)

0.646875

Using KNN with unscaled data will result in less scores.

## Cross validation in a pipeline

When we had only estimator to apply grid search cross validation, we specified the parameters grid as following

```python
param_grid = {
    "n_neighbors": range(1, 50)
}
```

For pipelines, it the same. However, we need to modify the hyperparmaters' names to be preceeded with `{step_name_in_pipeline}__{hyperparameter_name}`. So, it will be `"knn__n_neighbors"`.

***NOTE:*** Between the step name and hyperparameter name, there are 2 underscores not only one.

>Why using scaler in a pipeline instead of scaling once and then apply cross validation ?

Because if we scaled all data at once then used it in cross validation. A ***data leakage***, performed by the scaling process, of the test fold will be passed to the estimator which may cause overfitting. The scaling should only be applied to the training folds without any interaction with the test fold.

In [51]:
from sklearn.model_selection import GridSearchCV

parameters = {
    "knn__n_neighbors": range(1, 50)
}

pipeline_cv = GridSearchCV(pipeline, parameters, cv=5)

pipeline_cv.fit(X_train, y_train)

pipeline_cv.best_params_, pipeline_cv.best_score_

({'knn__n_neighbors': 1}, 0.7365042892156863)

In [54]:
pipeline.score(X_test, y_test)

0.746875