# Using Scikit-Learn's `Pipeline()` class for filling missing data and encoding categorical data

This notebook extends the code in the "Putting It All Together" with some more explanations for what's going on behind the scenes.

More specifically, what happens with the `Pipeline()` class as we use it to impute missing values and encode (turn into numbers) categorical values.

The main takeaways:
- Encode and fill your **categorical data** into numbers (on the whole dataset), if necessary.
- Split your data (into train/test), always keep your training & test data separate.
- Fill/transform the **numerical data** on the training set and test sets separately.
- Using [`Pipeline()`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) helps to ensure all of this is done for you.

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

**Note:** Since this notebook is running on Colab, the data has been imported directly from GitHub. It is the same data used in the videos.

Data on GitHub: https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended-missing-data.csv

In [None]:
car_sales_missing = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [None]:
# Check missing values
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [None]:
# Drop the rows with no labels
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [None]:
# Split data into X & y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

Now we've dropped the rows with no labels and split our data into `X` and `y`, let's create a [`Pipeline()`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) (or a few of them) to fill the rest of the missing values, encode them if necessary (turn them into numbers) and fit a model to them.

A `Pipeline()` in Scikit-Learn is a class which allows us to put multiple steps, such as filling data and then modelling it, together sequentially.

More specifically, we'll go through the following steps:
1. Define categorical, door and numeric features.
2. Build transformer `Pipeline()`'s for imputing missing data and encoding data.
3. Combine our transformer `Pipeline()`'s with [`ColumnTransformer()`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html).
4. Build a `Pipeline()` to preprocess and model our data with the `ColumnTransformer()` and [`RandomForestRegressor()`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html).
5. Split the data into train and test using `train_test_split()`.
6. Fit the preprocessing and modelling `Pipeline()` on the training data.
7. Score the preprocessing and modelling `Pipeline()` on the test data.


Let's start with steps 1. and 2.

We'll build the following:
* A categorical transformer to fill our categorical values with the value `'missing'` and then one encode them.
* A door transformer to fill the door column missing values with the value `4`.
* A numeric transformer to fill the numeric column missing values with the mean of the rest of the column.

All of these will be done with the `Pipeline()` class.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer # this will help us fill missing values
from sklearn.preprocessing import OneHotEncoder # this will help us turn our categorical variables into numbers

# Define categorical columns
categorical_features = ["Make", "Colour"]
# Create categorical transformer (imputes missing values, then one hot encodes them)
categorical_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
  ('onehot', OneHotEncoder(handle_unknown='ignore'))                                         
])

# Define door feature
door_feature = ["Doors"]
# Create door transformer (fills all door missing values with 4)
door_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='constant', fill_value=4)),
])

# Define numeric features
numeric_features = ["Odometer (KM)"]
# Create a transformer for filling all missing numeric values with the mean
numeric_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='mean'))  
])

Wonderful! Now we've got a way to fill our missing variables and turn them into numbers (one hot encode the categorical variables), let's take care of step 3.

3. Combine our transformer `Pipeline()`'s with `ColumnTransformer()`.


In [None]:
from sklearn.compose import ColumnTransformer

# Create a column transformer which combines all of the other transformers 
# into one step
preprocessor = ColumnTransformer(
    transformers=[
      # (name, transformer_to_use, features_to_use transform)
      ('categorical', categorical_transformer, categorical_features),
      ('door', door_transformer, door_feature),
      ('numerical', numeric_transformer, numeric_features)
])

Okay, we've got a `ColumnTransformer()` saved to `preprocessor`, let's make another `Pipeline()` to combine it with a `RandomForestRegressor()` model and take care of step 4.

4. Build a `Pipeline()` to preprocess and model our data with the `ColumnTransformer()` and`RandomForestRegressor()`.

In [None]:
from sklearn.ensemble import RandomForestRegressor

# Create the preprocessing and modelling pipeline
model = Pipeline(steps=[('preprocessor', preprocessor), # this will fill our missing data and make sure it's all numbers
                        ('regressor', RandomForestRegressor())]) # this will model our data

Creating the `model` pipeline like we did is like saying to Scikit-Learn, "Hey, preprocess all of the data we pass you first (using `preprocessor`) and then model it with `RandomForestRegressor()`.

Let's now do step 5 and 6.

5. Split the data into train and test using `train_test_split()`.
6. Fit the preprocessing and modelling `Pipeline()` on the training data.


In [None]:
from sklearn.model_selection import train_test_split

# Split data into train and teset sets
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit the model on the training data 
# (note: when fit() is called with a Pipeline(), fit_transform() is used for transformers)
model.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('preprocessor',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('categorical',
                                                  Pipeline(memory=None,
                                                           steps=[('imputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value='missing',
                                                                                 missing_values=nan,
                                                                                 strategy='constant',
                                              

If it all worked, we should've seen a print out of the steps in the modelling `Pipeline()`.

We've got 1 step left and that's to evaluate our model pipeline on the test data.

7. Score the preprocessing and modelling `Pipeline()` on the test data.

In [None]:
# Score the model on the data 
# (note: when score() or  predict() is called with a Pipeline(), transform() is used for transformers)
model.score(X_test, y_test)

0.22188417408787875

Nice! Our `Pipeline()` steps seem to have worked for the test dataset as well.

There are a few things going on behind the scenes here. The main ones being how the `Pipeline()` class deals with data transformation.

## Pipeline behind the scenes

When filling **numerical data**, it's important **not** to use values from the test set to fill values in the training set. Since we're trying to predict on the test set, this would be like taking information from the future to fill in the past.

Let's have an example.

In our case, the `Odometer (KM)` column is missing values. We could fill every value in the column (before splitting it into train and test) with the `mean()`. But this would result in using information from the test set to fill the training set (because we fill the whole column before the split).

Instead, we split the data into train and test sets first (still with missing values). Then calculate the `mean()` of the `Odometer (KM)` column on the training set and use it to fill the **training set** missing values *as well as* the **test set** missing values. 

Now you might be asking, how does this happen?

Well, behind the scenes, `Pipeline()` calls a couple of methods:
1. `fit_transform()` - in our case, this computes the `mean()` of the `Odometer (KM)` column and then transforms the rest of the column on the **training data**. It also stores the `mean()` in memory.
2. `transform()` - uses the saved `mean()` of the `Odometer (KM)` column and transforms the **test** values.

Wait, wait, wait. This is confusing... how does the `Pipeline()` know what the training and test data are? We never told it?

You're right.

The magic trick is:
* `fit_transform()` is only ever used when calling `fit()` on your `Pipeline()` (in our case, when we used `model.fit(X_train, y_train)`.
* `transform()` is only ever used when calling `score()` or `predict()` on your `Pipeline()` (in our case, `model.score(X_test, y_test)`.

![what's happening with Pipeline](https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/images/sklearn-whats-happening-with-pipeline.png)

This means, when our missing **numerical values** get calculated and filled (using `fit_transform()`), they only happen on the training data (as long as you only pass `X_train` and `y_train` to `model.fit()`).

And since they get saved in memory, when we call `model.score(X_test, y_test)` and subsequently `transform()`, the test data gets preprocessed with information from the training set (using the past to try and predict the future, not the other way round).

### What about categorical values?

Since they usually don't depend on each other, categorical values are okay to be filled across sets and examples.

Okay, knowing all this, let's cross-validate our model pipeline using [cross_val_score()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html).

In [None]:
from sklearn.model_selection import cross_val_score

# Cross-validate our pipeline model
cross_val_score(model, X, y).mean()

0.22152985494597335

Since our `model` is an instance of `Pipeline()`, the same steps as we discussed above happen here with the `cross_val_score()`.

## Next steps

If you'd like to dig deeper into what's happening here, I'd suggest the following reading resources and next steps.
* **Reading:** [Scikit-Learn Pipeline() documentation](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).
* **Reading:** [Imputing missing values before building an estimator](https://scikit-learn.org/stable/auto_examples/impute/plot_missing_values.html) (compares different methods of imputing values).
* **Practice:** Try [tuning model hyperparameters with a `Pipeline()` and `GridSearchCV()`](https://scikit-learn.org/stable/modules/grid_search.html#composite-estimators-and-parameter-spaces).

How about all in 1 cell?

(the code below runs the same as the code above, just in 1 cell)

In [None]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

# Import data from GitHub
car_sales_missing = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended-missing-data.csv")

# Drop the rows with no labels
car_sales_missing.dropna(subset=["Price"], inplace=True)

# Split data into X (data) & y (labels)
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

# Define categorical columns
categorical_features = ["Make", "Colour"]
# Create categorical transformer (imputes missing values, then one hot encodes them)
categorical_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
  ('onehot', OneHotEncoder(handle_unknown='ignore'))                                         
])

# Define door feature
door_feature = ["Doors"]
# Create door transformer (fills all door missing values with 4)
door_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='constant', fill_value=4)),
])

# Define numeric featrue
numeric_features = ["Odometer (KM)"]
# Create a transformer for filling all missing numeric values with the mean
numeric_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='mean'))  
])

# Create a column transformer which combines all of the other transformers 
# into one step
preprocessor = ColumnTransformer(
    transformers=[
      ('categorical', categorical_transformer, categorical_features),
      ('door', door_transformer, door_feature),
      ('numerical', numeric_transformer, numeric_features)
])

# Create the model pipeline
model = Pipeline(steps=[('preprocessor', preprocessor), # this will fill our missing data and make sure it's all numbers
                        ('regressor', RandomForestRegressor())]) # this will model our data

# Split data into train and teset sets
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit the model on the training data 
#(note: when fit() is called with a Pipeline(), fit_transform() is used for transformers)
model.fit(X_train, y_train)

# Score the model on the data 
# (note: when score() or  predict() is called with a Pipeline(), transform() is used for transformers)
model.score(X_test, y_test)


0.22188417408787875