# Scikit-Learn Pipelines for Data Preprocessing with Python

This notebook is created to code along the YouTube video link here: https://www.youtube.com/watch?v=2_7vRKawvEU&t=657

Video is from Nicholas Renotte.

## Authenticate to Kaggle

In [1]:
# import sys

In [2]:
# !pip install kaggle

In [3]:
# !pip list

In [4]:
# !kaggle competitions download -c house-prices-advanced-regression-techniques

Note: Before you can download the competition dataset from Kaggle, make sure you have done the steps below:

1. Install kaggle python package
2. Sign up for a Kaggle account
3. Create API token in Kaggle (a kaggle.json file will be downloaded)
4. Place the kaggle.json file in the .kaggle folder
5. Join the Kaggle competition
6. Download the dataset

Reminder: Make sure you comment out the code after you finish downloading!

In [5]:
# !dir

Now, extract the archive file downloaded from Kaggle and place it in the data folder:

In [6]:
# import zipfile

# with zipfile.ZipFile("house-prices-advanced-regression-techniques.zip") as f:
#     f.extractall(path="./data/")    

## Now, you're all set!

# Pipeline Practice.

First, load up our dataset into pandas.

In [7]:
import pandas as pd

train_df = pd.read_csv("./data/train.csv", index_col="Id")

test_df = pd.read_csv("./data/test.csv", index_col="Id")

Let's take a look at our data:

In [8]:
train_df.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [9]:
test_df.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,Gar2,12500,6,2010,WD,Normal
1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,MnPrv,,0,3,2010,WD,Normal
1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,6,2010,WD,Normal
1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,Inside,...,144,0,,,,0,1,2010,WD,Normal


In [10]:
print(f"There are {train_df.shape[0]} rows and {train_df.shape[1]} columns in our training dataset.")
print(f"There are {test_df.shape[0]} rows and {test_df.shape[1]} columns in our test dataset.")

There are 1460 rows and 80 columns in our training dataset.
There are 1459 rows and 79 columns in our test dataset.


In [11]:
print(f"Here are our columns:\n{train_df.columns}")

Here are our columns:
Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
  

In [12]:
test_df.columns

Index(['MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley',
       'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope',
       'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
       'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle',
       'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'PavedDrive', 'Wo

In [13]:
# Check for NaN
train_df.isna().values.sum(), test_df.isna().values.sum()

(7829, 7878)

In [14]:
# Check for duplicates
train_df.duplicated().sum(), test_df.duplicated().sum()

(0, 0)

In [15]:
train_df.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [16]:
# Select the columns to be used for the machine learning model, remove the rows with NaN and save it in a new dataframe
select_df = train_df[["MSSubClass", 
                "MSZoning", 
                "LotFrontage", 
                "LotArea", 
                "Street", 
                "LotShape", 
                "LandContour", 
                "Utilities", 
                "MiscVal", 
                "MoSold", 
                "YrSold", 
                "SaleType", 
                "SaleCondition", 
                "SalePrice"]].dropna()

In [17]:
# Select the columns to be used for the test dataset to make predictions, remove the rows with NaN and save it in a new dataframe
select_test_df = test_df[["MSSubClass", 
                "MSZoning", 
                "LotFrontage", 
                "LotArea", 
                "Street", 
                "LotShape", 
                "LandContour", 
                "Utilities", 
                "MiscVal", 
                "MoSold", 
                "YrSold", 
                "SaleType", 
                "SaleCondition"]].dropna()

Import our necessary sklearn classes:

In [18]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler

In [19]:
# split the data into features (X) and labels (y)
X_train = pd.get_dummies(select_df.drop("SalePrice", axis=1)) # use get_dummies to categorize the features
y_train = select_df.SalePrice

X_test = pd.get_dummies(select_test_df)

In [20]:
X_train.head()

Unnamed: 0_level_0,MSSubClass,LotFrontage,LotArea,MiscVal,MoSold,YrSold,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,65.0,8450,0,2,2008,False,False,False,True,...,False,False,False,True,False,False,False,False,True,False
2,20,80.0,9600,0,5,2007,False,False,False,True,...,False,False,False,True,False,False,False,False,True,False
3,60,68.0,11250,0,9,2008,False,False,False,True,...,False,False,False,True,False,False,False,False,True,False
4,70,60.0,9550,0,2,2006,False,False,False,True,...,False,False,False,True,True,False,False,False,False,False
5,60,84.0,14260,0,12,2008,False,False,False,True,...,False,False,False,True,False,False,False,False,True,False


In [21]:
pipeline = make_pipeline(StandardScaler(), RandomForestRegressor())

In [22]:
import numpy as np

np.random.seed(42)

pipeline.fit(X_train, y_train)

In [23]:
pipeline.predict(X_train)

array([205017.9 , 164175.64, 220616.  , ..., 237176.5 , 149410.75,
       151174.5 ])

In [24]:
# Use the test data to make predictions and save the predicted values in a new column in the existing test dataframe
select_test_df["SalePrice"] = pipeline.predict(X_test)

In [25]:
y_preds = select_test_df.SalePrice

In [26]:
select_test_df.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1461,20,RH,80.0,11622,Pave,Reg,Lvl,AllPub,0,6,2010,WD,Normal,189820.92
1462,20,RL,81.0,14267,Pave,IR1,Lvl,AllPub,12500,6,2010,WD,Normal,260743.26
1463,60,RL,74.0,13830,Pave,IR1,Lvl,AllPub,0,3,2010,WD,Normal,296599.86
1464,60,RL,78.0,9978,Pave,IR1,Lvl,AllPub,0,6,2010,WD,Normal,229946.29
1465,120,RL,43.0,5005,Pave,IR1,HLS,AllPub,0,1,2010,WD,Normal,235052.4


In [27]:
# save the predicted values as a csv file for submission
select_test_df["SalePrice"].to_csv("./data/submission.csv")

# Save the Pipeline

In [28]:
import pickle

In [29]:
with open("pipelinemodel.pkl", "wb") as f:
    pickle.dump(pipeline, f)

# Reload the Pipeline

In [30]:
with open("pipelinemodel.pkl", "rb") as f:
    reloaded_model = pickle.load(f)

In [31]:
reloaded_model

In [32]:
# Get certain steps with the pipeline
reloaded_model.steps[1][1].predict(X_train)



array([362675. , 362675. , 381962.5, ..., 362675. , 362675. , 362675. ])

# Why do we use `make_pipeline` and not `Pipeline`?

In [33]:
from sklearn.pipeline import Pipeline

In [34]:
# with the Pipeline class
custom_pipeline = Pipeline([("scaling", StandardScaler()), ("rfmodel", RandomForestRegressor())])

In [35]:
custom_pipeline

With `Pipeline`, you manually assign the name of the steps .

In [36]:
# with the make_pipeline class
make_pipeline_model = make_pipeline(StandardScaler(), RandomForestRegressor())

In [37]:
make_pipeline_model

With `make_pipeline`, it automatically name the steps for you.

# Column Transformer Practice

## Create the preprocessing pipelines for both numeric and categorical data.

In [38]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [39]:
# Numeric features
numeric_features = select_df.drop("SalePrice", axis=1).select_dtypes(exclude="object").columns
numeric_features

Index(['MSSubClass', 'LotFrontage', 'LotArea', 'MiscVal', 'MoSold', 'YrSold'], dtype='object')

In [40]:
# Create a pipeline and apply StandardScaler()
numeric_pipeline = Pipeline([("scaler", StandardScaler())])

In [41]:
# Categorical features
categorical_features = select_df.select_dtypes("object").columns
categorical_features

Index(['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities',
       'SaleType', 'SaleCondition'],
      dtype='object')

In [42]:
# Create a pipeline for the categorical features and apply OneHotEncoder()
categorical_pipeline = Pipeline([("onehot", OneHotEncoder())])

In [43]:
transformer = ColumnTransformer([("numeric_preprocessing", numeric_pipeline, numeric_features),
                                ("categorical_preprocessing", categorical_pipeline, categorical_features)])

In [44]:
transformer

In [45]:
ml_pipeline = Pipeline([("all_column_preprocessing", transformer), ("randforestregressor", RandomForestRegressor())])

In [46]:
ml_pipeline

In [47]:
X = select_df.drop("SalePrice", axis=1)
y = select_df.SalePrice

In [48]:
ml_pipeline.fit(X, y)

In [49]:
ml_pipeline.predict(X)

array([203135.        , 170325.17714286, 221294.4       , ...,
       228590.25      , 143965.        , 154168.5       ])

In [50]:
with open("columntransformermodel.pkl", "wb") as f:
    pickle.dump(ml_pipeline, f)

In [51]:
with open("columntransformermodel.pkl", "rb") as f:
    reloaded_ml_pipeline = pickle.load(f)

In [52]:
reloaded_ml_pipeline