# Scikit-Learn Pipelines for Data Preprocessing with Python

This notebook is created to code along the YouTube video link here: https://www.youtube.com/watch?v=2_7vRKawvEU&t=657

Video is from Nicholas Renotte.

## Authenticate to Kaggle

In [27]:
# import sys

In [28]:
# !pip install kaggle

In [29]:
# !pip list

In [30]:
# !kaggle competitions download -c house-prices-advanced-regression-techniques

Note: Before you can download the competition dataset from Kaggle, make sure you have done the steps below:

1. Install kaggle python package
2. Sign up for a Kaggle account
3. Create API token in Kaggle (a kaggle.json file will be downloaded)
4. Place the kaggle.json file in the .kaggle folder
5. Join the Kaggle competition
6. Download the dataset

Reminder: Make sure you comment out the code after you finish downloading!

In [31]:
# !dir

Now, extract the archive file downloaded from Kaggle and place it in the data folder:

In [32]:
# import zipfile

# with zipfile.ZipFile("house-prices-advanced-regression-techniques.zip") as f:
#     f.extractall(path="./data/")    

## Now, you're all set!

# Pipeline Practice.

First, load up our dataset into pandas.

In [33]:
import pandas as pd

df = pd.read_csv("./data/train.csv")

Let's take a look at our data:

In [34]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [35]:
df.tail()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125
1459,1460,20,RL,75.0,9937,Pave,,Reg,Lvl,AllPub,...,0,,,,0,6,2008,WD,Normal,147500


In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [41]:
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns in our dataset.")

There are 1460 rows and 81 columns in our dataset.


In [42]:
print(f"Here are our columns:\n{df.columns}")

Here are our columns:
Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'Ga

In [43]:
df.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [57]:
# Check for NaN
df.isna().values.sum()

7829

In [61]:
# Check for duplicates
df.duplicated().sum()

0

In [63]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [65]:
# Select the columns to be used for the machine learning model, remove the rows with NaN and save it in a new dataframe
select_df = df[["MSSubClass", 
                "MSZoning", 
                "LotFrontage", 
                "LotArea", 
                "Street", 
                "LotShape", 
                "LandContour", 
                "Utilities", 
                "MiscVal", 
                "MoSold", 
                "YrSold", 
                "SaleType", 
                "SaleCondition", 
                "SalePrice"]].dropna()

Import our necessary sklearn classes:

In [62]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler

In [66]:
# split the data into features (X) and labels (y)
X = pd.get_dummies(select_df.drop("SalePrice", axis=1)) # use get_dummies to categorize the features
y = select_df.SalePrice

In [67]:
X.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,MiscVal,MoSold,YrSold,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,60,65.0,8450,0,2,2008,False,False,False,True,...,False,False,False,True,False,False,False,False,True,False
1,20,80.0,9600,0,5,2007,False,False,False,True,...,False,False,False,True,False,False,False,False,True,False
2,60,68.0,11250,0,9,2008,False,False,False,True,...,False,False,False,True,False,False,False,False,True,False
3,70,60.0,9550,0,2,2006,False,False,False,True,...,False,False,False,True,True,False,False,False,False,False
4,60,84.0,14260,0,12,2008,False,False,False,True,...,False,False,False,True,False,False,False,False,True,False


In [68]:
pipeline = make_pipeline(StandardScaler(), RandomForestRegressor())

In [69]:

pipeline.fit(X, y)

In [70]:
pipeline.predict(X)

array([199280.  , 167462.  , 220837.  , ..., 220002.  , 142382.75,
       156176.5 ])

# Save the Pipeline

In [71]:
import pickle

In [73]:
with open("pipelinemodel.pkl", "wb") as f:
    pickle.dump(pipeline, f)

# Reload the Pipeline

In [74]:
with open("pipelinemodel.pkl", "rb") as f:
    reloaded_model = pickle.load(f)

In [75]:
reloaded_model

In [78]:
# Get certain steps with the pipeline
reloaded_model.steps[1][1].predict(X)



array([352484., 352484., 406624., ..., 352484., 352484., 352484.])

# Why do we use `make_pipeline` and not `Pipeline`?

In [79]:
from sklearn.pipeline import Pipeline

In [80]:
# with the Pipeline class
custom_pipeline = Pipeline([("scaling", StandardScaler()), ("rfmodel", RandomForestRegressor())])

In [82]:
custom_pipeline

With `Pipeline`, you manually assign the name of the steps .

In [81]:
# with the make_pipeline class
make_pipeline_model = make_pipeline(StandardScaler(), RandomForestRegressor())

In [83]:
make_pipeline_model

With `make_pipeline`, it automatically name the steps for you.

# Column Transformer Practice

## Create the preprocessing pipelines for both numeric and categorical data.

In [87]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

Index(['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities',
       'SaleType', 'SaleCondition'],
      dtype='object')

In [99]:
# Numeric features
numeric_features = select_df.drop("SalePrice", axis=1).select_dtypes(exclude="object").columns
numeric_features

Index(['MSSubClass', 'LotFrontage', 'LotArea', 'MiscVal', 'MoSold', 'YrSold'], dtype='object')

In [100]:
# Create a pipeline and apply StandardScaler()
numeric_pipeline = Pipeline([("scaler", StandardScaler())])

In [101]:
# Categorical features
categorical_features = select_df.select_dtypes("object").columns
categorical_features

Index(['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities',
       'SaleType', 'SaleCondition'],
      dtype='object')

In [102]:
# Create a pipeline for the categorical features and apply OneHotEncoder()
categorical_pipeline = Pipeline([("onehot", OneHotEncoder())])

In [103]:
transformer = ColumnTransformer([("numeric_preprocessing", numeric_pipeline, numeric_features),
                                ("categorical_preprocessing", categorical_pipeline, categorical_features)])

In [104]:
transformer

In [108]:
ml_pipeline = Pipeline([("all_column_preprocessing", transformer), ("randforestregressor", RandomForestRegressor())])

In [109]:
ml_pipeline

In [110]:
X = select_df.drop("SalePrice", axis=1)
y = select_df.SalePrice

In [111]:
ml_pipeline.fit(X, y)

In [112]:
ml_pipeline.predict(X)

array([204712.  , 165983.  , 218544.22, ..., 221465.5 , 144502.  ,
       155659.  ])

In [113]:
with open("columntransformermodel.pkl", "wb") as f:
    pickle.dump(ml_pipeline, f)

In [115]:
with open("columntransformermodel.pkl", "rb") as f:
    reloaded_ml_pipeline = pickle.load(f)

In [116]:
reloaded_ml_pipeline