# <u>Data Science Essentials</u>

## <u>Topic</u>: Streamlining preprocessing using Scikit-learn Pipelines

## <u>Category</u>: Data Preprocessing

### <u>Created By</u>: Mohammed Misbahullah Sheriff
- [LinkedIn](https://www.linkedin.com/in/mohammed-misbahullah-sheriff/)
- [GitHub](https://github.com/MisbahullahSheriff)

## Importing Libraries

In [None]:
import numpy as np
import pandas as pd

import sklearn

from sklearn.preprocessing import (
    StandardScaler,
    OneHotEncoder,
)

from sklearn.impute import SimpleImputer

from sklearn.pipeline import Pipeline

In [None]:
sklearn.set_config(transform_output="pandas")

## Getting the Data

In [None]:
path = "/content/car-details.csv"

df = pd.read_csv(path)
print("Data Shape:", df.shape)
df.head()

Data Shape: (6926, 16)


Unnamed: 0,name,company,model,edition,year,owner,fuel,seller_type,transmission,km_driven,mileage_mpg,engine_cc,max_power_bhp,torque_nm,seats,selling_price
0,Maruti Swift Dzire VDI,Maruti,Swift,Dzire VDI,2014,First,Diesel,Individual,Manual,145500,55.0,1248.0,74.0,190.0,5.0,450000
1,Skoda Rapid 1.5 TDI Ambition,Skoda,Rapid,1.5 TDI Ambition,2014,Second,Diesel,Individual,Manual,120000,49.7,1498.0,103.52,250.0,5.0,370000
2,Honda City 2017-2020 EXi,Honda,City,2017-2020 EXi,2006,Third,Petrol,Individual,Manual,140000,41.6,1497.0,78.0,124.544455,5.0,158000
3,Hyundai i20 Sportz Diesel,Hyundai,i20,Sportz Diesel,2010,First,Diesel,Individual,Manual,127000,54.06,1396.0,90.0,219.66896,5.0,225000
4,Maruti Swift VXI BSIII,Maruti,Swift,VXI BSIII,2007,First,Petrol,Individual,Manual,120000,37.84,1298.0,88.2,112.776475,5.0,130000


# Demo 1 - Handling Numeric Columns

### Plan of Action:

- Impute all columns with their respective `mean` values
- Perform `standardardization` to bring all the columns to similar scale

In [None]:
X_num = df.select_dtypes(include="number")
X_num.head()

Unnamed: 0,year,km_driven,mileage_mpg,engine_cc,max_power_bhp,torque_nm,seats,selling_price
0,2014,145500,55.0,1248.0,74.0,190.0,5.0,450000
1,2014,120000,49.7,1498.0,103.52,250.0,5.0,370000
2,2006,140000,41.6,1497.0,78.0,124.544455,5.0,158000
3,2010,127000,54.06,1396.0,90.0,219.66896,5.0,225000
4,2007,120000,37.84,1298.0,88.2,112.776475,5.0,130000


### Without Pipeline

#### 1. Imputing the missing values

In [None]:
X_num.isna().sum()

year               0
km_driven          0
mileage_mpg      208
engine_cc        208
max_power_bhp    209
torque_nm        209
seats            208
selling_price      0
dtype: int64

In [None]:
X_num.describe().loc[["min", "max"], :]

Unnamed: 0,year,km_driven,mileage_mpg,engine_cc,max_power_bhp,torque_nm,seats,selling_price
min,1983.0,1.0,0.0,624.0,32.8,47.07192,2.0,29999.0
max,2020.0,2360457.0,98.7,3604.0,400.0,1863.2635,14.0,10000000.0


In [None]:
imputer = SimpleImputer(strategy="mean")

X_num_imputed = imputer.fit_transform(X_num)

#### 2. Scaling the features

In [None]:
scaler = StandardScaler()

X_num_scaled = scaler.fit_transform(X_num_imputed)

In [None]:
X_num_scaled.isna().sum()

year             0
km_driven        0
mileage_mpg      0
engine_cc        0
max_power_bhp    0
torque_nm        0
seats            0
selling_price    0
dtype: int64

In [None]:
X_num_scaled.describe().loc[["min", "max"], :]

Unnamed: 0,year,km_driven,mileage_mpg,engine_cc,max_power_bhp,torque_nm,seats,selling_price
min,-7.459627,-1.268033,-4.280849,-1.660303,-1.759484,-1.212547,-3.543562,-0.937549
max,1.613466,39.182677,5.091654,4.471506,9.994715,16.044495,8.836943,18.24551


- The features were imputed with their respective mean values
- The features were then scaled using standardization
- The resulting dataset has no missing values and the values have all been scaled down
- Unfortunately, we had to write a lot of intermediate code to perform the preprocessing steps one at a time in a sequential manner

### With Pipeline

In [None]:
X_num.isna().sum()

year               0
km_driven          0
mileage_mpg      208
engine_cc        208
max_power_bhp    209
torque_nm        209
seats            208
selling_price      0
dtype: int64

In [None]:
X_num.describe().loc[["min", "max"], :]

Unnamed: 0,year,km_driven,mileage_mpg,engine_cc,max_power_bhp,torque_nm,seats,selling_price
min,1983.0,1.0,0.0,624.0,32.8,47.07192,2.0,29999.0
max,2020.0,2360457.0,98.7,3604.0,400.0,1863.2635,14.0,10000000.0


In [None]:
imputer = SimpleImputer(strategy="mean")
scaler = StandardScaler()

pipe = Pipeline(steps=[("imputer", imputer),
                       ("scaler", scaler)])

X_preprocessed = pipe.fit_transform(X_num)

In [None]:
X_preprocessed.head()

Unnamed: 0,year,km_driven,mileage_mpg,engine_cc,max_power_bhp,torque_nm,seats,selling_price
0,0.142153,1.225357,0.941924,-0.376327,-0.440658,0.145524,-0.448435,-0.129434
1,0.142153,0.788368,0.438639,0.138086,0.504288,0.71563,-0.448435,-0.28336
2,-1.819597,1.131104,-0.330533,0.136029,-0.312616,-0.47642,-0.448435,-0.691265
3,-0.838722,0.908326,0.852662,-0.071794,0.071508,0.427431,-0.448435,-0.562352
4,-1.574378,0.788368,-0.687581,-0.273444,0.013889,-0.588237,-0.448435,-0.745139


In [None]:
X_preprocessed.isna().sum()

year             0
km_driven        0
mileage_mpg      0
engine_cc        0
max_power_bhp    0
torque_nm        0
seats            0
selling_price    0
dtype: int64

In [None]:
X_preprocessed.describe().loc[["min", "max"], :]

Unnamed: 0,year,km_driven,mileage_mpg,engine_cc,max_power_bhp,torque_nm,seats,selling_price
min,-7.459627,-1.268033,-4.280849,-1.660303,-1.759484,-1.212547,-3.543562,-0.937549
max,1.613466,39.182677,5.091654,4.471506,9.994715,16.044495,8.836943,18.24551


- The features were all imputed and scaled
- All the preprocessing steps were performed in a single step with the help of pipelines
- Under-the-hood, the preprocessing steps were performed sequentially
- As evident from above, pipelines help streamline the preprocessing steps conveniently and improve coding efficiency significantly

# Demo 2 - Handling Categorical Columns

### Plan of Action:

- Impute all columns with their respective `mode` values
- Perform `one-hot encoding` to encode all the categories to numbers

In [None]:
X_obj = df.select_dtypes(exclude="number")
X_obj.head()

Unnamed: 0,name,company,model,edition,owner,fuel,seller_type,transmission
0,Maruti Swift Dzire VDI,Maruti,Swift,Dzire VDI,First,Diesel,Individual,Manual
1,Skoda Rapid 1.5 TDI Ambition,Skoda,Rapid,1.5 TDI Ambition,Second,Diesel,Individual,Manual
2,Honda City 2017-2020 EXi,Honda,City,2017-2020 EXi,Third,Petrol,Individual,Manual
3,Hyundai i20 Sportz Diesel,Hyundai,i20,Sportz Diesel,First,Diesel,Individual,Manual
4,Maruti Swift VXI BSIII,Maruti,Swift,VXI BSIII,First,Petrol,Individual,Manual


In [None]:
imputer = SimpleImputer(strategy="most_frequent")
encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

pipe = Pipeline(steps=[("imputer", imputer),
                       ("scaler", encoder)])

X_preprocessed = pipe.fit_transform(X_obj)

X_preprocessed.head()

Unnamed: 0,name_Ambassador CLASSIC 1500 DSL AC,name_Ambassador Classic 2000 DSZ AC PS,name_Ambassador Grand 1500 DSZ BSIII,name_Ambassador Grand 2000 DSZ PW CL,name_Ashok Leyland Stile LE,name_Audi A3 35 TDI Premium Plus,name_Audi A3 40 TFSI Premium,name_Audi A4 1.8 TFSI,name_Audi A4 2.0 TDI,name_Audi A4 2.0 TDI 177 Bhp Premium Plus,...,owner_Third,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,seller_type_Dealer,seller_type_Individual,seller_type_Trustmark Dealer,transmission_Automatic,transmission_Manual
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
