# <u>Data Science Essentials</u>

## <u>Topic</u>: Streamlining preprocessing using  Scikit-learn Column Transformer

## <u>Category</u>: Data Preprocessing

### <u>Created By</u>: Mohammed Misbahullah Sheriff
- [LinkedIn](https://www.linkedin.com/in/mohammed-misbahullah-sheriff/)
- [GitHub](https://github.com/MisbahullahSheriff)

## Importing Libraries

In [None]:
import numpy as np
import pandas as pd

import sklearn

from sklearn.preprocessing import (
    StandardScaler,
    OneHotEncoder,
)

from sklearn.impute import SimpleImputer

from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer

In [None]:
sklearn.set_config(transform_output="pandas")

## Getting the Data

In [None]:
path = "/content/car-details.csv"

df = pd.read_csv(path)
print("Data Shape:", df.shape)
df.head()

Data Shape: (6926, 16)


Unnamed: 0,name,company,model,edition,year,owner,fuel,seller_type,transmission,km_driven,mileage_mpg,engine_cc,max_power_bhp,torque_nm,seats,selling_price
0,Maruti Swift Dzire VDI,Maruti,Swift,Dzire VDI,2014,First,Diesel,Individual,Manual,145500,55.0,1248.0,74.0,190.0,5.0,450000
1,Skoda Rapid 1.5 TDI Ambition,Skoda,Rapid,1.5 TDI Ambition,2014,Second,Diesel,Individual,Manual,120000,49.7,1498.0,103.52,250.0,5.0,370000
2,Honda City 2017-2020 EXi,Honda,City,2017-2020 EXi,2006,Third,Petrol,Individual,Manual,140000,41.6,1497.0,78.0,124.544455,5.0,158000
3,Hyundai i20 Sportz Diesel,Hyundai,i20,Sportz Diesel,2010,First,Diesel,Individual,Manual,127000,54.06,1396.0,90.0,219.66896,5.0,225000
4,Maruti Swift VXI BSIII,Maruti,Swift,VXI BSIII,2007,First,Petrol,Individual,Manual,120000,37.84,1298.0,88.2,112.776475,5.0,130000


## Demo

### Plan of Action:

- Impute and scale the numeric features using pipelines
- Impute and encode the categorical features using pipelines
- Perform above transformations simultanously on the dataset using Column Transformer

In [None]:
# numeric pipeline

imputer = SimpleImputer(strategy="mean")
scaler = StandardScaler()

num_pipe = Pipeline(steps=[("imputer", imputer),
                           ("scaler", scaler)])

In [None]:
# categorical pipe

imputer = SimpleImputer(strategy="most_frequent")
encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

obj_pipe = Pipeline(steps=[("imputer", imputer),
                           ("scaler", encoder)])

In [None]:
# separating the columns

num_cols = df.select_dtypes(include="number").columns
obj_cols = df.select_dtypes(exclude="number").columns

In [None]:
# column transformer

ct = ColumnTransformer(transformers=[("num", num_pipe, num_cols),
                                     ("obj", obj_pipe, obj_cols)])

ct.fit_transform(df).head()

Unnamed: 0,num__year,num__km_driven,num__mileage_mpg,num__engine_cc,num__max_power_bhp,num__torque_nm,num__seats,num__selling_price,obj__name_Ambassador CLASSIC 1500 DSL AC,obj__name_Ambassador Classic 2000 DSZ AC PS,...,obj__owner_Third,obj__fuel_CNG,obj__fuel_Diesel,obj__fuel_LPG,obj__fuel_Petrol,obj__seller_type_Dealer,obj__seller_type_Individual,obj__seller_type_Trustmark Dealer,obj__transmission_Automatic,obj__transmission_Manual
0,0.142153,1.225357,0.941924,-0.376327,-0.440658,0.145524,-0.448435,-0.129434,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1,0.142153,0.788368,0.438639,0.138086,0.504288,0.71563,-0.448435,-0.28336,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
2,-1.819597,1.131104,-0.330533,0.136029,-0.312616,-0.47642,-0.448435,-0.691265,0.0,0.0,...,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
3,-0.838722,0.908326,0.852662,-0.071794,0.071508,0.427431,-0.448435,-0.562352,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
4,-1.574378,0.788368,-0.687581,-0.273444,0.013889,-0.588237,-0.448435,-0.745139,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0


- All the preprocessing steps were carried out for numeric and categorical features simultanously
- The preprocessing of numeric features takes place first, followed by the categorical features
- Thus, Column Transformers can be used for performing any kind of preprocessing steps for any combination of features simultanously
- This helps preprocess the data in any fashion, in one go