# Column Transformer
ColumnTransformer is a preprocessing tool in scikit-learn that allows you to apply different transformers to different columns of a dataset. It is commonly used to transform heterogeneous data or to apply different preprocessing steps to different subsets of features.

For example, you might have a dataset with numerical and categorical columns, and you want to apply different preprocessing steps to each type of column. The ColumnTransformer can be used to apply different transformers to each column type. You can specify the transformers to be applied to each subset of columns and the resulting transformed columns will be combined into a single output dataset.

# Why is this used?
The advantage of using a ColumnTransformer is that it allows you to specify a preprocessing pipeline for each subset of features in a concise and modular way. This makes it easy to experiment with different preprocessing strategies and to modify your pipeline as your data or modeling needs change.

Overall, the ColumnTransformer is a powerful tool that can simplify your preprocessing code and help you build more accurate and robust models.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("covid_toy.csv")

In [3]:
df.head()

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No


In [4]:
# nomial categorical data
df["gender"].value_counts() # 2 unique values

Female    59
Male      41
Name: gender, dtype: int64

In [5]:
# ordinal categorical data
df["cough"].value_counts() # 2 unique values

Mild      62
Strong    38
Name: cough, dtype: int64

In [6]:
# nomial categorical data
df["city"].value_counts() # 4 unique values

Kolkata      32
Bangalore    30
Delhi        22
Mumbai       16
Name: city, dtype: int64

In [7]:
# nomial categorical data
df["has_covid"].value_counts() # 4 unique values

No     55
Yes    45
Name: has_covid, dtype: int64

# without column transformer
Check it out what we would have to do if not column transformer, need to encode each categories separately as then append it agin 

In [8]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
import warnings
warnings.filterwarnings("ignore")

In [9]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns=['has_covid']),df['has_covid'],
                                                test_size=0.2)

In [10]:
X_train

Unnamed: 0,age,gender,fever,cough,city
64,42,Male,104.0,Mild,Mumbai
9,64,Female,101.0,Mild,Delhi
27,33,Female,102.0,Strong,Delhi
14,51,Male,104.0,Mild,Bangalore
43,22,Female,99.0,Mild,Bangalore
...,...,...,...,...,...
70,68,Female,101.0,Strong,Delhi
68,54,Female,104.0,Strong,Kolkata
49,44,Male,104.0,Mild,Mumbai
39,50,Female,103.0,Mild,Kolkata


In [11]:
# adding simple imputer to fever col
si = SimpleImputer()
X_train_fever = si.fit_transform(X_train[['fever']])

# also the test data
X_test_fever = si.fit_transform(X_test[['fever']])
                                 
X_train_fever.shape

(80, 1)

In [12]:
# Ordinalencoding -> cough
oe = OrdinalEncoder(categories=[['Mild','Strong']])
X_train_cough = oe.fit_transform(X_train[['cough']])

# also the test data
X_test_cough = oe.fit_transform(X_test[['cough']])

X_train_cough.shape

(80, 1)

In [13]:
# OneHotEncoding -> gender,city
ohe = OneHotEncoder(drop='first',sparse=False)
X_train_gender_city = ohe.fit_transform(X_train[['gender','city']])

# also the test data
X_test_gender_city = ohe.fit_transform(X_test[['gender','city']])

X_train_gender_city.shape

(80, 4)

In [14]:
# Extracting Age
X_train_age = X_train.drop(columns=['gender','fever','cough','city']).values

# also the test data
X_test_age = X_test.drop(columns=['gender','fever','cough','city']).values

X_train_age.shape

(80, 1)

In [15]:
X_train_transformed = np.concatenate((X_train_age,X_train_fever,X_train_gender_city,X_train_cough),axis=1)
# also the test data
X_test_transformed = np.concatenate((X_test_age,X_test_fever,X_test_gender_city,X_test_cough),axis=1)

X_train_transformed.shape

(80, 7)

# With transformer

In [16]:
from sklearn.compose import ColumnTransformer

In [17]:
transformer = ColumnTransformer(transformers=[
    ('tnf1',SimpleImputer(),['fever']),
    ('tnf2',OrdinalEncoder(categories=[['Mild','Strong']]),['cough']),
    ('tnf3',OneHotEncoder(sparse=False,drop='first'),['gender','city'])
],remainder='passthrough') # Remaining columns where nothing is to be done is not touched 

In [18]:
transformer.fit_transform(X_train).shape

(80, 7)

In [19]:
transformer.transform(X_test).shape

(20, 7)