## Day 28 — ColumnTransformer (using `covid_toy.csv`)


In this lecture, we introduce **ColumnTransformer**, a key component in building
clean and scalable machine learning pipelines.

The dataset `covid_toy.csv` contains **mixed feature types**, which makes it a
perfect example to understand why ColumnTransformer is needed.

---

### Dataset Overview

**Features:**
- `age` → Numerical  
- `fever` → Numerical  
- `gender` → Categorical (Nominal)  
- `cough` → Categorical (Nominal)  
- `city` → Categorical (Nominal)

**Target:**
- `has_covid` → Yes / No

---

### Why ColumnTransformer?

Different columns require different preprocessing steps:
- Numerical features may need **scaling**
- Categorical features need **encoding**

Without ColumnTransformer, we would have to:
- preprocess columns separately
- manually combine them
- risk data leakage and column mismatch

ColumnTransformer solves this by applying **column-wise transformations in parallel**.

---

### Core Idea
> Apply different transformations to different columns in a single unified step.

---

### Defining Column Groups

```python
categorical_cols = ["gender", "cough", "city"]
numerical_cols = ["age", "fever"]


In [3]:
import numpy as np
import pandas as pd

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

In [4]:
df = pd.read_csv("covid_toy.csv")

In [5]:
df

Unnamed: 0,age,gender,fever,cough,city,has_covid
0,60,Male,103.0,Mild,Kolkata,No
1,27,Male,100.0,Mild,Delhi,Yes
2,42,Male,101.0,Mild,Delhi,No
3,31,Female,98.0,Mild,Kolkata,No
4,65,Female,101.0,Mild,Mumbai,No
...,...,...,...,...,...,...
95,12,Female,104.0,Mild,Bangalore,No
96,51,Female,101.0,Strong,Kolkata,Yes
97,20,Female,101.0,Mild,Bangalore,No
98,5,Female,98.0,Strong,Mumbai,No


In [7]:
df["cough"].value_counts()

cough
Mild      62
Strong    38
Name: count, dtype: int64

In [8]:
df["city"].value_counts()

city
Kolkata      32
Bangalore    30
Delhi        22
Mumbai       16
Name: count, dtype: int64

In [6]:
X = df.drop("has_covid", axis=1)
y = df["has_covid"]

In [11]:
# gender , city : nominal 
# gender, city , cough : categorical
# cough : ordinal 

In [12]:
df.isnull().sum()

age           0
gender        0
fever        10
cough         0
city          0
has_covid     0
dtype: int64

In [13]:
X = df.drop("has_covid", axis=1)
y = df["has_covid"]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [14]:
X_train

Unnamed: 0,age,gender,fever,cough,city
55,81,Female,101.0,Mild,Mumbai
88,5,Female,100.0,Mild,Kolkata
26,19,Female,100.0,Mild,Kolkata
42,27,Male,100.0,Mild,Delhi
69,73,Female,103.0,Mild,Delhi
...,...,...,...,...,...
60,24,Female,102.0,Strong,Bangalore
71,75,Female,104.0,Strong,Delhi
14,51,Male,104.0,Mild,Bangalore
92,82,Female,102.0,Strong,Kolkata


# Without Using ColumnTransformer

In [15]:
## Imputing fever values 

In [16]:
from sklearn.impute import SimpleImputer

si = SimpleImputer(strategy="mean")

# fit ONLY on training data
X_train_fever = si.fit_transform(X_train[['fever']])

# transform test data using SAME mean
X_test_fever = si.transform(X_test[['fever']])

In [18]:
X_train_fever.shape

(80, 1)

### Ordinal Encoding on cough column

In [19]:
from sklearn.preprocessing import OrdinalEncoder

In [20]:
cough_order = [["Mild", "Strong"]]
oe = OrdinalEncoder(categories=cough_order)

In [21]:
X_train_cough = oe.fit_transform(X_train[["cough"]])

In [24]:
X_train_cough[:6]

array([[0.],
       [0.],
       [0.],
       [0.],
       [0.],
       [1.]])

In [25]:
np.unique(X_train_cough)

array([0., 1.])

In [26]:
X_test_cough = oe.transform(X_test[['cough']])

In [27]:
X_train_cough.shape

(80, 1)

### One-Hot Encoding: gender, city

In [29]:
from sklearn.preprocessing import OneHotEncoder

In [30]:
ohe = OneHotEncoder(drop='first',sparse_output=False)

In [33]:
X_train_cat = ohe.fit_transform(X_train[["gender","city"]])
X_test_cat = ohe.transform(X_test[["gender","city"]])

In [34]:
ohe.get_feature_names_out(["gender", "city"])

array(['gender_Male', 'city_Delhi', 'city_Kolkata', 'city_Mumbai'],
      dtype=object)

In [37]:
X_train_cat.shape

(80, 4)

In [38]:
# Concatenation

In [48]:
X_train_age = X_train[["age"]].values
X_test_age  = X_test[["age"]].values

In [49]:
X_train_final = np.hstack([
    X_train_age,
    X_train_fever,
    X_train_cough,
    X_train_cat
])

X_test_final = np.hstack([
    X_test_age,
    X_test_fever,
    X_test_cough,
    X_test_cat
])

In [50]:
X_train_final.shape

(80, 7)

In [51]:
columns = []

# numerical
columns.append("age")
columns.append("fever")

# ordinal
columns.append("cough")

# one-hot encoded
columns.extend(
    ohe.get_feature_names_out(["gender", "city"])
)

In [52]:
X_test_final_df = pd.DataFrame(
    X_test_final,
    columns=columns
)

X_test_final_df.head()

Unnamed: 0,age,fever,cough,gender_Male,city_Delhi,city_Kolkata,city_Mumbai
0,17.0,104.0,0.0,0.0,0.0,1.0,0.0
1,83.0,98.0,0.0,1.0,1.0,0.0,0.0
2,68.0,101.0,1.0,0.0,1.0,0.0,0.0
3,72.0,99.0,0.0,1.0,0.0,0.0,0.0
4,20.0,102.0,1.0,1.0,1.0,0.0,0.0


In [53]:
X_train_final_df = pd.DataFrame(
    X_train_final,
    columns=columns
)

X_test_final_df.head()

Unnamed: 0,age,fever,cough,gender_Male,city_Delhi,city_Kolkata,city_Mumbai
0,17.0,104.0,0.0,0.0,0.0,1.0,0.0
1,83.0,98.0,0.0,1.0,1.0,0.0,0.0
2,68.0,101.0,1.0,0.0,1.0,0.0,0.0
3,72.0,99.0,0.0,1.0,0.0,0.0,0.0
4,20.0,102.0,1.0,1.0,1.0,0.0,0.0


In [54]:
print(X_train_final_df.shape)
print(X_test_final_df.shape)

(80, 7)
(20, 7)


# With using column Transformer

###### Building with ColumnTransformer

In [72]:
transformer = ColumnTransformer(
    transformers=[
        ('tnf1', SimpleImputer(), ['fever']),
        ('tnf2', OrdinalEncoder(categories=[['Mild','Strong']]), ['cough']),
        ('tnf3', OneHotEncoder(sparse_output=False, drop='first'), ['gender','city'])
    ],
    remainder='passthrough'
)

In [77]:
X_train_transformed = transformer.fit_transform(X_train)

In [84]:
X_test_transformed= transformer.transform(X_test)

In [78]:
pd.DataFrame(X_train_transformed).head()

Unnamed: 0,0,1,2,3,4,5,6
0,101.0,0.0,0.0,0.0,0.0,1.0,81.0
1,100.0,0.0,0.0,0.0,1.0,0.0,5.0
2,100.0,0.0,0.0,0.0,1.0,0.0,19.0
3,100.0,0.0,1.0,1.0,0.0,0.0,27.0
4,103.0,0.0,0.0,1.0,0.0,0.0,73.0


In [79]:
feature_names = []

feature_names.append("fever")
feature_names.append("cough")

feature_names.extend(
    transformer.named_transformers_['tnf3']
    .get_feature_names_out(['gender','city'])
)

feature_names.append("age")

feature_names

['fever',
 'cough',
 'gender_Male',
 'city_Delhi',
 'city_Kolkata',
 'city_Mumbai',
 'age']

In [80]:
pd.DataFrame(X_train_transformed, columns=feature_names).head()

Unnamed: 0,fever,cough,gender_Male,city_Delhi,city_Kolkata,city_Mumbai,age
0,101.0,0.0,0.0,0.0,0.0,1.0,81.0
1,100.0,0.0,0.0,0.0,1.0,0.0,5.0
2,100.0,0.0,0.0,0.0,1.0,0.0,19.0
3,100.0,0.0,1.0,1.0,0.0,0.0,27.0
4,103.0,0.0,0.0,1.0,0.0,0.0,73.0


In [85]:
pd.DataFrame(X_test_transformed, columns=feature_names).head()

Unnamed: 0,fever,cough,gender_Male,city_Delhi,city_Kolkata,city_Mumbai,age
0,104.0,0.0,0.0,0.0,1.0,0.0,17.0
1,98.0,0.0,1.0,1.0,0.0,0.0,83.0
2,101.0,1.0,0.0,1.0,0.0,0.0,68.0
3,99.0,0.0,1.0,0.0,0.0,0.0,72.0
4,102.0,1.0,1.0,1.0,0.0,0.0,20.0


##  What We Learned Today — Day 28 (ColumnTransformer)

Today’s lecture focused on **ColumnTransformer**, a core tool for building
clean, safe, and scalable machine learning preprocessing pipelines.

### Key concepts covered
- Why **manual preprocessing + concatenation** is error-prone
- Importance of **train–test split before preprocessing**
- Handling **missing values** using `SimpleImputer`
- Difference between:
  - **Nominal categorical features** → OneHotEncoder
  - **Ordinal categorical features** → OrdinalEncoder
- Encoding:
  - `cough` as an **ordinal feature**
  - `gender`, `city` as **nominal features**
- Using `remainder='passthrough'` to avoid losing columns like `age`

### ColumnTransformer in practice
- Applied **different transformers to different columns**
- Combined imputation, ordinal encoding, and one-hot encoding in one step
- Automatically concatenated all transformed features
- Ensured **no data leakage** by fitting only on training data
- Verified transformations using shapes and feature inspection

### Major takeaways
- All preprocessing must be **learned from training data only**
- ColumnTransformer replaces manual feature extraction and `np.hstack`
- It prevents missing columns, shape mismatches, and leakage
- It is the foundation of **production-ready ML pipelines**

> ColumnTransformer turns messy, manual preprocessing into a single,
reliable, and reusable step.

---


