# Column Transformer - Feature Engineering Tutorial

This notebook demonstrates how to use scikit-learn's ColumnTransformer to efficiently handle different preprocessing steps for different columns in a dataset.

---

## Step 1: Import Required Libraries

**What we're doing:** Loading necessary libraries for data manipulation, visualization, and machine learning preprocessing.

---

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Step 2: Load and Explore the Dataset

**What we're doing:** Loading COVID toy dataset from CSV and displaying a random sample to understand data structure.

**Findings:** The dataset contains features like age, gender, city, fever, cough, and a target variable (has_covid).

---

In [2]:
df = pd.read_csv("covid_toy.csv")
df.sample(5)

Unnamed: 0,age,gender,fever,cough,city,has_covid
75,5,Male,102.0,Mild,Kolkata,Yes
17,40,Female,98.0,Strong,Delhi,No
9,64,Female,101.0,Mild,Delhi,No
31,83,Male,103.0,Mild,Kolkata,No
27,33,Female,102.0,Strong,Delhi,No


## Step 3: Split Data into Train and Test Sets

**What we're doing:** Dividing the dataset into training (80%) and testing (20%) sets for model validation.

**Findings:** Training set shape created for further preprocessing steps.

---

In [4]:
from sklearn.model_selection import train_test_split

X_test,X_train,Y_test,Y_train = train_test_split(df.drop(columns="has_covid"),df["has_covid"],train_size=0.2)

X_train.shape,X_test.shape

((80, 5), (20, 5))

## Step 4: Examine Training Data

**What we're doing:** Inspecting the training data to understand column types and data distribution.

**Findings:** Data contains mixed types - numerical (age), categorical ordinal (cough), and categorical nominal (gender, city) with some missing values.

---

In [5]:
X_train

Unnamed: 0,age,gender,fever,cough,city
81,65,Male,99.0,Mild,Delhi
86,25,Male,104.0,Mild,Bangalore
92,82,Female,102.0,Strong,Kolkata
7,20,Female,,Strong,Mumbai
36,38,Female,101.0,Mild,Bangalore
...,...,...,...,...,...
78,11,Male,100.0,Mild,Bangalore
6,14,Male,101.0,Strong,Bangalore
94,79,Male,,Strong,Kolkata
69,73,Female,103.0,Mild,Delhi


## Step 5: Handle Missing Values - Method 1

**What we're doing:** Using pandas `fillna()` with mean imputation to handle missing fever values.

**Findings:** Missing values in fever column filled with the mean value from the dataset.

---

In [9]:
df["fever"] = df["fever"].fillna(df["fever"].mean())

## Step 6: Handle Missing Values - Method 2

**What we're doing:** Using scikit-learn's SimpleImputer for a more scalable approach to handle missing values.

**Findings:** SimpleImputer provides a consistent method for both training and test data transformation with shape (n_samples, 1).

---

In [11]:
from sklearn.impute import SimpleImputer

si = SimpleImputer()

X_train_Fever = si.fit_transform(X_train[["fever"]])
X_test_Fever = si.fit_transform(X_test[["fever"]])

X_train_Fever.shape

(80, 1)

## Step 7: Encode Ordinal Categorical Features

**What we're doing:** Using OrdinalEncoder to convert ordinal categorical feature (cough: Mild/Strong) to numerical values.

**Findings:** OrdinalEncoder maps Mild→0, Strong→1, preserving the ordinal relationship in the data.

---

In [14]:
from sklearn.preprocessing import OrdinalEncoder

oe = OrdinalEncoder(categories=[["Mild","Strong"]])

X_train_cough = oe.fit_transform(X_train[["cough"]])
X_test_cough= oe.fit_transform(X_test[["cough"]])

X_train_cough.shape, X_test_cough.shape

((80, 1), (20, 1))

## Step 8: Encode Nominal Categorical Features

**What we're doing:** Using OneHotEncoder to convert nominal categorical features (gender, city) to binary encoded columns.

**Findings:** OneHotEncoder creates dummy variables for each category, dropping the first to avoid multicollinearity. Result shape shows multiple binary columns.

---

In [15]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(drop="first",sparse_output=False,dtype=np.int32)

X_train_gender_city = ohe.fit_transform(X_train[["gender","city"]])

X_test_gender_city = ohe.fit_transform(X_test[["gender","city"]])

X_train_gender_city.shape, X_test_gender_city.shape

((80, 4), (20, 4))

## Step 9: Extract Numerical Features

**What we're doing:** Selecting numerical features that don't need encoding (age column).

**Findings:** Age is a continuous numerical feature that can be used directly without transformation.

---

In [17]:
X_train_age = X_train.drop(columns=["gender","fever","cough","city"]).values

X_test_age = X_test.drop(columns=["gender","fever","cough","city"]).values

X_train_age.shape, X_test_age.shape

((80, 1), (20, 1))

## Step 10: Concatenate All Processed Features

**What we're doing:** Combining all transformed features (age, fever, cough, gender, city) into a single feature matrix using numpy concatenation.

**Findings:** Final transformed feature matrix has all preprocessing applied and is ready for machine learning models.

---

In [18]:
X_train_transformed = np.concatenate((X_train_age,X_train_Fever,X_train_gender_city,X_train_cough),axis=1)

X_test_transformed = np.concatenate((X_test_age,X_test_Fever,X_test_gender_city,X_test_cough),axis=1)

X_train_transformed.shape, X_test_transformed.shape

((80, 7), (20, 7))

## Step 11: Using ColumnTransformer (Simplified Approach)

**What we're doing:** Creating a ColumnTransformer to automate all preprocessing steps in a single pipeline.

**Findings:** ColumnTransformer combines multiple transformers for different column types and uses 'remainder=passthrough' to keep unspecified columns.

---

In [23]:
from sklearn.compose import ColumnTransformer

Transformer = ColumnTransformer(transformers=[
    ("tnf1",SimpleImputer(),["fever"]),
    ("tnf2",OrdinalEncoder(categories=[["Mild","Strong"]]),["cough"]),
    ("tnf3",OneHotEncoder(drop="first",sparse_output=False),["gender","city"])],
    remainder="passthrough"
)

## Step 12: Transform Training Data

**What we're doing:** Fitting the ColumnTransformer on training data and applying all transformations in one step.

**Findings:** ColumnTransformer produces the same result shape as manual concatenation method, confirming correct preprocessing.

---

In [25]:
Transformer.fit_transform(X_train).shape

(80, 7)

## Step 13: Transform Test Data

**What we're doing:** Applying the fitted ColumnTransformer to test data using the learned parameters from training.

**Findings:** Test data is transformed consistently with training data, ensuring proper generalization without data leakage.

---

In [26]:
Transformer.fit_transform(X_test).shape

(20, 7)