# Module 07: Data Preprocessing and Feature Engineering

This notebook supports **Module 07** and contains all code and explanatory text for:

- **Part 1**: Missing values, encoding categorical variables, scaling/normalization
- **Part 2**: Outlier detection, feature transformation, domain-driven features,
  preprocessing pipelines, and a quick sanity-check model

Datasets used:
- **Titanic dataset** (from Kaggle) – for demonstrating missing values
- **Heart Failure / Heart Disease dataset** – for the main preprocessing pipeline

Please upload or mount the CSV files in your environment as needed.

---
## Part 1: Core Preprocessing Concepts

In Part 1 we cover:
- Why preprocessing is needed
- How to handle missing values
- How to encode categorical variables
- How to scale / normalize numeric features

###Handling Missing Values (Titanic Dataset)

Real-life analogy: **attendance sheet with blank cells**.

- Some students have `P` (present), some have `A` (absent), and some cells are blank.
- If we ignore those blanks, the final attendance calculation will be wrong.
- We must decide how to handle the blanks using logic.

In the Titanic dataset:
- `Age` has missing values (numeric)
- `Embarked` has a few missing values (categorical)
- `Cabin` has many missing values (often dropped in simple demos)

We will:
1. Inspect missing values
2. Fill numeric column (`Age`) with the **median**
3. Fill categorical column (`Embarked`) with the **mode**
4. Drop `Cabin` because it is mostly missing

In [1]:
import pandas as pd

# Load Titanic dataset
# NOTE: Make sure titanic.csv exists at this path or update the path accordingly.
# You can download from here.  https://www.kaggle.com/datasets/yasserh/titanic-dataset
titanic_path = "/content/sample_data/Titanic-Dataset.csv"  # change if needed
df_titanic = pd.read_csv(titanic_path)

print("First 10 rows of Titanic dataset:")
df_titanic.head(10)

FileNotFoundError: [Errno 2] No such file or directory: '/content/sample_data/Titanic-Dataset.csv'

In [None]:
df_titanic.shape

In [None]:
print("Unique values per column:")
df_titanic.nunique()

In [None]:
print("Missing values per column:")
df_titanic.isnull().sum()

In [None]:
print("Distribution of Age Column:")
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(8,6))
sns.histplot(df_titanic["Age"], kde = True, bins = 20)
plt.title("Age Distribution of Titanic Passenger")
plt.xlabel("Age")
plt.ylabel("Frquencey")
plt.show()
#We will use Median as the feature values are right skewed

In [None]:
print("Distribution of Embarked Column:")
plt.figure(figsize=(8,6))
sns.countplot(data=df_titanic, x="Embarked")
plt.title("Embarked Distribution of Titanic Passenger")
plt.xlabel("Embarked")
plt.ylabel("Frquencey")
plt.show()
#

In [None]:
# 1. Handle numeric missing values: Age
age_median = df_titanic["Age"].median()
df_titanic["Age"] = df_titanic["Age"].fillna(age_median)

# 2. Handle categorical missing values: Embarked
embarked_mode = df_titanic["Embarked"].mode()[0]
df_titanic["Embarked"] = df_titanic["Embarked"].fillna(embarked_mode)

# 3. Drop Cabin (too many missing values)
df_titanic = df_titanic.drop(columns=["Cabin"])

In [None]:
print("Missing values after handling:")
df_titanic.isnull().sum()

###Encoding Categorical Variables (Heart Dataset)

Real-life analogy: **canteen token system**.

- The canteen menu has items like *Tehari*, *Chowmein*, *Biriyani*.
- The billing machine cannot understand these strings; it needs numeric codes.
- However, assigning `Tehari = 1`, `Chowmein = 2`, `Biriyani = 3` does **not**
  mean Biriyani is greater than Tehari. The numbers are **labels, not ranks**.

In the Heart dataset, we will:
- Use **Label Encoding** for binary categories like `Sex` and `ExerciseAngina`
- Use **OneHot Encoding** for nominal categories like `ChestPainType`,
  `RestingECG`, and `ST_Slope`

In [None]:
from sklearn.preprocessing import LabelEncoder

# Load Heart dataset
heart_path = "/content/sample_data/heart.csv"  # change if needed
df_heart = pd.read_csv(heart_path)

print("First 10 rows of Heart dataset:")
display(df_heart.head(10))

print("\nColumn data types:")
display(df_heart.dtypes)

In [None]:
#Categorical Feature Exploration
categorical_cols = ["Sex", "ChestPainType", "RestingECG",
                    "ExerciseAngina", "ST_Slope"]
for c in categorical_cols:
  plt.figure(figsize=(5,4))
  df_heart[c].value_counts().plot(kind="bar")
  plt.title(f"Value counts for {c}")
  plt.ylabel("Count")
  plt.tight_layout()
  plt.show()

In [None]:
# Label Encoding for binary categorical columns -> Sex and ExerciseAngina
le = LabelEncoder()
df_heart["Sex"] = le.fit_transform(df_heart["Sex"])
df_heart["ExerciseAngina"] = le.fit_transform(df_heart["ExerciseAngina"])

In [None]:
df_heart.head(10)

In [None]:
# OneHot Encoding for nominal categorical columns
cat_cols = ["ChestPainType", "RestingECG", "ST_Slope"]

df_heart_encoded = pd.get_dummies(
    df_heart,
    columns = cat_cols,
    dtype=int
)

In [None]:
df_heart_encoded.head(10)

###Normalization and Scaling

Real-life analogy: **comparing salary and height**.

- Height might range from 150 to 190 cm.
- Salary might range from 20,000 to 700,000.
- If we feed these two features directly into a distance-based model,
  salary will dominate the calculation.

To fix this, we **scale** numeric features so they are on a comparable range.

Common approaches:
- **StandardScaler**: transforms features to have mean 0 and standard deviation 1
- **MinMaxScaler**: rescales features to a fixed range, usually [0, 1]

Always fit the scaler on the **training set only**, then transform both
training and test sets using the same scaler.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Assume df_heart_encoded is our working dataframe with target 'HeartDisease'
target_col = "HeartDisease"

X = df_heart_encoded.drop(columns=[target_col])
y = df_heart_encoded[target_col]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .25, random_state=42)

# Standard Scaling
scaler_sd = StandardScaler()
X_train_std = scaler_sd.fit_transform(X_train)#mean and SD will be calcualted from X_train
X_test_std = scaler_sd.transform(X_test)

# MinMax Scaling
scaler_mm = MinMaxScaler()
X_train_mm = scaler_mm.fit_transform(X_train)
X_test_mm = scaler_mm.transform(X_test)



print("\n--- Displaying Standard Scaled Data ---")
# Convert scaled arrays back to DataFrame for better visualization with column names
X_train_std_df = pd.DataFrame(X_train_std, columns = X_train.columns, index = X_train.index)
X_test_std_df = pd.DataFrame(X_test_std, columns = X_test.columns, index = X_test.index)
print("\nFirst 5 rows of X_train & X_test (Standard Scaled):")
display(X_train_std_df.head())
display(X_test_std_df.head())


print("\n--- Displaying Standard Scaled Data ---")
# Convert scaled arrays back to DataFrame for better visualization with column names
X_train_mm_df = pd.DataFrame(X_train_mm, columns = X_train.columns, index = X_train.index)
X_test_mm_df = pd.DataFrame(X_test_mm, columns = X_test.columns, index = X_test.index)

print("\nFirst 5 rows of X_train (Minmax Scaled):")
display(X_train_mm_df.head())
display(X_test_mm_df.head())