# 02 â€“ Preprocessing and Encoding

This notebook reproduces the preprocessing pipeline described in the lung cancer study, using the synthetic dataset `synthetic_lung_cancer_data.csv`.

We will:
- Load the raw CSV data.
- Demonstrate dropping rows with missing values in key clinical and symptom columns.
- Apply **ordinal (label) encoding** to ordered symptom and exposure variables.
- Apply **one-hot encoding** to nominal categorical variables.
- Split the data into training and testing sets (with stratification by the `Lung Cancer` label).

In the Python modules under `src/`, the same logic is implemented in reusable functions so that the modelling notebook can simply import and call them.


In [None]:
from pathlib import Path
import sys

import pandas as pd

ROOT_DIR = Path("..").resolve()
if str(ROOT_DIR) not in sys.path:
    sys.path.insert(0, str(ROOT_DIR))

from src.preprocessing import (
    MISSING_VALUE_COLUMNS,
    ONE_HOT_COLUMNS,
    ORDINAL_COLUMNS,
    ORDINAL_ORDER,
    TARGET_COLUMN,
    load_data,
    preprocess_data,
    train_test_split_wrapped,
)

DATA_PATH = ROOT_DIR / "data" / "synthetic_lung_cancer_data.csv"

raw_df = load_data(str(DATA_PATH))
raw_df.head()


Unnamed: 0,Gender,Age,Smoking,Family History of Cancer,Dyspnea,Chest Pain,Weight Loss,Coughing,Previous Lung Disease,Occupational Hazards,Pollution Level in Residence City,Allergy,Coughing Blood,Immediate Family Smokers,Fatigue,Hoarseness of Voice,Lung Cancer
0,Male,34,Never Smoker,No,Moderate,,,Yes,Yes,,High,No,No,Yes,No,,No
1,Female,55,Never Smoker,Yes,,Severe,,No,Yes,,High,Yes,No,No,No,,No
2,Male,68,Never Smoker,No,,,,No,No,High,High,Yes,No,No,No,Mild,No
3,Male,65,Never Smoker,Yes,,Mild,,No,No,High,Moderate,No,No,Yes,No,Mild,Yes
4,Female,24,Former Smoker,No,Moderate,,,Yes,No,Low,High,No,No,Yes,No,Mild,No


In [None]:
# Demonstrate dropping rows with missing values in key columns

print("Columns used for missing-value filtering:")
print(MISSING_VALUE_COLUMNS)

print("\nMissing values before dropping:")
print(raw_df[MISSING_VALUE_COLUMNS].isna().sum())

# For the synthetic dataset, there should be no missing values, but we
# still call dropna to mirror the original study's approach.

df_clean = raw_df.dropna(subset=MISSING_VALUE_COLUMNS)
print("\nShape before dropping:", raw_df.shape)
print("Shape after dropping:", df_clean.shape)


Columns used for missing-value filtering:
['Family History of Cancer', 'Dyspnea', 'Chest Pain', 'Weight Loss', 'Previous Lung Disease', 'Occupational Hazards', 'Allergy', 'Immediate Family Smokers', 'Hoarseness of Voice']

Missing values before dropping:
Family History of Cancer    0
Dyspnea                     0
Chest Pain                  0
Weight Loss                 0
Previous Lung Disease       0
Occupational Hazards        0
Allergy                     0
Immediate Family Smokers    0
Hoarseness of Voice         0
dtype: int64

Shape before dropping: (230, 17)
Shape after dropping: (230, 17)


### Encoding Strategy Recap

Following the original study, we apply two types of encoding:

- **Ordinal (label) encoding** for variables with a natural order:
  - `Dyspnea`, `Chest Pain`, `Weight Loss`, `Occupational Hazards`, `Pollution Level in Residence City`, `Hoarseness of Voice`.
- **One-hot encoding** for nominal categorical variables with no intrinsic order:
  - `Gender`, `Smoking`, `Family History of Cancer`, `Coughing`, `Previous Lung Disease`, `Allergy`, `Coughing Blood`, `Immediate Family Smokers`, `Fatigue`.

The `Lung Cancer` column is converted to a binary 0/1 target for modelling (Yes = 1, No = 0).


In [None]:
# Apply the full preprocessing pipeline using the helper function

X, y, encoder, one_hot_feature_names = preprocess_data(raw_df)

print("Preprocessed feature matrix shape:", X.shape)
print("Target vector length:", len(y))

print("\nExample of ordinal columns and their order:")
for col in ORDINAL_COLUMNS:
    print(col, "->", ORDINAL_ORDER[col])

print("\nNumber of one-hot encoded features:", len(one_hot_feature_names))


Preprocessed feature matrix shape: (230, 26)
Target vector length: 230

Example of ordinal columns and their order:
Dyspnea -> ['None', 'Mild', 'Moderate', 'Severe']
Chest Pain -> ['None', 'Mild', 'Moderate', 'Severe']
Weight Loss -> ['None', 'Mild', 'Marked']
Occupational Hazards -> ['None', 'Low', 'Moderate', 'High']
Pollution Level in Residence City -> ['Low', 'Moderate', 'High']
Hoarseness of Voice -> ['None', 'Mild', 'Moderate', 'Severe']

Number of one-hot encoded features: 19


In [None]:
# Train/test split

X_train, X_test, y_train, y_test = train_test_split_wrapped(X, y)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

# Quick check of class balance in train and test
import numpy as np

print("\nClass distribution in y_train (0 = No, 1 = Yes):")
unique, counts = np.unique(y_train, return_counts=True)
print(dict(zip(unique, counts)))

print("\nClass distribution in y_test (0 = No, 1 = Yes):")
unique, counts = np.unique(y_test, return_counts=True)
print(dict(zip(unique, counts)))


Training set shape: (184, 26)
Test set shape: (46, 26)

Class distribution in y_train (0 = No, 1 = Yes):
{np.int64(0): np.int64(121), np.int64(1): np.int64(63)}

Class distribution in y_test (0 = No, 1 = Yes):
{np.int64(0): np.int64(30), np.int64(1): np.int64(16)}


### Summary

In this notebook we:

- Loaded the synthetic dataset and simulated the **row-dropping** strategy for handling missing values used in the original clinical study.
- Applied **ordinal encoding** to ordered symptom and exposure variables, and **one-hot encoding** to nominal categorical variables.
- Created stratified training and test sets using an 80/20 split.

These processed arrays (`X_train`, `X_test`, `y_train`, `y_test`) will be reused in the next notebook to train and evaluate Logistic Regression, Decision Tree, KNN, Random Forest, XGBoost, Gradient Boosting, CatBoost, and AdaBoost models.
