# Data Processing Experiments
This notebook captures hands-on experiments for data ingestion, cleaning tweaks, and preprocessing design decisions before they are codified in `src/data_preprocessing.py`.

## Environment Setup

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import Markdown, display

PROJECT_ROOT = Path('..')
RAW_PATH = PROJECT_ROOT / 'data' / 'raw' / 'heart_data.csv'
PROCESSED_DIR = PROJECT_ROOT / 'data' / 'processed'
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
print(f'Raw dataset: {RAW_PATH.resolve()}')

Raw dataset: /Users/peter/Desktop/AI_MLProjects_Research_Project/health_xai_project/data/raw/heart_data.csv


## Scratch Cleaning Tests

In [2]:
# Load a sample of the raw data to inspect quirks before pipeline runs
raw_sample = pd.read_csv(RAW_PATH, nrows=5_000)
display(raw_sample.head())
raw_sample.info()

Unnamed: 0.1,Unnamed: 0,cntry,happy,sclmeet,inprdsc,health,ctrlife,etfruit,eatveg,dosprt,...,slprl,wrhpp,fltlnl,enjlf,fltsd,hltprhc,hltprhb,hltprdi,gndr,paccnois
0,1,AT,8.0,4.0,1.0,3,8.0,3.0,3.0,3.0,...,1.0,3.0,1.0,3.0,1.0,0,1,0,1,0
1,2,AT,9.0,7.0,4.0,2,8.0,1.0,1.0,5.0,...,3.0,3.0,3.0,4.0,2.0,1,0,0,2,0
2,3,AT,9.0,4.0,4.0,1,9.0,4.0,3.0,3.0,...,3.0,3.0,1.0,3.0,1.0,0,0,0,2,0
3,4,AT,7.0,6.0,3.0,3,8.0,2.0,2.0,3.0,...,3.0,2.0,2.0,2.0,2.0,0,1,1,2,0
4,5,AT,9.0,5.0,4.0,2,9.0,3.0,3.0,3.0,...,1.0,3.0,1.0,3.0,1.0,0,0,0,1,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 26 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  5000 non-null   int64  
 1   cntry       5000 non-null   object 
 2   happy       4989 non-null   float64
 3   sclmeet     4996 non-null   float64
 4   inprdsc     4990 non-null   float64
 5   health      5000 non-null   int64  
 6   ctrlife     4982 non-null   float64
 7   etfruit     4993 non-null   float64
 8   eatveg      4970 non-null   float64
 9   dosprt      4934 non-null   float64
 10  cgtsmok     4995 non-null   float64
 11  alcfreq     4990 non-null   float64
 12  height      5000 non-null   int64  
 13  weighta     5000 non-null   int64  
 14  fltdpr      4983 non-null   float64
 15  flteeff     4988 non-null   float64
 16  slprl       4987 non-null   float64
 17  wrhpp       4978 non-null   float64
 18  fltlnl      4987 non-null   float64
 19  enjlf       4975 non-null  

### Missing Value Prototypes

In [3]:
missing_pct = (raw_sample.isna().mean().sort_values(ascending=False) * 100).round(2)
display(missing_pct.head(10).to_frame('missing_pct'))

# Placeholder for experimenting with alternative imputers
# e.g., raw_sample['height'].fillna(raw_sample['height'].median(), inplace=True)

Unnamed: 0,missing_pct
dosprt,1.32
eatveg,0.6
enjlf,0.5
wrhpp,0.44
fltsd,0.38
ctrlife,0.36
fltdpr,0.34
fltlnl,0.26
slprl,0.26
flteeff,0.24


### Encoding Experiments

In [4]:
categorical_cols = raw_sample.select_dtypes(include=['object']).columns.tolist()
print(f'Categorical columns: {categorical_cols}')

# Example: one-hot encode a small subset to inspect column expansion
encoded_preview = pd.get_dummies(raw_sample[categorical_cols[:2]], drop_first=False)
display(encoded_preview.head())

Categorical columns: ['cntry']


Unnamed: 0,cntry_AT,cntry_BE,cntry_BG
0,True,False,False
1,True,False,False
2,True,False,False
3,True,False,False
4,True,False,False


## Notes & Decisions
- Drop the unnamed index column that appears in the raw CSV before any downstream processing to avoid duplicate identifiers.

- Median imputation works best for the numeric wellness scores (e.g., happiness, control-over-life) because their histograms are skewed; align the production pipeline with `SimpleImputer(strategy="median")`.

- Use most-frequent imputation for survey-style categorical fields (country, smoking, alcohol) to preserve valid label sets without introducing new categories.

- Standardise all numeric features with `StandardScaler` so logistic regression and the neural network converge quickly while tree-based models can still operate on unscaled data if needed.

- Retain full one-hot encoding for `cntry` and similar categorical questions, relying on `OneHotEncoder(handle_unknown="ignore")` plus the `feature_names.csv` mapping to keep documentation in sync.

- Persist exploratory statistics (missingness, outliers, feature counts) into `results/metrics/` during notebook runs so changes can be compared before updating `src/data_preprocessing.py`.
