# EDA and Preprocessing Walkthrough

This notebook explores the synthetic hospital readmission dataset, performs exploratory data analysis, and documents the preprocessing pipeline aligned with the AI Development Workflow assignment.


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path

sns.set_theme(style="whitegrid")

DATA_PATH = Path("../data/synthetic_patients.csv")
df = pd.read_csv(DATA_PATH)
df.head()


## Dataset Summary

The dataset contains demographic, clinical, and social determinants features. The target column `readmitted` indicates whether the patient returned within 30 days of discharge.


In [None]:
df.describe(include="all").transpose()


In [None]:
class_counts = df['readmitted'].value_counts().rename({0: 'No Readmit', 1: 'Readmit'})
class_counts


In [None]:
sns.barplot(x=class_counts.index, y=class_counts.values)
plt.title("Readmission Class Distribution")
plt.ylabel("Count")
plt.xlabel("Class")
plt.show()


## Preprocessing Steps

See `src/data_pipeline.py` for the full column transformer and SMOTE pipeline. Key operations:

1. Impute missing values (median for numeric, most frequent for categorical).
2. Scale numeric features and one-hot encode categoricals.
3. Vectorize discharge summaries with TF-IDF (bigrams, max 256 features).
4. Apply SMOTE to balance readmission classes.

The following cell illustrates how to instantiate the preprocessing pipeline directly from the configuration file.


In [None]:
from src.utils import load_config
from src.data_pipeline import prepare_datasets

config = load_config("../config/experiment.yaml")
dataset_splits, preprocessing_pipeline = prepare_datasets(config)
preprocessing_pipeline


## Next Steps

- Run `src/train_model.py` to train LightGBM with the processed features.
- Review `docs/ai_workflow_report.md` for the comprehensive narrative.
- Publish synthesized insights to the PLP Academy Community using `docs/plp_article_post.md` as a template.
