# Data Processing Notebook

This companion notebook demonstrates how the AutoML Pro pipeline imports data, runs the shared sanitisation utilities, and inspects whether additional cleaning is necessary before modelling.

## 0. Environment setup
Install the minimal requirements if you're running this outside the repo's virtualenv.

In [None]:
!pip install -r Project/requirements_min.txt

## 1. Import helpers
Reuse the exact Python modules used by `scripts/run_all.py` so results stay aligned with the main pipeline.

In [None]:

from Project.utils.io import load_dataset, guess_target_column
from Project.utils.sanitize import sanitize_columns
from Project.utils.splits import resolve_split_plan
import pandas as pd


## 2. Load & sanitise dataset
Set `CSV_PATH` or rely on the built-in discovery logic.

In [None]:

# Optional: override with a specific CSV
data = sanitize_columns(load_dataset(max_rows=2000))
target_col = guess_target_column(data)
print(f"Detected target column: {target_col}")
data.head()


## 3. Missing values / dtype audit

In [None]:

profile = pd.DataFrame({
    'dtype': data.dtypes,
    'missing_pct': data.isna().mean() * 100,
    'unique_vals': data.nunique()
}).sort_values('missing_pct', ascending=False)
profile.head(20)


## 4. Train/validation splits
Demonstrate that the registry-driven split plan works as expected.

In [None]:

split_plan = resolve_split_plan(csv_path=None)
folds = list(split_plan.split(data.drop(columns=[target_col]), data[target_col].tolist(), n_splits=3, seed=42, is_classification=True))
print(f"Generated {len(folds)} folds via {split_plan.strategy} strategy")
[f"train={len(tr)}, valid={len(val)}" for tr, val in folds]


## 5. Optional: pipe into analysis utilities
Call the Python modules directly: `Project.analysis.summarize_all.main()` regenerates the summary CSVs, and `Project.analysis.plot_comparisons.main()` produces the figures showcased in the README.

In [None]:

from Project.analysis import summarize_all, plot_comparisons

# Uncomment to regenerate reports/metrics based on current dataset selection
# summarize_all.main()
# plot_comparisons.main()


Feel free to duplicate this notebook inside `notebooks/data_processing/` and tailor it per dataset; each variant will demonstrate the data preparation story for your presentation.