# Training Pipelines Walkthrough (Day 1–5)

This notebook documents and demonstrates the modular ML training pipeline using Hydra and MLflow. It aligns with the day-wise plan:

- Day 1: Define pipeline structure (load → preprocess → train → eval) and configs
- Day 2: Implement Logistic Regression & Random Forest and log metrics
- Day 3: Add cross-validation and hyperparameter search (GridSearchCV)
- Day 4: Add Gradient Boosting (optionally XGBoost)
- Day 5: Demo run and reproducibility notes

Prerequisites:
- Ensure `requirements.txt` is installed in your Python environment.
- Run from the project root so paths resolve correctly.



## Day 1: Pipeline structure and configs

- Source layout:
  - `main.py`: Hydra entrypoint; orchestrates ingestion → split → preprocess → train → log
  - `src/Training_pipeline/components`: ingestion, transformation, model training
  - `src/Training_pipeline/pipeline`: preprocessing & training pipeline wrappers
  - `configs/`: Hydra config groups: `dataset`, `model`, `training`, `preprocessing`
- Key configs:
  - `configs/config.yaml` sets defaults and MLflow tracking
  - `configs/dataset/cal_housing.yaml` and `cal_housing_classification.yaml`
  - `configs/model/*.yaml`: random_forest, logistic_regression, gradient_boosting (xgboost optional)
  - `configs/training/base.yaml`: CV and search settings
  - `configs/preprocessing/standard.yaml`: imputer/scaler/encoder



In [None]:
# Kaggle setup: install dependencies and set dataset path
import os, sys, subprocess

# Install requirements (safe to re-run)
%pip install -q -r ../requirements.txt

# Define the dataset CSV path for Kaggle; update this if your dataset path differs
# Example for a Kaggle dataset named "california-housing-prices":
# /kaggle/input/california-housing-prices/housing.csv
DATASET_PATH = os.environ.get(
    "DATASET_PATH",
    "/kaggle/input/california-housing-prices/housing.csv"
)
print("Using DATASET_PATH=", DATASET_PATH)

# Helper to build a train command with dataset path override

def train_cmd(dataset_group: str, model: str, extra_overrides: str = "") -> str:
    overrides = f"dataset={dataset_group} model={model} dataset.paths.dataset=\"{DATASET_PATH}\""
    if extra_overrides:
        overrides += " " + extra_overrides
    return f"python .\\train.py {overrides}"



In [None]:
import os, sys, json, subprocess, textwrap
from pathlib import Path

PROJECT_ROOT = Path(os.getcwd())
print("Project root:", PROJECT_ROOT)

# Utility: run a command and stream output

def run(cmd: str):
    print(f"\n$ {cmd}")
    proc = subprocess.Popen(cmd, shell=True)
    proc.wait()
    if proc.returncode != 0:
        raise RuntimeError(f"Command failed with code {proc.returncode}")



## Day 2: Baselines (Logistic Regression, Random Forest)

We'll run two quick training jobs: one for regression, one for classification with quantile labeling.



### Kaggle-friendly runs (use DATASET_PATH override)
These cells use `train_cmd()` so the dataset path under `/kaggle/input/...` is passed to Hydra.



In [None]:
# Regression baseline (Random Forest)
run("python .\\train.py dataset=cal_housing model=random_forest")

# Classification baseline (Logistic Regression)
run("python .\\train.py dataset=cal_housing_classification model=logistic_regression")



## Day 3: Hyperparameter Search + Cross-Validation

`GridSearchCV` and CV folds are controlled by `configs/training/base.yaml`. We can adjust `search.cv`, `runtime.n_jobs`, and scoring keys.



In [None]:
# Example: run with more folds and verbose logging
run("python .\\train.py dataset=cal_housing model=random_forest training.search.cv=3 training.runtime.verbose=2")



## Day 4: Add Gradient Boosting (and optional XGBoost)

- To compare models, simply change the `model` group override.
- If you re-enable XGBoost in the trainer and `configs/model/xgboost.yaml` exists, you can run it similarly.



In [None]:
# Compare Gradient Boosting on regression
run("python .\\train.py dataset=cal_housing model=gradient_boosting")

# Compare Gradient Boosting on classification
run("python .\\train.py dataset=cal_housing_classification model=gradient_boosting")



## Day 5: Demo + Reproducibility

- Demo command (regression): `python .\train.py dataset=cal_housing model=random_forest`
- Demo command (classification): `python .\train.py dataset=cal_housing_classification model=random_forest`
- MLflow UI: `mlflow ui --backend-store-uri mlruns` (or your absolute mlruns path)

Artifacts:
- Data: `artifacts/`
- Models: `models/`
- Preprocessor: `artifacts/preprocessor.pkl`
- Metrics: `artifacts/metrics.json`

The run logs contain the composed config so you can reproduce a run by reusing the logged parameters and the config overrides shown here.

