# MEM Shirt Size Model — Training Notebook

This notebook trains a **shirt size prediction model** that outputs one of:

- `S`
- `M`
- `L`
- `XL`
- `XXL`

The dataset used here is **synthetic** (generated) so you can iterate quickly without collecting data first.

For production accuracy, we should later replace the synthetic generator with real measurements / size-chart labels from your actual merch vendor.

## 1) Notebook Runtime & Reproducibility Setup (seeds, device, paths)

In [None]:
from __future__ import annotations

import os
import platform
import random
from pathlib import Path

import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

REPO_ROOT = Path.cwd()
MODEL_PATH = REPO_ROOT / "app" / "model.joblib"

print("Python:", platform.python_version())
print("CWD:", Path.cwd())
print("Model output path:", MODEL_PATH)

# Note: scikit-learn wheels may not exist yet for very new Python versions.
# If imports fail later, run this notebook using a Python 3.11 kernel OR use Docker.
# Docker approach (from repo root):
#   docker build -t mem-shirt-size:local .
#   docker run --rm -v "$(pwd -W)":/code -w /code mem-shirt-size:local python train_model.py

## 2) Install/Verify Dependencies (VS Code + Jupyter kernel)

This prints versions so the run is reproducible.

In [None]:
import joblib
import sklearn

print("numpy:", np.__version__)
print("joblib:", joblib.__version__)
print("scikit-learn:", sklearn.__version__)

## 3) Load Data (generated) + Basic Validation

We generate a synthetic dataset using `train_model.generate_synthetic_dataset()`.

Features:
- `height_cm`
- `weight_kg`
- `age`
- `gender`
- `fit_preference`
- `build`

Label:
- `size` in `{S, M, L, XL, XXL}`

In [None]:
from collections import Counter

from train_model import SIZES, generate_synthetic_dataset

rows, y = generate_synthetic_dataset(n=30_000, seed=SEED)

print("Samples:", len(rows))
print("Labels:", SIZES)
print("Label distribution:")
print(Counter(y))

print("Example row:")
print(rows[0], "=>", y[0])

## 4) Train/Validation/Test Split

We do a stratified split inside `train_model.train_model()` for reproducibility.

## 5) Preprocessing Pipeline (scaling/encoding)

We use a scikit-learn `Pipeline` with:
- numeric scaling (`StandardScaler`)
- categorical one-hot encoding (`OneHotEncoder`)

This is already implemented in `train_model.train_model()`.

## 6) Dataset/DataLoader Construction

This project uses scikit-learn (not PyTorch), so we don’t need a DataLoader.

Our “dataset” is a list of Python dicts (`rows`).

## 7) Define Model Architecture

We use a **multinomial Logistic Regression** baseline (fast + interpretable) via scikit-learn.

If you want later, we can switch to Gradient Boosting (often better) once we have real labeled data.

## 8) Define Loss, Metrics, Optimizer, Scheduler

For scikit-learn Logistic Regression:
- loss/optimizer are internal
- metrics are computed after training (accuracy + per-class precision/recall/F1)


## 9) Training Loop (epochs, batching, checkpointing)

We train in one call to scikit-learn’s `.fit()` and save artifacts as a single `joblib` file in `app/model.joblib`.

In [None]:
from dataclasses import asdict
from datetime import datetime, timezone

from train_model import ModelArtifact, train_model

pipeline, metrics = train_model(rows, y, seed=SEED)
print("Holdout accuracy:", metrics["accuracy"])

artifact = ModelArtifact(
    version="shirt-size-v1",
    trained_at=datetime.now(timezone.utc).isoformat(),
    labels=["S", "M", "L", "XL", "XXL"],
    feature_schema={
        "type": "object",
        "required": ["height_cm", "weight_kg", "age", "gender", "fit_preference", "build"],
        "properties": {
            "height_cm": {"type": "number", "minimum": 120, "maximum": 230},
            "weight_kg": {"type": "number", "minimum": 30, "maximum": 250},
            "age": {"type": "integer", "minimum": 10, "maximum": 100},
            "gender": {"type": "string", "enum": ["female", "male", "other"]},
            "fit_preference": {"type": "string", "enum": ["slim", "regular", "oversized"]},
            "build": {"type": "string", "enum": ["lean", "average", "athletic", "curvy"]},
        },
    },
    pipeline=pipeline,
    metrics=metrics,
)

MODEL_PATH.parent.mkdir(parents=True, exist_ok=True)
joblib.dump(asdict(artifact), MODEL_PATH)
print("Saved:", MODEL_PATH)

## 10) Evaluation Loop (metrics + confusion matrix/plots)

We’ll compute a confusion matrix on a quick re-split for visualization.

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split

idx = np.arange(len(rows))
train_idx, test_idx = train_test_split(idx, test_size=0.2, random_state=SEED, stratify=y)

X_train = [rows[i] for i in train_idx]
X_test = [rows[i] for i in test_idx]
y_train = y[train_idx]
y_test = y[test_idx]

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

disp = ConfusionMatrixDisplay.from_predictions(y_test, y_pred, labels=list(SIZES), normalize="true", cmap="Blues")
disp.ax_.set_title("Normalized confusion matrix")
plt.show()

## 11) Hyperparameter Tuning (small grid/random search)

A tiny grid search over Logistic Regression strength. This is intentionally small.

In [None]:
from sklearn.base import clone
from sklearn.metrics import accuracy_score

C_grid = [0.2, 0.5, 1.0, 2.0]
results = []

idx = np.arange(len(rows))
train_idx, val_idx = train_test_split(idx, test_size=0.2, random_state=SEED, stratify=y)
X_train = [rows[i] for i in train_idx]
X_val = [rows[i] for i in val_idx]
y_train = y[train_idx]
y_val = y[val_idx]

for C in C_grid:
    model = clone(pipeline)
    model.set_params(clf__C=C)
    model.fit(X_train, y_train)
    pred = model.predict(X_val)
    results.append({"C": C, "val_accuracy": float(accuracy_score(y_val, pred))})

results_sorted = sorted(results, key=lambda r: r["val_accuracy"], reverse=True)
print("Top configs:")
for r in results_sorted:
    print(r)

best_C = results_sorted[0]["C"]
print("Best C:", best_C)

## 12) Save/Load Artifacts (model + preprocessors)

The API expects a single `joblib` file at `app/model.joblib` containing:
- preprocessing + classifier pipeline
- labels
- schema
- metadata (version, trained_at)

We can also reload it here to confirm everything works.

In [None]:
loaded = joblib.load(MODEL_PATH)
print("Loaded keys:", loaded.keys())
print("Loaded version:", loaded.get("version"))
print("Loaded labels:", loaded.get("labels"))

loaded_pipeline = loaded["pipeline"]
print("Pipeline steps:", loaded_pipeline.named_steps.keys())

## 13) Inference on New Samples (single + batch)

This mirrors what the FastAPI `/predict` endpoint does.

In [None]:
sample = {
    "height_cm": 176,
    "weight_kg": 78,
    "age": 24,
    "gender": "male",
    "fit_preference": "regular",
    "build": "average",
}

proba = loaded_pipeline.predict_proba([sample])[0]
labels = loaded_pipeline.classes_

probs = {str(lbl): float(p) for lbl, p in zip(labels, proba)}
recommended = max(probs.items(), key=lambda kv: kv[1])[0]

print("Recommended:", recommended)
print("Probabilities:")
for k, v in sorted(probs.items(), key=lambda kv: kv[1], reverse=True):
    print(f"  {k}: {v:.3f}")

batch = [
    sample,
    {"height_cm": 162, "weight_kg": 55, "age": 19, "gender": "female", "fit_preference": "slim", "build": "lean"},
    {"height_cm": 182, "weight_kg": 105, "age": 33, "gender": "male", "fit_preference": "regular", "build": "curvy"},
]
print("Batch predictions:", list(loaded_pipeline.predict(batch)))

## 14) Export Core Code to `.py` Module for Reuse

This repo already does that:
- training logic is in [train_model.py](MEM-shirt-size-predictor-model/train_model.py)
- picklable transformers are in [app/feature_utils.py](MEM-shirt-size-predictor-model/app/feature_utils.py)

The notebook imports and calls those functions so you don’t end up with two divergent implementations.

## 15) Add Unit Tests (smoke tests for data + model)

Below is a minimal `pytest` test file content you can copy into `tests/test_model_smoke.py` if you want automated checks.

(We’re not creating the file automatically here unless you ask.)

In [None]:
print(r"""
# tests/test_model_smoke.py

import joblib

from train_model import generate_synthetic_dataset, train_model


def test_training_runs():
    rows, y = generate_synthetic_dataset(n=2000, seed=123)
    pipeline, metrics = train_model(rows, y, seed=123)
    assert metrics["accuracy"] > 0.5
    assert hasattr(pipeline, "predict")


def test_artifact_loads(tmp_path):
    rows, y = generate_synthetic_dataset(n=2000, seed=123)
    pipeline, _ = train_model(rows, y, seed=123)
    out = tmp_path / "model.joblib"
    joblib.dump({"pipeline": pipeline, "labels": ["S", "M", "L", "XL", "XXL"]}, out)
    loaded = joblib.load(out)
    assert "pipeline" in loaded
""")

## 16) VS Code Execution Cells + Output Pane Logging (structured logs)

Quick pattern: use Python’s `logging` module so output is consistent across cells and scripts.

In [None]:
import logging

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(name)s: %(message)s",
)
logger = logging.getLogger("mem-shirt-size")

logger.info("Notebook ready. Model path: %s", MODEL_PATH)
logger.info("Tip: run sections top-to-bottom.")