# Lektion 8 - ML pipelines: Struktur och automatisering

**Assignment: Build a minimal training pipeline**

Instructions:
1. Create a small end-to-end pipeline in code
2. Save outputs and metrics
3. Keep short comments explaining design choices

## Task 1: Pipeline in code
Build a small end-to-end pipeline with preprocessing and a model.

In [1]:
# TODO: Build a scikit-learn Pipeline with:
# - StandardScaler
# - Model of choice (LogisticRegression or SVC)

# En pipeline är en serie av steg som vi kör
# Inom ML, använder vi ofta pipelines för preprocessingsteg
# som till exempel standardisering, transformering och reshaping, 
# men även för träning.
# 
# Idag bygger vi en pipeline med standardisering och modellskapande
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Nedan bygger vi en pipeline.
# Pipelinen som vi har byggt sätter ihop både vårt preprocessingsteg
# och skapandet av vår modell till en körning.
# Det blir då väldigt enkelt att återskapa samma flöde
pipeline = Pipeline(
    steps = [
        ("scaler", StandardScaler()),
        ("model", LogisticRegression(max_iter=1000))
    ]
)




In [2]:
# TODO: Train and evaluate on a dataset
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

data = load_iris(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25, random_state=42)


In [6]:
print("X_test shape:", X_test.shape)
print("X_train shape:", X_train.shape)

X_test shape: (38, 4)
X_train shape: (112, 4)


# Se L4_assignment_CLASSROOM.ipynb för utförlig EDA:

[L4_assignment_CLASSROOM.ipynb](../L4/L4_assignment_CLASSROOM.ipynb)

[Web version](https://github.com/AndreasFurth/ML-Frameworks/blob/main/L4/L4_assignment_CLASSROOM.ipynb)

In [None]:
# The training:


# Med vår pipeline så tränar vi, och sedan utvärderar vår modell

pipeline.fit(X_train,y_train)
preds = pipeline.predict(X_test)

metrics = {
    "accuracy": accuracy_score(y_test, preds),
    "f1_macro": f1_score(y_test, preds, average="macro")
}

print(metrics)


{'accuracy': 1.0, 'f1_macro': 1.0}


## Task 2: Automate training
Wrap the workflow into a reusable experiment function.

In [None]:
# TODO: Wrap training in a function run_experiment(config)

# En sån här funktion riskerar att bli väldigt lång
# Vi måste gå balansgång mellan generaliserbarhet 
# och korthet/läsbarhet.
# Om idéen är att kunna återanvända vår experimentfunktion
# så är lång längd ett piller som vi kan behöva svälja
def run_experiment(config = {"scaler": StandardScaler(), "model": LogisticRegression(max_iter=1000), "params": None}):
    
    # Nedan är en generaliserbar pipeline
    # MEN, den är kanske inte maximalt användbar,
    # eftersom användaren tvingas hålla koll på 
    # och skicka in alla separata steg själv
    pipeline = Pipeline(
        steps= [
            ("scaler", config["scaler"]),
            ("model", config["model"]),
            ("params", config["params"])
        ]
    )


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import get_scorer

def run_experiment(config):
    """
    Run a general ML experiment entirely from a config dict.

    config: dict
        {
            "X": feature matrix,
            "y": target vector,
            "test_size": float (optional, default=0.2),
            "random_state": int (optional, default=42),
            "preprocessing": list of (name, transformer) tuples (optional),
            "model": sklearn estimator,
            "params": dict of hyperparameters for GridSearchCV (optional),
            "scoring": str or callable metric (optional, default='accuracy')
        }
    """
    
    # Extract from config with defaults
    X = config["X"]
    y = config["y"]
    test_size = config.get("test_size", 0.2)
    random_state = config.get("random_state", 42)
    preprocessing = config.get("preprocessing", [])
    model = config["model"]
    params = config.get("params", None)
    scoring = config.get("scoring", "accuracy")
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    
    # Build pipeline
    steps = preprocessing + [("model", model)]
    pipeline = Pipeline(steps=steps)
    
    # Wrap with GridSearchCV if params provided
    if params:
        pipeline = GridSearchCV(pipeline, params, cv=5, n_jobs=-1, scoring=scoring)
    
    # Fit
    pipeline.fit(X_train, y_train)
    
    # Evaluate
    scorer = get_scorer(scoring)
    score = scorer(pipeline, X_test, y_test)
    
    print(f"{scoring} on test set: {score:.4f}")
    
    return pipeline, score

In [None]:
# TODO: Save metrics to metrics.json

In [None]:
# TODO: Save the trained model with joblib

## Task 3: CLI (optional but recommended)
Parameterize runs and log chosen values.

In [None]:
# TODO: Use argparse to pass model params (e.g., C, max_iter)

In [None]:
# TODO: Log chosen params into metrics.json

In [None]:
print("Done! You created a small reproducible ML pipeline.")