# **Machine Learning For Beginners**

## 1. Foundations You Actually Need

| Goal                                           | What you’ll cover                                                                                      | Resources                                                 |
| ---------------------------------------------- | ------------------------------------------------------------------------------------------------------ | --------------------------------------------------------- |
| **Python for Data**                            | Installing Anaconda / Miniconda, Jupyter, VS Code; the *pandas* & *NumPy* 101 you really need          | Official tutorials + my cheat-sheet                       |
| **Math Refresher**                             | Vectors & matrices (linear algebra), derivatives (very light calculus), basic probability & statistics | 3Blue1Brown “Essence of LA”, Khan Academy micro-playlists |
| **Version Control** (optional but recommended) | Git basics, GitHub repo setup for your projects                                                        | Try Git interactive tutorial                              |


## 2. Core Machine-Learning Concepts

What counts as “learning”? datasets, features vs labels

Train / validation / test splits, random vs time-series splitting

Loss functions and why optimization ≠ memorisation

Bias–variance trade-off (your first mental model of overfitting)

## 3. Supervised Learning — Regression & Classification

| Algorithm                       | Why it’s useful                          | Key hyper-params            | scikit-learn class                       |
| ------------------------------- | ---------------------------------------- | --------------------------- | ---------------------------------------- |
| Linear & Logistic Regression    | Gold-standard baselines                  | regularisation strength     | `LinearRegression`, `LogisticRegression` |
| k-Nearest Neighbours            | Intuitive, non-parametric                | `n_neighbors`               | `KNeighborsClassifier`/`Regressor`       |
| Decision Trees & Random Forests | Handle non-linear relations, little prep | `max_depth`, `n_estimators` | `DecisionTree*`, `RandomForest*`         |
| Gradient Boosting / XGBoost     | State-of-the-art tabular                 | learning rate, trees        | `XGBClassifier`/`Regressor`              |


## 4. Unsupervised Learning

Clustering: k-Means, DBSCAN, Agglomerative

Dimensionality Reduction: PCA, t-SNE, UMAP

## 5. Model Evaluation & Tuning

Metrics cheat-sheet (RMSE, MAE, Accuracy, Precision-Recall, ROC-AUC)

Cross-validation & nested CV

Hyper-parameter search (Grid, Random, Bayesian)

Intro to pipelines so you stop leaking data

## 6. Feature Engineering & Data Wrangling

Handling missing data, text, dates, categories

Encoding schemes (one-hot, ordinal, target, embeddings)

Feature selection & importance (permutation, SHAP)

## 7. Neural Networks & Deep Learning (Starter Pack)

How a perceptron becomes a deep net

Popular architectures: MLP → CNN → RNN/Transformer basics

Framework tour: Keras/TensorFlow vs PyTorch (we’ll pick one)

## 8. Model Deployment & MLOps Basics

Saving models (joblib, pickle, ONNX)

Serving with FastAPI or Flask (containerised via Docker)

Monitoring drift & performance in production

## 9. Ethics, Fairness & Interpretability

Data bias, fairness metrics, responsible AI guidelines

Explainers: LIME, SHAP

Privacy (GDPR basics, differential privacy intro)

### Applying Models to Real Files — Your “Universal Pattern”

Below is the repeatable recipe we’ll use again and again (pseudocode now; we’ll flesh out real examples when you’re ready to run them).

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# 1. Load
df = pd.read_csv("your_data.csv")          # or read_excel, read_json, read_sql...

# 2. Define X & y
X = df.drop("target", axis=1)
y = df["target"]

# 3. Train/val split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=y)

# 4. Preprocess
num_cols = X.select_dtypes(include="number").columns
cat_cols = X.select_dtypes(include="object").columns

preprocess = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
])

# 5. Model
model = RandomForestClassifier(n_estimators=300, random_state=42)

# 6. Pipeline = preprocess + model
pipe = Pipeline([
    ("prep", preprocess),
    ("clf", model)
])

pipe.fit(X_train, y_train)

# 7. Evaluate
preds = pipe.predict(X_val)
print(classification_report(y_val, preds))

# Module 1 – Set-up & “Python for Data” from Absolute Zero

### 1. Pick & install your Python distribution

| Option                       | Best for                                                     | Quick steps                                                                                                                                                                                                                |
| ---------------------------- | ------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Anaconda (≈ 3 GB)**        | You want “batteries-included” and a GUI launcher             | 1. Download the **Anaconda 3** installer for your OS from [https://www.anaconda.com/download](https://www.anaconda.com/download). <br>2. Run the installer → tick *Add Anaconda to PATH* if offered (safe on Windows 10+). |
| **Miniconda (≈ 80 MB)**      | You prefer lightweight installs & only the packages you need | 1. Grab **Miniconda** from [https://docs.conda.io/en/latest/miniconda.html](https://docs.conda.io/en/latest/miniconda.html). <br>2. Run installer → accept defaults.                                                       |
| **System Python + pip/venv** | You already use Homebrew / apt / Chocolatey                  | 1. Install Python 3.10 + via your package manager. <br>2. `python -m venv ml-env && source ml-env/bin/activate` (Linux/macOS) or `.\ml-env\Scripts\activate` (Win).                                                        |


### 2. Create a clean ML environment (conda users)

In [None]:
# Open Terminal (macOS/Linux) or Anaconda Prompt / PowerShell (Windows)
conda update -n base -c defaults conda             # keep conda itself current
conda create -n ml101 python=3.11 numpy pandas jupyterlab matplotlib seaborn scikit-learn
conda activate ml101

(If you went with venv/pip, just run pip install numpy pandas jupyterlab matplotlib seaborn scikit-learn inside the virtual-env.)

### 3. Launch JupyterLab

In [None]:
jupyter lab / python notebook / python -m notebook

### 4. Your first 15 lines of “data Python”

In [None]:
import pandas as pd

# 1-liner: download a tiny CSV (UK car prices sample) straight from GitHub
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"
cars = pd.read_csv(url)

cars.head()          # show first 5 rows

In [None]:
cars.shape           # rows, columns
cars.describe()      # numeric summary stats
cars['origin'].value_counts()

### 5. Mini-exercise (15 min)

Pick a dataset you care about (CSV, Excel, JSON… anything tabular).
• Example sources: Kaggle “Titanic”, UCI Machine Learning Repository, your own spreadsheets.

Load it into a new notebook just like we did with cars.

Answer three basic questions about it, e.g.
How many rows? • Which columns are numeric vs. text? • What’s the mean of one interesting column?

(Optional but recommended) Push the notebook to a GitHub repo called ml-learning-journey.

# Module 2 – Core Machine-Learning Concepts

### 1. What “learning” means in ML

- Data = rows of examples. Each row is called an instance / sample / observation.

- Features (X) are the input columns we give to the algorithm.

- Target / label (y) is the output column we want it to predict.

- Supervised learning means we already know y for past data and train a model to map X → y.

In [None]:
X = cars.drop('mpg', axis=1)   # features
y = cars['mpg']                # target
print(X.columns.tolist()[:5], '...', len(X.columns), 'features total')

### 2. Train / Validation / Test split (why & how)

| Split          | Size (rule-of-thumb) | Purpose                               | Kept hidden from…                      |
| -------------- | -------------------- | ------------------------------------- | -------------------------------------- |
| **Train**      | 60-80 %              | Fit the model                         | Nobody                                 |
| **Validation** | 10-20 %              | Tune hyper-parameters, early stopping | The model (during fit)                 |
| **Test**       | 10-20 %              | Final, unbiased report                | You, until *everything* else is frozen |


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_temp, y_train, y_temp = train_test_split(
        X, y, test_size=0.3, random_state=42)        # 70 % train

X_val, X_test, y_val, y_test = train_test_split(
        X_temp, y_temp, test_size=0.5, random_state=42)  # 15 % + 15 %

Why random_state? It lets others reproduce your exact split.

### 3. Loss functions ≠ evaluation metrics

| Task                  | Typical **loss** (optimised during training) | Typical **metric** (reported to humans) |
| --------------------- | -------------------------------------------- | --------------------------------------- |
| Regression            | Mean-Squared-Error (MSE)                     | RMSE, MAE, R²                           |
| Binary classification | Log-loss (a.k.a. cross-entropy)              | Accuracy, F1, ROC-AUC                   |
| Multiclass            | Categorical cross-entropy                    | Accuracy, macro-F1                      |


**Rule: the algorithm never sees your metric; it only minimises loss.**

That’s why you occasionally get models that score great on loss but mediocre on your business KPI—choose the right metric and monitor both.

### 4. Bias–Variance Trade-off (your first mental model)

Imagine fitting dots with a line vs a wiggly curve:

- High bias = model too simple → under-fits, large error on both train & val.

- High variance = model too complex → memorises train set, but val error explodes.

Goal: sit in the “Goldilocks” zone where train error is low and the gap to val error is small.

### 5. Mini-project – “Hello, Overfitting!”

Dataset: the same cars CSV.

Objective: Predict mpg (fuel efficiency).

### **Steps**

#### Quick-and-dirty numeric-only subset

In [None]:
numeric_cols = cars.select_dtypes('number').columns
X_num = cars[numeric_cols].drop('mpg', axis=1)
y = cars['mpg']

#### Create three models of increasing complexity

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

models = {
    "Linear": LinearRegression(),
    "Tree_depth2": DecisionTreeRegressor(max_depth=2, random_state=0),
    "RandomForest_100": RandomForestRegressor(n_estimators=100, random_state=0)
}

#### Train/val split (80 % / 20 %).

#### Fit & print RMSE on both splits for each model.

In [None]:
from sklearn.metrics import mean_squared_error
import numpy as np

for name, m in models.items():
    m.fit(X_train, y_train)
    rmse_train = np.sqrt(mean_squared_error(y_train, m.predict(X_train)))
    rmse_val   = np.sqrt(mean_squared_error(y_val,   m.predict(X_val)))
    print(f"{name:15}  RMSE train: {rmse_train:5.2f}  val: {rmse_val:5.2f}")

#### Interpret

Which model is high-bias? (look for big errors everywhere)

Which is high-variance? (tiny train error, much worse val error)

### 6. (Quick) Metrics cheat-sheet

**Regression**

- RMSE = √MSE → penalises large errors

- MAE = mean |error| → more robust to outliers

- R² = 1 − (SS_res / SS_tot) → fraction of variance explained

**Classification**

- Accuracy = (TP+TN)/all

- Precision = TP/(TP+FP) (“When I say spam, how often am I right?”)

- Recall = TP/(TP+FN) (“How many actual spams did I catch?”)

- F1 = harmonic mean of precision & recall (balance)

- ROC-AUC = probability a random positive ranks above a random negative

### 7. Stretch: Cross-validation in one line

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingRegressor

gb = GradientBoostingRegressor(random_state=0)
rmse_scores = -cross_val_score(gb, X_num, y,
                               cv=5,       # 5-fold
                               scoring='neg_root_mean_squared_error')
print("5-fold RMSE:", rmse_scores.round(2), "mean", rmse_scores.mean().round(2))

### 8. You’re ready for Module 3

Once you’ve:

✔️ split data,

✔️ observed bias/variance,

✔️ calculated at least one metric on train & val,

# Module 3 – First Real Models: Supervised Learning

### 1. Why start with Linear & Logistic Regression?

They’re the simplest learners that already illustrate training ⇢ evaluation ⇢ interpretation. Everything you meet later (trees, boosting, neural nets) is just a fancier way to:

Define a score to minimise (loss)

Search parameter space for the minimum

Generalise to new data

### 2. Linear Regression (predicting numbers)

| Idea            | One-sentence reminder                                              | In `scikit-learn`    |
| --------------- | ------------------------------------------------------------------ | -------------------- |
| **Model**       | ŷ = β₀ + β₁x₁ + … + βₖxₖ                                           | `LinearRegression()` |
| **Loss**        | Minimise Mean-Squared-Error                                        | built-in             |
| **Assumptions** | roughly linear relation, independent errors, low multicollinearity | check residual plots |

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

X_train, X_val, y_train, y_val = train_test_split(X_num, y, test_size=0.2, random_state=42)

lin = LinearRegression()
lin.fit(X_train, y_train)

rmse = np.sqrt(mean_squared_error(y_val, lin.predict(X_val)))
print(f"RMSE: {rmse:.2f}")

# peek at coefficients
for name, coef in zip(X_num.columns, lin.coef_):
    print(f"{name:15s} {coef:8.3f}")

Exercise: plot predicted vs. actual; perfect predictions would lie on the 45° line.

### 3. Logistic Regression (predicting classes)

Same math; just squashes output through the sigmoid to give probabilities.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

logit = LogisticRegression(max_iter=1000)      # default solver = LBFGS
logit.fit(X_train, y_train)

y_pred = logit.predict(X_val)
print("Accuracy:", accuracy_score(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))

Key hyper-parameters

penalty = "l2" (ridge) or "l1" (lasso)

C = inverse regularisation strength (smaller = stronger penalty)

### 4. Three Non-parametric Baselines

| Algorithm                | When it shines                             | `sklearn` class                  | Must-tune                                   |
| ------------------------ | ------------------------------------------ | -------------------------------- | ------------------------------------------- |
| **k-Nearest-Neighbours** | data is small, decision boundary irregular | `KNeighborsClassifier/Regressor` | `n_neighbors`, distance metric              |
| **Decision Tree**        | easy interpretability, handles mixed types | `DecisionTree*`                  | `max_depth`, `min_samples_leaf`             |
| **Random Forest**        | strong default on tabular data             | `RandomForest*`                  | `n_estimators`, `max_depth`, `max_features` |


**Scaling note** – KNN needs scaled features; trees/forests do not.

### 5 Gradient Boosting / XGBoost

Ensemble of shallow trees trained sequentially; each new tree fixes the predecessor’s errors.
    
Why you care: wins most Kaggle comps; handles missing values; robust to feature scaling.

In [None]:
from xgboost import XGBClassifier
xgb = XGBClassifier(
        n_estimators=300,       # trees
        learning_rate=0.05,
        max_depth=4,
        subsample=0.7,
        colsample_bytree=0.8,
        random_state=42
)
xgb.fit(X_train, y_train)

In [None]:
Tune **learning_rate ↔ n_estimators** together (small LR needs more trees).

### 6. End-to-End Example: Titanic Survival

#### Get the data

In [None]:
import seaborn as sns, pandas as pd
df = sns.load_dataset("titanic")     # one-liner; or use Kaggle CSV

#### Define target & basic preprocessing pipeline

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

y = df["survived"]
X = df.drop(columns=["survived"])

num_cols = X.select_dtypes(include="number").columns
cat_cols = X.select_dtypes(exclude="number").columns

numeric_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler())
])
categorical_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer([
    ("num", numeric_pipe, num_cols),
    ("cat", categorical_pipe, cat_cols)
])

#### Swap in any model

In [None]:
models = {
    "Logistic":   LogisticRegression(max_iter=1000),
    "RandomForest": RandomForestClassifier(n_estimators=200, random_state=0),
    "XGB":        XGBClassifier(n_estimators=300, learning_rate=0.05,
                                max_depth=4, subsample=0.7,
                                colsample_bytree=0.8, random_state=0)
}

from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.2,
                                            stratify=y, random_state=0)

for name, model in models.items():
    pipe = Pipeline([("prep", preprocess), ("model", model)])
    pipe.fit(X_tr, y_tr)
    acc = pipe.score(X_val, y_val)
    print(f"{name:12s} accuracy: {acc:.3f}")

#### Interpret results

Use classification_report for precision/recall; for tree-based models print feature_importances_.

### 7. Mini-project (your turn)

Goal: build & compare at least two models on a CSV you choose (ideas: loan default, heart-disease, customer churn).

**Checklist**

- Load CSV into df

- Identify target column and split features/label

- Build preprocess with numeric+categorical pipelines

- Train

(a) Logistic Regression (baseline)

(b) Random Forest or XGBoost (stronger)

- Report accuracy + confusion matrix + classification report

- Briefly explain which features matter most (coef or importances)

### 8. What’s Next

Once you’re comfortable training, evaluating, and interpreting these first models we’ll move on to Module 4: Unsupervised Learning (clustering & dimensionality reduction) and tackle data without labels.

Just ping me when:

- you finish the mini-project (even partially), or

- something breaks and you need help.

# Module 4 – Unsupervised Learning

### 1. Why “unsupervised”?

No target column y to guide the algorithm—just raw features X.

You ask your model to discover structure (clusters, manifolds, directions of maximum variance).
Typical use-cases: customer segmentation, anomaly detection, data exploration & visualisation.

### 2. Clustering Algorithms in One Glance

| Algorithm                        | Intuition                                         | Pros                                           | Watch-outs                                            | `sklearn` class           | Must-tune                                         |
| -------------------------------- | ------------------------------------------------- | ---------------------------------------------- | ----------------------------------------------------- | ------------------------- | ------------------------------------------------- |
| **k-Means**                      | “Find *k* centroids, assign points to nearest.”   | Fast, scales to 10⁶ rows                       | Needs k; hates outliers & non-spherical shapes        | `KMeans`                  | `n_clusters`, `init`, `n_init`                    |
| **DBSCAN**                       | “Core points in dense areas → expand clusters.”   | Detects arbitrary shapes, auto-finds #clusters | Sensitive to ε & minPts; struggles in varying density | `DBSCAN`                  | `eps`, `min_samples`                              |
| **Agglomerative (Hierarchical)** | “Start as singletons, iteratively merge closest.” | Dendrogram = visual story; no k upfront        | O(n²) memory; large datasets heavy                    | `AgglomerativeClustering` | `n_clusters` *or* `distance_threshold`, `linkage` |


**Rule of thumb**

If you know roughly how many segments you want, start with k-Means.

If you suspect weird shapes / noise, try DBSCAN.

If you need a hierarchy, go agglomerative.

#### 2.1 k-Means in Code (Iris example)

In [None]:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import pandas as pd

iris   = load_iris(as_frame=True)
X      = iris.data
scaler = StandardScaler()
X_std  = scaler.fit_transform(X)

km = KMeans(n_clusters=3, n_init='auto', random_state=42)
labels = km.fit_predict(X_std)

print("Silhouette:", silhouette_score(X_std, labels).round(3))
pd.crosstab(labels, iris.target)       # compare to true species (just for curiosity)

### 3. Dimensionality Reduction

#### 3.1 PCA – the workhorse

Goal: find orthogonal directions (principal components) that capture maximal variance.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=0)
X_pca = pca.fit_transform(X_std)
print("Explained variance ratio:", pca.explained_variance_ratio_.round(3))

**Rule:** always standardise numeric data before PCA.

#### 3.2 Visualise clusters after PCA

In [None]:
import matplotlib.pyplot as plt

plt.scatter(X_pca[:,0], X_pca[:,1], c=labels, s=30)
plt.xlabel("PC 1"); plt.ylabel("PC 2"); plt.title("k-Means clusters on Iris")
plt.show()

In [None]:
(Don’t worry if PC 1+PC 2 < 80 % variance—visual clarity is what you want.)

#### 3.3 t-SNE & UMAP – for non-linear manifolds

| Method    | Use when                                    | Notes                                                                                       |
| --------- | ------------------------------------------- | ------------------------------------------------------------------------------------------- |
| **t-SNE** | you only care about 2-D/3-D visual clusters | Great visuals, but slow > 10 k rows; distances **not** meaningful beyond nearest neighbours |
| **UMAP**  | you want speed & can keep more dimensions   | Preserves global & local structure better; supports `n_components>3`                        |


In [None]:
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=40, random_state=0)
X_tsne = tsne.fit_transform(X_std)

### 4. Putting It All Together – Typical Workflow

In [None]:
# 0. Load / clean
df = pd.read_csv("your_data.csv")
X  = df.select_dtypes(include="number")    # or mix in encoded categoricals

# 1. Scale
X_std = StandardScaler().fit_transform(X)

# 2. Pick k with the silhouette / elbow method
sil_scores = {}
for k in range(2, 11):
    lab = KMeans(n_clusters=k, n_init='auto', random_state=0).fit_predict(X_std)
    sil_scores[k] = silhouette_score(X_std, lab)
best_k = max(sil_scores, key=sil_scores.get)

# 3. Final model
kmeans = KMeans(n_clusters=best_k, n_init='auto', random_state=0)
cluster_labels = kmeans.fit_predict(X_std)

# 4. Attach labels back
df["cluster"] = cluster_labels
df.groupby("cluster").mean().round(1)      # cluster profile

### 5. Mini-Project – Customer Segmentation

Dataset suggestion: “Mall Customers” (CSV, 200 rows, Age / Annual Income / Spending Score).

Download from https://github.com/vincentarelbundock/Rdatasets/raw/master/csv/datasets/mall.csv
 or use your own e-commerce data.

**Steps**

- EDA – check missing values, basic stats.

- Scale numeric columns.

- Find k

Plot elbow (inertia vs. k)

Compute silhouette for k = 2…10.

- Fit k-Means with the best k.

- Visualise clusters in 2-D with PCA or t-SNE.

- Profile clusters – mean age, income, spend; label groups (e.g. “High-income low-spend”).

- (Optional) Try DBSCAN—does it split any cluster further or flag outliers?

### 6. Where This Fits in the Bigger Picture

You now know **two toolkits:**

Supervised → learn from labels (Modules 2–3).

Unsupervised → explore structure without labels (Module 4).

Real projects bounce between both: cluster customers 🡒 build a supervised model to predict cluster membership for new sign-ups, etc.

# Module 5 – Model Evaluation & Hyper-parameter Tuning

### 1. Why this matters

A model’s default settings are rarely optimal.
    
If you don’t measure properly you’ll:

- Over-estimate performance (data leakage, cherry-picked split).

- Waste hours tuning the wrong thing (e.g. maximising accuracy when the business cares about recall).

Module 5 gives you the repeatable recipe:

split ➜ cross-validate ➜ tune ➜ lock the test-set ➜ report

### 2. Cross-validation (CV) essentials

| Strategy              | When to use                                                  | `sklearn` splitter | Typical folds |
| --------------------- | ------------------------------------------------------------ | ------------------ | ------------- |
| **k-Fold**            | i.i.d. tabular data                                          | `KFold` (default)  | 5 or 10       |
| **Stratified k-Fold** | classification with imbalanced classes                       | `StratifiedKFold`  | 5 or 10       |
| **Group k-Fold**      | samples grouped by user / product, no leakage between groups | `GroupKFold`       | #groups       |
| **TimeSeriesSplit**   | chronological data                                           | `TimeSeriesSplit`  | 3–5           |


In [None]:
from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

scores = cross_val_score(model, X, y,
                         cv=cv,
                         scoring="roc_auc")
print("AUC per fold:", scores.round(3),
      "mean:", scores.mean().round(3))

**Rule of thumb:** if std. dev > 0.02 on AUC/Accuracy, expect unstable generalisation → get more data or simpler model.

### 3. Data-leakage killers: Pipelines

Without a pipeline you risk “peeking” at validation data during scaling / imputation.

Pipeline & ColumnTransformer ensure every CV fold runs:

- Fit transforms on train-fold only

- Transform both train & val folds

- Fit model on transformed train-fold

In [None]:
from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ("prep", preprocess),          # from Module 3
    ("model", RandomForestClassifier(n_estimators=300,
                                     max_depth=None,
                                     random_state=0))
])

In [None]:
You’ll pass this pipe straight into CV or grid-search—no extra work.

### 4. Hyper-parameter search methods

| Method                    | `sklearn` class                     | Typical when…                                    | Pros                                 | Cons                                 |
| ------------------------- | ----------------------------------- | ------------------------------------------------ | ------------------------------------ | ------------------------------------ |
| **GridSearch**            | `GridSearchCV`                      | you have ≤ 4 params × a few values each          | exhaustive                           | combinatorial blow-up                |
| **RandomSearch**          | `RandomizedSearchCV`                | bigger spaces, limited budget                    | explores wide range; early good hits | may miss small sweet-spots           |
| **Bayesian / Sequential** | `skopt.BayesSearchCV` or **Optuna** | medium-large spaces, want fewer runs than random | learns from past trials              | extra dependency, slightly more code |


#### 4.1 GridSearch example (Random Forest)

In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "model__n_estimators": [100, 300, 500],
    "model__max_depth": [None, 8, 16],
    "model__max_features": ["sqrt", 0.5, 0.8]
}

gcv = GridSearchCV(pipe,
                   param_grid=param_grid,
                   cv=5,
                   scoring="roc_auc",
                   n_jobs=-1)
gcv.fit(X_train, y_train)

print("Best AUC:", gcv.best_score_.round(3))
print("Best params:", gcv.best_params_)
best_model = gcv.best_estimator_

**model__ prefix drills down into the Pipeline step called model.**

#### 4.2 RandomisedSearch example (XGBoost)

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_dist = {
    "model__n_estimators": randint(200, 800),
    "model__max_depth":    randint(3, 10),
    "model__learning_rate": uniform(0.01, 0.2),
    "model__subsample":    uniform(0.6, 0.4)
}

rscv = RandomizedSearchCV(pipe,
                          param_distributions=param_dist,
                          n_iter=40,
                          cv=5,
                          scoring="f1",
                          random_state=0,
                          n_jobs=-1)
rscv.fit(X_train, y_train)

### 5. Nested CV – the gold-standard

When the dataset is small and you need an honest performance claim:

- Inner loop: tune hyper-params on GridSearchCV.

- Outer loop: estimate generalisation error.

In [None]:
from sklearn.model_selection import cross_val_score

outer_scores = cross_val_score(rscv, X, y, cv=5, scoring="f1")
print("Nested CV F1:", outer_scores.mean().round(3), "±", outer_scores.std().round(3))

**Time-consuming, but keeps the test-set pristine.**

### 6. Final lock-down & model card

Refit best model on train + val (all data except test).

Evaluate once on test-set → final report.

**Document in a model card**

data source & time-range

metrics (+ confidence intervals)

fairness checks (e.g. by gender/region)

limitations & caveats

### 7. Mini-Project – Tune & Benchmark

**Choose one**

Titanic survival (classification) – continue from Module 3

California housing (regression) – sklearn.datasets.fetch_california_housing

Your own CSV

**Tasks**

Split train / test (80/20).

Build a Pipeline with preprocessing + base model.

RandomisedSearchCV over ≥ 4 hyper-params, cv=5, n_iter=40.

Record gcv.best_params_ & best_score_.

Refit on train + val, test once, print confusion matrix or RMSE.

Optional: Nested CV – compare outer CV mean to test score.

Summarise in 5 lines: “Model, tuned params, train/val CV metric, test metric, key insight”.

### 8. Coming up

You now have the machinery to build reliable models.
Next module we’ll dive into Feature Engineering & Interpretability:

missing-value tricks, target encoding

permutation importance, SHAP plots

lifting your metric by ≥ 10 % with smarter features.

# Module 6 – Feature Engineering & Interpretability

### 1. Why features beat fancy algorithms

A mediocre model fed useful features usually outperforms a state-of-the-art model fed raw data.

Feature work also helps you understand the problem domain, setting you up for interpretability and fairness checks later.

### 2. Systematic Feature-Engineering Checklist

| Category                               | Typical tricks                                                                                               | scikit-learn / Python tools                          |
| -------------------------------------- | ------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------- |
| **Numerical**                          | • log / square-root transforms for skewed data  <br>• binning into quantiles (captures non-linear jumps)     | `FunctionTransformer`, `KBinsDiscretizer`            |
| **Dates & times**                      | • extract year, month, day-of-week, hour  <br>• cyclic encoding for hour-of-day / day-of-week (`sin`, `cos`) | `FeatureHasher` or a custom `FunctionTransformer`    |
| **Categorical**                        | • One-hot encoding (baseline)  <br>• Frequency or **target encoding** for high-cardinality columns           | `OneHotEncoder`, `TargetEncoder` (category_encoders) |
| **Text**                               | • TF-IDF, n-grams ≤ 3  <br>• Sentence embeddings (e.g. Sentence-BERT) if you need context                    | `TfidfVectorizer`, `sentence-transformers`           |
| **Images**                             | • Pre-trained CNN embeddings (ResNet, EfficientNet…)                                                         | PyTorch/TensorFlow Hub                               |
| **Interaction / polynomial**           | • products & ratios (price / income)  <br>• polynomial features of degree 2 or 3                             | `PolynomialFeatures`                                 |
| **Aggregations** (temporal or grouped) | • mean / count per user, rolling 7-day sum                                                                   | pandas `groupby`, `rolling`                          |


**Rule of thumb:** engineer with domain logic first, then try automated methods (polynomial, feature selection).

### 3. Pipelines for Safe Feature Engineering

In [None]:
from sklearn.preprocessing  import OneHotEncoder, FunctionTransformer, StandardScaler
from sklearn.compose        import ColumnTransformer
from sklearn.pipeline       import Pipeline
import numpy as np

# Example: house prices
date_feat = FunctionTransformer(
        lambda s: np.c_[s.dt.year, s.dt.month, s.dt.dayofweek],
        feature_names_out=lambda _, f: ["year", "month", "dow"])

num_cols  = ["sqft", "beds", "baths", "price_per_sqft"]
cat_cols  = ["city", "home_type"]
date_col  = ["sold_date"]

preprocess = ColumnTransformer([
    ("num",  Pipeline([
        ("scale", StandardScaler()),
        ("poly", PolynomialFeatures(degree=2, include_bias=False))
    ]), num_cols),
    ("cat",  OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ("date", date_feat, date_col)
])
pipe = Pipeline([("prep", preprocess),
                 ("model", GradientBoostingRegressor())])

This guarantees that every fold in cross-validation sees transformations fitted only on its training slice—no leakage.

### 4. Feature Selection & Importance

#### 4.1 Filter Methods (fast, model-agnostic)

Variance Threshold – drop near-constant columns.

Univariate tests – e.g. SelectKBest with chi-square/F-score.

#### 4.2 Embedded in the Model

L1-regularised Logistic / Linear Regression → many coefficients driven to zero.

Tree-based models → feature_importances_.

#### 4.3 Wrapper Methods

RFE / RFECV (Recursive Feature Elimination) – iteratively drop the weakest features according to an estimator until performance degrades.

In [None]:
from sklearn.feature_selection import RFECV
selector = RFECV(pipe, step=1, cv=5, scoring="roc_auc", n_jobs=-1)
selector.fit(X, y)
print("Optimal number of features:", selector.n_features_)

### 5. Interpretability: looking inside black boxes

#### 5.1 Global Explanations

| Technique                                 | Works with                     | What you see                                    |
| ----------------------------------------- | ------------------------------ | ----------------------------------------------- |
| **Permutation Importance**                | any model                      | metric drop when a column’s values are shuffled |
| **Mean Decrease in Impurity** (built-ins) | tree ensembles                 | native `feature_importances_` plot              |
| **Partial Dependence (PDP)**              | any model (slower on big data) | curve of ŷ vs. a feature while averaging others |


In [None]:
from sklearn.inspection import permutation_importance, PartialDependenceDisplay

result = permutation_importance(pipe, X_val, y_val, scoring="f1", n_repeats=20)
importances = pd.Series(result.importances_mean, index=pipe["prep"].get_feature_names_out())
importances.sort_values(ascending=False).head(10).plot.barh()

#### 5.2 Local Explanations

| Technique                        | Library | Good for                                                 |
| -------------------------------- | ------- | -------------------------------------------------------- |
| **LIME**                         | `lime`  | single prediction “which features nudged this decision?” |
| **SHAP** (TreeSHAP / KernelSHAP) | `shap`  | both global & local; consistent attribution              |


In [None]:
import shap
explainer = shap.TreeExplainer(best_model["model"])
shap_values = explainer.shap_values(best_model["prep"].transform(X_val.iloc[:200]))
shap.summary_plot(shap_values, feature_names=pipe["prep"].get_feature_names_out())

**(For XGBoost/CatBoost/LightGBM, TreeSHAP runs fast; for other models use KernelSHAP on a sample.)**

### 6. Mini-Project – Boost Your Model by ≥ 10 %

Pick a previous supervised task (Titanic, housing, your own).

Baseline: score from Module 5 after tuning.

**Engineer at least 3 new features, e.g.**

- log-transform skewed numerics

- age buckets (cut into generations)

- interaction term (price / sqft)

Retrain (same CV & model).

Compare CV metric – aim for ≥ 10 % relative improvement.

**Explain**

- Plot top-10 permutation importances.

- Generate one SHAP force plot for an interesting single prediction (e.g. a mis-classified passenger).

Write 5 bullet points summarising which features moved the needle and any surprising insights.

# Module 7 – Neural Networks & Deep Learning Fundamentals

### 0. Pick your framework & install

| Choice                 | Why pick it                                                                | Stable version (May 2025)                   | Quick install                                                                                       |
| ---------------------- | -------------------------------------------------------------------------- | ------------------------------------------- | --------------------------------------------------------------------------------------------------- |
| **PyTorch**            | Python-first, flexible training loops, huge research community             | **2.7.0** released Apr 23 2025 ([PyPI][1])  | `bash pip install torch torchvision torchaudio` *(add the GPU wheel URL if you have CUDA 12+)*      |
| **TensorFlow / Keras** | Easiest “model-build-compile-fit”, production extras (TF-Lite, TF-Serving) | **2.19.0** released Mar 12 2025 ([PyPI][2]) | `bash pip install tensorflow` *(CPU & GPU wheels auto-select; Apple Silicon gets NEON/Metal accel)* |

[1]: https://pypi.org/project/torch/?utm_source=chatgpt.com "torch - PyPI"
[2]: https://pypi.org/project/tensorflow/?utm_source=chatgpt.com "TensorFlow - PyPI"

Both sites keep a single “Stable” tab that always shows the latest compatible wheel and CUDA/cuDNN matrix.


### 1. Building blocks you must know

| Concept              | 1-sentence memory hook                                                                |
| -------------------- | ------------------------------------------------------------------------------------- |
| **Neuron**           | Weighted sum → non-linear activation (ReLU, GELU, Sigmoid…).                          |
| **Layer / MLP**      | Stack of neurons; depth lets you model non-linear relations.                          |
| **Forward pass**     | Compute ŷ for a batch.                                                                |
| **Loss**             | Scalar that says “how wrong?” (cross-entropy for classification, MSE for regression). |
| **Back-propagation** | Auto-diff calculates ∂Loss/∂Weights; optimiser (SGD, Adam, Lion) updates weights.     |
| **Epoch**            | One full sweep over training data.                                                    |
| **Batch size**       | Rows processed before an optimiser step—trade-off RAM vs noisy gradients.             |

Once you can recite that loop—forward → loss → backward → step—you understand every neural-net paper diagram.

### 2. Key architectures on one slide

| Family               | Typical problem                   | Core idea                                                      |
| -------------------- | --------------------------------- | -------------------------------------------------------------- |
| **MLP (Dense)**      | Tabular numbers, basic NLP        | Fully-connected layers; good baseline.                         |
| **CNN**              | Images, audio spectrograms        | Convolution filters share weights → spatial pattern detection. |
| **RNN / LSTM / GRU** | Time-series, small-scale language | Hidden state carries information along sequence.               |
| **Transformer**      | Large-scale NLP, vision (ViT)     | Self-attention learns pairwise token relations in parallel.    |

You’ll implement an MLP today; CNN and Transformer quick-starts are optional stretch goals.

### 3. PyTorch Crash-course (40 lines)

In [None]:
import torch, torchvision
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# 1. Data pipeline (MNIST 28×28 → tensor [0,1])
train_ds = datasets.MNIST(root="data", train=True, download=True,
                          transform=transforms.ToTensor())
train_loader = DataLoader(train_ds, batch_size=128, shuffle=True)

# 2. Model (simple 2-hidden-layer MLP)
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.seq = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28*28, 256), nn.ReLU(),
            nn.Linear(256, 64),   nn.ReLU(),
            nn.Linear(64, 10)
        )
    def forward(self, x): return self.seq(x)

model = Net().to("cuda" if torch.cuda.is_available() else "cpu")

# 3. Loss & optimiser
loss_fn   = nn.CrossEntropyLoss()
optimiser = torch.optim.AdamW(model.parameters(), lr=3e-4)

# 4. Training loop (3 epochs)
for epoch in range(3):
    model.train()
    for x, y in train_loader:
        x, y = x.to(model.device), y.to(model.device)
        optimiser.zero_grad()
        pred = model(x)
        loss = loss_fn(pred, y)
        loss.backward()
        optimiser.step()
    print(f"Epoch {epoch+1}: {loss.item():.4f}")

**Add a torch.no_grad() validation block after each epoch and hook in TensorBoard:**

In [None]:
pip install tensorboard; tensorboard --logdir runs

### 4. Keras (TensorFlow) same idea in 12 lines

In [None]:
import tensorflow as tf
from tensorflow.keras import layers

(x_train, y_train), (x_val, y_val) = tf.keras.datasets.mnist.load_data()
x_train, x_val = x_train/255.0, x_val/255.0

model = tf.keras.Sequential([
    layers.Flatten(),
    layers.Dense(256, activation="relu"),
    layers.Dense(64,  activation="relu"),
    layers.Dense(10,  activation="softmax")
])

model.compile(optimizer="adam",
              loss="sparse_categorical_crossentropy",
              metrics=["accuracy"])

model.fit(x_train, y_train,
          validation_data=(x_val, y_val),
          epochs=3, batch_size=128)

Keras writes TensorBoard logs automatically when you add callbacks=[tf.keras.callbacks.TensorBoard()]. GPU is used by default if available (NVIDIA, Apple-Metal, ROCm).

### 5. From CSVs & images to tensors

**Tabular:** pandas → .values.astype(np.float32) → torch.tensor or tf.convert_to_tensor.

**Images:** torchvision.datasets.ImageFolder or tf.keras.utils.image_dataset_from_directory.

**Text:** Hugging Face datasets.load_dataset("imdb") → AutoTokenizer → tensors.

Batching and shuffling happen in the DataLoader (PyTorch) or tf.data.Dataset (TensorFlow).

### 6. Monitoring & preventing over-fit

| Tool                       | Works with                                              | Shows                                      |
| -------------------------- | ------------------------------------------------------- | ------------------------------------------ |
| **TensorBoard**            | both TF & PyTorch (`torch.utils.tensorboard`)           | loss/metric curves, GPU stats, embeddings. |
| **EarlyStopping callback** | Keras `tf.keras.callbacks.EarlyStopping`                | stop when val-loss stagnates.              |
| **LR schedulers**          | `torch.optim.lr_scheduler` or Keras `ReduceLROnPlateau` | auto-tune learning rate.                   |
| **Regularisers**           | Dropout, weight-decay (L2), data-augmentation           | baked into layers / optimiser.             |

### 7. Mini-projects (pick one)

**🖼️ Image – CIFAR-10 with Keras**

- tf.keras.datasets.cifar10.load_data()

- Normalise to [0,1], one-hot labels.

- Replace MLP with a small CNN: Conv2D → MaxPool → Flatten → Dense.

- Train 15 epochs, watch val-accuracy aim > 70 %.

- Plot 5 mis-classified images; use Grad-CAM (tf.keras.applications.vgg16.preprocess_input) for insight.

**✍️ Text – IMDb sentiment with Transformers (PyTorch)**

- pip install transformers datasets accelerate

- Load:

In [None]:
from datasets import load_dataset
ds = load_dataset("imdb")

- AutoTokenizer.from_pretrained("distilbert-base-uncased") → encode.

- AutoModelForSequenceClassification → Trainer API.

- Evaluate accuracy on test split; aim > 90 %.

- Use transformers.Interpretation or captum for token-level importance on one review.

### 8. Where we’re headed

You now have three engines:

- Classic ML (Modules 2–6)

- Deep-learning basics (this module)

- Everything ready to deploy (Module 8 – MLOps & serving)

# Module 8 – Model Deployment & MLOps Basics

### 0. Why shipping your model is a skill of its own

A model that lives only in a notebook has zero real-world value. Deployment turns it into a service that other code—or people—can actually call. MLOps then keeps that service fast, reliable and trustworthy over time (versioning, monitoring, rollback).

### 1. Save & version the trained model

| Format              | Good for                                  | How                                                                                   |
| ------------------- | ----------------------------------------- | ------------------------------------------------------------------------------------- |
| **joblib / pickle** | small scikit-learn or XGBoost objects     | `joblib.dump(pipe, "model.joblib")`                                                   |
| **ONNX**            | language-agnostic, runs in C#/Java/mobile | `skl2onnx.convert_sklearn(pipe)`                                                      |
| **MLflow**          | experiment tracking + artefact registry   | `mlflow.sklearn.log_model(pipe, "model")` (MLflow 2.22 out Apr 24 2025) ([MLflow][1]) |

[1]: https://mlflow.org/?utm_source=chatgpt.com "MLflow | MLflow"

**Rule:** store every artefact (model file, scaler, config) in a versioned directory named after the Git commit or MLflow run ID.

### 2. Pick a serving stack

| Path                                       | When to choose it                                        | Key tool/version                        |
| ------------------------------------------ | -------------------------------------------------------- | --------------------------------------- |
| **Pure FastAPI** REST                      | lightweight, Python-only microservice                    | FastAPI 0.115 (Mar 23 2025) ([PyPI][1]) |
| **Flask** REST                             | smallest footprint, simple scripts                       | Flask 3.x                               |
| **BentoML**                                | turnkey “model → Docker/OCI image” pipeline, multi-model | BentoML 1.4 (Feb 2025) ([BentoML][2])   |
| **Serverless** (AWS Lambda, GCP Cloud Run) | bursty traffic, pay-per-use                              | Container image or zip-deploy           |
| **Spark/Databricks MLflow**                | large-scale batch scoring                                | MLflow model registry                   |

[1]: https://pypi.org/project/fastapi/?utm_source=chatgpt.com "fastapi - PyPI"
[2]: https://www.bentoml.com/blog/announcing-bentoml-1-4?utm_source=chatgpt.com "Announcing BentoML 1.4"

### 3. FastAPI service in 25 lines

In [None]:
# app.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib

class Passenger(BaseModel):          # tailor to your feature schema
    pclass:int; sex:str; age:float; fare:float

app = FastAPI(title="Titanic Survival API")
model = joblib.load("model.joblib")   # trained Pipeline from Module 5/6

@app.post("/predict")
def predict(p: Passenger):
    import pandas as pd
    X = pd.DataFrame([p.dict()])
    proba = model.predict_proba(X)[0,1]
    return {"survival_probability": round(float(proba), 3)}

#### **Local test**

In [None]:
# 1. Install deps
pip install fastapi uvicorn[standard] joblib pandas

# 2. Run dev server
uvicorn app:app --reload

# 3. Call
curl -X POST http://127.0.0.1:8000/predict \
     -H "Content-Type: application/json" \
     -d '{"pclass":2,"sex":"female","age":28,"fare":28.0}'

**Swagger UI auto-appears at /docs—great for QA without writing a line of front-end code.**

### 4. Containerise with Docker 25

Create a plain text file named Dockerfile in the project root:

In [None]:
# syntax=docker/dockerfile:1
FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN  pip install -r requirements.txt            # scikit-learn, fastapi, uvicorn,...

COPY . .
ENV PORT=80
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "80"]

#### **Bash**

In [None]:
# Build & run
docker build -t titanic-api:1.0 .
docker run -d -p 80:80 titanic-api:1.0

Docker Engine 25 reached stable GA earlier this year; stick to 25.x for latest security patches and BuildKit improvements.

### 5. Continuous Delivery (CD) with GitHub Actions

(one-file minimal example)

In [None]:
# .github/workflows/deploy.yml
name: CI-CD

on:
  push:
    branches: [ main ]

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set tag
        id: vars
        run: echo "TAG=${GITHUB_SHA::7}" >> $GITHUB_ENV
      - name: Build
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/${{ github.repository }}:${{ env.TAG }}
          context: .
      - name: Deploy to staging
        run: |
          curl -X POST $STAGING_TRIGGER_URL \
               -H "Authorization: Bearer $TOKEN" \
               -d "{\"image\":\"ghcr.io/$GITHUB_REPOSITORY:${{ env.TAG }}\"}"


Replace $STAGING_TRIGGER_URL with your cloud’s image-update endpoint (Fly.io, Render, ECS, etc.).

### 6. Monitoring & Drift Detection

| Concern                      | Open-source helper                                     |
| ---------------------------- | ------------------------------------------------------ |
| **Service uptime / latency** | Prometheus + Grafana                                   |
| **Model & data drift**       | Evidently (0.5 + drift dashboards) ([Evidently AI][1]) |
| **Experiment lineage**       | MLflow tracking server                                 |
| **Alerting**                 | Grafana alerts ➜ Slack / Teams                         |

[1]: https://www.evidentlyai.com/?utm_source=chatgpt.com "Evidently AI - AI Testing & LLM Evaluation Platform"

#### **Example: Evidently quick-check**

In [None]:
from evidently.report import Report
from evidently.metrics import DataDriftPreset, RegressionPerformancePreset

ref = pd.read_parquet("2024-12-reference.parquet")
cur = pd.read_parquet("2025-05-current.parquet")

rpt = Report(metrics=[DataDriftPreset(), RegressionPerformancePreset()])
rpt.run(reference_data=ref, current_data=cur)
rpt.save_html("drift.html")

In [None]:
Serve drift.html behind Basic-Auth or drop it in S3 for your data team.

### 7. Mini-project – Put Your Model on the Internet 🌐

- Pick any trained model from earlier modules.

- Serialize → model.joblib.

- Wrap with FastAPI (see §3).

- Add requirements.txt (FastAPI, uvicorn, pandas, scikit-learn, joblib).

- Containerise with Dockerfile (§4).

- Run locally and hit /predict with curl or Postman.

- (Stretch) Push to GitHub, enable GH Actions CD, deploy to free tier on Fly.io, Render or Railway.

- Smoke-test one happy path and one error path (missing field).

- Share: post your service URL or Docker Hub tag here—I’ll test and review security headers, latency, and suggest observability improvements.

### 8. Where we go from here

You can now ship models, reproduce builds, and watch them in production—a huge milestone!

The final module will cover Responsible AI & Fairness: bias audits, SHAP for fairness, model cards, and privacy basics.

When your FastAPI endpoint returns a 200 OK (or you want help making it do so), drop “next” and we’ll finish our ML A-to-Z journey with ethics and responsible deployment.

# Module 9 – Responsible AI: Ethics, Fairness, Privacy & Transparency

### 1. Why this matters

Even a perfectly tuned, high-accuracy model can harm people—by amplifying bias, leaking personal data or behaving unpredictably once it meets the real world. New laws (EU AI Act, FTC actions, ISO 42001) now make responsible-AI practices mandatory, not optional.

### 2. The regulatory landscape you must track (2024-25)

| Regulation / Standard                    | Scope & key duties                                                                                                                                                                                                                                                                                                                        | In force                                              |
| ---------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------- |
| **EU AI Act**                            | *Risk-based tiers:* minimal → limited → **high-risk** → prohibited. High-risk systems (e.g. hiring, credit, health) need a **fundamental-rights impact assessment**, quality management, transparency & human oversight. Foundation-model providers must publish energy use & test for systemic risks. ([Artificial Intelligence Act][1]) | Gradual: some rules mid-2025, full compliance by 2026 |
| **NIST AI RMF 1.0**                      | Voluntary U.S. framework—IDENTIFY, GOVERN, MAP, MEASURE, MANAGE functions—to systematise trustworthy-AI risk control across the lifecycle. ([NIST][2], [NIST Technical Series Publications][3])                                                                                                                                           | Jan 2024                                              |
| **ISO/IEC 42001**                        | First certifiable **AI management-system** standard (akin to ISO 27001 for security) covering ethics, transparency, continuous risk review. ([ISO][4], [KPMG][5])                                                                                                                                                                         | Dec 2024                                              |
| **AI Safety Summits (Bletchley, Paris)** | Multilateral “Bletchley Declaration” commits 29 countries to share safety testing for frontier models. ([GOV.UK][6], [European Payments Council][7])                                                                                                                                                                                      | 2024                                                  |
| **FTC enforcement**                      | U.S. regulator now fines vendors for deceptive or biased AI claims (e.g., IntelliVision settlement Jan 2025). ([Lathrop GPM][8], [Federal Trade Commission][9])                                                                                                                                                                           | Active                                                |

[1]: https://artificialintelligenceact.eu/high-level-summary/?utm_source=chatgpt.com "High-level summary of the AI Act | EU Artificial Intelligence Act"
[2]: https://www.nist.gov/itl/ai-risk-management-framework?utm_source=chatgpt.com "AI Risk Management Framework | NIST"
[3]: https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf?utm_source=chatgpt.com "[PDF] Artificial Intelligence Risk Management Framework (AI RMF 1.0)"
[4]: https://www.iso.org/standard/81230.html?utm_source=chatgpt.com "ISO/IEC 42001:2023 - AI management systems"
[5]: https://kpmg.com/ch/en/insights/artificial-intelligence/iso-iec-42001.html?utm_source=chatgpt.com "ISO/IEC 42001: The latest AI management system standard"
[6]: https://www.gov.uk/government/publications/ai-safety-summit-2023-the-bletchley-declaration/the-bletchley-declaration-by-countries-attending-the-ai-safety-summit-1-2-november-2023?utm_source=chatgpt.com "The Bletchley Declaration by Countries Attending the AI Safety ..."
[7]: https://epc.eu/publication/The-Paris-Summit-Au-Revoir-global-AI-Safety-61ea68/?utm_source=chatgpt.com "The Paris Summit: Au Revoir, global AI Safety?"
[8]: https://www.lathropgpm.com/insights/transparency-and-ai-ftc-launches-enforcement-actions-against-businesses-promoting-deceptive-ai-product-claims/?utm_source=chatgpt.com "Transparency and AI: FTC Launches Enforcement Actions Against ..."
[9]: https://www.ftc.gov/policy/advocacy-research/tech-at-ftc/2025/01/ai-risk-consumer-harm?utm_source=chatgpt.com "AI and the Risk of Consumer Harm | Federal Trade Commission"


### 3. Fairness fundamentals

| Concept                    | Quick definition                                                                                                 | Typical metric                                              |
| -------------------------- | ---------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------- |
| **Sensitive attribute**    | Feature legally / ethically protected (sex, race, disability, age…)                                              | —                                                           |
| **Group fairness**         | Outcome rates are comparable across sensitive groups                                                             | *Demographic parity*, *Equalised odds*, *Equal opportunity* |
| **Individual fairness**    | “Similar individuals receive similar outcomes”                                                                   | Counter-factual fairness distance                           |
| **Bias mitigation** stages | *Pre-processing* (re-weighing), *In-processing* (adversarial debias), *Post-processing* (threshold optimisation) | —                                                           |

Trade-offs: perfect parity can lower overall accuracy; document the business-ethics decision.

### 4. Interpretability & transparency toolkit

| Goal                  | Reality-check questions                         | Tools / libraries                                        |
| --------------------- | ----------------------------------------------- | -------------------------------------------------------- |
| **Global insight**    | “Which features drive decisions overall?”       | Permutation importance, SHAP summary plot                |
| **Local explanation** | “Why did *this* person get denied?”             | SHAP force plot, LIME                                    |
| **Robustness**        | “Does a small input tweak flip the prediction?” | Adversarial tests, sensitivity analysis                  |
| **Documentation**     | “Can a non-tech auditor understand the system?” | *Model cards*, *datasheets for datasets*, *system cards* |


### 5. Privacy & security check-list

- Minimise data: drop columns you would not show on your dashboard.

- Pseudonymise IDs, use k-anonymity + differential privacy when releasing stats.

- Secure the supply chain: signed Docker images, dependency-pinning.

- Red-team generative models: jailbreak prompts, prompt-injection, data-leak tests (EU AI Act Article 53).

### 6. Hands-on mini-project – Fairness audit & mitigation

#### 6.1 Dataset

Adult Income (a.k.a. “Census Income”, 48 k rows, predict income > $50 k). Contains sex and race attributes—classic for bias demos.

#### 6.2 Objectives

- Baseline: Train a tuned Logistic Regression or Random Forest (Module 5 pipeline).

- Audit: Compute accuracy plus

Demographic parity difference (DPD) & Equalised odds difference (EOD) for sex and race.

- Mitigate: Use Fairlearn’s ExponentiatedGradient or GridSearch to reduce both gaps while keeping ≥ 95 % of baseline accuracy.

- Report: One chart (Fairlearn dashboard) + 5 bullets answering:

Which metric improved? • What trade-off did you accept? • Any residual bias? • How will you monitor in prod? • Next mitigation ideas.

#### 6.3 Code pseudostarter

In [None]:
pip install fairlearn==0.11.0

from fairlearn.metrics import MetricFrame, demographic_parity_difference, \
                               equalized_odds_difference, accuracy_score
from fairlearn.reductions import ExponentiatedGradient, DemographicParity
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd

df = pd.read_csv("adult.csv")                       # load data
y = (df["income"] == ">50K").astype(int)
X = df.drop(columns=["income"])

sens = df["sex"]                                    # try also 'race'
X_tr, X_te, y_tr, y_te, s_tr, s_te = train_test_split(
        X, y, sens, stratify=y, test_size=0.2, random_state=0)

pre = ColumnTransformer([
        ("num", StandardScaler(), X.select_dtypes("number").columns),
        ("cat", OneHotEncoder(handle_unknown="ignore"),
                X.select_dtypes("object").columns)
])

base = Pipeline([("prep", pre),
                 ("clf",  LogisticRegression(max_iter=2000))])
base.fit(X_tr, y_tr)

y_hat = base.predict(X_te)
mf = MetricFrame(metrics=dict(acc=accuracy_score,
                              dp=demographic_parity_difference,
                              eo=equalized_odds_difference),
                 y_true=y_te, y_pred=y_hat, sensitive_features=s_te)
print("Baseline:", mf.by_group)

# Mitigation
mitigator = ExponentiatedGradient(base, constraints=DemographicParity())
mitigator.fit(X_tr, y_tr, sensitive_features=s_tr)
y_hat_m = mitigator.predict(X_te)
mf_m = mf._replace(y_pred=y_hat_m)     # re-run metrics
print("Mitigated:", mf_m.by_group)


### 7. Production guard-rails

| Stage            | Must-have guard-rail                                                                           |
| ---------------- | ---------------------------------------------------------------------------------------------- |
| **Pre-launch**   | *Fairness report* signed-off by legal/ethics lead; attach to Model Card v1.                    |
| **Daily batch**  | Drift dashboard (Evidently) auto-checks sensitive-attribute distributions; alert at 3 σ shift. |
| **Realtime API** | Log inputs & outputs (hashed IDs) for traceability; sample 0.5 % for human review.             |
| **Versioning**   | Tag model + data snapshot; keep older version live until new one passes A/B guard-rail.        |


### 8. Reflection & wrap-up

- Deliver your audit notebook (or errors) and we’ll iterate on tougher mitigations (counter-factual, Causal ML).

- Update your model card to include: intended use, out-of-scope use, ethical risks, fairness metrics, mitigations, contact for redress.

- Re-train with privacy enhancements (e.g., 𝜖 ≤ 1 differential-privacy logistic regression) if data is sensitive.

You’ve now completed the A → Z ML journey:

Foundations → Classic ML → Unsupervised → Tuning → Feature-craft → Deep-learning → Deployment → Responsible AI.