# Lecture 0 — Python Crash Course + Simple ML Workflow

**Course:** BME i9400 — Special Topics in Machine Learning (Graduate, Biomedical Engineering)  
**Environment:** JupyterHub (CPU-only friendly)  
**Date generated:** 2025-08-24 18:43 UTC  

This notebook is a hands-on crash course designed to ensure every student is ready for the rest of the course.  
It covers Python essentials, NumPy, Pandas, plotting, and a complete ML workflow (regression & classification) with scikit-learn.

> **Micro‑Deliverable (end):** Produce ROC & PR curves for a logistic model, include a 3–5 sentence interpretation, and a short _model card_ (data, metrics, limitations, ethics).

## 0. Reproducibility & Versions

In [None]:
import sys, platform, random, os, textwrap, math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

print("Python:", sys.version.split()[0])
print("Platform:", platform.platform())
print("NumPy:", np.__version__)
print("Pandas:", pd.__version__)
print("Matplotlib:", plt.matplotlib.__version__)

## 1. Python Essentials

### 1.1 Variables, Types, and Collections

- Numeric types: `int`, `float`
- Text: `str`
- Boolean: `bool`
- Collections: `list`, `tuple`, `dict`, `set`

In [None]:
age = 34               # int
height_m = 1.78        # float
name = "Ada Lovelace"  # str
is_patient = False     # bool

ages = [29, 33, 41, 37, 29]     # list
vitals = ("HR", "BP", "Temp")   # tuple
record = {"id": 101, "age": 33, "bp_sys": 122}  # dict
unique_ids = {101, 102, 103}    # set

print(type(age), type(height_m), type(name), type(is_patient))
print(ages[:3], vitals[0], record["bp_sys"], 101 in unique_ids)

### 1.2 Control Flow & Comprehensions

In [None]:
# if/elif/else
if height_m > 1.9:
    category = "tall"
elif height_m > 1.7:
    category = "medium"
else:
    category = "short"
category

In [None]:
# for loop + enumerate
for i, a in enumerate(ages):
    if i < 3:
        print(i, a)

In [None]:
# list comprehension
ages_plus_one = [a + 1 for a in ages]
ages_even = [a for a in ages if a % 2 == 0]
ages_plus_one[:5], ages_even[:5]

### 1.3 Functions

In [None]:
def bmi(weight_kg: float, height_m: float) -> float:
    """Compute Body Mass Index."""
    return weight_kg / (height_m ** 2)

def classify_bmi(bmi_value: float) -> str:
    return ("underweight" if bmi_value < 18.5 else
            "normal" if bmi_value < 25 else
            "overweight" if bmi_value < 30 else
            "obese")

val = bmi(72, 1.78)
val, classify_bmi(val)

## 2. NumPy Essentials

Key concepts: arrays, shapes, broadcasting, vectorization, random generation.

In [None]:
import numpy as np
np.random.seed(SEED)

# Simulate systolic blood pressure (BP) as a normal distribution
bp = np.random.normal(loc=120, scale=12, size=1000)  # mean 120, sd 12
bp.shape, bp.mean().round(2), bp.std(ddof=1).round(2)

In [None]:
# Vectorized operations: z-score normalize
bp_z = (bp - bp.mean()) / bp.std()
bp_z[:5].round(3)

In [None]:
# Broadcasting demo: add patient-specific offsets
offsets = np.random.uniform(-5, 5, size=5)
bp5 = bp[:5]
bp5 + offsets

## 3. Pandas Essentials

We'll create a small synthetic EHR-like table: age, BMI, systolic BP, outcome.

In [None]:
import pandas as pd
np.random.seed(SEED)

N = 400
age = np.random.randint(18, 85, size=N)
bmi_vals = np.random.normal(26, 4.5, size=N).clip(15, 45)
bp_sys = np.random.normal(120 + 0.4*(age-50) + 0.8*(bmi_vals-25), 10, size=N)
# Binary outcome (e.g., hypertension diagnosis) with logistic link to bp_sys + age
logit = -12 + 0.06*bp_sys + 0.015*age
prob = 1 / (1 + np.exp(-logit))
y = (np.random.rand(N) < prob).astype(int)

df = pd.DataFrame({
    "age": age,
    "bmi": bmi_vals,
    "bp_sys": bp_sys,
    "y": y
})
df.head()

In [None]:
df.info()

In [None]:
df.describe().T

## 4. Plotting with Matplotlib

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 3, figsize=(12, 3))
axes[0].hist(df["age"], bins=20); axes[0].set_title("Age"); axes[0].set_xlabel("years")
axes[1].hist(df["bmi"], bins=20); axes[1].set_title("BMI"); axes[1].set_xlabel("kg/m²")
axes[2].hist(df["bp_sys"], bins=20); axes[2].set_title("Systolic BP"); axes[2].set_xlabel("mmHg")
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(4.5,3.5))
plt.scatter(df["bmi"], df["bp_sys"], s=10, alpha=0.6)
plt.xlabel("BMI (kg/m²)"); plt.ylabel("Systolic BP (mmHg)"); plt.title("BP vs BMI")
plt.show()

## 5. Simple ML Workflow Overview

We'll demonstrate two end-to-end tasks:

1. **Regression**: Predict `bp_sys` from `age` and `bmi`.  
2. **Classification**: Predict `y` (hypertension diagnosis) from features.

We will show **train/test split**, **preprocessing pipelines**, **metrics** (MAE/R² for regression, ROC/PRC/F1 for classification), and **simple model cards**.

### 5.1 Regression: Predicting Systolic BP

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge

X = df[["age", "bmi"]].copy()
y_reg = df["bp_sys"].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y_reg, test_size=0.25, random_state=SEED)

num_features = ["age", "bmi"]
preprocess = ColumnTransformer(
    transformers=[("num", StandardScaler(), num_features)],
    remainder="drop"
)

reg_model = Pipeline([
    ("prep", preprocess),
    ("lr", LinearRegression())
])

reg_model.fit(X_train, y_train)
pred = reg_model.predict(X_test)
mae = mean_absolute_error(y_test, pred)
r2 = r2_score(y_test, pred)
mae, r2

In [None]:
# Compare with Ridge regression (regularization)
ridge = Pipeline([
    ("prep", preprocess),
    ("ridge", Ridge(alpha=1.0, random_state=SEED))
])
ridge.fit(X_train, y_train)
pred_r = ridge.predict(X_test)
mae_r = mean_absolute_error(y_test, pred_r)
r2_r = r2_score(y_test, pred_r)

print(f"LinearRegression  MAE={mae:.2f}, R2={r2:.3f}")
print(f"Ridge(alpha=1.0) MAE={mae_r:.2f}, R2={r2_r:.3f}")

In [None]:
# Plot predicted vs true
plt.figure(figsize=(4.5,4))
plt.scatter(y_test, pred, s=10, alpha=0.7, label="LR")
plt.scatter(y_test, pred_r, s=10, alpha=0.7, label="Ridge")
lims = [min(y_test.min(), pred.min(), pred_r.min()), max(y_test.max(), pred.max(), pred_r.max())]
plt.plot(lims, lims, 'k--', linewidth=1)
plt.xlabel("True BP (mmHg)")
plt.ylabel("Predicted BP (mmHg)")
plt.title("Regression: Predicted vs True")
plt.legend()
plt.tight_layout()
plt.show()

### 5.2 Classification: Predicting Hypertension Diagnosis

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve, precision_recall_curve, average_precision_score, f1_score, confusion_matrix, classification_report

X = df[["age", "bmi", "bp_sys"]].copy()
y_cls = df["y"].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y_cls, test_size=0.25, random_state=SEED, stratify=y_cls)

preprocess = ColumnTransformer(
    transformers=[("num", StandardScaler(), X.columns.tolist())],
    remainder="drop"
)

clf = Pipeline([
    ("prep", preprocess),
    ("logreg", LogisticRegression(max_iter=200, random_state=SEED))
])

clf.fit(X_train, y_train)
proba = clf.predict_proba(X_test)[:, 1]
pred = (proba >= 0.5).astype(int)

auc = roc_auc_score(y_test, proba)
ap = average_precision_score(y_test, proba)
f1 = f1_score(y_test, pred)

print(f"ROC-AUC={auc:.3f} | PR-AUC={ap:.3f} | F1={f1:.3f}")
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred, digits=3))

In [None]:
# ROC curve
fpr, tpr, _ = roc_curve(y_test, proba)
plt.figure(figsize=(4.5,3.5))
plt.plot(fpr, tpr, label=f"LogReg (AUC={auc:.3f})")
plt.plot([0,1], [0,1], 'k--', alpha=0.6)
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Precision-Recall curve
prec, rec, _ = precision_recall_curve(y_test, proba)
plt.figure(figsize=(4.5,3.5))
plt.plot(rec, prec, label=f"LogReg (AP={ap:.3f})")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.legend()
plt.tight_layout()
plt.show()

## 6. Troubleshooting Tips

- **ConvergenceWarnings** in logistic regression: increase `max_iter` or standardize inputs (we standardize in the pipeline).  
- **Data leakage**: don't peek at test data during preprocessing; use `Pipeline`/`ColumnTransformer`.  
- **Class imbalance**: check positive rate; consider `class_weight='balanced'` or threshold tuning.  
- **Randomness**: set seeds for comparability; still expect minor variations.

## 7. Stretch: Cross-Validation & Threshold Sweeps

In [None]:
from sklearn.model_selection import cross_val_score, StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
cv_auc = cross_val_score(clf, X, y_cls, cv=skf, scoring="roc_auc")
cv_auc.mean().round(3), cv_auc.std().round(3)

In [None]:
# Threshold sweep to inspect precision/recall trade-offs
thresholds = np.linspace(0.1, 0.9, 9)
rows = []
for th in thresholds:
    p = (proba >= th).astype(int)
    tp = ((p == 1) & (y_test == 1)).sum()
    fp = ((p == 1) & (y_test == 0)).sum()
    fn = ((p == 0) & (y_test == 1)).sum()
    precision = tp / max(tp + fp, 1)
    recall = tp / max(tp + fn, 1)
    rows.append({"threshold": th, "precision": precision, "recall": recall})
pd.DataFrame(rows)

## 8. Exercises (Short)

1. **NumPy:** Create a 2D array of shape (50, 3) representing `[age, bmi, noise]` and compute the correlation matrix.  
2. **Pandas:** Add a new column `bmi_cat` with categories (`under`, `normal`, `over`, `obese`) using cut-points, then group by `bmi_cat` to compute mean `bp_sys`.  
3. **Visualization:** Plot boxplots of `bp_sys` by `bmi_cat`.  
4. **Regression:** Add `age*bmi` as an interaction term and compare R² to the baseline model.  
5. **Classification:** Use `class_weight='balanced'` in logistic regression and compare PR-AUC.

## 9. Micro‑Deliverable

**Task:** Using the classification pipeline above, produce **ROC** and **PR** curves, report **ROC‑AUC** and **PR‑AUC**, and write a **3–5 sentence interpretation** covering:
- What threshold you would choose and why
- Expected trade-offs (false positives vs false negatives) in a clinical context
- Two limitations of this synthetic experiment

**Model Card (short):**
- **Data:** Synthetic EHR-like, N=400, features: age, BMI, systolic BP; label: binary hypertensive dx (simulated).  
- **Intended Use:** Classroom demonstration only, not for clinical decisions.  
- **Metrics:** ROC‑AUC, PR‑AUC, F1, confusion matrix.  
- **Limitations & Risks:** synthetic distribution; missing confounders; threshold sensitive; not calibrated.  
- **Ethics:** No PHI; avoid clinical claims; discuss bias & prevalence effects.