# Day 4 — Machine Learning Foundations

**Duration:** 2 hours

**Objectives:**
- Understand supervised vs unsupervised learning
- Train a simple classifier and evaluate it
- Learn basic ML workflow and metrics

## 1. Quick recap of ML concepts

- Supervised (regression, classification)
- Unsupervised (clustering, dim. reduction)
- Reinforcement (agent-based)

Today we'll focus on supervised classification with Logistic Regression and Decision Tree.

## 2. Dataset: Titanic (classification task)

Goal: predict `survived` using a few chosen features.

In [None]:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load and prepare data
titanic = sns.load_dataset('titanic')
df = titanic[['survived','pclass','sex','age','fare']].copy()
# simple preprocessing: drop rows with missing survived
# impute age with median
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='median')
df['age'] = imp.fit_transform(df[['age']])
# encode sex and pclass
df['sex'] = df['sex'].map({'male':0,'female':1})
# pclass already numeric
# drop remaining nulls if any
df = df.dropna()

X = df[['pclass','sex','age','fare']]
y = df['survived'].astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Train shape:', X_train.shape, 'Test shape:', X_test.shape)

## 3. Train Logistic Regression

In [None]:
# Logistic Regression
lr = LogisticRegression(max_iter=200)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print('Logistic Regression accuracy:', accuracy_score(y_test, y_pred_lr))
print('\nClassification Report:\n', classification_report(y_test, y_pred_lr))

## 4. Train Decision Tree

In [None]:
# Decision Tree
dt = DecisionTreeClassifier(max_depth=5, random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print('Decision Tree accuracy:', accuracy_score(y_test, y_pred_dt))
print('\nClassification Report:\n', classification_report(y_test, y_pred_dt))

## 5. Confusion Matrix (LR)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

cm = confusion_matrix(y_test, y_pred_lr)
plt.figure(figsize=(5,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Logistic Regression Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## 6. Feature importance / coefficients

Interpret model coefficients for logistic regression (which features push survival up/down).

In [None]:
coeffs = pd.DataFrame({'feature': X.columns, 'coef': lr.coef_[0]})
coeffs.sort_values('coef', ascending=False)


## 7. Overfitting vs Underfitting (demo idea)

We used a shallow decision tree (max_depth=5) to reduce overfitting. Try increasing depth to see what happens. For homework, experiment with tree depth and record train vs test accuracy.

## 8. Exercise (in-notebook)

1. Train a k-Nearest Neighbors classifier (k=5) on the same data. Compare accuracy.
2. Try scaling `age` and `fare` with StandardScaler and re-run Logistic Regression. Observe any change.
3. (Optional) Use cross-validation to estimate performance more reliably.

In [None]:
# Exercise starters: KNN + scaling
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# KNN
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
print('KNN accuracy:', accuracy_score(y_test, knn.predict(X_test)))

# Scaling + Logistic Regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

lr2 = LogisticRegression(max_iter=200)
lr2.fit(X_train_scaled, y_train)
print('LogReg (scaled) accuracy:', accuracy_score(y_test, lr2.predict(X_test_scaled)))

## 9. Wrap-up & Reading

Suggested: Hands-On Machine Learning with Scikit-Learn & TensorFlow (Aurélien Géron). Tomorrow: Day 5 — Applied project & take-home assignment.