# Supervised Learning

This is the second session in our machine learning series. In the first
session, we introduced machine learning conceptually and demonstrated a
basic supervised learning workflow using logistic regression to predict
survivors on the Titanic. This session builds on what was covered in the
first session, exploring supervised learning in more detail,
understanding the different approaches to supervising learning, and
looking at some of the most common algorithms for supervised tasks.

## Slides

Use the left ⬅️ and right ➡️ arrow keys to navigate through the slides
below. To view in a separate tab/window,
<a href="slides.html" target="_blank">follow this link</a>.

## What is Supervised Learning

Supervised learning means training a model on data where the outcome is
already known, in order to predict outcomes on new data where the answer
isn’t known. You provide features (input variables) and a target (the
outcome to predict), and the model learns the relationship between them.
This differs from unsupervised learning, where no target exists and the
goal is finding patterns or groupings, and from reinforcement learning,
where the model learns through trial and error.

The fundamental structure of supervised learning involves features and
targets. Features are the input variables that describe each
observation: patient age, diagnosis codes, medication count, previous
admissions. The target is what you want to predict: whether the patient
will be readmitted, how long they’ll stay, which treatment will work
best. Features can be numerical (age, test results) or categorical
(diagnosis, sex), and most algorithms handle both types.

There are two different kinds of supervised learning task, based on the
target type:

-   Classification - Predicting categories (often binary outcomes). Will
    this patient be readmitted (yes/no), which diagnosis applies (A, B,
    C, D), is this scan normal or abnormal.
-   Regression - Predicting continuous numbers. How many days will the
    patient stay, what will next month’s A&E attendances be, what’s the
    predicted blood pressure.

Some algorithms work for both tasks. Decision trees, for instance, can
do classification by predicting the most common class in each leaf node,
or regression by predicting the average value in each leaf.

The workflow we introduced in Session 1 applies to all supervised
learning. Get labelled data, split into training and testing sets, train
a model on training data, predict on test data, evaluate performance.
The training/testing split is critical because it simulates real-world
deployment. The model learns patterns from training data, then we test
whether those patterns generalise to new data. Testing on training data
would be dishonest: the model has already seen those examples and
optimised for them.

The central challenge in supervised learning is generalisation. Models
must learn patterns that work on new data, not just memorise the
training set. Underfitting occurs when the model is too simple and
misses important patterns, performing poorly on both training and test
data. Overfitting occurs when the model is too complex and learns noise
in the training data, performing well on training data but poorly on
test data. Finding the right complexity level is fundamental to building
useful models.

## Understanding How Models Learn

When analysts first encounter machine learning, the training process
often feels like magic. You pass data into a function, get a trained
model back, and somehow it can make predictions. Understanding what
happens during training transforms machine learning from magic into a
tool you can control and apply effectively.

Different algorithms approach learning in fundamentally different ways.
Logistic regression finds linear boundaries that separate classes.
Decision trees ask sequential yes/no questions. Random forests combine
many trees to get more stable predictions. Gradient boosting builds
trees that learn from previous mistakes. Each approach has strengths and
weaknesses that make it better suited to different types of problems.

This matters in practice because there is no single best algorithm. The
model that works best depends on your data structure, how much data you
have, whether you need to explain predictions to clinicians, and what
types of patterns exist in your features. Understanding how algorithms
work lets you make informed choices about which to try and how to
interpret their results.

## Comparing Multiple Models

We’ll train four different models on the Titanic dataset and compare
their performance. This demonstrates how different algorithms learn
different patterns from the same data.

### Setup

We’ll use the same libraries as Session 1, with additions for tree-based
models and visualisation.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# set random seed for reproducibility
np.random.seed(42)

### Data Preparation

We’ll follow the same workflow as Session 1, preparing the Titanic
dataset for modelling.

In [2]:
# load titanic data
df = sns.load_dataset('titanic')

# select features and target
X = df[['pclass', 'sex', 'age', 'fare']]
y = df['survived']

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# prepare features
def prepare_features(data):
    data = data.copy()
    data['age'] = data['age'].fillna(data['age'].median())
    data['sex'] = (data['sex'] == 'female').astype(int)
    return data

X_train = prepare_features(X_train)
X_test = prepare_features(X_test)

### Logistic Regression

We already covered logistic regression in Session 1. It finds a linear
boundary that best separates survivors from non-survivors in feature
space. Logistic regression works well when the relationship between
features and outcomes is approximately linear, and it produces
interpretable coefficients that show how each feature influences
predictions.

In [3]:
# train logistic regression
lr_model = LogisticRegression(max_iter=1000, random_state=42)
lr_model.fit(X_train, y_train)

# evaluate on test set
lr_pred = lr_model.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_pred)

print(f"Logistic Regression Accuracy: {lr_accuracy:.1%}")

Logistic Regression Accuracy: 79.5%

### Decision Trees

Decision trees learn by recursively splitting the data based on feature
values. At each node, the tree asks a yes/no question about a feature
and splits passengers into two groups. The algorithm chooses splits that
best separate survivors from non-survivors, measured by metrics like
Gini impurity[1].

The tree continues splitting until it reaches stopping criteria: all
passengers in a node have the same outcome, the node contains too few
passengers to split further, or the tree reaches its maximum depth. The
result is a series of rules that can be followed from root to leaf to
make a prediction.

[1] Gini impurity measures how mixed the classes are in a node. A pure
node (all survivors or all deaths) has Gini = 0. A 50/50 split has Gini
= 0.5. The algorithm chooses splits that minimise Gini impurity.

In [4]:
# train decision tree
tree_model = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_model.fit(X_train, y_train)

# evaluate on test set
tree_pred = tree_model.predict(X_test)
tree_accuracy = accuracy_score(y_test, tree_pred)

print(f"Decision Tree Accuracy: {tree_accuracy:.1%}")

Decision Tree Accuracy: 80.6%

We can visualise the tree structure to see exactly what rules it
learned.

In [5]:
plt.figure(figsize=(20, 10))
plot_tree(
    tree_model, 
    feature_names=X_train.columns,
    class_names=['Died', 'Survived'],
    filled=True,
    fontsize=10
)
plt.title("Decision Tree Structure (max_depth=3)")
plt.tight_layout()
plt.show()

The tree shows the splitting logic clearly. Each box contains the
splitting rule, the Gini impurity, the number of samples, and the
predicted class. Following any path from root to leaf gives you the
sequence of decisions that lead to a prediction. For example, if a
passenger is female (sex ≤ 0.5 is false), the tree predicts survival
regardless of other features.

This interpretability is the key strength of decision trees. You can
explain exactly why a model made a specific prediction by showing the
decision path. However, trees have a significant weakness: they tend to
overfit the training data by learning overly specific rules that don’t
generalise well.

### Random Forests

Random forests address the overfitting problem by training many trees
and combining their predictions. Each tree in the forest is trained on a
random subset of the training data (bootstrap sample), and at each
split, only a random subset of features is considered. This randomness
ensures each tree learns slightly different patterns.

When making predictions, each tree votes for a class, and the forest
returns the majority vote. Because individual trees make different
mistakes, averaging across trees produces more reliable predictions that
generalise better to new data.

In [6]:
# train random forest
rf_model = RandomForestClassifier(
    n_estimators=100,  # number of trees
    max_depth=3,       # keep trees shallow to match single tree
    random_state=42
)
rf_model.fit(X_train, y_train)

# evaluate on test set
rf_pred = rf_model.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)

print(f"Random Forest Accuracy: {rf_accuracy:.1%}")

Random Forest Accuracy: 78.7%

Random forests typically outperform single decision trees because the
ensemble smooths out individual tree errors. The trade-off is reduced
interpretability: you can’t easily visualise 100 trees or explain a
prediction as a single decision path.

### Gradient Boosting

Gradient boosting takes a different approach to combining trees. Instead
of training trees independently and averaging, it trains them
sequentially. Each new tree focuses on the mistakes made by previous
trees, gradually improving the model’s performance.

The process works as follows: train the first tree on the original data,
identify where it makes prediction errors, train the second tree to
correct those errors, and repeat. Each tree contributes to the final
prediction, with later trees typically having more influence because
they’ve learned from more mistakes.

In [7]:
# train gradient boosting model
gb_model = GradientBoostingClassifier(
    n_estimators=100,
    max_depth=3,
    learning_rate=0.1,  # how much each tree contributes
    random_state=42
)
gb_model.fit(X_train, y_train)

# evaluate on test set
gb_pred = gb_model.predict(X_test)
gb_accuracy = accuracy_score(y_test, gb_pred)

print(f"Gradient Boosting Accuracy: {gb_accuracy:.1%}")

Gradient Boosting Accuracy: 81.7%

Gradient boosting often achieves the best performance of these
algorithms, particularly on tabular data. It’s the foundation of winning
Kaggle solutions and production models at major tech companies. The
sequential learning process allows it to capture complex patterns that
other models miss. Like random forests, the trade-off is reduced
interpretability compared to single trees.

## Comparing Model Performance

Different models learned different patterns from the same training data.
Comparing their test set performance shows which patterns generalised
best to unseen data.

In [8]:
# create comparison dataframe
comparison = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest', 'Gradient Boosting'],
    'Accuracy': [lr_accuracy, tree_accuracy, rf_accuracy, gb_accuracy]
})

# sort by accuracy
comparison = comparison.sort_values('Accuracy', ascending=False).reset_index(drop=True)

print(comparison.to_string(index=False))

              Model  Accuracy
  Gradient Boosting  0.817164
      Decision Tree  0.805970
Logistic Regression  0.794776
      Random Forest  0.787313

The results demonstrate several principles. First, no single model
dominates across all problems. Second, more complex models (random
forest, gradient boosting) often outperform simpler ones, but not
always. Third, even small accuracy differences can be meaningful
depending on the application[1].

We can also examine what each model learned by looking at feature
importance. Tree-based models calculate feature importance based on how
much each feature reduces impurity when used for splitting.

[1] In healthcare applications, statistical significance doesn’t always
equal clinical significance. A 2% improvement in accuracy might be
meaningless for one use case and critically important for another.
Domain knowledge determines whether model differences matter.

In [9]:
# create dataframe for plotting
feature_names = X_train.columns
importance_data = pd.DataFrame({
    'Feature': list(feature_names) * 3,
    'Importance': list(tree_model.feature_importances_) + 
                  list(rf_model.feature_importances_) + 
                  list(gb_model.feature_importances_),
    'Model': ['Decision Tree'] * len(feature_names) + 
             ['Random Forest'] * len(feature_names) + 
             ['Gradient Boosting'] * len(feature_names)
})

# create faceted plot
g = sns.FacetGrid(importance_data, col='Model', height=4, aspect=0.8)
g.map_dataframe(sns.barplot, y='Feature', x='Importance', order=feature_names)
g.set_axis_labels('Importance', '')
g.set_titles(col_template='{col_name}')
plt.tight_layout()
plt.show()

The feature importance plots reveal what each model learned. All three
tree-based models identified sex as the most important feature, which
aligns with historical accounts of “women and children first” evacuation
procedures. However, the models weight other features differently,
showing they learned distinct patterns from the same data.

## Overfitting and Model Complexity

Model complexity is controlled through hyperparameters. For decision
trees, the most important hyperparameter is `max_depth`, which limits
how many sequential questions the tree can ask. Shallow trees underfit
by not learning enough patterns. Deep trees overfit by learning noise in
the training data.

In [10]:
# test different tree depths
depths = range(1, 16)
train_scores = []
test_scores = []

for depth in depths:
    tree = DecisionTreeClassifier(max_depth=depth, random_state=42)
    tree.fit(X_train, y_train)
    
    train_scores.append(accuracy_score(y_train, tree.predict(X_train)))
    test_scores.append(accuracy_score(y_test, tree.predict(X_test)))

# create dataframe for plotting
overfitting_data = pd.DataFrame({
    'Tree Depth': list(depths) * 2,
    'Accuracy': train_scores + test_scores,
    'Dataset': ['Training Accuracy'] * len(depths) + ['Test Accuracy'] * len(depths)
})

# plot results
plt.figure(figsize=(10, 6))
sns.lineplot(data=overfitting_data, x='Tree Depth', y='Accuracy', hue='Dataset', 
             marker='o', markersize=6)
plt.title('Overfitting in Decision Trees')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

The plot demonstrates the classic overfitting pattern. As tree depth
increases, training accuracy continues to improve because the tree
learns increasingly specific rules about the training data. However,
test accuracy peaks at moderate depth then declines. Beyond that peak,
the tree is learning patterns that don’t generalise to new data.

Finding the optimal complexity requires validation on held-out data.
This principle applies to all models: logistic regression regularisation
strength, random forest tree count and depth, gradient boosting learning
rate and iterations. The goal is maximising generalisation, not training
performance.

## Summary

Understanding how models learn transforms machine learning from a black
box into a practical tool. Different algorithms approach learning
differently: logistic regression finds linear boundaries, decision trees
ask sequential questions, random forests combine many trees through
voting, and gradient boosting trains trees that learn from previous
mistakes.

No algorithm works best in all situations. Simpler models like logistic
regression and shallow decision trees are more interpretable but may
miss complex patterns. Ensemble methods like random forests and gradient
boosting often achieve better performance but sacrifice
interpretability. The right choice depends on your data, your
constraints, and whether you need to explain predictions to
stakeholders.

The key principles from this session apply broadly. Test multiple models
and compare performance. Use validation data to detect overfitting.
Examine feature importance to understand what models learned. Control
complexity through hyperparameters. These practices form the foundation
of effective model development.

What we haven’t covered yet is how to systematically tune
hyperparameters, how to handle imbalanced classes, or how to evaluate
models using metrics beyond accuracy. We also haven’t explored feature
engineering, which often matters more than algorithm choice. These
topics build on the foundations we’ve established and will be covered in
future sessions.