# An Introduction to Machine Learning
## Session 2a: More Classification Models (Decision Trees and k-NN)

Welcome to Session 2a! Today, we’ll explore two additional classification models: Decision Trees and k-Nearest Neighbours (k-NN). Each of these models uses different approaches to classify data, which we’ll examine in depth.

We’ll apply both models to the Titanic dataset and compare them to our previous Logistic Regression model, looking at accuracy, precision, recall, and model interpretability. By the end, you’ll have a broader toolkit for classification tasks and a better understanding of how different models handle the same data.

### 1. Importing packages and data prep.

In [None]:
# Run this cell to import additional libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

In [None]:
# Load and preprocess Titanic dataset
titanic_data = pd.read_csv("../data/titanic_train.csv")
titanic_data['Age'].fillna(titanic_data['Age'].median(), inplace=True)
titanic_data['Embarked'].fillna(titanic_data['Embarked'].mode()[0], inplace=True)
titanic_data['Sex'] = titanic_data['Sex'].map({'male': 0, 'female': 1})
titanic_data = pd.get_dummies(titanic_data, columns=['Embarked'], drop_first=True)

# Define features and target
X = titanic_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_Q', 'Embarked_S']]
y = titanic_data['Survived']

# Split the dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 2. Decision Trees

In [None]:
# Initialise the Decision Tree Classifier
tree_model = DecisionTreeClassifier(random_state=42)

In [None]:
# EXERCISE: Train the Decision Tree model on X_train and y_train.
# Hint: Use .fit() method.

tree_model.fit(____, ____)

In [None]:
# EXERCISE: Predict on X_test and calculate accuracy, precision, and recall.

y_pred_tree = tree_model.predict(____)

accuracy_tree = accuracy_score(y_test, y_pred_tree)
precision_tree = precision_score(y_test, y_pred_tree)
recall_tree = recall_score(y_test, y_pred_tree)

print(f"Decision Tree Accuracy: {accuracy_tree}")
print(f"Decision Tree Precision: {precision_tree}")
print(f"Decision Tree Recall: {recall_tree}")

In [None]:
# Plot confusion matrix
cm_tree = confusion_matrix(y_test, y_pred_tree)
ConfusionMatrixDisplay(confusion_matrix=cm_tree, display_labels=['Did Not Survive', 'Survived']).plot(cmap='Blues')
plt.title('Decision Tree Confusion Matrix')
plt.show()

### 3. k-Nearest Neighbours (k-NN) Model

In [None]:
# Initialise the k-NN model with 5 neighbours
knn_model = KNeighborsClassifier(n_neighbors=5)

In [None]:
# EXERCISE: Train the k-NN model on X_train and y_train.

knn_model.fit(____, ____)

In [None]:
# EXERCISE: Predict on X_test using the k-NN model, then calculate accuracy, precision, and recall.

y_pred_knn = knn_model.predict(____)

accuracy_knn = accuracy_score(y_test, y_pred_knn)
precision_knn = precision_score(y_test, y_pred_knn)
recall_knn = recall_score(y_test, y_pred_knn)

print(f"k-NN Accuracy: {accuracy_knn}")
print(f"k-NN Precision: {precision_knn}")
print(f"k-NN Recall: {recall_knn}")

In [None]:
# Plot confusion matrix for k-NN
cm_knn = confusion_matrix(y_test, y_pred_knn)
ConfusionMatrixDisplay(confusion_matrix=cm_knn, display_labels=['Did Not Survive', 'Survived']).plot(cmap='Blues')
plt.title('Decision Tree Confusion Matrix')
plt.show()

### 4. Comparing model performance

In [None]:
# Create a performance comparison table for Decision Tree and k-NN
performance = pd.DataFrame({
    "Model": ["Decision Tree", "k-NN"],
    "Accuracy": [accuracy_tree, accuracy_knn],
    "Precision": [precision_tree, precision_knn],
    "Recall": [recall_tree, recall_knn]
})

performance

1. Which model has the highest accuracy? 
2. Do precision and recall tell a different story compared to accuracy?
3. Which model do you think would be more reliable for predicting survival? Why?

### 5. Decision Boundaries

A decision boundary is a line or surface that separates different classes in a classification model. It helps to visualise how a model decides which class a data point belongs to based on its features. Imagine you’re classifying passengers on the Titanic as either “survived” or “did not survive” based on characteristics like ticket class and fare paid. A decision boundary shows where the model would classify one group differently from another. If a new data point falls on one side of the boundary, it gets classified into one category; if it falls on the other, it’s assigned to the other category.

Different models create decision boundaries in different ways. For example, Decision Trees tend to create straight, box-like boundaries as they split the data sequentially by feature values, which can make them appear less smooth but highly interpretable. k-Nearest Neighbours (k-NN), on the other hand, classifies points based on the nearest neighbours around them, so the decision boundary is influenced by the actual data distribution. This can result in more complex and flexible boundaries that adapt to the structure of the data, especially when more neighbours are considered.

Decision boundaries are a helpful visual tool for understanding a model’s behaviour and limitations. In situations with clear separation between classes, decision boundaries help us see where the model might make mistakes or misclassify points, especially when classes overlap. By plotting these boundaries, we can better understand why a model may make certain predictions and where it might struggle with accuracy.

In [None]:
# Selecting two features for visualising decision boundaries (Pclass and Fare)
from matplotlib.colors import ListedColormap

def plot_decision_boundary(model, X, y, title):
    X_selected = X[['Pclass', 'Fare']].values
    model.fit(X_selected, y)

    x_min, x_max = X_selected[:, 0].min() - 1, X_selected[:, 0].max() + 1
    y_min, y_max = X_selected[:, 1].min() - 1, X_selected[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                         np.arange(y_min, y_max, 0.1))

    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, alpha=0.3, cmap=ListedColormap(['#FFAAAA', '#AAAAFF']))
    plt.scatter(X_selected[:, 0], X_selected[:, 1], c=y, edgecolor='k', s=20, cmap=ListedColormap(['#FF0000', '#0000FF']))
    plt.xlabel("Pclass")
    plt.ylabel("Fare")
    plt.title(title)
    plt.show()

# Plot decision boundary for Decision Tree
plot_decision_boundary(tree_model, X_train, y_train, "Decision Tree Decision Boundary")

# Plot decision boundary for k-NN
plot_decision_boundary(knn_model, X_train, y_train, "k-NN Decision Boundary")