# Classification Examples - Titanic Dataset

This lesson uses the [Titanic dataset](course_datasets.md#titanic).  It predicts Survival based on passenger class, sex, fare, embarkation, fare band, using logistic regression and decision tree classifiers.

Steps
* Load data into pandas
* Clean data (select columns), remove any rows with missing values
* Encode data (convert string columns into numbers, required by model). One-hot Ordinal (later) for passenger class
* Encode label column (Died ->0, Survived ->1)
* Split data into training and test sections
* Build logistic regression model, fit on training data an predict on test data
* Evaluate models with a confusion matrix
* Build decision tree model, fit on training data and predict on test data. 
* Show decision tree model graph


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, classification_report

from sklearn.tree import DecisionTreeClassifier, plot_tree
import joblib

Load the titanic data from a CSV file on a public URL into a pandas DataFrame.  This version of the dataset has already been cleaned to some extent: it has columns `Adult Or Child`, and `Is Age Missing`, based on the Age column.  It also has a `FareBand` column based on the Fare column. The original Name column has been split into `Title`, `Surname` and `Other Names` columns.

In [None]:
titanic_url = 'https://raw.githubusercontent.com/MarkWilcock/CourseDatasets/main/Misc%20Datasets/Titanic%20Passenger.csv'
df = pd.read_csv(titanic_url) # read the data
df.head(5) # show the first 5 rows

### Explore the Data

Before modelling, it is good practice to explore the dataset. This helps us understand:
- What columns are available and their data types
- Whether there are missing values (non-null counts below the total row count)
- The balance between classes (survived vs died) — an imbalanced dataset affects how we interpret accuracy

In [None]:
# Show column names, data types, and count of non-null values
df.info()

### Handle Missing Values

sklearn encoders cannot process `NaN` values. We drop any rows with missing values in our selected columns.

In [None]:
df = df.dropna()
print(f"Rows remaining after dropping missing values: {len(df)}")

In [None]:
# Check class balance - are there roughly equal numbers of survivors and non-survivors?
df['Survival'].value_counts()

In [None]:
# Check the passenger class distribution
df['Passenger Class'].value_counts()

Keep only the columns that are likely to be predictive of survival. We drop columns like `Name` and `Ticket Number` that are unique to individual passengers and would not generalise to new data.

In [None]:
df_slim = df[['Survival', 'Title', 'Passenger Class', 'Gender', 'Embarked', 'FareBand', 'Adult Or Child']]
df_slim.head(5)

Rename columns to use a consistent `snake_case` style, which is the Python convention.

In [None]:
df_slim.columns = ['survival', 'title', 'pass_class', 'gender', 'embarked', 'fareband', 'adult_or_child']
df_slim.head(5)

## Encoding Categorical Features

Machine learning models require numeric inputs. We need to convert our text columns into numbers.

**One-hot encoding** converts a categorical column into several binary (0/1) columns — one per category. For example, the `gender` column becomes two new columns:

| gender | → | gender_male | gender_female |
|--------|---|:-----------:|:-------------:|
| male   |   | 1           | 0             |
| female |   | 0           | 1             |

This avoids implying any numeric ordering between categories (which simple integer encoding like `male=0, female=1` would incorrectly suggest).

See [this explainer article](https://www.geeksforgeeks.org/ml-one-hot-encoding/) and the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

In [None]:
category_columns = ['title', 'gender', 'embarked', 'fareband', 'adult_or_child']
categorical_encoders = OneHotEncoder(sparse_output=False)
categorical_encoders

**Ordinal encoding** is used when a categorical column has a meaningful rank order. Passenger class is ordinal — 1st is "better" than 2nd, which is "better" than 3rd — so we encode it as numbers that preserve this order:

| pass_class | Encoded value |
|:----------:|:-------------:|
| 1st        | 0             |
| 2nd        | 1             |
| 3rd        | 2             |

We specify the order explicitly with `categories=[pass_class_values]` so the encoder knows the intended ranking rather than guessing.

In [None]:
ordinal_columns =  ['pass_class']
pass_class_values = ['1st', '2nd', '3rd']
ordinal_encoders = OrdinalEncoder(categories=[pass_class_values]) 
ordinal_encoders

The ColumnTransformer lets us assemble the transforms on all the dataset columns.

In [None]:
ct = ColumnTransformer(transformers = [
        ('cat', categorical_encoders, category_columns),
        ('ord', ordinal_encoders, ordinal_columns)
        ], 
        remainder = 'drop')
ct.set_output(transform='pandas')
ct

X is the standard name for the transformed data of features (independent variables).  Notice that `X` has more columns than our original features. One-hot encoding expands each categorical column into multiple binary columns — one per unique category value.


In [None]:
X = ct.fit_transform(df_slim)
X

In [None]:
print(f"Original feature count: {len(df_slim.columns) - 1} columns")
print(f"Transformed feature count: {X.shape[1]} columns")
print("The increase is due to one-hot encoding creating one binary column per category value.")

Create an array of labels from the survival column.  survival is a text column (with values Died and Survived), and is transformed to an array of numbers either 0 (Died) and 1 (Survived).

In [None]:
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df_slim['survival'])
y[:5] # show the first 5 elements of y

## Train / Test Split

We split the data into two sets:

- **Training set (80%)** — the model learns patterns from this data
- **Test set (20%)** — used *after* training to measure how well the model generalises to data it has never seen

This prevents us from reporting inflated accuracy from a model that has simply memorised the training data (known as **overfitting**).

`random_state=42` fixes the random seed so the split is reproducible — everyone running this notebook gets the same split.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

Inspect the transformed feature columns. Notice that the column count is larger than our original 6 features — this is because one-hot encoding replaces each categorical column with multiple binary columns (one per category value).

In [None]:
print("Transformed feature columns:")
X_train.info()

## Logistic Regression model
Build and fit the logistic regression model

In [None]:
model_LR = LogisticRegression()
model_LR.fit(X_train, y_train)
predictions_LR = model_LR.predict(X_test)
predictions_LR[:5] # show the first 5 predictions

### Evaluate the Logistic Regression Model

We use several metrics to understand model performance:

- **Accuracy** — proportion of all predictions that were correct
- **Precision** — of all passengers predicted to survive, what fraction actually did?
- **Recall** — of all passengers who actually survived, what fraction did we correctly predict?
- **F1 score** — the harmonic mean of precision and recall; useful when classes are imbalanced

In [None]:
print('Classification Report\n',  classification_report(y_test, predictions_LR))
print(f'f1 score\n {f1_score(y_test, predictions_LR):3.3f}')

### Confusion Matrix

The confusion matrix breaks down predictions into four categories:

|                      | Predicted: Died | Predicted: Survived |
|----------------------|:---------------:|:-------------------:|
| **Actual: Died**     | True Negative   | False Positive      |
| **Actual: Survived** | False Negative  | True Positive       |

- **False Positives** — predicted survived but actually died (Type I error)
- **False Negatives** — predicted died but actually survived (Type II error)

The diagonal (top-left, bottom-right) shows correct predictions. Off-diagonal cells are errors.

In [None]:
confusion_matrix(y_test, predictions_LR)

## Decision Tree Model

A **decision tree** classifies passengers by learning a series of yes/no rules from the training data (e.g. *"Was the passenger female? If yes, did they travel in 1st or 2nd class?"*). The result is a tree structure that is intuitive and easy to visualise.

We limit the tree depth with `max_depth=4` to prevent **overfitting** — without a limit, the tree would memorise the training data by creating very specific rules that don't generalise to new passengers.

> **Try it:** Remove `max_depth=4` and re-run. How does the accuracy on the test set change? What does the tree look like?

In [None]:
model_DT = DecisionTreeClassifier(max_depth=4)
model_DT.fit(X_train, y_train)
model_DT

In [None]:
predictions_DT = model_DT.predict(X_test)
predictions_DT[:5] # show the first 5 predictions

In [None]:
print('Decision Tree — Classification Report\n', classification_report(y_test, predictions_DT))
print('Decision Tree — Confusion Matrix')
print(confusion_matrix(y_test, predictions_DT))

Visualise the decision tree model using scikit-learn's built-in `plot_tree` function.

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))
plot_tree(model_DT,
          feature_names=X.columns,
          class_names=['Died', 'Survived'],
          filled=True,
          rounded=True,
          ax=ax)
plt.show()

Persist the model in case we want to rerun without retraining

In [None]:
import os
os.makedirs('outputs', exist_ok=True)
joblib.dump(model_DT, 'outputs/titanic_model.pkl')
print("Model saved to outputs/titanic_model.pkl")
print("Use joblib.load('outputs/titanic_model.pkl') to reload it later without retraining.")

### Model Comparison

Which model performed better? Consider:
- **Accuracy** tells you overall correctness, but can be misleading with imbalanced classes
- **F1 score** balances precision and recall — more informative here since more passengers died than survived
- **Interpretability** — decision trees are easy to explain to non-technical stakeholders; logistic regression coefficients are harder to visualise

In [None]:
print("=== Model Comparison ===\n")
print(f"Logistic Regression — Accuracy: {accuracy_score(y_test, predictions_LR):.3f}   F1: {f1_score(y_test, predictions_LR):.3f}")
print(f"Decision Tree       — Accuracy: {accuracy_score(y_test, predictions_DT):.3f}   F1: {f1_score(y_test, predictions_DT):.3f}")

## Summary

In this notebook we:

1. **Loaded and explored** the Titanic passenger dataset
2. **Selected features** likely to be predictive of survival and dropped rows with missing values
3. **Encoded** categorical features (one-hot encoding) and ordinal features (ordinal encoding), and encoded the target label
4. **Split** the data into training (80%) and test (20%) sets
5. **Trained** two classifiers: Logistic Regression and Decision Tree
6. **Evaluated** both models using accuracy, F1 score, classification report, and confusion matrix

### What to Try Next

- Add or remove features (columns) and see how model performance changes
- Try `RandomForestClassifier` — an ensemble of many decision trees that often outperforms a single tree
