# An Introduction to Machine Learning
## Session 1b: Classification Basics and Logistic Regression

Welcome to Session 1b! We’ll delve deeper into machine learning by exploring classification models. Classification models help us predict categorical outcomes, like whether a passenger on the Titanic survived or not.

We’ll introduce Logistic Regression, one of the most straightforward classification models, and use it to make predictions based on features in the Titanic dataset. You’ll also learn about how to evaluate classification models using metrics like accuracy, precision, and recall. By the end of this session, you’ll have a solid foundation in training and evaluating a basic classification model.

### 1. Importing packages and pre-processing for classification data.

In [None]:
# Run this cell to import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, ConfusionMatrixDisplay

In [None]:
# Load the Titanic dataset and display the first few rows
titanic_data = pd.read_csv("../data/titanic_train.csv")
titanic_data.head()

In [None]:
# EXERCISE: Fill missing values for 'Age' with the median and 'Embarked' with the mode.
# Hint: Use .fillna() method for both columns.

titanic_data['Age'].fillna(____, inplace=True)  # Replace ____ with appropriate value
titanic_data['Embarked'].fillna(____, inplace=True)  # Replace ____ with appropriate value

In [None]:
# Convert 'Sex' to numerical values and 'Embarked' with one-hot encoding
titanic_data['Sex'] = titanic_data['Sex'].map({'male': 0, 'female': 1})
titanic_data = pd.get_dummies(titanic_data, columns=['Embarked'], drop_first=True)

In [None]:
# Define features and target
X = titanic_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_Q', 'Embarked_S']]
y = titanic_data['Survived']

In [None]:
# EXERCISE: Split the dataset into training (80%) and testing (20%) sets.
# Use train_test_split function with test_size=0.2 and random_state=42.
X_train, X_test, y_train, y_test = train_test_split(____, ____, test_size=____, random_state=____)

### 2. Logistic regression classifier

In [None]:
# Initialise the Logistic Regression model
log_reg = LogisticRegression(max_iter=1000)

In [None]:
# EXERCISE: Train the Logistic Regression model on the training data.
# Hint: Use .fit() method with X_train and y_train.

log_reg.fit(____, ____)

In [None]:
# EXERCISE: Predict the survival on the test data using the trained model.
# Hint: Use .predict() method with X_test.

y_pred = log_reg.predict(____)

In [None]:
# Calculate accuracy, precision, and recall
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")

In [None]:
# EXERCISE: Plot the confusion matrix.
# Hint: Use ConfusionMatrixDisplay with confusion_matrix(y_test, y_pred).

cm = confusion_matrix(____, ____)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Did Not Survive', 'Survived']).plot(cmap='Blues')
plt.show()

### 3. Thinking about what we've seen

In this cell, write down:

1. Which metric (accuracy, precision, or recall) do you think is most important in predicting survival, and why?
2. What could you do to improve the model’s performance? Think of any additional features you might include or methods you might try.