In [None]:
!pip install nb_black
%load_ext nb_black

# Introduction

In the following exercises, you'll practice key skills in developing a logistic regression classifier. The dataset you'll use is from the classic [Titanic competition on Kaggle](https://www.kaggle.com/competitions/titanic/overview), containing information about passengers aboard the Titanic. The modelling objective is to use available features to predict whether a passenger survived or perished in the disaster. 

The format of this exercise notebook is similar to that of the linear regression notebook. You'll start by drawing your own decision boundary through feature space and then defining a logistic regression classifier. You'll then practice using the `sklearn` implementation of logistic regression and engineering more complex decision boundaries.

In [None]:
# Packages and functions you may find useful
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

In [None]:
# Load the titanic dataset
data = pd.read_csv("../titanic.csv")

# Explore and preprocess the data

**Exercise:** Spend a little time exploring the dataset. What features do you think are predictive of survival? Are there potential issues to flag, e.g. missing data? Imbalanced classes?

**Exercise:** Make a scatterplot of `Age` vs `Fare`, with points colored by survival status. Where would you draw a decision boundary? Consider applying a log transformation to `Fare` as a preprocessing step.

In [None]:
data = data.dropna(subset="Age")
data = data.assign(LogFare=np.log(data.Fare + 10))  # Add 10 to avoid taking log(0)

plt.figure()
sns.scatterplot(x=data.LogFare, y=data.Age, hue=data.Survived)
plt.title("Titanic survival")
plt.xlabel("log transformed fare")
plt.ylabel("age")
plt.show()

**Exercise:** Divide the data into training and test sets. A train-test split of at least 90-10 is recommended.

In [None]:
X = data[[col for col in data.columns if col != "Survived"]]
Y = data.Survived

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.1, random_state=42
)

# Logistic regression on two variables

In order to visualise the decsion boundary in feature space, we'll limit the number of features to two: `Age` and `Fare`, transformed as above. If we choose a linear decision boundary, our regression will be of the form
$$P(Survived=1|Age, Fare, \theta) = \big( 1 + exp(\theta_0 + \theta_1 * Age + \theta_2 * log(Fare) \big)^{-1}.$$
You can then decide on an operating point to binarise the model output.

## Drawing a decision boundary

**Exercise:** Based on the plot above, draw your own decision boundary, linear or otherwise. Then define a logistic regression classifier based on your decision boundary and a choice of operating point. Evaluate the classification accuracy of your model. Does it outperform the naive baseline model that just predicts no one survived?

In [None]:
def logistic_regression(
    x_1: pd.Series,
    x_2: pd.Series,
    slope: float,
    intercept: float,
    operating_point: float = 0.5,
    return_proba: bool = False,
) -> int:
    """A logistic regression classifier in two variables.
    Parameters
    ----------
    x_1: pd.Series
        Feature 1 observations.
    x_2: float
        Feature 2 observations.
    slope: float
        Slope of the linear parametisation of x_2 by x_1 on
        the decision boundary.
    intercept: float
        Intercept of the linear parameterisation of x_2 by x_1 on
        the decision boundary.
    operating_point: float, default 0.5
        Threshold for turning probability into a binary output.
    return_proba: bool, default False
        Whether to return probability or a binary output.
    Returns
    -------
    int
        Class label 0 or 1.
    """
    linear_fcn = -x_2 + slope * x_1 + intercept
    proba = (1 + np.exp(-linear_fcn)) ** (-1)

    if return_proba:
        return proba
    return (proba >= operating_point).astype(int)

In [None]:
SLOPE = 30
INTERCEPT = -95

x = np.linspace(0, 7, 100)
decision_boundary = SLOPE * x + INTERCEPT

plt.figure()
sns.scatterplot(x=X_train.LogFare, y=X_train.Age, hue=Y_train)
plt.plot(x, decision_boundary, color="black")
plt.title("Titanic survival")
plt.xlabel("log transformed fare")
plt.ylabel("age")
plt.ylim((0, 80))
plt.xlim((2, 7))
plt.show()

In [None]:
def calc_accuracy(y_true: pd.Series, y_pred: pd.Series) -> float:
    num_wrong = np.sum(np.absolute(y_true - y_pred))
    return 1 - num_wrong / len(y_true)


train_accuracy = calc_accuracy(
    Y_train, logistic_regression(X_train.LogFare, X_train.Age, SLOPE, INTERCEPT)
)
test_accuracy = calc_accuracy(
    Y_test, logistic_regression(X_test.LogFare, X_test.Age, SLOPE, INTERCEPT)
)
train_accuracy, test_accuracy

In [None]:
# Accuracy of the naive baseline model
train_accuracy = np.sum(Y_train == 0) / len(Y_train)
test_accuracy = np.sum(Y_test == 0) / len(Y_test)
train_accuracy, test_accuracy

## Sklearn implementation 

**Exercise:** Use the `sklearn` class `LogisticRegression` to fit a logistic regression model on `Age` and log transformed `Fare`, scaling the features beforehand if necessary. Plot the decision boundary from the fit. Evaluate the classification accuracy of the model and compare to the performance of your model above. Note that the default logistic regression in `sklearn` is L2 regularised, so you may want to experiment with different levels of regularisation.

In [None]:
# Scale features
scaler = MinMaxScaler()
features = ["Age", "LogFare"]

x_train = scaler.fit_transform(X_train[features])
x_test = scaler.transform(X_test[features])

In [None]:
# Fit model
logistic_reg = LogisticRegression()
logistic_reg.fit(x_train, Y_train)

coef = logistic_reg.coef_
intercept = logistic_reg.intercept_
coef, intercept

x = np.linspace(0, 1, 100)
decision_boundary = -(coef[0, 1] * x + intercept[0]) / coef[0, 0]

plt.figure()
sns.scatterplot(x=x_train[:, 1], y=x_train[:, 0], hue=Y_train)
plt.plot(x, decision_boundary, c="black", label="decision boundary")
plt.ylim((-0.1, 1.1))
plt.title("Titanic survival")
plt.xlabel("log transformed fare")
plt.ylabel("age")
plt.show()

In [None]:
train_accuracy = logistic_reg.score(x_train, Y_train)
test_accuracy = logistic_reg.score(x_test, Y_test)
train_accuracy, test_accuracy

## Polynomial logistic regression

We may be able to achieve higher accuracy by introducing higher order features. But as always with a more complex model, we risk overfitting the model to the training data, so we may have to adjust the strength of regularisation.

**Exercise:** Engineer polynomial features in log `Fare` and/or `Age` and fit a logistic regression. You may want to experiment with the regularisation method and the strength of the regularisation parameter. Compare the contribution of each feature to the model prediction. Evaluate your model's performance, comparing to the performance of the model you developed above.

In [None]:
# Add higher order features and scale


def add_poly_features(data: pd.DataFrame, feature: str, max_power: int) -> pd.DataFrame:
    df = data.copy()
    for n in range(2, max_power + 1):
        df.loc[:, f"{feature}{n}"] = df[feature] ** n
    return df


MAX_POWER = 11

scaler = MinMaxScaler()
features = ["Age", "LogFare"]

x_train = add_poly_features(X_train[features], "LogFare", MAX_POWER)
x_train = scaler.fit_transform(x_train)

x_test = add_poly_features(X_test[features], "LogFare", MAX_POWER)
x_test = scaler.transform(x_test)

In [None]:
# Fit model and plot decision boundary

regression = LogisticRegression(penalty="none")
regression.fit(x_train, Y_train)
coef = regression.coef_
intercept = regression.intercept_


x = pd.DataFrame({"Age": np.zeros((100,)), "LogFare": np.linspace(2, 7, 100)})
x = add_poly_features(x, "LogFare", MAX_POWER)
x = scaler.transform(x)
decision_boundary = -intercept[0] / coef[0, 0]
for n in range(1, MAX_POWER + 1):
    decision_boundary += -(coef[0, n] / coef[0, 0]) * x[:, n]

plt.figure()
sns.scatterplot(x=x_train[:, 1], y=x_train[:, 0], hue=Y_train)
plt.plot(x[:, 1], decision_boundary, c="black", label="decision boundary")
plt.ylim((-0.1, 1.2))
plt.title("Titanic survival")
plt.xlabel("log transformed fare")
plt.ylabel("age")
plt.show()

In [None]:
train_accuracy = regression.score(x_train, Y_train)
test_accuracy = regression.score(x_test, Y_test)
train_accuracy, test_accuracy

# Extension: develop your own regression model

**Exercise:** Now that we've demonstrated the basics of logitstic regression, develop your own model with all the features in the dataset at your disposal. Compare the performance of your model to the peformance of the models we produced above. Keep in mind the following tips as you train and evaluate your model.
* Consider one-hot-encoding categorical features.
* Features on different scales should be transformed so that they're comparable.
* If tuning a hyperparameter like the regularisation parameter, it's best practice to further divide your training set into training and validation sets or to cross validate.