<span style="font-size: 24px;">The Titanic Problem.</span>

Key skills covered: Binary Classification, decision trees, hyperparameters for decision trees, gini impurity, k-fold-cross-validation, evaluation methods for classification problems, k nearest neighbours, preprocessing with sklearn, feature engineering best practice

We're dealing with a binary classification problem, the dataset is relatively small, (<1000) and the features have a mixture of categorial and numerical data so a decision tree will most likely work relatively well. This is what we'll start with.

CORRECTION - high dimensionality is not good for decision trees. Should not use one hot vectors to encode categorical variables and use label encoding instead. This is what I originally did and then switched to label encoding.

First the data needs cleaning and preparing.

In [13]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
path_train = "/home/frances/Documents/Getting a JOB/ML preparation/Titanic_Classification/train.csv"
path_test = "/home/frances/Documents/Getting a JOB/ML preparation/Titanic_Classification/test.csv"

train_data = pd.read_csv(path_train)
test_data = pd.read_csv(path_test)

# Create a new instance of LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding to 'Sex' and 'Embarked' columns in training data
train_data['Sex'] = label_encoder.fit_transform(train_data['Sex'])
train_data['Embarked'] = label_encoder.fit_transform(train_data['Embarked'])
test_data['Sex'] = label_encoder.fit_transform(test_data['Sex'])
test_data['Embarked'] = label_encoder.fit_transform(test_data['Embarked'])

X_train = train_data.drop(['Name', 'Parch', 'Ticket','Cabin'],axis=1) # remove unnecessary columns
X_test = test_data.drop(['Name', 'Parch', 'Ticket','Cabin'],axis=1) # remove unnecessary columns

# There are 177 na values in the train set, and 87 na values in the test data, which is very high so I don't want to drop them.

cols_with_missing_data = ['Age','Fare']

# There are 891 samples, sqrt(891) = 30 so I'll use 30 nearest neighbours.

knn_imputer = KNNImputer(n_neighbors=30)

# Perform imputation on training data
imputed_train_data = X_train.copy()
imputed_train_data[cols_with_missing_data] = knn_imputer.fit_transform(X_train[cols_with_missing_data])

# Perform imputation on test data
imputed_test_data = X_test.copy()
imputed_test_data[cols_with_missing_data] = knn_imputer.transform(X_test[cols_with_missing_data])

X_train = imputed_train_data
X_test = imputed_test_data

Y_train = X_train['Survived']
X_train = X_train.drop(['Survived'],axis=1) # remove unnecessary columns

X_train['SibSp'] = np.log(X_train['SibSp'] + 1)  # Adding 1 to handle zero values
X_train['Fare'] = np.log(X_train['Fare'] + 1)
X_test['SibSp'] = np.log(X_test['SibSp'] + 1) 
X_test['Fare'] = np.log(X_test['Fare'] + 1)

from sklearn.preprocessing import MinMaxScaler

# Create a MinMaxScaler instance
scaler = MinMaxScaler()

# Fit and transform the training data
X_train = scaler.fit_transform(X_train)

# Transform the test data using the same scaler
X_test= scaler.transform(X_test)

<span style="font-size: 20px;">Decision Trees</span>

Decision trees are easy to interpret and can handle both numerical and categorical data. They are a simple model, they don’t capture complex behaviour and they can easily overfit. Decision trees work by iteratively (at each branch) picking the ‘best feature’ with which to separate the data. This is based on ‘purity’ for classification tasks and ‘variance’ for regression tasks. 

<b> Explanation of Gini purity </b>

Gini = 1 - ∑(p_i)^2

So for each class, you see the proportion of data points belonging to it, and you square that value and sum it and take it away from 1. 

If values for an input variable e.g. Weather = Sunny, Weather = Cloudy, are fairly spread out across the classes (i.e. if they fall into Yes or No more evenly) then they aren't a very good predictor of the outcome. The proportions of data points in each class are likely to be very close, resulting in small values of the summand term. Which means the whole metric will be closer to 1, indicating impurity. The best variable is the purest variable.

<b> Explanation of variance </b>

It’s similar for regression problems, except you minimise variance. 

<b> Hyperparameters </b>

Criterions: methodology to split the data/create branches e.g. Gini impurity

Things which prevent overfitting/control tree architecture:
- Max Depth: It limits the number of levels the tree can grow. Setting this hyperparameter can help prevent overfitting.
- Min Samples Split: The minimum number of samples required to split an internal node. If a node has fewer samples than this value, it will not be split further.
- Min Samples Leaf: The minimum number of samples required to be at a leaf node. This hyperparameter helps control the size of the tree.

Splitter: This hyperparameter specifies the strategy used to choose the split at each node. Common values are "best" (choose the best split) and "random" (choose the best random split).

Class Weight: For imbalanced datasets, you can use this hyperparameter to assign different weights to different classes. It can help the decision tree model to give more importance to minority classes.

In [14]:
from sklearn.impute import KNNImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, KFold

# Load and preprocess your data (as you've shown in your code)

# Create a decision tree classifier
decision_tree = DecisionTreeClassifier()

# Combine your features and labels for training
X = X_train
y = Y_train

# Perform k-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)  # You can adjust the random_state as needed

# Store the cross-validation scores
cv_scores = cross_val_score(decision_tree, X_train, Y_train, cv=kf)

# Print the cross-validation scores for each fold
for i, score in enumerate(cv_scores, start=1):
    print(f"Fold {i}: {score:.4f}")

# Calculate the mean and standard deviation of the cross-validation scores
mean_cv_score = np.mean(cv_scores)
std_cv_score = np.std(cv_scores)

print(f"Mean CV Score: {mean_cv_score:.4f}")
print(f"Standard Deviation of CV Scores: {std_cv_score:.4f}")

Fold 1: 0.7709
Fold 2: 0.7303
Fold 3: 0.7640
Fold 4: 0.7640
Fold 5: 0.7584
Mean CV Score: 0.7576
Standard Deviation of CV Scores: 0.0142


<span style="font-size: 20px;"> Explanation </span>

We've used a technique called K-Cross-Validation. This lowers the likelihood of underfitting and overfitting. This involves taking ‘k’ distinct folds where the model is trained on the remaining k-1 folds, and then the last fold is used for validation. K is typically 5 or 10. 5 is quicker and also sufficient for smaller training sets. Cross validation can be computationally expensive but you can use GPUs to parallelise for increased efficiency. You have to balance desire for accuracy vs computational cost.

- Precision is the ratio of true positives to the sum of true and false positives. When it's high, this means that when the model predicts a positive class, it’s likely to be correct. 
- Accuracy is the ratio of correct predictions to the total no. of data points. 
- Recall (sensitivity) is about how many positive predictions the model made relative to the actual no. of positive predictions. Precision can be very high if the model just classifies - everything positively.
- The F1 score is a way of balancing precision and recall. You can increase precision and reduce recall, for example. This takes a harmonic mean of the two, creating a sort of trade off.

Currently we correctly predict correctly 69% of the time, which seems pretty low. It may be due to the inbalanced class sizes (most people died) which decision trees don't handle particularly well. If you have more datapoints in a class, when you take measures like Gini impurity or variance, they tend to choose variables which favour the dominant class. Techniques like class weighting, oversampling or undersampling (duplicating datapoints in minority classes and removing datapoints in majority classes, respectively) can help with this.

Decision trees do not scale particularly well. They are prone to overfitting (though not in this case). 

<span style="font-size: 20px;"> K-nearest neighbors </span> 

K-nearest neighbours finds k nearest points to new data points and uses them to make a prediction.

In [11]:
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

X_train, X_test, y_train, y_test = train_test_split(X_train, Y_train, test_size=0.3, random_state=42)

knn = KNeighborsClassifier()

knn.fit(X_train,y_train)

y_pred_knn = knn.predict(X_test)

# Evaluate the model using different metrics
accuracy = accuracy_score(y_test, y_pred_knn)
precision = precision_score(y_test, y_pred_knn)
recall = recall_score(y_test, y_pred_knn)
f1 = f1_score(y_test, y_pred_knn)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1}")

Accuracy: 0.8134328358208955
Precision: 0.8351648351648352
Recall: 0.6846846846846847
F1-score: 0.7524752475247525


<span style="font-size: 20px;"> Explanation </span>

KNN is also very easy to implement, and actually performs better than the decision tree on this problem. It has a better trade-off between precision and recall, as indicated by the higher recall and F1-score values.

The next step would be to use an ensemble method like Random Forests or Gradient Boosting (XGBoost) which combine multiple decision trees to improve performance. We could also try logistic regression and support vector machines. A deep learning model would be more appropriate for a larger data set.