In [1]:
import joblib as joblib
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

raw_data = pd.read_csv('../data/raw/Titanic_dataset.csv', delimiter=',', names=['Name', 'PClass', 'Age', 'Sex', 'Survived'])
processed_data = passenger_data = pd.read_csv('../data/processed/processed_dataset.csv', delimiter=',', names=['Name', 'PClass', 'Age', 'Sex', 'Survived' , 'Family', 'Title'])
raw_data = raw_data[1:]
processed_data = processed_data[1:]

x = processed_data[['PClass', 'Age', 'Sex', 'Family', 'Title']]
y = processed_data['Survived']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

## Choosing the model

Predicting whether a random passenger survived the Titanic accident or not is obviously a binary classification problem. We have multiple options when it comes to choosing a model for this classification problem. We will explore some of them and compare the classification accuracies on the test set.

### Logistic regression

First we will try logistic regression. It's a commonly used algorithm for classification tasks. It's a good starting point to see if it matches our needs. We will use GridSearchCV to find approximate value for hyperparameter C, which represents inverse of regularization strength. Also, we will determine which norm we should use for penalization. Cross-validation will be used for model validation.

In [2]:
np.random.seed(123)
c_range = np.random.normal(5, 1.5, 20).astype(float)
hyperparameters = {'penalty': ['l1', 'l2'], 'C': c_range}

clf = GridSearchCV(LogisticRegression(solver='liblinear'), hyperparameters, cv=5)
clf.fit(x_train, y_train)

print(clf.best_params_)

{'C': 1.359981134910389, 'penalty': 'l2'}


After looking at previous results, we will decide to search for optimal value of hyperparameter C close to value C=1. Now we will use RandomizedSearchCV to find optimal value:

In [3]:
c_range = np.random.normal(1, 0.2, 20).astype(float)
hyperparameters = {'penalty': ['l1', 'l2'], 'C': c_range}

clf = RandomizedSearchCV(LogisticRegression(solver='liblinear'), param_distributions=hyperparameters, cv=5)
clf.fit(x_train, y_train)

print(clf.best_params_)

{'penalty': 'l1', 'C': 0.7140979262024134}


Finally, we will use L2 penalizer and value C = 1.383. The performance of this model will be measured on the test set:

In [4]:
best_penalty = clf.best_params_['penalty']
best_c = clf.best_params_['C']

logreg = LogisticRegression(solver='liblinear', penalty=best_penalty, C=best_c)
logreg.fit(x_train, y_train)

print(metrics.classification_report(y_test, logreg.predict(x_test)))
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(x_test, y_test)))
print('Accuracy of logistic regression classifier on training set: {:.2f}'.format(logreg.score(x_train, y_train)))

              precision    recall  f1-score   support

           0       0.82      0.98      0.89       175
           1       0.92      0.56      0.70        86

    accuracy                           0.84       261
   macro avg       0.87      0.77      0.79       261
weighted avg       0.85      0.84      0.83       261

Accuracy of logistic regression classifier on test set: 0.84
Accuracy of logistic regression classifier on training set: 0.81


Satisfying results are obtained with 84% accuracy on the test set. It's interesting to note that low recall value for class 1 (survived) is expected result. That's because of imbalanced dataset due to the fact that majority of people didn't survive the accident. It's no wonder that for some passengers who actually survived our model predicts the opposite.

### KNN

Next, we will try k-nearest neighbors algorithm. The most important question is which value of k should we pick. Using GridSearchCV, we will try many values for k and pick the best one:

In [5]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)

classifier = GridSearchCV(KNeighborsClassifier(), param_grid, cv=10, scoring='accuracy')
classifier.fit(x_train, y_train)
best_n_neighbors = classifier.best_params_['n_neighbors']

print('Optimal k value: ', best_n_neighbors)

Optimal k value:  8


Best classifier is obtained for value k=8, so we will use that value for our model:

In [6]:
knn = KNeighborsClassifier(n_neighbors=best_n_neighbors)
knn.fit(x_train, y_train)

y_pred = knn.predict(x_test)
train_pred = knn.predict(x_train)
print(metrics.classification_report(y_test, y_pred))
print('Accuracy on the test set: ', round(accuracy_score(y_test, y_pred)*100,2), '%')
print('Accuracy on the training set: ', round(accuracy_score(y_train, train_pred)*100,2), '%')

              precision    recall  f1-score   support

           0       0.83      0.95      0.88       175
           1       0.85      0.59      0.70        86

    accuracy                           0.83       261
   macro avg       0.84      0.77      0.79       261
weighted avg       0.83      0.83      0.82       261

Accuracy on the test set:  83.14 %
Accuracy on the training set:  82.32 %


We got similar results to logistic regression model results. It's hard to tell which model is better for our problem.

### Support Vector Machine (SVM)

Next reasonable thing to try is Support Vector Machine. Just like in previous models, we will begin with GridSearchCV and try to find approximately good hyperparameters:

In [7]:
from sklearn import svm

np.random.seed(123)
g_range = np.random.uniform(7, 2.5, 10).astype(float)
c_range = np.random.normal(7, 2.5, 10).astype(float)

hyperparameters = {'kernel': ['linear', 'rbf', 'sigmoid'], 'C': c_range, 'gamma': g_range}

clf = GridSearchCV(svm.SVC(), hyperparameters, cv=5)
clf.fit(x_train, y_train)

best_kernel = clf.best_params_['kernel']
best_gamma = clf.best_params_['gamma']
best_c = clf.best_params_['C']

print(clf.best_params_)

{'C': 10.164840646763835, 'gamma': 2.5865611072692305, 'kernel': 'rbf'}


We will use previously calculated values as base values for RandomizedSearchCV to further tune these values:

In [8]:
np.random.seed(123)
g_range = np.random.uniform(2.5, 0.3, 10).astype(float)
c_range = np.random.normal(10, 0.3, 10).astype(float)

hyperparameters = {'kernel': ['linear', 'rbf', 'sigmoid'], 'C': c_range, 'gamma': g_range}

clf = RandomizedSearchCV(svm.SVC(), param_distributions=hyperparameters, cv=5)
clf.fit(x_train, y_train)

best_kernel = clf.best_params_['kernel']
best_gamma = clf.best_params_['gamma']
best_c = clf.best_params_['C']

print(clf.best_params_)

{'kernel': 'rbf', 'gamma': 1.870493463109165, 'C': 10.656035826692136}


Now using these values, we can get model with tuned parameters:

In [9]:
svc = svm.SVC(kernel = best_kernel, gamma=best_gamma, C=best_c, probability=True)
svc.fit(x_train, y_train)

y_pred = svc.predict(x_test)
print(metrics.classification_report(y_test, y_pred))
print('Accuracy on the test set: ', round(accuracy_score(y_test, y_pred)*100,2), '%')

              precision    recall  f1-score   support

           0       0.83      0.95      0.89       175
           1       0.86      0.59      0.70        86

    accuracy                           0.84       261
   macro avg       0.85      0.77      0.79       261
weighted avg       0.84      0.84      0.83       261

Accuracy on the test set:  83.52 %


## Ensemble learning

Now we will use models mentioned above and combine them and observe the results. Voting classifier will be used for this purpose:

In [10]:
from sklearn.ensemble import VotingClassifier

voting_classifier = VotingClassifier(estimators=[('lr', logreg), ('svc', svc), ('knn', knn)], voting='soft')
voting_classifier.fit(x_train, y_train)

print('Accuracy on training set: ', round(accuracy_score(y_train, voting_classifier.predict(x_train)) * 100, 2), '%')
print('Accuracy on test set: ', round(accuracy_score(y_test, voting_classifier.predict(x_test)) * 100, 2), '%')
print(metrics.classification_report(y_test, logreg.predict(x_test)))

Accuracy on training set:  83.0 %
Accuracy on test set:  83.52 %
              precision    recall  f1-score   support

           0       0.82      0.98      0.89       175
           1       0.92      0.56      0.70        86

    accuracy                           0.84       261
   macro avg       0.87      0.77      0.79       261
weighted avg       0.85      0.84      0.83       261



We can see that we get similar results as for previous models. Similarity of predictions of different models is probably the consequence of simple dataset with small number of available features. We didn't get much from model ensemble, very likely because all models have similar predictions for the test data.