In this notebook I implement use a support vector machine classification algorithm on the Titanic dataset. This is an algorithm which is particularly suited to binary classification, and so should hopefully perform better that a random forest classifier. As for the random forrest classifier notebook, the exploratory data analysis is cut and pasted from the logistic regression notebook. 

In [27]:
import numpy as np
import pandas as pd
from math import log
%matplotlib inline 
import matplotlib.pyplot as plt
import warnings
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder,OneHotEncoder
warnings.filterwarnings('ignore')


train_data = pd.read_csv("Titanic/train.csv")
test_data = pd.read_csv("Titanic/test.csv")

pd.set_option('display.max_columns', None)
train_data = train_data.copy()
test_data = test_data.copy()
train_data = train_data.set_index("PassengerId")
test_data = test_data.set_index("PassengerId")
train_data.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [28]:
features = ['Survived', 'Pclass', 'Sex', 'Fare', 'Embarked', 'FamilyOnBoard', 'Age']
train_data['FamilyOnBoard'] = train_data['Parch'] + train_data['SibSp']
test_data['FamilyOnBoard'] = test_data['Parch'] + test_data['SibSp']
train_data_features = train_data[features]
train_data_features.groupby('Pclass', as_index = False)['Age'].describe()

Unnamed: 0,Pclass,count,mean,std,min,25%,50%,75%,max
0,1,186.0,38.233441,14.802856,0.92,27.0,37.0,49.0,80.0
1,2,173.0,29.87763,14.001077,0.67,23.0,29.0,36.0,70.0
2,3,355.0,25.14062,12.495398,0.42,18.0,24.0,32.0,74.0


In [29]:
y = train_data_features['Survived']
X = train_data_features.drop('Survived', axis = 1)



In [30]:
OHE_encoder = Pipeline(
    steps = [
        ("imputer", SimpleImputer(strategy = 'most_frequent')),
        ("encoder", OneHotEncoder())
    ]
)
num_pipeline = Pipeline(
    steps = [
        ("imputer", SimpleImputer(strategy = 'mean')),
        ("scaler", StandardScaler())
    ]
)
preprocessor = ColumnTransformer(transformers = [
    ("num", num_pipeline, ['Fare', 'Age', 'FamilyOnBoard']), 
    ("ord", OrdinalEncoder(), ['Sex']),
    ("OHE", OHE_encoder, ['Embarked'])
])

The hyperparameteters we tune are $C$ and the 'class_weight'. By default, the SVC classifier is regularised with the squared l2 penalty and has a default parameter of $C=1$. By toggling $C$, we adjust the strength of this regularisation. The class_weight will tell us how much to penalise an incorrect prediction. With None, false positives and false negatives are penalised the same. Since there are significantly more negatives than positives in the training data though, it seems reasonable that penalisation according to 'balanced' might make sense. This will penalise an incorrect prediction based on its relative frequency in the dataset. 

In [31]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 42)

parameters = {'model__C' : np.logspace(-2,2,100)} #This no longer has class_weight as a parameter -- I found it to not be useful.
model = SVC(gamma = 'auto', random_state = 42) 
my_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                              ('model', model)
                             ])
search = GridSearchCV(my_pipeline, param_grid=parameters, scoring = 'accuracy', cv=5, n_jobs = -1)
search.fit(X_train,y_train)

In [32]:
best_model = search.best_estimator_
print(search.best_params_)
forest_scores = cross_val_score(best_model, X_train, y_train, cv=5)
print(forest_scores.mean())
y_pred = best_model.predict(X_train)
Acc_score = accuracy_score(y_train, y_pred)
print(Acc_score)
print(confusion_matrix(y_train, y_pred))

{'model__C': 2.2051307399030455}
0.8257953314291344
0.8356741573033708
[[399  45]
 [ 72 196]]


In [33]:
y_pred = best_model.predict(X_test)
Acc_score = accuracy_score(y_test, y_pred)
print(Acc_score)
print(confusion_matrix(y_pred, y_test))

0.8100558659217877
[[91 20]
 [14 54]]


It seems that changing the class weights to be proportional to the class frequency (i.e. balanced) didn't improve the accuracy, but it did make the false positives and false negatives more equal. We can see that there are more false predictions of survival, which makes sense since only about $38\%$ of people actually survived. If we forced the model to use a balanced class weight, the predictions become less accurate, although the number of false positives and false negatives become about equal. It's possible that this means that using no class_weight, but changing the threshold for predicting survival might be a reasonable strategy. 

Also, the same comment about data leakage as in the random forrest classifier notebook applies here. 

In [34]:
predictions = best_model.predict(test_data)
my_submission = pd.DataFrame({'PassengerId': test_data.index, 'Survived': predictions})
my_submission.to_csv('SVMClassifier.csv', index=False)