# 06. Support Vector Machine Classification
***

In this chapter, we are building another well known classifier which is the Support Vector Machine (SVM) classifier. We are not going to detail any math behind SVM.

Note that SVM is called a _Discriminent Classifier_, because it generates a separation function between classes. This is different from _Probabilistic Classifiers_ that perform a prediction based on probabilities (like Logistic Regression).

In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('whitegrid')
sns.set_context('poster')

from matplotlib import rcParams

In [2]:
from sklearn.metrics import confusion_matrix
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC

***
## Loading data

In [3]:
data = pd.read_csv('./data/training_dataset.csv')
data.head(3)

Unnamed: 0,SibSp,Parch,Fare,Gender,Boarding_C,Boarding_Q,Boarding_S,Age,Pclass_1,Pclass_2,Pclass_3,Survived
0,0,0,13.0,0,0,0,1,24.0,0,1,0,0
1,1,0,16.1,0,0,0,1,21.75,0,0,1,1
2,0,0,30.6958,1,1,0,0,56.0,1,0,0,0


In [4]:
X = data.drop('Survived', axis = 1)
Y = data['Survived']

In [5]:
data_test = pd.read_csv('./data/testing_dataset.csv')
X_test = data_test.drop('Survived', axis = 1)
Y_test = data_test['Survived']

***
## Basic SVM classifier

We are going first to build a basic SVM classifier using default classifier hyper-parameters.

In [6]:
svc = SVC()
svc.fit(X, Y)

print "Accuracy on training data: %f" % svc.score(X, Y)
print "Accuracy on test data: %f" % svc.score(X_test, Y_test)

Accuracy on training data: 0.884669
Accuracy on test data: 0.719101


The model performed very badly. When it comes to SVM, we should be very careful when selecting the model hyper-parameters. We must perform a grid search on these hyper-parameters, which is performed in the nest section.

***
## Searching the best SVM classifier

Here we are performing a grid search on the classifier hyper-parameters to select the best classifier, in term of model accuracy.

In [None]:
svc_c = SVC(C = 0.1)
param_grid = {'kernel': ['poly', 'linear', 'rbf']}

gs = GridSearchCV(svc_c, param_grid = param_grid, cv = 5)
gs.fit(X, Y) # Will take a lot of time...

In [15]:
best_svc = gs.best_estimator_

In [16]:
print "Using SVM params %s produce the best model accuracy: %f" % (gs.best_params_, gs.best_score_)

Using SVM params {'kernel': 'linear', 'C': 100} produce the best model accuracy: 0.798875


In [17]:
print "Accuracy on training data = %f" % best_svc.score(X, Y)
print "Accuracy on test data = %f" % best_svc.score(X_test, Y_test)

Accuracy on training data = 0.800281
Accuracy on test data = 0.752809


In [18]:
# The corresponding confusion matrix
print confusion_matrix(Y_test, best_svc.predict(X_test))

[[89 20]
 [24 45]]
