# **Support Vector Machines**

I have not seen anyone using a Support Vector Classifier in this competition yet, so how about we try it and see how it performs? 👀

If this method is new to you, I strongly recommend the following videos from the great **StatQuest** channel. They will give you a good grasp of what is going on.

* [Support Vector Machines: Main Ideas!!!](https://www.youtube.com/watch?v=efR1C6CvhmE)
* [Support Vector Machines: The Polynomial Kernel](https://www.youtube.com/watch?v=Toet3EiSFcM)
* [Support Vector Machines: The Radial (RBF) Kernel](https://www.youtube.com/watch?v=Qc5IyLW_hns)

If you are curious about other ML techniques as well, check the other videos in the channel. They are clear and also funny. BAM!!! 😃

### Updates

**Version 4**: added 10-fold cross validation.

**Version 5**: added *scikit-learn-intelex* extension, reccomended by @napetrov (see [this notebook](https://www.kaggle.com/napetrov/tps04-svm-with-intel-extension-for-scikit-learn)).

## Enabling scikit-learn-intelex

In [None]:
!pip install scikit-learn-intelex --progress-bar off >> /tmp/pip_sklearnex.log
from sklearnex import patch_sklearn
patch_sklearn()

# Load libraries and data

In [None]:
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier

from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split

In [None]:
train = pd.read_csv('../input/tabular-playground-series-apr-2021/train.csv', index_col='PassengerId')
test = pd.read_csv('../input/tabular-playground-series-apr-2021/test.csv', index_col='PassengerId')
submission = pd.read_csv('../input/tabular-playground-series-apr-2021/sample_submission.csv', index_col='PassengerId')

target = train.pop('Survived')

By glancing at the train dataframe, it seems like we have missing data and some features that we may not need.

In [None]:
train.head()

## Preprocessing

The features **Name**, **Ticket** and **Cabin** don't seem to be useful, so let's drop them.

In [None]:
train.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)
test.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

Now let's see if we have missing data.

In [None]:
train.info()

We do. Here we take a simple approach and just fill **Age** and **Fare** with their median value (using the train data). For **Embarked**, we fill the NAs with its most frequent value, which is 'S'.

In [None]:
test_prepared = test.copy()
train_prepared = train.copy()

test_prepared['Age'].fillna((train['Age'].median()), inplace=True)
train_prepared['Age'].fillna((train['Age'].median()), inplace=True)

test_prepared['Fare'].fillna((train['Fare'].median()), inplace=True)
train_prepared['Fare'].fillna((train['Fare'].median()), inplace=True)

test_prepared['Embarked'].fillna('S', inplace=True)
train_prepared['Embarked'].fillna('S', inplace=True)

In [None]:
train_prepared.info()

In [None]:
test_prepared.info()

Now the data is complete. Let's encode the categorical variables **Pclass**, **Sex** and **Embarked**.

In [None]:
for col in ['Pclass', 'Sex', 'Embarked']:
    le = LabelEncoder()
    le.fit(train_prepared[col])
    train_prepared[col] = le.transform(train_prepared[col])
    test_prepared[col] = le.transform(test_prepared[col])

In [None]:
train_prepared.head()

In [None]:
train_prepared.describe()

The last preprocessing step will be scaling all variables because Support Vector Machines do not work well with variables in different scales.

In [None]:
train_prepared_scaled = train_prepared.copy()
test_prepared_scaled = test_prepared.copy()

scaler = StandardScaler()
scaler.fit(train_prepared)
train_prepared_scaled = scaler.transform(train_prepared_scaled)
test_prepared_scaled = scaler.transform(test_prepared_scaled)

train_prepared_scaled = pd.DataFrame(train_prepared_scaled, columns=train_prepared.columns)
test_prepared_scaled = pd.DataFrame(test_prepared_scaled, columns=train_prepared.columns)

In [None]:
train_prepared_scaled.describe()

We are good to go! :)

Let's split the train set into 90% training and 10% validation data.

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(train_prepared_scaled, target, test_size=0.1, random_state=0)

# SVC with linear kernel

If we want to use a linear kernel, `LinearSVC` is the recommended class because of its speed. Here it runs in a fraction of a second!

In [None]:
%%time
linear_svc = LinearSVC(random_state=0, C=0.01, loss='hinge')
linear_svc.fit(X_train, y_train)
y_pred = linear_svc.predict(X_valid)
accuracy_score(y_pred, y_valid)

In [None]:
%%time
final_pred = linear_svc.predict(test_prepared_scaled)

The Public LB score for the following submission file is **0.78505**. After some investigation I realized that this classifier is pratically using only the variable **Sex** for classification. Specifically, it predicts that all women survive and all men die. So the SVC with a linear kernel does not help us at all. But hey, we had to try it!

In [None]:
submission['Survived'] = np.round(final_pred).astype(int)
submission.to_csv('svc_kernel_linear.csv')

# SVC with RBF kernel

Let's now use the *kernel trick* with an RBF kernel. For this, we need to use the class SVC and set `kernel='rbf'`. The parameter `C` did not have an influence for some of the values I tested, but it ran faster when I set it to low values, so that's why I chose 0.01.

Before Version 5 of this notebook, the following cell would take 10 minutes to run. With **scikit-learn-intelex** enabled, it takes less than a minute.

In [None]:
%%time
svc_kernel_rbf = SVC(kernel='rbf', random_state=0, C=0.01)
svc_kernel_rbf.fit(X_train, y_train)
y_pred = svc_kernel_rbf.predict(X_valid)
accuracy_score(y_pred, y_valid)

Prediction is now very fast with scikit-learn-intelex. Without it, the followging cell would take about 5 minutes.

In [None]:
%%time
final_pred = svc_kernel_rbf.predict(test_prepared_scaled)

The Public LB score for the following submission file is **0.79062**, "which is an improvement from our previous score". 😏

In [None]:
submission['Survived'] = np.round(final_pred).astype(int)
submission.to_csv('svc_kernel_rbf.csv')

# SVC with polynomial kernel

Another popular choice is to use a polynomial kernel. Here I tried a 3rd degree polynomial, but other values could be used as well. 

The Extension **scikit-learn-intelex** did not speed up the computations in this case. The following cell will take several minutes to run.

In [None]:
%%time
svc_kernel_poly_3 = SVC(kernel='poly', degree=3, random_state=0, C=0.01)
svc_kernel_poly_3.fit(X_train, y_train)
y_pred = svc_kernel_poly_3.predict(X_valid)
accuracy_score(y_pred, y_valid)

Prediction will also take a few minutes here.

In [None]:
%%time
final_pred = svc_kernel_poly_3.predict(test_prepared_scaled)

The Public LB score for the following submission file is **0.76709**, "which is *not* an improvement from our previous score". In fact, it's even worse than what we obtained with a linear kernel. 😐

In [None]:
submission['Survived'] = np.round(final_pred).astype(int)
submission.to_csv('svc_kernel_poly_3.csv')

# Final submission with 10 folds

To finish off, let's predict survival with 10-fold cross-validation using the RBF kernel. The following cell will take over 2 hours to run.

In [None]:
%%time
n_folds = 10
kf = KFold(n_splits=n_folds, shuffle=True, random_state=0)
y_pred = np.zeros(test.shape[0])

for fold, (train_index, valid_index) in enumerate(kf.split(train_prepared_scaled, target)):
    print("Running Fold {}".format(fold + 1))
    X_train, X_valid = pd.DataFrame(train_prepared_scaled.iloc[train_index]), pd.DataFrame(train_prepared_scaled.iloc[valid_index])
    y_train, y_valid = target.iloc[train_index], target.iloc[valid_index]
    svc_kernel_rbf = SVC(kernel='rbf', random_state=0, C=0.01)
    svc_kernel_rbf.fit(X_train, y_train)
    print("  Accuracy: {}".format(accuracy_score(y_valid, svc_kernel_rbf.predict(X_valid))))
    y_pred += svc_kernel_rbf.predict(test_prepared_scaled)

y_pred /= n_folds

print("")
print("Done!")

The Public LB score for the following submission file is **0.79062**, the same obtained with only one split.

In [None]:
submission['Survived'] = np.round(y_pred).astype(int)
submission.to_csv('svc_kernel_rbf_10_folds.csv')

# Conclusion

I found it interesting that the Support Vector Classifier with a linear kernel seems to be influenced only by one of the variables. If you have an explanation (or a guess) for this fact, please leave a comment.

This was the first time that I used this technique and it was fun to play with it. Hope you enjoyed this brief notebook too! 😉