# PAMD week 10 - Support Vector Machines

This week we will look into implementing SVMs for classification. I uploaded our churn dataset to Learn again and we will try to predict whether a customer will churn or not based on their specifications.

A lot of these steps will be familiar for you, so I will just demonstrate them today.

In [9]:
import pandas as pd
import numpy as np
import pprint as pp

df = pd.read_csv("churn_ibm.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


### 1.1 Pre-processing and data splitting

Our pre-processing steps are the usual:

- divide the whole dataframe into X and y
- create dummy variables for our categorical data in X
- create dummy variable for our classification outcome y
- split into test/train for both X and y

In [10]:
from sklearn.model_selection import train_test_split

y = df["Churn"]
X = df.drop(["Churn", "customerID"], axis=1)

for column in X.columns:
    if X[column].dtype == object:
        X = pd.concat(
            [X, pd.get_dummies(X[column], prefix=column, drop_first=True)], axis=1
        ).drop([column], axis=1)

y = pd.get_dummies(y, prefix="churn", drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Remember how SVMs operate: they find the best fitting dividing hyperplane to classify data. This logic tells you that it will be important on which scale the X variables are measured, as this will impact how the variables are distributed in our dataspace.

So, let's standardise our X, both for test and train.

In [11]:
from sklearn.preprocessing import StandardScaler

X_train = StandardScaler().fit_transform(X_train)
X_test = StandardScaler().fit_transform(X_test)

### 1.2 Support Vector Machine implementation

We will now start our implementation through the the [sklearn module 'svm'](https://scikit-learn.org/stable/modules/svm.html). Note the different available functions here: we have regression and classification functions. Within the classification functions there are three different ones:

- 'LinearSVC' which implements a linear SVM and is faster than 'SVC' for linear cases
- 'SVC' which allows for different shapes of Kernel, including linear but also Gaussian etc.
- 'NuSVC' which is an extension of 'SVC' and allows you to choose the number of support vectors. You can think about it as a penalised version of 'SVC'.

We will focus on 'SVC' here.

Have a look at the [SVC doucmentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) as usual. We will start really simple here, and then make it more complex in the following exercise by comparing different Kernels.

Start with a baseline model:

**TASK**
- Implement a SVC using all default settings
- Calculate the accuracy and AUC

In [12]:
from sklearn import svm, metrics
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import roc_auc_score

# Create a SVC classifier using default settings
clf = svm.SVC(probability=True)

# Train the model using the training sets
clf.fit(X_train, y_train)

# Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))

# Model Precision
print("Precision:", metrics.precision_score(y_test, y_pred))

# Model Recall
print("Recall:", metrics.recall_score(y_test, y_pred))

# Calculate AUC
y_pred_proba = clf.predict_proba(X_test)[::, 1]
auc = roc_auc_score(y_test, y_pred_proba)
print("AUC:", auc)

  y = column_or_1d(y, warn=True)


Accuracy: 0.7976303317535545
Precision: 0.700507614213198
Recall: 0.4717948717948718
AUC: 0.8023079725374808


You can learn more about the settings of your model by calling the get_params() function on it.

In [14]:
pp.pprint(clf.get_params())

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': True,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}


### 1.3 Comparison of different SVM Kernels

You might wonder how different Kernel choices impact the performance of your model.

Choosing the right Kernel for an SVM problem is a difficult task. In the majority of cases you will have to run different configurations and test them, comparing the performance and then choosing the best fit based on that. There are typically only two or three candidates which you would consider.

Check back into the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) and note the different Kernel options you have. There's also a very good overview of [different Kernels](https://scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html#sphx-glr-auto-examples-svm-plot-svm-kernels-py) and their performance visualised.

- 'rbf': the default. It stands for radial basis function and is a Gaussian Kernel. This is a very common Kernel shape in practice.
- 'linear': Produces a linear SVC.
- 'poly': Polynomial Kernel. You can then use the parameter 'degree' to change the degree of it, the default is set to 3. The parameter 'coef0' can then be used to shift data.
- 'sigmoid' This Kernel is less commonly used as it has very specific use cases, where the data distribution can be approximated as a sigmoid shape, and it only fits those well. Again, 'coef0' can be used to shift data.

For a lot of these coefficients like 'coef0' we would use cross-validation (for example via [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)) to find the optimal one.

**TASK**
- Implement one or more SVMs, but this time specify the Kernel choice
- Compare the performance to your model in 1.2 using accuracy and AUC

The default in 1.2 was Gaussian (rbf), so you can compare the performance to any of the other three options (linear, poly, sigmoid). You can also test what happens when you make changes to the other parameters, e.g. degree, gamma or coef0 which impact the shape of the boundaries. Remember that AUC is a pure comparison metric - we use it to evaluate different models against each other, and a higher AUC indicates a better performance.


In [3]:
# Create a SVC classifier using a linear kernel
clf_linear = svm.SVC(kernel="linear", probability=True)

### 1.4 The role of the parameter 'C'

Think back to the lecture: There are two main decisions you have to make for an SVM. One is the Kernel choice, which we have just covered. The other is the parameter C.

C can be seen as a sensitivity or penalisation parameter, because it describes how many datapoints can violate our decision boundaries. We therefore also call it a slack variable:

- A lower value of C is more lenient, it draws wider margins around the boundary (more slack)
- A higher value of C is more strict, it draws more narrow margins around the boundary (less slack)

The default value for 'SVC' is 1.

You might wonder why we not just choose a very high value for C to be very accurate, but this can lead to overfitting. We therefore have to carefully balance again.

**TASK**

- Check what happens when you decrese or increase the value for C for your best performing SVC from tasks 1.2 and 1.3. You can choose a number of different values (higher and lower than the default 1) and compare based on accuracy and/or AUC.

In [4]:
# ADD CODE HERE

### 1.5 Bringing it all together

Using [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) you can bring these different decisions on Kernel and C together and make a good overview comparison. This can be useful if you're not sure what to look for.

**OPTIONAL TASK**

If you have some more time, try implementing GridSearchCV to search for different Kernels and values of C. I will create a list of AUC values for combinations of Kernel and three values for C (0.2, 0.5, 1.0). How large is the impact of C? 

In [5]:
# ADD CODE HERE