# Support Vector Machines (SVM)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv('cell_samples.csv')
df.head()

Unnamed: 0,ID,Clump,UnifSize,UnifShape,MargAdh,SingEpiSize,BareNuc,BlandChrom,NormNucl,Mit,Class
0,1000025,5,1,1,1,2,1,3,1,1,2
1,1002945,5,4,4,5,7,10,3,2,1,2
2,1015425,3,1,1,1,2,2,3,1,1,2
3,1016277,6,8,8,1,3,4,3,7,1,2
4,1017023,4,1,1,3,2,1,3,1,1,2


In [3]:
df.dtypes

ID              int64
Clump           int64
UnifSize        int64
UnifShape       int64
MargAdh         int64
SingEpiSize     int64
BareNuc        object
BlandChrom      int64
NormNucl        int64
Mit             int64
Class           int64
dtype: object

Notice, that not all values of the column *BareNuc* has integer values (It is object). <br/>
Thus we need to preprocess the data before it is fed any model.

### Data Preprocessing

Let's drop the rows which has a *non-numeric* value in the **BareNuc** column.

In [4]:
df = df[pd.to_numeric(df['BareNuc'], errors='coerce').notnull()]
df['BareNuc'] = df['BareNuc'].astype('int')
df.dtypes

ID             int64
Clump          int64
UnifSize       int64
UnifShape      int64
MargAdh        int64
SingEpiSize    int64
BareNuc        int64
BlandChrom     int64
NormNucl       int64
Mit            int64
Class          int64
dtype: object

In [5]:
X = df[['Clump', 'UnifSize', 'UnifShape', 'MargAdh', 'SingEpiSize', 'BareNuc', 'BlandChrom', 'NormNucl', 'Mit']]
X = np.asarray(X)
X

array([[ 5,  1,  1, ...,  3,  1,  1],
       [ 5,  4,  4, ...,  3,  2,  1],
       [ 3,  1,  1, ...,  3,  1,  1],
       ...,
       [ 5, 10, 10, ...,  8, 10,  2],
       [ 4,  8,  6, ..., 10,  6,  1],
       [ 4,  8,  8, ..., 10,  4,  1]])

In [6]:
y = df['Class']
y = np.asarray(y)
y[0:5]

array([2, 2, 2, 2, 2])

## Train-test split

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

## Modelling

The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:

    1.Linear
    2.Polynomial
    3.Radial basis function (RBF)
    4.Sigmoid
Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset, we usually choose different functions in turn and compare the results.

In [9]:
from sklearn import svm

In [10]:
model = svm.SVC()
model

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [11]:
model.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [12]:
y_hat = model.predict(X_test)
y_hat[0:5]

array([2, 2, 2, 4, 2])

## Evaluation

### Confusion Matrix

In [13]:
from sklearn import metrics

In [14]:
cnf_matrix = metrics.confusion_matrix(y_test, y_hat, labels = [2, 4])
print(cnf_matrix)

[[80  9]
 [ 0 48]]


In the above statement labels [2, 4] means **Benign** and **Malignant** respectively.

### f1 score

In [15]:
from sklearn import metrics

In [16]:
print("F1 score of the model is %.9f" % metrics.f1_score(y_test, y_hat, average = 'weighted'))

F1 score of the model is 0.935372769


### Jaccard Similarity Score

In [17]:
from sklearn import metrics

In [18]:
print("Jaccard Similarity Score of the model is %.9f" % metrics.jaccard_similarity_score(y_test, y_hat))

Jaccard Similarity Score of the model is 0.934306569
