<a href="https://colab.research.google.com/github/IndraniMandal/CSC310-S20/blob/master/SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
###### Set Up #####
# verify our folder with the data and module assets is installed
# if it is installed make sure it is the latest
!test -e ds-assets && cd ds-assets && git pull && cd ..
# if it is not installed clone it
!test ! -e ds-assets && git clone https://github.com/IndraniMandal/ds-assets.git
# point to the folder with the assets
home = "ds-assets/assets/"
import sys
sys.path.append(home)

Cloning into 'ds-assets'...
remote: Enumerating objects: 205, done.[K
remote: Counting objects: 100% (58/58), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 205 (delta 54), reused 50 (delta 50), pack-reused 147 (from 1)[K
Receiving objects: 100% (205/205), 12.58 MiB | 9.34 MiB/s, done.
Resolving deltas: 100% (80/80), done.


# SVM Code Examples



In [2]:
# set up
import pandas as pd
import numpy as np
np.set_printoptions(formatter={'float_kind':"{:3.2f}".format})
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

# get data
df = pd.read_csv(home+"wdbc.csv")
df = df.drop(['ID'],axis=1)
X  = df.drop(['Diagnosis'],axis=1)
y = df['Diagnosis']


# SVM model
model = SVC(kernel='rbf', C=0.001, max_iter=10000)

# do the 5-fold cross validation
scores = cross_val_score(model, X, y, cv=5)
print("Fold Accuracies: {}".format(scores))
print("Accuracy: {:3.2f}".format(scores.mean()))

Fold Accuracies: [0.62 0.62 0.63 0.63 0.63]
Accuracy: 0.63


## Kernel Functions
The kernel function can be any of the following:

* linear
* polynomial
* rbf( Gaussian kernel)
* sigmoid

## SVM Grid Search

We can also perform a grid search to find the optimal model.

BEWARE: a grid search over all possible parameters of an SVM is almost impossible - combinatoric explosion, too many different combinations possible.

Here we only perform a grid over the number of hyperparameters.



In [3]:
# set up
import pandas as pd
import numpy as np
np.set_printoptions(formatter={'float_kind':"{:3.2f}".format})
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from confint import classification_confint

# get data
df = pd.read_csv(home+"wdbc.csv")
df = df.drop(['ID'],axis=1)
X  = df.drop(['Diagnosis'],axis=1)
actual_y = df['Diagnosis']

# SVM model
model = SVC(max_iter=10000)

# grid search
param_grid = [
  {'C': [1, 10, 100, 1000], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
 ]
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X, actual_y)
print("Grid Search: best parameters: {}".format(grid.best_params_))

# evaluate the best model
best_model = grid.best_estimator_
predict_y = best_model.predict(X)
acc = accuracy_score(actual_y, predict_y)
lb,ub = classification_confint(acc,X.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

# build the confusion matrix
labels = ['M', 'B']
cm = confusion_matrix(actual_y, predict_y, labels=labels)
cm_df = pd.DataFrame(cm, index=labels, columns=labels)
print("Confusion Matrix:\n{}".format(cm_df))



Grid Search: best parameters: {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
Accuracy: 0.99 (0.98,1.00)
Confusion Matrix:
     M    B
M  207    5
B    3  354


In [4]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
model = make_pipeline(StandardScaler(), SVC(C= 100, gamma= 0.0001, kernel='rbf'))
model.fit(X, actual_y)

# evaluate the best model

predict_y = model.predict(X)
acc = accuracy_score(actual_y, predict_y)
lb,ub = classification_confint(acc,X.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

# build the confusion matrix
labels = ['M', 'B']
cm = confusion_matrix(actual_y, predict_y, labels=labels)
cm_df = pd.DataFrame(cm, index=labels, columns=labels)
print("Confusion Matrix:\n{}".format(cm_df))

Accuracy: 0.98 (0.97,0.99)
Confusion Matrix:
     M    B
M  202   10
B    1  356


# Example: Handwritten Digit Classification

In [5]:
# we need UCI repo access
!pip install ucimlrepo

import numpy as np # we need numpy arrays
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


In [6]:
# fetch dataset
digits = fetch_ucirepo(id=80)

# data (as pandas dataframes)
X = digits.data.features
X.head()

Unnamed: 0,Attribute1,Attribute2,Attribute3,Attribute4,Attribute5,Attribute6,Attribute7,Attribute8,Attribute9,Attribute10,...,Attribute55,Attribute56,Attribute57,Attribute58,Attribute59,Attribute60,Attribute61,Attribute62,Attribute63,Attribute64
0,0,1,6,15,12,1,0,0,0,7,...,0,0,0,0,6,14,7,1,0,0
1,0,0,10,16,6,0,0,0,0,7,...,3,0,0,0,10,16,15,3,0,0
2,0,0,8,15,16,13,0,0,0,1,...,0,0,0,0,9,14,0,0,0,0
3,0,0,0,3,11,16,0,0,0,0,...,0,0,0,0,0,1,15,2,0,0
4,0,0,5,14,4,0,0,0,0,0,...,12,0,0,0,4,12,14,7,0,0


In [7]:
y = digits.data.targets
y.head()

Unnamed: 0,class
0,0
1,0
2,7
3,4
4,6


There is support for all of the 10 digits.

In [8]:
y.value_counts()

Unnamed: 0_level_0,count
class,Unnamed: 1_level_1
3,572
1,571
4,568
7,566
9,562
5,558
6,558
2,557
0,554
8,554


Note that this is a 10-way classification problem.  A single SVM can only discriminate two classes. The sklearn SVM implementation deals with it by building one SVM for each class (digit in this case) and then uses an aggregation scheme to come up with a single classification.

In [9]:
# setting up training/testing data
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y['class'], # we want a series as target
    train_size=0.8,
    test_size=0.2,
    random_state=1
)

In [10]:
# train model

# model object
model = SVC(kernel='rbf')

# grid search
param_grid = {
      'kernel':['linear','rbf'],
      'C': [1, 10, 100, 1000],
      'gamma': [0.0001, 0.001, 0.01, 0.1]
}
grid = GridSearchCV(model, param_grid, cv=3, verbose=10, n_jobs=-1)
grid.fit(X_train, y_train)
print("Grid Search: best parameters: {}".format(grid.best_params_))
best_model = grid.best_estimator_

Fitting 3 folds for each of 32 candidates, totalling 96 fits
Grid Search: best parameters: {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'}


In [11]:
# Evaluate the best model
predict_y = best_model.predict(X_test)
acc = accuracy_score(y_test, predict_y)
lb,ub = classification_confint(acc,X_test.shape[0])
print("Accuracy: {:3.2f} ({:3.2f},{:3.2f})".format(acc,lb,ub))

Accuracy: 0.99 (0.99,1.00)


In [12]:
# build the confusion matrix
cm = confusion_matrix(y_test, predict_y)
cm_df = pd.DataFrame(cm)
cm_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,105,0,0,0,0,0,0,0,0,0
1,0,129,0,0,0,0,0,1,0,0
2,0,0,91,0,0,0,0,0,0,0
3,0,0,0,118,0,0,0,0,0,0
4,0,0,0,0,107,0,0,1,0,0
5,0,0,0,0,0,96,0,0,0,0
6,0,0,0,0,1,0,111,0,0,0
7,0,0,0,1,0,0,0,129,0,0
8,0,1,0,0,1,0,0,0,118,1
9,0,0,0,0,0,0,0,0,0,113
