# Section 1: SVM

This notebook shows the steps to train a SVM Model. The notebook are separated into 3 main sections:

## Contents

- <a href='#1'>1. Import libraries and data</a>
- <a href='#2'>2. Train SVM using pixel intensity as an input</a>
 - <a href='#2.1'>2.1 Find the appropriate kernel</a> 
 - <a href='#2.2'>2.2 Tune the parameters(C and/or gamma ) for only for the best kernel</a>         
- <a href='#3'>3. Train SVM using HOG descriptor as an input</a>
 - <a href='#3.1'>3.1 Find the appropriate kernel</a> 
 - <a href='#3.2'>3.2 Tune the parameters(C and/or gamma ) for only for the best kernel</a> 

## 1.Import libraries and data <a id='1'></a> 

In [3]:
import os
import matplotlib.pyplot as plt
from skimage.feature import hog
import numpy as np
import pandas as pd
import pickle
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
import time
pd.set_option('expand_frame_repr', False)

In [2]:
# Helper Function to show the first 5 images
# Credit from INM427 Neural Computing Exercise
def plot_example(X, y):
    """Plot the first 5 images and their labels in a row."""
    for i, (img, y) in enumerate(zip(X[:5].reshape(5, 28, 28), y[:5])):
        plt.subplot(151 + i)
        plt.imshow(img,cmap ='gray')
        plt.xticks([])
        plt.yticks([])
        plt.title(y)

In [4]:
# %% Get Current Path
# Get data location path
cwd = os.getcwd()
script_path = cwd + '/'
data_path = script_path + 'Data2'
train_path = data_path +'/' + 'mnist_background_random_train.amat'
test_path = data_path +'/' + 'mnist_background_random_test.amat'

In [6]:
# Import Data
df_train = np.loadtxt(train_path)
df_test = np.loadtxt(test_path)

X_train = df_train[:,0:-1]
y_train = df_train[:,-1]
X_test = df_test[:,0:-1]
y_test = df_test[:,-1]

Since SVM perform poorly on a large dataset.
To shorten the training process, we will use only 40% of our training data to tune (which is around 5000 observations, 500 on each classes)

In [8]:
from sklearn.model_selection import train_test_split
_, X_tuning, _, y_tuning = train_test_split(X_train, y_train,
                                                    stratify=y_train, 
                                                    test_size=0.4,
                                                    random_state =1)

## 2.Train SVM using pixel intensity as an input <a id='2'></a> 

The method to train SVM is divided into 2 steps

1. Find the appropriate kernel
2. Tune the appropriates parameter for only the best kernel

### 2.1 Find the appropriate kernel <a id='2.1'></a> 

To find the appropriate, two gridsearchs are conducted. <br>

* The first gridsearch ran through: 'linear', 'rbf' , 'sigmoid' kernel. <br>
* The second gridseach ran through 'poly' kernel

In [24]:
#Define kernel and general parameter to test
parameters = {'kernel':['linear', 'rbf', 'sigmoid']}
#Initiate SV class
svc = svm.SVC()

# First gridsearch
raw_model_1 = GridSearchCV(svc, parameters,
                    cv = 5,
                    scoring = 'accuracy',
                    n_jobs = -1,
                    verbose = 4,
                    return_train_score = True)

raw_model_1.fit(X_tuning, y_tuning)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  15 | elapsed:  2.1min remaining:   32.0s
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:  2.8min finished


GridSearchCV(cv=5, estimator=SVC(), n_jobs=-1,
             param_grid={'kernel': ['linear', 'rbf', 'sigmoid']},
             return_train_score=True, scoring='accuracy', verbose=4)

In [25]:
#Display the result of the first gridseach
col = ['param_kernel',
       'mean_test_score','std_test_score']
df_temp_1 = pd.DataFrame(raw_model_1.cv_results_)
df_temp_1 = df_temp_1[col]
print(df_temp_1)

#df_temp_1.to_csv('raw_kernel_1.csv')

  param_kernel  mean_test_score  std_test_score
0       linear         0.745417        0.014246
1          rbf         0.820000        0.007778
2      sigmoid         0.110833        0.000510


In [26]:
#Define kernel and general parameter to test
parameters = {'kernel':['poly'], 
              'degree' :[2,3,4]}
#Initiate SV class
svc = svm.SVC()

# The second gridsearch
raw_model_1_2 = GridSearchCV(svc, parameters,
                    cv = 5,
                    scoring = 'accuracy',
                    n_jobs = -1,
                    verbose = 4,
                    return_train_score = True)

raw_model_1_2.fit(X_tuning, y_tuning)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  15 | elapsed:  2.1min remaining:   31.0s
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:  2.7min finished


GridSearchCV(cv=5, estimator=SVC(), n_jobs=-1,
             param_grid={'degree': [2, 3, 4], 'kernel': ['poly']},
             return_train_score=True, scoring='accuracy', verbose=4)

In [27]:
#Display the result of the second gridseach
col = ['param_kernel','param_degree',
       'mean_test_score','std_test_score']
df_temp_2 = pd.DataFrame(raw_model_1_2.cv_results_)
df_temp_2 = df_temp_2[col]
print(df_temp_2)

  param_kernel param_degree  mean_test_score  std_test_score
0         poly            2         0.765417        0.016789
1         poly            3         0.778750        0.012476
2         poly            4         0.788542        0.013209


In [28]:
#Display the result of the first gridseach & second gridsearch
df_result_1 = pd.concat([df_temp_2,df_temp_1])
print(df_result_1)
df_result_1.to_csv('raw_kernel_result.csv')

  param_kernel param_degree  mean_test_score  std_test_score
0         poly            2         0.765417        0.016789
1         poly            3         0.778750        0.012476
2         poly            4         0.788542        0.013209
0       linear          NaN         0.745417        0.014246
1          rbf          NaN         0.820000        0.007778
2      sigmoid          NaN         0.110833        0.000510


The result of the gridsearch shows that 'rbf' kernel work best in separating the pixel intensity<br>

### 2.2 Tune the parameters(C and/or gamma ) for only for the best kernel<a id='2.2'></a> 

The appropriate parameters in rbf kernel is C and sigma. We used exponential range to search for both parameters.

In [30]:
#Define kernel and general parameter to test
parameters = {'kernel':['rbf'],
             'C': [0.1,0.3,1,3,10,30,100,300],
             'gamma':['scale',0.01,0.1,0.3,0.9,1,]}
#Initiate SV class
svc = svm.SVC()

raw_param_2 = GridSearchCV(svc, parameters,
                    cv = 5,
                    scoring = 'accuracy',
                    n_jobs = -1,
                    verbose = 4,
                    return_train_score = True)

raw_param_2.fit(X_tuning, y_tuning)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed: 25.8min
[Parallel(n_jobs=-1)]: Done 213 tasks      | elapsed: 67.4min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 75.2min finished


GridSearchCV(cv=5, estimator=SVC(), n_jobs=-1,
             param_grid={'C': [0.1, 0.3, 1, 3, 10, 30, 100, 300],
                         'gamma': ['scale', 0.01, 0.1, 0.3, 0.9, 1],
                         'kernel': ['rbf']},
             return_train_score=True, scoring='accuracy', verbose=4)

In [32]:
col = ['param_kernel','param_C','param_gamma',
       'mean_test_score','std_test_score']
df_temp_4 = pd.DataFrame(raw_param_2.cv_results_)
df_temp_4 = df_temp_4[col]
print(df_temp_4)

#df_temp_4.to_csv('raw_param_1.csv')

   param_kernel param_C param_gamma  mean_test_score  std_test_score
0           rbf     0.1       scale         0.492708        0.010437
1           rbf     0.1        0.01         0.585000        0.010512
2           rbf     0.1         0.1         0.110833        0.000510
3           rbf     0.1         0.3         0.110833        0.000510
4           rbf     0.1         0.9         0.110833        0.000510
5           rbf     0.1           1         0.110833        0.000510
6           rbf     0.3       scale         0.780625        0.009377
7           rbf     0.3        0.01         0.786250        0.012244
8           rbf     0.3         0.1         0.110833        0.000510
9           rbf     0.3         0.3         0.110833        0.000510
10          rbf     0.3         0.9         0.110833        0.000510
11          rbf     0.3           1         0.110833        0.000510
12          rbf       1       scale         0.820000        0.007778
13          rbf       1        0.0

In [36]:
# Rank my Value
# Sort the result by accuracy score
print(df_temp_4.sort_values('mean_test_score', ascending=False).head(15))

   param_kernel param_C param_gamma  mean_test_score  std_test_score
24          rbf      10       scale         0.820625        0.008053
18          rbf       3       scale         0.820625        0.008053
42          rbf     300       scale         0.820625        0.008053
36          rbf     100       scale         0.820625        0.008053
30          rbf      30       scale         0.820625        0.008053
12          rbf       1       scale         0.820000        0.007778
13          rbf       1        0.01         0.816875        0.007581
19          rbf       3        0.01         0.814583        0.009387
43          rbf     300        0.01         0.814375        0.009442
37          rbf     100        0.01         0.814375        0.009442
31          rbf      30        0.01         0.814375        0.009442
25          rbf      10        0.01         0.814375        0.009442
7           rbf     0.3        0.01         0.786250        0.012244
6           rbf     0.3       scal

Looking at the results the sigma(gamma) parameter that perform well is a default 'scale' parameter. The scale parameter is the default parameter from SVM, the value is according to this formula (1 / (n_features * X.var()) which is around 0.014.

The exponential range  of C is also not have that much effect on the result.

<b>Hence the final parameters for SVM using pixel intensity as an input is<b>
* kernel: rbf
* C: 10
* gamma: scale (0.014)

Train the model using the above parameters and export

In [5]:

svm_1 = svm.SVC(kernel = 'rbf', C =10, gamma ='scale')

t0 = time.time()
svm_1.fit(X_train,y_train)
t1 = time.time()

print('Training Time',t1-t0)



Training Time 150.84240317344666


In [9]:
pkl_filename = "svm_1_raw_pixel.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(svm_1, file)

## 3.Train SVM using HOG descriptor as an input <a id='3'></a> 

The method to train SVM is divided into 2 steps

1. Find the appropriate kernel
2. Tune the appropriates parameter for only the best kernel

Before we move on to create a model and search for optimized parameters, we need to create some function and transform our data to HOG features

In [9]:
# Import the data again
X_train = X_train.reshape(X_train.shape[0],28,28)
X_test = X_test.reshape(X_test.shape[0],28,28)
X_tuning = X_tuning.reshape(X_tuning.shape[0],28,28)

Create a class to transform the image to HOG. In this process, we fixed the parameters of HOG transformer.
These set of parameters would transform the image of size 28*28 into the features of size 5408 which should provide more information about the image

In [7]:
# Create a class to transform image to HOG
# Credit: https://kapernikov.com/tutorial-image-classification-with-scikit-learn/
class HogTransformer(BaseEstimator, TransformerMixin):
    """
    Expects an array of 2d arrays (1 channel images)
    Calculates hog features for each img
    """

    def __init__(self, y=None, orientations=8,
                 pixels_per_cell=(2, 2),
                 cells_per_block=(2, 2)):
        self.y = y
        self.orientations = orientations
        self.pixels_per_cell = pixels_per_cell
        self.cells_per_block = cells_per_block

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):

        def local_hog(X):
            return hog(X,
                       orientations=self.orientations,
                       pixels_per_cell=self.pixels_per_cell,
                       cells_per_block=self.cells_per_block)

        try:  # parallel
            return np.array([local_hog(img) for img in X])
        except:
            return np.array([local_hog(img) for img in X])

### 3.1 Find the appropriate kernel <a id='3.1'></a> 

To find the appropriate, two gridsearchs are conducted. <br>

* The first gridsearch ran through: 'linear', 'rbf' , 'sigmoid' kernel. <br>
* The second gridseach ran through 'poly' kernel

In this section instead of just a model, we utilised a pipeline to easily convery image inot HOG features and fit to a model

In [8]:
# Initiate a pipleline to turn the image to HOG, standardize and train
HOG_pipeline = Pipeline(
    [
    ('hogify', HogTransformer(
        pixels_per_cell=(2, 2),
        cells_per_block=(2, 2),
        orientations=8)
     ),
    ('scalify', StandardScaler()),
    ('classify',  svm.SVC()) ])

In [79]:
#Define kernel and general parameter to test
parameters = {'classify__kernel':['linear', 'rbf', 'sigmoid']}

# First gridsearch
hog_model_1 = GridSearchCV(HOG_pipeline, parameters,
                    cv = 5,
                    scoring = 'accuracy',
                    n_jobs = -1,
                    verbose = 4,
                    return_train_score = True)

hog_model_1.fit(X_tuning, y_tuning)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  15 | elapsed: 20.6min remaining:  5.1min
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed: 24.1min finished


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('hogify', HogTransformer()),
                                       ('scalify', StandardScaler()),
                                       ('classify', SVC())]),
             n_jobs=-1,
             param_grid={'classify__kernel': ['linear', 'rbf', 'sigmoid']},
             return_train_score=True, scoring='accuracy', verbose=4)

In [82]:
#Display result of the first gridsearch
col = ['param_classify__kernel',
       'mean_test_score','std_test_score']
df_temp_6 = pd.DataFrame(hog_model_1.cv_results_)
df_temp_6 = df_temp_6[col]
print(df_temp_6)

#df_temp_4.to_csv('raw_param_1.csv')

  param_classify__kernel  mean_test_score  std_test_score
0                 linear         0.719583        0.007666
1                    rbf         0.776667        0.009354
2                sigmoid         0.746042        0.008995


In [83]:
#Define kernel and general parameter to test
parameters = {'classify__kernel':['poly'], 
              'classify__degree' :[2,3,4]}


#Second gridsearch
hog_model_1_2 = GridSearchCV(HOG_pipeline, parameters,
                    cv = 5,
                    scoring = 'accuracy',
                    n_jobs = -1,
                    verbose = 4,
                    return_train_score = True)

hog_model_1_2.fit(X_tuning, y_tuning)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  12 out of  15 | elapsed: 27.9min remaining:  7.0min
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed: 34.7min finished


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('hogify', HogTransformer()),
                                       ('scalify', StandardScaler()),
                                       ('classify', SVC())]),
             n_jobs=-1,
             param_grid={'classify__degree': [2, 3, 4],
                         'classify__kernel': ['poly']},
             return_train_score=True, scoring='accuracy', verbose=4)

In [87]:
#Display the result of the second gridsearhc
col = ['param_classify__kernel','param_classify__degree',
       'mean_test_score','std_test_score']
df_temp_7 = pd.DataFrame(hog_model_1_2.cv_results_)
df_temp_7 = df_temp_7[col]
print(df_temp_7)

  param_classify__kernel param_classify__degree  mean_test_score  std_test_score
0                   poly                      2         0.580833        0.011460
1                   poly                      3         0.114792        0.001215
2                   poly                      4         0.110833        0.000510


In [88]:
#Concat and display 
df_result_2 = pd.concat([df_temp_7,df_temp_6])
print(df_result_2)
df_result_2.to_csv('hog_kernel_result.csv')

  param_classify__kernel param_classify__degree  mean_test_score  std_test_score
0                   poly                      2         0.580833        0.011460
1                   poly                      3         0.114792        0.001215
2                   poly                      4         0.110833        0.000510
0                 linear                    NaN         0.719583        0.007666
1                    rbf                    NaN         0.776667        0.009354
2                sigmoid                    NaN         0.746042        0.008995


The result of the gridsearch shows that 'rbf' kernel work best in separating the pixel intensity<br>


### 3.2 Tune the parameters(C and/or gamma ) for only for the best kernel <a id='3.2'></a> 

The appropriate parameters in rbf kernel is C and sigma. We used exponential range to search for both parameters.

In [None]:
#Define kernel and general parameter to test
parameters = {'classify__kernel':['rbf'],
             'classify__C': [0.1,0.3,1,3,10,30,100,300],
             'classify__gamma':['scale',0.01,0.1,0.3,0.9,1,]}



hog_param_1 = GridSearchCV(HOG_pipeline, parameters,
                    cv = 5,
                    scoring = 'accuracy',
                    n_jobs = -1,
                    verbose = 4,
                    return_train_score = True)

hog_param_1.fit(X_tuning, y_tuning)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed: 38.2min
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed: 204.6min
[Parallel(n_jobs=-1)]: Done 213 tasks      | elapsed: 491.5min


In [93]:
col = ['param_classify__kernel','param_classify__C','param_classify__gamma',
       'mean_test_score','std_test_score']
df_temp_8 = pd.DataFrame(hog_param_1.cv_results_)
df_temp_8 = df_temp_8[col]
print(df_temp_8)

#df_temp_4.to_csv('raw_param_1.csv')

   param_classify__kernel param_classify__C param_classify__gamma  mean_test_score  std_test_score
0                     rbf               0.1                 scale         0.110833        0.000510
1                     rbf               0.1                  0.01         0.110833        0.000510
2                     rbf               0.1                   0.1         0.110833        0.000510
3                     rbf               0.1                   0.3         0.110833        0.000510
4                     rbf               0.1                   0.9         0.110833        0.000510
5                     rbf               0.1                     1         0.110833        0.000510
6                     rbf               0.3                 scale         0.588125        0.005129
7                     rbf               0.3                  0.01         0.110833        0.000510
8                     rbf               0.3                   0.1         0.110833        0.000510
9         

In [94]:
# Rank my Value
# Sort the result by accuracy score
print(df_temp_8.sort_values('mean_test_score', ascending=False).head(15))

   param_classify__kernel param_classify__C param_classify__gamma  mean_test_score  std_test_score
12                    rbf                 1                 scale         0.776667        0.009354
42                    rbf               300                 scale         0.775625        0.008165
30                    rbf                30                 scale         0.775625        0.008165
18                    rbf                 3                 scale         0.775625        0.008165
36                    rbf               100                 scale         0.775625        0.008165
24                    rbf                10                 scale         0.775625        0.008165
6                     rbf               0.3                 scale         0.588125        0.005129
34                    rbf                30                   0.9         0.110833        0.000510
28                    rbf                10                   0.9         0.110833        0.000510
29        

Looking at the results the sigma(gamma) parameter that perform well is a default 'scale' parameter. The scale parameter is the default parameter from SVM, the value is according to this formula (1 / (n_features * X.var()) which is around 0.000018.

The exponential range  of C is also not have that much effect on the result.

<b>Hence the final parameters for SVM using pixel intensity as an input is<b>
* kernel: rbf
* C: 1
* gamma: scale ( 0.000018)

Train the model using the above parameters and export

In [11]:
HOG_final = Pipeline(
    [
    ('hogify', HogTransformer(
        pixels_per_cell=(2, 2),
        cells_per_block=(2, 2),
        orientations=8)
     ),
    ('scalify', StandardScaler()),
    ('classify',  svm.SVC()) ])

t0 = time.time()
HOG_final.fit(X_train,y_train)
t1 = time.time()

print(t1-t0)

1895.2547478675842


In [12]:
pkl_filename = "svm_1_hog.pkl"
with open(pkl_filename, 'wb') as file:
    pickle.dump(HOG_final, file)