# Support Vector Machine - Image Classification


In this notebook we will train SVM classifiers to classify images. 

For a comparative understanding, we will compare the performance of the SVM with the Logistic Regression Softmax classifier.

We will use dimensionality reduction technique (Principle Component Analysis) to project the features into a smaller dimension to expedite the training time.


Generally **images are linearly non-separable**. Based on this we formulate the following hypotheses:
- The kernelized SVM models will perform significantly better than the linear SVM model.
- The RBF Kernel based SVM will perform better than Softmax regression classifier.
- Dimensionaly reduction (by retaining maximum variance) should improve the performance.

We will investige these hypotheses by conducting the following experiments.


## Experiments

- Experiment 1: Support Vector Machine (LinearSVC) + PCA
- Experiment 2: Support Vector Machine (SVC with RBF Kernel) + PCA
- Experiment 3: Support Vector Machine (SVC with RBF Kernel) 
- Experiment 4: Logistic Regression (Softmax Regression) + PCA

## Dataset: MNIST


We will use the MNIST dataset, which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau. Each image is labeled with the digit it represents.


There are 70,000 images. Each image is 28x28 pixels, and each feature simply represents one pixel’s intensity, from 0 (white) to 255 (black).

Thus, each image has 784 features. 

In [1]:
from time import time
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt


from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC

from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

## Load Data and Create Data Matrix (X) and the Label Vector (y)

First load the data and explore the feature names, target names, etc.

We may load the data from a local folder or load it directly from cloud using Scikit-Learn.

In [2]:
# Load the data from the local folder "data"
# mnist = loadmat('data/mnist-original.mat')

# Create the data Matrix X and the target vector y
# X = mnist["data"].T.astype('float64')
# y = mnist["label"][0].astype('int64')


# Load data using Scikit-Learn
mnist = fetch_openml('mnist_784', cache=False)

X = mnist["data"]
y = mnist["target"].astype('int64')


print("\nNo. of Samples: ", X.shape)
print("No. of Labels: ", y.shape)


print("\nX Type: ", X.dtype)
print("y Type", y.dtype)


No. of Samples:  (70000, 784)
No. of Labels:  (70000,)

X Type:  float64
y Type int64


## Split Data Into Training and Test Sets

We use sklearn's train_test_split function to spilt the dataset into training and test subsets. The data is shuffled by default before splitting.

This function splits arrays or matrices into **random** train and test subsets.

For the **reproducibility of the results**, we need to use the same seed for the random number generator. The seed is set by the "random_state" parameter of the split function. 

However, in repeated experiments if we don't want to use the same train and test subsets, then we drop the "random_state" parameter from the funtion.


In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Optimization Using Dimensionaly Reduction

We can optimize the running-time of the Logistic Regression algorithm by reducing the number of features. Our assumption is that the essence or core content of the data does not span along all dimensions. The technique for reducing the dimension of data is known as dimensionality reduction.

For a gentle introduction to various dimensionality reduction technique, see the notebook "Dimensionality Reduction" in the Github repository.

We will use the Principle Component Analysis (PCA) dimensionality reduction technique to project the MNIST dataset (784 features) to a lower dimensional space by retaining maximum variance. 

The goal is to see the improvement in training time due to this dimensionality reduction.

Before we apply the PCA, we need to standardize the data.

## Standardize the Data

PCA is influenced by scale of the data. Thus we need to scale the features of the data before applying PCA. 

For understanding the negative effect of not scaling the data, see the following post:

https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html#sphx-glr-auto-examples-preprocessing-plot-scaling-importance-py

Note that we fit the scaler on the training set and transform on the training and test set. 

In [4]:
scaler = StandardScaler()

# Fit on training set only.
scaler.fit(X_train)

# Apply transform to both the training set and the test set.
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Apply PCA

While applying PCA we can set the number of principle components by the "n_components" attribute. But more importantly, we can use this attribute to determine the % of variance we want to retain in the extracted features.

For example, if we set it to 0.95, sklearn will choose the **minimum number of principal components** such that 95% of the variance is retained.

In [5]:
%%time
pca = PCA(n_components=0.95)

pca.fit(X_train)

Wall time: 6.66 s


PCA(copy=True, iterated_power='auto', n_components=0.95, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

## Number of Principle Components

We can find how many components PCA chose after fitting the model by using the following attribute: n_components_

We will see that 95% of the variance amounts to **315 principal components**.

In [6]:
print("Number of Principle Components: ", pca.n_components_)  

Number of Principle Components:  327


## Apply the Mapping (Transform) to both the Training Set and the Test Set

In [7]:
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

## Experiments

We will conduct the following experiments.

- Experiment 1: Support Vector Machine (LinearSVC) + PCA
- Experiment 2: Support Vector Machine (SVC with RBF Kernel) + PCA
- Experiment 3: Support Vector Machine (SVC with RBF Kernel) 
- Experiment 4: Logistic Regression (Softmax Regression) + PCA

## Support Vector Machine: Model Selection via Hyperparameter Tuning

Note that we are not performing grid search (which we should have). 

We are simply using the best values for the two hyperparameters ($\gamma$ and $C$) for the SVC from prior grid search. However, it is advised that one should perform grid search to fine tune the hyperparameters.

## Experiment 1: LinearSVC + PCA

In [8]:
t0 = time()

linear_svc_pca = LinearSVC(loss='hinge', C=1, random_state=42)
linear_svc_pca.fit(X_train_pca, y_train)

training_LinearSVC = time() - t0

print("Training Complete in %0.3fs" % training_LinearSVC)

Training Complete in 182.533s




## Experiment 1: Evaluate LinearSVC + PCA on Test Data

In [9]:
%%time

y_test_predicted = linear_svc_pca.predict(X_test_pca)

accuracy_score_test_LinearSVC = np.mean(y_test_predicted == y_test)
print("\nTest Accuracy: ", accuracy_score_test_LinearSVC)

print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))

print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))


Test Accuracy:  0.9075714285714286

Test Confusion Matrix:
[[1353    1    4    1    1    8   15    0    4    0]
 [   0 1538    9    6    3    8    2    3   11    0]
 [  17   16 1290   11   20    2   23   19   41    4]
 [   6    8   62 1233    3   48    5   15   33   22]
 [   5    3   11    1 1264    5    5    8    6   42]
 [  11    8   11   28   16 1079   29    7   33    9]
 [  13    8   15    7   12   25 1297    2    5    3]
 [   8    9   32    8   22    7    3 1328    7   34]
 [  10   40   15   27   12   52    8    5 1169   30]
 [   8    6   16   26   65   16    1   54   14 1155]]

Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.98      0.96      1387
           1       0.94      0.97      0.96      1580
           2       0.88      0.89      0.89      1443
           3       0.91      0.86      0.89      1435
           4       0.89      0.94      0.91      1350
           5       0.86      0.88      0.87      1231
      

## Experiment 2: SVC (RBF Kernel) + PCA

In [10]:
t0 = time()

svm_clf_pca = SVC(C=1, gamma=0.001)
svm_clf_pca.fit(X_train_pca, y_train)

training_SVC_rbf_pca = time() - t0

print("Training Complete in %0.3fs" % training_SVC_rbf_pca)

Training Complete in 180.884s


## Experiment 2: Evaluate SVC (RBF Kernel) + PCA on Test Data

In [11]:
%%time

y_test_predicted = svm_clf_pca.predict(X_test_pca)

accuracy_score_test_svc_rbf_pca = np.mean(y_test_predicted == y_test)
print("\nTest Accuracy: ", accuracy_score_test_svc_rbf_pca)

print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))

print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))


Test Accuracy:  0.9612857142857143

Test Confusion Matrix:
[[1364    0    4    0    1    2    8    3    5    0]
 [   0 1564    5    0    2    0    0    6    2    1]
 [   6    4 1389    7    6    1    2   14   11    3]
 [   1    7   21 1350    2   18    0   13   20    3]
 [   2    2   11    0 1305    1    2    5    1   21]
 [   3    4    8   12    2 1172   13    8    8    1]
 [   6    1    5    0    6    9 1344   13    3    0]
 [   2    7   16    1    9    0    0 1402    1   20]
 [   1   15    6    8    5   14    5    5 1296   13]
 [   1    0    6   17   28    5    0   26    6 1272]]

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      1387
           1       0.98      0.99      0.98      1580
           2       0.94      0.96      0.95      1443
           3       0.97      0.94      0.95      1435
           4       0.96      0.97      0.96      1350
           5       0.96      0.95      0.96      1231
      

## Experiment 3: SVC (RBF Kernel) 

We experiment with the SVC (RBF Kernel) without applying dimensionaly reducion on the data.

In [12]:
t0 = time()

svm_clf = SVC(gamma=0.001)
svm_clf.fit(X_train, y_train)

training_SVC_rbf = time() - t0

print("Training Complete in %0.3fs" % training_SVC_rbf)

Training Complete in 391.316s


## Experiment 3: Evaluate SVC (RBF Kernel) on Test Data

In [13]:
%%time

y_test_predicted = svm_clf.predict(X_test)

accuracy_score_test_svc_rbf = np.mean(y_test_predicted == y_test)
print("\nAccuracy: ", accuracy_score_test_svc_rbf)


print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))


print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))


Accuracy:  0.9604285714285714

Test Confusion Matrix:
[[1364    0    4    0    1    2    8    3    5    0]
 [   0 1564    5    0    2    0    0    5    3    1]
 [   4    4 1389    8    7    1    2   14   11    3]
 [   1    7   19 1350    1   19    0   15   20    3]
 [   2    2   10    0 1303    1    2    5    1   24]
 [   3    4    8   13    2 1170   13    9    8    1]
 [   6    1    4    0    7    9 1343   14    3    0]
 [   2    6   16    1    9    0    0 1401    2   21]
 [   1   16    6    8    4   14    4    5 1297   13]
 [   2    0    7   17   29    6    1   28    6 1265]]

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      1387
           1       0.98      0.99      0.98      1580
           2       0.95      0.96      0.95      1443
           3       0.97      0.94      0.95      1435
           4       0.95      0.97      0.96      1350
           5       0.96      0.95      0.95      1231
           

## Experiment 4: Logistic Regression (Softmax Regression) + PCA

We use the best performing solver (i.e., lbfgs) from previous notebook to train the logistic regression model on the PCA transformed data.

In [14]:
t0 = time()

softmax_reg_pca = LogisticRegression(solver='lbfgs', multi_class='multinomial')

softmax_reg_pca.fit(X_train_pca, y_train)

training_logistic = time() - t0

print("Training Complete in %0.3fs" % training_logistic)

Training Complete in 7.420s




## Experiment 4: Evaluate Softmax Regression + PCA on Test Data

In [15]:
print("No. of Iterations:", softmax_reg_pca.n_iter_ )


y_test_predicted = softmax_reg_pca.predict(X_test_pca)
#print(y_test_predict)

accuracy_score_test_logistic = np.mean(y_test_predicted == y_test)
print("\nTest Accuracy: ", accuracy_score_test_logistic)


print("\nTest Confusion Matrix:")
print(confusion_matrix(y_test, y_test_predicted))


print("\nClassification Report:")
print(classification_report(y_test, y_test_predicted))

No. of Iterations: [100]

Test Accuracy:  0.9187142857142857

Test Confusion Matrix:
[[1347    0    4    1    2   10   16    0    6    1]
 [   0 1537   10    2    2    6    2    3   16    2]
 [   7   23 1291   23   19    6   16   17   36    5]
 [   2    8   51 1264    2   45    2   15   32   14]
 [   1    4    9    2 1261    2   18   11    5   37]
 [  10    6   16   40    7 1085   19    5   35    8]
 [  14    7   13    1   13   16 1315    0    8    0]
 [   3    6   22    6   16    5    0 1341    6   53]
 [   5   36   11   28   10   37   10    3 1209   19]
 [   5    3    6   15   49   11    1   48   11 1212]]

Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.97      0.97      1387
           1       0.94      0.97      0.96      1580
           2       0.90      0.89      0.90      1443
           3       0.91      0.88      0.90      1435
           4       0.91      0.93      0.92      1350
           5       0.89      0.88  

# Summary of Results from 4 Experiments

In [16]:
data = [["LinearSVC + PCA", accuracy_score_test_LinearSVC, training_LinearSVC], 
        ["SVM(RBF) + PCA", accuracy_score_test_svc_rbf_pca, training_SVC_rbf_pca],
        ["SVM(RBF)", accuracy_score_test_svc_rbf, training_SVC_rbf],
        ["Softmax + PCA", accuracy_score_test_logistic, training_logistic]]

pd.DataFrame(data, columns=["Classifier", "Accuracy", "Running-Time"])


Unnamed: 0,Classifier,Accuracy,Running-Time
0,LinearSVC + PCA,0.907571,182.532686
1,SVM(RBF) + PCA,0.961286,180.884422
2,SVM(RBF),0.960429,391.315884
3,Softmax + PCA,0.918714,7.419869


## Comparative Understanding

We have done 4 experiments using SVM and Logistic Regression classifiers.

The first 3 experiements are done using 2 SVM algorithms, with the effect of PCA.

The experimental results confirm our hypotheses:
- The kernelized SVM models will perform significantly better than the linear SVM model.
- The RBF Kernel based SVM will perform better than Softmax regression classifier.
- Dimensionaly reduction (by retaining maximum variance) should improve the performance.

We make following observations.
- The SVM classifiers perform **significantly** better than the Softmax classifier.
- The SVM classifier training and prediction time is **longer**.
- The RBF kernel based SVM classifier performs better than the linear SVM classifier. It indicates that for this non-linear image classsification problem the kernelized SVM is the most suitable algorithm.
- Dimensionality reduction improves the performance slightly on the RBF kernel based SVM.

### Thus, for image classification problems RBF kernel based SVM model should be used with dimensionality reduction.