# Clothes Classification with Support Vector Machines

In this notebook we are going to explore the use of Support Vector Machines (SVM) for image classification. We are going to use a new version of the famous MNIST dataset (the original is a dataset of handwritten digits). The version we are going to use is called Fashion MNIST (https://pravarmahajan.github.io/fashion/) and is a dataset of small images of clothes and accessories.



The dataset labels are the following:

| Label | Description |
| --- | --- |
| 0 | T-shirt/top |
| 1 | Trouser |
| 2 | Pullover |
| 3 | Dress |
| 4 | Coat |
| 5 | Sandal |
| 6 | Shirt |
| 7 | Sneaker |
| 8 | Bag |
| 9 | Ankle boot |

## TODO: Insert your surname, name and ID number

Student name: Pujatti Mattia <br>
ID: 1232236

In [None]:
#load the required packages

%matplotlib inline  

import numpy as np
import scipy as sp
import pandas as pd
import matplotlib.pyplot as plt

import sklearn
from sklearn.datasets import fetch_mldata
from sklearn.neural_network import MLPClassifier
import sklearn.metrics as skm

In [None]:
# helper function to load Fashion MNIST dataset
def load_mnist(path, kind='train'):
    import os
    import gzip
    import numpy as np
    labels_path = os.path.join(path, '%s-labels-idx1-ubyte.gz' % kind)
    images_path = os.path.join(path, '%s-images-idx3-ubyte.gz' % kind)
    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,offset=8)
    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,offset=16).reshape(len(labels), 784)
    return images, labels

In [None]:
#fix your ID ("numero di matricola") and the seed for random generator (as usual you can try different seeds)
ID = 1232236
np.random.seed(ID)

In [None]:
#load the Fashion MNIST dataset from the 'data' folder and let's normalize the features so that each value is in [0,1] 

X, y = load_mnist('data', kind='train')
# rescale the data
X, y = X / 255., y # original pixel values are between 0 and 255
print(X.shape, y.shape)

Now split into training and test. Make sure that each label is present at least 10 times
in training. If it is not, then keep adding permutations to the initial data until this 
happens.

In [None]:
#random permute the data and split into training and test taking the first 500
#data samples as training and the rests as test
permutation = np.random.permutation(X.shape[0])

X = X[permutation]
y = y[permutation]

m_training = 500

X_train, X_test = X[:m_training], X[m_training:]
y_train, y_test = y[:m_training], y[m_training:]

labels, freqs = np.unique(y_train, return_counts=True)
print("Labels in training dataset: ", labels)
print("Frequencies in training dataset: ", freqs)

In [None]:
#function for plotting a image and printing the corresponding label
def plot_input(X_matrix, labels, index):
    print("INPUT:")
    plt.imshow(
        X_matrix[index].reshape(28,28),
        cmap          = plt.cm.gray_r,
        interpolation = "nearest"
    )
    plt.show()
    print("LABEL: %i"%labels[index])
    return

In [None]:
#let's try the plotting function
plot_input(X_train,y_train,5)
plot_input(X_test,y_test,50)
plot_input(X_test,y_test,500)
plot_input(X_test,y_test,5000)

## TO DO 1
Use a SVM classifier with cross validation to pick a model. Use a 4-fold cross-validation. Let's start with a Linear kernel:

In [None]:
#import SVC
from sklearn.svm import SVC
#import for Cross-Validation
from sklearn.model_selection import GridSearchCV

# parameters for linear SVM
parameters = {'C': [0.0005, 0.005, 0.05, 0.5, 5, 50, 500]}

#run linear SVM
svc_lin = SVC(kernel='linear')    
GridS = GridSearchCV(estimator=svc_lin,param_grid=parameters,cv=4)
GridS.fit(X_train,y_train)

print ('RESULTS FOR LINEAR KERNEL','\n')

print("Best parameters set found:")
print(GridS.best_params_,'\n')

print("Best model:")
print(GridS.best_estimator_,'\n')

print("Score with best parameters:")
print(GridS.best_score_,'\n')

#print("All scores on the grid:")
#print(pd.DataFrame(GridS.cv_results_))

## TO DO 2
Pick a model for the Polynomial kernel with degree=2:

In [None]:
# parameters for poly with degree 2 kernel
parameters = {'C': [0.05, 0.5, 5],'gamma':[0.05,0.5,5.]}

#run SVM with poly of degree 2 kernel

svc_poly2 = SVC(kernel='poly',degree=2)    
GridS_poly2 = GridSearchCV(estimator=svc_poly2,param_grid=parameters,cv=4)
GridS_poly2.fit(X_train,y_train)

print ('RESULTS FOR POLY DEGREE=2 KERNEL')

print("Best parameters set found:")
print(GridS_poly2.best_params_,'\n')

print("Best model:")
print(GridS_poly2.best_estimator_,'\n')

print("Score with best parameters:")
print(GridS_poly2.best_score_,'\n')

#print("All scores on the grid:")
#print(pd.DataFrame(GridS_poly2.cv_results_))

## TO DO 3

Now let's try a higher degree for the polynomial kernel.

In [None]:
# parameters for poly with higher degree kernel
parameters = {'C': [0.05, 0.5, 5],'gamma':[0.05,0.5,5.]}

#run SVM with poly of higher degree kernel
degree = 3

svc_polyn = SVC(kernel='poly',degree=degree)    
GridS_polyn = GridSearchCV(estimator=svc_polyn,param_grid=parameters,cv=4)
GridS_polyn.fit(X_train,y_train)

print ('RESULTS FOR POLY DEGREE=', degree, ' KERNEL')

print("Best parameters set found:")
print(GridS_polyn.best_params_,'\n')

print("Best model:")
print(GridS_polyn.best_estimator_,'\n')

print("Score with best parameters:")
print(GridS_polyn.best_score_,'\n')

#print("All scores on the grid:")
#print(pd.DataFrame(GridS_polyn.cv_results_))

## TO DO 4
Pick a model for the Radial Basis Function kernel:

In [None]:
# parameters for rbf SVM
parameters = {'C': [0.5, 5, 50, 500],'gamma':[0.005, 0.05, 0.5,5]}

#run SVM with rbf kernel

svc_rbf = SVC(kernel='rbf')    
GridS_rbf = GridSearchCV(estimator=svc_rbf,param_grid=parameters,cv=4)
GridS_rbf.fit(X_train,y_train)

print ('RESULTS FOR rbf KERNEL')

print("Best parameters set found:")
print(GridS_rbf.best_params_,'\n')

print("Best model:")
print(GridS_rbf.best_estimator_,'\n')

print("Score with best parameters:")
print(GridS_rbf.best_score_,'\n')

#print("All scores on the grid:")
#print(pd.DataFrame(GridS_rbf.cv_results_))

## TO DO5
What do you observe when using RBF and polynomial kernels on this dataset ?

The first thing we notice is that the score decrease with the rise of the degree in the case of the polynomial kernel. Tipically, the RBF kernel is the best one to use. In this case, with the use of that particular random seed, we find that the method with linear, quatratic or gaussian kernel are almost equivalent, with similar scores. For this reason, while the linear kernel seems to be the one with the best results, we will keep RBF for the next analysis.

## TO DO 6
Report here the best SVM kernel and parameters

In [None]:
#get training and test error for the best SVM model from CV
best_SVM = GridS_rbf.best_estimator_

training_error = 1. - best_SVM.score(X_train,y_train)
test_error = 1. - best_SVM.score(X_test,y_test)

print ("Best SVM training error: %f" % training_error)
print ("Best SVM test error: %f" % test_error)

## More data
Now let's do the same but using more data points for training.


Choose a new number of data points.

In [None]:
X = X[permutation]
y = y[permutation]

m_training = 2000 # TODO number of data points, adjust depending on the capabilities of your PC

X_train, X_test = X[:m_training], X[m_training:]
y_train, y_test = y[:m_training], y[m_training:]

labels, freqs = np.unique(y_train, return_counts=True)
print("Labels in training dataset: ", labels)
print("Frequencies in training dataset: ", freqs)

Let's try to use SVM with parameters obtained from the best model for $m_{training} =  2000$. Since it may take a long time to run, you can decide to just let it run for some time and stop it if it does not complete. If you decide to do this, report it in the TO DO 9 cell below.

### TO DO 7

In [None]:
#get training and test error for the best SVM model from CV

best_SVM = GridS.best_estimator_

training_error = 1. - best_SVM.score(X_train,y_train)
test_error = 1. - best_SVM.score(X_test,y_test)

print ("Best SVM training error: %f" % training_error)
print ("Best SVM test error: %f" % test_error)

Just for comparison, let's also use logistic regression (with standard parameters from scikit-learn, i.e. some regularization is included).

### TO DO 8 Try first without regularization (use a very large large C)

In [None]:
from sklearn import linear_model

logreg = linear_model.LogisticRegression(C=1e8,solver='newton-cg',penalty='l2',max_iter=1000,multi_class='auto')
logreg.fit(X_train,y_train)

prediction_training = logreg.predict(X_train)
differences_training = (y_train==prediction_training)
training_error = (differences_training==False).sum()/prediction_training.shape[0]

prediction_test = logreg.predict(X_test)
differences_test = (y_test==prediction_test)
test_error = (differences_test==False).sum()/prediction_test.shape[0]

print ("Best logistic regression training error: %f" % training_error)
print ("Best logistic regression test error: %f" % test_error)

### TO DO 9 Then use also some regularization 

In [None]:
logreg_regu = linear_model.LogisticRegression(C=1,solver='newton-cg',penalty='l2',max_iter=1000,multi_class='auto')
logreg_regu.fit(X_train,y_train)

prediction_training_regu = logreg_regu.predict(X_train)
differences_training_regu = (y_train==prediction_training_regu)
training_error_regu = (differences_training_regu==False).sum()/prediction_training_regu.shape[0]

prediction_test_regu = logreg_regu.predict(X_test)
differences_test_regu = (y_test==prediction_test_regu)
test_error_regu = (differences_test_regu==False).sum()/prediction_test_regu.shape[0]


print ("Best regularized logistic regression training error: %f" % training_error_regu)
print ("Best regularized logistic regression test error: %f" % test_error_regu)

## TO DO 10
Compare and discuss:
- the results from SVM with m=500 and with m=2000 training data points. If you stopped the SVM, include such aspect in your comparison.
- the results of SVM and of Logistic Regression with and without regularization

As we can see, in the case with m=2000 samples we obtain an higher training error (as predictable), respect to the case with m=500, that anyway is more similar to the test error, which remains almost the same between the two sets.
With logistic regression, instead, we find a null training error in the case with no regularization, indicating that all the samples have been correctly classified.
For what regards the test errors, we find similar values for the best SVM and for the Logistic Regression with and without regularization.

## TO DO 10
Plot an item of clothing that is missclassified by logistic regression and correctly classified by SVM.

In [None]:
LR_prediction = logreg_regu.predict(X)
LR_prediction_check = np.array((LR_prediction==y))

SVM_prediction = best_SVM.predict(X)
SVM_prediction_check = np.array((SVM_prediction==y))

def plot_random(lr_pred,svm_pred,lr_check,svm_check):
    n = np.random.randint(0,lr_pred.shape[0])
    if (not lr_check[n]) and svm_check[n]: 
        plot_input(X,y,n)
        print('Logistic Regression prediction: ',lr_pred[n])
        print('SVM prediction: ',svm_pred[n])
    else: plot_random(lr_pred,svm_pred,lr_check,svm_check)

plot_random(LR_prediction,SVM_prediction,LR_prediction_check,SVM_prediction_check)

## TO DO 11
Plot the confusion matrix for the SVM classifier and for logistic regression.
The confusion matrix has one column for each predicted label and one row for each true label. 
It shows for each class in the corresponding row how many samples belonging to that class gets each possible output label.
Notice that the diagonal contains the correctly classified samples, while the other cells correspond to errors.
You can obtain it with the sklearn.metrics.confusion_matrix function (see the documentation).
Try also to normalize the confusion matrix by the number of samples in each class in order to measure the accuracy on each single class.


In [None]:
# for better aligned printing of confusion matrix use floatmode='fixed' (not supported in all versions of Python)
np.set_printoptions(precision=2, suppress=True) 

u, counts = np.unique(y_test, return_counts=True)
print("Labels and frequencies in test set: ", counts)

confusion_SVM = skm.confusion_matrix(y,SVM_prediction)
print("\n Confusion matrix SVM  \n \n", confusion_SVM)
print("\n Confusion matrix SVM (normalized)   \n \n", confusion_SVM /counts[:,None] )

confusion_LR = skm.confusion_matrix(y,LR_prediction)
print("\n Confusion matrix LR  \n \n", confusion_LR)
print("\n Confusion matrix LR (normalized)   \n \n", confusion_LR /counts[:,None] )

## TO DO 12
Have a look at the confusion matrices and comment on the obtained accuracies. Why some classes have lower accuracies and others an higher one ? Make some guesses on the possible causes.

As can be seen from the diagonal values, most of the samples have been correctly classified by both methods, that also have similar results in term of which clothes are more often recognized. In particular, the labels on which they deviate most are the number 2 (pullovers) and 9(ankle boot), for which the logistic regression is more performing, and number 4(coat) for which, instead, SVM is better.
Many classes have lower accuracies than others, probably due to the fact that the drawings are similar, and this make bring confusion to the algorithms. In particular, the labels that are less correctly predicted are the 3(dress) and the 6(shirt) that are usually classified by both algorithms as something else (for example 6 and 0, shirts and t-shirts, are often exchanged).