# Group exercise 2a
## Loïc Rosset, Nanae Aubry, Kilian Ruchti, Lionel Ieri

* Use the provided training set to build your SVM.
* Apply the trained SVM to classify the test set. 
* Investigate at least two different kernels and optimize the SVM parameters by means of cross-validation.

## Import libraries

Reference : 
* [Cross validation from scikit-learn](https://scikit-learn.org/stable/modules/cross_validation.html)
* [Grid-Search Cross Validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

In [19]:
#numpy and panda for data structure
import numpy as np
import pandas as pd
#sklearn for svm and for cross validation
from sklearn import svm
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

## Read the provided dataset

In [6]:
#path to dataset
mnist_train = "./../dataset/csv/mnist_train.csv"
mnist_test = "./../dataset/csv/mnist_test.csv"

In [7]:
def read_data(filename):
    with open(filename, 'r') as csvfile:
        reader = csv.reader(csvfile)
        data = list(reader)
    #data into numpy array
    matrix = np.array(data, dtype = int)
    samples = matrix[:,1:]
    labels = matrix[:,0]
    return samples, labels

In [8]:
# Load the training and the test set
training_data, training_labels = read_data(mnist_train)
test_data, test_labels = read_data(mnist_test)

Computing the following code (section SVM form Scikit-learn) with the whole training dataset takes a while so here is, if needed for testing, a random sample of it :

In [14]:
random_ids = np.random.randint(0,training_data.shape[0],1000)
sample_data = training_data[random_ids]
sample_labels = training_labels[random_ids]

## SVM from Scikit-learn

Reference : [A practical guide to support vector classification](https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf)

SVM trained on a cross-validation of 5 folds and parameters found with Grid-search

In [16]:
#parameters sequences of C and gamma to be tested
c_seq = [pow(2,x) for x in range(-5, 15, 4)]
gamma_seq = [pow(2,x) for x in range(-15, 3, 4)]

In [17]:
parameters = {'kernel':('rbf','linear'), 'C':c_seq, 'gamma':gamma_seq}
nb_folds = 4
#Support Vector Classification
svc = svm.SVC()
#Cross-validation
svmOptimized = GridSearchCV(svc, parameters, cv=nb_folds)
svmOptimized.fit(sample_data, sample_labels)
scores = pd.DataFrame.from_dict(svmOptimized.cv_results_)
print("--> Best parameters : ", svmOptimized.best_params_)
scores

--> Best parameters :  {'C': 0.03125, 'gamma': 3.0517578125e-05, 'kernel': 'linear'}


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_gamma,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
0,0.876596,0.007425,0.15835,0.004607,0.03125,3.05176e-05,rbf,"{'C': 0.03125, 'gamma': 3.0517578125e-05, 'ker...",0.128,0.124,0.124,0.124,0.125,0.001732,41
1,0.244249,0.004496,0.0959,0.002548,0.03125,3.05176e-05,linear,"{'C': 0.03125, 'gamma': 3.0517578125e-05, 'ker...",0.888,0.868,0.884,0.884,0.881,0.007681,1
2,0.901351,0.006444,0.167331,0.003841,0.03125,0.000488281,rbf,"{'C': 0.03125, 'gamma': 0.00048828125, 'kernel...",0.128,0.124,0.124,0.124,0.125,0.001732,41
3,0.24357,0.003364,0.095651,0.000829,0.03125,0.000488281,linear,"{'C': 0.03125, 'gamma': 0.00048828125, 'kernel...",0.888,0.868,0.884,0.884,0.881,0.007681,1
4,0.919993,0.007915,0.163589,0.00358,0.03125,0.0078125,rbf,"{'C': 0.03125, 'gamma': 0.0078125, 'kernel': '...",0.128,0.124,0.124,0.124,0.125,0.001732,41
5,0.247719,0.009131,0.096783,0.001879,0.03125,0.0078125,linear,"{'C': 0.03125, 'gamma': 0.0078125, 'kernel': '...",0.888,0.868,0.884,0.884,0.881,0.007681,1
6,0.902505,0.003064,0.164549,0.004906,0.03125,0.125,rbf,"{'C': 0.03125, 'gamma': 0.125, 'kernel': 'rbf'}",0.128,0.124,0.124,0.124,0.125,0.001732,41
7,0.245843,0.005228,0.095901,0.001581,0.03125,0.125,linear,"{'C': 0.03125, 'gamma': 0.125, 'kernel': 'line...",0.888,0.868,0.884,0.884,0.881,0.007681,1
8,0.909234,0.004494,0.165506,0.003685,0.03125,2.0,rbf,"{'C': 0.03125, 'gamma': 2, 'kernel': 'rbf'}",0.128,0.124,0.124,0.124,0.125,0.001732,41
9,0.252053,0.00709,0.097651,0.002483,0.03125,2.0,linear,"{'C': 0.03125, 'gamma': 2, 'kernel': 'linear'}",0.888,0.868,0.884,0.884,0.881,0.007681,1


## Classify test set with trained SVM

In [18]:
labels_pred = svmOptimized.predict(test_data)
print("Accuracy of svm classifier with optimized parameter values: ", metrics.accuracy_score(test_labels, labels_pred)*100,"%")

Accuracy of svm classifier with optimized parameter values:  88.03 %
