# Group exercise 2a
## Loïc Rosset, Nanae Aubry, Kilian Ruchti, Lionel Ieri

* Use the provided training set to build your SVM.
* Apply the trained SVM to classify the test set. 
* Investigate at least two different kernels and optimize the SVM parameters by means of cross-validation.

## Import libraries

Reference : 
* [Cross validation from scikit-learn](https://scikit-learn.org/stable/modules/cross_validation.html)
* [Grid-Search Cross Validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

In [1]:
import csv
#numpy and panda for data structure
import numpy as np
import pandas as pd
#sklearn for svm and for cross validation
from sklearn import svm
from sklearn import metrics
from sklearn.model_selection import GridSearchCV

## Read the provided dataset

In [2]:
#path to dataset
mnist_train = "./../dataset/csv/mnist_train.csv"
mnist_test = "./../dataset/csv/mnist_test.csv"

In [3]:
def read_data(filename):
    with open(filename, 'r') as csvfile:
        reader = csv.reader(csvfile)
        data = list(reader)
    #data into numpy array
    matrix = np.array(data, dtype = int)
    samples = matrix[:,1:]
    labels = matrix[:,0]
    return samples, labels

In [4]:
# Load the training and the test set
training_data, training_labels = read_data(mnist_train)
test_data, test_labels = read_data(mnist_test)

Computing the following code (section SVM form Scikit-learn) with the whole training dataset takes a while so here is, if needed for testing, a random sample of it :

In [5]:
random_ids = np.random.randint(0,training_data.shape[0],1000)
sample_data = training_data[random_ids]
sample_labels = training_labels[random_ids]

## SVM from Scikit-learn

Reference : [A practical guide to support vector classification](https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf)

SVM trained on a cross-validation of 5 folds and parameters found with Grid-search

In [6]:
#parameters sequences of C and gamma to be tested
c_seq = [pow(2,x) for x in range(-5, 15, 4)]
gamma_seq = [pow(2,x) for x in range(-15, 3, 4)]

In [7]:
parameters = {'kernel':('rbf','linear'), 'C':c_seq, 'gamma':gamma_seq}
nb_folds = 4
#Support Vector Classification
svc = svm.SVC()
#Cross-validation
svmOptimized = GridSearchCV(svc, parameters, cv=nb_folds)
svmOptimized.fit(sample_data, sample_labels)
scores = pd.DataFrame.from_dict(svmOptimized.cv_results_)
print("--> Best parameters : ", svmOptimized.best_params_)
scores[['param_C', 'param_gamma', 'param_kernel', 'mean_test_score', 'std_test_score', 'rank_test_score']]

--> Best parameters :  {'C': 0.03125, 'gamma': 3.0517578125e-05, 'kernel': 'linear'}


Unnamed: 0,param_C,param_gamma,param_kernel,mean_test_score,std_test_score,rank_test_score
0,0.03125,3.05176e-05,rbf,0.118,0.002,43
1,0.03125,3.05176e-05,linear,0.884,0.014142,1
2,0.03125,0.000488281,rbf,0.147,0.05141,26
3,0.03125,0.000488281,linear,0.884,0.014142,1
4,0.03125,0.0078125,rbf,0.117,0.001732,45
5,0.03125,0.0078125,linear,0.884,0.014142,1
6,0.03125,0.125,rbf,0.117,0.001732,45
7,0.03125,0.125,linear,0.884,0.014142,1
8,0.03125,2.0,rbf,0.117,0.001732,45
9,0.03125,2.0,linear,0.884,0.014142,1


## Classify test set with trained SVM

In [8]:
labels_pred = svmOptimized.predict(test_data)
print("Accuracy of svm classifier with optimized parameter values: ", metrics.accuracy_score(test_labels, labels_pred)*100,"%")

Accuracy of svm classifier with optimized parameter values:  88.19 %
