# Machine Learning Project: Number Classifier
### David García Allo - 21/02/2022 
### Latest update: 14/07/2022
#### Description
Downloading the MNIST manuscrited numbers data, build a Support Vector Machine classifier that learns to identify $\\$
manuscrited numbers with a determined accuraccy.  Adjust the parameters of the algorithm to obtain the best accuracy.
#### External modules needed
- numpy
- matplotlib
- sklearn 
- pandas (needed to obtain the downloaded data)
#### Comments
This program may take **several minutes** to finish, event without making a full GridSearch, this may be because the $\\$
Support Vector Machine Classifier it's meant to small data and here we have large data so it's not very efficient.

### Needed modules and functions

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

### Download data

In [None]:
#Download the data
mnist = fetch_openml('mnist_784', version=1, cache=True) #pandas needed here
#Getting the data and the target values
data   = mnist.data.values
number = mnist.target.to_numpy() #Numbers are the labels of each data, i.e. the number of the data
number = number.reshape(len(number), 1)

## Visualize the data

In [None]:
for i in range(1,26):
    plt.subplot(5,5,i)
    plt.imshow(data[i].reshape((28,28)), cmap= cm.Greys_r)
    plt.axis('off')
plt.show()

### Splitting Train and Test samples

In [None]:
#Split Train and Test datasets
fulldata = np.concatenate((data, number), axis=1)
train, test = train_test_split(fulldata, test_size=0.2, random_state=42) #Size of the test datset - 20%
#Train
train_data  = np.delete(train, -1, 1)
train_label = train[:,-1]
#Test
test_data   = np.delete(test, -1, 1)
test_label  = test[:,-1]
#Scale the data
train_data = scale(train_data)
test_data  = scale(test_data)

### Training the algorithm and show some predictions on test sample

In [None]:
#Train the algorithm
classifier = SVC(kernel='rbf', gamma='scale') #Our classification algorithm
classifier.fit(train_data, train_label) #Training algo with our data
#Show some predictions on test sample
preds = classifier.predict(test_data[:50])
print('Predicted Labels:\n', preds)
print('Real Labels:\n', test_label[:50])

### Finding the best hyperparameters C, gamma

In [None]:
params_ranges = {"gamma": np.linspace(0.0001,0.01,20), "C": np.linspace(3,10,14)} #Ranges to find the best caombination
search_cv = GridSearchCV(classifier, params_ranges, n_jobs=8, verbose=1, cv=3) #Searching algorithm that we will use
"""WARNING: The fit: search_cv.fit(train_data, train_label) may take several hours.
I use only the first 5000 events so the GridSearch takes half an hour in my pc.
For a fast run, consider that with 1000 events only takes 1 minute.
I think if I do it with the total (56000) it will take several hours (quadratic scaling)"""
search_cv.fit(train_data[:1000], train_label[:1000])
print('Best Hyperparameters: ', search_cv.best_estimator_)
print('Accuracy Score: ', search_cv.best_score_)

### Train the algorithm with the new hyperparameters and show acurraccy

In [None]:
#Train again the algorithm with the best estimators obtained
classifier_best = SVC(kernel="rbf", C=search_cv.best_params_['C'], gamma=search_cv.best_params_['gamma'])
classifier_best.fit(train_data, train_label)
#Obtain the accuracy on the test sample
preds_best = classifier_best.predict(test_data)
test_accuracy = accuracy_score(test_label, preds_best)
print('Test Accuraccy: %.2f%%'%(test_accuracy*100))