# MNIST Classifier

In this notebook you will create both, an mnist tabular dataset and a classifier.

## 1.- import the Operating System (os) module in python and any other library you need

In [4]:
import os
from PIL import Image
import numpy as np
import pandas as pd
import time

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import plot_confusion_matrix, accuracy_score

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC



## 2.- As you can see each class has its own folder (Do it only for train). 

    - Iterate folder by folder ( os.listdir() )
    - Inside each folder: 
        1.- Read the image
        2.- Reshape it into a flat array (784,)
        3.- Save the data into a pandas dataframe apending the column name as the class
    - Save the data into a CSV

    Note: if it takes to long try doing only 100 images per folder and the teacher for the CSV.

In [5]:
# Number of images to load per folder
n_images_to_load = 200

df = pd.DataFrame()

for folder in os.listdir("data/trainingSet"):
    
    # List of images' names within a folder
    images = os.listdir(f'data/trainingSet/{folder}')
    for name in images[:n_images_to_load]:
        pth = os.path.join(f'data/trainingSet/{folder}', name)
        img = Image.open(pth)
        img_arr = np.array(img, dtype=float).flatten()
        df = pd.concat([df,pd.DataFrame(img_arr, columns=[folder])], axis = 1)
        

In [6]:
# Shape of the images
print(df.T.shape)

(2000, 784)


In [7]:
# Saving to a csv file
df.T.reset_index().rename(columns={"index":'label'}).to_csv('data/images.csv', index=False)

## 3.- Load the CSV

In [8]:
data = pd.read_csv('data/images.csv')
print(data.shape)
data.head()

(2000, 785)


Unnamed: 0,label,0,1,2,3,4,5,6,7,8,...,774,775,776,777,778,779,780,781,782,783
0,0,3.0,0.0,0.0,3.0,7.0,3.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 4.- Create a dictionary of models (No preprocessing needed, it has already been done).
    
    Include both, tree models and mult models.

In [10]:
# features and label separation
X = data.drop(columns='label')
y = data['label']
print(X.shape, y.shape)

(2000, 784) (2000,)


In [14]:
# Models
models = {'dt': DecisionTreeClassifier(random_state=0),
           'rf': RandomForestClassifier(random_state=0), 
          'svc': SVC(random_state=0), 
          'log': LogisticRegression(solver='saga', random_state=0)}

## 5.- Using either cross validation or stratification find out which is the best model
    - Base your code on the previous two days examples

In [15]:
results = pd.DataFrame()

for name, model in models.items():

    # Time to fit the model
    start_time = time.time()
    model.fit(X, y)
    final_time = time.time() - start_time

    # Cross validation
    cross_val_mean = cross_val_score(model, X, y).mean()
    cross_val_std = cross_val_score(model, X, y).std()

    # Append to results
    to_append = pd.DataFrame({'model':[name], 'cross_val_mean': [cross_val_mean], 'cross_val_std':[cross_val_std],'training_time':[final_time]})
    results = pd.concat([results, to_append])
    



In [17]:
results.sort_values('cross_val_mean', ascending = False)

Unnamed: 0,model,cross_val_mean,cross_val_std,training_time
0,svc,0.929,0.010075,0.789289
0,rf,0.902,0.004848,1.863019
0,log,0.8755,0.015116,10.238482
0,dt,0.6455,0.025466,0.618236


The best model is `svc`. However, the `log` reached the maximum iteration without converging. Solving the issue, might potentially give better performance for `log` model. 

## Optional: Can you rotate an image?