# Clothes Classification with Neural Networks

In this notebook we are going to explore the Neural Networks for image classification. We are going to use the same dataset of the SVM notebook: Fashion MNIST (https://pravarmahajan.github.io/fashion/), a dataset of small images of clothes and accessories.

The dataset labels are the following:

| Label | Description |
| --- | --- |
| 0 | T-shirt/top |
| 1 | Trouser |
| 2 | Pullover |
| 3 | Dress |
| 4 | Coat |
| 5 | Sandal |
| 6 | Shirt |
| 7 | Sneaker |
| 8 | Bag |
| 9 | Ankle boot |

In [None]:
#load the required packages and check Scikit-learn version

%matplotlib inline  

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd

import sklearn
print ('scikit-learn version: ', sklearn.__version__)
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

In [None]:
# helper function to load Fashion MNIST dataset from disk
def load_mnist(path, kind='train'):
    import os
    import gzip
    import numpy as np
    labels_path = os.path.join(path, '%s-labels-idx1-ubyte.gz' % kind)
    images_path = os.path.join(path, '%s-images-idx3-ubyte.gz' % kind)
    with gzip.open(labels_path, 'rb') as lbpath:
        labels = np.frombuffer(lbpath.read(), dtype=np.uint8,offset=8)
    with gzip.open(images_path, 'rb') as imgpath:
        images = np.frombuffer(imgpath.read(), dtype=np.uint8,offset=16).reshape(len(labels), 784)
    return images, labels

# TODO 
Place a seed for the random generator (you can use your "numero di matricola"). Try to change the seed to see the impact of the randomization.

In [None]:
ID = 1232236
np.random.seed(ID)

In [None]:
#load the MNIST dataset and let's normalize the features so that each value is in [0,1]
X, y = load_mnist("data")
print("Number of samples in the MNIST dataset:", X.shape[0])
# rescale the data
X = X / 255.0

Now split into training and test. We start with a small training set of 600 samples to reduce computation time. Make sure that each label is present at least 10 times
in training frequencies.

In [None]:
#random permute the data and split into training and test taking the first 600
#data samples as training and the rests as test
permutation = np.random.permutation(X.shape[0])

X = X[permutation]
y = y[permutation]

m_training = 600

X_train, X_test = X[:m_training], X[m_training:]
y_train, y_test = y[:m_training], y[m_training:]

labels, freqs = np.unique(y_train, return_counts=True)
print("Labels in training dataset: ", labels)
print("Frequencies in training dataset: ", freqs)

In [None]:
#function for plotting a image and printing the corresponding label
def plot_input(X_matrix, labels, index):
    print("INPUT:")
    plt.imshow(
        X_matrix[index].reshape(28,28),
        cmap          = plt.cm.gray_r,
        interpolation = "nearest"
    )
    plt.show()
    print("LABEL: %i"%labels[index])
    return

In [None]:
#let's try the plotting function
plot_input(X_train,y_train,10)
plot_input(X_test,y_test,100)
plot_input(X_test,y_test,10000)

## TO DO 1

Now use a feed-forward Neural Network for prediction. Use the multi-layer perceptron classifier, with the following parameters: max_iter=300, alpha=1e-4, solver='sgd', tol=1e-4, learning_rate_init=.1, random_state=ID (this last parameter ensures the run is the same even if you run it more than once). The alpha parameter is the regularization term.

Then, using the default activation function, pick four or five architectures to consider, with different numbers of hidden layers and different sizes. It is not necessary to create huge neural networks, you can limit to 3 layers and, for each layer, its maximum size can be of 100. Evaluate the architectures you chose using GridSearchCV with cv=5.

You can reduce the number of iterations if the running time is too long on your computer.


In [None]:
# these are sample values but feel free to change them as you like, try to experiment with different sizes!!
parameters = {'hidden_layer_sizes': [(10,), (20,), (40,), (40,20,), (40,30,20) ]}

mlp = MLPClassifier(max_iter=300, alpha=1e-4, solver='sgd',
                    tol=1e-4, random_state=ID,
                    learning_rate_init=.1)

GridS = GridSearchCV(estimator=mlp,param_grid=parameters,cv=5)
GridS.fit(X_train,y_train)

print ('RESULTS FOR NN\n')

print("Best parameters set found:")
print(GridS.best_params_,'\n')

print("Score with best parameters:")
print(GridS.best_score_,'\n')

print("\nAll scores on the grid:")
print(pd.DataFrame(GridS.cv_results_))

### TO DO 2

Now try also different batch sizes, while keeping the best NN architecture you have found above. Remember that the batch size was previously set to the default value, i.e., min(200, n_samples). 
Recall that a batch size of 1 corresponds to baseline SGD, while using all the 480 training samples (there are 600 samples but in cross validation with 5 folders we use 1/5 of them for validation at each round) corresponds to standard GD and using a different mini-batch size lies in the middle between the two extreme cases.

In [None]:
# these are sample values corresponding to baseline SGD, a reasonable mini-batch size and standard GD
# again feel free to change them as you like, try to experiment with different batch sizes!!
parameters = {'batch_size': [1, 32, 480]}

# need to specify that you would like to use the standard k-fold split otherwise sklearn create splits of different sizes
kf = sklearn.model_selection.KFold(n_splits=5)
kf.split(X_train,y_train)

# recall to use cv=kf in GridSearchCV parameters to use the k-fold subdivision seen in the lectures

GridS_kf = GridSearchCV(estimator=GridS.best_estimator_,param_grid=parameters,cv=kf)
GridS_kf.fit(X_train,y_train)

print ('RESULTS FOR NN\n')

print("Best parameters set found:")
print(GridS_kf.best_params_,'\n')

print("Score with best parameters:")
print(GridS_kf.best_score_,'\n')

print("\nAll scores on the grid:")
print(pd.DataFrame(GridS_kf.cv_results_))

### QUESTION 1

What do you observe for different architectures and batch sizes? How do the number of layers and their sizes affect the performances? What do you observe for different batch sizes, in particular what happens to the training convergence for different batch sizes (notice that the algorithm could not converge for some batch sizes)?

As we can see, using huge neural networks don't pay for this kind of problems, since in the first test the more performant possibilities are the ones with 1 or 2 hidden layers. Using more layers means worse performances. For what regard the batch sizes, instead, we find similar results for the larger values, meaning that using the GD method is the best way. Other possibilities bring several results, but also the risk of no convergence.
 
<br>
Anyway, the computation time and the resources required increase a lot with complex NN.

### TO DO 3

Now try also to use different learning rates, while keeping the best NN architecture and batch size you have found above. Plot the learning curves (i.e., the variation of the loss over the steps, you can get it from the loss_curve_ object of sklearn) for the different values of the learning rate . 

In [None]:
import matplotlib.pyplot as plt


lr_list = [10**exp for exp in range(-3,0)]
scores = {}

fig, ax = plt.subplots(1, 1, figsize=(8, 8))
bestNN_model = GridS_kf.best_estimator_
bestNN_model.set_params(**{'max_iter':1000})

kf = sklearn.model_selection.KFold(n_splits=5)
kf.split(X_train,y_train)

GridS_lr = GridSearchCV(estimator=bestNN_model,param_grid={'learning_rate_init':lr_list},cv=kf)
GridS_lr.fit(X_train,y_train)

for learning_rate_init in lr_list:
    bestNN_model.set_params(**{'learning_rate_init':learning_rate_init})
    bestNN_model.fit(X_train,y_train)
    label = 'learning_rate ' + str(learning_rate_init)
    ax.plot(bestNN_model.loss_curve_,label=label)

ax.set_title('Learning curves for different NN architectures')
ax.legend()

print ('RESULTS FOR NN\n')

print("Best parameters set found:")
print(GridS_lr.best_params_,'\n')

print("Score with best parameters:")
print(GridS_lr.best_score_,'\n')

### QUESTION 2

Comment about the learning curves (i.e. the variation of the loss over the steps). How does the curve changes for different learning rates in terms of stability and speed of convergence ?

Before answering to question 2, notice that I have incremented the number of max iterations to 1000, to give to all the models the possibility to reach a better "visive" trend in the graph (even if they don't converged yet). The first thing we see looking at the plot is the irregularity of the curve with learning rate equal to 0.1, and the general behaviour of the curves, that tells us that we have the best performances for higher values of the parameter considered. But, even if the model with the best results is the one highest learning rate, for the next computation we will consider the parameter equal to 0.01, to take into account that shows a better stability. Our choice is also confirmed by repeating the analysis with max_iter set to 300: in that case, in fact, the best models is exactly the one with learning rate 0.01.

### TO DO 4

Now get training and test error for a NN with best parameters (architecture, batch size and learning rate)from above. Plot the learning curve also for this case.

In [None]:
#get training and test error for the best NN model from CV

final_bestNN_model = GridS_lr.best_estimator_
final_bestNN_model.set_params(**{'learning_rate_init':0.01})  # To have more stability on the learning curve
final_bestNN_model.fit(X_train,y_train)

training_error = 1. - final_bestNN_model.score(X_train,y_train)
test_error = 1. - final_bestNN_model.score(X_test,y_test)

print ('\nRESULTS FOR BEST NN\n')

print ("Best NN training error: %f" % training_error)
print ("Best NN test error: %f" % test_error)

fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.plot(final_bestNN_model.loss_curve_,label='learning_curve')
ax.legend()
ax.set_title('Learning curve for the best model')

## More data 
Now let's do the same but using 5000 (or less if it takes too long on your machine) data points for training. Use the same NN architecture as before, but you can try more if you like and have a powerful computer !!

In [None]:
X = X[permutation]
y = y[permutation]

m_training = 10000

X_train, X_test = X[:m_training], X[m_training:]
y_train, y_test = y[:m_training], y[m_training:]

labels, freqs = np.unique(y_train, return_counts=True)
print("Labels in training dataset: ", labels)
print("Frequencies in training dataset: ", freqs)

## TO DO 5

Now train the NNs with the added data points using the optimum parameters found above. Eventually, feel free to try different architectures if you like. We suggest that you use 'verbose=True' so have an idea of how long it takes to run 1 iteration (eventually reduce also the number of iterations to 50).

In [None]:
# use best architecture and params from before
final_bestNN_model_moredata = sklearn.base.clone(final_bestNN_model)
final_bestNN_model_moredata.set_params(**{'verbose':True})
final_bestNN_model_moredata.fit(X_train,y_train)

# I KNOW THAT 1000 ITERATIONS IS A BIG NUMBER, BUT THE COMPUTATION STOP AFTER ABOUT 600 ITERATIONS, AND ON MY 
# PC IT TAKES ONLY 10 SECONDS (IT STARTS TO BECOME MUCH LONGER WITH M_TRAINING = 30000)

training_error = 1. - final_bestNN_model_moredata.score(X_train,y_train)
test_error = 1. - final_bestNN_model_moredata.score(X_test,y_test)

print ('\nRESULTS FOR NN\n')

print ("NN training error: %f" % training_error)
print ("NN test error: %f" % test_error)

fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.plot(final_bestNN_model_moredata.loss_curve_,label='learning_curve')
ax.legend()
ax.set_title('Learning curve for the best model')

## QUESTION 3
Compare the train and test errors you got with a large number of samples with the best one you obtained with only 600 data points. Comment about the results you obtained.

With 600 samples, the results weren't so good, in particular in terms of the training error, 
that was null: probably a case of overfitting. The test error, instead, was quite good, about 0.21. 
We obtained better performances with bigger dataset. As I said in the previous cell, 
I tried with m_training values equal to 10000 and 30000, that with so high number of iterations selected, 
can result into a wasteful algorithm, from a computational point of view. <br>
In fact, I obtained the following results:

m_training = 10000 <br>
time = 10 seconds <br><br>
NN training error: 0.029100 <br>
NN test error: 0.166160

m_training = 30000 <br>
time = 30/60 seconds <br><br>
NN training error: 0.059033 <br>
NN test error: 0.145633

So it isn't convenient to use much samples, since the results doesn't change much more, while the computational time is much more bigger.

### TO DO 7

Plot an example that was missclassified by NN with m=600 training data points and it is now instead correctly classified by NN with m=5000 training data points.

In [None]:
NN_prediction = final_bestNN_model.predict(X)
NN_pred_check = np.array((NN_prediction==y))

large_NN_prediction = final_bestNN_model_moredata.predict(X)
large_NN_pred_check = np.array((large_NN_prediction==y))

def plot_random(nn_pred,large_nn_pred,NN_check,large_NN_check):
    n = np.random.randint(0,nn_pred.shape[0])
    if (not NN_check[n]) and large_NN_check[n]: 
        plot_input(X,y,n)
        print('Neural Network prediction: ',nn_pred[n])
        print('Larger Neural Network prediction: ',large_nn_pred[n])
    else: plot_random(nn_pred,large_nn_pred,NN_check,large_NN_check)

plot_random(NN_prediction,large_NN_prediction,NN_pred_check,large_NN_pred_check)

### TO DO 8

Let's plot the weigths of the multi-layer perceptron classifier, for the best NN we get with 600 data points and with 5000 data points. The code is already provided, just fix variable names (e.g., replace mlp ,  mlp_large with your estimators) in order to have it working with your implementation



In [None]:
# The code is already provided, fix variable names in order to have it working with your implementation

mlp = final_bestNN_model
mlp_large = final_bestNN_model_moredata

print("Weights with 600 data points:")

fig, axes = plt.subplots(4, 4,figsize=(8,8))
vmin, vmax = mlp.coefs_[0].min(), mlp.coefs_[0].max()
for coef, ax in zip(mlp.coefs_[0].T, axes.ravel()):
    ax.matshow(coef.reshape(28, 28), cmap=plt.cm.gray, vmin=.5 * vmin,
               vmax=.5 * vmax)
    ax.set_xticks(())
    ax.set_yticks(())

plt.show()

print("Weights with 5000 data points:")

fig, axes = plt.subplots(4, 4,figsize=(8,8))
vmin, vmax = mlp_large.coefs_[0].min(), mlp_large.coefs_[0].max()
for coef, ax in zip(mlp.coefs_[0].T, axes.ravel()):
    ax.matshow(coef.reshape(28, 28), cmap=plt.cm.gray, vmin=.5 * vmin,
               vmax=.5 * vmax)
    ax.set_xticks(())
    ax.set_yticks(())
plt.show()

## QUESTION 4

Describe what do you observe by looking at the weights

The first difference to notice is that in the second case, the parameters are much more uniformly distributed than in the first one. We observe also less irregularities.

### TO DO 9

Report the best SVM model and its parameters, you found in the last notebook (or check out the solution on the moodle webpage of the course). Fit it on a few data points and compute its training and test scores.

In [None]:
m_training = 5000

X_train, X_test = X[:m_training], X[m_training:2*m_training]
y_train, y_test = y[:m_training], y[m_training:2*m_training]

# use best parameters found in the SVM notebook, create SVM and perform fitting
SVM = SVC(C=5, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.005, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False) 

SVM.fit(X_train,y_train)

print ('RESULTS FOR SVM')

SVM_training_error = 1 - SVM.score(X_train,y_train)

print("Training score SVM:")
print(SVM_training_error)

SVM_test_error = 1 - SVM.score(X_test,y_test)

print("Test score SVM:")
print(SVM_test_error)

## QUESTION 5
Compare the results of SVM and of NN. Which one would you prefer? 

As can be seen just looking at the errors, NN fits better training data, but the performances on the test set are similar between the two methods. But if I had do make a choice I would prefer SVM, since it is computationally less wasteful.