# Block 6 Exercise 1: Non-Linear Classification

## MNIST Data
We return to the MNIST data set on handwritten digits to compare non-linear classification algorithms ...   

In [1]:
#imports 
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

In [2]:
# Load data from https://www.openml.org/d/554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)


In [3]:
#the full MNIST data set contains 70k samples of digits 0-9 as 28*28 gray scale images (represented as 784 dim vectors)
np.shape(X)

(70000, 784)

In [4]:
X.min()

0.0

In [5]:
#look at max/min value in the data
X.max()

255.0

### E1.1: Cross-Validation and Support Vector Machines
Train and optimize  C-SVM classifier on MNIST (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
* use a RBF kernel
* use *random search* with cross-validation to find the best settings for *gamma* and *C* (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV)

In [6]:
SVCModel = SVC(kernel = "rbf", max_iter = 15)
distributions = dict(C=uniform(1,5), gamma = ["auto", "scale"])
randomSearch = RandomizedSearchCV(SVCModel, distributions, n_iter = 5)
search = randomSearch.fit(X,y)



In [7]:
search.best_params_

{'C': 1.4455781112239365, 'gamma': 'scale'}

In [8]:
search.best_score_

0.6917142857142857

### E1.2: Pipelines and simple Neural Networks
Split the MNIST data into  train- and test-sets and then train and evaluate a simple Multi Layer Perceptron (MLP) network. Since the non-linear activation functions of MLPs are sensitive to the scaling on the input (recall the *sigmoid* function), we need to scale all input values to [0,1] 

* combine all steps of your training in a SKL pipeline (https://scikit-learn.org/stable/modules/compose.html#pipeline)
* use a SKL-scaler to scale the data (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
* MLP Parameters: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier
    * use a *SGD* solver
    * use *tanh* as activation function
    * compare networks with 1, 2 and 3 layers, use different numbers of neurons per layer
    * adjust training parameters *alpha* (regularization) and *learning rate* - how sensitive is the model to these parameters?
    * Hint: do not change all parameters at the same time, split into several experiments
* How hard is it to find the best parameters? How many experiments would you need to find the best parameters?
    


## Import libraries

In [8]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
import sklearn.metrics as metrics

## Split Data

In [7]:
X_train, X_test, Y_train, Y_test = train_test_split(X,y, train_size=10000)

## 1 Layer - 10 Neurons

In [51]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (10,),activation = "tanh", solver = "sgd"))

In [52]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh', hidden_layer_sizes=(10,),
                               solver='sgd'))])

In [53]:
pipe.score(X_test, Y_test)

0.8889833333333333

## 1 Layer - 20 Neurons

In [54]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (20,),activation = "tanh", solver = "sgd"))

In [55]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh', hidden_layer_sizes=(20,),
                               solver='sgd'))])

In [56]:
pipe.score(X_test, Y_test)

0.9125666666666666

## 1 Layer - 30 Neurons

In [57]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (30,),activation = "tanh", solver = "sgd"))

In [58]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh', hidden_layer_sizes=(30,),
                               solver='sgd'))])

In [59]:
pipe.score(X_test, Y_test)

0.92005

## 2 Layer - 10 Neurons

In [60]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (10,10),activation = "tanh", solver = "sgd"))

In [61]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh', hidden_layer_sizes=(10, 10),
                               solver='sgd'))])

In [62]:
pipe.score(X_test, Y_test)

0.8840833333333333

## 2 Layer - 20 Neurons

In [63]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (20,20),activation = "tanh", solver = "sgd"))

In [64]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh', hidden_layer_sizes=(20, 20),
                               solver='sgd'))])

In [65]:
pipe.score(X_test, Y_test)

0.9093166666666667

## 2 Layer - 30 Neurons

In [66]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (30,30),activation = "tanh", solver = "sgd"))

In [67]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh', hidden_layer_sizes=(30, 30),
                               solver='sgd'))])

In [68]:
pipe.score(X_test, Y_test)

0.9168666666666667

## 3 Layer - 10 Neurons

In [69]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (10,10,10),activation = "tanh", solver = "sgd"))

In [70]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh',
                               hidden_layer_sizes=(10, 10, 10),
                               solver='sgd'))])

In [71]:
pipe.score(X_test, Y_test)

0.87345

## 3 Layer - 20 Neurons

In [72]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (20,20,20),activation = "tanh", solver = "sgd"))

In [73]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh',
                               hidden_layer_sizes=(20, 20, 20),
                               solver='sgd'))])

In [74]:
pipe.score(X_test, Y_test)

0.9063

## 3 Layer - 30 Neurons

In [75]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (30,30,30),activation = "tanh", solver = "sgd"))

In [76]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh',
                               hidden_layer_sizes=(30, 30, 30),
                               solver='sgd'))])

In [77]:
pipe.score(X_test, Y_test)

0.9135833333333333

## Results

We can see that increasing the number of neurons in a layer increases the accuracy, in contrary to the number of layers.
Adding more layers can reduce the accuracy score, as we can see it with 3 layers

## Alpha

## 0.00005

In [79]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (30),activation = "tanh", solver = "sgd", alpha = 0.00005))

In [80]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh', alpha=5e-05,
                               hidden_layer_sizes=30, solver='sgd'))])

In [81]:
pipe.score(X_test, Y_test)

0.91665

## 0.0001

In [82]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (30),activation = "tanh", solver = "sgd", alpha = 0.0001))

In [83]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh', hidden_layer_sizes=30,
                               solver='sgd'))])

In [84]:
pipe.score(X_test, Y_test)

0.9186166666666666

## 0.001

In [85]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (30),activation = "tanh", solver = "sgd", alpha = 0.001))

In [86]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh', alpha=0.001,
                               hidden_layer_sizes=30, solver='sgd'))])

In [87]:
pipe.score(X_test, Y_test)

0.9191833333333334

## 0.01

In [88]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (30),activation = "tanh", solver = "sgd", alpha = 0.01))

In [89]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh', alpha=0.01,
                               hidden_layer_sizes=30, solver='sgd'))])

In [90]:
pipe.score(X_test, Y_test)

0.9164166666666667

We can observe that a higher value of alpha increases the score in regard to the default value of 0.0001, but a too high alpha decrease the score

## Learning rate

## Constant

In [91]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (30),activation = "tanh", solver = "sgd", alpha = 0.0001, learning_rate = "constant"))

In [93]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh', hidden_layer_sizes=30,
                               solver='sgd'))])

In [94]:
pipe.score(X_test, Y_test)

0.91585

## Invscaling

In [96]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (30),activation = "tanh", solver = "sgd", alpha = 0.0001, learning_rate = "invscaling"))

In [97]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh', hidden_layer_sizes=30,
                               learning_rate='invscaling', solver='sgd'))])

In [98]:
pipe.score(X_test, Y_test)

0.6279

## Adaptive

In [99]:
pipe = make_pipeline(StandardScaler(), MLPClassifier(hidden_layer_sizes = (30),activation = "tanh", solver = "sgd", alpha = 0.0001, learning_rate = "adaptive"))

In [100]:
pipe.fit(X_train, Y_train)



Pipeline(steps=[('standardscaler', StandardScaler()),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh', hidden_layer_sizes=30,
                               learning_rate='adaptive', solver='sgd'))])

In [101]:
pipe.score(X_test, Y_test)

0.9177333333333333

We can see that both constant and adaptive learning rate deliver the same accuracy score, in contrary to the Invscaling method which is much lower

We observe that tuning the parameters is quite time consuming because each parameter can interfere on the other parameter. Here, if we would really know the best 4 parameters between number of layer, number of neuron, alpha, and learning rate, we would need at least 15 experiences to approach the best solution