# Block 6 Exercise 1: Non-Linear Classification

## MNIST Data
We return to the MNIST data set on handwritten digits to compare non-linear classification algorithms ...   

In [0]:
#imports 
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_openml

In [0]:
# Load data from https://www.openml.org/d/554
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)


In [3]:
#the full MNIST data set contains 70k samples of digits 0-9 as 28*28 gray scale images (represented as 784 dim vectors)
np.shape(X)

(70000, 784)

In [4]:
X.min()

0.0

In [5]:
#look at max/min value in the data
X.max()

255.0

### E1.1: Cross-Validation and Support Vector Machines
Train and optimize  C-SVM classifier on MNIST (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
* use a RBF kernel
* use *random search* with cross-validation to find the best settings for *gamma* and *C* (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV)

In [0]:
import sklearn.model_selection
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.9)

In [0]:
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

model = SVC(kernel='rbf')
rand_list = {"C": [1, 10 , 100], "gamma": ['scale',0.0001, 1, 0.1]}

clf = RandomizedSearchCV(estimator=model, param_distributions=rand_list, random_state=0, n_jobs=4, n_iter=32, cv=3)

In [0]:
%%time
search = clf.fit(X_train[:1000], y_train[:1000])
search.best_params_



CPU times: user 1.21 s, sys: 64.2 ms, total: 1.27 s
Wall time: 26.8 s


In [0]:
search.best_params_

{'C': 10, 'gamma': 'scale'}

In [0]:
search.cv_results_

{'mean_fit_time': array([1.23795462, 2.21893406, 2.18677457, 2.18830919, 1.41883755,
        2.207769  , 2.20731886, 2.18407178, 1.39622553, 2.25866787,
        2.16642372, 1.78776185]),
 'mean_score_time': array([0.49466236, 0.63097517, 0.58576608, 0.58494512, 0.4852097 ,
        0.63734357, 0.58997869, 0.60050519, 0.4879969 , 0.61797722,
        0.59142804, 0.47016231]),
 'mean_test_score': array([0.89400179, 0.11800123, 0.11800123, 0.11800123, 0.91400382,
        0.11800123, 0.11800123, 0.11800123, 0.91400382, 0.11800123,
        0.11800123, 0.11800123]),
 'param_C': masked_array(data=[1, 1, 1, 1, 10, 10, 10, 10, 100, 100, 100, 100],
              mask=[False, False, False, False, False, False, False, False,
                    False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_gamma': masked_array(data=['scale', 0.0001, 1, 0.1, 'scale', 0.0001, 1, 0.1,
                    'scale', 0.0001, 1, 0.1],
              mask=[False, False, False, False,

In [0]:
clf_SVC = SVC(kernel='rbf', gamma='scale', C=10).fit(X_train,y_train)
clf_SVC.score(X_test,y_test)

0.986

In [0]:
#Standard
clf_SVC2 = SVC(kernel='rbf').fit(X_train,y_train)
clf_SVC2.score(X_test,y_test)

0.9802857142857143

### E1.2: Pipelines and simple Neural Networks
Split the MNIST data into  train- and test-sets and then train and evaluate a simple Multi Layer Perceptron (MLP) network. Since the non-linear activation functions of MLPs are sensitive to the scaling on the input (recall the *sigmoid* function), we need to scale all input values to [0,1] 

* combine all steps of your training in a SKL pipeline (https://scikit-learn.org/stable/modules/compose.html#pipeline)
* use a SKL-scaler to scale the data (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)
* MLP Parameters: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier
    * use a *SGD* solver
    * use *tanh* as activation function
    * compare networks with 1, 2 and 3 layers, use different numbers of neurons per layer
    * adjust training parameters *alpha* (regularization) and *learning rate* - how sensitive is the model to these parameters?
    * Hint: do not change all parameters at the same time, split into several experiments
* How hard is it to find the best parameters? How many experiments would you need to find the best parameters?
    

In [7]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.neural_network import MLPClassifier

clf = make_pipeline(StandardScaler(), MLPClassifier(activation='tanh', solver='sgd',hidden_layer_sizes=(64,64), alpha=1e-3, learning_rate='constant'))
clf.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('mlpclassifier',
                 MLPClassifier(activation='tanh', alpha=0.001,
                               batch_size='auto', beta_1=0.9, beta_2=0.999,
                               early_stopping=False, epsilon=1e-08,
                               hidden_layer_sizes=(64, 64),
                               learning_rate='constant',
                               learning_rate_init=0.001, max_fun=15000,
                               max_iter=200, momentum=0.9, n_iter_no_change=10,
                               nesterovs_momentum=True, power_t=0.5,
                               random_state=None, shuffle=True, solver='sgd',
                               tol=0.0001, validation_fraction=0.1,
                               verbose=False, warm_start=False))],
         verbose=False)

In [8]:
print(clf.score(X_train, y_train))
print(clf.score(X_test, y_test))

0.9974920634920635
0.9621428571428572


Protokoll: Einstellungen und Test-Score

hidden_layer_sizes=(5, 2), alpha=1e-5, learning_rate='invscaling' : 0.2097

hidden_layer_sizes=(5,4,3), alpha=1e-5, learning_rate='invscaling' : 0.257

hidden_layer_sizes=(5,4,3), alpha=1e-3, learning_rate='invscaling' : 0.3258

hidden_layer_sizes=(5,4,3), alpha=1e-3, learning_rate='adaptive' : 0.8291

hidden_layer_sizes=(10,4,3), alpha=1e-3, learning_rate='adaptive' : 0.8832

hidden_layer_sizes=(5), alpha=1e-3, learning_rate='constant': 0.8 - konvergiert nicht

hidden_layer_sizes=(5), alpha=1e-3, learning_rate='constant': 0.8847 - konvergiert nicht

hidden_layer_sizes=(1), alpha=1e-3, learning_rate='constant': 0.33 - konvergiert nicht

