# Neural Networks with scikit-learn

Neural networks are amongst the most complex and flexible machine/deep learning models, due to this, their capability to tackle more complex problems is huge. However, due to this flexibility, they are also easier to overfit, which is why it's the data scientists job to find the correct hyper-parameters for these models.

In order to do this, we'll use scikit learn in the same way we always have. In this case, however, neural networks have a lot more than 1 parameter, in case of the multi-layer perceptron models in scikit-learn the parameters are:

* hidden_layer_sizes
* activation
* solver
* alpha
* batch_size
* learning_rate
* learning_rate_init
* power_t
* max_iter
* shuffle
* random_state
* tol
* verbose
* warm_start
* momentum
* nesterovs_momentum
* early_stopping
* validation_fraction
* beta_1
* beta_2
* epsilon
* n_iter_no_change

You can read more about each one [here](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).

## The data (MNIST)

Let's load the digit data from MNIST.

In [1]:
import pandas as pd


data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv") # Only for Kaggle

We visualize the already familiar data with the modified label type to category.

In [2]:
data.label = data.label.astype("category")

data.head()

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We split the data set into a test and training set.

In [3]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(data.drop("label", axis = 1), data.label, test_size = 0.3)

## The network

This classifier is a dense neural network, which can be visualized as:

![By Glosser.ca - Own work, Derivative of File:Artificial neural network.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24913461](resources/nnet.png)

This network architecture is the **MLPClassifier** in scikit-learn, first we will try a network with one hidden with default parameters.

In [4]:
from sklearn.neural_network import MLPClassifier

net = MLPClassifier(verbose = True)
net.fit(x_train, y_train)

Iteration 1, loss = 10.26146455
Iteration 2, loss = 1.83843986
Iteration 3, loss = 1.04772283
Iteration 4, loss = 0.67763113
Iteration 5, loss = 0.46146341
Iteration 6, loss = 0.34692434
Iteration 7, loss = 0.26143220
Iteration 8, loss = 0.20229604
Iteration 9, loss = 0.16285930
Iteration 10, loss = 0.13975919
Iteration 11, loss = 0.11227159
Iteration 12, loss = 0.11205778
Iteration 13, loss = 0.08230610
Iteration 14, loss = 0.07142090
Iteration 15, loss = 0.07541603
Iteration 16, loss = 0.07819111
Iteration 17, loss = 0.08094314
Iteration 18, loss = 0.07275204
Iteration 19, loss = 0.07556990
Iteration 20, loss = 0.06547626
Iteration 21, loss = 0.07011028
Iteration 22, loss = 0.07428769
Iteration 23, loss = 0.05611521
Iteration 24, loss = 0.06898180
Iteration 25, loss = 0.09274774
Iteration 26, loss = 0.09507757
Iteration 27, loss = 0.09063448
Iteration 28, loss = 0.07772569
Iteration 29, loss = 0.07285178
Iteration 30, loss = 0.06775419
Iteration 31, loss = 0.07997448
Iteration 32, lo

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=None, shuffle=True, solver='adam', tol=0.0001,
              validation_fraction=0.1, verbose=True, warm_start=False)

Notice the default parameters above after the description of the epochs (iterations).

Now we can visualize it's accuracy in the test set.

In [5]:
net.score(x_test, y_test)

0.950952380952381

Out of the box, this model will give us an accuracy of 95-96%. Let's explore some basic parameters and what they mean.

### Learning rate (learning_rate_init)

This parameter controls the size of the "steps" a network takes when performing gradient descent, if this is too low, the network will take very long to converge, if it's too high, it might diverge instead, let's see some examples.

In [6]:
learning_rates = [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1]
scores = []

for lr in learning_rates:
    print(f"-----------Starting training with lr {lr}")
    model = MLPClassifier(learning_rate_init = lr, verbose = True, max_iter=80)
    model.fit(x_train, y_train)
    score = model.score(x_test, y_test)
    scores.append(score)
    print(f"-----------Finished lr {lr} with score {score}")

-----------Starting training with lr 1e-06
Iteration 1, loss = inf
Iteration 2, loss = 156.24378280
Iteration 3, loss = 145.88952965
Iteration 4, loss = 136.65493102
Iteration 5, loss = 128.35060788
Iteration 6, loss = 120.84511087
Iteration 7, loss = 114.02259303
Iteration 8, loss = 107.78673363
Iteration 9, loss = 102.07041656
Iteration 10, loss = 96.79393813
Iteration 11, loss = 91.88569928
Iteration 12, loss = 87.28857803
Iteration 13, loss = 82.99309977
Iteration 14, loss = 78.96251148
Iteration 15, loss = 75.15411581
Iteration 16, loss = 71.55755806
Iteration 17, loss = 68.17757104
Iteration 18, loss = 65.00437283
Iteration 19, loss = 62.02097122
Iteration 20, loss = 59.22087523
Iteration 21, loss = 56.59936060
Iteration 22, loss = 54.14548791
Iteration 23, loss = 51.83451477
Iteration 24, loss = 49.65551517
Iteration 25, loss = 47.59807189
Iteration 26, loss = 45.64806810
Iteration 27, loss = 43.80259151
Iteration 28, loss = 42.06786180
Iteration 29, loss = 40.43697886
Iteration



-----------Finished lr 1e-06 with score 0.7071428571428572
-----------Starting training with lr 1e-05
Iteration 1, loss = 120.38953030
Iteration 2, loss = 72.39161349
Iteration 3, loss = 50.07755297
Iteration 4, loss = 37.86612161
Iteration 5, loss = 30.16467350
Iteration 6, loss = 24.97445868
Iteration 7, loss = 21.32192646
Iteration 8, loss = 18.62356289
Iteration 9, loss = 16.53949361
Iteration 10, loss = 14.89907361
Iteration 11, loss = 13.56506011
Iteration 12, loss = 12.45956299
Iteration 13, loss = 11.52444344
Iteration 14, loss = 10.72148254
Iteration 15, loss = 10.02897873
Iteration 16, loss = 9.41851140
Iteration 17, loss = 8.87404891
Iteration 18, loss = 8.39365407
Iteration 19, loss = 7.95478608
Iteration 20, loss = 7.55543364
Iteration 21, loss = 7.19088843
Iteration 22, loss = 6.86566811
Iteration 23, loss = 6.56814635
Iteration 24, loss = 6.29650767
Iteration 25, loss = 6.04073482
Iteration 26, loss = 5.80400939
Iteration 27, loss = 5.58546586
Iteration 28, loss = 5.3790



-----------Finished lr 1e-05 with score 0.8918253968253969
-----------Starting training with lr 0.0001
Iteration 1, loss = 39.82718742
Iteration 2, loss = 9.60506738
Iteration 3, loss = 6.34727149
Iteration 4, loss = 4.81116778
Iteration 5, loss = 3.89284095
Iteration 6, loss = 3.23600823
Iteration 7, loss = 2.75015119
Iteration 8, loss = 2.37951866
Iteration 9, loss = 2.08290071
Iteration 10, loss = 1.83393093
Iteration 11, loss = 1.60649796
Iteration 12, loss = 1.42051504
Iteration 13, loss = 1.25654700
Iteration 14, loss = 1.12278612
Iteration 15, loss = 0.98957145
Iteration 16, loss = 0.88496243
Iteration 17, loss = 0.77112971
Iteration 18, loss = 0.68529236
Iteration 19, loss = 0.60306707
Iteration 20, loss = 0.53637999
Iteration 21, loss = 0.47121709
Iteration 22, loss = 0.41968900
Iteration 23, loss = 0.37116154
Iteration 24, loss = 0.32409773
Iteration 25, loss = 0.28549747
Iteration 26, loss = 0.25436936
Iteration 27, loss = 0.22106471
Iteration 28, loss = 0.19283029
Iteration



Iteration 1, loss = 9.39810668
Iteration 2, loss = 1.87911905
Iteration 3, loss = 1.02345366
Iteration 4, loss = 0.64248040
Iteration 5, loss = 0.44456914
Iteration 6, loss = 0.31535503
Iteration 7, loss = 0.23779426
Iteration 8, loss = 0.17982226
Iteration 9, loss = 0.15677850
Iteration 10, loss = 0.11214531
Iteration 11, loss = 0.09188056
Iteration 12, loss = 0.08868267
Iteration 13, loss = 0.08221899
Iteration 14, loss = 0.08372344
Iteration 15, loss = 0.08399587
Iteration 16, loss = 0.07581939
Iteration 17, loss = 0.07074609
Iteration 18, loss = 0.07462386
Iteration 19, loss = 0.08444302
Iteration 20, loss = 0.06912944
Iteration 21, loss = 0.06895907
Iteration 22, loss = 0.08391247
Iteration 23, loss = 0.10152134
Iteration 24, loss = 0.08389736
Iteration 25, loss = 0.08279019
Iteration 26, loss = 0.07106654
Iteration 27, loss = 0.05504027
Iteration 28, loss = 0.05303933
Iteration 29, loss = 0.07265740
Iteration 30, loss = 0.07223835
Iteration 31, loss = 0.09674595
Iteration 32, los

Now we visualize the results.

In [7]:
import matplotlib.pyplot as plt

plt.plot(learning_rates, scores)

[<matplotlib.lines.Line2D at 0x22781b2f438>]

We can see that as the learning rate increases too much, accuracy immediately drops due to the model divergin, whereas when the learning rate is low, it has lower accuracy than the higher (but not too high) learning rates after it.

Now let's try modifying the layers, we will try 1 hidden layer with 100 neurons (default), 2 hidden layers with 20 and 10 neurons, 3 hidden layers with 30, 20 and 10 neurons and 4 hidden layers with 100, 30, 20, 10 neurons. We will use a learning rate of 0.001 (default).

In [8]:
layers = [(100), (20, 10), (30, 20, 10), (100, 30, 20, 10)]
scores = []

for lyr in layers:
    print(f"-----------Starting training with lr {lyr}")
    model = MLPClassifier(hidden_layer_sizes = lyr, verbose = True, max_iter=80, n_iter_no_change = 20)
    model.fit(x_train, y_train)
    score = model.score(x_test, y_test)
    scores.append(score)
    print(f"-----------Finished lr {lyr} with score {score}")

-----------Starting training with lr 100
Iteration 1, loss = 9.77190092
Iteration 2, loss = 1.84778452
Iteration 3, loss = 1.06915906
Iteration 4, loss = 0.65737945
Iteration 5, loss = 0.47222375
Iteration 6, loss = 0.33524773
Iteration 7, loss = 0.23453425
Iteration 8, loss = 0.19431840
Iteration 9, loss = 0.15883297
Iteration 10, loss = 0.11552765
Iteration 11, loss = 0.10408286
Iteration 12, loss = 0.08472054
Iteration 13, loss = 0.08800138
Iteration 14, loss = 0.07162908
Iteration 15, loss = 0.07725046
Iteration 16, loss = 0.11412627
Iteration 17, loss = 0.13420784
Iteration 18, loss = 0.11038861
Iteration 19, loss = 0.09104652
Iteration 20, loss = 0.08624993
Iteration 21, loss = 0.09540535
Iteration 22, loss = 0.07629342
Iteration 23, loss = 0.09543054
Iteration 24, loss = 0.09291153
Iteration 25, loss = 0.07364293
Iteration 26, loss = 0.07861657
Iteration 27, loss = 0.08636618
Iteration 28, loss = 0.08915543
Iteration 29, loss = 0.07590769
Iteration 30, loss = 0.07158298
Iteratio



Iteration 1, loss = 5.37796498
Iteration 2, loss = 1.88288373
Iteration 3, loss = 1.66145013
Iteration 4, loss = 1.52197386
Iteration 5, loss = 1.43069066
Iteration 6, loss = 1.36084339
Iteration 7, loss = 1.28264130
Iteration 8, loss = 1.21824234
Iteration 9, loss = 1.14936803
Iteration 10, loss = 1.08493784
Iteration 11, loss = 1.05027312
Iteration 12, loss = 1.01327585
Iteration 13, loss = 0.97116662
Iteration 14, loss = 0.93667574
Iteration 15, loss = 0.80650160
Iteration 16, loss = 0.72566648
Iteration 17, loss = 0.66646423
Iteration 18, loss = 0.63043847
Iteration 19, loss = 0.59619426
Iteration 20, loss = 0.56389703
Iteration 21, loss = 0.53846401
Iteration 22, loss = 0.51735129
Iteration 23, loss = 0.49437060
Iteration 24, loss = 0.47040671
Iteration 25, loss = 0.46020348
Iteration 26, loss = 0.43466796
Iteration 27, loss = 0.42085945
Iteration 28, loss = 0.41013924
Iteration 29, loss = 0.39098277
Iteration 30, loss = 0.37658932
Iteration 31, loss = 0.37192073
Iteration 32, los



Iteration 1, loss = 3.19995980
Iteration 2, loss = 1.25213142
Iteration 3, loss = 0.90707957
Iteration 4, loss = 0.70432685
Iteration 5, loss = 0.59491343
Iteration 6, loss = 0.52851372
Iteration 7, loss = 0.47636893
Iteration 8, loss = 0.42845556
Iteration 9, loss = 0.39029889
Iteration 10, loss = 0.37044696
Iteration 11, loss = 0.34068150
Iteration 12, loss = 0.32544436
Iteration 13, loss = 0.30509389
Iteration 14, loss = 0.29483925
Iteration 15, loss = 0.28080200
Iteration 16, loss = 0.26580297
Iteration 17, loss = 0.25951228
Iteration 18, loss = 0.24947619
Iteration 19, loss = 0.23922325
Iteration 20, loss = 0.23197527
Iteration 21, loss = 0.22344993
Iteration 22, loss = 0.22152229
Iteration 23, loss = 0.21365853
Iteration 24, loss = 0.20788750
Iteration 25, loss = 0.20860865
Iteration 26, loss = 0.19322361
Iteration 27, loss = 0.18753729
Iteration 28, loss = 0.18174643
Iteration 29, loss = 0.18099395
Iteration 30, loss = 0.17759237
Iteration 31, loss = 0.17341455
Iteration 32, los



Iteration 1, loss = 3.12481554
Iteration 2, loss = 1.14091593
Iteration 3, loss = 0.83582234
Iteration 4, loss = 0.55727279
Iteration 5, loss = 0.38138987
Iteration 6, loss = 0.29706384
Iteration 7, loss = 0.23998337
Iteration 8, loss = 0.20768927
Iteration 9, loss = 0.18221275
Iteration 10, loss = 0.17088626
Iteration 11, loss = 0.13527509
Iteration 12, loss = 0.12912826
Iteration 13, loss = 0.10910068
Iteration 14, loss = 0.10089850
Iteration 15, loss = 0.09657211
Iteration 16, loss = 0.09330860
Iteration 17, loss = 0.08313284
Iteration 18, loss = 0.06835692
Iteration 19, loss = 0.06487736
Iteration 20, loss = 0.06310378
Iteration 21, loss = 0.06426749
Iteration 22, loss = 0.06353525
Iteration 23, loss = 0.05415676
Iteration 24, loss = 0.05116692
Iteration 25, loss = 0.05750300
Iteration 26, loss = 0.05026858
Iteration 27, loss = 0.04011395
Iteration 28, loss = 0.04663917
Iteration 29, loss = 0.05186012
Iteration 30, loss = 0.04645203
Iteration 31, loss = 0.03586358
Iteration 32, los






We can see that all results are similar, let's try something a bit deeper. (takes long to execute)

In [9]:
#model = MLPClassifier(hidden_layer_sizes = (2500, 2000, 1500, 1000, 500), verbose = True, max_iter = 200, n_iter_no_change = 20, tol = 0.00005)
#model.fit(x_train, y_train)
#model.score(x_test, y_test)

In order to look for the right params, and have it not be so problematic, let's use an already defined scikit-learn function called **GridSearchCV**. This function will help us by making all the combinations of the parameters we give it and using them in the network.

In [None]:
from sklearn.model_selection import GridSearchCV

model = MLPClassifier(verbose = True)

parameters = {
    "hidden_layer_sizes": [(100), (20, 10), (30, 20, 10)],
    "learning_rate_init": [0.0009, 0.001, 0.002],
    "batch_size": [30000, 1000, 200, 1],
    "max_iter": [100],
    "tol": [0.00001, 0.000005]
}

search = GridSearchCV(model, parameters, cv = 5)
search.fit(data.drop("label", axis = 1), data.label)