In [1]:
#adapted from https://iamtrask.github.io/2015/07/27/python-network-part2/
import numpy as np # linear algebra

In [8]:
alphas = [0.001,0.01,0.1,1,10,100,1000]

In [35]:
def sigmoid(x):
    return (1/(1+np.exp(-x)))

In [36]:
def sigmoid_derivative(sigmoid_output):
    return (sigmoid_output*(1-sigmoid_output))

In [11]:
X = np.array([[0,0,1],[0,1,1], [1,0,1], [1,1,1]])

In [12]:
y = np.array([[0],[1],[1],[0]])

In [17]:
np.random.seed(1)

In [18]:
#randomly initialize out weights with mean 0
synapse_0 = 2*np.random.random((3,4)) - 1
synapse_1 = 2*np.random.random((4,1)) - 1
print ('Synapse 0: \n', synapse_0,'\nSynapse 1: \n',synapse_1)

Synapse 0: 
 [[-0.16595599  0.44064899 -0.99977125 -0.39533485]
 [-0.70648822 -0.81532281 -0.62747958 -0.30887855]
 [-0.20646505  0.07763347 -0.16161097  0.370439  ]] 
Synapse 1: 
 [[-0.5910955 ]
 [ 0.75623487]
 [-0.94522481]
 [ 0.34093502]]


We will be creating a two layer Neural Network with the first layer having 4 weights and three neurons, and the second layer having one weight and 4 neurons.

In [41]:
for alpha in alphas:
    print('\nTraining With Alpha:',alpha)
    np.random.seed(1)
    #randomly initialize our weights with mean 0
    synapse_0 = 2*np.random.random((3,4)) - 1
    synapse_1 = 2*np.random.random((4,1)) - 1
    
    for step in range(60_000):
        # Feed forward through layers 0, 1, and 2
        layer_0 = X
        layer_1 = sigmoid(np.dot(layer_0,synapse_0))
        layer_2 = sigmoid(np.dot(layer_1,synapse_1))

        # calculate how much we are off by
        layer_2_error = layer_2 - y
        # what direction is the desired value?
        # decide how much to change based on how confident we are 
        layer_2_delta = layer_2_error*sigmoid_derivative(layer_2)

        if step%10000 == 0:
            print('Error after ',step,' iterations: ', np.mean(np.abs(layer_2_error)))
        # how much did each layer_1 value contribute to the layer_2 error (according to hte weights)?
        layer_1_error = layer_2_delta.dot(synapse_1.T)
        layer_1_delta = layer_1_error*sigmoid_derivative(layer_1)
        # what direction is the target layer_1? chane based on confidence
        synapse_1 = synapse_1 - alpha * (layer_1.T.dot(layer_2_delta))
        synapse_0 = synapse_0 - alpha * (layer_0.T.dot(layer_1_delta))
    if alpha is 10:
        print('Synapse 0: \n', synapse_0,'\nSynapse 1: \n',synapse_1)


Training With Alpha: 0.001
Error after  0  iterations:  0.49641003190272537
Error after  10000  iterations:  0.49516402549338606
Error after  20000  iterations:  0.4935960431880486
Error after  30000  iterations:  0.4916063585594306
Error after  40000  iterations:  0.48910016654420474
Error after  50000  iterations:  0.48597785784615843

Training With Alpha: 0.01
Error after  0  iterations:  0.49641003190272537
Error after  10000  iterations:  0.45743107444190134
Error after  20000  iterations:  0.35909720256339894
Error after  30000  iterations:  0.2393581371589725
Error after  40000  iterations:  0.14307065901337035
Error after  50000  iterations:  0.09859642980892715

Training With Alpha: 0.1
Error after  0  iterations:  0.49641003190272537
Error after  10000  iterations:  0.0428880170001158
Error after  20000  iterations:  0.024098994228521613
Error after  30000  iterations:  0.018110652146797846
Error after  40000  iterations:  0.01498761627221092
Error after  50000  iterations: 

# Analysis
## Alpha = 0.001 
The network with a crazy small alpha didn't hardly converge! This is because we made the weight updates so small that they hardly changed anything, even after 60,000 iterations!
## Alpha = 0.001 
This alpha made a rather pretty convergence. It was quite smooth over the course of the 60,000 iterations but ultimately didn't converge as far as some of the others.
## Alpha = 0.1
This alpha made some of progress very quickly but then slowed down a bit. We need to increase alpha some more.
## Alpha = 1
As a clever eye might suspect, this had the exact convergence as if we had no alpha at all! Multiplying our weight updates by 1 doesn't change anything.
## Alpha = 10
An alpha that was greater than 1 achieved the best score after only 10,000 iterations! This tells us that our weight updates were being too conservative with smaller alphas. This means that in the smaller alpha parameters (less than 10), the network's weights were generally headed in the right direction, they just needed to hurry up and get there!
## Alpha = 100
Now we can see that taking steps that are too large can be very counterproductive. The network's steps are so large that it can't find a reasonable lowpoint in the error plane. The Alpha is too big so it just jumps around on the error plane and never "settles" into a local minimum.
## Alpha = 1000
And with an extremely large alpha, we see a textbook example of divergence, with the error increasing instead of decreasing... hardlining at 0.5. This is a more extreme version of overstepping where it overcorrectly whildly and ends up very far away from any local minimums.
