In [1]:
import numpy as np

alphas = [0.001,0.01,0.1,1,10,100,1000]

# compute sigmoid nonlinearity
def sigmoid(x):
    output = 1/(1+np.exp(-x))
    return output

# convert output of sigmoid function to its derivative
def sigmoid_output_to_derivative(output):
    return output*(1-output)

In [2]:
X = np.array([[0,0,1],
              [0,1,1],
              [1,0,1],
              [1,1,1]])
                
y = np.array([[0],
              [1],
              [1],
              [0]])

In [5]:
for alpha in alphas:
    print ("\nTraining With Alpha: {}".format(alpha))
    np.random.seed(1)

    # randomly initialize our weights with mean 0
    synapse_0 = 2*np.random.random((3,4)) - 1
    synapse_1 = 2*np.random.random((4,1)) - 1

    for j in range(60000):

        # Feed forward through layers 0, 1, and 2
        layer_0 = X
        layer_1 = sigmoid(np.dot(layer_0,synapse_0))
        layer_2 = sigmoid(np.dot(layer_1,synapse_1))

        # how much did we miss the target value?
        layer_2_error = layer_2 - y

        if (j% 10000) == 0:
            print("Error after "+str(j)+" iterations:" + str(np.mean(np.abs(layer_2_error))))

        # in what direction is the target value?
        # were we really sure? if so, don't change too much.
        layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)

        # how much did each l1 value contribute to the l2 error (according to the weights)?
        layer_1_error = layer_2_delta.dot(synapse_1.T)

        # in what direction is the target l1?
        # were we really sure? if so, don't change too much.
        layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)

        synapse_1 -= alpha * (layer_1.T.dot(layer_2_delta))
        synapse_0 -= alpha * (layer_0.T.dot(layer_1_delta))




Training With Alpha: 0.001
Error after 0 iterations:0.49641003190272537
Error after 10000 iterations:0.49516402549338606
Error after 20000 iterations:0.4935960431880486
Error after 30000 iterations:0.4916063585594306
Error after 40000 iterations:0.48910016654420474
Error after 50000 iterations:0.48597785784615843

Training With Alpha: 0.01
Error after 0 iterations:0.49641003190272537
Error after 10000 iterations:0.45743107444190134
Error after 20000 iterations:0.359097202563399
Error after 30000 iterations:0.23935813715897253
Error after 40000 iterations:0.1430706590133703
Error after 50000 iterations:0.09859642980892719

Training With Alpha: 0.1
Error after 0 iterations:0.49641003190272537
Error after 10000 iterations:0.042888017000115755
Error after 20000 iterations:0.02409899422852161
Error after 30000 iterations:0.018110652146797843
Error after 40000 iterations:0.014987616272210912
Error after 50000 iterations:0.013014490538142586

Training With Alpha: 1
Error after 0 iterations:0

# Following are the findings: 

## Alpha = 0.001 
The network with a crazy small alpha didn't hardly converge! This is because we made the weight updates so small that they hardly changed anything, even after 60,000 iterations! This is textbook Problem 3:When Slopes Are Too Small.

## Alpha = 1
Interestingly, This had the exact convergence as if we had no alpha at all! Multiplying our weight updates by 1 doesn't change anything. :)

## Alpha = 10
An alpha that was greater than 1 achieved the best score after only 10,000 iterations! This tells us that our weight updates were being too conservative with smaller alphas. This means that in the smaller alpha parameters (less than 10), the network's weights were generally headed in the right direction, they just needed to hurry up and get there!

## Alpha = 1000
And with an extremely large alpha, we see an example of divergence, with the error increasing instead of decreasing... hardlining at 0.5. This is a more extreme version of Problem 3 where it overcorrectly whildly and ends up very far away from any local minimums.