# Neural Learning: Gradient descent

Below we have the main algoright to *predict, compare and learn*.

- **epoch**: An epoch means training the neural network with all the training data for one cycle.
- **weight**: The initial random weight of the neuron for a single input.
- **goal_pred**: The true/target value to be predicted.
- **input**: The input data to be used to predict the target.
- **alpha**: Learning rate is used to prevent overcorrecting weight updates. The solution is to multiply the weight update by a
fraction to make it smaller. Finding the appropriate alpha, even for state-of-the-art neural networks, is often done
by guessing.

For each epoch, the algorigthm will evaluate the following:
1. **Predict**: It will make a prediction using the input and the weight. As the weight is randomly initialised it is very unlikely the prediction will be correct during the first epoch.
2. **Compare**: The Mean Squared Error is used to calculate the error. MSE is a Performance Function but a different one could have been used. The MSE will be a positive number and it will amplify big errors and reduce small errors.
3. **Learn**: At this step we need to update the weight. In order to do this we need to guess the direction that we want to move the weight and for how much (formally this is called the derivative). The direction is calculated as the *pure error* (pred - goal_pred) and it gets multiplied by the input for scaling, negative reversal and stopping. The derivative is eventually multiplied by the learning rate (alpha) to avoid overcorrecting.

    3.1. **Direction**: In order to know the direction we can think of the following. If the true value was 1 and prediction was 0.6, we want to update the weight and if the true value was 0 reduce it. So for instance if the goal_pred > pred (1 > 0.6) the direction would be positive and we would increase the weight. Notice that in the below code the opposite is calculated (pred - goal_pred = pred > goald_pred) and therefore the weight is substracted using the weight delta rather than added. This means that if we calculate the weight delta as (pred - goal_pred) the weight needs to be substracted (weight -= weight_delta), but if (goal_pred - pred) is used the weight must be summed (weight += weight_delta).
    
    3.2. **Amount**: The pure error is multiplied by the input for these reasons:
        - Stopping: If input is 0, then there is nothing to learn. It basically kills this neuron.
        - Negative reversal: Multiplying pure error by input will flip the sign if input is negative
        - Scaling: If input is big, weight update should also be big. Can go out of control. (Use alpha)

In [8]:
epochs = 20
weight = 0.5
goal_pred = 0.8
input = 2
alpha = 0.1
for epoch in range(epochs):
    pred = input * weight
    error = (pred - goal_pred) ** 2
    derivative = input * (pred - goal_pred)
    weight_delta = alpha * derivative
    weight -= weight_delta
    print("Error: %.6f\tPrediction: %.4f\tDerivative: %.4f\tWeight Delta: %.4f"
          % (error, pred, derivative, weight_delta))

Error: 0.040000	Prediction: 1.0000	Derivative: 0.4000	Weight Delta: 0.0400
Error: 0.014400	Prediction: 0.9200	Derivative: 0.2400	Weight Delta: 0.0240
Error: 0.005184	Prediction: 0.8720	Derivative: 0.1440	Weight Delta: 0.0144
Error: 0.001866	Prediction: 0.8432	Derivative: 0.0864	Weight Delta: 0.0086
Error: 0.000672	Prediction: 0.8259	Derivative: 0.0518	Weight Delta: 0.0052
Error: 0.000242	Prediction: 0.8156	Derivative: 0.0311	Weight Delta: 0.0031
Error: 0.000087	Prediction: 0.8093	Derivative: 0.0187	Weight Delta: 0.0019
Error: 0.000031	Prediction: 0.8056	Derivative: 0.0112	Weight Delta: 0.0011
Error: 0.000011	Prediction: 0.8034	Derivative: 0.0067	Weight Delta: 0.0007
Error: 0.000004	Prediction: 0.8020	Derivative: 0.0040	Weight Delta: 0.0004
Error: 0.000001	Prediction: 0.8012	Derivative: 0.0024	Weight Delta: 0.0002
Error: 0.000001	Prediction: 0.8007	Derivative: 0.0015	Weight Delta: 0.0001
Error: 0.000000	Prediction: 0.8004	Derivative: 0.0009	Weight Delta: 0.0001
Error: 0.000000	Predictio

The below example is the same as the previous one but *pure error* is calculated in the opposite way and therefore the weight gets updated summing the weight_delta, rather than decreasing it.

In [6]:
weight = 0.5
goal_pred = 0.8
input = 2
alpha = 0.1
for iteration in range(20):
    pred = input * weight
    error = (pred - goal_pred) ** 2
    derivative = input * (goal_pred - pred)
    weight_delta = alpha * derivative
    weight += weight_delta
    print("Error: %.6f\tPrediction: %.4f\tDerivative: %.4f\tWeight Delta: %.4f"
          % (error, pred, derivative, weight_delta))

Error: 0.040000	Prediction: 1.0000	Derivative: -0.4000	Weight Delta: -0.0400
Error: 0.014400	Prediction: 0.9200	Derivative: -0.2400	Weight Delta: -0.0240
Error: 0.005184	Prediction: 0.8720	Derivative: -0.1440	Weight Delta: -0.0144
Error: 0.001866	Prediction: 0.8432	Derivative: -0.0864	Weight Delta: -0.0086
Error: 0.000672	Prediction: 0.8259	Derivative: -0.0518	Weight Delta: -0.0052
Error: 0.000242	Prediction: 0.8156	Derivative: -0.0311	Weight Delta: -0.0031
Error: 0.000087	Prediction: 0.8093	Derivative: -0.0187	Weight Delta: -0.0019
Error: 0.000031	Prediction: 0.8056	Derivative: -0.0112	Weight Delta: -0.0011
Error: 0.000011	Prediction: 0.8034	Derivative: -0.0067	Weight Delta: -0.0007
Error: 0.000004	Prediction: 0.8020	Derivative: -0.0040	Weight Delta: -0.0004
Error: 0.000001	Prediction: 0.8012	Derivative: -0.0024	Weight Delta: -0.0002
Error: 0.000001	Prediction: 0.8007	Derivative: -0.0015	Weight Delta: -0.0001
Error: 0.000000	Prediction: 0.8004	Derivative: -0.0009	Weight Delta: -0.0001

# Neural Learning: Learning the whole dataset

In [14]:
import numpy as np

a = np.array([0,1,2,1])
b = np.array([2,2,2,3])
c = np.array([[0,1,2,1], [2,2,2,3]]).T
print("a: %s" % a)
print("b: %s" % b)
print("c: %s" % c)

print("a * b: %s" % (a*b)) #elementwise multiplication
print("a + b: %s" % (a+b)) #elementwise addition
print("a * 0.5: %s" % (a * 0.5)) # vector-scalar multiplication
print("a + 0.5: %s" % (a + 0.5)) # vector-scalar addition
print("a . b: %s" % (a.dot(b)))
print("a . c: %s" % (a.dot(c)))

a: [0 1 2 1]
b: [2 2 2 3]
c: [[0 2]
 [1 2]
 [2 2]
 [1 3]]
a * b: [0 2 4 3]
a + b: [2 3 4 4]
a * 0.5: [0.  0.5 1.  0.5]
a + 0.5: [0.5 1.5 2.5 1.5]
a . b: 9
a . c: [6 9]


In [9]:
import numpy as np

weights = np.array([0.5,0.48,-0.7])
alpha = 0.1
epochs = 40

streetlights = np.array( [[ 1, 0, 1 ],
                          [ 0, 1, 1 ],
                          [ 0, 0, 1 ],
                          [ 1, 1, 1 ],
                          [ 0, 1, 1 ],
                          [ 1, 0, 1 ] ] )

walk_vs_stop = np.array( [ 0, 1, 0, 1, 1, 0 ] )

for epoch in range(epochs):
    error_for_all_lights = 0
    predictions = []
    for row_index in range(len(walk_vs_stop)):
        input = streetlights[row_index]
        goal_prediction = walk_vs_stop[row_index]
        
        prediction = input.dot(weights)
        predictions.append(prediction)
        
        error = (goal_prediction - prediction) ** 2
        error_for_all_lights += error
        
        delta = prediction - goal_prediction
        weights = weights - (alpha * (input * delta))
    print("[Epoch: %s] Predictions: %s" % (epoch, np.round( [float(i) for i in predictions], 4)))
    print("[Epoch: %s] Error: %.4f\n" % (epoch, error_for_all_lights))

[Epoch: 0] Predictions: [-0.2    -0.2    -0.56    0.616   0.1728  0.1755]
[Epoch: 0] Error: 2.6561

[Epoch: 1] Predictions: [ 0.1404  0.3066 -0.3451  1.0066  0.4785  0.267 ]
[Epoch: 1] Error: 0.9629

[Epoch: 2] Predictions: [ 0.2136  0.5347 -0.2607  1.1319  0.6275  0.2543]
[Epoch: 2] Error: 0.5509

[Epoch: 3] Predictions: [ 0.2035  0.6562 -0.2219  1.1663  0.7139  0.2147]
[Epoch: 3] Error: 0.3645

[Epoch: 4] Predictions: [ 0.1718  0.7325 -0.1997  1.1698  0.772   0.173 ]
[Epoch: 4] Error: 0.2517

[Epoch: 5] Predictions: [ 0.1384  0.7865 -0.1837  1.1632  0.8149  0.1363]
[Epoch: 5] Error: 0.1780

[Epoch: 6] Predictions: [ 0.109   0.8274 -0.1704  1.1538  0.8482  0.1059]
[Epoch: 6] Error: 0.1286

[Epoch: 7] Predictions: [ 0.0848  0.8595 -0.1586  1.1438  0.8747  0.0815]
[Epoch: 7] Error: 0.0951

[Epoch: 8] Predictions: [ 0.0652  0.8851 -0.1477  1.1342  0.896   0.062 ]
[Epoch: 8] Error: 0.0719

[Epoch: 9] Predictions: [ 0.0496  0.9056 -0.1377  1.1251  0.9133  0.0465]
[Epoch: 9] Error: 0.0556



# Neural Learning: Backpropagation

## [ReLu](https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks) (Rectified linear activation unit)

In order to use stochastic gradient descent with backpropagation of errors to train deep neural networks, an activation function is needed that looks and acts like a linear function, but is, in fact, a nonlinear function allowing complex relationships in the data to be learned.

Nonlinear activation functions are preferred as they allow the nodes to learn more complex structures in the data. Traditionally, two widely used nonlinear activation functions are the sigmoid and hyperbolic tangent activation functions.

The sigmoid activation function, also called the logistic function, is traditionally a very popular activation function for neural networks. The input to the function is transformed into a value between 0.0 and 1.0. Inputs that are much larger than 1.0 are transformed to the value 1.0, similarly, values much smaller than 0.0 are snapped to 0.0. The shape of the function for all possible inputs is an S-shape from zero up through 0.5 to 1.0. For a long time, through the early 1990s, it was the default activation used on neural networks.

The hyperbolic tangent function, or tanh for short, is a similar shaped nonlinear activation function that outputs values between -1.0 and 1.0. In the later 1990s and through the 2000s, the tanh function was preferred over the sigmoid activation function as models that used it were easier to train and often had better predictive performance.

The solution had been bouncing around in the field for some time, although was not highlighted until papers in 2009 and 2011 shone a light on it. The solution is to use the rectified linear activation function, or ReL for short. A node or unit that implements this activation function is referred to as a rectified linear activation unit, or ReLU for short. Often, networks that use the rectifier function for the hidden layers are referred to as rectified networks. Adoption of ReLU may easily be considered one of the few milestones in the deep learning revolution, e.g. the techniques that now permit the routine development of very deep neural networks.

For a long time, the default activation to use was the sigmoid activation function. Later, it was the tanh activation function. For modern deep learning neural networks, the default activation function is the rectified linear activation function.

It is recommended as the default for both Multilayer Perceptron (MLP) and Convolutional Neural Networks (CNNs). Given their careful design, ReLU were thought to not be appropriate for Recurrent Neural Networks (RNNs) such as the Long Short-Term Memory Network (LSTM) by default. When using ReLU in your network, consider setting the bias to a small value, such as 0.1. Before training a neural network,the weights of the network must be initialized to small random values.

In [52]:
import numpy as np

np.random.seed(1)

def relu(x):
    return (x > 0) * x # returns x if x > 0
                       # return 0 otherwise

def relu2deriv(output):
    return output>0 # returns 1 for input > 0
                    # return 0 otherwise
epochs = 60
alpha = 0.2
hidden_size = 4

streetlights = np.array( [[ 1, 0, 1 ],
                          [ 0, 1, 1 ],
                          [ 0, 0, 1 ],
                          [ 1, 1, 1 ] ] )
walk_vs_stop = np.array([[ 1, 1, 0, 0]]).T

weights_0_1 = 2*np.random.random((3,hidden_size)) - 1  # Create random values between -1 and 1
weights_1_2 = 2*np.random.random((hidden_size,1)) - 1  # Create random values between -1 and 1

print("Weights")
print(weights_0_1)
print(weights_1_2)

print("\nStart")
for epoch in range(epochs):
   layer_2_error = 0
   predictions = []
   for i in range(len(streetlights)):
      layer_0 = streetlights[i:i+1]
      layer_1 = relu(np.dot(layer_0,weights_0_1))
      layer_2 = np.dot(layer_1,weights_1_2)
      predictions.append(layer_2[0][0])

      layer_2_error += np.sum((layer_2 - walk_vs_stop[i:i+1]) ** 2)

      layer_2_delta = (walk_vs_stop[i:i+1] - layer_2)
      layer_1_delta=layer_2_delta.dot(weights_1_2.T)*relu2deriv(layer_1)

      weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
      weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

   if(epoch % 10 == 9):
      print("[Epoch: %s] Predictions: %s" % (epoch, np.round( [float(i) for i in predictions], 4)))
      print("[Epoch: %s] Error: %.6f\n" % (epoch, layer_2_error))

Weights
[[-0.16595599  0.44064899 -0.99977125 -0.39533485]
 [-0.70648822 -0.81532281 -0.62747958 -0.30887855]
 [-0.20646505  0.07763347 -0.16161097  0.370439  ]]
[[-0.5910955 ]
 [ 0.75623487]
 [-0.94522481]
 [ 0.34093502]]

Start
[Epoch: 9] Predictions: [0.8478 0.4052 0.4261 0.2751]
[Epoch: 9] Error: 0.634231

[Epoch: 19] Predictions: [0.9919 0.5718 0.3627 0.2084]
[Epoch: 19] Error: 0.358384

[Epoch: 29] Predictions: [1.     0.8191 0.2162 0.0598]
[Epoch: 29] Error: 0.083018

[Epoch: 39] Predictions: [1.     0.9598 0.0695 0.0043]
[Epoch: 39] Error: 0.006467

[Epoch: 49] Predictions: [1.     0.9921 0.0163 0.    ]
[Epoch: 49] Error: 0.000329

[Epoch: 59] Predictions: [1.     0.9983 0.0035 0.    ]
[Epoch: 59] Error: 0.000015

