In [1]:
import numpy as np

import the numpy library.

To get familar with what is numpy and how to use it, I recommned to go over this tutorial:
https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

In [14]:
# compute sigmoid nonlinearity
def nonlin(x, deriv=False):
    if deriv:
        return x*(1-x)
    return 1/(1+np.exp(-x))


- Formula of Sigmoid function: $$S(t) = \frac{1}{(1+e^t)}$$
- Sigmoid function is the activation function that maps inputs to any numbesr between 0 to 1, which represents probabilities.
- The input, x, for this activation function is the dot product of layer_0 (represent the actual data to be fed in the model) and synapse_0 (gives weights on corresponding features in the layer_0 data).
- This tranforms the matrix which contains values between -infinity and infinity into parobablistic values between 0 to 1.
- The returned value of this function is the next layer, layer_1

- The input (output) is the layer_1 matrix
- The magic formula is the result of taking derivative of sigmoid function and simplify it:
- The process of making the simplified derivative of sigmoid:
$$S(x) = \frac{1}{(1+e^x)}$$
$$f(x) = \frac{1}{S(x)} = 1+e^x$$
$$f'(x) = \frac{S'(x)}{S(x)^2}$$
$$f'(x) = -e^{-x} = 1 - f(x) = 1 - \frac{1}{S(x)}
= \frac{(S(x)-1)}{S(x)}$$
$$ \frac{S'(x)}{S(x)^2} = \frac{(S(x)-1)}{S(x)}$$ 
$$ S'(x) = {S(x)^2}\frac{(S(x)-1)}{S(x)}$$ 
$$ S'(x) = {S(x)}(S(x)-1)$$ 

In [15]:
# input dataset
X = np.array([  [0,0,1],
                [0,1,1],
                [1,0,1],
                [1,1,1] ])

- every [] represents a raw of table, a sample of a set of samples.
- every index of [] represents a column of table, a feature of a set of features.
- [0, 1]: a sample contains value 0, for feature 1 and value 1 for feature 2.

In [16]:
# output dataset            
y = np.array([[0,1,1,0]]).T
# same as y = np.array([[0],
#                       [1],
#                       [1],
#                       [0]])


- Make sure the shape of the array is transposed.

In [17]:
# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)

In [18]:
# randomly initialize our weights with mean 0
syn0 = 2*np.random.random((3,4)) - 1
syn1 = 2*np.random.random((4,1)) - 1

- np.random.random: https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.random.html
- np.random.random((3,4)) generates 3x4 matrix contains values in each cell between [0, 1)
- by 2*np.random.random((3,4)) - 1, the range of values in each cell becomces [-1, 1)
- so 2*np.random.random((4,1)) - 1 generates 4x1 matrix contains values in each cell between [-1, 1)

In [19]:
for iter in range(100000):

    # Feed forward through layers 0, 1, and 2
    l0 = X
    l1 = nonlin(np.dot(l0,syn0)) # 4x3 * 3x4 = 4x4
    l2 = nonlin(np.dot(l1,syn1)) # 4x4 * 4x1 = 4x1


- l0: is layer 0 of the neural nets, which represents the source input of the dataset
- l1: is layer 1 of the neural nets, which represents the first hidden layer which is the result of mapping the weights onto layer 0 values through synapse 0 (paths between layer 0 to layer 1).
- l2: is layer 3 of the neural nets, which represents the output layer which is the result of mappint the weights onto layer 1 values throug synapse 1 (paths between layer 1 to layer 2).
- Hidden layer's all valuse (l1 and l2 in this case) should always be [0, 1) because of application of nonlinear transformation by the activation function, in this exmaple Sigmoid function.
- Think of layers as states of pulses at a given time and synapses as paths which gives weights on pulses (either strengthen, weaken, or do nothing)
- What we are training through neural nets training is the weights of synapses (how much change it gives pulses passing through the synapse). We would like to find the right configuration of synapses to achieve the ability to give right weights to each states of pulses for target output.

In [20]:
    # how much did we miss the target value?
    l2_error = y - l2


- By knowing the difference between the output originally given and computed through neural nets, we use this information to how much degree we should adjust the weights of synapses.
- This is where the direction of error is computed. l2 is known to be [0, 1) where y ranges 0 or 1. If one of the value of l2 is 0.5 and the corresponding value of y is 0 and then, we know that in the next iterations we should weigh the value down. If y is 1 then, we should weight the value up.

In [21]:
    if (iter% 10000) == 0:
        print("Error:" + str(np.mean(np.abs(l2_error))))
        

In [22]:
    # in what direction is the target value?
    # were we really sure? if so, don't change too much.
    l2_delta = l2_error*nonlin(l2,deriv=True)


- Since l2 ranges [0, 1), its derivative (computed by x * (1 - x)) should also range [0, 1)
- Notice that the derivative gain maximum value when x is closer to 0.5. This is understandable when we think about the output is supposed to get closer to either 0 or 1 through training process. Theoritically, 0.5 is the furthest value of state for the training preference so we need more adjustment to put in the synapses.
- Note again that the l2_error represents the direction that the weights should be adjusted.

In [23]:
    # how much did each l1 value contribute to the l2 error (according to the weights)?
    l1_error = l2_delta.dot(syn1.T)
    

In [24]:
    # in what direction is the target l1?
    # were we really sure? if so, don't change too much.
    l1_delta = l1_error * nonlin(l1,deriv=True)


In [25]:
    syn1 += l1.T.dot(l2_delta)
    syn0 += l0.T.dot(l1_delta)
    

In [32]:
print("Output After Training:")
print(l2)

Output After Training:
[[ 0.47372957]
 [ 0.48895696]
 [ 0.54384086]
 [ 0.54470837]]
