# Binary addition
_What exactly will the RNN learn?_

**RNN is going to learn the carry bit on its own!**


| input1 | input2 | carry-in | sum | carry-out |
|:---:|:---:|:---:|:---:|:---:|
| 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 1 | 0 | 1 | 0 |
| 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 1 | 0 |
| 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 |

In [12]:
import numpy as np
from abc import ABC, abstractmethod

# importing "collections" for deque operations 
from collections import deque

## Samples
To train and test our RNN, data set is needed.
Samples in the dataset include a, b and c. Following samples show how they are shown:

| a | b | c | 
| :---: | :---: | :---:|
| [0 0 0 1 0 0 1 0] | [0 1 0 0 1 1 0 1] | [0 1 0 1 1 1 1 1]
| [0 0 0 0 1 1 1 0] | [0 0 0 1 1 0 0 1] | [0 0 1 0 0 1 1 1]
| [0 1 0 1 0 0 1 1] | [0 0 1 0 1 1 1 1] | [1 0 0 0 0 0 1 0]

It's good practice to seed your random numbers. Your numbers will still be randomly distributed, but they'll be randomly distributed in exactly the same way each time you train. This makes it easier to see how your changes affect the network.

In [13]:
np.random.seed(0)

In [14]:
class dataset_utility:
    
    @staticmethod
    def get_samples(samples_count, binary_dim):
        
        largest_number = pow(2,binary_dim)
        
        samples = list()
        for i in range(samples_count):
            
            a = np.random.randint(largest_number/2) 
            b = np.random.randint(largest_number/2)

            # true answer => summation
            c = a + b            
            
            int_sample = np.array([[a], [b], [c]], dtype=np.uint8)            
            binary_int_sample = np.unpackbits(int_sample, axis=1)
            
            samples.append(binary_array)
            
        return samples

# Network Architecture
Following architecture is used to addition of two bits in each step:

<img src="./images/network_architecture.jpg"><img>
<center>Figure 1</center>


# Computation model

Following model is used to compute addition of two bits in each step:

<img src="./images/forward_one_step.jpg"><img>
<center>Figure 2</center>

# RNN gates

As can be seen in comutation model, there are two types of gates in our RNN:
    1. multiply gate
    2. add gate

In [15]:
class multiply_gate:
    
    @staticmethod
    def forward(inputs, weights):
        return np.dot(inputs, weights)
    
    @staticmethod
    def backward(inputs):
        return inputs.T

class add_gate:
    
    @staticmethod
    def forward(input1, input2):
        return input1 + input2
    
    @staticmethod
    def backward(input1, input2):
        return input1.backward() + input2.backward()

# Activation function

In our network, sigmoid function is used. A **sigmoid function** maps any value to a value between 0 and 1.

forward

$$ \sigma(x) = \frac{1}{1+e^{-x}}$$

backward
$$ \frac{\partial \sigma(x)}{\partial x} =  \sigma(x)(1- \sigma(x))$$

In [16]:
class sigmoid_activation():
        
    def forward(self, x):
        return 1/(1 + np.exp(-x))
    
    def backward(self, x):
        return x*(1 - x)

# Network layers

According to the network architecture shown in first picture, there are three layers in our RNN:
    1. input layer
    2. hidden layer
    3. output layer

## Initilize weights

By random method in numpy, random weights for each layer is generated.

## Forward propagation

In every layer, two propagations should be done, forward along with backward. To implement forward, a corresponding method should be implemented in the code.

In [17]:
class random_generator:
    
    @staticmethod
    def get_random_weight_matrix(input_dimension, output_dimension):
        return 2*np.random.random((input_dimension,output_dimension)) - 1

In [18]:
class network_layer(ABC):
    
    def __init__(self, input_dimension, output_dimension):        
        self.weights = random_generator.get_random_weight_matrix(input_dimension, output_dimension)
    
    @abstractmethod
    def forward(self, input):
        pass   

## Input layer

In this layer, inputs including a(t) and b(t) as a vector named x is going to be multiplied by input layer weights in forward phase. 

**forward**

$ net_{input} = x \times  W_{input} $

In [20]:
class input_layer(network_layer):
    
    def forward(self, x):
        return multiply_gate.forward(x, self.weights)

## Hidden layer

### forward

In this layer following equations should be implemented in forward propagation:

$ net_{hidden} = net_{input} + prev_{hidden} \times W_{hidden} $

sigmoid is used for activation function in this layer

$ a_{hidden} = \sigma(net_{hidden}) $


In [None]:
class hidden_layer(network_layer):
    
    def forward(self, net_input, s_t_prev):        
        net_hidden = add_gate.forward(net_input, multiply_gate.forward(s_t_prev, self.weights))
        return sigmoid_activation.forward(net_hidden)

## Output layer

In this layer following equations should be implemented in forward propagation:


### forward

$ net_{output} = a_{hidden} \times  W_{output} $

$ \hat{y}\ (predited\ value) = a_{output} = a(net_{output}) = \sigma(net_{output}) $

predited_value = one bit used for the output of the RNN (a+b)

In [None]:
class output_layer(network_layer):
    
    def forward(self, activation_hidden): 
        net_output = multiply_gate.forward(activation_hidden, self.weights)    
        return sigmoid_activation.forward(net_output)

# Binary addition RNN

In the rest of the code, our RNN is going to be demonstrated.

## Initialization

As it be explained, the RNN is designed to add two binary arrays, therefore dimention of these arrays is important for initializing.

Due to adding two bits in each step, the dimension of input layer is 2, the output of this addition is also one bit, therefore output layer dimension should be 1. There is not optimal number for hidden layer dimension, it could get defined by you.

Last point is loss function. It is needed to compute errors and backpropagate it through all layers. **Mean squared error** function is used to compute these errors.

In [None]:
class mse_loss_function():
    
    def forward(target_value, predicted_value):
        return np.mean((target_value - predicted_value)**2)
    
    def backward(target_value, predicted_value):
        predicted_value - target_value

In [None]:
class binary_addition_rnn:
    
    def __init__(self, binary_dim, hidden_dimension):
        
        self.binary_dim = binary_dim
        input_layer_dimension = 2 # two numbers a, b
        self.hidden_layer_dimension = hidden_dimension
        output_dimension = 1 # result of addition, c = a + b
        
        self.input_layer = input_layer(input_layer_dimension, hidden_dimension)
        self.hidden_layer = hidden_layer(hidden_dimension, hidden_dimension)
        self.output_layer = output_layer(hidden_dimension,output_dimension)
        
        self.loss_function = mse_loss_function()

## Forward propagation

The forward method of binary_addition_rnn iteratively updates the states through time and returns the resulting states (hidden values) as well as predicted values. Figure 3 shows binary addition step by step, as can be seen in it, predecting addition of bits starts from the least significant bit (LSB) to the most significant bit (MSB).

<img src="./images/binary_addition_steps.gif" />

In [None]:
    def feed_forward(self, a, b, c):        
        
        hidden_values = list()
        hidden_values.append(np.zeros((1, self.hidden_dimension)))
        
        prediction_values = deque([])
        
        # Proceed from right-to-left, column-by-column, starting from last digit
        for column in range(self.binary_dim-1, -1, -1):
            
            # It is given two input digits at each time step. 
            X = np.array([[a[column], b[column]]])
            
            # input layer
            net_input_layer = self.input_layer.forward(X) # X*W_in
            
            # hidden layer
            s_t_prev = hidden_values[-1]
            activation_hidden = self.hidden_layer.forward(net_input_layer, s_t_prev)            
            # save activation_hidden for BPTT
            hidden_values.append(activation_hidden)
            
            # output layer
            prediction_value = self.output_layer.forward(activation_hidden)
            # save predicted values for BPTT
            prediction_values.appendleft(prediction_value)
            
        return prediction_values, hidden_values

## Back propagation throw time (BPTT)

BPTT works by unrolling all input timesteps. Each timestep has one input time step, one output time step and one copy of the network. Then the errors are calculated and accumulated for each timestep. The network is then rolled back to update the weights.

Following weights are used to compute predicted value, therefore, these weights should get updated by BPTT for next iteration. 

$ W\_output $

$ W\_hidden $

$ W\_input $

### Chain rule
In order to update the weights, chain rule could help us. Using this rule, the following equations are obtained.

Mean squared error is used for loss function:

$ l = \frac{1}{2}(y - \hat{y})^2 $

$ \frac{\partial l}{\partial \hat{y}} = \frac{\partial \frac{1}{2}(y - \hat{y})^2 }{\partial \hat{y}} = 2 \times \frac{1}{2} \times -1 \times (y - \hat{y}) = \hat{y} - y $

$ \frac{\partial l}{\partial net_{output}} = \frac{\partial l}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial net_{output}} $

$ \frac{\partial l}{\partial net_{output}} = \frac{\partial l}{\partial \hat{y}} \times \frac{\partial \sigma({net_{output})}}{\partial net_{output}} $

$ \frac{\partial l}{\partial net_{output}} = (\hat{y} - y) \times \sigma(net_{output}) \bigodot(1-\sigma(net_{output})) $

$ \frac{\partial l}{\partial net_{output}} = (\hat{y} - y) \times \hat{y} \bigodot(1-\hat{y}) $

$ \frac{\partial l}{\partial net_{output}} $ is needed in following equations. Therefore, it is assumed that

$ \frac{\partial l}{\partial net_{output}} = \delta_{net_{output}}$

$$ \delta_{net_{output}} = (\hat{y} - y) \times \hat{y} \bigodot(1-\hat{y}) $$

**Updating $ W_{output} $**


**$ \frac{\partial l}{\partial W_{output}} = ? $**

$ \frac{\partial l}{\partial W_{output}} = \frac{\partial l}{\partial net_{output}} \times \frac{\partial net_{output}}{\partial W_{output}} $

$ \frac{\partial l}{\partial W_{output}} = 
\frac{\partial l}{\partial net_{output}} \times \frac{\partial (a_{hidden} \times  W_{output})}{\partial W_{output}} $

<div id="test" class="equation">
$$ \frac{\partial l}{\partial W_{output}} = 
\delta_{net_{output}} \times a_{hidden}^T
$$
</div>

For hidden layer and input layer, $\frac{\partial l}{\partial a_{hidden}}$ should be computed

$ \frac{\partial l}{\partial a_{hidden}} = ? $

$ \frac{\partial l}{\partial a_{hidden}} =  \frac{\partial l}{\partial net_{output}} \times \frac{\partial net_{output}}{\partial a_{hidden}} $

$ \frac{\partial l}{\partial a_{hidden}} =  \frac{\partial l}{\partial net_{output}} \times \frac{\partial (a_{hidden} \times  W_{output}) }{\partial a_{hidden}} $ 

$$ \frac{\partial l}{\partial a_{hidden}} = \delta_{net_{output}} \times W_{output}^T $$

$ \frac{\partial l}{\partial net_{hidden}} = ? $

$ \frac{\partial l}{\partial net_{hidden}} = \frac{\partial l(t)}{\partial net_{hidden}(t)} +  \frac{\partial l(t+1)}{\partial net_{hidden}(t)} $

$ \frac{\partial l(t)}{\partial net_{hidden}(t)} =  
\frac{\partial l(t)}{\partial a_{hidden}(t)} \times 
\frac{\partial a_{hidden}(t)}{\partial net_{hidden}(t)} $

$ \frac{\partial l(t)}{\partial net_{hidden}(t)} =  
\frac{\partial l(t)}{\partial a_{hidden}(t)} \times 
\frac{\partial \sigma(net_{hidden})(t)}{\partial net_{hidden(t)}} $

$ \frac{\partial l(t)}{\partial net_{hidden}(t)} =  
\frac{\partial l(t)}{\partial a_{hidden}(t)} \times 
\sigma(net_{hidden}(t)) \bigodot(1-\sigma(net_{hidden})(t))
$

$ \frac{\partial l}{\partial net_{hidden}} =  
\frac{\partial l}{\partial a_{hidden}} \times 
a_{hidden} \bigodot(1-a_{hidden})
$

$ \frac{\partial l(t)}{\partial net_{hidden}(t)} =  
(\delta_{net_{output}(t)} \times W_{output}^T) \times 
(a_{hidden}(t) \bigodot(1-a_{hidden}(t)))
$

$ \frac{\partial l(t)}{\partial net_{hidden}(t)} = \delta_{net_{hidden\_explicit}(t)} $

$$ \delta_{net_{hidden\_explicit}}(t) =  
(\delta_{net_{output}}(t) \times W_{output}^T) \times 
(a_{hidden}(t) \bigodot(1-a_{hidden}(t)))
$$

$ \frac{\partial l(t+1)}{\partial net_{hidden(t)}} = ? $

$ \frac{\partial l(t+1)}{\partial net_{hidden}(t)} =  
\frac{\partial l(t+1)}{\partial net_{hidden}(t+1)} \times 
\frac{\partial net_{hidden}(t+1)}{\partial net_{hidden}(t)} $

$ \frac{\partial l(t+1)}{\partial net_{hidden}(t)} =  
\frac{\partial l(t+1)}{\partial net_{hidden}(t+1)} \times 
\frac{\partial (net_{input}(t+1) + a_{hidden}(t) \times W_{hidden}) }{\partial net_{hidden}(t)} $

$ \frac{\partial l(t+1)}{\partial net_{hidden}(t)} =  
\frac{\partial l(t+1)}{\partial net_{hidden}(t+1)} \times 
\frac{\partial (a_{hidden}(t) \times W_{hidden}) }{\partial net_{hidden}(t)} $

$ \frac{\partial l(t+1)}{\partial net_{hidden}(t)} =  
\frac{\partial l(t+1)}{\partial net_{hidden}(t+1)} \times 
\frac{\partial (a_{hidden}(t) \times W_{hidden}) }{\partial a_{hidden}(t)} \times 
\frac{\partial a_{hidden}(t) }{\partial net_{hidden}(t)} 
$

$ \frac{\partial l(t+1)}{\partial net_{hidden}(t)} =  
\delta_{net_{hidden\_explicit}}(t+1) \times 
 W_{hidden}^T \times 
(a_{hidden}(t) \bigodot(1-a_{hidden}(t)))$

$ \frac{\partial l(t+1)}{\partial net_{hidden}(t)} = \delta_{net_{hidden\_implixit}(t)} $

$$ \delta_{net_{hidden\_implixit}(t)} =   
\delta_{net_{hidden\_explicit}}(t+1) \times 
 W_{hidden}^T \times 
(a_{hidden}(t) \bigodot(1-a_{hidden}(t)))$$

By using the two equations computed above, we have:

$ \frac{\partial l}{\partial net_{hidden}} = \frac{\partial l(t)}{\partial net_{hidden}(t)} +  \frac{\partial l(t+1)}{\partial net_{hidden}(t)} $

$ \delta_{net_{hidden}(t)} = \delta_{net_{hidden\_implixit}(t)} +  \delta_{net_{hidden\_explixit}(t)} $

$ \delta_{net_{hidden}(t)} = 
(\delta_{net_{output}}(t) \times W_{output}^T) \times 
(a_{hidden}(t) \bigodot(1-a_{hidden}(t))) 
+
\delta_{net_{hidden\_explicit}}(t+1) \times 
 W_{hidden}^T \times 
(a_{hidden}(t) \bigodot(1-a_{hidden}(t)))
$

$$ \delta_{net_{hidden}(t)} = 
(\delta_{net_{output}}(t) \times W_{output}^T + \delta_{net_{hidden\_explicit}}(t+1) \times 
 W_{hidden}^T)
\times 
(a_{hidden}(t) \bigodot(1-a_{hidden}(t))
$$

**$ W_{hidden} $**

$ \frac{\partial l}{\partial w_{hidden}} = ? $

$ \frac{\partial l(t)}{\partial w_{hidden}} = 
\frac{\partial l(t)}{\partial net_{hidden}(t)}
\times
\frac{\partial net_{hidden}(t)}{\partial w_{hidden}}
$

$ \frac{\partial l(t)}{\partial w_{hidden}} = 
\delta_{net_{hidden}(t)}
\times
\frac{\partial net_{hidden}(t)}{\partial w_{hidden}}
$

$ \frac{\partial net_{hidden}(t)}{\partial w_{hidden}} =
\frac{\partial \ (net_{input}(t) + prev_{hidden} \times W_{hidden})}{\partial w_{hidden}} $

$ \frac{\partial net_{hidden}(t)}{\partial w_{hidden}} =
\frac{\partial \ prev_{hidden} \times W_{hidden}}{\partial w_{hidden}} = prev_{hidden}^T
$

<div id="test" class="equation">
    
$$ \frac{\partial l(t)}{\partial w_{hidden}} = 
\delta_{net_{hidden}(t)}
\times
prev_{hidden}^T
$$

</div>

**$ W_{input} $**

$ \frac{\partial l}{\partial w_{input}} = ? $

$ \frac{\partial l(t)}{\partial w_{input}} = 
\frac{\partial l(t)}{\partial net_{hidden}(t)}
\times
\frac{\partial net_{hidden}(t)}{\partial w_{input}}
$

$ \frac{\partial l(t)}{\partial w_{hidden}} = 
\delta_{net_{hidden}(t)}
\times
\frac{\partial net_{hidden}(t)}{\partial w_{input}}
$

$ \frac{\partial net_{hidden}(t)}{\partial w_{hidden}} =
\frac{\partial \ (net_{input}(t) + prev_{hidden} \times W_{hidden})}{\partial w_{input}} $

$ \frac{\partial net_{hidden}(t)}{\partial w_{hidden}} =
\frac{\partial net_{input}(t)}{\partial w_{input}} $

$ \frac{\partial net_{hidden}(t)}{\partial w_{hidden}} = 
\frac{\partial (x(t) \times  W_{input})}{\partial w_{input}} = x(t)^T
$

<div id="test" class="equation">

$$ \frac{\partial l(t)}{\partial w_{input}} = 
\delta_{net_{hidden}(t)}
\times
x(t)^T
$$
    
</div>    

In [None]:
   def bptt(self, a, b, c, predicated_values, hidden_values):
        
        future_hidden_delta = np.zeros(self.hidden_dimension)
        future_hidden = np.zeros(self.hidden_dimension)
        
        # Initialize Updated Weights Values
        W_output_update = np.zeros_like(self.output_layer.weights)
        W_hidden_update = np.zeros_like(self.hidden_layer.weights)
        W_input_update = np.zeros_like(self.input_layer.weights)
        
        for time_step in range(self.binary_dim):
            
            # s_t = h(t)
            time_step_hidden_value_index = self.binary_dim - time_step
            s_t = hidden_values[time_step_hidden_value_index]
            s_t_prev = hidden_values[time_step_hidden_value_index -1]
            
            # target value
            y = np.array([[c[time_step]]]).T
            X = np.array([[a[time_step],b[time_step]]])
            
            y_hat = predicated_values[time_step]  
            
            # loss = y-y_hat
            # delta
           
            
            delta_y_hat = self.loss_function.backward(y, y_hat)
            
            
            # hidden_delta = delta_3 + future_hidden_delta.dot(self.hidden_layer.weights.T) * (future_hidden*(1-future_hidden))
            hidden_delta = future_hidden_delta.dot(self.hidden_layer.weights.T) * sigmoid_activation.backward(future_hidden) + delta_2

            # W_output
            
            
            # W_input
            x = np.array([[a[time_step], b[time_step]]])
            x_prev = np.array([[a[time_step-1], b[time_step-1]]])
            
            

            # error at output layer
            outputlayer_error = y - y_hat
            outputlayer_delta = (outputlayer_error)*sigmoid_activation.backward(y_hat)*(-1)
        
            # error at hidden layer * sigmoid_derivative(future_hidden)
            hidden_delta = (future_hidden_delta.dot(self.hidden_layer.weights.T) * sigmoid_activation.backward(future_hidden) + outputlayer_delta.dot(self.output_layer.weights.T)) * sigmoid_activation.backward(s_t)

            
            # update all weights 
            W_output_update += np.atleast_2d(s_t).T.dot(outputlayer_delta)
            W_hidden_update += np.atleast_2d(s_t_prev).T.dot(hidden_delta) 
            # W_input_update  += X.T.dot(hidden_delta) 
            W_input_update  += self.input_layer.backward.dot(hidden_delta) 
            future_hidden_delta = hidden_delta
            future_hidden = s_t 
        
        return W_output_update, W_hidden_update, W_input_update

In [None]:
class hiddenLayerUnfold:
    
    def __init__(self, neuron_count):
        
        # Save the values obtained at Hidden Layer of current state in a list to keep track
        self.hidden_layer_values  = list()
        
        # Initially, there is no previous hidden state. So append "0" for that
        self.hidden_layer_values.append(np.zeros(neuron_count))
    
    def save_previous_hidden_layer_value(self, previous_hidden_layer_value):
        self.hidden_layer_values.append(copy.deepcopy(previous_hidden_layer_value))

In [4]:
class hidden_layer(network_layer):
    
    def __init__(self, neuron_count):
        super().__init__(neuron_count)
        #self.hiddenLayerUnfold = hiddenLayerUnfold(neuron_count)
        # Save the values obtained at Hidden Layer of current state in a list to keep track
        self.hidden_layer_values  = list()
        
        # Initially, there is no previous hidden state. So append "0" for that
        self.hidden_layer_values.append(np.zeros(neuron_count))
    
    def forward(self, input_layer_output, W_hidden):
        prev_hidden = self.hidden_layer_values[-1]      
        net_hidden = input_layer_output + np.dot(prev_hidden, W_hidden)
        sigmoid = sigmoid_activation()
        return sigmoid.forward(net_hidden)
    
    def backward(self, hidden_value_index, W_hidden, binary_dim):
        if hidden_value_index == -binary_dim:
            s_0 = self.hidden_layer_values[0]
            return s_0
        
        s_t = self.hidden_layer_values[hidden_value_index]
        t1 = s_t*(1-s_t)
        s_t_1 = self.hidden_layer_values[hidden_value_index-1]
        
        backward_prev = self.backward(hidden_value_index-1, W_hidden, binary_dim)
        t2 = backward_prev * W_hidden
        t3 = s_t_1 + t2
        return_value =  t1 * t3
        
        return return_value
    
    def save_previous_hidden_layer_value(self, previous_hidden_layer_value):
        #self.hiddenLayerUnfold.save_previous_hidden_layer_value(previous_hidden_layer_value)
        self.hidden_layer_values.append(copy.deepcopy(previous_hidden_layer_value))

NameError: name 'network_layer' is not defined

In [None]:
class output_layer(network_layer):
    
    def forward(self, hidden_layer_output, W_output):
        net_output = np.dot(hidden_layer_output, W_output)
        sigmoid = sigmoid_activation()
        return sigmoid.forward(net_output)

In [None]:
class weight:
    
    @staticmethod
    def GetWeightMatrix(first_dimension, second_dimension):
        return 2*np.random.random((first_dimension,second_dimension)) - 1

In [None]:
class loss_function():
    
    @staticmethod
    def mse(target_value, predicted_value):
        return np.mean((target_value - predicted_value)**2)

In [None]:
class utility:
    
    @staticmethod
    def print_result(overallError, a_int, b_int, c, d):    
        print("Error:" + str(overallError))
        print("Pred:" + str(d))
        print("True:" + str(c))
        out = 0
        for index, x in enumerate(reversed(d)):
            out += x * pow(2, index)
        print(str(a_int) + " + " + str(b_int) + " = " + str(out))
        print("------------")

# Train

The general algorithm is

   1. First, present the input pattern and propagate it through the network to get the output.
    
   2. Then compare the predicted output to the expected output and calculate the error.

   3. Then calculate the derivates of the error with respect to the network weights
    
   4. Try to adjust the weights so that the error is minimum.

## Forward propagation

This is our process to predict summation of two bits in each step:

**input layer**

X = two input bits (a,b)

$ net_{input} = X \times  W_{input} $

No activation function is used in this layer

**hidden layer**

$ net_{hidden} = net_{input} + prev_{hidden} \times W_{hidden} $

activation function, which is used in this layer, is sigmoid

$ A_{hidden} = A(net_{hidden}) = \sigma(net_{hidden}) $

**output layer**

$ net_{output} = A_{hidden} \times  W_{output} $

$ \hat{y}\ (predited\ value) = A_{output} = A(net_{output}) = \sigma(net_{output}) $

predited_value = one bit used for the output of the RNN (a+b)


Note that unlike the corresponding equation in the previous example, the
equations for $ W_{hidden}$ and $W_{input}$ are **recursive**.

The derivative at the current time depends on the derivative at the previous time.

**$ W_{hidden} $**

$$ \frac{\partial E}{\partial W_{hidden}} = \frac{\partial E}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial net_{output}} \times \frac{\partial net_{output}}{\partial A_{hidden}} \times \frac{\partial A_{hidden}}{\partial net_{hidden}} \times \frac{\partial net_{hidden}}{\partial W_{hidden}} $$

To compute equation above, following computation are needed:

$ \frac{\partial E}{\partial \hat{y}} = -(y - \hat{y}) $

$ \frac{\partial \hat{y}}{\partial net_{output}} = \hat{y}(1-\hat{y})$

$ \frac{\partial net_{output}}{\partial A_{hidden}} =  \frac{\partial A_{hidden} \times  W_{output}}{\partial  A_{hidden}} = w_{output}$

$ \frac{\partial A_{hidden}}{\partial net_{hidden}} = \frac{\partial \sigma (net_{hidden})} {\partial net_{hidden}} =  \sigma (net_{hidden}) (1-\sigma (net_{hidden})) = A_{hidden}(1-A_{hidden})$

$ \frac{\partial net_{hidden}}{\partial w_{hidden}} =  \frac{\partial \ (net_{input} + prev_{hidden} \times W_{hidden})}{\partial w_{hidden}} = prev_{hidden} + \frac{\partial prev_{hidden}}{\partial w_{hidden}}\times w_{hidden}$

By putting these results into the main equation, we end up with following equation:

$$ \frac{\partial E}{\partial W_{hidden}} = -(y - \hat{y}) \times \hat{y}(1-\hat{y}) \times w_{output} \times A_{hidden}(1-A_{hidden}) \times (prev_{hidden} + \frac{\partial prev_{hidden}}{\partial w_{hidden}}\times w_{hidden})$$

**$ W_{input} $**

$$ \frac{\partial E}{\partial W_{input}} = \frac{\partial E}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial net_{output}} \times \frac{\partial net_{output}}{\partial A_{hidden}} \times \frac{\partial A_{hidden}}{\partial net_{hidden}} \times \frac{\partial net_{hidden}}{\partial net_{input}} \times \frac{\partial net_{input}}{\partial W_{input}}$$

To compute equation above, following computation are needed:

$ \frac{\partial E}{\partial \hat{y}} = -(y - \hat{y}) $

$ \frac{\partial \hat{y}}{\partial net_{output}} = \hat{y}(1-\hat{y})$

$ \frac{\partial net_{output}}{\partial A_{hidden}} = w_{output}$

$ \frac{\partial A_{hidden}}{\partial W_{input}} = \frac{\partial A_{hidden}}{\partial net_{hidden}} \times \frac{\partial net_{hidden}}{\partial W_{input}} $

$ \frac{\partial A_{hidden}}{\partial net_{hidden}} = A_{hidden}(1-A_{hidden}) $

$ \frac{\partial net_{hidden}}{\partial W_{input}} =  \frac{\partial \ (net_{input} + prev_{hidden} \times W_{hidden})}{\partial W_{input}} =  \frac{\partial \ (X \times  W_{input} + prev_{hidden} \times W_{hidden})}{\partial W_{input}} = X +  \frac{\partial \ ( prev_{hidden} \times W_{hidden})}{\partial W_{input}} = X + \frac{\partial \ prev_{hidden}}{\partial W_{input} }\times W_{hidden} $

$ \frac{\partial A_{hidden}}{\partial W_{input}} = A_{hidden}(1-A_{hidden}) \times (X + \frac{\partial \ prev_{hidden}}{\partial W_{input} }\times W_{hidden}) $

By putting these results into main equation, we end up with following equation:

$$ \frac{\partial E}{\partial W_{input}} = -(y - \hat{y}) \times \hat{y}(1-\hat{y}) \times w_{output} \times A_{hidden}(1-A_{hidden}) \times (X + \frac{\partial prev_{hidden}}{\partial W_{input} }\times W_{hidden}) $$

In [None]:
class simple_binary_addition_rnn:
    
    def __init__(self, binary_dim, hidden_dimension, learning_rate):
        
        self.binary_dim = binary_dim
        input_dimension = 2
        output_dimension = 1    
        
        # predicated_values
        self.predicated_values = np.zeros(self.binary_dim)        
        
        # layers
        self.input_layer = input_layer(input_dimension)
        self.hidden_layer = hidden_layer(hidden_dimension)
        self.output_layer = output_layer(output_dimension)
        
        # initialize weights
        self.W_input = weight.GetWeightMatrix(input_dimension, hidden_dimension)
        self.W_hidden = weight.GetWeightMatrix(hidden_dimension, hidden_dimension)
        self.W_output = weight.GetWeightMatrix(hidden_dimension, output_dimension)
        
        self.learning_rate = learning_rate
        self.overallError = 0
        
    def feed_forward(self, a, b, c):
        
         # Array to save predicted outputs (binary encoded)
        d = np.zeros_like(c)
    
        # position: location of the bit amongst binary_dim-1 bits; for example, starting point "0"; "0 - 7"
        for position in range(binary_dim):

            location = binary_dim - position - 1
            X = np.array([[a[location], b[location]]])           
            
            # ----------- forward ---------------
            # input_layer forward
            input_layer_output = self.input_layer.forward(X, self.W_input)            
            
            # hidden_layer forward
            hidden_layer_output = self.hidden_layer.forward(input_layer_output, self.W_hidden)
            
            # Save the hidden layer to be used in BPTT            
            self.hidden_layer.save_previous_hidden_layer_value(hidden_layer_output)
            
            # self.output_layer.forward
            # predicated_value is a "guess" for each input matrix. 
            # We can now compare how well it did by subtracting the true answer (y) from the guess (predicated_value). 
            # output_error is just a vector of positive and negative numbers reflecting how much the network missed.
            predicated_value = self.output_layer.forward(hidden_layer_output, self.W_output)          

            # Round off the values to nearest "0" or "1" and save it to a list
            d[location] = np.round(predicated_value[0][0])   
            
        return d, self.predicated_values

    def back_propagate(self, a, b, c, predicated_values):
        
        # Initialize Updated Weights Values
        W_output_update = np.zeros_like(self.W_output)
        W_hidden_update = np.zeros_like(self.W_hidden)
        W_input_update = np.zeros_like(self.W_input)        
        
        # for position in range(self.binary_dim-1, -1, -1):  # binary_dim=8=> position: 7->0
        for position in range(self.binary_dim):           

            y = np.array([[c[position]]]).T        
            
            # sigmoid
            sigmoid = sigmoid_activation()
          
            hidden_value_index = -position-1
            A_hidden = self.hidden_layer.hidden_layer_values[hidden_value_index]
          
            # update W_output ----------------------------------------------------         
            y_hat = predicated_values[position]            
            dy_hat = (y-y_hat)
            
            # W_output---------------------
            dnet_output = dy_hat * sigmoid.backward(y_hat)
            dw_output = dnet_output* A_hidden.T           
            W_output_update += dw_output     

            # W_hidden ---------------------
            dA_hidden = dnet_output*self.W_output            
                    
            t3 = self.hidden_layer.backward(hidden_value_index, self.W_hidden, self.binary_dim)
            t4 = dA_hidden*t3            
            W_hidden_update += t4        
            
            # W_input ---------------------
            t_in_3 = self.input_layer.backward(a, b, hidden_value_index, self.W_hidden, self.binary_dim, self.hidden_layer.hidden_layer_values)
            t_in_4 = dA_hidden*t_in_3            
            W_input_update += t_in_4
            
            
        self.W_output += W_output_update * self.learning_rate
        self.W_hidden += W_hidden_update * self.learning_rate
        self.W_input += W_input_update * self.learning_rate
   
    
    def train(self, epochs_count):
        
        data = dataset(self.binary_dim)        
        
        # This for loop "iterates" multiple times over the training code to optimize our network to the dataset.
        for epoch in range(epochs_count):
            
            overallError = 0
            
            # sample a + b = c
            # for example: 2 + 3 = 5 => (a) 00000010 + (b) 00000011 = (c) 00000101
            a, b, c, a_int, b_int, c_int = data.get_sample_addition_problem()
            
            # where we'll store our best guess (binary encoded)
            # desired predictions => d
            d = np.zeros_like(c)  
            
            d, predicated_values = self.feed_forward(a, b, c)
            
            self.back_propagate(a, b, c, predicated_values)
    
            # Print out the Progress of the RNN
            if (epoch % 1000 == 0):
                utility.print_result(overallError, a_int, b_int, c, d)
                
 

In [None]:
binary_dim = 8
hidden_dimension = 16
learning_rate = 0.1
rnn = simple_binary_addition_rnn(binary_dim, hidden_dimension, learning_rate)
rnn.train(10000)

In [2]:
from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)
css_styling()