# Binary addition
_What exactly will the RNN learn?_

**RNN is going to learn the carry bit on its own!**


| input1 | input2 | carry-in | sum | carry-out |
|:---:|:---:|:---:|:---:|:---:|
| 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 1 | 1 | 0 |
| 0 | 1 | 0 | 1 | 0 |
| 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 1 | 0 |
| 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 |

In [157]:
import numpy as np
from abc import ABC, abstractmethod

# importing "collections" for deque operations 
from collections import deque

## Samples
To train and test our RNN, data set is needed.
Samples in the dataset include a, b and c. Following samples show how they are shown:

| a | b | c | 
| :---: | :---: | :---:|
| [0 0 0 1 0 0 1 0] | [0 1 0 0 1 1 0 1] | [0 1 0 1 1 1 1 1]
| [0 0 0 0 1 1 1 0] | [0 0 0 1 1 0 0 1] | [0 0 1 0 0 1 1 1]
| [0 1 0 1 0 0 1 1] | [0 0 1 0 1 1 1 1] | [1 0 0 0 0 0 1 0]

It's good practice to seed your random numbers. Your numbers will still be randomly distributed, but they'll be randomly distributed in exactly the same way each time you train. This makes it easier to see how your changes affect the network.

In [158]:
np.random.seed(0)

In [159]:
class dataset_utility:
    
    @staticmethod
    def get_data(samples_count, binary_dim):
        
        largest_number = pow(2,binary_dim)
        
        samples = list()
        for i in range(samples_count):
            
            a = np.random.randint(largest_number/2)
            b = np.random.randint(largest_number/2)
            # true answer => summation
            c = a + b            
                    
            int_array = np.array([[a], [b], [c]], dtype=np.uint8)
            
            binary_array = np.unpackbits(int_array, axis=1)
            
            samples.append(binary_array)
            
        return samples

    @staticmethod
    def get_inputs_and_target(sample):
         a = sample[0]
         b = sample[1]
         c = sample[2]
         return a, b, c  

In [160]:
class utility:
    
    @staticmethod
    def print_result(a, b, c, predicated_values, epoch):  
        d = np.zeros_like(c)
        for position in range(len(predicated_values)):             
             d[position] = np.round(predicated_values[position][0][0])
        
        print("epoch:", epoch)
        print("a:   " + str(a))
        print("b:   " + str(b))        
        print("----------------------")
        print("c:   " + str(c))
        print("Pred:" + str(d))
        print("============================")

# Network Architecture
Following architecture is used to addition of two bits in each step:

<img src="./images/network_architecture.jpg"><img>
<center>Figure 1</center>


# Computation model

Following model is used to compute addition of two bits in each step:

<img src="./images/forward_one_step.jpg"><img>
<center>Figure 2</center>

# RNN gates

As can be seen in comutation model, there are two types of gates in our RNN:
    1. multiply gate
    2. add gate

In [161]:
class multiply_gate:
    
    @staticmethod
    def forward(inputs, weights):
        return np.dot(inputs, weights)
    
    @staticmethod
    def backward(weights):
        return weights.T

class add_gate:
    
    @staticmethod
    def forward(input1, input2):
        return input1 + input2
    
    @staticmethod
    def backward(input1, input2):
        return input1.backward() + input2.backward()

# Activation function

In our network, sigmoid function is used. A **sigmoid function** maps any value to a value between 0 and 1.

forward

$$ \sigma(x) = \frac{1}{1+e^{-x}}$$

backward
$$ \frac{\partial \sigma(x)}{\partial x} =  \sigma(x)(1- \sigma(x))$$

In [162]:
class sigmoid_activation():
        
    @staticmethod
    def forward(net):
        return 1/(1 + np.exp(-net))
    
    @staticmethod
    def backward(output):
        return output*(1 - output)  

# Network layers

According to the network architecture shown in first picture, there are three layers in our RNN:
    1. input layer
    2. hidden layer
    3. output layer

## Initilize weights

By random method in numpy, random weights for each layer is generated.

## Forward propagation

In every layer, two propagations should be done, forward along with backward. To implement forward, a corresponding method should be implemented in the code.

In [163]:
class random_generator:
    
    @staticmethod
    def get_random_weight_matrix(input_dimension, output_dimension):
        return 2*np.random.random((input_dimension,output_dimension)) - 1

In [164]:
class network_layer(ABC):
    
    def __init__(self, input_dimension, output_dimension):        
        self.weights = random_generator.get_random_weight_matrix(input_dimension, output_dimension)
    
    @abstractmethod
    def forward(self, input):
        pass   

## Input layer

In this layer, inputs including a(t) and b(t) as a vector named x is going to be multiplied by input layer weights in forward phase. 

**forward**

$ net_{input} = x \times  W_{input} $

In [165]:
class input_layer(network_layer):
    
    def forward(self, x):
        return multiply_gate.forward(x, self.weights)

## Hidden layer

### forward

In this layer following equations should be implemented in forward propagation:

$ net_{hidden} = net_{input} + prev_{hidden} \times W_{hidden} $

sigmoid is used for activation function in this layer

$ a_{hidden} = \sigma(net_{hidden}) $


In [166]:
class hidden_layer(network_layer):
    
    def forward(self, net_input, s_t_prev):        
        net_hidden = add_gate.forward(net_input, multiply_gate.forward(s_t_prev, self.weights))
        return sigmoid_activation.forward(net_hidden)

## Output layer

In this layer following equations should be implemented in forward propagation:


### forward

$ net_{output} = a_{hidden} \times  W_{output} $

$ \hat{y}\ (predited\ value) = a_{output} = a(net_{output}) = \sigma(net_{output}) $

predited_value = one bit used for the output of the RNN (a+b)

In [167]:
class output_layer(network_layer):
    
    def forward(self, activation_hidden): 
        net_output = multiply_gate.forward(activation_hidden, self.weights)    
        return sigmoid_activation.forward(net_output)

# Binary addition RNN

In the rest of the code, our RNN is going to be demonstrated.

## Initialization

As it be explained, the RNN is designed to add two binary arrays, therefore dimention of these arrays is important for initializing.

Due to adding two bits in each step, the dimension of input layer is 2, the output of this addition is also one bit, therefore output layer dimension should be 1. There is not optimal number for hidden layer dimension, it could get defined by you.

Last point is loss function. It is needed to compute errors and backpropagate it through all layers. **Mean squared error** function is used to compute these errors.

# Train

The general algorithm is

   1. First, present the input pattern and propagate it through the network to get the output.
    
   2. Then compare the predicted output to the expected output and calculate the error.

   3. Then calculate the derivates of the error with respect to the network weights
    
   4. Try to adjust the weights so that the error is minimum.

## Forward propagation

The forward method of binary_addition_rnn iteratively updates the states through time and returns the resulting states (hidden values) as well as predicted values. Figure 3 shows binary addition step by step, as can be seen in it, predecting addition of bits starts from the least significant bit (LSB) to the most significant bit (MSB).

<img src="./images/binary_addition_steps.gif" />

# Test

This phase is as same as train phase excluding back propagation step.

In [168]:
class mse_loss_function():
    
    def forward(target_value, predicted_value):
        return np.mean((target_value - predicted_value)**2)
    
    def backward(target_value, predicted_value):
        predicted_value - target_value

In [169]:
class binary_addition_rnn:
    
    def __init__(self, binary_dim, hidden_dimension, learning_rate):
        
        self.learning_rate = learning_rate
        self.binary_dim = binary_dim
        input_dimension = 2 # two numbers a, b
        self.hidden_dimension = hidden_dimension
        output_dimension = 1 # result of addition, c = a+b
        
        self.input_layer = input_layer(input_dimension, hidden_dimension)
        self.hidden_layer = hidden_layer(hidden_dimension, hidden_dimension)
        self.output_layer = output_layer(hidden_dimension,output_dimension)
        
         # predicated_values array
        self.predicated_values = np.zeros(self.binary_dim)

    def feed_forward(self, a, b, c):        
        
        hidden_values = list()
        hidden_values.append(np.zeros((1, self.hidden_dimension)))
        
        prediction_values = deque([])
        
        # Proceed from right-to-left, column-by-column, starting from last digit
        for column in range(self.binary_dim-1, -1, -1):
            
            # It is given two input digits at each time step. 
            X = np.array([[a[column], b[column]]])
            
            # input layer
            net_input_layer = self.input_layer.forward(X) # X*W_in
            
            # hidden layer
            s_t_prev = hidden_values[-1]
            activation_hidden = self.hidden_layer.forward(net_input_layer, s_t_prev)
            
            # save activation_hidden for BPTT
            hidden_values.append(activation_hidden)
            
            # output layer
            prediction_value = self.output_layer.forward(activation_hidden)
            prediction_values.appendleft(prediction_value)
            
        return prediction_values, hidden_values

    def back_propagate(self, a, b, c, predicated_values, hidden_values):
        
        # BPTT
        dl_dw_output, dl_dw_hidden, dl_dw_input = back_propagation.bptt(a, b, c, predicated_values, hidden_values, hidden_dimension, self.output_layer.weights, self.hidden_layer.weights, self.input_layer.weights, self.binary_dim)
        
        self.output_layer.weights -= dl_dw_output * self.learning_rate
        self.hidden_layer.weights -= dl_dw_hidden * self.learning_rate
        self.input_layer.weights -= dl_dw_input * self.learning_rate  

    def train(self, dataset_train):
           
        epochs_count = len(dataset_train)
        
        # This for loop "iterates" multiple times over the training code to optimize our network to the dataset.
        for epoch in range(epochs_count):
            
            overallError = 0            
                             
            a, b, c = dataset_utility.get_inputs_and_target(dataset_train[epoch])           
            
            # feed forward propagation
            predicated_values, hidden_values = self.feed_forward(a, b, c)
            
            # back propagation
            self.back_propagate(a, b, c, predicated_values, hidden_values)
            
            # Print out the Progress of the RNN
            if (epoch % 1000 == 0):
                 utility.print_result(a, b, c, predicated_values, epoch)


    def test(self, dataset_test):

        epochs_count = len(dataset_test)
        
        # This for loop "iterates" multiple times over the training code to optimize our network to the dataset.
        for epoch in range(epochs_count):
                        
            a, b, c = dataset_utility.get_inputs_and_target(dataset_train[epoch])
            
            # feed forward propagation
            predicated_values, hidden_values = self.feed_forward(a, b, c)
            
            # Print out the Progress of the RNN
            if (epoch % 1000 == 0):
                 utility.print_result(a, b, c, predicated_values, epoch)

## Back propagation throw time (BPTT)

BPTT works by unrolling all input timesteps. Each timestep has one input time step, one output time step and one copy of the network. Then the errors are calculated and accumulated for each timestep. The network is then rolled back to update the weights.

Following weights are used to compute predicted value, therefore, these weights should get updated by BPTT for next iteration. 

$ W\_output $

$ W\_hidden $

$ W\_input $

### Chain rule
In order to update the weights, chain rule could help us. Using this rule, the following equations are obtained.

## Loss function
Mean squared error is used for loss function:

$ l = \frac{1}{2}(y - \hat{y})^2 $

$ \frac{\partial l}{\partial \hat{y}} = \frac{\partial \frac{1}{2}(y - \hat{y})^2 }{\partial \hat{y}} = 2 \times \frac{1}{2} \times -1 \times (y - \hat{y}) = \hat{y} - y $

$ \frac{\partial l}{\partial net_{output}} = \frac{\partial l}{\partial \hat{y}} \times \frac{\partial \hat{y}}{\partial net_{output}} $

$ \frac{\partial l}{\partial net_{output}} = \frac{\partial l}{\partial \hat{y}} \times \frac{\partial \sigma({net_{output})}}{\partial net_{output}} $

$ \frac{\partial l}{\partial net_{output}} = (\hat{y} - y) \times \sigma(net_{output}) \bigodot(1-\sigma(net_{output})) $

$ \frac{\partial l}{\partial net_{output}} = (\hat{y} - y) \times \hat{y} \bigodot(1-\hat{y}) $

$ \frac{\partial l}{\partial net_{output}} $ is needed in following equations. Therefore, it is assumed that

$ \frac{\partial l}{\partial net_{output}} = \delta_{net_{output}}$

$$ \delta_{net_{output}} = (\hat{y} - y) \times \hat{y} \bigodot(1-\hat{y}) $$

### Updating $ W_{output} $

**$ \frac{\partial l}{\partial W_{output}} = ? $**

$ \frac{\partial l}{\partial W_{output}} = \frac{\partial l}{\partial net_{output}} \times \frac{\partial net_{output}}{\partial W_{output}} $

$ \frac{\partial l}{\partial W_{output}} = 
\frac{\partial l}{\partial net_{output}} \times \frac{\partial (a_{hidden} \times  W_{output})}{\partial W_{output}} $

<div id="test" class="equation">
$$ \frac{\partial l}{\partial W_{output}} = 
\delta_{net_{output}} \times a_{hidden}^T
$$
</div>

For hidden layer and input layer, $\frac{\partial l}{\partial a_{hidden}}$ should be computed

$ \frac{\partial l}{\partial a_{hidden}} = ? $

$ \frac{\partial l}{\partial a_{hidden}} =  \frac{\partial l}{\partial net_{output}} \times \frac{\partial net_{output}}{\partial a_{hidden}} $

$ \frac{\partial l}{\partial a_{hidden}} =  \frac{\partial l}{\partial net_{output}} \times \frac{\partial (a_{hidden} \times  W_{output}) }{\partial a_{hidden}} $ 

$$ \frac{\partial l}{\partial a_{hidden}} = \delta_{net_{output}} \times W_{output}^T $$

$ \frac{\partial l}{\partial net_{hidden}} = ? $

$ \frac{\partial l}{\partial net_{hidden}} = \frac{\partial l(t)}{\partial net_{hidden}(t)} +  \frac{\partial l(t+1)}{\partial net_{hidden}(t)} $

$ \frac{\partial l(t)}{\partial net_{hidden}(t)} =  
\frac{\partial l(t)}{\partial a_{hidden}(t)} \times 
\frac{\partial a_{hidden}(t)}{\partial net_{hidden}(t)} $

$ \frac{\partial l(t)}{\partial net_{hidden}(t)} =  
\frac{\partial l(t)}{\partial a_{hidden}(t)} \times 
\frac{\partial \sigma(net_{hidden})(t)}{\partial net_{hidden(t)}} $

$ \frac{\partial l(t)}{\partial net_{hidden}(t)} =  
\frac{\partial l(t)}{\partial a_{hidden}(t)} \times 
\sigma(net_{hidden}(t)) \bigodot(1-\sigma(net_{hidden})(t))
$

$ \frac{\partial l}{\partial net_{hidden}} =  
\frac{\partial l}{\partial a_{hidden}} \times 
a_{hidden} \bigodot(1-a_{hidden})
$

$ \frac{\partial l(t)}{\partial net_{hidden}(t)} =  
(\delta_{net_{output}(t)} \times W_{output}^T) \times 
(a_{hidden}(t) \bigodot(1-a_{hidden}(t)))
$

$ \frac{\partial l(t)}{\partial net_{hidden}(t)} = \delta_{net_{hidden\_explicit}(t)} $

$$ \delta_{net_{hidden\_explicit}}(t) =  
(\delta_{net_{output}}(t) \times W_{output}^T) \times 
(a_{hidden}(t) \bigodot(1-a_{hidden}(t)))
$$

$ \frac{\partial l(t+1)}{\partial net_{hidden(t)}} = ? $

$ \frac{\partial l(t+1)}{\partial net_{hidden}(t)} =  
\frac{\partial l(t+1)}{\partial net_{hidden}(t+1)} \times 
\frac{\partial net_{hidden}(t+1)}{\partial net_{hidden}(t)} $

$ \frac{\partial l(t+1)}{\partial net_{hidden}(t)} =  
\frac{\partial l(t+1)}{\partial net_{hidden}(t+1)} \times 
\frac{\partial (net_{input}(t+1) + a_{hidden}(t) \times W_{hidden}) }{\partial net_{hidden}(t)} $

$ \frac{\partial l(t+1)}{\partial net_{hidden}(t)} =  
\frac{\partial l(t+1)}{\partial net_{hidden}(t+1)} \times 
\frac{\partial (a_{hidden}(t) \times W_{hidden}) }{\partial net_{hidden}(t)} $

$ \frac{\partial l(t+1)}{\partial net_{hidden}(t)} =  
\frac{\partial l(t+1)}{\partial net_{hidden}(t+1)} \times 
\frac{\partial (a_{hidden}(t) \times W_{hidden}) }{\partial a_{hidden}(t)} \times 
\frac{\partial a_{hidden}(t) }{\partial net_{hidden}(t)} 
$

$ \frac{\partial l(t+1)}{\partial net_{hidden}(t)} =  
\delta_{net_{hidden\_explicit}}(t+1) \times 
 W_{hidden}^T \times 
(a_{hidden}(t) \bigodot(1-a_{hidden}(t)))$

$ \frac{\partial l(t+1)}{\partial net_{hidden}(t)} = \delta_{net_{hidden\_implixit}(t)} $

$$ \delta_{net_{hidden\_implixit}(t)} =   
\delta_{net_{hidden\_explicit}}(t+1) \times 
 W_{hidden}^T \times 
(a_{hidden}(t) \bigodot(1-a_{hidden}(t)))$$

By using the two equations computed above, we have:

$ \frac{\partial l}{\partial net_{hidden}} = \frac{\partial l(t)}{\partial net_{hidden}(t)} +  \frac{\partial l(t+1)}{\partial net_{hidden}(t)} $

$ \delta_{net_{hidden}(t)} = \delta_{net_{hidden\_implixit}(t)} +  \delta_{net_{hidden\_explixit}(t)} $

$ \delta_{net_{hidden}(t)} = 
(\delta_{net_{output}}(t) \times W_{output}^T) \times 
(a_{hidden}(t) \bigodot(1-a_{hidden}(t))) 
+
\delta_{net_{hidden\_explicit}}(t+1) \times 
 W_{hidden}^T \times 
(a_{hidden}(t) \bigodot(1-a_{hidden}(t)))
$

$$ \delta_{net_{hidden}(t)} = 
(\delta_{net_{output}}(t) \times W_{output}^T + \delta_{net_{hidden\_explicit}}(t+1) \times 
 W_{hidden}^T)
\times 
(a_{hidden}(t) \bigodot(1-a_{hidden}(t))
$$

### Updating $ W_{hidden} $

$ \frac{\partial l}{\partial w_{hidden}} = ? $

$ \frac{\partial l(t)}{\partial w_{hidden}} = 
\frac{\partial l(t)}{\partial net_{hidden}(t)}
\times
\frac{\partial net_{hidden}(t)}{\partial w_{hidden}}
$

$ \frac{\partial l(t)}{\partial w_{hidden}} = 
\delta_{net_{hidden}(t)}
\times
\frac{\partial net_{hidden}(t)}{\partial w_{hidden}}
$

$ \frac{\partial net_{hidden}(t)}{\partial w_{hidden}} =
\frac{\partial \ (net_{input}(t) + prev_{hidden} \times W_{hidden})}{\partial w_{hidden}} $

$ \frac{\partial net_{hidden}(t)}{\partial w_{hidden}} =
\frac{\partial \ prev_{hidden} \times W_{hidden}}{\partial w_{hidden}} = prev_{hidden}^T
$

<div id="test" class="equation">
    
$$ \frac{\partial l(t)}{\partial w_{hidden}} = 
\delta_{net_{hidden}(t)}
\times
prev_{hidden}^T
$$

</div>

### Updating $ W_{input} $

$ \frac{\partial l}{\partial w_{input}} = ? $

$ \frac{\partial l(t)}{\partial w_{input}} = 
\frac{\partial l(t)}{\partial net_{hidden}(t)}
\times
\frac{\partial net_{hidden}(t)}{\partial w_{input}}
$

$ \frac{\partial l(t)}{\partial w_{hidden}} = 
\delta_{net_{hidden}(t)}
\times
\frac{\partial net_{hidden}(t)}{\partial w_{input}}
$

$ \frac{\partial net_{hidden}(t)}{\partial w_{hidden}} =
\frac{\partial \ (net_{input}(t) + prev_{hidden} \times W_{hidden})}{\partial w_{input}} $

$ \frac{\partial net_{hidden}(t)}{\partial w_{hidden}} =
\frac{\partial net_{input}(t)}{\partial w_{input}} $

$ \frac{\partial net_{hidden}(t)}{\partial w_{hidden}} = 
\frac{\partial (x(t) \times  W_{input})}{\partial w_{input}} = x(t)^T
$

<div id="test" class="equation">

$$ \frac{\partial l(t)}{\partial w_{input}} = 
\delta_{net_{hidden}(t)}
\times
x(t)^T
$$
    
</div>    

In [170]:
class back_propagation:
    
    @staticmethod
    def bptt(a, b, c, predicated_values, hidden_values, hidden_dimension, output_layer_weights, hidden_layer_weights, input_layer_weights, binary_dim):  
              
        future_delta_net_hidden_explicit = np.zeros(hidden_dimension)
        
        # Initialize Updated Weights Values
        dl_dw_output = np.zeros_like(output_layer_weights)
        dl_dw_hidden = np.zeros_like(hidden_layer_weights)
        dl_dw_input = np.zeros_like(input_layer_weights)
        
        for time_step in range(binary_dim):
            
            # s_t = h(t)
            # s_t_prev = h(t-1)
            time_step_hidden_value_index = binary_dim - time_step
            s_t = hidden_values[time_step_hidden_value_index]
            s_t_prev = hidden_values[time_step_hidden_value_index -1]            
            
            X = np.array([[a[time_step],b[time_step]]])            
            y = np.array([[c[time_step]]]).T
            y_hat = predicated_values[time_step]           
           
            # delta_net_output
            dl_d_y_hat = y_hat-y
            dy_hat_d_net_output = y_hat*(1-y_hat)
            delta_net_output = dl_d_y_hat * dy_hat_d_net_output          
            
            # delta_net_hidden(t)           
            delta_net_hidden = (delta_net_output.dot(output_layer_weights.T) + future_delta_net_hidden_explicit.dot(hidden_layer_weights.T))* sigmoid_activation.backward(s_t)
            
            # save delta_net_hidden_explicit as future_delta_net_hidden_explicit for next backpropagation step
            future_delta_net_hidden_explicit = delta_net_output.dot(output_layer_weights.T) * sigmoid_activation.backward(s_t)
                        
            # Updating W_output, W_hidden and  W_output for every bit
            dl_dw_output += np.atleast_2d(s_t).T.dot(delta_net_output)
            dl_dw_hidden += np.atleast_2d(s_t_prev).T.dot(delta_net_hidden) 
            dl_dw_input  += X.T.dot(delta_net_hidden)            
        
        return dl_dw_output, dl_dw_hidden, dl_dw_input

# Ckecking our network

In following code, the network is going to be checked for computing binary addition.

In [171]:
binary_dim = 8
hidden_dimension = 16
learning_rate = 0.1
    
# datasets config
dataset_size = 10000
train_dataset_size = 7000    
data = dataset_utility.get_data(dataset_size, binary_dim)
dataset_train = data[:train_dataset_size]
dataset_test = data[train_dataset_size:]
    
# initializing RNN
rnn = binary_addition_rnn(binary_dim, hidden_dimension, learning_rate)
    
# train
print("-------------- train --------------")
rnn.train(dataset_train)
    
# test
print("-------------- test --------------")    
rnn.train(dataset_test)

-------------- train --------------
epoch: 0
a:   [0 0 1 0 1 1 0 0]
b:   [0 0 1 0 1 1 1 1]
----------------------
c:   [0 1 0 1 1 0 1 1]
Pred:[0 0 0 0 0 0 0 0]
epoch: 1000
a:   [0 1 0 1 0 0 0 1]
b:   [0 0 0 1 1 1 0 1]
----------------------
c:   [0 1 1 0 1 1 1 0]
Pred:[0 1 1 1 1 1 1 1]
epoch: 2000
a:   [0 0 1 1 1 0 1 1]
b:   [0 0 1 0 1 0 1 1]
----------------------
c:   [0 1 1 0 0 1 1 0]
Pred:[0 1 1 1 0 1 1 1]
epoch: 3000
a:   [0 0 1 0 0 0 0 1]
b:   [0 1 0 1 0 1 0 0]
----------------------
c:   [0 1 1 1 0 1 0 1]
Pred:[1 0 1 1 0 1 0 1]
epoch: 4000
a:   [0 1 1 1 0 1 0 0]
b:   [0 0 1 1 1 1 0 1]
----------------------
c:   [1 0 1 1 0 0 0 1]
Pred:[1 0 1 1 0 0 0 1]
epoch: 5000
a:   [0 1 1 0 1 0 1 1]
b:   [0 0 0 0 1 0 0 1]
----------------------
c:   [0 1 1 1 0 1 0 0]
Pred:[0 1 1 1 0 1 0 0]
epoch: 6000
a:   [0 0 0 0 0 0 0 0]
b:   [0 1 1 1 0 0 1 1]
----------------------
c:   [0 1 1 1 0 0 1 1]
Pred:[0 1 1 1 0 0 1 1]
-------------- test --------------
epoch: 0
a:   [0 0 0 1 1 0 0 1]
b:   [0 1 0

This is added to CSS stles

In [172]:
from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)
css_styling()