# Recurrent Neural Network 

A neural network usually takes an independent variable X (or a set of independent variables ) and a dependent variable y then it learns the mapping between X and y (we call this Training), Once training is done , we give a new independent variable to predict the dependent variable.


Recurrent Neural Networks (RNN) are very effective for Natural Language Processing and other sequence tasks because they have "memory". They can read inputs $x^{\langle t \rangle}$ (such as words) one at a time, and remember some information/context through the hidden layer activations that get passed from one time-step to the next. This allows a uni-directional RNN to take information from the past to process later inputs. A bidirection RNN can take context from both the past and the future. 

**Notation**:
- Superscript $[l]$ denotes an object associated with the $l^{th}$ layer. 
    - Example: $a^{[4]}$ is the $4^{th}$ layer activation. $W^{[5]}$ and $b^{[5]}$ are the $5^{th}$ layer parameters.

- Superscript $(i)$ denotes an object associated with the $i^{th}$ example. 
    - Example: $x^{(i)}$ is the $i^{th}$ training example input.

- Superscript $\langle t \rangle$ denotes an object at the $t^{th}$ time-step. 
    - Example: $x^{\langle t \rangle}$ is the input x at the $t^{th}$ time-step. $x^{(i)\langle t \rangle}$ is the input at the $t^{th}$ timestep of example $i$.
    
- Lowerscript $i$ denotes the $i^{th}$ entry of a vector.
    - Example: $a^{[l]}_i$ denotes the $i^{th}$ entry of the activations in layer $l$.



## 1 - Forward propagation for the basic Recurrent Neural Network

Later this week, you will generate music using an RNN. The basic RNN that you will implement has the structure below. In this example, $T_x = T_y$. 

<img src="images/RNN.png" style="width:500;height:300px;">
<caption><center> **Figure 1**: Basic RNN model </center></caption>

Here's how you can implement an RNN: 

**Steps**:
1. Implement the calculations needed for one time-step of the RNN.
2. Implement a loop over $T_x$ time-steps in order to process all the inputs, one at a time. 
$T_x$ is just a length of your sequence

Let's go!

## 1.1 - RNN cell

A Recurrent neural network can be seen as the repetition of a single cell. You are first going to implement the computations for a single time-step. The following figure describes the operations for a single time-step of an RNN cell. 

<img src="images/rnn_step_forward.png" style="width:700px;height:300px;">
<caption><center> **Figure 2**: Basic RNN cell. Takes as input $x^{\langle t \rangle}$ (current input) and $a^{\langle t - 1\rangle}$ (previous hidden state containing information from the past), and outputs $a^{\langle t \rangle}$ which is given to the next RNN cell and also used to predict $y^{\langle t \rangle}$ </center></caption>

1. Compute the hidden state with tanh activation: $a^{\langle t \rangle} = \tanh(W_{aa} a^{\langle t-1 \rangle} + W_{ax} x^{\langle t \rangle} + b_a)$.
2. Using your new hidden state $a^{\langle t \rangle}$, compute the prediction $\hat{y}^{\langle t \rangle} = softmax(W_{ya} a^{\langle t \rangle} + b_y)$. We provided you a function: `softmax`.
3. Store $(a^{\langle t \rangle}, a^{\langle t-1 \rangle}, x^{\langle t \rangle}, parameters)$ in cache
4. Return $a^{\langle t \rangle}$ , $y^{\langle t \rangle}$ and cache



# Data preparing

In [1]:
import numpy as np
import pandas as pd

In [2]:
# lets we have four sentences 
s1 = 'mango is yellow color'
s2 = 'banana is pink color'
s3 = 'hair has black color'

# Now what is and how the matrices shapes define
# create vocab
l = [*s1.split(),*s2.split(),*s3.split()]
l.append('<end>')
vocab = sorted(set(l))
one_hot_vector_vocab = np.array(pd.get_dummies(vocab))
print(pd.get_dummies(vocab))

   <end>  banana  black  color  hair  has  is  mango  pink  yellow
0      1       0      0      0     0    0   0      0     0       0
1      0       1      0      0     0    0   0      0     0       0
2      0       0      1      0     0    0   0      0     0       0
3      0       0      0      1     0    0   0      0     0       0
4      0       0      0      0     1    0   0      0     0       0
5      0       0      0      0     0    1   0      0     0       0
6      0       0      0      0     0    0   1      0     0       0
7      0       0      0      0     0    0   0      1     0       0
8      0       0      0      0     0    0   0      0     1       0
9      0       0      0      0     0    0   0      0     0       1


In [3]:
# now we create two list which stores [character_to_index] values and [index_to_character] values

char2idx = {value:index for index,value in enumerate(vocab)}
idx2char = np.array(vocab)

In [4]:
# max length of time step

seq_length = max(len(s1.split()),len(s2.split()),len(s3.split()))
seq_length

4

# Preparing Input data for single batch 

In [5]:
# For single batch  (1*10*4) or (10*4) loop over 4 i.e loop (1*10) (1*10) (1*10) (1*10)

##############################################################
# input  #       shape      #   target   #       shape       #
##############################################################
# mango  (1*10) onehotvector     is      (1*10) onehotvector #
# is     (1*10) onehotvector    yellow   (1*10) onehotvector #
# yellow (1*10) onehotvector    color    (1*10) onehotvector #
# color  (1*10) onehotvector    <end>    (1*10) onehotvector #
##############################################################

In [6]:
# right now we pass single sentence or single batch so the input size will become 
#(1*10*5) batchsize * vocab_size * sequence_size or timestep 
print([char2idx[char] for char in s1.split()])
print(one_hot_vector_vocab[[7,6,9,3]].T.shape)
one_hot_vector_vocab[[7,6,9,3]].T  # mango is yellow color OH


[7, 6, 9, 3]
(10, 4)


array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 1],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 1, 0, 0],
       [1, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 1, 0]], dtype=uint8)

In [7]:
# Input array for single batch
np.array(s1.split()) # 4 words each represent in One Hot Encoding  dim = (10*4)

array(['mango', 'is', 'yellow', 'color'], dtype='<U6')

In [8]:
# prepare input data for single batch pass
x1 = one_hot_vector_vocab[[char2idx[char] for char in s1.split()]].T #s1 = ' mango is yellow color '
x2 = one_hot_vector_vocab[[char2idx[char] for char in s2.split()]].T #s2 = ' banana is pink color  '
x3 = one_hot_vector_vocab[[char2idx[char] for char in s3.split()]].T #s3 = ' hair has black color  '
print(x1.shape) # these are
print(x2.shape) # three
print(x3.shape) # different samples

print(x1[:,0].shape) # 0th time step means we are passing mango as an input
x1[:,0]

(10, 4)
(10, 4)
(10, 4)
(10,)


array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0], dtype=uint8)

# Preparing target variable for single batch

In [9]:
y1 = s1.split()[1:] # why I take from 1st index? bcz first input word predict 2nd so we expect 2 word from output
y2 = s2.split()[1:]
y3 = s3.split()[1:]
y1.append('<end>') ; y2.append('<end>') ; y3.append('<end>') ;
y1 = np.array(y1)
y2 = np.array(y2)
y3 = np.array(y3)
print(y1) # these are
print(y2) # three 
print(y3) # different samples
[char2idx[i] for i in y1] # this is expected targets

['is' 'yellow' 'color' '<end>']
['is' 'pink' 'color' '<end>']
['has' 'black' 'color' '<end>']


[6, 9, 3, 0]

In [10]:
y1 = np.array([char2idx[i] for i in y1]) # this is expected targets
y2 = np.array([char2idx[i] for i in y2]) # this is expected targets
y3 = np.array([char2idx[i] for i in y3]) # this is expected targets
y3

array([5, 2, 3, 0])

# Forward Propagation

## 1.Single timestep feedforward

In [11]:
def softmax(a):
    return np.exp(a)/np.exp(a).sum(axis=1)

def rnn_cell_forward(xt, a_prev, parameters):
    """
    Arguments:
    xt         --  (batch_unit_size , vocab_size)
    a_prev     --  (hidden_unit_size, vocab_size)

    parameters -- python dictionary containing:
                        Wax -- (batch_unit_size , hidden_unit_size)
                        Waa -- (hidden_unit_size, hidden_unit_size)
                        Wya -- (hidden_unit_size, output_unit_size)
                        ba --  (hidden_unit_size, 1)
                        by --  (output_unit_size, 1)
    Returns:
    a_next  -- (hidden_unit, vocab_size)
    yt_pred -- (output_unit, vocab_size)
    cache   -- (a_next, a_prev, xt, parameters)
    """
    Wax = parameters["Wax"]
    Waa = parameters["Waa"]
    Way = parameters["Way"]
    ba = parameters["ba"]
    by = parameters["by"]
    
    a_next = np.tanh(np.dot(Waa.T, a_prev) + np.dot(Wax.T, xt) + ba) # current activation bias
    y_cap = softmax(np.dot(Way.T, a_next) + by)  # output of current time step

    return (a_prev, a_next, y_cap)

In [12]:
x1[:,0] # mango
x1[:,1] # is 
x1[:,2] # yellow
x1[:,3] # color

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 0], dtype=uint8)

In [13]:
#           Wax.T  *  Xt               Way.T  * a_current
# (1*10)    (2*1)  * (1*10)             (1*2) *  (2*10)
# mango ---------------------> (2*10) --------------------->  is = (1*10)


In [14]:
# we create 1 hidden layer with 2 neurons
np.random.seed(1)
xt     = x1[:,0].reshape(1,-1)  # (1*10)  ==> batch_size    * vocab_size 
Wax    = np.random.randn(1,2)   # (1*2)   ==> batch_size    * hidden_neuron

a_prev = np.random.randn(2,10)  # (2*10)  ==> hidden_neuron * vocab_size
Waa    = np.random.randn(2,2)   # (2*2)   ==> hidden_neuron * hidden_neuron
ba     = np.random.randn(2,1)   # (2*1)   ==> hidden nueron * 1

Way    = np.random.randn(2,1)   # (2*1)   ==> hidden_neuron * output_neuron(or batch_size)
by     = np.random.randn(1,1)   # (1*1)   ==> output_neuron * 1

# In sentiment anlysis output_neron will be no. of labels 

parameters = {"Waa": Waa, "Wax": Wax, "Way": Way, "ba": ba, "by": by}

a_prev, a_next, y_cap = rnn_cell_forward(xt, a_prev, parameters)
print("a_next = ", a_next.shape)
print("a_next.shape = ", a_next.shape)
print("y_cap.shape = ", y_cap.shape)


a_next =  (2, 10)
a_next.shape =  (2, 10)
y_cap.shape =  (1, 10)


## 2.Multi timestep feedforward 
#### Don't think that it's a multi neuron like NN. It just a single neuron where activations are feed in loop 

You can see an RNN as the repetition of the cell you've just built. If your input sequence of data is carried over 4 time steps, then you will copy the RNN cell 4 times. Each cell takes as input the hidden state from the previous cell ($a^{\langle t-1 \rangle}$) and the current time-step's input data ($x^{\langle t \rangle}$). It outputs a hidden state ($a^{\langle t \rangle}$) and a prediction ($y^{\langle t \rangle}$) for this time-step.


<img src="images/rnn.png" style="width:800px;height:300px;">
<caption><center> **Figure 3**: Basic RNN. The input sequence $x = (x^{\langle 1 \rangle}, x^{\langle 2 \rangle}, ..., x^{\langle T_x \rangle})$  is carried over $T_x$ time steps. The network outputs $y = (y^{\langle 1 \rangle}, y^{\langle 2 \rangle}, ..., y^{\langle T_x \rangle})$. </center></caption>

In [15]:
def rnn_forward(xt, a_prev, parameters):
    """
    Arguments:
    xt         --  (batch_unit_size , vocab_size , timestep)
    a_prev     --  (hidden_unit_size, vocab_size)

    parameters -- python dictionary containing:
                        Wax -- (batch_unit_size , hidden_unit_size)
                        Waa -- (hidden_unit_size, hidden_unit_size)
                        Way -- (hidden_unit_size, output_unit_size)
                        ba --  (hidden_unit_size, 1)
                        by --  (output_unit_size, 1)
    Returns:
    a_next  -- (hidden_unit, vocab_size)
    yt_pred -- (output_unit, vocab_size)
    cache   -- (a_next, a_prev, xt, parameters)
    """
    
    caches = []  # store outputs which we get using single timestep  [need during backpropagation]
    
    hidden_neuron , vocab_size         =  a_prev.shape
    batch_size , vocab_size , timestep =  xt.shape
    hidden_neuron, output_neuron       =  parameters["Way"].shape
    
    activations = np.zeros([hidden_neuron,vocab_size,timestep])  # store activations during each timesteps
    y_caps      = np.zeros([output_neuron,vocab_size,timestep])  # store output at each timesteps

    a_next_t = a_prev  # why? you will get soon
    a0 = 0
    # loop over all time-steps
    for t in range(timestep):
        
        a_prev_t, a_next_t, y_cap_t = rnn_cell_forward(xt[:,:,t], a_next_t, parameters)
        if t == 0:
            a0 = a_prev_t # store only first prev activation
        activations[:,:,t]   = a_next_t
        y_caps[:,:,t]        = y_cap_t
    
    return a0, activations, y_caps

In [16]:
np.random.seed(1)
xt     = x1.reshape(1,10,4)     # (1*10*4) ==> batch_size    * vocab_size * timestep 
Wax    = np.random.randn(1,2)   # (1*2)    ==> batch_size    * hidden_neuron

a_prev = np.random.randn(2,10)  # (2*10)   ==> hidden_neuron * vocab_size
Waa    = np.random.randn(2,2)   # (2*2)    ==> hidden_neuron * hidden_neuron
ba     = np.random.randn(2,1)   # (2*1)    ==> hidden nueron * 1

Way    = np.random.randn(2,1)   # (2*1)    ==> hidden_neuron * output_neuron(or batch_size)
by     = np.random.randn(1,1)   # (1*1)    ==> output_neuron * 1

parameters = {"Waa": Waa, "Wax": Wax, "Way": Way, "ba": ba, "by": by}

a0, activations, y_caps = rnn_forward(xt, a_prev, parameters)
print("activations.shape = ", activations.shape)
print("y_caps.shape      = ", y_caps.shape)
print("a_prev            = ", a0)

activations.shape =  (2, 10, 4)
y_caps.shape      =  (1, 10, 4)
a_prev            =  [[-0.52817175 -1.07296862  0.86540763 -2.3015387   1.74481176 -0.7612069
   0.3190391  -0.24937038  1.46210794 -2.06014071]
 [-0.3224172  -0.38405435  1.13376944 -1.09989127 -0.17242821 -0.87785842
   0.04221375  0.58281521 -1.10061918  1.14472371]]


# Cost Function

<img src="images/costfunction.jpeg" style="width:500;height:300px;">

# Backpropagation 

###  Basic RNN  backward pass

We will start by computing the backward pass for the basic RNN-cell.

<img src="images/rnn_cell_backprop.png" style="width:500;height:300px;"> <br>
<caption><center> **Figure 6**: RNN-cell's backward pass. Just like in a fully-connected neural network, the derivative of the cost function $J$ backpropagates through the RNN by following the chain-rule from calculus. The chain-rule is also used to calculate $(\frac{\partial J}{\partial W_{ax}},\frac{\partial J}{\partial W_{aa}},\frac{\partial J}{\partial b})$ to update the parameters $(W_{ax}, W_{aa}, b_a)$. </center></caption>

In [17]:
def rnn_cell_backward(timestep, y_caps, y, activation, parameters,xt,a0):
    """
    Arguments:
    da_next -- Gradient of loss with respect to next hidden state
    cache   -- python dictionary containing useful values (output of rnn_cell_forward())


    Returns:
    gradients -- python dictionary containing:
                        dx      -- Gradients of input data, of shape              (batch_size ,vocab_size, timestep)
                        da_prev -- Gradients of previous hidden state, of shape   (hidden_unit,vocab_size)
                        dWax    -- Gradients of input-to-hidden weights, of shape (batch_size ,hidden_unit)
                        dWaa    -- Gradients of hidden-to-hidden weights, of shape(hidden_unit,hidden_unit)
                        dba     -- Gradients of bias vector, of shape             (hidden_unit, 1)
    """
    Wax = parameters["Wax"] #(1*2)
    Waa = parameters["Waa"] #(2*2)
    Way = parameters["Way"] #(2*1)
    ba  = parameters["ba"]  #(2*1)
    by  = parameters["by"]  #(1*1)
    
    y_cap_ = y_caps[:,:,timestep]      # current timestep y_pred
    y_     = y[:,:,timestep]           # current timestep y_actual
    a_     = activation[:,:,timestep]  # current timestep a_next
    xt_    = xt[:,:,timestep]          # current timestep input
    a0_    = a0                        # a_prev for first timestep
    
    # compute derivative Way w.r.t to Loss
    #                (2*10) (1*10)-(1*10)
    dL_dWay = np.dot( a_,  (y_cap_ - y_).T)  # (2*1) 
    
    # dL_dWaa = dL_da * da__da_prev * da_da_prev__dWaa 
    dL_dWaa = 0
    for i in range(timestep+1):  # t = 2    ===> 0 1 
        #                    (2*1)  (1*10)-(1*10) 
        dL_da        = np.dot(Way, (y_cap_ - y_))   # (2*10)
        
        da__da_prev = 1
        for j in reversed(range(i+1,timestep+1)): # t = 2  ===> 1   
            #                               (2*10)
            da_dtanh     = (1 - np.square(activation[:,:,j]))          # (2*10)
            #                                  (2*2)  (2*10)   
            da__da_prev  = da__da_prev * np.dot(Waa , da_dtanh)        # (2*10)
            
        if i == 0: 
    
            #           (2*10)                    (2*10)  
            da_dWaa   =  a0_   *  (1 - np.square(activation[:,:,i])) # (2*10)        
        else:
            #                          (2*10)                               (2*10)
            da_dWaa   =  (1 - np.square(activation[:,:,i-1]))   *  (1 - np.square(activation[:,:,i])) # (2*10)
            
            #            (2*10)     (2*10)       (10*2)
        dL_dWaa += np.dot(dL_da * da__da_prev , da_dWaa.T)  # (2*2)
    
    # dL_dWax = dL_da * da__da_prev * da_da_prev__dWax 
    dL_dWax = 0
    for i in range(timestep+1):  # t = 1    ===> 0  
        dL_da       = np.dot(Way, (y_cap_ - y_))   # (2*10)
        da__da_prev = 1
        for j in reversed(range(i+1,timestep+1)): # t = 1  ===>     
            da_dtanh     = (1 - np.square(activation[:,:,j]))          # (2*10)
            da__da_prev  = da__da_prev * np.dot(Waa , da_dtanh)        # (2*10)
        
        da_dWax   =  xt[:,:,i]   *  (1 - np.square(activation[:,:,i])) # (2*10)
        
        #            (2*10)    (2*10)      (2*10)
        dL_dWax  +=  dL_da * da__da_prev * da_dWax     # (1*2)
    
    dL_dWax = dL_dWax.sum(axis=1).reshape(1,-1)
    # dL_da0 = dL_da * da__da_prev
    dL_da = np.dot(Way, (y_cap_ - y_))            # (2*10)
    da_da_prev = 1
    for i in reversed(range(timestep+1)): # 0 1 2 3
        da_dtanh     = (1 - np.square(activation[:,:,i]))          # (2*10)
        da__da_prev  = da__da_prev * np.dot(Waa , da_dtanh)        # (2*10)
    
    dL_da0 = dL_da * da__da_prev                                   # (2*10)
    #dL_dby = dL_ds * ds_dby
    #          (1*10) - (1*10)     
    dL_dby = (y_cap_ - y_).sum(axis = 1).reshape(-1,1)  #   (1*1)
                        
    #dL_dba = dL_da * da_dba
    #       (2*10)       (2*1) (1*10)-(1*10) 
    dL_dba = (a_ * np.dot(Way, (y_cap_ - y_))).sum(axis = 1 ).reshape(-1,1)   # (2*10)
    
    # Store the gradients in a python dictionary
    gradients = {"dL_da0" :dL_da0,
                 "dL_dWaa": dL_dWaa, 
                 "dL_dWax": dL_dWax,
                 "dL_dWay": dL_dWay,
                 "dL_dba" : dL_dba,
                 "dL_dby" : dL_dby
                 }
    
    return gradients

In [18]:
def rnn_backward(a0, activations, y_caps, y1_ohe, parameters,xt):
    """
    Arguments:
    da     -- Upstream gradients of all hidden states, of shape (batch_size, vocab_size, timestep)
    caches -- tuple containing information from the forward pass (rnn_forward)
    
    Returns:
    gradients -- python dictionary containing:
                    dx  -- Gradient w.r.t. the input data, numpy-array of shape (batch_size, vocab_size, timestep)
                    da0 -- Gradient w.r.t the initial hidden state, numpy-array of shape (hidden_unit, vocab_size)
                    dWax-- Gradient w.r.t the input's weight matrix, numpy-array of shape (batch_size, hidden_unit)
                    dWaa-- Gradient w.r.t the hidden state's weight matrix, numpy-arrayof shape (hidden_unit,hidden_unit)
                    dba -- Gradient w.r.t the bias, of shape (hidden_unit, 1)
    """
    
    batch_size, vocab_size, timestep =  y_caps.shape
    hidden_unit = activations.shape[0]
    
    dWax = 0
    dWay = 0
    dWaa = 0
    dba  = 0
    da0  = 0
    dby  = 0
    
    # Loop through all the time steps
    
    for t in range(timestep):
        
        gradient = rnn_cell_backward(t, y_caps, y1_ohe, activations, parameters,xt,a0)
        
        dL_da0      = gradient["dL_da0"]
        dL_dWaa     = gradient["dL_dWaa"]
        dL_dWax     = gradient["dL_dWax"]
        dL_dWay     = gradient["dL_dWay"]   
        dL_dba      = gradient["dL_dba"]
        dL_dby      = gradient["dL_dby"]
        
        da0  += dL_da0
        dWaa += dL_dWaa
        dWax += dL_dWax
        dWay += dL_dWay
        dba  += dL_dba
        dby  += dL_dby
    
    gradients = {"da0": da0, "dWaa": dWaa, "dWax": dWax, "dWay": dWay,"dba": dba,'dby':dby}
    return gradients

In [19]:
def crossEntropy(y_caps,y1):
    cost = []
    for i in range(y_caps.shape[2]):
        y_actual = (one_hot_vector_vocab[y1[i]]).reshape(1,-1)
        y_pred   = y_caps[:,:,i]
        cost.append(-np.sum(y_actual * np.log(y_pred)))
    return np.mean(cost,axis=0)

def one_hot_encode(y):
    one = np.zeros([1,10,4])
    for i in range(len(y)):
        one[:,:,i] = one_hot_vector_vocab[y1[i]].reshape(1,-1)
    return one


In [20]:
x1 = x1  # change to x2, x3 for more results
y1 = y1  # change to y2, y3 for more results
np.random.seed(1)
x1  = x1.reshape(1,10,4)     # (1*10*4) ==> batch_size    * vocab_size * timestep 
Wax = np.random.randn(1,2)   # (1*2)    ==> batch_size    * hidden_neuron

a0  = np.random.randn(2,10)  # (2*10)   ==> hidden_neuron * vocab_size
Waa = np.random.randn(2,2)   # (2*2)    ==> hidden_neuron * hidden_neuron
ba  = np.random.randn(2,1)   # (2*1)    ==> hidden nueron * 1

Way = np.random.randn(2,1)   # (2*1)    ==> hidden_neuron * output_neuron(or batch_size)
by  = np.random.randn(1,1)   # (1*1)    ==> output_neuron * 1

parameters = {"Waa": Waa, "Wax": Wax, "Way": Way, "ba": ba, "by": by}

y = one_hot_encode(y1)

#######################
# stochastic gradient #
#######################
alpha = 0.01
parameters = {"Waa": Waa, "Wax": Wax, "Way": Way, "ba": ba, "by": by}
for i in range(100):
    a0, activations, y_caps = rnn_forward(x1, a0, parameters)
    gradients               = rnn_backward(a0, activations, y_caps, y, parameters,xt)
    print("Final cost: ",crossEntropy(y_caps,y1))
    parameters['Waa'] -= (alpha * gradients['dWaa'])
    parameters['Wax'] -= (alpha * gradients['dWax'])
    parameters['Way'] -= (alpha * gradients['dWay'])
    parameters['ba']  -= (alpha * gradients['dba'])
    parameters['by']  -= (alpha * gradients['dby'])
    parameters['by']  -= (alpha * gradients['dby'])
    a0                -= (alpha * gradients['da0'])

Final cost:  2.295090024207877
Final cost:  2.2922161782315897
Final cost:  2.2892558138074204
Final cost:  2.2862029606399026
Final cost:  2.2830518630043835
Final cost:  2.2797970480459386
Final cost:  2.2764334051634347
Final cost:  2.2729562760142246
Final cost:  2.269361553980236
Final cost:  2.265645791096242
Final cost:  2.261806309495847
Final cost:  2.2578413134529094
Final cost:  2.2537499971927333
Final cost:  2.2495326429509666
Final cost:  2.2451907034087877
Final cost:  2.2407268627488346
Final cost:  2.2361450712204274
Final cost:  2.2314505492544834
Final cost:  2.226649758713517
Final cost:  2.221750340607662
Final cost:  2.21676102032656
Final cost:  2.211691482927092
Final cost:  2.206552222156217
Final cost:  2.2013543676629865
Final cost:  2.1961094953477667
Final cost:  2.190829426142775
Final cost:  2.1855260188335137
Final cost:  2.180210962861414
Final cost:  2.1748955773452128
Final cost:  2.169590622695794
Final cost:  2.1643061310218
Final cost:  2.159051260

In [21]:
print(np.argmax(y_caps[:,:,0]))
print(np.argmax(y_caps[:,:,1]))
print(np.argmax(y_caps[:,:,2]))
print(np.argmax(y_caps[:,:,3]))

3
3
3
1


In [22]:
print("Actual_input\tActual_expected\t\t\tPredicted")
for i in range(4):
    print("{}\t\t\t{}\t\t\t{}".format(idx2char[np.argmax(x1[:,:,i])],idx2char[y1[i]],idx2char[np.argmax(y_caps[:,:,i])]))
    

Actual_input	Actual_expected			Predicted
mango			is			color
is			yellow			color
yellow			color			color
color			<end>			banana


In [23]:
def softmax(y_linear, temperature=1.0):
    lin = (y_linear-nd.max(y_linear, axis=1).reshape((-1,1))) / temperature # shift each row of y_linear by its max
    exp = nd.exp(lin)
    partition =nd.sum(exp, axis=1).reshape((-1,1))
    return exp / partition