<h1>This is Implementing One Layer Recurrent Neural Network from Scratch</h1>

The hidden state is computed as
$\large h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$.

The output is given by
$\large y_t = W_{hy} h_t + b_y$.


so remember hidden state is really the RNN part of this, that yt a linear layer
in pytorch what nn.RNN would return is that ht on the last layer of the RNN network. we put linear
layer so that we can make decisions. just like in CNN, CNN part extract data and Linear layer do the
rest as we say. In RNN also ht remember what's necessary and linear layer do what we want.
(classification or regression)

<h3>So here we first see how is forward pass works </h3>
input size is X = (seq_len,batch_size,input_size) , so in the network input features first multiplied
by Wxh . so its like (seq_len,batch_size,features) * (input_size,hidden_size) . so this gives the
(seq_len,batch_size,hidden_size) . but we cant use this architecture like this. if you think about
doing simple matrix multiplication it will broadcast same hidden state for all sequence, and it is
wrong since we need to use previous hidden state for to calculate current hidden state. so that why
we use hidden state.

And we generate random numbers as the transpose to (input_size, hidden_size) meaning
(hidden_size,input_size), just like pytorch so that multiplication will be easy. if not we have to
transpose that again.

So about hidden state and for loop, in the for loop we take one seq at time that give the size of
(batch_size,input_size) then we transpose it (input_size,batch_size) so we can do multiplication
like this Whx @ X .which gives size of (hidden_size, batch_size) . when you think about it , it is
all the hidden states of each batch member. that why h = (hidden_size, batch_size) . so I think if i
ever forget , it will be this. so after tha we have to calculate Whh @ h and then add all with bias bh
. so we store those hidden states one by one , and new hidden state got calculated by last one.
that's the forward pass.





<h3>Back propagation </h3>
About back prop. you have to do the calculations by hand , and i dont know latex that much to write
those here. so the way you have to approach is , write equation with most abstract way . like
Loss = 1/2(yp - y)**2 , then calculate the derivative respect to all things except constants . Its
chain rule, but what we usually do is in the class is find most simplified version
of the equation the getting the derivative of each one. but we cant do that here it will get messy
coding. so do that for each weight and bias. then multiply with previous one .


the new thing is this dhlast, This will be hard to explain. think about ht, if the ht is last hidden
state that generated this ht cause two errors. first by current output yp, and the error that this
for h(t+1) . (this is by this why it called back propagation through time, which does not happen
other networks) dlast is the gradient of that how much error he contributed to next h(t+1). but in
here h is the last hidden state. So , h doesn't cause error yet. but h(t-1) have cause error to ht.
that why we start with zeros . And remember we are going back in time with this. we calculate all the
errors from last point to first point, that it generated. And we add that error to Whh. since Whh
responsible for generating ht from h(t-1).

In [48]:
import numpy as np
import matplotlib.pyplot as plt
import sys


In [52]:
class RNN():
    def __init__(self,input_size,hidden_sate_size,output_size):
        self.input_size = input_size
        self.hidden_state_size = hidden_sate_size
        self.output_size = output_size

        self.Whx = np.random.randn(self.hidden_state_size,self.input_size)/np.sqrt(self.hidden_state_size)
        self.Whh = np.random.randn(self.hidden_state_size,self.hidden_state_size)/np.sqrt(self.hidden_state_size)
        self.Wyh = np.random.randn(output_size, self.hidden_state_size) / np.sqrt(self.hidden_state_size)

        self.bh = np.zeros((self.hidden_state_size, 1))
        self.by = np.zeros((self.output_size, 1))

        self.inputs = None
        self.hidden_states = None


    def forward(self, inputs):
        self.inputs = inputs #(seq,batch,input)
        seq_len,batch_size,features = inputs.shape

        h = np.zeros((self.hidden_state_size,batch_size))
        self.hidden_states = [h]
        outputs = np.zeros((seq_len, batch_size, self.output_size))

        for t,x in enumerate(inputs):
            x = x.T #(input,batch)
            h = np.tanh(
                np.dot(self.Whh,h) + #(hidden,hidden) * (hidden,batch) = (hidden,batch)
                np.dot(self.Whx,x) + #(hidden,input) * (input,batch) = (hidden,batch)
                self.bh              #(hidden,1) column vector is going through all batches
            )
            y = np.dot(self.Wyh,h) + self.by #(output,hidden) * (hidden,output) = (output *batch)
            outputs[t] = y.T # (batch,output)
            self.hidden_states.append(h)

        return outputs # (seq,batch,input)

    def backward(self, outputs, targets, learning_rate=0.01):

        _, batch_size, _ = outputs.shape #(seq,batch,input)

        dWhx = np.zeros_like(self.Whx) # (hidden, input)
        dWhh = np.zeros_like(self.Whh) #(hidden,hidden)
        dWhy = np.zeros_like(self.Wyh) #(output * hidden)
        dbh = np.zeros_like(self.bh) # (hidden,1)
        dby = np.zeros_like(self.by) #(output,1)

        dhlast = np.zeros((self.hidden_state_size, batch_size))

        for t in reversed(range(len(outputs))):

            dy = outputs[t].T - targets[t].T #(output,batch)
            dWhy += np.dot(dy, self.hidden_states[t].T) #(output,batch) * (batch,hidden) = (output,hidden)
            dby += np.sum(dy, axis=1, keepdims=True) # sum of all dy in column wise

            dh = np.dot(self.Wyh.T, dy) + dhlast #(hidden,output) * (output,batch) = (hidden,batch)
            dtanh = (1 - self.hidden_states[t] ** 2) * dh #(hidden,batch)
            dbh += np.sum(dtanh, axis=1, keepdims=True)  # sum of all dtanh in column wise
            dWhx += np.dot(dtanh, self.inputs[t])  #(hidden,batch) * (batch,inputs) = (hidden * input)
            dWhh += np.dot(dtanh, self.hidden_states[t-1].T) # (hidden,batch) * (batch,hidden) = (hidden * hidden)
            dhlast = np.dot(self.Whh.T, dtanh) #(hidden,hidden) * (hidden,batch) = (hidden,batch)

        self.Whx -= learning_rate * dWhx
        self.Whh -= learning_rate * dWhh
        self.Wyh -= learning_rate * dWhy
        self.bh -= learning_rate * dbh
        self.by -= learning_rate * dby

