# IMPLEMENTING RNN FROM SCRATCH IN PYTHON

Implementation¶
We will implement a full Recurrent Neural Network from scratch using Python. We will try to build a text generation model using an RNN. We train our model to predict the probability of a character given the preceding characters. It’s a generative model. Given an existing sequence of characters we sample a next character from the predicted probabilities, and repeat the process until we have a full sentence. This implementation is from Andrej Karparthy great post building a character level RNN. Here we will discuss the implementation details step by step.

General steps to follow:

Initialize weight matrices U, V, W from random distribution and bias b, c with zeros
Forward propagation to compute predictions
Compute the loss
Back-propagation to compute gradients
Update weights based on gradients
Repeat steps 2–5


### Step 1: Initialize:

To start with the implementation of the basic RNN cell, we first define the dimensions of the various parameters U,V,W,b,c.

Dimensions: Let’s assume we pick a vocabulary size vocab_size= 8000 and a hidden layer size hidden_size=100. Then we have:

In [1]:
import numpy as np
import pandas as pd

In [3]:
class RNN:
    def __init__(self,hidden_size,vocab_size,seq_length,learning_rate):
        # hyper params
        self.hidden_size = hidden_size
        self.vocab_size = vocab_size
        self.seq_length = seq_length
        self.learning_rate = learning_rate
        
        # model params
        
        self.U= np.random.uniform(-np.sqrt(1./vocab_size),np.sqrt(1./vocab_size),(hidden_size,vocab_size))
        self.V = np.random.uniform(-np.sqrt(1./hidden_size),np.sqrt(1./hidden_size),(vocab_size,hidden_size))
        self.W = np.random.uniform(-np.sqrt(1./hidden_size),np.sqrt(1./hidden_size),(hidden_size,hidden_size))
        self.b = np.zeros((hidden_size,1)) # bias for hidden layer.
        self.c = np.zeros((vocab_size,1)) # bias for output.

Proper initialization of weights seems to have an impact on training results there has been lot of research in this area. It turns out that the best initialization depends on the activation function (tanh in our case) and one recommended approach is to initialize the weights randomly in the interval from[ -1/sqrt(n), 1/sqrt(n)]where n is the number of incoming connections from the previous layer.

## Step 2: Forward pass

Straightforward as per our equations for each timestamp t, we calculate hidden state hs[t] and output os[t] applying softmax to get the probability for the next character.

In [4]:
def forward(self,inputs,hprev):
    xs,hs,os,ycap = {},{},{},{}
    hs[-1] = np.copy(hprev)
    for t in range(len(inputs)):
        xs[t] = zero_init(self.vocab_size,1)
        xs[t][inputs[t]] = 1 # one hot encoding
        hs[t] = np.tanh(np.dot(self.U,xs[t])+np.dot(self.W,hs[t-1])+ self.b) # hidden state
        os[t] = np.dot(self.V,hs[t]) + self.c
        ycap[t] = self.softmax(os[t])
    return xs,hs,ycap

In [6]:
def softmax(x):
    exps = np.exp(x)
    return exps/ np.sum(exps)

In [9]:
softmax([1,2,3])

array([0.09003057, 0.24472847, 0.66524096])

In [10]:
softmax([1000,2000,3000])

  exps = np.exp(x)
  return exps/ np.sum(exps)


array([nan, nan, nan])

In [None]:
# We adjust the softmax function to handle larger numbers like this:

def softmax(self,x):
    p = np.exp(x - np.max(x))
    return p / np.sum(p)

In [None]:
"""import numpy as np

def softmax(x):
    p = np.exp(x - np.max(x))
    return p / np.sum(p)

# Example usage
print(softmax([1, 2, 3]))"""

[0.09003057 0.24472847 0.66524096]


In [None]:
"""import numpy as np

class MyClass:
    def softmax(self, x):
        p = np.exp(x - np.max(x))
        return p / np.sum(p)

# Create an instance of the class
my_instance = MyClass()

# Example usage
print(my_instance.softmax([1, 2, 3]))"""

[0.09003057 0.24472847 0.66524096]


In [13]:
def loss(self,ps,targets):
    '''loss for a sequence'''
    # cross entropy loss
    return sum(-np.log(ps[t][targets[t],0]) for t in range(self.seq_length))

### Step 4: Backward pass¶

In [17]:
def backward(self, xs, hs, ps, targets):
            # backward pass: compute gradients going backwards
            dU, dW, dV = np.zeros_like(self.U), np.zeros_like(self.W), np.zeros_like(self.V)
            db, dc = np.zeros_like(self.b), np.zeros_like(self.c)
            dhnext = np.zeros_like(hs[0])
            for t in reversed(range(self.seq_length)):
                dy = np.copy(ps[t])
                #through softmax
                dy[targets[t]] -= 1 # backprop into y
                #calculate dV, dc
                dV += np.dot(dy, hs[t].T)
                dc += dc
                #dh includes gradient from two sides, next cell and current output
                dh = np.dot(self.V.T, dy) + dhnext # backprop into h
                # backprop through tanh non-linearity 
                dhrec = (1 - hs[t] * hs[t]) * dh  #dhrec is the term used in many equations
                db += dhrec
                #calculate dU and dW
                dU += np.dot(dhrec, xs[t].T)
                dW += np.dot(dhrec, hs[t-1].T)
                #pass the gradient from next cell to the next iteration.
                dhnext = np.dot(self.W.T, dhrec)
            # clip to mitigate exploding gradients
            for dparam in [dU, dW, dV, db, dc]:
                np.clip(dparam, -5, 5, out=dparam) 
            return dU, dW, dV, db, dc
    

# Step 5: Update weights

In [18]:
def update_model(self, dU, dW, dV, db, dc):
        # parameter update with adagrad
        for param, dparam, mem in zip([self.U, self.W, self.V, self.b, self.c],
                                  [dU, dW, dV, db, dc],
                                  [self.mU, self.mW, self.mV, self.mb, self.mc]):
            mem += dparam*dparam
            param += -self.learning_rate*dparam/np.sqrt(mem+1e-8) # adagrad update

### Step 6: Repeat steps 2–5