# LSTM From Scratch

```
Implement Simple character Level LSTM from Scratch using Numpy. It is trained in batches with Adam and learns words after few iteration
```

### Architecture of an LSTM Memory Cell
![image.png](attachment:image.png)

In [17]:
import numpy as np
import matplotlib.pyplot as plt
from random import uniform

```
We will use the Adam optimiser, as general empirical results demonstrate that it performs favourably compared to other optimisation methods, it converges fast and can effectively navigate local minima by adapting the learning rate for each parameter. The moving averages, \beta_{1} and \beta_{2} are initialised as suggested in the original paper.
```

```
After the learning rate, weight initialisation is the second most important setting for LSTMs and other recurrent networks; improper initialisation could slow down the training process to the point of impracticality. We will therefore use the high-performing Xavier initialisation, which involves randomly sampling weights from a distribution , where n is the number of neurons in the preceding layer.
```
<img src="https://s0.wp.com/latex.php?latex=%5Cmathcal%7BN%7D%280%2C+%5Cfrac%7B1%7D%7B%5Csqrt%7Bn%7D%7D%29&bg=ffffff&fg=404040&s=2&c=20201002">

```
The input to the LSTM, z has dimensions [vocab_size + n + 1]. Since, the LSTM paper wants to output n neurons.
Each weight should be of size[n, vocab_size + n] and each bias of size[n, 1]. 
Exception is the weight and bias at the output softmax layer (Wv, bv). The resulting output will be a probability distribution over all possible characters in the vocabulary, therefore of size [vocab_size, 1], hence Wv should be of size[vocab_size, n] and bv of size [n, 1].
```

In [10]:
class LSTM:
    def __init__(self, char_to_idx, idx_to_char, vocab_size, 
                 n=100, seq_len=25, epochs=10, lr=0.01, beta1=0.9, beta2=0.999):
        
        self.char_to_idx = char_to_idx # char to index mapping 
        self.idx_to_char = idx_to_char
        self.vocab_size = vocab_size # Number of unique chars in train data
        self.n = n # Number of units in hidden layer
        self.seq_len = seq_len # Number of time stamps, also size of mini batch
        self.epochs = epochs
        self.lr = lr
        self.beta1 = beta1 # Momentum Parameter (For Adam Optimizer)
        self.beta2 = beta2
        
        # Init weight and biases
        self.params = {}
        std = (1.0 / np.sqrt(self.vocab_size + self.n)) # Xavier Initialization
        
        # Forget Gate
        self.params['Wf'] = np.random.randn(self.n, self.n + self.vocab_size) * std
        self.params['bf'] = np.ones((self.n, 1))
        
        # Input Gate 
        self.params['Wi'] = np.random.randn(self.n, self.n + self.vocab_size) * std
        self.params['bi'] = np.ones((self.n, 1))
        
        # Cell Gate 
        self.params['Wc'] = np.random.randn(self.n, self.n + self.vocab_size) * std
        self.params['bc'] = np.zeros((self.n, 1))
        
        # Output Gate
        self.params['Wo'] = np.random.randn(self.n, self.n + self.vocab_size) * std
        self.params['bo'] = np.ones
        
        # Output 
        self.params['Wv'] = np.random.randn(self.n, self.n + self.vocab_size) * std
        self.params['bv'] = np.ones((self.n, 1))
        
        
        # init gradients and Adam params
        self.grads = {}
        self.optim_params = {}
        
        for key in self.params:
            self.grads['d' + key] = np.zeros_like(self.params[key])
            self.optim_params['m' + key] = np.zeros_like(self.params[key])
            self.optim_params['v' + key] = np.zeros_like(self.params[key])
        self.smooth_loss = -np.log(1.0 / self.vocab_size) * self.seq_len
        return
    
    # Utility Functions
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def softmax(self, x):
        e_x = np.exp(x - np.max(x))
        return e_x / np.sum(e_x)
    
    # Although Exploding Gradient is not as prevalent for LSTM as for RNN, we have to limit the gradients to a 
    # conservative value using clip_grads utility. After backpropogating through all LSTM cells, we have to reset 
    # grads.
    
    def clip_grads(self):
        for key in self.grads:
            np.clip(self.grads[key], -5, 5, out=self.grads[key])
        return
    
    def zero_grads(self):
        for key in self.grads:
            self.grads[key].fill(0)
        return
    
    # Update Params For the Optimizer
    # Here weights are updated using the accumulated gradients for all time stamps.
    def update(self, batch_num):
        for key in self.params:
            self.optim_params['m' + key] = self.optim_params['m' + key] * self.beta1 +\
                             (1 - self.beta1) * self.grads[d + 'key']
            self.optim_params['v' + key] = self.optim_params['v' + key] * self.beta2 +\
                             (1 - self.beta2) * self.grads['d' + key] ** 2
            
            m_correlated = self.optim_params['m' + key] / (1 - self.beta1 ** batch_num)
            v_correlated = self.optim_params['v' + key] / (1 - self.beta2 ** batch_num)
            self.params[key] -= self.lr * m_correlated / (np.sqrt(v_correlated) + 1e-8)
        
        return
    
    # Gradient Check 
    '''
    To Check Gradients and debug Neural Networks.
    
    Checks Magnitude of gradients against expected approximate values
    '''
    def gradient_check(self, x, y, h_prev, c_prev, num_checks=10, delta=1e-6):
        print('Gradient Check...')
        
        _, _, _ = self.forward_backward(x, y, h_prev, c_prev)
        grads_numerical = self.grads
        
        for key in self.params:
            print('----------', key, '-------------')
            test = True
            
            dims = self.params[key].shape
            grad_numerical = 0
            grad_analytical = 0
            
            for _ in range(num_checks): # Sample 10 neurons
                idx = int(uniform(0, self.params[key].size))
                old_val = self.params[key].flat[idx]
                
                self.params[key].flat[idx] = old_val + delta
                J_plus, _, _ = self.forward_backward(x, y, h_prev, c_prev)
                
                self.params[key].flat[idx] = old_val - delta
                J_minus, _, _ = self.forward_backward(x, y, h_prev, c_prev)

                self.params[key].flat[idx] = old_val

                grad_numerical += (J_plus - J_minus) / (2 * delta)
                grad_analytical += grads_numerical["d" + key].flat[idx]
            
            grad_numerical /= num_checks
            grad_analytical /= num_checks
            
            rel_error = abs(grad_analytical - grad_numerical) / abs(grad_analytical + grad_numerical)
            
            if (rel_error > 1e-2):
                if not (grad_analytical < 1e-6 and grad_numerical < 1e-6):
                    test = False
                    assert (test)
            
            print('Approximate: \t%e, Exact: \t%e =>  Error: \t%e' % (grad_numerical, grad_analytical, rel_error))
        print("\nTest successful!")
        print("**********************************\n")
        return
    
    # Forward Pass for a time-stamp
    '''
    Propogate through each LSTM cell using Forward Pass.
    
    An LSTM cell mainly depends on the previous cell's state. Forward Pass therefore takes as input the 
    previous hidden state (h_prev) and previous cell state (c_prev). 
    
    At the beginning of every epoch, the previous hidden states are initialised to zero (i.e at t = -1), 
    but for subsequent time-steps, they correspond to the hidden states at t-1, where t is the current time-stamp.
    '''
    def forward(self, x, h_prev, c_prev):
        z = np.row_stack((h_prev, x))
        
        f = self.sigmoid(np.dot(self.params['Wf'], z) + self.params['bf'])
        i = self.sigmoid(np.dot(self.params['Wi'], z) + self.params['bi'])
        c_bar = np.tanh(np.dot(self.params['Wc'], z) + self.params['bc'])
        
        c = f * c_bar + i * c_bar
        o = self.sigmoid(np.dot(self.params['Wo'], z) + self.params['bo'])
        h = o * np.tanh(c)
        
        v = np.dot(self.params['Wv'], h) + self.params['bv']
        y_hat = self.softmax(v)
        
        return y_hat, v, h, o, c, c_bar, i, f, z
    
    
    # Backward Pass for a time step
    '''
    After Forward Pass, pass the updated values of the last LSTM cell to the backward() and propogate 
    the gradient backwards to the first LSTM cell. 
    
    dh_next and dc_next are initialised to zero at t = -1, but take the values of dh_prev and dc_prev that backward()
    returns in subsequent time steps.
    
    1. As weights are shared by all times, the weight gradients are accumulated.
    2. We are adding dh_next to dh, because h is brached in Forward Pass in the softmax output layer and the 
       next LSTM cell, where it is concatenated with x. So, there are two gradients flowing back. This applies to dc
       also.
    3. There are 4 gradients flowing towards the input layer from the gates, therefore dz is the summation of 
       those gradients.
    '''
    def backward(self, y, y_hat, dh_next, dc_next, c_prev, z, f, i, c_bar, c, o, h):
        dv = np.copy(y_hat)
        dv[y] = 1 # y_hat - y
        
        self.grads['dWv'] += np.dot(dv, h.T)
        self.grads['dbv'] += dv
        
        dh = np.dot(self.params['Wv'].T, dv)
        dh += dh_next
        
        do = dh * np.tanh(c)
        da_o = do * o * (1-o)
        self.grads['dWo'] += np.dot(da_o, z.T)
        self.grads['dbo'] += da_o
        
        dc = dh * o * (1 - np.tanh(c)**2)
        dc += dc_next
        
        dc_bar = dc * i
        da_c = dc_bar * (1 - c_bar**2)
        self.grads['dWc'] += np.dot(da_c, z.T)
        self.grads['dbc'] += da_c
        
        di = dc * c_bar
        da_i = di * i * (1-i)
        self.grads['dWi'] += np.dot(da_i, z.T)
        self.grads['dbi'] += da_i
        
        df = dc * c_prev
        da_f = df * f * (1-f)
        self.grads['dWf'] += np.dot(da_f, z.T)
        self.grads['dbf'] += da_f
        
        dz = (np.dot(self.params['Wf'].T, da_f)
             + np.dot(self.params['Wi'].T, da_i)
             + np.dot(self.params['Wc'].T, da_c)
             + np.dot(self.params['Wo'].T, da_o))
        
        dh_prev = dz[:self.n, :]
        dc_prev = f * dc
        return dh_prev, dc_prev
    
    
    # Forward and Backward propogation for all time stamps
    '''
    Both the propogation will execute in this function.
    
    Iterate over all time stamps and store the results for each time step in dictonaries.
    1. In Forward propogation, we accumulate the cross entropy loss.
    
    This Function exports cross-entropy loss of the training batch, in addition to the hidden and cell states of 
    the last layer which are fed to the first LSTM cell as h_prev and prev of the next training batch.
    '''
    def forward_backward(self, x_batch, y_batch, h_prev, c_prev):
        x, z = {}, {}
        f, i, c_bar, c, o = {}, {}, {}, {}, {}
        y_hat, v, h  = {}, {}, {}
        
        # Values at t = -1
        h[-1] = h_prev
        c[-1] = c_prev
        
        loss = 0
        for t in range(self.seq_len):
            x[t] = np.zeros((self.vocab_size, 1))
            x[t][x_batch[t]] = 1    
            y_hat[t], v[t], h[t], o[t], c[t], c_bar[t], i[t], f[t], z[t] = self.forward(x[t], h[t-1], c[t-1])
            
            loss += -np.log(y_hat[t][y_batch[t], 0]) # Cross Entropy Loss
        
        self.zero_grads()
        
        dh_next = np.zeros_like(h[0])
        dc_next = np.zeros_like(c[0])
        
        for t in reversed(range(self.seq_len)):
            dh_next, dc_next = self.backward(y_batch[t], y_hat[t], dh_next[t], dc_next[t], 
                                             c[t-1], z[t], f[t], i[t], c_bar[t], c[t], o[t], h[t])
        
        return loss, h[self.seq_len-1], c[self.seq_len-1]
    
    # Sampling the Character Sequences
    '''
    A Sample Function to output a sequence of characters from the model, of length sample_size
    '''
    def sample(self, h_prev, c_prev, sample_size):
        x = np.zeros((self.vocab_size, 1))
        h = h_prev
        c = c_prev
        sample_string = ""
        
        for t in range(sample_size):
            y_hat, _, h, _, c, _, _, _, _ = self.forward(x, h, c)
            
            # Get a random idx within the probability distribution of y_hat(ravel())
            idx = np.random.choice(range(self.vocab_size), p=y_hat.ravel())
            x = np.zeros((self.vocab_size, 1))
            x[idx] = 1
            
            # Find the char with the sampled index and concat to the output string 
            char = self.idx_to_char[idx]
            sample_string += char
        return sample_string
    
    
    # Train
    '''
    In this Function, input -> text String(X) and output -> List of losses of each training batches as well as 
    the trained Parameters.
    
    We preprocess the input text to train it faster. num_batches is given by 
                        len(X) / no. of chars that we want in (seq_len) [User-Defined]
                        
    1. Trim the characters at end of the input text that don't form a full sequence.
    2. Slice the input text in batches of size (seq_len) when iterating over each training batch.
    3. Map each char in the input and output batch to idx, using idx_to_char, effectively converting input batch 
       to a list of Integers.
       
    In this, h_prev and c_prev are already set to zero at the beginning of every epoch.
    This means that the states for the samples of each batch will be reused as initial states of the samples in 
    the next batch.
    
    If each training batch is independent, then h_prev and c_prev should be reset after training each batch.
    '''
    def train(self, X, verbose=True):
        J = [] # Store losses.
        
        num_batches = len(X) // self.seq_len
        X_trimmed = X[:num_batches * self.seq_len] # trim input to have full sequences
        
        for epoch in range(self.epochs):
            h_prev = np.zeros((self.n, 1))
            c_prev = np.zeros((self.n, 1))
            
            for j in range(0, len(X_trimmed) - self.seq_len, self.seq_len):
                # Prepare batches
                x_batch = [self.char_to_idx[ch] for ch in X_trimmed[j: j + self.seq_len]]
                y_batch = [self.char_to_idx[ch] for ch in X_trimmed[j + 1: j + self.seq_len + 1]]
                
                loss, h_prev, c_prev = self.forward_backward(x_batch, y_batch, h_prev, c_prev)
                
                # smooth out loss and store it in the list (J)
                self.smooth_loss = self.smooth_loss * 0.999 + loss * 0.001
                J.append(self.smooth_loss)
                
                # Check gradients
                if epoch == 0 and j == 0:
                    self.gradient_check(x_batch, y_batch, h_prev, c_prev, num_checks=10, delta=1e-7)
                
                
                self.clip_grads()
                
                batch_num = epoch * self.epochs + j / self.seq_len + 1
                self.update(batch_num)
                
                # print out loss and the sample string 
                if verbose:
                    if j % 400000 == 0:
                        print('Epoch: ', epoch, '\tBatch: ', j, '-', j + self.seq_len, 
                              '\tLoss: ', round(self.smooth_loss, 2))
                        s = self.sample(h_prev, c_prev, sample_size=250)
                        print(s, '\n')
        return J, self.params

In [11]:
# Data
X = open('HP1.txt').read().lower()

chars = set(X)
vocab_size = len(chars)

print('data has %d characters, %d unique' % (len(X), vocab_size))

data has 25808 characters, 39 unique


In [12]:
# Create dict for mapping chars to ints and vice versa
char_to_idx = {w: i for i, w in enumerate(chars)}
idx_to_char = {i: w for i, w in enumerate(chars)}

In [13]:
char_to_idx

{'n': 0,
 'c': 1,
 '?': 2,
 '(': 3,
 '!': 4,
 '\n': 5,
 '"': 6,
 'm': 7,
 'l': 8,
 'o': 9,
 ' ': 10,
 "'": 11,
 '-': 12,
 'p': 13,
 'r': 14,
 'w': 15,
 'h': 16,
 'x': 17,
 'k': 18,
 'v': 19,
 'f': 20,
 'z': 21,
 'q': 22,
 'u': 23,
 ',': 24,
 'a': 25,
 't': 26,
 'i': 27,
 's': 28,
 'd': 29,
 ':': 30,
 'y': 31,
 '.': 32,
 ')': 33,
 'j': 34,
 ';': 35,
 'b': 36,
 'g': 37,
 'e': 38}

In [14]:
idx_to_char

{0: 'n',
 1: 'c',
 2: '?',
 3: '(',
 4: '!',
 5: '\n',
 6: '"',
 7: 'm',
 8: 'l',
 9: 'o',
 10: ' ',
 11: "'",
 12: '-',
 13: 'p',
 14: 'r',
 15: 'w',
 16: 'h',
 17: 'x',
 18: 'k',
 19: 'v',
 20: 'f',
 21: 'z',
 22: 'q',
 23: 'u',
 24: ',',
 25: 'a',
 26: 't',
 27: 'i',
 28: 's',
 29: 'd',
 30: ':',
 31: 'y',
 32: '.',
 33: ')',
 34: 'j',
 35: ';',
 36: 'b',
 37: 'g',
 38: 'e'}

In [15]:
# Model
model = LSTM(char_to_idx, idx_to_char, vocab_size, epochs=10, lr=0.0005)

In [16]:
# Train 
J, params = model.train(X)

TypeError: unsupported operand type(s) for +: 'float' and 'function'

### Above Class is Giving Error so trying with Scripting
```RUN LSTM_SCRATCH.py file```

SEE THE OTHER NOTEBOOK FOR IT