Please refer to these blog-posts for an indepth understanding of RNNS and LSTM's    
https://github.com/karpathy/char-rnn <br>
http://colah.github.io/posts/2015-08-Understanding-LSTMs/ <br>
https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21 <br>
https://towardsdatascience.com/understanding-gru-networks-2ef37df6c9be


In [1]:
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

In [2]:
# open text file and read in data as `text`
with open('HP1.txt', 'r' ,encoding="utf8", errors='ignore') as f:
    text = f.read()

In [3]:
text[:100]

"Harry Potter and the Sorcerer's Stone \n\nCHAPTER ONE \n\nTHE BOY WHO LIVED \n\nMr. and Mrs. Dursley, of n"

### Encoding the text

We will convert all the characters in the text to numbers to be fed into the network
> * First we index the set of characters we have in set
* Then we encode the text into numbers

In [4]:
# Get unique characters from text
chars = tuple(set(text))
# Get int to chars mapping
int2char = dict(enumerate(chars))
# Reverse the above to get numeric values for characters
char2int = {ch: ii for ii, ch in int2char.items()}
# encode the text
encoded = np.array([char2int[ch] for ch in text])

In [5]:
len(chars)

78

Check the double 'r' in Harry repeating as two 8's in the output

In [6]:
encoded[:100]

array([69, 39, 66, 66, 13, 24, 64, 53, 43, 43, 50, 66, 24, 39, 27, 73, 24,
       43, 51, 50, 24,  2, 53, 66,  7, 50, 66, 50, 66, 60, 28, 24,  2, 43,
       53, 27, 50, 24, 55, 55, 18, 69, 25, 64, 45, 52, 30, 24, 71, 32, 52,
       24, 55, 55, 45, 69, 52, 24, 44, 71, 65, 24,  5, 69, 71, 24, 61,  9,
       20, 52, 36, 24, 55, 55, 31, 66, 56, 24, 39, 27, 73, 24, 31, 66, 28,
       56, 24, 36, 74, 66, 28, 38, 50, 13, 11, 24, 53,  4, 24, 27])

>* To making this a multi-class classification with 78 classes (one of each unique character),
we convert every number encoded into an array of size 78 with one-hot encoding

In [7]:
def one_hot_encoding(input_seq, num_labels):
    
    # Initialize the the encoded array
    one_hot_vector = np.zeros((np.multiply(*input_seq.shape), num_labels), dtype=np.float32)
    # Fill rows with 1 only on the index that matches the number
    one_hot_vector[np.arange(one_hot_vector.shape[0]), input_seq.flatten()] = 1.  
    # Reshape original
    one_hot_vector = one_hot_vector.reshape((*input_seq.shape, num_labels))
    
    return one_hot_vector

In [8]:
# Testing function
input_seq = np.array([[1,2,3],[2,3,6],[2,5,6],[6,3,8],[3,4,6]])
num_labels = 10
temp = one_hot_encoding(input_seq, num_labels)

In [9]:
batch_size = 2

In [12]:
num_batches = len(input_seq.flatten())//(batch_size*seq_length)
input_seq = input_seq.flatten()[:num_batches*batch_size*seq_length]
input_seq = input_seq.reshape(batch_size,-1)

for i in range(0,input_seq.shape[1], seq_length):
    x = arr[:, i:i+seq_length]
    if i+ seq_length == input_seq.shape[1]:
    y = 
    
    

IndentationError: expected an indented block (<ipython-input-12-18fbfe91648d>, line 8)

In [13]:
input_seq.reshape(batch_size,-1)[:,2]

ValueError: cannot reshape array of size 15 into shape (2,newaxis)

### Generate DataLoader that yields batches everytime it is called

* check the Sentiment analysis notebook to know how to wrap data into a torch Dataset to be used by a torch data loader

In [14]:
def data_loader(input_seq, batch_size, seq_length):
    """
    This function generates (batch_size * seq_length) batches from
    input sequence
    """
    
    num_batches = len(input_seq.flatten())//(batch_size*seq_length)
    # Remove the extra characters from input_seq that donot fit batch size
    input_seq = input_seq.flatten()[:num_batches*batch_size*seq_length]
    input_seq = input_seq.reshape(batch_size,-1)
    
    for i in range(0,input_seq.shape[1], seq_length):
        x = input_seq[:, i:i+seq_length]
        y = np.zeros_like(x)
        try:
            y[:, :-1], y[:, -1] = x[:, 1:], input_seq[:, i+seq_length]
        # feed the first column , when there is no more character to be assigned to y in the end 
        except IndexError:
            y[:, :-1], y[:, -1] = x[:, 1:], input_seq[:, 0]
            
        yield x, y

In [15]:
batches = data_loader(encoded, 8, 50)
x, y = next(batches)

In [16]:
x.shape, y.shape

((8, 50), (8, 50))

In [17]:
# check if GPU is available
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
    print('Training on GPU!')
else: 
    print('Training on CPU!')

Training on GPU!


### Define Class for character LSTM

https://pytorch.org/docs/stable/nn.html#lstm
> * n_layer --> number of layers of LSTM, for n_layers > 1, LSTMS are stacked
* n_hidden --> number of neurons in hidden layer
* dropout_prob --> Probabilty for drop out in LSTM
* batch_first --> the input and output tensors are provided as (batch, seq, feature)


> Dimensions:
* Input --> (batch_size, seq_length, input_size)
* hidden --> (n_layers, batch_size, hidden_dim)
* Output --> (batch_size, seq_length, hidden_size)

* After LSTM we have a fully connect layer that reduces the output dimension from hidden layer dim to input sequence unit dim 
* We return the hidden unit after forward pass to be fed to next LSTM
* We initilize initial weights for cell state and hidden state with zero's

In [28]:
class charLSTM(nn.Module):
    def __init__(self, chars, n_hidden=256, n_layers=2,
                               dropout_prob=0.5, lr=0.001):
        super(charLSTM, self).__init__()
        self.dropout_prob = dropout_prob
        self.n_layers = n_layers
        self.n_hidden = n_hidden
        self.lr = lr
        
        # Encode
        self.chars = chars
        self.int2char = dict(enumerate(self.chars))
        self.char2int = {ch: ii for ii, ch in self.int2char.items()}
        
        # check the LSTM parameters from pytorch
        self.lstm = nn.LSTM(len(self.chars), n_hidden, n_layers, 
                            dropout=dropout_prob, batch_first=True)
        
        # Fully connected layer to get out put in the dimension of encoding
        self.dropout = nn.Dropout(dropout_prob)
        self.fc = nn.Linear(n_hidden, len(self.chars))
      
    
    def forward(self, x, hidden):
        """
        Forward pass input and hidden state.
        Define hidden state intially.
        """      
        r_output, hidden = self.lstm(x, hidden)
        out = self.dropout(r_output)
        out = out.contiguous().view(-1, self.n_hidden)
        out = self.fc(out)
        
        return out, hidden
    
    
    def init_hidden(self, batch_size):
        """ 
        Initializes hidden state
        """
        
        # Create two new tensors with sizes n_layers x batch_size x n_hidden
        # with same type as weights initialized to zero, for hidden state and cell state of LSTM
        # check https://discuss.pytorch.org/t/when-to-initialize-lstm-hidden-state/2323/8 for clarification
        
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),
                      weight.new(self.n_layers, batch_size, self.n_hidden).zero_())
        
        return hidden
    

In [68]:
# define and print the net
n_hidden= 200
n_layers=4

net = charLSTM(chars, n_hidden, n_layers)
if(train_on_gpu):
        net = net.cuda()
        
print(net)

charLSTM(
  (lstm): LSTM(78, 200, num_layers=4, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5)
  (fc): Linear(in_features=200, out_features=78, bias=True)
)


### Training Network

>* Before training, we split the data into training and validation set
* Our perfomance metric here will just be the loss because, we take the absolute probabilities of varies characters in an output vector and sample from it for next

In [69]:
valid_percent = 0.2
valid_idx = int(len(encoded)*(1 - valid_percent))
train_encoded = encoded[:valid_idx]
valid_encoded = encoded[valid_idx:]

In [70]:
def train(train_on_gpu,net,train_data, valid_data, epochs = 1, seq_length = 30, 
          batch_size = 256, lr = 0.001, clip = 5, name = str('CharLSTM')):
    """
    Function to train the LSTM in batches
    """
    
    acc = 10^8
    char_len = len(net.chars)
    
    for j in range(epochs):
        # Set model to train mode
        net.train()
        total = 0
        sum_loss = 0
        
        # Initial hidden state filled with zeros
        h = net.init_hidden(batch_size)
        
        for i, (x, y) in enumerate(data_loader(train_data, batch_size, seq_length)):
            
            optim = torch.optim.Adam(net.parameters(), lr=lr)
            batch = y.shape[0]
            
            x = one_hot_encoding(x, char_len)
            x, y = torch.from_numpy(x), torch.from_numpy(y)
    
            if(train_on_gpu):
                x, y = x.cuda(), y.cuda()
                
            # Remove the hidden layer from history to avoid back propagating in time
            h = tuple([each.data for each in h])
            
            # Set gradients to 0, avoid gradient accumulation
            net.zero_grad()
            # store the hidden layer which could be used for the next epoch
            output, h = net(x, h)
            
            loss = F.cross_entropy(output, y.view(batch_size*seq_length))
            loss.backward()
            
            # Clip the gradient to avoid exploding gradients
            nn.utils.clip_grad_norm_(net.parameters(), clip)
            optim.step()
            
            total += batch
            sum_loss += batch*(loss.item())
            
        if j%5 == 0:
            print(f"train loss at epoch {j}: ", sum_loss/total)
        # Save the model with parameters that produces lowest validation error
        
        val_loss = validation(train_on_gpu , net, j, valid_data,batch_size)
        if val_loss < acc:
            torch.save(net.state_dict(), name)
        acc = val_loss


In [71]:
def validation(train_on_gpu, net, epoch, valid_data , seq_length = 30, batch_size = 256, lr = 0.001, clip = 5):
    """
    Function that returns validation loss.
    Called for every epoch.
    """
    val_h = net.init_hidden(batch_size)
    sum_loss = 0
    total = 0
    # Set the model to evaluation mode
    net.eval()

    val_h = net.init_hidden(batch_size)
    
    for i, (x, y) in enumerate(data_loader(valid_data, batch_size, seq_length)):
    
        batch = y.shape[0]
        # One-hot encode our data and make them Torch tensors
        x = one_hot_encoding(x, len(net.chars))

        x, y = torch.from_numpy(x), torch.from_numpy(y)
        if(train_on_gpu):
            x, y = x.cuda(), y.cuda()
        
        # Remove the hidden state parameter from history
        val_h = tuple([each.data for each in val_h])
        output, val_h = net(x, val_h)

        val_loss = F.cross_entropy(output, y.view(batch_size*seq_length))        
        sum_loss += batch*(val_loss.item())
        total += batch
    if epoch%5 == 0:
        print("val loss", sum_loss/total)
    return (sum_loss/total)

In [72]:
batch_size = 128
seq_length = 100
n_epochs = 100 # start smaller if you are just testing initial behavior

# train the model
train(train_on_gpu, net, train_encoded, valid_encoded, epochs=n_epochs, batch_size=batch_size, seq_length=seq_length, lr=0.005)

train loss at epoch 0:  3.362160073386298
val loss 3.2157922983169556
train loss at epoch 5:  3.2859446825804532
val loss 3.184307813644409
train loss at epoch 10:  2.9332976694460267
val loss 2.8943209648132324
train loss at epoch 15:  2.486027655778108
val loss 2.364353656768799
train loss at epoch 20:  2.1597665592476174
val loss 2.037652015686035
train loss at epoch 25:  1.9363778078997578
val loss 1.7812238931655884
train loss at epoch 30:  1.7917820745044284
val loss 1.653004765510559
train loss at epoch 35:  1.6948643922805786
val loss 1.5593156218528748
train loss at epoch 40:  1.6270457329573456
val loss 1.5129769444465637
train loss at epoch 45:  1.5805056978155065
val loss 1.4686496257781982
train loss at epoch 50:  1.5432682081505105
val loss 1.4467105865478516
train loss at epoch 55:  1.5104516082339816
val loss 1.4244187474250793
train loss at epoch 60:  1.4857933123906453
val loss 1.4144509434700012
train loss at epoch 65:  1.4659521138226543
val loss 1.3977885246276855


In [202]:
batch_size = 128
seq_length = 100
epochs = 1 # start smaller if you are just testing initial behavior
lr=0.001
print_every=10
data = encoded
val_frac = 0.2
clip = 5

net.train()
    
opt = torch.optim.Adam(net.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

# create training and validation data
val_idx = int(len(data)*(1-val_frac))
data, val_data = data[:val_idx], data[val_idx:]

if(train_on_gpu):
    net.cuda()

counter = 0
n_chars = len(net.chars)
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)
    x,y = next(data_loader(data, batch_size, seq_length))
        

    counter += 1

    # One-hot encode our data and make them Torch tensors
    x = one_hot_encoding(x, n_chars)


    inputs, targets = torch.from_numpy(x), torch.from_numpy(y)

    if(train_on_gpu):
        inputs, targets = inputs.cuda(), targets.cuda()

    # Creating new variables for the hidden state, otherwise
    # we'd backprop through the entire training history
    h = tuple([each.data for each in h])

    # zero accumulated gradients
    net.zero_grad()

    # get the output from the model
    output, h = net(inputs, h)
    print(output.shape)

    preds = torch.max(output, dim=1)[1]
    preds = preds.reshape(batch_size,-1)
    print((preds==targets).float().sum())
    print(output.shape, targets.view(batch_size*seq_length).shape)
    # calculate the loss and perform backprop
    loss = criterion(output, targets.view(batch_size*seq_length))
    print(loss)
    loss.backward()
    # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
    nn.utils.clip_grad_norm_(net.parameters(), clip)
    opt.step()



torch.Size([12800, 78])
tensor(509.)
torch.Size([12800, 78]) torch.Size([12800])
tensor(4.3452, grad_fn=<NllLossBackward>)


### Finally, Predictions !

> 1) We pass few characters through the model and do a softmax on the output.  <br>
2) This gives us probability distribution of for all the characters in our set. <br>
3) We sampe from this distribution randomly and feed it into the model again 
for predictions until we hit the threshold on generating characters


In [65]:
def predict(train_on_gpu, net, char, h=None, top_k=None):
        """
        Given a character, predict the next character.
        Returns the predicted character and the hidden state.
        """ 
                
        # tensor inputs
        x = np.array([[net.char2int[char]]])
        x = one_hot_encoding(x, len(net.chars))
        inputs = torch.from_numpy(x)
        
        if(train_on_gpu):
            inputs = inputs.cuda()
        
        # detach hidden state from history
        h = tuple([each.data for each in h])
        # get the output of the model
        out, h = net(inputs, h)

        # get the character probabilities
        p = F.softmax(out, dim=1).data
        if(train_on_gpu):
            p = p.cpu() # move to cpu
        
        # get top characters
        if top_k is None:
            top_ch = np.arange(len(net.chars))
        else:
            p, top_ch = p.topk(top_k)
            top_ch = top_ch.numpy().squeeze()
        
        # select the likely next character with some element of randomness
        p = p.numpy().squeeze()
        char = np.random.choice(top_ch, p=p/p.sum())
        
        # return the encoded value of the predicted char and the hidden state
        return net.int2char[char], h

In [66]:
def sample(train_on_gpu, net, size, prime='The', top_k=None):
        
    if(train_on_gpu):
        net.cuda()
    else:
        net.cpu()
    
    net.eval() # eval mode
    
    # First off, run through the prime characters
    chars = [ch for ch in prime]
    h = net.init_hidden(1)
    for ch in prime:
        char, h = predict(train_on_gpu , net, ch, h, top_k=top_k)

    chars.append(char)
    
    # Now pass in the previous character and get a new one
    for ii in range(size):
        char, h = predict(train_on_gpu, net, chars[-1], h, top_k=top_k)
        chars.append(char)

    return ''.join(chars)

In [73]:
# prime = 'Harry' warms up the network with hidden states
print(sample(train_on_gpu, net, 1000, prime='Harry', top_k=5))

Harry on the stool on the second. Harry saw him stool and shill she seat an owl while though it had been a trink back the sight. Their fame to hungry to thinks they were so how to tell Harry in the starts of his things. 

"What wouldn't shat the some with the thing to the from the from you stor it. It's been worrying the frinned that him anything that he's back if they heard to this tonay." "It stopped him to gras in the telephone. 

"I were number of steak where the steps with however." 

"Wouldn't started off the that saying that had tree. It can saying you all?" 

The franticing to him. 

"It's the game of the still. I heard her store off and there's no one stelled it." 

Harry, and he was nosiced to his both of the fire and flound three of the floor in Harry's stop to be teacher of Harry started. 

"Who weren't to get them in his bess in though it." 

"What's you to him one of his back to the crowds on you? I'm been in the top so to him to get to some something, won't have got to s