<a href="https://colab.research.google.com/github/vedantdave77/project.Orca/blob/master/Character_Level_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Character Level LSTM in PyTorch

Here, I will train the character-level LSTM with PyTorch. The network will train character level RNN, which can able to generate new text from the reference text give to it. 

> I will use PyTorch library, in colab due to gpu library -> *.dll has issue with miniconda environment. 

## Load required libraries.

In [2]:
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

## Load Data 

In [3]:
# we have text data
with open('/content/anna.txt','r') as data:
  text = data.read()

# check first 100 characters of the book
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

## Tockenization
Tockenization is process to conver each character to numeric value. Neural Network is mathematical model and thats why we need to convert data to numeric value for training model. 

So, its one kind of encoding.

In [4]:
chars = tuple(set(text))                              # consider each unique charater and make set and converts tuple
int2char = dict(enumerate(chars))                     # make dictionary of unique characters and index. ex: (1,a)
char2int = {ch: n for n, ch in int2char.items()}      # reverse the dictionary ex: (a,1)

# encode the text
encoded = np.array([char2int[ch] for ch in text])

In [5]:
encoded[:100]

array([36, 76, 74, 34, 75,  9, 50, 28, 58, 68, 68, 68,  0, 74, 34, 34, 14,
       28,  2, 74, 53, 12, 59, 12,  9,  8, 28, 74, 50,  9, 28, 74, 59, 59,
       28, 74, 59, 12, 71,  9,  1, 28,  9,  7,  9, 50, 14, 28, 39,  5, 76,
       74, 34, 34, 14, 28,  2, 74, 53, 12, 59, 14, 28, 12,  8, 28, 39,  5,
       76, 74, 34, 34, 14, 28, 12,  5, 28, 12, 75,  8, 28, 11, 24,  5, 68,
       24, 74, 14, 13, 68, 68, 41,  7,  9, 50, 14, 75, 76, 12,  5])

## Preprocessing of data 

LSTM need data in numeric form, but the encoding we did is not useful, so the idea is: first use all the characters as features, generate one vector which have 1 for corresponding index, and 0 for rest vector. 

Here is the function for that.


In [88]:
def one_hot_encode(array, n_labels):
  # initialize array 
  one_hot = np.zeros((array.size,n_labels),dtype = np.float32)
  #print(one_hot.shape)

  # fill the elements with ones for corresponding posisions
  one_hot[np.arange(one_hot.shape[0]),array.flatten()] =1.

  # finally reshape to original array
  one_hot = one_hot.reshape((*array.shape,n_labels))

  return one_hot 

In [7]:
# check the above function 
test_seq = np.array([[3,5,1]])
one_hot_test = one_hot_encode(test_seq,8)

print(one_hot_test)

(3, 8)
[[[0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 1. 0. 0. 0. 0. 0. 0.]]]


# Creating batches

Now, we have one long sequence. Its efficient to convert and seperate them into batch sizes. No of batch size returns equal amount of subsequence. Its also obvious that we can not use the full length of sequence for training, and thats why we need to choose sequence length. 



In [8]:
def get_batches(array, batch_size, seq_length):
  batch_size_total = batch_size * seq_length                  # total batch_size means total elements consider by lstm at a time. 
  n_batches = len(array)//batch_size_total                    # total no. of parts of sub_sequznece. 

  array = array[:n_batches * batch_size_total]                # ensure full batches (make even devide elements)
  array = array.reshape((batch_size,-1))                      # reshape into batch_size_row.

  for n in range(0,array.shape[1],seq_length):                # iterate though array, but takes one sequence at a time.
    x = array[:,n:n+seq_length]                               # features => one feature length = seq_length
    y = np.zeros_like(x)                                      # generate target and shifted by one.
    try:                                                      # Exception for errors.
      y[:,:-1],y[:,-1] = x[:,1:], array[:,n+seq_length]
    except IndexError:
      y[:,:-1],y[:,-1] = x[:,1:], array[:,0]
    yield x,y

In [9]:
# Unit Test:
batches = get_batches(encoded, 8,50)
x,y = next(batches)

In [10]:
print('X',x[:10,:10])
print("==================")
print('y',y[:10,:10])
print(" ")

X [[36 76 74 34 75  9 50 28 58 68]
 [ 8 11  5 28 75 76 74 75 28 74]
 [ 9  5 82 28 11 50 28 74 28  2]
 [ 8 28 75 76  9 28 10 76 12  9]
 [28  8 74 24 28 76  9 50 28 75]
 [10 39  8  8 12 11  5 28 74  5]
 [28 43  5  5 74 28 76 74 82 28]
 [54 73 59 11  5  8 71 14 13 28]]
y [[76 74 34 75  9 50 28 58 68 68]
 [11  5 28 75 76 74 75 28 74 75]
 [ 5 82 28 11 50 28 74 28  2 11]
 [28 75 76  9 28 10 76 12  9  2]
 [ 8 74 24 28 76  9 50 28 75  9]
 [39  8  8 12 11  5 28 74  5 82]
 [43  5  5 74 28 76 74 82 28  8]
 [73 59 11  5  8 71 14 13 28 52]]
 


Here, Y is one step shifter to X.


----



## Defining network with PyTorch:

I will use [PyTorch documentation for LSTM](https://pytorch.org/docs/stable/nn.html#lstm) to generate custom Character RNN, LSTM model. 

In [54]:
class charLSTM(nn.Module):
  
  def __init__(self, tockens, n_hidden_size, n_layers,drop_prob, lr):
    super().__init__()
    self.drop_prob = drop_prob                        # drop out probability [avoid overfiting, generalize model]
    self.n_layers = n_layers                          # no. of hidden layers
    self.n_hidden_size = n_hidden_size                # hidden layer size
    self.lr = lr                                      # learning_rate

    # creating directory
    self.chars = tockens
    self.int2char =  dict(enumerate(self.chars))
    self.char2int = {ch: n for n , ch in self.int2char.items()}

    # defining layers (LSTM, dropout, fully connected)
    self.lstm = nn.LSTM(len(self.chars),n_hidden_size,n_layers, dropout = drop_prob, batch_first=True)  
    self.dropout = nn.Dropout(drop_prob)
    self.fc = nn.Linear(n_hidden,len(self.chars))       # we had batch_first=True so, last_layer size = first_layer size

  # generate forward flow function
  def forward(self,x,hidden):
    r_output, hidden = self.lstm(x,hidden)
    out = self.dropout(r_output)
    out = out.contiguous().view(-1,self.n_hidden_size)    # to reshape output, because of stacked RNN...
    out = self.fc(out)
    return out, hidden 
  
  # generate hidden state due to stacked RNN model, so output of layer_1 lstm as input to next lstm layer_2.
  def init_hidden(self, batch_size):
    weight = next(self.parameters()).data
    if (train_on_gpu):
      hidden = (weight.new(self.n_layers, batch_size, self.n_hidden_size).zero_().cuda(),
                    weight.new(self.n_layers, batch_size, self.n_hidden_size).zero_().cuda())
    else:
      hidden = (weight.new(self.n_layers, batch_size, self.n_hidden_size).zero_(),
                      weight.new(self.n_layers, batch_size, self.n_hidden_size).zero_())
        
    return hidden

### Training Algorithm



In [40]:
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
  print("Training of GPU")
else:
  print("No Gpu Available, training of CPU, plz consider you epoch very small")

Training of GPU


In [95]:
def train(net, data, epochs, batch_size, seq_length, lr, clips, val_fraction, print_every):
  net.train()
  optimizer = torch.optim.Adam(net.parameters(),lr=lr)
  criterian = nn.CrossEntropyLoss()

  # prepare training and valiataion data
  val_idx = int(len(data)* (1-val_fraction))    # for val=0.2, it get index of data from where val data will start.
  data,val_data = data[:val_idx],data[val_idx:]

  if(train_on_gpu):
    net.cuda()

  counter = 0
  n_chars = len(net.chars)
  for e in range(epochs):
    print("=========== New Epoch ===========")
    h = net.init_hidden(batch_size)

    for x,y in get_batches(data, batch_size, seq_length):
      counter +=1

      # one hot encoding (data preprocessing)
      x = one_hot_encode(x,n_chars)
      inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
      if(train_on_gpu):
        inputs, targets = inputs.cuda(), targets.cuda()
      
      h = tuple([each.data for each in h])                                      # generate new variable and save data of hidden_state
      
      net.zero_grad()                                                           # zero accumulated gradients
      output, h = net(inputs,h)                                                 # get model output
      # loss and back propagation
      loss = criterian(output,targets.view(batch_size * seq_length).long())
      loss.backward()

      nn.utils.clip_grad_norm_(net.parameters(),clips)                           # prevent gradient exploding, (treat vanishing gradient problem)
      optimizer.step()

      # calculate and update loss
      if counter%print_every ==0:
        # get validation loss
        val_h = net.init_hidden(batch_size)
        val_losses = []
        net.eval()
        for x,y in get_batches(val_data,batch_size,seq_length):
          x = one_hot_encode(x,n_chars)
          x,y = torch.from_numpy(x),torch.from_numpy(y)

          val_h = tuple([each.data for each in val_h])
          inputs, targets = x, y
          if(train_on_gpu):
            inputs,targets = inputs.cuda(), targets.cuda()
          output, val_h = net(inputs,val_h)
          val_loss = criterian(output,targets.view(batch_size * seq_length).long())
          val_losses.append(val_loss.item())
        ""
        net.train() 

        print("Epoch: {}/{}...".format(e+1, epochs),
              "steps: {}...".format(counter),
              "Loss: {:.4f}".format(loss.item()),
              "Val loss: {:.4f}".format(np.mean(val_losses)))
  


  

## Instantiating the model (define hyper parameters)


In [96]:
# define parameters
n_hidden = 512
n_layers = 2
batch_size = 128
seq_length = 100
lr = 0.001
print_every = 10
clips = 5
val_fraction = 0.1
n_epochs = 20
drop_prob = 0.5

In [97]:
# train model 
net = charLSTM(chars, n_hidden,n_layers,drop_prob,lr)
print(net)
print("=====================================")
Print("===============TRAINING==============")
print("======================================")
train(net,encoded, epochs = n_epochs, batch_size = batch_size, seq_length= seq_length, lr=lr , clips =clips, val_fraction = val_fraction, print_every = print_every)

charLSTM(
  (lstm): LSTM(83, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=512, out_features=83, bias=True)
)
Epoch: 1/20... steps: 10... Loss: 3.2671 Val loss: 3.2193
Epoch: 1/20... steps: 20... Loss: 3.1583 Val loss: 3.1437
Epoch: 1/20... steps: 30... Loss: 3.1478 Val loss: 3.1260
Epoch: 1/20... steps: 40... Loss: 3.1144 Val loss: 3.1200
Epoch: 1/20... steps: 50... Loss: 3.1456 Val loss: 3.1180
Epoch: 1/20... steps: 60... Loss: 3.1215 Val loss: 3.1162
Epoch: 1/20... steps: 70... Loss: 3.1106 Val loss: 3.1151
Epoch: 1/20... steps: 80... Loss: 3.1237 Val loss: 3.1122
Epoch: 1/20... steps: 90... Loss: 3.1200 Val loss: 3.1052
Epoch: 1/20... steps: 100... Loss: 3.0970 Val loss: 3.0881
Epoch: 1/20... steps: 110... Loss: 3.0607 Val loss: 3.0493
Epoch: 1/20... steps: 120... Loss: 2.9582 Val loss: 2.9975
Epoch: 1/20... steps: 130... Loss: 2.9095 Val loss: 2.8752
Epoch: 2/20... steps: 140... Loss: 2.7935 Val loss: 2.7580

## Result Conclusion and Model Improvement Analysis
