<a href="https://colab.research.google.com/github/vedantdave77/project.Orca/blob/master/Recurrent_Neural_Network/Character_Level_LSTM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Character Level LSTM in PyTorch

Here, I will train the character-level LSTM with PyTorch. The network will train character level RNN, which can able to generate new text from the reference text give to it. 

> I will use PyTorch library, in colab due to gpu library -> *.dll has issue with miniconda environment. 

## Load required libraries.

In [None]:
import numpy as np
import torch
from torch import nn
import torch.nn.functional as F

## Load Data 

In [None]:
# we have text data
with open('/content/anna.txt','r') as data:
  text = data.read()

# check first 100 characters of the book
text[:100]

'Chapter 1\n\n\nHappy families are all alike; every unhappy family is unhappy in its own\nway.\n\nEverythin'

## Tockenization
Tockenization is process to conver each character to numeric value. Neural Network is mathematical model and thats why we need to convert data to numeric value for training model. 

So, its one kind of encoding.

In [None]:
chars = tuple(set(text))                              # consider each unique charater and make set and converts tuple
int2char = dict(enumerate(chars))                     # make dictionary of unique characters and index. ex: (1,a)
char2int = {ch: n for n, ch in int2char.items()}      # reverse the dictionary ex: (a,1)

# encode the text
encoded = np.array([char2int[ch] for ch in text])

In [None]:
encoded[:100]

array([63, 46, 45,  6, 19, 80, 65, 39, 50, 82, 82, 82, 61, 45,  6,  6,  9,
       39, 22, 45, 72, 21, 42, 21, 80, 71, 39, 45, 65, 80, 39, 45, 42, 42,
       39, 45, 42, 21, 78, 80, 66, 39, 80, 10, 80, 65,  9, 39, 43, 12, 46,
       45,  6,  6,  9, 39, 22, 45, 72, 21, 42,  9, 39, 21, 71, 39, 43, 12,
       46, 45,  6,  6,  9, 39, 21, 12, 39, 21, 19, 71, 39, 26, 16, 12, 82,
       16, 45,  9, 47, 82, 82, 67, 10, 80, 65,  9, 19, 46, 21, 12])

## Preprocessing of data 

LSTM need data in numeric form, but the encoding we did is not useful, so the idea is: first use all the characters as features, generate one vector which have 1 for corresponding index, and 0 for rest vector. 

Here is the function for that.


In [None]:
def one_hot_encode(array, n_labels):
  # initialize array 
  one_hot = np.zeros((array.size,n_labels),dtype = np.float32)
  #print(one_hot.shape)

  # fill the elements with ones for corresponding posisions
  one_hot[np.arange(one_hot.shape[0]),array.flatten()] =1.

  # finally reshape to original array
  one_hot = one_hot.reshape((*array.shape,n_labels))

  return one_hot 

In [None]:
# check the above function 
test_seq = np.array([[3,5,1]])
one_hot_test = one_hot_encode(test_seq,8)

print(one_hot_test)

[[[0. 0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0. 0.]
  [0. 1. 0. 0. 0. 0. 0. 0.]]]


# Creating batches

Now, we have one long sequence. Its efficient to convert and seperate them into batch sizes. No of batch size returns equal amount of subsequence. Its also obvious that we can not use the full length of sequence for training, and thats why we need to choose sequence length. 



In [None]:
def get_batches(array, batch_size, seq_length):
  batch_size_total = batch_size * seq_length                  # total batch_size means total elements consider by lstm at a time. 
  n_batches = len(array)//batch_size_total                    # total no. of parts of sub_sequznece. 

  array = array[:n_batches * batch_size_total]                # ensure full batches (make even devide elements)
  array = array.reshape((batch_size,-1))                      # reshape into batch_size_row.

  for n in range(0,array.shape[1],seq_length):                # iterate though array, but takes one sequence at a time.
    x = array[:,n:n+seq_length]                               # features => one feature length = seq_length
    y = np.zeros_like(x)                                      # generate target and shifted by one.
    try:                                                      # Exception for errors.
      y[:,:-1],y[:,-1] = x[:,1:], array[:,n+seq_length]
    except IndexError:
      y[:,:-1],y[:,-1] = x[:,1:], array[:,0]
    yield x,y

In [None]:
# Unit Test:
batches = get_batches(encoded, 8,50)
x,y = next(batches)

In [None]:
print('X',x[:10,:10])
print("==================")
print('y',y[:10,:10])
print(" ")

X [[63 46 45  6 19 80 65 39 50 82]
 [71 26 12 39 19 46 45 19 39 45]
 [80 12  2 39 26 65 39 45 39 22]
 [71 39 19 46 80 39 23 46 21 80]
 [39 71 45 16 39 46 80 65 39 19]
 [23 43 71 71 21 26 12 39 45 12]
 [39 52 12 12 45 39 46 45  2 39]
 [70 31 42 26 12 71 78  9 47 39]]
y [[46 45  6 19 80 65 39 50 82 82]
 [26 12 39 19 46 45 19 39 45 19]
 [12  2 39 26 65 39 45 39 22 26]
 [39 19 46 80 39 23 46 21 80 22]
 [71 45 16 39 46 80 65 39 19 80]
 [43 71 71 21 26 12 39 45 12  2]
 [52 12 12 45 39 46 45  2 39 71]
 [31 42 26 12 71 78  9 47 39 44]]
 


Here, Y is one step shifter to X.


----



## Defining network with PyTorch:

I will use [PyTorch documentation for LSTM](https://pytorch.org/docs/stable/nn.html#lstm) to generate custom Character RNN, LSTM model. 

In [None]:
class charLSTM(nn.Module):
  
  def __init__(self, tockens, n_hidden_size, n_layers,drop_prob, lr):
    super().__init__()
    self.drop_prob = drop_prob                        # drop out probability [avoid overfiting, generalize model]
    self.n_layers = n_layers                          # no. of hidden layers
    self.n_hidden_size = n_hidden_size                # hidden layer size
    self.lr = lr                                      # learning_rate

    # creating directory
    self.chars = tockens
    self.int2char =  dict(enumerate(self.chars))
    self.char2int = {ch: n for n , ch in self.int2char.items()}

    # defining layers (LSTM, dropout, fully connected)
    self.lstm = nn.LSTM(len(self.chars),n_hidden_size,n_layers, dropout = drop_prob, batch_first=True)  
    self.dropout = nn.Dropout(drop_prob)
    self.fc = nn.Linear(n_hidden,len(self.chars))       # we had batch_first=True so, last_layer size = first_layer size

  # generate forward flow function
  def forward(self,x,hidden):
    r_output, hidden = self.lstm(x,hidden)
    out = self.dropout(r_output)
    out = out.contiguous().view(-1,self.n_hidden_size)    # to reshape output, because of stacked RNN...
    out = self.fc(out)
    return out, hidden 
  
  # generate hidden state due to stacked RNN model, so output of layer_1 lstm as input to next lstm layer_2.
  def init_hidden(self, batch_size):
    weight = next(self.parameters()).data
    if (train_on_gpu):
      hidden = (weight.new(self.n_layers, batch_size, self.n_hidden_size).zero_().cuda(),
                    weight.new(self.n_layers, batch_size, self.n_hidden_size).zero_().cuda())
    else:
      hidden = (weight.new(self.n_layers, batch_size, self.n_hidden_size).zero_(),
                      weight.new(self.n_layers, batch_size, self.n_hidden_size).zero_())
        
    return hidden

### Training Algorithm



In [None]:
train_on_gpu = torch.cuda.is_available()
if(train_on_gpu):
  print("Training of GPU")
else:
  print("No Gpu Available, training of CPU, plz consider you epoch very small")

Training of GPU


In [None]:
def train(net, data, epochs, batch_size, seq_length, lr, clips, val_fraction, print_every):
  net.train()
  optimizer = torch.optim.Adam(net.parameters(),lr=lr)
  criterian = nn.CrossEntropyLoss()

  # prepare training and valiataion data
  val_idx = int(len(data)* (1-val_fraction))    # for val=0.2, it get index of data from where val data will start.
  data,val_data = data[:val_idx],data[val_idx:]

  if(train_on_gpu):
    net.cuda()

  counter = 0
  n_chars = len(net.chars)
  for e in range(epochs):
    print("=========== New Epoch ===========")
    h = net.init_hidden(batch_size)

    for x,y in get_batches(data, batch_size, seq_length):
      counter +=1

      # one hot encoding (data preprocessing)
      x = one_hot_encode(x,n_chars)
      inputs, targets = torch.from_numpy(x), torch.from_numpy(y)
      if(train_on_gpu):
        inputs, targets = inputs.cuda(), targets.cuda()
      
      h = tuple([each.data for each in h])                                      # generate new variable and save data of hidden_state
      
      net.zero_grad()                                                           # zero accumulated gradients
      output, h = net(inputs,h)                                                 # get model output
      # loss and back propagation
      loss = criterian(output,targets.view(batch_size * seq_length).long())
      loss.backward()

      nn.utils.clip_grad_norm_(net.parameters(),clips)                           # prevent gradient exploding, (treat vanishing gradient problem)
      optimizer.step()

      # calculate and update loss
      if counter%print_every ==0:
        # get validation loss
        val_h = net.init_hidden(batch_size)
        val_losses = []
        net.eval()
        for x,y in get_batches(val_data,batch_size,seq_length):
          x = one_hot_encode(x,n_chars)
          x,y = torch.from_numpy(x),torch.from_numpy(y)

          val_h = tuple([each.data for each in val_h])
          inputs, targets = x, y
          if(train_on_gpu):
            inputs,targets = inputs.cuda(), targets.cuda()
          output, val_h = net(inputs,val_h)
          val_loss = criterian(output,targets.view(batch_size * seq_length).long())
          val_losses.append(val_loss.item())
        ""
        net.train() 

        print("Epoch: {}/{}...".format(e+1, epochs),
              "steps: {}...".format(counter),
              "Loss: {:.4f}".format(loss.item()),
              "Val loss: {:.4f}".format(np.mean(val_losses)))
  


  

## Instantiating the model (define hyper parameters)


In [None]:
# define parameters
print_every = 10
clips = 5
val_fraction = 0.1
n_epochs = 20
drop_prob = 0.5

# Hyperparameters
batch_size = 128         # number of sequences run through the network in one pass
seq_length = 100         # number of character in the squence, larger is better, network can learn for more long range. 
lr = 0.001
n_hidden = 512
n_layers = 2



In [None]:
# train model 
net = charLSTM(chars, n_hidden,n_layers,drop_prob,lr)
print(net)
print("=====================================")
print("============== TRAINING =============")
print("======================================")
train(net,encoded, epochs = n_epochs, batch_size = batch_size, seq_length= seq_length, lr=lr , clips =clips, val_fraction = val_fraction, print_every = print_every)

charLSTM(
  (lstm): LSTM(83, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=512, out_features=83, bias=True)
)
Epoch: 1/20... steps: 10... Loss: 3.2635 Val loss: 3.2081
Epoch: 1/20... steps: 20... Loss: 3.1542 Val loss: 3.1374
Epoch: 1/20... steps: 30... Loss: 3.1478 Val loss: 3.1249
Epoch: 1/20... steps: 40... Loss: 3.1177 Val loss: 3.1200
Epoch: 1/20... steps: 50... Loss: 3.1425 Val loss: 3.1179
Epoch: 1/20... steps: 60... Loss: 3.1188 Val loss: 3.1152
Epoch: 1/20... steps: 70... Loss: 3.1090 Val loss: 3.1133
Epoch: 1/20... steps: 80... Loss: 3.1205 Val loss: 3.1081
Epoch: 1/20... steps: 90... Loss: 3.1116 Val loss: 3.0954
Epoch: 1/20... steps: 100... Loss: 3.0780 Val loss: 3.0655
Epoch: 1/20... steps: 110... Loss: 3.0218 Val loss: 3.0020
Epoch: 1/20... steps: 120... Loss: 2.9289 Val loss: 2.9729
Epoch: 1/20... steps: 130... Loss: 2.8562 Val loss: 2.8195
Epoch: 2/20... steps: 140... Loss: 2.7447 Val loss: 2.6858

## Result Conclusion and Model Improvement Analysis

For winning model according to my reading, 
there are two main scenario.
1. If your training loss is lower and validation loss is quiet highter then model will be overfiting. The probable solutions are ...
- - use shallow network (I mean... less larger than current)
- - increase dropout between (0,1)
- - increase data for generalize model

2. The other scenario is your training and validation loss will be same. so, it means you have underfit model.

- - larger your network
- - increase your data

Some Real Examples: 
- 1 MB text files = 1 Million Parameters (approx)
- 100 MB data and RNN trained with 150K parameters, means you have 0.15 Million parameters for 1 M data. So, obviousely you need more larger network, so incrase network size will be better. 
- Increase network size means more computing time and need computing power. So, use dropout = 0.5 will boost you network by 2X. 

---

According to best ML practice...
* Keep track of your models, and thats why always save your model as checkpoints. 



In [None]:
# use best practice
model_name = 'rnn_20_epoch.net'

checkpoint = {'n_hidden_size' : net.n_hidden_size,
              'n_layers' : net.n_layers,
              'state_dict' : net.state_dict(),
              'tokens' : net.chars}

with open(model_name, 'wb') as f:
  torch.save(checkpoint, f)

## Prediction 

Model is trained, so it will give us information about the next probable character. Ofcause the loss for model is large. Thats why it can not give use best performace. But, still can generate randomsome text.

In [24]:
def predict(net, char, h = None, top_k = None):
  x = np.array([[net.char2int[char]]])
  x = one_hot_encode(x, len(net.chars))
  inputs = torch.from_numpy(x)

  if(train_on_gpu):
    inputs = inputs.cuda()
  
  # detach hidden state 
  h = tuple([each.data for each in h])
  output,h = net(inputs,h)

  p = F.softmax(output,dim=1).data
  if(train_on_gpu):
    p = p.cpu()                                 # probability is output so we need to save on cpu, or local instance

    if top_k is None:
      top_ch = np.arrange(len(net.chars))
    else: 
      p,top_ch = p.topk(top_k)
      top_ch = top_ch.numpy().squeeze()

      # select the likely character form achived probability
      p = p.numpy().squeeze()
      char = np.random.choice(top_ch, p = p/p.sum())

      return net.int2char[char],h

## Priming and generating text



In [37]:
print(net)

charLSTM(
  (lstm): LSTM(83, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=512, out_features=83, bias=True)
)


In [28]:
def sample(net, size, prime = 'The', top_k = None):
  
  if(train_on_gpu):
    net.cuda()
  else:
    net.cpu()

  net.eval()

  chars = [ch for ch in prime]
  h = net.init_hidden(1)
  for ch in prime:
    char, h = predict(net,ch,h,top_k= top_k)
  chars.append(char)

  # pass previous character to generate next one
  for i in range(size):
    char, h = predict(net,chars[-1], h, top_k = top_k)
    chars.append(char)
  return ''.join(chars)

In [33]:
print(sample(net,1000,prime= 'Machine Learning',top_k= 5))

Machine Learning and the pate with his caparly and companion of the death.

"Ah! have you soon!" he answered to the carriage with a smile and taking his hand.

The doors with start of the measing on the course of some subjock thought in society, and happy ordinarily he had not a children sorry
of the breins to the carriage, at the strange was not the
father, and treed another shrupked and heart and straight about him. But alone without when he had
come. All the condition of the
painter there and
all the people and her hand was an interesting of the man, who came to see her face where was a minute all one on her arrangements, and asked her
husband's hungress,
with the church. And had chertiously did, he flowned his sole and terror of the princess who had supposed, he could not himself for me. And well, they
said steping too, as they were
calling his his housible from
something, that he had seen him. As he went in to the performable station that they
seemed so as to say a suffering. The 

## Loading a checkpoint

In [42]:
with open('rnn_20_epoch.net','rb') as f:
  checkpoint = torch.load(f)

loaded = charLSTM(checkpoint['tokens'],n_hidden_size=checkpoint['n_hidden_size'],n_layers = checkpoint['n_layers'],drop_prob=drop_prob,lr=lr)
loaded.load_state_dict(checkpoint['state_dict'])

<All keys matched successfully>

In [44]:
# another sample for loaded model
print(sample(loaded, 2000, top_k=5, prime= "He Said"))

He Said.

"I shall be the cheeks at once to
be a confusion."

"I am away one of this tone, and that was a man who was not in an insten in him that the
conversation of it as he came to be as a characteristic attitude of that
three there then was not
seeing
the messed. It all the
considerations in a corract, which all sense. She saw that the point there was in the face.

"I would have to said that?" he answered.

"I won't arrange
your fate."

Alexey Alexandrovitch, with the same correction. And
walked out of the crack of the day of the door that were close a mother in his hind that was said at the
matter, where a lad coleness, who were all
to the proportion with which he
had been taken out of the right sense of holsing. But he
was always before the men seemed as as to
the music and at all, the coupt in his honsess of her. And he could not have been a secret, who would
be treated to his sone in the second candle. She came upon a peasants.

"Ah, your humilatting. I am not so introducing th

## Try to improve network with more iterations

In [45]:
# define parameters
print_every = 10
clips = 5
val_fraction = 0.1
n_epochs = 100
drop_prob = 0.5

# Hyperparameters
batch_size = 128         # number of sequences run through the network in one pass
seq_length = 500         # number of character in the squence, larger is better, network can learn for more long range. 
lr = 0.001
n_hidden = 512
n_layers = 2


In [46]:
# train model 
net = charLSTM(chars, n_hidden,n_layers,drop_prob,lr)
print(net)
print("=====================================")
print("============== TRAINING =============")
print("======================================")
train(net,encoded, epochs = n_epochs, batch_size = batch_size, seq_length= seq_length, lr=lr , clips =clips, val_fraction = val_fraction, print_every = print_every)

charLSTM(
  (lstm): LSTM(83, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=512, out_features=83, bias=True)
)
Epoch: 1/100... steps: 10... Loss: 3.2766 Val loss: 3.2041
Epoch: 1/100... steps: 20... Loss: 3.1804 Val loss: 3.1305
Epoch: 2/100... steps: 30... Loss: 3.1369 Val loss: 3.1208
Epoch: 2/100... steps: 40... Loss: 3.1304 Val loss: 3.1189
Epoch: 2/100... steps: 50... Loss: 3.1330 Val loss: 3.1176
Epoch: 3/100... steps: 60... Loss: 3.1113 Val loss: 3.1153
Epoch: 3/100... steps: 70... Loss: 3.1086 Val loss: 3.1125
Epoch: 3/100... steps: 80... Loss: 3.1113 Val loss: 3.1062
Epoch: 4/100... steps: 90... Loss: 3.0815 Val loss: 3.0893
Epoch: 4/100... steps: 100... Loss: 3.0574 Val loss: 3.0480
Epoch: 5/100... steps: 110... Loss: 2.9689 Val loss: 2.9682
Epoch: 5/100... steps: 120... Loss: 2.8825 Val loss: 2.8833
Epoch: 5/100... steps: 130... Loss: 2.8071 Val loss: 2.8066
Epoch: 6/100... steps: 140... Loss: 2.7181 Va

In [47]:
# another sample for loaded model
print(sample(net, 2000, top_k=5, prime= "He Said"))

He Said, he saw what he had been a personage, so that the clavirgation of her husband was at him on the mother. But to herself the doctor, who had sent the doctor's serious sufferings, which took his eyes and he was already
struggled again at the
possibility, to say that there
was a step of all these words wishing to the midst of her matter and her, that it was sitting to be a side. He shook her side of his statical smile. His house with the palicion at that start of her set had been all his honse for the first other portrait in the sense of the first of the life. He was nithing a
little bad and docross to his surdame in. He could not be sitting, and so in spite of his birst
and hearing her face and her paces of all, with the clasp she could, she was not saying what had supered to have the count of the servant was the sort of comprehense of the
companions, and that would have said this to speaking; he saw that this worst that his best was
as happy and so many first most only the same i

In [48]:
# use best practice
model_name = 'rnn_100_epoch.net'

checkpoint = {'n_hidden_size' : net.n_hidden_size,
              'n_layers' : net.n_layers,
              'state_dict' : net.state_dict(),
              'tokens' : net.chars}

with open(model_name, 'wb') as f:
  torch.save(checkpoint, f)