Let's first import all the necessary libraries and mount the drive to the notebook inorder to read text files and save the trained LSTM model.

In [None]:
import numpy as np
import torch
from torch import nn
import spacy
from collections import Counter
from IPython.display import clear_output

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


text is a variable that stores the contents of a txt file, which is later sliced inorder to use only a million words from the huge corpus.

In [None]:
%cd /content/drive/MyDrive/Language_Modeling/

with open('feedback_prize.txt', 'r') as f:
    text = f.read()

text = text[:1000000]
print(len(text))

/content/drive/MyDrive/Language_Modeling
1000000


In [None]:
text[:100]

'I think we should be able to play in a sport if we have a grade C. I think i would be not fear for s'

One of the important steps in Langauge Modeling is to create word and integer mappings. After creating such mappings to convert words to integers and vice versa, it becomes necessary to convert the corpus to a sequence of integer values since Neural Networks only work with numbers.

In [None]:
!python -m spacy download en_core_web_md
# !pip install python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_md')
clear_output()

nlp.max_length = len(text)

word_tokens_list = [word.text for word in nlp(text)]

In [None]:
unique_word_counter = Counter(word_tokens_list)

In [None]:
sorted_unique_word_counter = sorted(unique_word_counter.items(), key = lambda x: x[1], reverse = True)
sorted_unique_keys = [i for i, j in sorted_unique_word_counter]

In [None]:
word2int = {word: idx for idx, word in enumerate(sorted_unique_keys)}
int2word = {idx: word for idx, word in enumerate(sorted_unique_keys)}

In [None]:
intarr = [word2int[i] for i in word_tokens_list]

Next, for faster computation and to exploit the computational efficiency provided by GPUs, get_batches is a function that takes care of splitting the corpus into batches of given batch_size and sequence_length. Along with a batch, the function also yields the expected output for the given batch.

In [None]:
def get_batches(arr: list([int]), batch_size: int, seq_length: int) -> tuple([list, list]):
  '''
  Function Description:
    The function takes in a list of integer values, representing words, and a couple of batch parameters to yields a batch of words and corresponding output words. 

  Parameters:
    arr: a list of strings converted to integer values using word2int mapping
    batch_size: number of lists to be returned in a batch
    seq_length: length of a sequence in a batch
  
  Returns:
    A tuple containing 2 lists; a list of input words from a batch and a list of the corresponding output.
  '''

  arr = np.array(arr)
  number_of_full_batches = len(arr) // (batch_size * seq_length)
  new_arr_size = number_of_full_batches * batch_size * seq_length
  arr = arr[ : new_arr_size]
  arr = arr.reshape(batch_size, -1)
  for i in range(0, arr.shape[1], seq_length):
      x = arr[ : , i : i+seq_length]
      y = np.zeros_like(x)
      y[ : , : -1] = x[ : , 1 : ]
      try:
          y[ : , -1] = arr[ : , i+seq_length]
      except IndexError:
          y[ : , -1] = arr[ : , 0]
      yield x, y

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

The Recurrent Neural Network used for this project of Language Modeling is LSTM which can learn long range influences from a large text. Using the word embeddings, a fully connected layer and a LSTM, the class wordRNN wraps the RNN which is to be trained and tested. The class also has methods to feed forward through the network and also to create a initial hidden layer in the LSTM.

In [None]:
class wordRNN(nn.Module):
  '''
  A class to instantiate a Long Short Term Memory Neural Network.
  .forward() takes a list of words which are word2int mapped and a hidden state to output a list of predicted words and a new hidden state.
  .init_hidden() creates a initial hidden state of the given batch_size.
  '''

  def __init__(self, tokens, embedding_dim = 50, n_hidden = 512, n_layers = 2, drop_prob = 0.5):
      super().__init__()
      self.drop_prob = drop_prob
      self.words = tokens
      self.embedding_dim = embedding_dim
      self.n_vocab = len(set(self.words))
      self.n_hidden = n_hidden
      self.n_layers = n_layers
      self.dropout = nn.Dropout()
      self.fc = nn.Linear(self.n_hidden, self.n_vocab)
      self.embedding = nn.Embedding(self.n_vocab, self.embedding_dim)
      self.lstm = nn.LSTM(self.embedding_dim, self.n_hidden, self.n_layers, dropout = self.drop_prob, batch_first = True)
      
  def forward(self, x, hidden):
      x = torch.from_numpy(x)
      x = x.to(device)
      embed = self.embedding(x)
      r_output, hidden = self.lstm(embed, hidden)
      out = self.dropout(r_output)
      out = r_output.contiguous().view(-1, self.n_hidden)
      out = self.fc(out)
      return out, hidden
  
  def init_hidden(self, batch_size):
      
      hidden = (torch.zeros(self.n_layers, batch_size, self.n_hidden).to(device),
                torch.zeros(self.n_layers, batch_size, self.n_hidden).to(device))
    
      return hidden 

In [None]:
model = wordRNN(intarr)
model

wordRNN(
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=512, out_features=9355, bias=True)
  (embedding): Embedding(9355, 50)
  (lstm): LSTM(50, 512, num_layers=2, batch_first=True, dropout=0.5)
)

In [None]:
torch.cuda.empty_cache()

The created model is now trained and it's performance (perplexity) is checked on predicting on text which the model hasn't seen before using the validation dataset. The model is saved when the the validation loss is at it's least.

In [None]:
# def train_loop(batches, model, optimizer, criterion):
batch_size = 10
seq_length = 50
learning_rate = 0.001
epochs = 250
optimizer = torch.optim.Adam(model.parameters(), lr = learning_rate)
criterion = nn.CrossEntropyLoss()
min_val_loss = np.inf
model = model.to(device)
clip = 5
    
validation_percent = 0.2
total_batches = len(list(get_batches(intarr, batch_size, seq_length)))
no_of_validation_batches = int(validation_percent * total_batches)
no_of_train_batches = total_batches - no_of_validation_batches

hidden = model.init_hidden(batch_size)
    
for epoch in range(epochs):
    val_loss = 0
    perplexity = 0
    print(f'----------------------epoch-{epoch+1}----------------------')
    batches = get_batches(intarr, batch_size, seq_length)
    for batch_no, (x, y) in enumerate(batches):
        
        
        if(batch_no+1 <= no_of_train_batches):
            model.train()
            
            hidden = tuple(each.data.to(device) for each in hidden)

            logits, hidden = model(x, hidden)

            x = torch.tensor(x)
            y = torch.tensor(y).to(device)

            y = y.view(batch_size * seq_length).long()
            loss = criterion(logits, y)

            optimizer.zero_grad()
            # loss.backward(retain_graph=True)
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), clip)
            optimizer.step()

            if batch_no % 100 == 0:
                print(f'Training loss:{loss: 7f}         batch {batch_no+1} / {no_of_train_batches}')
                
        
        else:
            model.eval()
            # hidden = tuple(each.data for each in hidden)
            logits, hidden = model(x, hidden)
            x = torch.tensor(x)
            y = torch.tensor(y).to(device)

            y = y.view(batch_size * seq_length).long()
            
            val_loss += criterion(logits, y)        
    
    avg_val_loss = val_loss / no_of_validation_batches
    perplexity = torch.exp(avg_val_loss)
    print(f'Validation: Average Loss = {avg_val_loss : .3f}    Average Perplexity = {perplexity: .3f}')
    
    if (avg_val_loss < min_val_loss):
        print('Saving the model..')
        min_val_loss = avg_val_loss
        torch.save(model, 'language_model.pth')
        
    print('\n')

----------------------epoch-1----------------------
Training loss: 9.147414         batch 1 / 336
Training loss: 5.892900         batch 201 / 336
Validation: Average Loss =  6.015    Average Perplexity =  409.697
Saving the model..


----------------------epoch-2----------------------
Training loss: 5.929358         batch 1 / 336
Training loss: 5.443862         batch 201 / 336
Validation: Average Loss =  5.644    Average Perplexity =  282.638
Saving the model..


----------------------epoch-3----------------------
Training loss: 5.475732         batch 1 / 336
Training loss: 5.043805         batch 201 / 336
Validation: Average Loss =  5.464    Average Perplexity =  235.923
Saving the model..


----------------------epoch-4----------------------
Training loss: 5.191070         batch 1 / 336
Training loss: 4.877835         batch 201 / 336
Validation: Average Loss =  5.360    Average Perplexity =  212.647
Saving the model..


----------------------epoch-5----------------------
Training los

KeyboardInterrupt: ignored

In [None]:
model = torch.load('Copy of language_model.pth')

Now that the model is trained, the model can be used to generate text. 

But before that we create a fucntion that can feed a word a word and hidden state to the model to return a predicted word a new hidden state.

In [None]:
def predict(char: str, hidden: tuple((torch.tensor, torch.tensor))) -> tuple((str, tuple((torch.tensor, torch.tensor)))):
  '''
  Function Description:
    Takes in a word and a hidden state to pass through the LSTM network and returns one of the top 5 predicted words and a new hidden state. 

  Parameters:
    char: an input word
    hidden: a hidden state

  Returns:
    A new word (str) and a new hidden state.
  '''

  softmax = nn.Softmax(dim = 1)
  x = np.array([[word2int[char]]])

  next_word, next_hidden = model(x, hidden)
  next_word_probabilities = softmax(next_word)

  p, top_ch = next_word_probabilities.topk(5)
  top_ch = top_ch.cpu().numpy().squeeze()
  p = p.detach().cpu().numpy().squeeze()

  char = np.random.choice(top_ch)
  return int2word[char], hidden

To generate text using the model, the user simply needs to provide a few words and the trained RNN model reads the text, and remembers the influences and generates text based on the corpus on which it was trained.

In [None]:
def sample(model: wordRNN, input_words: str, predict_words: int):
  '''
  Function Description:
    Using a sequence of input words, the function returns a string of predicted words.

  Parameters:
    model: An object of class wordRNN; an LSTM network
    input_words: A sequence of words wrapped in a string
    predict_words: Number of words to output as predicted words

  Returns:
    A string containing input_words and all the predicted words.
  '''
  model.eval()
  hidden = model.init_hidden(1)
  generated_text = []

  for word in input_words.split():
    next_word, hidden = predict(word, hidden)
    generated_text.append(word) 
  
  for i in range(predict_words):
      next_word, hidden = predict(generated_text[-1], hidden)
      generated_text.append(next_word)

  return ' '.join(generated_text)

In [None]:
sample(model, ' I think i would be not fear for student that have a good grade like', 100)

'I think i would be not fear for student that have a good grade like not the most people get a lot not just have a lot , they think I feel the way you think the next work . have just get a phone and do not a better " . . . \n I know the next . ( classroom world think it is good work and just do you know they do not just get better . . \n I do to do help . ( learning to have just just do you just do you just just get more more important and have to do you just just just get more than it help make one person is not one people do you think that you just not not one of school , and do you just do you just have more and do not just do n\'t just just do n\'t really one person just get more than you think it also just do not a good grades . have more than a more than it will not just just do you have more and the next school , school work that is good work , and just not a phone get more more than the car " This would be able help . \n The way the next school learning learning work . . \n I h