# NATURAL LANGUAGE PROCESSING WITH RNN & LSTM
**What are Recurrent Neural Networks (RNN) ?**<br>

RNN are described(by IBM 2023)as a type of Neural Networks that use sequential data for prediction analysis.RNN are used for Natual Language Processing, Speech recognition,Time Series analysis, etc.<br>
**Sequential data** -This is the kind of data where a point in the dataset depends on another point(s) in the same dataset e.g a Sentence. A sentence is a collection of words with a contextual meaning.In a meaningful sentence, the next word in a sentence depends on the previous word.<br>
In order to predict the next word, a RNN needs to capture and remember information or patterns from earlier time steps in the sequence to make accurate predictions or perform a specific task.The ability for a network to have this kind of memory is called a 'Long Term Dependancy'.The main issue with a simple RNN is that is struggles it to learn long term dependancy and therefore for long sentences, struggles to go back in memory to remember what the previous words were in order to predict the next word.Specifically with long sequences,where the RNN is repeatedly multiplying gradients to calculate the weights,if the gradients are greater than 1 this can cause an exploding gradient as this weights can exponentially grow with the number of steps.Similarly,when the gradient is less than 1,it can vanish as the weight reduces with the number of steps.These are called Exploding and Vanishing gradients respectively, which are one of the main issues about using a simple RNN. To address this issue an Advance RNN can be used, a Long Short-term Memory,which handles long sequences better.Regularisation tools like dropout can also be used to prevent this.

**LSTM** is an advanced RNN with a good long-term dependancy,and therefore performs better in prediction than a simple RNN.In a LSTM,data flows through gates that help it choose which information is important to keep in the memory and unimportant/redundant words are removed,resulting in short sentences than before,hence the name is "Long-Short" term memory.This shortening of sentences prevents vanishing/exploding gradients,as a results of this, backpropagation is perfomed well and the model is able to go back in memory(and improve weights,thus reducing the loss function) to remember the sentence and predict the next word.







**The Dataset**<br>
RNN is useful for various Natural Languge Processing(NLP) applications, including text generation, autocomplete suggestions, and improving machine understanding of human languages.
The main goal in this assignment is to train an RNN to predict the most likely word to follow a given sequence of words in a text.A book is a good dataset for this model as it has long and complex sentences with dependencies between words that span multiple paragraphs or chapters. This provides an opportunity to train the model to capture and understand long-term contextual information, which is valuable for next-word prediction.Additionally,books contain natural language patterns, idiomatic expressions, and grammatical structures commonly found in written text. Training on a books can help the model learn these patterns, making it capable of generating text that adheres to proper language usage..<br>
For this assignment I am using a book titled 'The Life and Work of Susan B. Anthony(Volume 2 of 2) ' by Ida Husted Harper and it is available for public access on the GUTENBERG PROJECT website (link below):<br>
https://www.gutenberg.org/ebooks/31125 <br>
For the purpose of this assignment(functionality of the code),A portion of the book consisiting of over 157 000 words(excluding special characted) was extracted from the entire dataset.

**References**<br>
IBM. (2023). Recurrent Neural Networks. [Online].[Available] at: https://www.ibm.com/topics/recurrent-neural-networks<br>
Sherif, A., & Ravindra, A. (2018). Apache Spark Deep Learning Cookbook. Packt Publishing.


**The Analysis**<br>
The Analysis of this dataset focuses on the perfomance of the RNN model that will be built,on how well it can predict the next word given a sequence of words(sentences).For this project,a sequence takes up 30 words as input,and the next word after that as the output.The Analysis aims to also compare the prediction accuracy/results from using a simple RNN and from using an LSTM,which is an improved RNN.In details, the analysis will be performed in the following steps:

**1.Data Preparation:**<br>
This sections consists of obtaining the dataset,and data wrangling(preprocessing).<br>
This includes removing punctuation, and tokenizing the text into single words using a Tokeniser.
Creating a vocabulary that consists of all unique words in the corpus and assigning each word a unique numerical identifier.<br>

**2.Sequence Generation:**<br>

In this step input-output pairs for training for the model are created. For each sequence of words in the text, I created a training example where the input is a fixed-length sequence of words(30), and the output is the word that follows the input sequence.
<br>

**3.Vectorization:**<br>

In this section I convert the tokens in my training corpus into numerical vectors using the to_catogorical technique.This will done to remove any ordinal implications that may arise from the sequenced data, and to allow for the correct computation of loss functions and gradients during training.

<br>

**4.Training:**<br>
First,my RNN's architecture is created with the specified parameters, the algorithms that will be used are :
1.A simple RNN 
2.Long Short-Term Memory (LSTM)  because it can capture long-term dependencies.
My model with then be trained on the prepared training data. The goal is to minimize the cross-entropy loss between the predicted word probabilities and the actual next words in the training examples.
Early stopping and dropout will be used to stabilise and improve training.<br>

**5.Testing:**<br>

After training, I will use trained LSTM model for next word prediction.
Starting with sequence of words from the book, and then use the RNN to predict the next word.
And then create a sequence from outside the book and see how well the LSTM model predicts.<br>


**6.Evaluation:**<br>

Evaluate the quality of your model's predictions using metrics like perplexity, BLEU score, or human judgment.
Fine-tune your model and experiment with hyperparameters to improve performance.


In [1]:
pip install spark


Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install pyspark

Note: you may need to restart the kernel to use updated packages.


In [3]:
#Importing a Spark session
from pyspark.sql import SparkSession

In [4]:
spark=SparkSession.builder.appName("The Life of Susan Book Analysis").master('local').config('spark.driver.bindaddress','127.0.0.1')


In [5]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import pickle
from pickle import dump
from keras.utils import to_categorical
import numpy as np
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import SimpleRNN
from keras.layers import Dropout
from keras.callbacks import EarlyStopping
from keras.layers import Embedding

import seaborn as sns
import matplotlib.pyplot as plt




2023-09-26 15:50:19.006207: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-09-26 15:50:19.008845: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-09-26 15:50:19.056864: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-09-26 15:50:19.057753: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


**1.DATA PREPARATION**


**1.1 Getting the data**<br>
In this step data is loaded onto the notebook and viewed

In [6]:
DATA_PATH='Life of Susan Anthony.txt'

In [7]:
file = open("Life of Susan Anthony.txt", "r", encoding = "utf8")

# store file in list
lines = []
for i in file:
    lines.append(i)

# Convert list to string
data = ""
for i in lines:
  data = ' '. join(lines)

In [8]:
#viewing the data that has been loaded
data

'CHAPTER XXXII.\n \n MISS ANTHONY\'S EUROPEAN LETTERS.\n \n 1883.\n \n \n No pen so well as Miss Anthony\'s own, can describe her delightful tour\n abroad, and although her letters were dashed off while travelling from\n point to point, or at the close of a hard day\'s sight-seeing, and the\n entries in the diary are a mere word, they tell in a unique way her\n personal impressions. Because of limited space descriptions of scenery\n will be omitted in order to leave room for opinions of people and\n events.\n \n                               ON BOARD THE BRITISH PRINCE, February 24.\n \n      MY DEAR MRS. SPOFFORD: Here we are at noon, Friday, steaming down\n      Delaware Bay. We got along nicely until 3 P. M. yesterday, when we\n      came to a standstill. "Stuck in the mud," was the report. There we\n      lay until eight, when with the incoming tide we made a fruitless\n      attempt to get over the bar; then had to steam back up the river to\n      anchor, and lie there until nine

In [9]:
#viwing the length of the dataset with special characters and punctuation
len(data)

175535

**1.2 Cleaning the data**<br>
In this step the data is cleaned by removing all special characters, punctuation and unnecessary white spaces


In [10]:
#removing special characters
data=data.replace('\n', '').replace('\r', '').replace('*', '').replace('“','').replace('”','').replace('.','').replace(',','').replace(':','').replace('/','').replace('/','').replace('@','').replace('$','').replace('&','').replace('[','').replace(']','').replace('(','').replace(')','').replace('\'','').replace('"','').replace(';','').replace(' _','')

#remove unnecessary spaces
data = data.split()
data = ' '.join(data)


In [11]:
#Now viewing the cleaned data
data

'CHAPTER XXXII MISS ANTHONYS EUROPEAN LETTERS 1883 No pen so well as Miss Anthonys own can describe her delightful tour abroad and although her letters were dashed off while travelling from point to point or at the close of a hard days sight-seeing and the entries in the diary are a mere word they tell in a unique way her personal impressions Because of limited space descriptions of scenery will be omitted in order to leave room for opinions of people and events ON BOARD THE BRITISH PRINCE February 24 MY DEAR MRS SPOFFORD Here we are at noon Friday steaming down Delaware Bay We got along nicely until 3 P M yesterday when we came to a standstill Stuck in the mud was the report There we lay until eight when with the incoming tide we made a fruitless attempt to get over the bar then had to steam back up the river to anchor and lie there until nine this morning--twenty-four hours almost in sight of the loved ones! It is a break from all fastenings to friends to be thus cut loose from the w

In [12]:
#Viewing the size of the data after it has been cleaned
len(data)

157536

**1.3 Tokenisation**<br>
In this step, the data is tokenised into single words using a Tokeniser and each tokenised word is given a numerical representation using the text_to_sequence function

In [13]:
#Tokenising the data
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])

# saving the tokenizer for prediction later on
pickle.dump(tokenizer, open('token.pkl', 'wb'))

#Senquencing the data to sequences in numerical representations
sequenced_data = tokenizer.texts_to_sequences([data])[0]

In [14]:
#viewing the sequenced data
sequenced_data

[812,
 2212,
 29,
 219,
 2213,
 173,
 813,
 55,
 814,
 35,
 174,
 21,
 29,
 219,
 148,
 81,
 1383,
 9,
 383,
 571,
 815,
 2,
 439,
 9,
 173,
 34,
 2214,
 250,
 128,
 2215,
 30,
 672,
 4,
 672,
 50,
 10,
 1,
 440,
 3,
 5,
 384,
 124,
 269,
 345,
 2,
 1,
 2216,
 6,
 1,
 220,
 41,
 5,
 1038,
 294,
 48,
 251,
 6,
 5,
 2217,
 221,
 9,
 385,
 2218,
 140,
 3,
 2219,
 1039,
 2220,
 3,
 673,
 49,
 26,
 1384,
 6,
 572,
 4,
 346,
 236,
 8,
 1040,
 3,
 104,
 2,
 816,
 18,
 270,
 1,
 817,
 818,
 573,
 819,
 32,
 78,
 19,
 820,
 66,
 17,
 41,
 10,
 674,
 675,
 2221,
 141,
 1385,
 676,
 17,
 441,
 498,
 2222,
 192,
 442,
 295,
 142,
 499,
 56,
 17,
 129,
 4,
 5,
 2223,
 1386,
 6,
 1,
 2224,
 12,
 1,
 206,
 46,
 17,
 500,
 192,
 574,
 56,
 14,
 1,
 2225,
 2226,
 17,
 88,
 5,
 2227,
 1041,
 4,
 155,
 82,
 1,
 1387,
 69,
 25,
 4,
 2228,
 158,
 65,
 1,
 1388,
 4,
 2229,
 2,
 1042,
 46,
 192,
 501,
 31,
 108,
 252,
 222,
 200,
 313,
 6,
 269,
 3,
 1,
 502,
 386,
 11,
 16,
 5,
 677,
 30,
 23,
 2230,
 4,
 1

In [15]:
#showing the length of the sequenced data
len(sequenced_data)

29049

In [16]:
#determining how many words are unique in this dataset
vocab_size=len(tokenizer.word_index) + 1
print('Vocabulary size %d' % vocab_size)


Vocabulary size 4976


**2.SEQUENCE GENERATION**<br>
In the section input and output pairs are created where the input is a fixed-length sequence of 30 words, and the output is the word that follows the input sequence.

In [17]:
#setting up the sequence size
sequences = list()

for i in range(30, len(sequenced_data)):
    words = sequenced_data[i-30:i+1]
    sequences.append(words)

print("The Length of sequences are: ", len(sequences))


The Length of sequences are:  29019


In [18]:
#Showing 5 sequences where each sequences is made up of 31 tokens/words,30 toekns being the input words and the last 1 being the output
sequences = np.array(sequences)
sequences[:5]

array([[ 812, 2212,   29,  219, 2213,  173,  813,   55,  814,   35,  174,
          21,   29,  219,  148,   81, 1383,    9,  383,  571,  815,    2,
         439,    9,  173,   34, 2214,  250,  128, 2215,   30],
       [2212,   29,  219, 2213,  173,  813,   55,  814,   35,  174,   21,
          29,  219,  148,   81, 1383,    9,  383,  571,  815,    2,  439,
           9,  173,   34, 2214,  250,  128, 2215,   30,  672],
       [  29,  219, 2213,  173,  813,   55,  814,   35,  174,   21,   29,
         219,  148,   81, 1383,    9,  383,  571,  815,    2,  439,    9,
         173,   34, 2214,  250,  128, 2215,   30,  672,    4],
       [ 219, 2213,  173,  813,   55,  814,   35,  174,   21,   29,  219,
         148,   81, 1383,    9,  383,  571,  815,    2,  439,    9,  173,
          34, 2214,  250,  128, 2215,   30,  672,    4,  672],
       [2213,  173,  813,   55,  814,   35,  174,   21,   29,  219,  148,
          81, 1383,    9,  383,  571,  815,    2,  439,    9,  173,   34,
        

In [19]:
#Separating input from output(x and y variables)
X = []
y = []

for i in sequences:
    X.append(i[0:30])
    y.append(i[30])

X = np.array(X)
y = np.array(y)

In [20]:
#Below the Input and Output variables are shown, where the X (input) variable now has 30 token and the Y(output) variable has the last token
print("X: ", X[:2])
print("y: ", y[:2])

X:  [[ 812 2212   29  219 2213  173  813   55  814   35  174   21   29  219
   148   81 1383    9  383  571  815    2  439    9  173   34 2214  250
   128 2215]
 [2212   29  219 2213  173  813   55  814   35  174   21   29  219  148
    81 1383    9  383  571  815    2  439    9  173   34 2214  250  128
  2215   30]]
y:  [ 30 672]


**3.VECTORISATION**<br>
 In this section, the output variable which is a class vector is converted into a binary class matrix representation 

In [21]:
#vectoring the y variable into a binary matrix
X, y = sequences[:,:-1], sequences[:,-1] 
y = to_categorical(y, num_classes=vocab_size)



In [22]:
X.shape

(29019, 30)

In [23]:
y.shape

(29019, 4976)

**4.TRAINING THE MODEL(S)**<br>
In this section, the model will be trained with 2 algorithms:<br>
First a simple RNN with a sigmoid activation function,<br>
Using early stopping and drop out functions to avoid over fitting.<br>
And using the 'Adam' function as a gradient descent for back propagation.<br>
A more improved model LSTM will be trained for better results compared to the traditional RNN.This model is expected to have better results as it has long-term dependancy and can remember sequences better than a traditional RNN.


**4.1 Training a traditional RNN**


In [24]:
#RNN Parameters
drop_value=0.2
early_stop=EarlyStopping(monitor='val_loss',patience=3)

In [25]:
model=Sequential()
model.add(Embedding(vocab_size,10,input_length=30))
model.add(SimpleRNN(128))
model.add(Dense(vocab_size,activation='sigmoid'))
model.add(Dropout(drop_value))
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 30, 10)            49760     
                                                                 
 simple_rnn (SimpleRNN)      (None, 128)               17792     
                                                                 
 dense (Dense)               (None, 4976)              641904    
                                                                 
 dropout (Dropout)           (None, 4976)              0         
                                                                 
Total params: 709456 (2.71 MB)
Trainable params: 709456 (2.71 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None


In [26]:
model.compile(loss= "categorical_crossentropy", optimizer='Adam',metrics=['accuracy'])

In [35]:
model.fit(X, y, epochs=70, callbacks=[early_stop])


Epoch 1/70
Epoch 2/70
Epoch 3/70
Epoch 4/70
Epoch 5/70
Epoch 6/70
Epoch 7/70
Epoch 8/70
Epoch 9/70
Epoch 10/70
Epoch 11/70
Epoch 12/70
Epoch 13/70
Epoch 14/70
Epoch 15/70
Epoch 16/70
Epoch 17/70
Epoch 18/70
Epoch 19/70
Epoch 20/70
Epoch 21/70
Epoch 22/70
Epoch 23/70
Epoch 24/70
Epoch 25/70
Epoch 26/70
Epoch 27/70
Epoch 28/70
Epoch 29/70
Epoch 30/70
Epoch 31/70
Epoch 32/70
Epoch 33/70
Epoch 34/70
Epoch 35/70
Epoch 36/70
Epoch 37/70
Epoch 38/70
Epoch 39/70
Epoch 40/70
Epoch 41/70
Epoch 42/70
Epoch 43/70
Epoch 44/70
Epoch 45/70
Epoch 46/70
Epoch 47/70
Epoch 48/70
Epoch 49/70
Epoch 50/70
Epoch 51/70
Epoch 52/70
Epoch 53/70
Epoch 54/70
Epoch 55/70
Epoch 56/70
Epoch 57/70
Epoch 58/70
Epoch 59/70
Epoch 60/70
Epoch 61/70
Epoch 62/70
Epoch 63/70
Epoch 64/70
Epoch 65/70
Epoch 66/70
Epoch 67/70
Epoch 68/70
Epoch 69/70
Epoch 70/70


<keras.src.callbacks.History at 0x7f51fdf6c760>

**Results of the model**<br>
As seen above, the simple RNN model trained to an accuracy of 38%. This means that the model does not remember the words in the sequences well enough to be able to accurately predict the next word.For an improve perform,an LSTM RNN will be trained below.

It is worth noting that, as seen above, at 70 epochs the loss was still improving and,therefore with more iterations the model could probably improve more.However, for the purpose of this assignment we will compare the accuracy and loss fuction with an LSTM at the same number of Ephochs.

Testing how well it can predict

**4.2Now training with LSTM**<br>
This model will be trained with ReLu and Softmax as activation functions,<br>
To prevent the model from overfitting,early stopping and drop out functions will be used.<br>
And using the 'Adam' function as a gradient descent for back propagation.<br>
To tune the model for improvement and reducing the loss fuction, a hyperparater learning of 0.001 will be used

In [28]:
from tensorflow.keras.optimizers import Adam

In [29]:
from tensorflow.keras.callbacks import ModelCheckpoint

checkpoint = ModelCheckpoint("next_words.h5", monitor='loss', verbose=1, save_best_only=True)

In [30]:
early_stop=EarlyStopping(monitor='val_loss',patience=3)

In [31]:
model = Sequential()
model.add(Embedding(vocab_size, 20, input_length=30))
model.add(LSTM(128, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(Dense(128, activation="relu"))
model.add(Dense(vocab_size, activation="softmax"))
          
model.compile(loss="categorical_crossentropy", optimizer=Adam(learning_rate=0.001),metrics=['accuracy'])
          
early_stop=EarlyStopping(monitor='val_loss',patience=3)

In [32]:
model.fit(X, y, epochs=70, batch_size=30, callbacks=[early_stop])

Epoch 1/70
Epoch 2/70
Epoch 3/70
Epoch 4/70
Epoch 5/70
Epoch 6/70
Epoch 7/70
Epoch 8/70
Epoch 9/70
Epoch 10/70
Epoch 11/70
Epoch 12/70
Epoch 13/70
Epoch 14/70
Epoch 15/70
Epoch 16/70
Epoch 17/70
Epoch 18/70
Epoch 19/70
Epoch 20/70
Epoch 21/70
Epoch 22/70
Epoch 23/70
Epoch 24/70
Epoch 25/70
Epoch 26/70
Epoch 27/70
Epoch 28/70
Epoch 29/70
Epoch 30/70
Epoch 31/70
Epoch 32/70
Epoch 33/70
Epoch 34/70
Epoch 35/70
Epoch 36/70
Epoch 37/70
Epoch 38/70
Epoch 39/70
Epoch 40/70
Epoch 41/70
Epoch 42/70
Epoch 43/70
Epoch 44/70
Epoch 45/70
Epoch 46/70
Epoch 47/70
Epoch 48/70
Epoch 49/70
Epoch 50/70
Epoch 51/70
Epoch 52/70
Epoch 53/70
Epoch 54/70
Epoch 55/70
Epoch 56/70
Epoch 57/70
Epoch 58/70
Epoch 59/70
Epoch 60/70
Epoch 61/70
Epoch 62/70
Epoch 63/70
Epoch 64/70
Epoch 65/70
Epoch 66/70
Epoch 67/70
Epoch 68/70
Epoch 69/70
Epoch 70/70


<keras.src.callbacks.History at 0x7f52347ccac0>

*As seen above, The LSTM model is unable to improve above the simple RNN, although I had applied early stopping and dropout and hypertuned my parameters with an Optimiser.
This could be due to the data being too dirty and noisy and therefore the model is unable to learn the sequences.Improvements could be done on the pre-processing stage to clean the data further.*

**5. Testing Prediction** <br>
The code below is to store the model in memory and then use it to predict on text by providing sentences to the model and expecting a predicted word from the model.
Where the model can not predict the word, it will give then error stipulated on the code below.

In [33]:
from tensorflow.keras.models import load_model

# Load the model and tokenizer
model = load_model('next_words.h5')
tokenizer = pickle.load(open('token.pkl', 'rb'))
 
def Predict_Next_Words(model, tokenizer, text):
 
  sequence = tokenizer.texts_to_sequences([text])
  sequence = np.array(sequence)
  preds = np.argmax(model.predict(sequence))
  predicted_word = ""
   
  for key, value in tokenizer.word_index.items():
      if value == preds:
          predicted_word = key
          break
   
  print(predicted_word)
  return predicted_word

In [34]:
while(True):
  text = input("Enter your line: ")
   
  if text == "0":
      print("Execution completed.....")
      break
   
  else:
      try:
          text = text.split(" ")
          text = text[-30:]
          print(text)
         
          Predict_Next_Words(model, tokenizer, text)
           
      except Exception as e:
        print("Error occurred: ",e)
        continue
      

    

['', '', '', 'setting', 'foot', 'on', 'land,', 'I', 'could', 'get', 'up', 'no', 'spirit', 'to', 'write', 'or', 'think.', 'I', '', '', '', '', '', 'have', 'worn', 'the', 'old', 'velvet-trimmed', 'black', 'silk']
Error occurred:  in user code:

    File "/home/lab_services_student/anaconda3/lib/python3.9/site-packages/keras/src/engine/training.py", line 2341, in predict_function  *
        return step_function(self, iterator)
    File "/home/lab_services_student/anaconda3/lib/python3.9/site-packages/keras/src/engine/training.py", line 2327, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/lab_services_student/anaconda3/lib/python3.9/site-packages/keras/src/engine/training.py", line 2315, in run_step  **
        outputs = model.predict_step(data)
    File "/home/lab_services_student/anaconda3/lib/python3.9/site-packages/keras/src/engine/training.py", line 2283, in predict_step
        return self(x, training=False)
    File "/hom

**5.Evaluation**<br>
As a result of the poor model training, the model is unable to predict any next word given the above sequence,it gives the above errors when given a sequence to predict the next word.The model was tested with 3 sentences, 2 from the dataset and the 3rd one is a new sequence that was not part of the dataset.None of these were predicted successfully by the model due to the low accuracy.