### Next Word Prediction

About:

The project aims to develop a deep learning model for predicting the next word in a given sequence of words. The model is built using LSTM [Long Short Term Memory] networks, which are suited for sequence prediction tasks.

1) Data Collection: we will use Shakespeare's Hamlet as our dataset.

2) Data Preprocessing: The text data is tokenized, converted to sequences, and are padded to ensure uniform input lengths. Then we split into train and test data.

3) Model Building: an LSTM is constructed with an embedding layer, two LSTM layers, and a dense output layer with a softmax function to predict the probablity of the next word.

4) Model Training: the model is trained using the prepared sequences, with early stopping implemented to prevent overfitting. Early stopping monitors the validation loss and stops trainings when the loss stops improving.

5) Model Evaluation: a set of examples are prepared to test model ability to predict the next sequence word.

6) Deployment: A streamlit web application is developed to allow users to input sequence of words and predict the next word in real time.

In [1]:
import nltk
import pandas as pd
nltk.download('gutenberg')
from nltk.corpus import gutenberg

data = gutenberg.raw('shakespeare-hamlet.txt')

with open('../data/hamlet.txt','w') as file_obj:
    file_obj.write(data)

[nltk_data] Downloading package gutenberg to C:\Users\Nitin
[nltk_data]     Flavier\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [2]:
# Data Preprocessing

import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences 

from sklearn.model_selection import train_test_split




In [3]:
with open('../data/hamlet.txt','r') as file_obj:
    text = file_obj.read().lower()

# creating indexes for words 
tokenizer = Tokenizer() 
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1
print(total_words)
tokenizer.word_index

4818


{'the': 1,
 'and': 2,
 'to': 3,
 'of': 4,
 'i': 5,
 'you': 6,
 'a': 7,
 'my': 8,
 'it': 9,
 'in': 10,
 'that': 11,
 'ham': 12,
 'is': 13,
 'not': 14,
 'his': 15,
 'this': 16,
 'with': 17,
 'your': 18,
 'but': 19,
 'for': 20,
 'me': 21,
 'lord': 22,
 'as': 23,
 'what': 24,
 'he': 25,
 'be': 26,
 'so': 27,
 'him': 28,
 'haue': 29,
 'king': 30,
 'will': 31,
 'no': 32,
 'our': 33,
 'we': 34,
 'on': 35,
 'are': 36,
 'if': 37,
 'all': 38,
 'then': 39,
 'shall': 40,
 'by': 41,
 'thou': 42,
 'come': 43,
 'or': 44,
 'hamlet': 45,
 'good': 46,
 'do': 47,
 'hor': 48,
 'her': 49,
 'let': 50,
 'now': 51,
 'thy': 52,
 'how': 53,
 'more': 54,
 'they': 55,
 'from': 56,
 'enter': 57,
 'at': 58,
 'was': 59,
 'oh': 60,
 'like': 61,
 'most': 62,
 'there': 63,
 'well': 64,
 'know': 65,
 'selfe': 66,
 'would': 67,
 'them': 68,
 'loue': 69,
 'may': 70,
 "'tis": 71,
 'vs': 72,
 'sir': 73,
 'qu': 74,
 'which': 75,
 'did': 76,
 'why': 77,
 'laer': 78,
 'giue': 79,
 'thee': 80,
 'ile': 81,
 'must': 82,
 'hath': 

In [4]:
inputSequences=[]

# for each line we are taking 0, 0 1, 0 1 2, 0 1 2 3 sequences to train our model

for line in text.split('\n'):
    print(line)
    print(tokenizer.texts_to_sequences([line]))
    token_list = tokenizer.texts_to_sequences([line])[0]
    print(token_list)
    for i in range(1,len(token_list)):
        n_gram_sequence = token_list[:i+1]
        inputSequences.append(n_gram_sequence)
    print(inputSequences)
    break


[the tragedie of hamlet by william shakespeare 1599]
[[1, 687, 4, 45, 41, 1886, 1887, 1888]]
[1, 687, 4, 45, 41, 1886, 1887, 1888]
[[1, 687], [1, 687, 4], [1, 687, 4, 45], [1, 687, 4, 45, 41], [1, 687, 4, 45, 41, 1886], [1, 687, 4, 45, 41, 1886, 1887], [1, 687, 4, 45, 41, 1886, 1887, 1888]]


In [5]:
inputSequences=[]

for line in text.split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1,len(token_list)):
        n_gram_sequence = token_list[:i+1]
        inputSequences.append(n_gram_sequence)

In [6]:
## Pad Sequences
maxSeqLen = max([len(x) for x in inputSequences])
print(maxSeqLen)

14


In [7]:
print(type(pad_sequences(inputSequences,maxlen=maxSeqLen)))
inputSequences = np.array(pad_sequences(inputSequences,maxlen=maxSeqLen,padding='pre'))
inputSequences

<class 'numpy.ndarray'>


array([[   0,    0,    0, ...,    0,    1,  687],
       [   0,    0,    0, ...,    1,  687,    4],
       [   0,    0,    0, ...,  687,    4,   45],
       ...,
       [   0,    0,    0, ...,    4,   45, 1047],
       [   0,    0,    0, ...,   45, 1047,    4],
       [   0,    0,    0, ..., 1047,    4,  193]])

In [8]:
import tensorflow as tf

X,y = inputSequences[:,:-1],inputSequences[:,-1]

In [9]:
# y to be converted into categorical variable
# why ?

# here each index is represented by some word
# as out output will be the probablity of the word being among all 
# the unique words present in the word_list obtained from the text.

print(y)
y = tf.keras.utils.to_categorical(y,num_classes=total_words)
print(y)

[ 687    4   45 ... 1047    4  193]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [10]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.33,random_state=32)

### Explanation of the Parameters:

1) total_words: 
- Represents the vocabulary size, i.e., the total number of unique words in your text data. 
- Each word in the input sequence will be mapped to a corresponding vector of fixed size.

2) 100:
- This is the embedding dimension, i.e., the size of the vector representation of each word.
- Each word in your vocabulary will be represented as a vector of 100 numbers in the embedding space.
- Yes, this represents the number of features for a word. It determines how much information you want to encode in the vector representation of a word.

3) input_length=maxSeqLen:
- Refers to the length of input sequences to the model.
- If your input data consists of sequences of words (e.g., sentences or text chunks), maxSeqLen is the fixed length of these sequences.
- For instance, if your sentences have 10 words, then maxSeqLen would be 10.


### What Does the Embedding Layer Do?  

- The embedding layer converts integer-encoded words (word indices) into dense vectors of fixed size (100 in this case).  
- It initializes the word embeddings randomly and updates them during training, learning a meaningful representation of words in the process.  

### Dropout Layer

- The Dropout layer is applied to the outputs of the first LSTM layer.

- After the first LSTM layer produces its outputs (a sequence of vectors because return_sequences=True), the Dropout layer randomly sets 20% (0.2) of those outputs to zero during training to prevent overfitting.

- LSTM itself has built-in arguments for dropout within its gates and recurrent connections:  
    1) dropout: Applies dropout to the input connections of the LSTM.  
    2) recurrent_dropout: Applies dropout to the recurrent connections (connections between time steps).  

    model.add(LSTM(150, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))  

In [16]:
# Training our LSTM RNN 
from tensorflow.keras.models import Sequential 
from tensorflow.keras.layers import Embedding,LSTM,Dense,Dropout
from tensorflow.keras.callbacks import EarlyStopping

earlyStoppingCallback = EarlyStopping(monitor='val_accuracy',patience=5,min_delta=0.00001,restore_best_weights=True)

model = Sequential() 
model.add(Embedding(total_words,100,input_length=maxSeqLen-1)) # as the last word is to be predicted
model.add(LSTM(150,return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words,activation='softmax'))

# set the optimizer and the loss function
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 13, 100)           481800    
                                                                 
 lstm_4 (LSTM)               (None, 13, 150)           150600    
                                                                 
 dropout_2 (Dropout)         (None, 13, 150)           0         
                                                                 
 lstm_5 (LSTM)               (None, 100)               100400    
                                                                 
 dense_2 (Dense)             (None, 4818)              486618    
                                                                 
Total params: 1219418 (4.65 MB)
Trainable params: 1219418 (4.65 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [18]:
# train the model 
history = model.fit(X_train,y_train,epochs=150,validation_data=(X_test,y_test),verbose=1)

Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78

### About the Model Output:

1) Model Output Shape:

    - Your model's final layer is a Dense(total_words, activation='softmax').  
    - This means the model predicts a probability distribution over total_words (the size of your vocabulary).  
    - For each input sequence (N words) in token_list, the model returns a vector of probabilities (length: total_words).  

2) Shape of y_pred:

    If token_list contains 1 sequence (N words), the shape of the output is (1, total_words):
    - N: Number of words in the sequences (batch size).
    - total_words: Vocabulary size (number of classes).

In [34]:
def predict_next_word(model, tokenizer, text, maxSeqLen):
    token_list = tokenizer.texts_to_sequences([text])[0] # converts text to vocabulary indexes
    if len(token_list) >= maxSeqLen:
        token_list = token_list[-(maxSeqLen-1):] # taking last maxSeqLen-1 words
    else:
        token_list = pad_sequences([token_list],maxlen=maxSeqLen-1,padding='pre')
    
    y_pred = model.predict(token_list,verbose=0) # 2d arr of vocabulary size
    print("the predicted values: ",y_pred)
    print("Length of values: ",len(y_pred[0]))


    predicted_word_index = np.argmax(y_pred,axis=1)

    for word, index in tokenizer.word_index.items():
        if index == predicted_word_index:
            return word
    
    return None

### What are sequences ?

A sequence refers to a list (or array) of numbers, where:  
1) Each number corresponds to a word or token in your vocabulary.
2) A sequence typically represents a part of your input text (e.g., a sentence, phrase, or a fixed-length chunk of text).

In [39]:
input_text = "I am going to"
# input_text = "I haue seene"
# input_text = "I thinke I heare"
print(f"Input text: {input_text}")

max_sequence_len = model.input_shape[1]+1
print(max_sequence_len)
next_word = predict_next_word(model,tokenizer,input_text.lower(),max_sequence_len)
print(next_word)

Input text: I am going to
14
the predicted values:  [[5.4742877e-10 2.0345326e-03 1.1209423e-04 ... 6.9300030e-25
  5.8657768e-10 5.2906424e-10]]
Length of values:  4818
his


In [40]:
import pickle

# save the model 
model.save('../pickle_files/next_word_lstm.h5')

# save the tokenizer
with open('../pickle_files/tokenizer.pkl','wb') as file_obj:
    pickle.dump(tokenizer,file_obj,protocol=pickle.HIGHEST_PROTOCOL)

  saving_api.save_model(
