## Project Description: Next Word Prediction Using LSTM
#### Project Overview:

This project aims to develop a deep learning model for predicting the next word in a given sequence of words. The model is built using Long Short-Term Memory (LSTM) networks, which are well-suited for sequence prediction tasks. The project includes the following steps:

1- Data Collection: We use the text of Shakespeare's "Hamlet" as our dataset. This rich, complex text provides a good challenge for our model.

2- Data Preprocessing: The text data is tokenized, converted into sequences, and padded to ensure uniform input lengths. The sequences are then split into training and testing sets.

3- Model Building: An LSTM model is constructed with an embedding layer, two LSTM layers, and a dense output layer with a softmax activation function to predict the probability of the next word.

4- Model Training: The model is trained using the prepared sequences, with early stopping implemented to prevent overfitting. Early stopping monitors the validation loss and stops training when the loss stops improving.

5- Model Evaluation: The model is evaluated using a set of example sentences to test its ability to predict the next word accurately.

6- Deployment: A Streamlit web application is developed to allow users to input a sequence of words and get the predicted next word in real-time.

## Data Collection 

In [2]:
from nltk.corpus import gutenberg
import pandas as pd

# Load Data
data = gutenberg.raw('shakespeare-hamlet.txt')
with open('hamlet.txt','w') as file:
    file.write(data)

# Data Preprocessing 

In [1]:
import numpy as np
from tensorflow.keras.preprocessing.text  import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split as tts




In [2]:
with open("hamlet.txt",'r') as file:
    text = file.read().lower()

In [3]:
# Creating indexes for words
 
tokenizer = Tokenizer() 
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index)+1
total_words   

4818

In [7]:
#tokenizer.word_index

In [9]:
#create input sequences
input_sequences = []
for line in text.split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1,len(token_list)):
        n_gram_sequence  = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

In [10]:
input_sequences[:5]

[[1, 687],
 [1, 687, 4],
 [1, 687, 4, 45],
 [1, 687, 4, 45, 41],
 [1, 687, 4, 45, 41, 1886]]

In [12]:
# pad sequences
max_sequence_length = max([len(x) for x in input_sequences])
max_sequence_length


14

In [13]:
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_length, padding='pre'))
input_sequences[:5]

array([[   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    1,  687],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           1,  687,    4],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    1,
         687,    4,   45],
       [   0,    0,    0,    0,    0,    0,    0,    0,    0,    1,  687,
           4,   45,   41],
       [   0,    0,    0,    0,    0,    0,    0,    0,    1,  687,    4,
          45,   41, 1886]])

In [11]:
#create predictiors and label
import tensorflow as tf 
x,y  = input_sequences[:,:-1],input_sequences[:,-1]

In [12]:
x

array([[   0,    0,    0, ...,    0,    0,    1],
       [   0,    0,    0, ...,    0,    1,  687],
       [   0,    0,    0, ...,    1,  687,    4],
       ...,
       [   0,    0,    0, ...,  687,    4,   45],
       [   0,    0,    0, ...,    4,   45, 1047],
       [   0,    0,    0, ...,   45, 1047,    4]])

In [13]:
y

array([ 687,    4,   45, ..., 1047,    4,  193])

In [14]:
y = tf.keras.utils.to_categorical(y,num_classes = total_words)
y

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [15]:
Xtrain, Xtest, ytrain, ytest = tts(x,y,test_size=0.2,random_state=42)

In [58]:
#Train Our LSTM RNN model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,LSTM,Dense,Dropout,GRU


In [23]:
# Defining the model
model =Sequential()
model.add(Embedding(total_words,100,input_length = max_sequence_length-1))
model.add(LSTM(150,return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words,activation = 'softmax'))  

In [24]:
model.compile(optimizer = 'adam', loss = 'categorical_crossentropy',metrics = ['accuracy'])
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 13, 100)           481800    
                                                                 
 lstm_4 (LSTM)               (None, 13, 150)           150600    
                                                                 
 dropout_1 (Dropout)         (None, 13, 150)           0         
                                                                 
 lstm_5 (LSTM)               (None, 100)               100400    
                                                                 
 dense_1 (Dense)             (None, 4818)              486618    
                                                                 
Total params: 1219418 (4.65 MB)
Trainable params: 1219418 (4.65 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [25]:
# train the model
history = model.fit(
    Xtrain,ytrain,
    validation_data = (Xtest,ytest),
    epochs = 45,
    verbose = 1,
)

Epoch 1/45


Epoch 2/45
Epoch 3/45
Epoch 4/45
Epoch 5/45
Epoch 6/45
Epoch 7/45
Epoch 8/45
Epoch 9/45
Epoch 10/45
Epoch 11/45
Epoch 12/45
Epoch 13/45
Epoch 14/45
Epoch 15/45
Epoch 16/45
Epoch 17/45
Epoch 18/45
Epoch 19/45
Epoch 20/45
Epoch 21/45
Epoch 22/45
Epoch 23/45
Epoch 24/45
Epoch 25/45
Epoch 26/45
Epoch 27/45
Epoch 28/45
Epoch 29/45
Epoch 30/45
Epoch 31/45
Epoch 32/45
Epoch 33/45
Epoch 34/45
Epoch 35/45
Epoch 36/45
Epoch 37/45
Epoch 38/45
Epoch 39/45
Epoch 40/45
Epoch 41/45
Epoch 42/45
Epoch 43/45
Epoch 44/45
Epoch 45/45


In [26]:
def predict_next_word(mdoel,tokenizer,text,max_sequence_length):
    token_list = tokenizer.texts_to_sequences([text])[0]
   
    if len(token_list) >= max_sequence_length:
        token_list = token_list[-(max_sequence_length-1):]
    
    token_list = pad_sequences([token_list],maxlen=max_sequence_length-1,padding='pre')

    predicted = model.predict(token_list,verbose=1)

    predicted_next_word = np.argmax(predicted,axis=1)
    for word,index in tokenizer.word_index.items():
        if index == predicted_next_word:
            return word
    return None

In [30]:
input_text = 'To be or not to be'
print(f'Input text :{input_text}')
max_sequence_length = model.input_shape[1]+1
next_word = predict_next_word(model,tokenizer,input_text,max_sequence_length)
print(f'Predicted next Word is : {next_word}')


Input text :To be or not to be
Predicted next Word is : buried


In [32]:
model.save('Next_word_LSTM.h5')

import pickle
with open('tokenizer.pickle','wb') as handle :
    pickle.dump(tokenizer,handle,protocol = pickle.HIGHEST_PROTOCOL)

'''
USE OF PROTOCOL

Efficiency: Newer pickle protocols often introduce more efficient ways of serializing data, resulting in smaller file sizes and potentially faster loading/dumping times.

Features: Newer protocols might support the serialization of a wider range of Python objects or handle certain data types more robustly.

Future-proofing (to some extent): By using the highest protocol, you're leveraging the latest advancements in the pickle module.
'''


  saving_api.save_model(


"\nUSE OF PROTOCOL\n\nEfficiency: Newer pickle protocols often introduce more efficient ways of serializing data, resulting in smaller file sizes and potentially faster loading/dumping times.\n\nFeatures: Newer protocols might support the serialization of a wider range of Python objects or handle certain data types more robustly.\n\nFuture-proofing (to some extent): By using the highest protocol, you're leveraging the latest advancements in the pickle module.\n"

In [57]:
input_text = 'That person is '
print(f'Input text :{input_text}')
max_sequence_length = model.input_shape[1]+1
next_word = predict_next_word(model,tokenizer,input_text,max_sequence_length)
print(f'Predicted next Word is : {next_word}')


Input text :That person is 
Predicted next Word is : cunning


In [None]:
# Defining the GRU model
model2 = Sequential()
model2.add(Embedding(total_words,100,input_length = max_sequence_length-1))
model2.add(GRU(150,return_sequences=True))
model2.add(Dropout(0.2))
model2.add(GRU(100))
model2.add(Dense(total_words,activation = 'softmax'))  