## Project Description: Next Word Prediction Using LSTM
#### Project Overview:

This project aims to develop a deep learning model for predicting the next word in a given sequence of words. The model is built using Long Short-Term Memory (LSTM) networks, which are well-suited for sequence prediction tasks. The project includes the following steps:

1- Data Collection: We use the text of Shakespeare's "Hamlet" as our dataset. This rich, complex text provides a good challenge for our model.

2- Data Preprocessing: The text data is tokenized, converted into sequences, and padded to ensure uniform input lengths. The sequences are then split into training and testing sets.

3- Model Building: An LSTM model is constructed with an embedding layer, two LSTM layers, and a dense output layer with a softmax activation function to predict the probability of the next word.

4- Model Training: The model is trained using the prepared sequences, with early stopping implemented to prevent overfitting. Early stopping monitors the validation loss and stops training when the loss stops improving.

5- Model Evaluation: The model is evaluated using a set of example sentences to test its ability to predict the next word accurately.

6- Deployment: A Streamlit web application is developed to allow users to input a sequence of words and get the predicted next word in real-time.

# Data Collection

In [1]:
import nltk
nltk.download("gutenberg")
from nltk.corpus import gutenberg, stopwords
import pandas as pd

[nltk_data] Downloading package gutenberg to C:\Users\Ashutosh
[nltk_data]     Choudhari\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


## Load the dataset and save it to a .txt file

In [2]:
data = gutenberg.raw('shakespeare-hamlet.txt')

with open('hamlet.txt', "w") as file:
    file.write(data)

# Data preprocessing

In [3]:
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

## load the dataset

In [4]:
stop_words = set(stopwords.words('english'))

with open('hamlet.txt', 'r') as file:
    text = file.read().lower()

words = text.split()
filtered_words = [word for word in words if word not in stop_words]
filtered_text = " ".join(filtered_words)

## Tokenize the text

#### Creating indexes for the words

In [5]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([filtered_text])

total_words = len(tokenizer.word_index) + 1
total_words

4789

In [6]:
print(tokenizer.word_index)



## Create input sequence

In [7]:
input_sequences = []

for line in text.split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

input_sequences

[[1847, 639],
 [1847, 639, 519],
 [1847, 639, 519, 8],
 [1847, 639, 519, 8, 1683],
 [1847, 639, 519, 8, 1683, 1848],
 [1847, 639, 519, 8, 1683, 1848, 1849],
 [1847, 639, 519, 8, 1683, 1848, 1849, 1850],
 [1135, 1851],
 [1135, 1851, 1852],
 [1135, 1851, 1852, 1853],
 [13, 349],
 [13, 349, 1161],
 [13, 349, 1161, 1136],
 [13, 349, 1161, 1136, 113],
 [13, 349, 1161, 1136, 113, 1854],
 [349, 1137],
 [349, 1137, 169],
 [350, 97],
 [350, 97, 318],
 [350, 97, 318, 24],
 [350, 97, 318, 24, 182],
 [350, 97, 318, 24, 182, 836],
 [395, 159],
 [395, 159, 183],
 [395, 159, 183, 1847],
 [395, 159, 183, 1847, 4],
 [350, 349],
 [395, 424],
 [350, 16],
 [350, 16, 7],
 [350, 16, 7, 1855],
 [350, 16, 7, 1855, 40],
 [350, 16, 7, 1855, 40, 515],
 [395, 25],
 [395, 25, 106],
 [395, 25, 106, 1856],
 [395, 25, 106, 1856, 516],
 [395, 25, 106, 1856, 516, 319],
 [395, 25, 106, 1856, 516, 319, 32],
 [395, 25, 106, 1856, 516, 319, 32, 211],
 [395, 25, 106, 1856, 516, 319, 32, 211, 1136],
 [350, 199],
 [350, 199, 

In [8]:
len(input_sequences)

24018

## Pad Sequences

In [9]:
max_sequence_len = max([len(x) for x in input_sequences])
max_sequence_len

13

In [10]:
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

input_sequences

array([[   0,    0,    0, ...,    0, 1847,  639],
       [   0,    0,    0, ..., 1847,  639,  519],
       [   0,    0,    0, ...,  639,  519,    8],
       ...,
       [   0,    0,    0, ...,  519,    8, 1002],
       [   0,    0,    0, ...,    8, 1002,  519],
       [   0,    0,    0, ..., 1002,  519,  130]])

## Create predictors and labels

In [11]:
import tensorflow as tf

X,y = input_sequences[:,:-1], input_sequences[:, -1]

In [12]:
print(X.shape)
X

(24018, 12)


array([[   0,    0,    0, ...,    0,    0, 1847],
       [   0,    0,    0, ...,    0, 1847,  639],
       [   0,    0,    0, ..., 1847,  639,  519],
       ...,
       [   0,    0,    0, ...,  639,  519,    8],
       [   0,    0,    0, ...,  519,    8, 1002],
       [   0,    0,    0, ...,    8, 1002,  519]])

In [13]:
print(y.shape)
y

(24018,)


array([ 639,  519,    8, ..., 1002,  519,  130])

In [14]:
y = tf.keras.utils.to_categorical(y,num_classes=total_words)

In [15]:
print(y.shape)
y

(24018, 4789)


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Split the data into training and testing sets

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [17]:
print(f"X_train.shape : {X_train.shape} | X_test.shape : {X_test.shape} | y_train.shape : {y_train.shape} | y_test.shape : {y_test.shape} | ")

X_train.shape : (16812, 12) | X_test.shape : (7206, 12) | y_train.shape : (16812, 4789) | y_test.shape : (7206, 4789) | 


# Train our LSTM RNN

## Importing necessary libraries

In [18]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam

## Defining the model

In [19]:
model = Sequential()
model.add(Embedding(input_dim=total_words, output_dim=200))
model.add(LSTM(150, return_sequences=True))
model.add(BatchNormalization())
model.add(Dropout(rate=0.5, seed=42))
model.add(LSTM(100))
model.add(BatchNormalization())
model.add(Dropout(rate=0.5, seed=42))
model.add(Dense(total_words, activation='softmax'))

## Compiling the model

In [20]:
optimizer = Adam()
model.compile(loss = "categorical_crossentropy", optimizer = optimizer, metrics = ["accuracy"])
model.summary()

## Creating an instance of early stopping and tensorboard callback

In [21]:
from datetime import datetime
from tensorflow.keras.callbacks import EarlyStopping, TensorBoard
early_stopping  = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights = True, min_delta = 0.00001)

## Setup the Tensorboard

In [22]:
log_dir = "logs/fit/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = TensorBoard(log_dir = log_dir, histogram_freq = 1)

## Train the model

In [23]:
history = model.fit(
    X_train,
    y_train,
    epochs=100,
    validation_data=(X_test, y_test),
    callbacks = [early_stopping, tensorboard_callback],
    verbose=1
)

Epoch 1/100
[1m526/526[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 29ms/step - accuracy: 0.0181 - loss: 8.0731 - val_accuracy: 0.0401 - val_loss: 6.8937
Epoch 2/100
[1m526/526[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 27ms/step - accuracy: 0.0473 - loss: 6.7231 - val_accuracy: 0.0550 - val_loss: 6.8464
Epoch 3/100
[1m526/526[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 27ms/step - accuracy: 0.0595 - loss: 6.3879 - val_accuracy: 0.0602 - val_loss: 6.8425
Epoch 4/100
[1m526/526[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 35ms/step - accuracy: 0.0725 - loss: 6.1513 - val_accuracy: 0.0663 - val_loss: 6.9038
Epoch 5/100
[1m526/526[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 29ms/step - accuracy: 0.0807 - loss: 5.9282 - val_accuracy: 0.0724 - val_loss: 6.9465
Epoch 6/100
[1m526/526[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 27ms/step - accuracy: 0.0943 - loss: 5.7375 - val_accuracy: 0.0701 - val_loss: 7.0129
Epoch 7/10

## Prediction

### Function to predict the next word

In [24]:
def predict_next_word(model, tokenizer, text, max_sequence_len):
    stop_words = set(stopwords.words('english'))
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words]
    filtered_text = " ".join(filtered_words)
    token_list = tokenizer.texts_to_sequences([filtered_text])[0]
    if len(token_list) >= max_sequence_len:
        token_list = token_list[-(max_sequence_len - 1):] # Ensure the sequence length matches max_sequence
    token_list = pad_sequences([token_list], maxlen=max_sequence_len, padding='pre')
    prediction = model.predict(token_list, verbose=0)
    predicted_word_index = np.argmax(prediction, axis=1)
    for word, index in tokenizer.word_index.items():
        if index == predicted_word_index:
            return word
    return None

In [25]:
input_text = "to be or not to be"
print(f"Input Text: {input_text}")
max_sequence_len = model.input_shape[1] + 1
next_word = predict_next_word(model=model, tokenizer= tokenizer, text = input_text, max_sequence_len=max_sequence_len)
print(f"Next word prediction: {next_word}")

Input Text: to be or not to be
Next word prediction: i


In [26]:
input_text = "Giue you good night"
print(f"Input Text: {input_text}")
max_sequence_len = model.input_shape[1] + 1
next_word = predict_next_word(model=model, tokenizer= tokenizer, text = input_text, max_sequence_len=max_sequence_len)
print(f"Next word prediction: {next_word}")

Input Text: Giue you good night
Next word prediction: the


## Save the model and tokenizer

In [27]:
import pickle

model.save("next_word_lstm.h5")

with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)



## Load tensorboard extension

In [28]:
%load_ext tensorboard

In [30]:
%tensorboard --logdir logs/fit

Reusing TensorBoard on port 6006 (pid 17840), started 0:01:32 ago. (Use '!kill 17840' to kill it.)