## Project Description: Next Word Prediction Using LSTM
#### Project Overview:

This project aims to develop a deep learning model for predicting the next word in a given sequence of words. The model is built using Long Short-Term Memory (LSTM) networks, which are well-suited for sequence prediction tasks. The project includes the following steps:

1- Data Collection: We use the text of Shakespeare's "Hamlet" as our dataset. This rich, complex text provides a good challenge for our model.

2- Data Preprocessing: The text data is tokenized, converted into sequences, and padded to ensure uniform input lengths. The sequences are then split into training and testing sets.

3- Model Building: An LSTM model is constructed with an embedding layer, two LSTM layers, and a dense output layer with a softmax activation function to predict the probability of the next word.

4- Model Training: The model is trained using the prepared sequences, with early stopping implemented to prevent overfitting. Early stopping monitors the validation loss and stops training when the loss stops improving.

5- Model Evaluation: The model is evaluated using a set of example sentences to test its ability to predict the next word accurately.

6- Deployment: A Streamlit web application is developed to allow users to input a sequence of words and get the predicted next word in real-time.

In [27]:
## Data Collection
import nltk
nltk.download('gutenberg')
from nltk.corpus import gutenberg
import  pandas as pd

## load the dataset
data=gutenberg.raw('shakespeare-hamlet.txt')
## save to a file
with open('hamlet.txt','w') as file:
    file.write(data)

[nltk_data] Error loading gutenberg: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1002)>


In [28]:
## Data Preprocessing
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer  # For converting text to numbers
from tensorflow.keras.preprocessing.sequence import pad_sequences  # For making sequences uniform length
from sklearn.model_selection import train_test_split  # For splitting data into training and testing sets

# Load the dataset
# Opens 'hamlet.txt' file and reads all text, converting to lowercase
# This helps standardize the text and reduce vocabulary size
with open('hamlet.txt', 'r') as file:
    text = file.read().lower()

# Tokenization Process
# Create a tokenizer object that will convert words to numerical indices
tokenizer = Tokenizer()

# fit_on_texts learns the vocabulary from the text
# It creates a word-to-index mapping dictionary where:
# - Each unique word gets assigned a unique integer index
# - More frequent words typically get smaller indices
# - The text is passed as a list containing one string
tokenizer.fit_on_texts([text])

# Calculate total number of unique words
# Add 1 to account for the 0 index which is reserved for padding
# tokenizer.word_index is a dictionary where:
#   - keys are words from the text
#   - values are their assigned integer indices
total_words = len(tokenizer.word_index) + 1

# Print the total number of unique words in the vocabulary
total_words

4818

In [29]:
# Method 1: Using dict items and list slicing
first_4 = dict(list(tokenizer.word_index.items())[:4])
first_4

{'the': 1, 'and': 2, 'to': 3, 'of': 4}

In [30]:
# Create input sequences for training
# These sequences will be used to predict the next word
input_sequences = []

# Split the text into lines and process each line separately
for line in text.split('\n'):
    # Convert each line of text to sequences of integers
    # texts_to_sequences returns a list of lists, so we take [0] to get the first (and only) sequence
    # Example: "hello world" might become [45, 67]
    token_list = tokenizer.texts_to_sequences([line])[0]
    
    # Create n-gram sequences
    # For each position i, create a sequence from start up to position i+1
    # Example with "hello world":
    # First iteration:  [45]         -> predicts 67
    # Second iteration: [45, 67]     -> predicts next word
    for i in range(1, len(token_list)):
        # Extract the sequence up to position i+1
        n_gram_sequence = token_list[:i+1]
        # Add this sequence to our list of input sequences
        input_sequences.append(n_gram_sequence)

In [31]:
input_sequences[:5]

[[1, 687],
 [1, 687, 4],
 [1, 687, 4, 45],
 [1, 687, 4, 45, 41],
 [1, 687, 4, 45, 41, 1886]]

In [32]:
## Pad Sequences
max_sequence_len=max([len(x) for x in input_sequences])
max_sequence_len

14

In [33]:
input_sequences=np.array(pad_sequences(input_sequences,maxlen=max_sequence_len,padding='pre'))
input_sequences[:4]

array([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   1,
        687],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   1, 687,
          4],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   1, 687,   4,
         45],
       [  0,   0,   0,   0,   0,   0,   0,   0,   0,   1, 687,   4,  45,
         41]], dtype=int32)

In [34]:
##create predicitors and label
import tensorflow as tf
x,y=input_sequences[:,:-1],input_sequences[:,-1]

In [35]:
# Convert the target values (y) to one-hot encoded format
# to_categorical converts numbers to binary class matrices (one-hot encoding)
# Example: if total_words = 5, number 2 becomes [0, 0, 1, 0, 0]

# Parameters:
# - y: our target values (the next word indices we want to predict)
# - num_classes: total number of unique words in our vocabulary (total_words)
#   This ensures we have the correct number of columns in our one-hot matrix

y = tf.keras.utils.to_categorical(y, num_classes=total_words)

# The resulting 'y' is now a 2D array where:
# - Each row represents one training example
# - Each row is a binary vector with length total_words
# - In each row, only one position has 1 (the target word's index)
# - All other positions have 0

# Example:
# If y was [2, 3] and total_words = 5, y becomes:
# [[0, 0, 1, 0, 0],
#  [0, 0, 0, 1, 0]]

In [36]:
# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [37]:
# Define early stopping callback to prevent overfitting
from tensorflow.keras.callbacks import EarlyStopping

# Create early stopping object that:
# - Monitors validation loss 
# - Stops training if no improvement for 3 epochs
# - Keeps the model weights from best epoch
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

In [38]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional, GRU
from tensorflow.keras.optimizers import Adam

# Define GRU (Gated Recurrent Unit) model
model = Sequential()

# Embedding layer parameters:
# - total_words: Size of the vocabulary
# - 100: Dimension of dense embedding (size of vector space for words)
# - input_length: Length of input sequences
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))

# First GRU layer parameters:
# - 150: Number of GRU units (dimensionality of output space)
# - return_sequences=True: Return full sequence for stacking layers
# GRU is simpler than LSTM and often trains faster while maintaining good performance
model.add(GRU(150, return_sequences=True))

# Dropout layer to prevent overfitting
# - 0.2: 20% of neurons will be randomly disabled during training
model.add(Dropout(0.2))

# Second GRU layer parameters:
# - 100: Number of GRU units
# - return_sequences=False by default for final sequence processing
model.add(GRU(100))

# Output layer parameters:
# - total_words: Number of units equals vocabulary size for word prediction
# - softmax: Activation function to get probability distribution over words
model.add(Dense(total_words, activation="softmax"))

# Model compilation:
# - categorical_crossentropy: Loss function for multi-class classification
# - adam: Optimizer with adaptive learning rate
# - accuracy: Metric to monitor during training
model.compile(loss="categorical_crossentropy", optimizer='adam', metrics=['accuracy'])

# Display model architecture summary
model.summary()

Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, 13, 100)           481800    
                                                                 
 gru_6 (GRU)                 (None, 13, 150)           113400    
                                                                 
 dropout_9 (Dropout)         (None, 13, 150)           0         
                                                                 
 gru_7 (GRU)                 (None, 100)               75600     
                                                                 
 dense_7 (Dense)             (None, 4818)              486618    
                                                                 
Total params: 1157418 (4.42 MB)
Trainable params: 1157418 (4.42 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [39]:
## Train the model
history=model.fit(x_train,y_train,epochs=50,validation_data=(x_test,y_test),verbose=1,callbacks=[early_stopping])


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50


In [40]:
# Function to predict the next word given a text input
def predict_next_word(model, tokenizer, text, max_sequence_len):
   # Convert input text to sequence of numbers using tokenizer
   token_list = tokenizer.texts_to_sequences([text])[0]
   
   # If sequence is longer than max length, truncate it
   if len(token_list) >= max_sequence_len:
       token_list = token_list[-(max_sequence_len-1):]  
   
   # Pad sequence to ensure consistent length for model input
   token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
   
   # Get model's prediction (probability distribution over words)
   predicted = model.predict(token_list, verbose=0)
   
   # Get index of word with highest probability
   predicted_word_index = np.argmax(predicted, axis=1)
   
   # Convert predicted index back to word
   for word, index in tokenizer.word_index.items():
       if index == predicted_word_index:
           return word
           
   # Return None if no matching word found
   return None

In [41]:
input_text="To be or not to be"
print(f"Input text:{input_text}")
max_sequence_len=model.input_shape[1]+1
next_word=predict_next_word(model,tokenizer,input_text,max_sequence_len)
print(f"Next Word PRediction:{next_word}")

Input text:To be or not to be
Next Word PRediction:to


In [42]:
## Save the model
model.save("next_word_gru.h5")
## Save the tokenizer
import pickle
with open('tokenizer.pickle','wb') as handle:
    pickle.dump(tokenizer,handle,protocol=pickle.HIGHEST_PROTOCOL)

In [43]:
input_text="  Barn. Last night of all,When yond same"
print(f"Input text:{input_text}")
max_sequence_len=model.input_shape[1]+1
next_word=predict_next_word(model,tokenizer,input_text,max_sequence_len)
print(f"Next Word PRediction:{next_word}")

Input text:  Barn. Last night of all,When yond same
Next Word PRediction:lord


### THis will be  far better model but my system will take alot of time to build this

In [44]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional

# Improved model architecture
model = Sequential([
    # Embedding layer
    Embedding(
        input_dim=total_words,        # Size of vocabulary
        output_dim=200,               # Increased from 100: Dimension of word vectors
        input_length=max_sequence_len-1  # Length of input sequences
    ),
    
    # Using Bidirectional LSTM for better context understanding
    Bidirectional(LSTM(
        units=256,                    # Increased number of units
        return_sequences=True,        # Return full sequence of outputs
        recurrent_dropout=0.1         # Dropout for recurrent connections
    )),
    
    Dropout(0.3),                     # Increased dropout for better regularization
    
    # Second Bidirectional LSTM
    Bidirectional(LSTM(
        units=128,                    # Number of LSTM units
        recurrent_dropout=0.1
    )),
    
    Dropout(0.3),
    
    # Add an intermediate Dense layer
    Dense(512, activation='relu'),    # Additional layer for more complex patterns
    Dropout(0.3),
    
    # Output layer
    Dense(total_words, activation='softmax')  # Softmax for word prediction
])

# Improved compilation with learning rate scheduling
from tensorflow.keras.optimizers import Adam
optimizer = Adam(learning_rate=0.001)
model.compile(
    loss='categorical_crossentropy',
    optimizer=optimizer,
    metrics=['accuracy']
)























