# <span style="color: #3498db; font-size: 36px; font-family: 'Arial', sans-serif; font-weight: bold; display: block; text-align: center; position: absolute; top: 15%; left: 15%; transform: translate(-25%, -25%); background-color: #f2f2f2; padding: 10px; border-radius: 10px;">Next Word Prediction with LSTM Neural Networks</span>


<h2 style="color: #3498db;">Introduction</h2>

In this project, we explore a fascinating task in the realm of **Natural Language Processing** (NLP): Next Word Prediction. The goal is to predict the next word in a sentence based on the context provided by the preceding words. This task has numerous practical applications, such as improving text input systems and generating coherent and appropriate sentences.<br>

To achieve this, we'll use the power of **Long Short-Term Memory** (LSTM) networks, a type of **recurrent neural network** (RNN) that excels at learning and remembering long-term dependencies in sequences, which makes it perfect for working with text. To do so, I'll train our model on a large dataset, and it will learn the patterns and structure of language, enabling it to predict the next word in a given sequence with remarkable accuracy.


![Image](NextwordPredict.png
        )


<h3 style="color: #3498db;">Importing Libraries for Text Processing and LSTM Modeling</h3>

In [23]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


<h3 style="color: #3498db;">Reading And Cleaning Text Data </h3>
In this project, we'll be using the text of Sherlock Holmes as our dataset. By processing and cleaning the text, we can prepare it for tasks such as next-word prediction and text generation, providing a foundation for our model to learn from the structure and patterns in the language.

In [25]:
# Reading the data and cleaning it.

with open('sherlock-holm.es_stories_plain-text_advs.txt', 'r', encoding='utf-8') as text_file:
    raw_text = text_file.read()
cleaned_text = "\n".join([line.strip() for line in raw_text.split('\n') if line.strip() != ""])
print(cleaned_text[:500])


THE ADVENTURES OF SHERLOCK HOLMES
Arthur Conan Doyle
Table of contents
A Scandal in Bohemia
The Red-Headed League
A Case of Identity
The Boscombe Valley Mystery
The Five Orange Pips
The Man with the Twisted Lip
The Adventure of the Blue Carbuncle
The Adventure of the Speckled Band
The Adventure of the Engineer's Thumb
The Adventure of the Noble Bachelor
The Adventure of the Beryl Coronet
The Adventure of the Copper Beeches
A SCANDAL IN BOHEMIA
Table of contents
Chapter 1
Chapter 2
Chapter 3
CHAP


The output shows that the data is well-organized, with clearly split lines and structured text. Each line represents distinct content, such as the title, author, table of contents, and story sections, indicating that the cleaning process was successful. This structure makes the data ready for tokenization and sequence generation in the next steps.

<h3 style="color: #3498db;">Tokenization </h3>
Here, we use a tokenizer to convert the text into numerical values, essentially creating a unique index for each word. From the output, we can see the total vocabulary size and a sample of the first 10 word-to-index mappings. This step is super important because it helps us translate the text into something the model can actually understand and work with

In [29]:
# Creating a tokenizer based on our cleaned_text.
text_tokenizer = Tokenizer()
text_tokenizer.fit_on_texts([cleaned_text])
total_vocab = len(text_tokenizer.word_index) + 1
print(f"Total vocabulary size: {total_vocab}")
print(list(text_tokenizer.word_index.items())[:10])


Total vocabulary size: 8200
[('the', 1), ('and', 2), ('i', 3), ('to', 4), ('of', 5), ('a', 6), ('in', 7), ('that', 8), ('it', 9), ('he', 10)]


<h3 style="color: #3498db;">Generating n-gram Sequences </h3>
Here, we generate n-gram sequences from the cleaned text, which are chunks of text containing sequences of words. Each sequence includes an increasing number of words, one at a time, making it useful for training the model to predict the next word.

In [32]:
# Generate n-gram sequences for training
sequence_data = []
for sentence in cleaned_text.split('\n'):
    token_sequence = text_tokenizer.texts_to_sequences([sentence])[0]
    for j in range(1, len(token_sequence)):
        n_gram_seq = token_sequence[:j+1]
        sequence_data.append(n_gram_seq)
print(sequence_data[:5])
print(f"Total number of sequences generated: {len(sequence_data)}")


[[1, 1561], [1, 1561, 5], [1, 1561, 5, 129], [1, 1561, 5, 129, 34], [647, 4498]]
Total number of sequences generated: 96314


From the output, we can see the first 5 sequences and the total number of sequences generated. This step is crucial for preparing the data for our text generation model.

In [35]:
# Padding our sequences to the same length
max_seq_length = max([len(sequence) for sequence in sequence_data])
print(f"Maximum sequence length: {max_seq_length}")
padded_sequences = pad_sequences(sequence_data, maxlen=max_seq_length, padding='pre')
print(padded_sequences[:5])


Maximum sequence length: 18
[[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    1 1561]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    1 1561    5]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     1 1561    5  129]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    1
  1561    5  129   34]
 [   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0  647 4498]]


In this step, we pad all the sequences to ensure they have the same length. The maximum sequence length is calculated, and shorter sequences are padded with zeros at the beginning to match this length. This standardization is essential because models require input of uniform size. From the output, we can see the maximum sequence length and the first 5 padded sequences, all of which are now the same length.

<h3 style="color: #3498db;">Data Preparation for Model Training </h3>
Here, we split the padded sequences into features (X_features) and target labels (y_labels). The features include all columns except the last one, while the last column is used as the target label. We then perform one-hot encoding on the target labels to convert them into a format suitable for multi-class classification. 

In [39]:
X_features = padded_sequences[:, :-1]  # All columns except the last one
y_labels = padded_sequences[:, -1]     # The last column as the target

# Perform one-hot encoding on the target labels
y_labels = tf.keras.utils.to_categorical(y_labels, num_classes=total_vocab)
print(f"Shape of X_features: {X_features.shape}")  # (number of sequences, max sequence length - 1)
print(f"Shape of y_labels: {y_labels.shape}")      # (number of sequences, vocabulary size)


Shape of X_features: (96314, 17)
Shape of y_labels: (96314, 8200)


The output shows the shapes of X_features and y_labels after splitting the padded sequences. X_features has a shape of (96,314, 17), meaning there are 96,314 sequences, each of length 17 (excluding the target word). y_labels has a shape of (96,314, 8,200), where 8,200 represents the total vocabulary size, with each label one-hot encoded. This confirms that the data is correctly prepared for training the model.

<h3 style="color: #3498db;">Defining our LSTM-Based Sequential Model </h3>
Here, we define our sequential model for text generation. The model consists of three main layers:

- An Embedding layer that transforms words into dense vectors of size 100.
- An LSTM layer with 150 units, which captures the sequential nature of the data.
- A Dense layer with total_vocab units and a softmax activation, which predicts the probability distribution of the next word in the sequence.

In [7]:
# We define our sequential model with a LSTM Layer of 150 units

text_generation_model = Sequential([
    Input(shape=(max_seq_length - 1,)),  
    Embedding(input_dim=total_vocab, output_dim=100),
    LSTM(150),  # LSTM layer with 150 units
    Dense(total_vocab, activation='softmax')  
    
text_generation_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(text_generation_model.summary())


None


The summary output shows the architecture of our model:

- **The Embedding layer** outputs vectors of size (17, 100) for each sequence (17 being the input sequence length).
- **The LSTM layer** outputs vectors of size (150) for each input sequence.
- **The Dense layer** predicts the next word with a vocabulary size of 8,200.
Additionally, the total number of trainable parameters is 2,208,800, indicating the complexity of the model. This well-structured architecture is now ready for training.

<h3 style="color: #3498db;">Model Training</h3>

Now that the model architecture is defined, we move on to training the model. We compile the model using:

- **Categorical_crossentropy** as the loss function, which is ideal for multi-class classification tasks.
- **Adam** as the optimizer for efficient and adaptive learning.
- **Accuracy** as the evaluation metric to track how well the model is performing during training.

The model is trained on X_features and y_labels for 100 epochs, during which it learns to predict the next word in a sequence.

In [8]:
text_generation_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
text_generation_model.fit(X_features, y_labels, epochs=100, verbose=1)


Epoch 1/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 17ms/step - accuracy: 0.0608 - loss: 6.5504
Epoch 2/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 18ms/step - accuracy: 0.1200 - loss: 5.5480
Epoch 3/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m54s[0m 18ms/step - accuracy: 0.1487 - loss: 5.1128
Epoch 4/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 18ms/step - accuracy: 0.1644 - loss: 4.7645
Epoch 5/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 17ms/step - accuracy: 0.1861 - loss: 4.4462
Epoch 6/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m53s[0m 17ms/step - accuracy: 0.2083 - loss: 4.1473
Epoch 7/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 17ms/step - accuracy: 0.2383 - loss: 3.8565
Epoch 8/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m52s[0m 17ms/step - accuracy: 0.2714 - loss: 3.5931


<keras.src.callbacks.history.History at 0x2899b15a0f0>

The model demonstrates strong performance after **100** epochs of training, achieving an accuracy of **87.54%** and a loss of **0.4756**. This indicates that the model has successfully learned to predict the next word in a sequence with high accuracy while minimizing the error.

<h3 style="color: #3498db;">Generating Text Using Our Trained Model</h3>

In this part, we test the model's ability to generate text by providing it with an initial seed phrase. The model uses this input to predict and generate the next words, one at a time, based on the patterns it learned during training. Each predicted word is added to the text, creating a longer, coherent sequence. This iterative process allows us to see how well the model understands context and its ability to generate meaningful extensions of the provided input. The result is a combination of the original seed text and the newly generated words.

In [50]:
# Initial seed text for generating new words
seed_text = "if he leaves"
num_words_to_generate = 3  # Number of words to generate
original_text = seed_text

# Generate the next words iteratively
for _ in range(num_words_to_generate):
    # Convert the current text to a sequence of tokens
    tokenized_sequence = text_tokenizer.texts_to_sequences([seed_text])[0]
    tokenized_sequence = pad_sequences([tokenized_sequence], maxlen=max_seq_length - 1, padding='pre')
    predicted_idx = np.argmax(text_generation_model.predict(tokenized_sequence), axis=-1)
    predicted_word = ""
    for word, idx in text_tokenizer.word_index.items():
        if idx == predicted_idx:
            predicted_word = word
            break
    
    # Append the predicted word to the seed text
    seed_text += " " + predicted_word

print(f"Original text: '{original_text}'")
print(f"Generated text: '{seed_text}'")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 66ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 42ms/step
Original text: 'if he leaves'
Generated text: 'if he leaves room upon the'


<h3 style="color: #3498db;">Conclusion</h3>

In this project, we successfully built and trained a text generation model using an LSTM-based neural network. Starting with raw textual data from "The Adventures of Sherlock Holmes," we cleaned, tokenized, and preprocessed the text to prepare it for training. By creating n-gram sequences and padding them to a uniform length, we ensured that the model could effectively learn the contextual relationships between words.

The training process demonstrated the model's strong performance, achieving an accuracy of **87.54%** and a loss of **0.4756** after **100** epochs. These metrics highlight the model's ability to accurately predict the next word in a sequence while maintaining a low error rate. Additionally, the generated text showcased its capability to extend input phrases in a meaningful and contextually relevant way.

While the results are impressive, there are areas for further improvement. Using a larger and more diverse dataset, experimenting with advanced architectures like Transformer models, or implementing techniques to reduce repetitive patterns could enhance the quality of the generated text even further.

Overall, this project highlights the potential of deep learning in natural language processing. It serves as a solid foundation for developing more sophisticated applications, such as chatbots, text summarization, and creative writing tools, that rely on text generation capabilities.