# Lecture 55: LSTM and GRU End to End Deep Learning Project - Predicting next word

## Project Description: Next Word Prediction Using LSTM

**Project Overview:** This project aims to develop a deep learning model for predicting the next word in a given sequence of words. The model is built using Long Short-Term Memory (LSTM) networks, which are well-suited for sequence prediction tasks. The project includes the following steps:

1- **Data Collection:** We use the text of Shakespeare's "Hamlet" as our dataset. This rich, complex text provides a good challenge for our model.

2- **Data Preprocessing:** The text data is tokenized, converted into sequences, and padded to ensure uniform input lengths. The sequences are then split into training and testing sets.

3- **Model Building:** An LSTM model is constructed with an embedding layer, two LSTM layers, and a dense output layer with a softmax activation function to predict the probability of the next word.

4- **Model Training:** The model is trained using the prepared sequences, with early stopping implemented to prevent overfitting. Early stopping monitors the validation loss and stops training when the loss stops improving.

5- **Model Evaluation:** The model is evaluated using a set of example sentences to test its ability to predict the next word accurately.

## 1. Data Collection

**What is the Gutenberg corpus?**

**Gutenberg corpus:** Collection of various public domain texts, including classic literature and religious works. This is advanced NLP text data because it contains a wide range of genres and styles.

In [1]:
import nltk
nltk.download('gutenberg') # Downloads the Gutenberg corpus from the NLTK library
from nltk.corpus import gutenberg
import pandas as pd

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


In [2]:

data  = gutenberg.raw('shakespeare-hamlet.txt')
# Import text of Shakespeare's Hamlet from the NLTK Gutenberg corpus
with open('hamlet.txt', 'w') as f:
    f.write(data)

## 2. Data Preprocessing

**What does the `pad_sequences` and `Tokenizer` functions do? Explain `Tokenizer` in detail.**

| Functions | Description |
| --- | --- |
| `pad_sequences` | Pads sequences (list of integers) to uniform length |
| `Tokenizer` | Vectorize text into sequences of integers |

- `pad_sequences` is useful when input sequences are of varying lenghts and need to make all sequences the same length for input into a neural network.
- `Tokenizer` can create a vocabulary index based on word frequency and encode new texts using this index.

### Tokenizer Example:
```python
texts = [
    "I love machine learning",
    "I love deep learning",
    "Neural networks are amazing"
]

print(Tokenizer().fit_on_texts(texts).word_index)
```
Output:
```json
{'love': 1, 'learning': 2, 'i': 3, 'machine': 4, 'deep': 5, 'neural': 6, 'networks': 7, 'are': 8, 'amazing': 9}
```

- Each word is assigned an integer value based on its frequency i.e. most frequent words get lower indices
- Now we can use this tokenizer to convert new text to sequences of integers:
```python
new_text = "I love neural networks"
sequences = tokenizer.texts_to_sequences([new_text])
print(sequences)
```
Output:
```json
[[3, 1, 6, 7]]
```

> Each word in the new text is replaced by its corresponding integer from the vocabulary.


In [3]:
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from pprint import pprint

In [4]:
with open('hamlet.txt', 'r') as f:
    text = f.read().lower()

tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1

# Print the first 20 words and their indices
# pprint(dict(list(tokenizer.word_index.items())[:20]), width=80, compact=False)


**How do you process the text data for input sequences?**
1. Iterate through each line of the text, that is split into separate lines.
2. For each line, convert the text into sequences of tokens (numbers) using the tokenizer.

> From the above example, we know that `texts_to_sequences` will return a list of lists, where each inner list corresponds to the tokenized sequence for each input text.
>
> Since, we're only passing one line at a time, the result is a list containing a single list of tokens.

3. By using [0], we're extracting that single list of tokens from the outer list, so token_list directly contains the sequence of tokens for the current line, rather than a list containing that sequence.
4. Then, it's creating n-gram sequnces of increasing length for each line.

**Let's use a simple example to illustrate:**

Suppose we have this text: "I love coding"

After tokenization, it might look like this: [1, 2, 3]

The code will generate these sequences: [1, 2] [1, 2, 3]

```python
for line in text.split('\n'):
    # If line is "I love coding"
    token_list = tokenizer.texts_to_sequences([line])[0]
    # token_list might be [1, 2, 3]
    
    for i in range(1, len(token_list)):
        # i will be 1, then 2
        n_gram_sequence = token_list[:i+1]
        # When i=1: n_gram_sequence = [1, 2]
        # When i=2: n_gram_sequence = [1, 2, 3]
        input_sequences.append(n_gram_sequence)
```


**What do you mean by n_gram? Explain n_gram_sequence and n_gram_frequency.**

1. N-gram:
   - An n-gram is a contiguous sequence of n items(usually words or characters) from a given sample of text or speech.
   - n-grams are used to capture patterns and predict the likelihood of word sequences.
2. n_gram_sequence:
   - sequence of words that make up an n-gram.
3. n_gram_frequency:
   - measure of how often an n-gram appears in a given text or corpus.
   - more frequent n-grams are generally more likely to occur.

In [5]:
# Creating input sequences

input_sequences = []
for line in text.split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

# Print the first 2 lines in terms of the input sequences
pprint(input_sequences[:10], width=100, compact=False)

[[1, 687],
 [1, 687, 4],
 [1, 687, 4, 45],
 [1, 687, 4, 45, 41],
 [1, 687, 4, 45, 41, 1886],
 [1, 687, 4, 45, 41, 1886, 1887],
 [1, 687, 4, 45, 41, 1886, 1887, 1888],
 [1180, 1889],
 [1180, 1889, 1890],
 [1180, 1889, 1890, 1891]]


In [6]:
# Pad Sequences

max_sequence_len = max([len(x) for x in input_sequences])

max_sequence_len

input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

print(input_sequences)

[[   0    0    0 ...    0    1  687]
 [   0    0    0 ...    1  687    4]
 [   0    0    0 ...  687    4   45]
 ...
 [   0    0    0 ...    4   45 1047]
 [   0    0    0 ...   45 1047    4]
 [   0    0    0 ... 1047    4  193]]


**How to process the input sequences for LSTM Neural Network training?**

1. Split the input sequences into x and y where x is the input sequence ([:,:-1] - All columns except the last one) and y is the next word ([:,-1] - last column).
2. Categorize the words in the next word column using one-hot encoding.

Example:
If `input_sequences` is:

```json
[[1, 2, 3, 4],
 [2, 3, 4, 5],
 [3, 4, 5, 6]]
```
Then, x would be:
```json
[[1, 2, 3],
 [2, 3, 4],
 [3, 4, 5]]
```
And y would be:
```json
[4, 5, 6]
```
Now, convert y into one-hot encoding:
```json
[[0, 0, 0, 1, 0, 0, 0],
 [0, 0, 0, 0, 1, 0, 0],
 [0, 0, 0, 0, 0, 1, 0]]
```



In [14]:
# Create Predictors and label

import tensorflow as tf

x, y = input_sequences[:,:-1], input_sequences[:,-1]
y = tf.keras.utils.to_categorical(y, num_classes=total_words)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

In [15]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

model = Sequential()
model.add(Embedding(total_words, 100))
model.add(LSTM(150, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.build((None, max_sequence_len))
model.summary()

In [20]:
# Define Early stopping
from tensorflow.keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(monitor='val_loss', patience = 10, restore_best_weights=True)

In [22]:
# history = model.fit(x_train, y_train, epochs=100, validation_data=(x_test, y_test), verbose=1, callbacks=[early_stopping])
history = model.fit(x_train, y_train, epochs=100, validation_data=(x_test, y_test), verbose=1)

Epoch 1/100
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 10ms/step - accuracy: 0.3418 - loss: 3.0590 - val_accuracy: 0.0515 - val_loss: 10.3612
Epoch 2/100
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 9ms/step - accuracy: 0.3558 - loss: 2.9920 - val_accuracy: 0.0501 - val_loss: 10.4352
Epoch 3/100
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 11ms/step - accuracy: 0.3621 - loss: 2.9790 - val_accuracy: 0.0515 - val_loss: 10.5312
Epoch 4/100
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 11ms/step - accuracy: 0.3695 - loss: 2.9153 - val_accuracy: 0.0532 - val_loss: 10.5775
Epoch 5/100
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 9ms/step - accuracy: 0.3740 - loss: 2.8911 - val_accuracy: 0.0525 - val_loss: 10.6649
Epoch 6/100
[1m644/644[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 9ms/step - accuracy: 0.3865 - loss: 2.8376 - val_accuracy: 0.0515 - val_loss: 10.7539
Epoch 7/1

In [24]:
# Function to predict the next word
def predict_next_word(model, tokenizer, text, max_sequence_len):
    token_list = tokenizer.texts_to_sequences([text])[0]
    if len(token_list) >= max_sequence_len:
        token_list = token_list[-(max_sequence_len-1):]  # Ensure the sequence length matches max_sequence_len
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    predicted = model.predict(token_list, verbose=0)
    predicted_word_index = np.argmax(predicted, axis=1)
    for word, index in tokenizer.word_index.items():
        if index == predicted_word_index:
            return word
    return None


In [25]:
# Prediction
input_text = "To be or not to be"
print(f"Input text:{input_text}")
max_sequence_len = model.input_shape[1] + 1
next_word = predict_next_word(model, tokenizer, input_text, max_sequence_len)
if next_word:
    print(f"The predicted next word is: {next_word}")

Input text:To be or not to be
The predicted next word is: that


In [26]:
import pickle
# Save the model
model.save('next_word_lstm.h5')
# Save the tokenizer
with open('tokenizer.pkl', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

