# LSTM Example 3: predict the next word in a sentence
## Notes

Below is a coding example using an LSTM model in TensorFlow to work with a sequence of strings. The example demonstrates how to process a list of strings and train an LSTM model to predict the next word in a sentence. This is a common task in natural language processing called text generation.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

- NumPy (np) is used for numerical operations.
- TensorFlow (tf) is a deep learning library, specifically its Keras API is used here.
- Tokenizer and pad_sequences are used for text preprocessing.
- Sequential, Embedding, LSTM, and Dense are layers used to build the neural network model.

<img src="Naruto_Team7.jpg" width="600">

In [2]:
# In this example, we will use a simple list of sentences to train an LSTM model to predict the next word based on the previous words in the sequence.
# Sample list of sentences (input data)
sentences = [
    "Kakashi is reading a fiction",
    "Naruto is taking a walk",
    "Sasuke is looking for Naruto",
    "Naruto and Sasuke are friends",
    "Sakura likes Sasuke",
    "Sasuke has a brother",
]

In [3]:
# Tokenize the sentences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
total_words = len(tokenizer.word_index) + 1

- Tokenizer is used to convert the text into a numerical format (tokens).
- fit_on_texts builds a word index based on the sentences.
- total_words is the total number of unique words plus one (to account for zero indexing). 17

In [4]:
input_sequences = tokenizer.texts_to_sequences(sentences)
# Print the word index to see the mapping of words to their token indices
print("Word Index:", tokenizer.word_index)
print("Tokenize Results: ", sentences[0], input_sequences[0])

Word Index: {'sasuke': 1, 'is': 2, 'a': 3, 'naruto': 4, 'kakashi': 5, 'reading': 6, 'fiction': 7, 'taking': 8, 'walk': 9, 'looking': 10, 'for': 11, 'and': 12, 'are': 13, 'friends': 14, 'sakura': 15, 'likes': 16, 'has': 17, 'brother': 18}
Tokenize Results:  Kakashi is reading a fiction [5, 2, 6, 3, 7]


In [5]:
# Create input sequences and corresponding labels
input_sequences = []
for sentence in sentences:
    token_list = tokenizer.texts_to_sequences([sentence])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i + 1]
        input_sequences.append(n_gram_sequence)

- Each sentence is converted into a sequence of tokens.
- N-gram sequences are generated from each sentence to create training data. For example, from "the cat is on the mat", sequences like [kakashi], [kakashi, is], [kakashi, is, reading], etc., are created. len(input_sequences) is now 21. 

In [6]:
# Pad sequences to ensure uniform length
max_sequence_len = max(len(seq) for seq in input_sequences)
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre')

- Sequences are padded to ensure they all have the same length. Padding adds zeros at the beginning of sequences shorter than the maximum length. input_sequences.shape = (21, 5)

In [7]:
print(f" Kakashi is -- are vectorized into: {input_sequences[0, :]}") 
print(f" Kakashi is reading -- are vectorized into: {input_sequences[1, :]}") 
print(f" Kakashi is reading a -- are vectorized into: {input_sequences[2, :]}") 
print(f" Kakashi is reading a book -- are vectorized into: {input_sequences[3, :]}") 

 Kakashi is -- are vectorized into: [0 0 0 5 2]
 Kakashi is reading -- are vectorized into: [0 0 5 2 6]
 Kakashi is reading a -- are vectorized into: [0 5 2 6 3]
 Kakashi is reading a book -- are vectorized into: [5 2 6 3 7]


In [8]:
# Split input sequences into features and labels
X = input_sequences[:, :-1]
y = input_sequences[:, -1]

# Convert labels to one-hot encoded format
y = tf.keras.utils.to_categorical(y, num_classes=total_words)

- The input sequences (X) consist of all but the last token of each padded sequence.
- The label (y) is the last token of each sequence, representing the word the model needs to predict.
- y is one-hot encoded to represent the output labels as categorical data. total_words = 19

In [9]:
X

array([[ 0,  0,  0,  5],
       [ 0,  0,  5,  2],
       [ 0,  5,  2,  6],
       [ 5,  2,  6,  3],
       [ 0,  0,  0,  4],
       [ 0,  0,  4,  2],
       [ 0,  4,  2,  8],
       [ 4,  2,  8,  3],
       [ 0,  0,  0,  1],
       [ 0,  0,  1,  2],
       [ 0,  1,  2, 10],
       [ 1,  2, 10, 11],
       [ 0,  0,  0,  4],
       [ 0,  0,  4, 12],
       [ 0,  4, 12,  1],
       [ 4, 12,  1, 13],
       [ 0,  0,  0, 15],
       [ 0,  0, 15, 16],
       [ 0,  0,  0,  1],
       [ 0,  0,  1, 17],
       [ 0,  1, 17,  3]])

In [10]:
# Build the LSTM model
model = Sequential()
model.add(Embedding(total_words, 64))
model.add(LSTM(50, activation='relu'))
model.add(Dense(total_words, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

- A sequential model is created.
- An Embedding layer is used to convert word indices into dense vectors of fixed size (64 dimensions).
- An LSTM layer with 50 units processes the sequences to learn patterns.
- A Dense layer with softmax activation is used for multi-class classification, outputting a probability distribution over all possible words.
- The model is compiled using the adam optimizer and categorical_crossentropy loss, suitable for multi-class classification.

In [11]:
# Train the model: The model is trained on the input sequences (X) and labels (y) for 100 epochs.
history = model.fit(X, y, epochs=100, verbose=1)

# Function to generate text based on seed text
def generate_text(seed_text, next_words, max_sequence_len):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=max_sequence_len - 1, padding='pre')
        predicted = np.argmax(model.predict(token_list, verbose=0), axis=-1)
        output_word = ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output_word = word
                break
        seed_text += " " + output_word
    return seed_text

Epoch 1/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step - accuracy: 0.0476 - loss: 2.9447
Epoch 2/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step - accuracy: 0.0476 - loss: 2.9419
Epoch 3/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step - accuracy: 0.1429 - loss: 2.9393
Epoch 4/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step - accuracy: 0.1429 - loss: 2.9366
Epoch 5/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step - accuracy: 0.1429 - loss: 2.9339
Epoch 6/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - accuracy: 0.1429 - loss: 2.9311
Epoch 7/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step - accuracy: 0.1429 - loss: 2.9283
Epoch 8/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 26ms/step - accuracy: 0.1429 - loss: 2.9254
Epoch 9/100
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m

- ```def generate_text()``` function generates text by predicting the next word based on the seed text.
It iteratively predicts the next word, appends it to the seed text, and continues for the specified number of words (next_words).

In [12]:
# Test the text generation
seed_text = "the cat is"
next_words = 3
generated_text = generate_text(seed_text, next_words, max_sequence_len)
print(f"Generated Text: {generated_text}")

seed_text = "the cat and the dog"
next_words = 2
generated_text = generate_text(seed_text, next_words, max_sequence_len)
print(f"Generated Text: {generated_text}")

Generated Text: the cat is is reading a
Generated Text: the cat and the dog and sasuke


- The text generation function is called with a seed text ("the cat is") and generates the next 3 words.

In [13]:
# Test the text generation - Train
seed_text = "Kakashi is"
next_words = 3
generated_text = generate_text(seed_text, next_words, max_sequence_len)
print(f"Generated Text: {generated_text}")

Generated Text: Kakashi is reading a fiction


In [14]:
# Test the text generation - Train
seed_text = "Naruto and Sasuke"
next_words = 2
generated_text = generate_text(seed_text, next_words, max_sequence_len)
print(f"Generated Text: {generated_text}")

Generated Text: Naruto and Sasuke are friends


In [14]:
# Test the text generation - Test 1
seed_text = "Sakura and Sasuke"
next_words = 2
generated_text = generate_text(seed_text, next_words, max_sequence_len)
print(f"Generated Text: {generated_text}")

Generated Text: Sakura and Sasuke are friends


In [15]:
# Test the text generation - Test 2
seed_text = "Naruto and Sakura"
next_words = 2
generated_text = generate_text(seed_text, next_words, max_sequence_len)
print(f"Generated Text: {generated_text}")

Generated Text: Naruto and Sakura are friends


In [16]:
# Test the text generation - Test 3
seed_text = "Kakashi and Sakura"
next_words = 2
generated_text = generate_text(seed_text, next_words, max_sequence_len)
print(f"Generated Text: {generated_text}")

Generated Text: Kakashi and Sakura are friends
