This code imports necessary libraries and reads a text file for further processing:

1**Imports****:
   - `import numpy as np`: Imports the NumPy library for numerical operations, commonly used for handling arrays and matrices.
   - `import tensorflow as tf`: Imports TensorFlow for building and training machine learning models.
   - `from tensorflow.keras.preprocessing.text import Tokenizer`: Imports the `Tokenizer` class for converting text into sequences of integers.
   - `from tensorflow.keras.preprocessing.sequence import pad_sequences`: Imports a method to ensure input sequences have the same length by padding them.
   - `from tensorflow.keras.models import Sequential`: Imports the Sequential model, a linear stack of layers for building models.
   - `from tensorflow.keras.layers import Embedding, LSTM, Dense`: Imports specific layers for the model (Embedding for word representations, LSTM for sequence processing, Dense for fully connected layers).
   - `from tensorflow.keras.optimizers import Adam`: Imports the Adam optimizer for training tmodel training.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.optimizers import Adam

# Read the text file
with open('/kaggle/input/next-word-prediction/sherlock-holm.es_stories_plain-text_advs.txt', 'r', encoding='utf-8') as file:
    text = file.read()

This code sets up a tokenizer to process the text, creates a mapping of each unique word to a number, and calculates the total number of words (including a padding token).

In [2]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
total_words = len(tokenizer.word_index) + 1

This code creates input sequences for training the model:

1. **`input_sequences = []`**: Initializes an empty list to store the sequences.

2. **`for line in text.split('\n'):`**: Loops through each line of the text (split by newline).

3. **`token_list = tokenizer.texts_to_sequences([line])[0]`**: Converts each line of text into a sequence of integers (tokens) based on the tokenizer's word index.

4. **`for i in range(1, len(token_list)):`**: Loops through each token in the `token_list`, starting from the second token.

5. **`n_gram_sequence = token_list[:i+1]`**: Creates an n-gram sequence by taking the first `i+1` tokens from the `token_list`.

6. **`input_sequences.append(n_gram_sequence)`**: Appends each n-gram sequence to the `input_sequences` list.

This process generates sequences of words (n-grams) that the model will use to learn context and predict the next word.

In [3]:
input_sequences = []
for line in text.split('\n'):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

This ensures all input sequences are of equal length, which is required for training a machine learning model.

In [4]:
max_sequence_len = max([len(seq) for seq in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

This code splits the input sequences into features (X) and labels (y):

X = input_sequences[:, :-1]:

X contains all tokens in the input sequences except the last token. This is done by selecting all columns except the last one (:-1), which represents the context (previous words) for the prediction.
y = input_sequences[:, -1]:

y contains only the last token of each sequence (-1), which is the target word that the model will predict based on the context (X).

In [5]:
X = input_sequences[:, :-1]
y = input_sequences[:, -1]

This step prepares the labels in a format suitable for training a classification model.

In [6]:
y = np.array(tf.keras.utils.to_categorical(y, num_classes=total_words))

This code defines, builds, and compiles a neural network model for next-word prediction:

1. **Model Definition**:
   - `model = Sequential()`: Creates a sequential model, where layers are stacked one after the other.
   - `model.add(Embedding(input_dim=total_words, output_dim=100, input_length=max_sequence_len-1))`: Adds an embedding layer that converts integer sequences into dense vectors of fixed size (100). The input dimension is the total number of words (vocabulary size), and the input length is the length of each sequence (one less than the max sequence length).
   - `model.add(LSTM(150))`: Adds a Long Short-Term Memory (LSTM) layer with 150 units. LSTM is used for processing sequential data.
   - `model.add(Dense(total_words, activation='softmax'))`: Adds a dense layer with `total_words` output units and a softmax activation function, which will output a probability distribution over the vocabulary for the predicted next word.

2. **Model Build**:
   - `model.build(input_shape=(None, max_sequence_len-1))`: Explicitly builds the model by specifying the shape of the input data (sequences of length `max_sequence_len-1`).

3. **Model Compilation**:
   - `adam = Adam(learning_rate=0.01)`: Initializes the Adam optimizer with a learning rate of 0.01.
   - `model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])`: Compiles the model with categorical crossentropy loss (for multi-class classification) and the Adam optimizer. The accuracy metric is used to evaluate the model.

4. **Model Summary**:
   - `print(model.summary())`: Prints a summary of the model, including the number of parameters in each layer.

This sets up the model for training next-word prediction based on the input sequences.

In [7]:
# Define the model
model = Sequential()
model.add(Embedding(input_dim=total_words, output_dim=100, input_length=max_sequence_len-1))
model.add(LSTM(150))
model.add(Dense(total_words, activation='softmax'))

# Build the model explicitly to avoid 'unbuilt' state
model.build(input_shape=(None, max_sequence_len-1))

# Compile the model
adam = Adam(learning_rate=0.01)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])

# Print the model summary to verify parameter count
print(model.summary())




None


This trains the model to predict the next word based on context.

In [8]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X, y, epochs=100, verbose=1)

Epoch 1/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 7ms/step - accuracy: 0.0621 - loss: 6.5480
Epoch 2/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 7ms/step - accuracy: 0.1166 - loss: 5.5604
Epoch 3/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 6ms/step - accuracy: 0.1448 - loss: 5.1421
Epoch 4/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 7ms/step - accuracy: 0.1653 - loss: 4.7772
Epoch 5/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 7ms/step - accuracy: 0.1844 - loss: 4.4606
Epoch 6/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 6ms/step - accuracy: 0.2050 - loss: 4.1594
Epoch 7/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 7ms/step - accuracy: 0.2332 - loss: 3.8829
Epoch 8/100
[1m3010/3010[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 6ms/step - accuracy: 0.2647 - loss: 3.6185
Epoch 9/

<keras.src.callbacks.history.History at 0x788fcd81b640>

In [9]:
seed_text = "Hi my name is "
next_words = 5

for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    predicted = np.argmax(model.predict(token_list), axis=-1)
    output_word = ""
    for word, index in tokenizer.word_index.items():
        if index == predicted:
            output_word = word
            break
    seed_text += " " + output_word

print(seed_text)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 139ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
Hi my name is  sherlock holmes it is my
