# **1: Data Preprocessing**
This cell is responsible for preparing the dataset before training the model. The dataset, Poirot Investigates by Agatha Christie, is loaded from a text file. To remove unnecessary parts, the code eliminates the header and footer added by Project Gutenberg. After that, the text is cleaned by removing special characters, converting everything to lowercase, and ensuring words are properly spaced.

To train the language model effectively, tokenization is applied using TensorFlow’s Tokenizer, which assigns a unique number to each word while limiting the vocabulary size to 30,000 words. This helps focus on frequently used words, making the model more efficient. Pre-trained GloVe word embeddings are used to initialize word representations, allowing the model to start with a better understanding of word meanings. The text is then broken into sequences, where each sequence consists of a few words from the book, helping the model learn word patterns. These sequences are padded to a fixed length to maintain uniform input size for training.

### **Key Steps:**

* Loads and cleans the dataset by removing unnecessary metadata.
* Removes special characters and converts the text to lowercase.
* Tokenizes words and limits vocabulary to 30,000 for efficiency.
* Uses pre-trained GloVe embeddings for better word understanding.
* Creates sequences of words and pads them to a fixed length.




In [None]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import re
import gensim.downloader as api
from tensorflow.keras import mixed_precision

# Enable Mixed Precision Training for GPU memory optimization
mixed_precision.set_global_policy("mixed_float16")

# Load the text file
file_path = "/content/61262-0.txt"
with open(file_path, "r", encoding="utf-8") as file:
    text = file.read()

# Remove Gutenberg headers and footers
start_idx = text.find("*** START OF THIS PROJECT GUTENBERG")
end_idx = text.find("*** END OF THIS PROJECT GUTENBERG")
if start_idx != -1 and end_idx != -1:
    text = text[start_idx:end_idx]

# Clean text (Keep only words and spaces)
text = re.sub(r'[^a-zA-Z\s]', '', text).lower()
text = re.sub(r'\s+', ' ', text).strip()

# Tokenization and Vocabulary Reduction
MAX_VOCAB_SIZE = 30000
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE, filters='', oov_token="<OOV>")
tokenizer.fit_on_texts([text])
total_words = min(MAX_VOCAB_SIZE, len(tokenizer.word_index)) + 1

# Reduce Sequence Length to Save Memory
MAXLEN = 100

# Load GloVe embeddings for selected words only
glove_vectors = api.load("glove-wiki-gigaword-100")
embedding_dim = 100
embedding_matrix = np.zeros((total_words, embedding_dim))

for word, i in tokenizer.word_index.items():
    if i < MAX_VOCAB_SIZE and word in glove_vectors:
        embedding_matrix[i] = glove_vectors[word]

# Create input sequences
input_sequences = []
for line in text.split("\n"):
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        input_sequences.append(token_list[:i+1])

# Pad sequences
input_sequences = pad_sequences(input_sequences, maxlen=MAXLEN, padding='pre')

# Create input and output labels
X, y = input_sequences[:, :-1], input_sequences[:, -1]
y = tf.keras.utils.to_categorical(y, num_classes=total_words)

print(f"✅ Vocabulary Size: {total_words}")
print(f"✅ Maximum Sequence Length: {MAXLEN}")
print("✅ Data Preprocessing Completed Successfully")


✅ Vocabulary Size: 6265
✅ Maximum Sequence Length: 100
✅ Data Preprocessing Completed Successfully


# **2: Model Definition**
This cell defines the structure of the deep learning model for text generation. The model starts with an embedding layer that converts word indices into vector representations. These vectors are initialized using the GloVe embeddings loaded earlier, allowing the model to work with meaningful word relationships instead of just random numbers.

Next, two Long Short-Term Memory (LSTM) layers are added. These layers are specialized for handling sequential data, helping the model understand and remember word order and context. To improve training stability, LayerNormalization is applied after each LSTM layer. Dropout layers are also included to prevent overfitting by randomly deactivating some neurons during training. The model ends with a dense layer that predicts the next word using a softmax activation function, which outputs probabilities for each word in the vocabulary. After defining the structure, the model is built and summarized to ensure all components are correctly set up.

### **Key Steps:**

* Defines an embedding layer initialized with GloVe word vectors.
* Adds two LSTM layers to learn word sequences and context.
* Uses LayerNormalization for stable training.
* Includes Dropout layers to reduce overfitting.
* Ends with a softmax output layer that predicts the next word.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, LayerNormalization

model = Sequential([
    Embedding(input_dim=total_words, output_dim=embedding_dim, weights=[embedding_matrix], input_length=MAXLEN-1, trainable=True),
    LSTM(256, return_sequences=True),
    LayerNormalization(),
    Dropout(0.3),
    LSTM(128),
    LayerNormalization(),
    Dropout(0.3),
    Dense(128, activation='relu'),
    Dense(total_words, activation='softmax', dtype='float32')
])

# Build model before running summary
model.build(input_shape=(None, MAXLEN-1))
model.summary()
print("✅ Model Built and Ready for Training")




✅ Model Built and Ready for Training


# **3: Model Training & Saving**
This cell handles training the model and saving the best results. Since training takes time, the model is saved in Google Drive to prevent loss of progress. The optimizer used is Adam, which helps the model learn efficiently. To ensure stable training, gradient clipping is applied to prevent sudden jumps in learning. The loss function used is categorical cross-entropy since the model is predicting a word from multiple choices in the vocabulary.

To improve training, several callbacks are used. Early stopping monitors the loss and stops training if no improvement is seen for several epochs, preventing unnecessary computation. A learning rate scheduler reduces the learning rate when progress slows down, helping the model fine-tune better. Model checkpointing ensures that the best version of the model is saved during training. The model is trained for 250 epochs using a batch size of 64. Once training is complete, the final model and tokenizer are saved to Google Drive so they can be loaded later for text generation.

### **Key Steps:**

* Mounts Google Drive to store training results safely.
* Uses Adam optimizer with gradient clipping for stable learning.
* Applies categorical cross-entropy loss for multi-class word prediction.
* Implements early stopping, learning rate adjustment, and checkpoint saving.
* Trains the model for 250 epochs and saves the final version.

In [None]:
from google.colab import drive
import os

# Mount Google Drive
drive.mount('/content/drive')

# Define save path
drive_path = "/content/drive/MyDrive/Poirot_LSTM"
os.makedirs(drive_path, exist_ok=True)

from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from tensorflow.keras.optimizers import Adam
import json

# Define optimizer
optimizer = Adam(learning_rate=0.0003, clipnorm=1.0)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

# Define Callbacks
early_stopping = EarlyStopping(monitor='loss', patience=8, restore_best_weights=True)
lr_scheduler = ReduceLROnPlateau(monitor='loss', factor=0.5, patience=5, min_lr=1e-6)
checkpoint = ModelCheckpoint(f"{drive_path}/poirot_lstm_best_model.keras", save_best_only=True, monitor='loss')

# Train model
num_epochs = 250
batch_size = 64
history = model.fit(X, y, epochs=num_epochs, batch_size=batch_size, verbose=1, callbacks=[early_stopping, lr_scheduler, checkpoint])

# Save final model
model.save(f"{drive_path}/poirot_lstm_final_model.keras")

# Save tokenizer
tokenizer_json = tokenizer.to_json()
with open(f"{drive_path}/tokenizer.json", "w", encoding="utf-8") as f:
    f.write(tokenizer_json)

print("✅ Training Complete & Model + Tokenizer Saved to Google Drive!")


Mounted at /content/drive
Epoch 1/250
[1m821/821[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 21ms/step - accuracy: 0.0552 - loss: 7.0388 - learning_rate: 3.0000e-04
Epoch 2/250
[1m821/821[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m36s[0m 20ms/step - accuracy: 0.0638 - loss: 6.3584 - learning_rate: 3.0000e-04
Epoch 3/250
[1m821/821[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 20ms/step - accuracy: 0.0852 - loss: 6.0302 - learning_rate: 3.0000e-04
Epoch 4/250
[1m821/821[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 20ms/step - accuracy: 0.0998 - loss: 5.8132 - learning_rate: 3.0000e-04
Epoch 5/250
[1m821/821[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 20ms/step - accuracy: 0.1053 - loss: 5.6134 - learning_rate: 3.0000e-04
Epoch 6/250
[1m821/821[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 20ms/step - accuracy: 0.1123 - loss: 5.4867 - learning_rate: 3.0000e-04
Epoch 7/250
[1m821/821[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m 

# **4: Model Loading & Text Generation**
This cell loads the trained model and tokenizer to generate new text. It first mounts Google Drive to access the saved model files. The trained model and tokenizer are then loaded from their respective locations. If the files are missing, the script prints an error message to alert the user.

The text generation function works by taking an input prompt (a few words) and predicting the next words. It first converts the input text into numerical tokens and pads them to match the training format. The model then predicts probabilities for the next word. To make the text more diverse and less repetitive, a technique called temperature scaling is used. A lower temperature makes predictions more deterministic, while a higher temperature introduces more randomness. Additionally, a sampling rate controls how often a random word is chosen instead of the most likely word. The function is demonstrated by generating text starting with "The great", showing how the model continues the sentence in Agatha Christie’s writing style.

### **Key Steps:**

* Mounts Google Drive and loads the trained model and tokenizer.
* Checks if files are missing and displays an error if needed.
* Converts input text into numerical tokens and pads them.
* Uses temperature scaling and sampling to generate diverse text.
* Demonstrates text generation with an example prompt.

In [None]:
from tensorflow.keras.models import load_model
from tensorflow.keras.preprocessing.text import tokenizer_from_json
from google.colab import drive
import os
import json  # Ensure JSON module is imported
import numpy as np
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Mount Google Drive
drive.mount('/content/drive')

# Define save path
drive_path = "/content/drive/MyDrive/Poirot_LSTM"
os.makedirs(drive_path, exist_ok=True)

# Load trained model
model_path = f"{drive_path}/poirot_lstm_final_model.keras"
tokenizer_path = f"{drive_path}/tokenizer.json"

if os.path.exists(model_path):
    model = load_model(model_path)
    print("✅ Model Loaded Successfully!")
else:
    print("❌ Model file not found!")

if os.path.exists(tokenizer_path):
    with open(tokenizer_path, "r", encoding="utf-8") as f:
        tokenizer_data = json.load(f)  # JSON module was missing
    tokenizer = tokenizer_from_json(json.dumps(tokenizer_data))
    print("✅ Tokenizer Loaded Successfully!")
else:
    print("❌ Tokenizer file not found!")

# Ensure MAXLEN is defined
MAXLEN = 100  # Set this to the correct max length used during training

# Generate Text
def generate_text(seed_text, next_words=20, temperature=0.8, sampling_rate=0.7):
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([seed_text])[0]
        token_list = pad_sequences([token_list], maxlen=MAXLEN-1, padding='pre')
        predicted_probs = model.predict(token_list, verbose=0)[0]
        predicted_probs = np.exp(np.log(predicted_probs) / temperature)
        predicted_probs /= np.sum(predicted_probs)
        predicted = np.random.choice(len(predicted_probs), p=predicted_probs) if np.random.rand() < sampling_rate else np.argmax(predicted_probs)
        seed_text += " " + tokenizer.index_word.get(predicted, "<OOV>")
    return seed_text

# Example Usage
print(generate_text("The great", next_words=20, temperature=0.8))


Mounted at /content/drive


  saveable.load_own_variables(weights_store.get(inner_path))


✅ Model Loaded Successfully!
✅ Tokenizer Loaded Successfully!


  predicted_probs = np.exp(np.log(predicted_probs) / temperature)


The great financier was perfectly right in the afternoon poirot gave a policeman getting of his own flesh and cry the nephew
