# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Split into training (80%) and validation (20%).

In [1]:
import tensorflow as tf
import numpy as np
import re
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.optimizers import Adam
import math

2025-04-24 02:20:55.485290: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1745454055.504097 1185200 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1745454055.509858 1185200 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1745454055.525571 1185200 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1745454055.525604 1185200 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1745454055.525607 1185200 computation_placer.cc:177] computation placer alr

In [None]:
with open('book.txt', 'r', encoding='utf-8') as f:
    text = f.read()

text = text.replace('\n', ' ').replace('\r', ' ').replace('\t', ' ').replace('  ', ' ')

split_idx = int(len(text) * 0.8)
train_text = text[:split_idx]
val_text = text[split_idx:]
print(f"Length of training text: {len(train_text)} characters")
print(f"Length of validation text: {len(val_text)} characters")

Length of training text: 59916 characters
Length of validation text: 14979 characters


## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words or characters (your choice).  
- Build a vocabulary (map each unique word to an integer ID).

In [None]:
import re
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import tensorflow as tf

def preprocess(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9.?! ]+", "", text)
    return text

train_text = preprocess(train_text)
val_text = preprocess(val_text)

In [4]:
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts([train_text])
vocab_size = len(tokenizer.word_index) + 1
max_seq_len = 100
def create_char_sequences(text, tokenizer, seq_length):
    seq = tokenizer.texts_to_sequences([text])[0]
    input_sequences = []
    for i in range(seq_length, len(seq)):
        input_sequences.append(seq[i-seq_length:i+1])
    input_sequences = np.array(input_sequences)
    x = input_sequences[:,:-1]
    y = input_sequences[:,-1]
    y = tf.keras.utils.to_categorical(y, num_classes=vocab_size)
    return x, y

xs, ys = create_char_sequences(train_text, tokenizer, max_seq_len)
val_x, val_y_cat = create_char_sequences(val_text, tokenizer, max_seq_len)
print(f'Char-level vocab size: {vocab_size}')
print(f'xs shape: {xs.shape}, ys shape: {ys.shape}')
print(f'val_x shape: {val_x.shape}, val_y_cat shape: {val_y_cat.shape}')

Char-level vocab size: 41
xs shape: (58470, 100), ys shape: (58470, 41)
val_x shape: (14426, 100), val_y_cat shape: (14426, 41)


## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

In [5]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,
    output_dim=256,
    input_length=max_seq_len,
)



## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
model = Sequential([
    embedding_layer,
    Bidirectional(LSTM(256, return_sequences=True)),
    Dropout(0.3),
    Bidirectional(LSTM(256)),
    Dropout(0.3),
    Dense(vocab_size, activation='softmax')
])
optimizer = Adam(learning_rate=0.001, clipnorm=1.0)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
model.summary()

I0000 00:00:1745454057.657718 1185200 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9770 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060, pci bus id: 0000:01:00.0, compute capability: 8.6


## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible. If you have higher value (which is possible) try to draw conclusions, why doesn't it decrease to a lower value.

In [None]:
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=1, min_lr=1e-6, verbose=1)
history = model.fit(
    xs, ys,
    validation_data=(val_x, val_y_cat),
    epochs=30,
    callbacks=[early_stop, reduce_lr],
    batch_size=256,
    verbose=1
)

Epoch 1/30


I0000 00:00:1745454062.749050 1185228 cuda_dnn.cc:529] Loaded cuDNN version 90300


[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m29s[0m 99ms/step - accuracy: 0.2058 - loss: 2.8468 - val_accuracy: 0.3266 - val_loss: 2.2859 - learning_rate: 0.0010
Epoch 2/30
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 97ms/step - accuracy: 0.3487 - loss: 2.1769 - val_accuracy: 0.3853 - val_loss: 2.0593 - learning_rate: 0.0010
Epoch 3/30
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 97ms/step - accuracy: 0.4159 - loss: 1.9492 - val_accuracy: 0.4396 - val_loss: 1.8778 - learning_rate: 0.0010
Epoch 4/30
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 97ms/step - accuracy: 0.4682 - loss: 1.7755 - val_accuracy: 0.4840 - val_loss: 1.7627 - learning_rate: 0.0010
Epoch 5/30
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 97ms/step - accuracy: 0.5054 - loss: 1.6457 - val_accuracy: 0.4992 - val_loss: 1.7000 - learning_rate: 0.0010
Epoch 6/30
[1m229/229[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2

In [19]:
val_loss, val_acc = model.evaluate(val_x, val_y_cat, verbose=1)
val_perplexity = math.exp(val_loss)
print(f"Updated Validation Perplexity: {val_perplexity:.2f}")

[1m451/451[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 10ms/step - accuracy: 0.5295 - loss: 1.5853
Updated Validation Perplexity: 5.06


## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).

In [23]:
def generate_text(seed_text, next_words=60):
    result = seed_text
    for _ in range(next_words):
        token_list = tokenizer.texts_to_sequences([result])[0]
        token_list = pad_sequences([token_list], maxlen=max_seq_len - 1, padding='pre')
        predicted = model.predict(token_list, verbose=0)
        predicted_word_index = np.argmax(predicted, axis=1)[0]
        output_word = tokenizer.index_word.get(predicted_word_index, '')
        result += '' + output_word
    return result

print("=== Sample 1 ===")
print(generate_text("his death was"))

print("\n=== Sample 2 ===")
print(generate_text("her father took"))

=== Sample 1 ===
his death was a story of the man of the story of the story of the story o

=== Sample 2 ===
her father took who had been so walter that he was a stirit to the story of
