# Copyright

<PRE>
Copyright (c) Bálint Gyires-Tóth - All Rights Reserved
You may use and modify this code for research and development purpuses.
Using this code for educational purposes (self-paced or instructor led) without the permission of the author is prohibited.
</PRE>

# Assignment: RNN text generation with your favorite book


## 1. Dataset
- Download your favorite book from https://www.gutenberg.org/
- Split into training (80%) and validation (20%).

In [1]:
with open('book.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Clean whitespace
text = text.replace('\n', ' ').replace('\r', ' ').replace('\t', ' ').replace('  ', ' ')

# 80-20 split
split_idx = int(len(text) * 0.8)
train_text = text[:split_idx]
val_text = text[split_idx:]

## 2. Preprocessing
- Convert text to lowercase.  
- Remove punctuation (except basic sentence delimiters).  
- Tokenize by words or characters (your choice).  
- Build a vocabulary (map each unique word to an integer ID).

In [2]:
import re
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import tensorflow as tf

# Lowercase and remove punctuation (except period/question/exclamation)
def preprocess(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9.?! ]+", "", text)
    return text

train_text = preprocess(train_text)
val_text = preprocess(val_text)

# Tokenize by word
tokenizer = Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts([train_text])
vocab_size = len(tokenizer.word_index) + 1

train_seq = tokenizer.texts_to_sequences([train_text])[0]

# Create n-gram sequences
input_sequences = []
for i in range(2, len(train_seq)):
    input_sequences.append(train_seq[:i])

# Pad sequences to same length
max_seq_len = max(len(seq) for seq in input_sequences)
input_sequences = pad_sequences(input_sequences, maxlen=max_seq_len, padding='pre')

# Split into input (xs) and label (ys)
xs = input_sequences[:, :-1]
ys = input_sequences[:, -1]
ys = tf.keras.utils.to_categorical(ys, num_classes=vocab_size)

## 3. Embedding Layer in Keras
Below is a minimal example of defining an `Embedding` layer:
```python
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,     # size of the vocabulary
    output_dim=128,           # embedding vector dimension
    input_length=sequence_length
)
```
- This layer transforms integer-encoded sequences (word IDs) into dense vector embeddings.

- Feed these embeddings into your LSTM or GRU OR 1D CNN layer.

In [3]:
from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
    input_dim=vocab_size,
    output_dim=128,
    input_length=max_seq_len - 1
)



## 4. Model
- Implement an LSTM or GRU or 1D CNN-based language model with:
  - **The Embedding layer** as input.
  - At least **one recurrent layer** (e.g., `LSTM(256)` or `GRU(256)` or your custom 1D CNN).
  - A **Dense** output layer with **softmax** activation for word prediction.
- Train for about **5–10 epochs** so it can finish in approximately **2 hours** on a standard machine.


In [4]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam

model = Sequential([
    embedding_layer,
    LSTM(256, return_sequences=True),
    Dropout(0.3),
    LSTM(256),
    Dropout(0.3),
    Dense(vocab_size, activation='softmax')
])

model.compile(loss='categorical_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])
model.summary()

2025-04-23 23:24:57.550626: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1 Pro
2025-04-23 23:24:57.550822: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 16.00 GB
2025-04-23 23:24:57.550848: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 5.33 GB
I0000 00:00:1745443497.552076   45990 pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
I0000 00:00:1745443497.553129   45990 pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


## 5. Training & Evaluation
- **Monitor** the loss on both training and validation sets.
- **Perplexity**: a common metric for language models.
  - It is the exponent of the average negative log-likelihood.
  - If your model outputs cross-entropy loss `H`, then `perplexity = e^H`.
  - Try to keep the validation perplexity **under 50** if possible. If you have higher value (which is possible) try to draw conclusions, why doesn't it decrease to a lower value.

In [None]:
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
import math

early_stop = EarlyStopping(monitor='val_loss', patience=2, restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=1, min_lr=1e-6, verbose=1)

# Evaluate on validation set
val_seq = tokenizer.texts_to_sequences([val_text])[0]
val_input_sequences = []
for i in range(2, len(val_seq)):
    val_input_sequences.append(val_seq[:i])
val_input_sequences = pad_sequences(val_input_sequences, maxlen=max_seq_len, padding='pre')
val_x = val_input_sequences[:, :-1]
val_y = val_input_sequences[:, -1]
val_y_cat = tf.keras.utils.to_categorical(val_y, num_classes=vocab_size)

# Train
history = model.fit(
    xs, ys,
    validation_data=(val_x, val_y_cat),
    epochs=10,
    callbacks=[early_stop, reduce_lr],
    verbose=1
)

val_loss = model.evaluate(val_x, val_y_cat, verbose=0)
val_perplexity = math.exp(val_loss[0])
print(f"Validation Perplexity: {val_perplexity:.2f}")

Epoch 1/10


2025-04-23 23:25:02.227092: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


## 6. Generation Criteria
- After training, generate **two distinct text samples**, each at least **50 tokens**.
- Use **different seed phrases** (e.g., “love is” vs. “time will”).

In [None]:
# Updated text generation for word-level model
def generate_text(model, seed_text, word2idx, idx2word, seq_length, num_words):
    import numpy as np
    # Convert seed text to indices
    input_seq = [word2idx.get(w, 0) for w in seed_text.split()][-seq_length:]
    input_seq = np.array(input_seq)[None, :]
    generated = []
    for _ in range(num_words):
        preds = model.predict(input_seq, verbose=0)[0]
        next_id = np.argmax(preds)
        generated.append(idx2word[next_id])
        # slide window
        input_seq = np.concatenate([input_seq[:, 1:], [[next_id]]], axis=1)
    return seed_text + ' ' + ' '.join(generated)
# Example generation
seed1 = 'love is'
print(generate_text(model, seed1, word2idx, idx2word, sequence_length, 50))

love is <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK>


In [None]:
seed2 = 'time will'
print(generate_text(model, seed2, word2idx, idx2word, sequence_length, 50))

time will <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK>
