# Chapter 10: Natural language processing with TensorFlow: Language modeling

This notebook reproduces the code and summarizes the theoretical concepts from Chapter 10 of *'TensorFlow in Action'* by Thushan Ganegedara.

This chapter focuses on **language modeling**—the task of predicting the next token (a word or character) in a sequence. This is a fundamental task in NLP that enables models to generate text.

We will cover:
1.  **Data Processing**: How to process a raw text corpus, use n-grams to manage vocabulary size, and build an efficient `tf.data` pipeline.
2.  **Model Implementation**: Building a language model using a **Gated Recurrent Unit (GRU)**, which is similar to an LSTM.
3.  **Model Evaluation**: Creating a custom **Perplexity** metric to evaluate the quality of the language model.
4.  **Text Generation**: Using the trained model for inference, including **Greedy Decoding** and the more advanced **Beam Search**.

---

## 10.1 Processing the Data

Language modeling is an unsupervised task. The labels are generated from the data itself: the input is a sequence of tokens, and the target is the same sequence, shifted one step to the right.

**Input**: `[ "The", "cat", "sat" ]`
**Target**: `[ "cat", "sat", "on" ]`

### 10.1.3 N-grams

A major challenge in language modeling is large vocabulary size. A model that predicts the next *word* might have to choose from 50,000+ possibilities. The book uses **n-grams** (sequences of *n* characters) to solve this.

Using 2-grams (bigrams), for example, dramatically reduces the vocabulary. The word "hello" becomes `["he", "ll", "o"]` (with padding). This allows the model to handle a much smaller vocabulary and even create words it has never seen before by combining known n-grams.

In [1]:
import tensorflow as tf
from tensorflow.keras import layers, models
import tensorflow.keras.backend as K
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np
import os
import requests
import tarfile
from collections import Counter
from itertools import chain
import pickle

# --- 1. Download and Read Data (Simulated from book) ---
# The book uses the bAbI dataset. We'll simulate downloading and reading it.
data_dir = os.path.join('data', 'lm', 'CBTest', 'data')
os.makedirs(data_dir, exist_ok=True)
train_path = os.path.join(data_dir, 'cbt_train.txt')
valid_path = os.path.join(data_dir, 'cbt_valid.txt')
test_path = os.path.join(data_dir, 'cbt_test.txt')

# Create dummy data files for demonstration
if not os.path.exists(train_path):
    with open(train_path, 'w') as f:
        f.write("_BOOK_TITLE_ A Simple Story\n")
        f.write("Once upon a time, there was a fox.\n")
        f.write("The fox was quick and brown.\n")
        f.write("_BOOK_TITLE_ Another Story\n")
        f.write("A dog and a cat were friends.\n")
        f.write("They played in the yard.\n")

    with open(valid_path, 'w') as f:
        f.write("_BOOK_TITLE_ A Validation Story\n")
        f.write("The sun was bright.\n")

    with open(test_path, 'w') as f:
        f.write("_BOOK_TITLE_ A Test Story\n")
        f.write("The moon was full.\n")

def read_data(path):
    stories = []
    with open(path, 'r') as f:
        s = []
        for row in f:
            if row.startswith("_BOOK_TITLE_"):
                if len(s) > 0:
                    stories.append(' '.join(s).lower())
                s = []
            s.append(row.strip()) # Add strip() to remove newlines
        if len(s) > 0:
            stories.append(' '.join(s).lower())
    return stories

stories = read_data(train_path)
val_stories = read_data(valid_path)
test_stories = read_data(test_path)

print(f"Loaded {len(stories)} training stories.")
print(stories[0][:100]) # Print first 100 chars of first story

Loaded 2 training stories.
_book_title_ a simple story once upon a time, there was a fox. the fox was quick and brown.


In [2]:
import pandas as pd

# --- 2. N-gram and Tokenizer Processing ---

# Function to get n-grams
def get_ngrams(text, n):
    return [text[i:i+n] for i in range(0, len(text), n)]

ngrams = 2 # We'll use 2-grams (bigrams)
train_ngram_stories = [get_ngrams(s, ngrams) for s in stories]

# Calculate vocabulary size (e.g., all n-grams appearing >= 10 times)
# In our small demo, we'll use a threshold of 1
text_corpus = chain(*train_ngram_stories)
cnt = Counter(text_corpus)
freq_df = pd.Series(list(cnt.values()), index=list(cnt.keys())).sort_values(ascending=False)

n_vocab = (freq_df >= 1).sum() # For demo, use 1. Book uses 10.
print(f"\nN-gram vocabulary size: {n_vocab}")
print("Most common n-grams:")
print(freq_df.head())

# --- 3. Tokenize Data ---
tokenizer = Tokenizer(num_words=n_vocab, oov_token='unk', lower=False)

# Fit tokenizer on training n-grams
tokenizer.fit_on_texts(train_ngram_stories)

# Convert all datasets to sequences of integer IDs
train_data_seq = tokenizer.texts_to_sequences(train_ngram_stories)

val_ngram_stories = [get_ngrams(s, ngrams) for s in val_stories]
val_data_seq = tokenizer.texts_to_sequences(val_ngram_stories)

test_ngram_stories = [get_ngrams(s, ngrams) for s in test_stories]
test_data_seq = tokenizer.texts_to_sequences(test_ngram_stories)

print("\nOriginal text:")
print(train_ngram_stories[0][:15])
print("\nTokenized sequence:")
print(train_data_seq[0][:15])


N-gram vocabulary size: 57
Most common n-grams:
 a    6
e     4
ti    3
er    3
th    3
dtype: int64

Original text:
['_b', 'oo', 'k_', 'ti', 'tl', 'e_', ' a', ' s', 'im', 'pl', 'e ', 'st', 'or', 'y ', 'on']

Tokenized sequence:
[8, 9, 10, 4, 11, 12, 2, 13, 22, 14, 3, 23, 24, 15, 25]


### 10.1.5 Defining a `tf.data` pipeline

We now create a pipeline that takes our long list of token sequences and turns it into `(input, target)` batches for training.

1.  `from_tensor_slices`: Creates a dataset from our list of stories.
2.  `flat_map` + `window`: This is the key part. It slides a `window` (of size `n_seq + 1`) across each story, creating many overlapping subsequences.
3.  `shuffle`: Shuffles these windows.
4.  `batch`: Groups the windows into batches.
5.  `map`: Splits each window `[t_0, t_1, ..., t_n]` into an input `x = [t_0, ..., t_{n-1}]` and a target `y = [t_1, ..., t_n]`.
6.  `prefetch`: Optimizes performance by pre-loading the next batch while the current one is processing.

In [10]:
import tensorflow as tf
from tensorflow.keras import layers, models
import tensorflow.keras.backend as K
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np
import os
import requests
import tarfile
from collections import Counter
from itertools import chain
import pickle
import pandas as pd

# --- 1. Download and Read Data (Simulated from book) ---
# The book uses the bAbI dataset. We'll simulate downloading and reading it.
data_dir = os.path.join('data', 'lm', 'CBTest', 'data')
os.makedirs(data_dir, exist_ok=True)
train_path = os.path.join(data_dir, 'cbt_train.txt')
valid_path = os.path.join(data_dir, 'cbt_valid.txt')
test_path = os.path.join(data_dir, 'cbt_test.txt')

# Create dummy data files for demonstration
if not os.path.exists(train_path):
    with open(train_path, 'w') as f:
        f.write("_BOOK_TITLE_ A Simple Story\n")
        f.write("Once upon a time, there was a fox.\n")
        f.write("The fox was quick and brown.\n")
        f.write("_BOOK_TITLE_ Another Story\n")
        f.write("A dog and a cat were friends.\n")
        f.write("They played in the yard.\n")

    with open(valid_path, 'w') as f:
        f.write("_BOOK_TITLE_ A Validation Story\n")
        f.write("The sun was bright.\n")

    with open(test_path, 'w') as f:
        f.write("_BOOK_TITLE_ A Test Story\n")
        f.write("The moon was full.\n")

def read_data(path):
    stories = []
    with open(path, 'r') as f:
        s = []
        for row in f:
            if row.startswith("_BOOK_TITLE_"):
                if len(s) > 0:
                    stories.append(' '.join(s).lower())
                s = []
            s.append(row.strip()) # Add strip() to remove newlines
        if len(s) > 0:
            stories.append(' '.join(s).lower())
    return stories

stories = read_data(train_path)
val_stories = read_data(valid_path)
test_stories = read_data(test_path)

# --- 2. N-gram and Tokenizer Processing ---

# Function to get n-grams
def get_ngrams(text, n):
    return [text[i:i+n] for i in range(0, len(text), n)]

ngrams = 2 # We'll use 2-grams (bigrams)
train_ngram_stories = [get_ngrams(s, ngrams) for s in stories]

# Calculate vocabulary size (e.g., all n-grams appearing >= 10 times)
# In our small demo, we'll use a threshold of 1
text_corpus = chain(*train_ngram_stories)
cnt = Counter(text_corpus)
freq_df = pd.Series(list(cnt.values()), index=list(cnt.keys())).sort_values(ascending=False)

n_vocab = (freq_df >= 1).sum() # For demo, use 1. Book uses 10.

# --- 3. Tokenize Data ---
tokenizer = Tokenizer(num_words=n_vocab, oov_token='unk', lower=False)

# Fit tokenizer on training n-grams
tokenizer.fit_on_texts(train_ngram_stories)

# Convert all datasets to sequences of integer IDs
train_data_seq = tokenizer.texts_to_sequences(train_ngram_stories)

val_ngram_stories = [get_ngrams(s, ngrams) for s in val_stories]
val_data_seq = tokenizer.texts_to_sequences(val_ngram_stories)

test_ngram_stories = [get_ngrams(s, ngrams) for s in test_stories]
test_data_seq = tokenizer.texts_to_sequences(test_ngram_stories)

# Based on Listing 10.3
def get_tf_pipeline(data_seq, n_seq, batch_size=64, shift=1, shuffle=True):
    """Converts sequences of text IDs into (input, target) batches."""

    # Use RaggedTensor to handle stories of different lengths
    text_ds = tf.data.Dataset.from_tensor_slices(tf.ragged.constant(data_seq))

    if shuffle:
        # Ensure buffer_size is at least 1, even for very small datasets
        buffer_size_stories = max(1, len(data_seq) // 2)
        text_ds = text_ds.shuffle(buffer_size=buffer_size_stories)

    # Use flat_map to apply windowing to each story individually
    text_ds = text_ds.flat_map(
        lambda x: tf.data.Dataset.from_tensor_slices(x).window(
            n_seq + 1, shift=shift, drop_remainder=True
        ).flat_map(
            lambda window: window.batch(n_seq + 1, drop_remainder=True)
        )
    )

    if shuffle:
        # Ensure buffer_size is at least 1
        buffer_size_batches = max(1, 10 * batch_size)
        text_ds = text_ds.shuffle(buffer_size=buffer_size_batches)

    text_ds = text_ds.batch(batch_size)

    # Split into (x, y) pairs where y is x shifted by one
    text_ds = text_ds.map(lambda x: (x[:, :-1], x[:, 1:]))

    # Add .repeat() for training datasets to ensure multiple epochs
    if shuffle: # Only repeat for training dataset, not validation/test
        text_ds = text_ds.repeat()

    text_ds = text_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
    return text_ds

# Set hyperparameters
n_seq = 10 # Sequence length for the model (changed from 100 to 10)
batch_size = 128

train_ds = get_tf_pipeline(train_data_seq, n_seq, batch_size=batch_size, shuffle=True)
valid_ds = get_tf_pipeline(val_data_seq, n_seq, batch_size=batch_size)
test_ds = get_tf_pipeline(test_data_seq, n_seq, batch_size=batch_size)

# Inspect a batch
for x_batch, y_batch in train_ds.take(1):
    print(f"X batch shape: {x_batch.shape}")
    print(f"Y batch shape: {y_batch.shape}")
    print(f"\nExample X: {x_batch[0, :10]}")
    print(f"Example Y: {y_batch[0, :10]}")

X batch shape: (67, 10)
Y batch shape: (67, 10)

Example X: [44  5  6 13 45 46  2 47 48  2]
Example Y: [ 5  6 13 45 46  2 47 48  2  7]


---

## 10.2 GRUs in Wonderland: Generating text with deep learning

A **Gated Recurrent Unit (GRU)** is a type of recurrent neural network (RNN), similar to an LSTM. It's designed to learn from sequences and remember information over long periods. It's slightly simpler than an LSTM, using two gates (an *update gate* and a *reset gate*) instead of three, and one hidden state instead of two. This often makes it faster to train with comparable performance.

In [4]:
# Based on Listing 10.4
K.clear_session()

model = models.Sequential([
    layers.Embedding(
        input_dim=n_vocab + 1, # +1 for the padding token (ID 0)
        output_dim=512,
        input_shape=(None,) # (None,) means it can accept sequences of any length
    ),

    # return_sequences=True is critical.
    # It makes the GRU output a prediction for *every* token in the sequence,
    # not just the very last one.
    layers.GRU(1024, return_state=False, return_sequences=True),

    layers.Dense(512, activation='relu'),

    # The final layer predicts the next token ID from the entire vocabulary
    layers.Dense(n_vocab, name='final_out'),
    layers.Activation('softmax') # Use softmax to get probabilities
])

model.summary()

  super().__init__(**kwargs)


## 10.3 Measuring the quality of the generated text

Simple accuracy is a poor metric for language models. If the correct next word is "dog" and the model predicts "cat," the accuracy is 0, but the prediction is semantically reasonable.

A better metric is **Perplexity (PPL)**, which measures how "surprised" or "confused" the model is by the true target sequence. It's derived from the cross-entropy (CE) loss:

$$PPL = e^{\text{CE_Loss}}$$

A lower perplexity is better. A PPL of 100 means the model is, on average, as confused as if it were randomly guessing between 100 different words at each step.

In [5]:
# Based on Listing 10.5: Custom Perplexity Metric
class PerplexityMetric(tf.keras.metrics.Mean):
    def __init__(self, name='perplexity', **kwargs):
        super().__init__(name=name, **kwargs)
        # We use sparse categorical crossentropy because our y_true (targets)
        # are integers, not one-hot vectors.
        self.cross_entropy = tf.keras.losses.SparseCategoricalCrossentropy(
            from_logits=False, reduction='none'
        )

    def _calculate_perplexity(self, real, pred):
        # Calculate the cross-entropy loss for each token
        loss_ = self.cross_entropy(real, pred)

        # Get the mean loss across the sequence
        mean_loss = K.mean(loss_, axis=-1)

        # Perplexity is the exponential of the mean loss
        perplexity = K.exp(mean_loss)
        return perplexity

    def update_state(self, y_true, y_pred, sample_weight=None):
        perplexity = self._calculate_perplexity(y_true, y_pred)
        super().update_state(perplexity, sample_weight=sample_weight)

## 10.4 Training and evaluating the language model

In [12]:
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, CSVLogger

# Compile the model
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy', PerplexityMetric()]
)

# Define callbacks
os.makedirs('eval', exist_ok=True)
csv_logger = CSVLogger(os.path.join('eval', '1_language_modelling.log'))
es_callback = EarlyStopping(monitor='val_perplexity', patience=5, mode='min')
lr_callback = ReduceLROnPlateau(monitor='val_perplexity', factor=0.1, patience=2, mode='min')

# Train the model (only 3 epochs for this demo, book uses 50)
print("Starting model training...")
history = model.fit(
    train_ds,
    epochs=1,
    validation_data=valid_ds,
    callbacks=[es_callback, lr_callback, csv_logger]
)

# Evaluate on the test set
print("\nEvaluating model on test set...")
model.evaluate(test_ds)

# Save the model and tokenizer
os.makedirs('models', exist_ok=True)
model.save(os.path.join('models', '2_gram_lm.h5'))

with open(os.path.join('models', 'text_hyperparams.pkl'), 'wb') as f:
    pickle.dump({'n_vocab': n_vocab, 'ngrams': ngrams, 'n_seq': n_seq}, f)

Starting model training...
  19634/Unknown [1m19994s[0m 1s/step - accuracy: 0.9655 - loss: 0.0546 - perplexity: 179.1632

KeyboardInterrupt: 

---

## 10.5 Generating new text from the language model: Greedy decoding

For **inference** (text generation), we can't use `model.fit()` or `model.predict()` on a whole sequence. We need to generate one token at a time, feed that token back into the model, and get the next one.

This requires a new model that:
1.  Takes the previous token(s) **and** the GRU's previous hidden state as input.
2.  Outputs the prediction (logits) **and** the new hidden state.

**Greedy Decoding** is the simplest method: at each step, we just pick the single token with the highest probability.

In [13]:
# 1. Re-build the model for inference using the Functional API
K.clear_session()

from tensorflow.keras.models import load_model # Import load_model

# Load the entire trained model
trained_model = load_model(os.path.join('models', '2_gram_lm.h5'),
                           custom_objects={'PerplexityMetric': PerplexityMetric})

# Define inputs for the inference model
inp_token = tf.keras.layers.Input(shape=(1,), dtype=tf.int32, name='input_token') # Changed shape from (None,) to (1,)
inp_state = tf.keras.layers.Input(shape=(1024,), name='input_state') # 1024 is the GRU units

# Get weights from the trained model's layers
embedding_weights = trained_model.get_layer('embedding').get_weights()
gru_weights = trained_model.get_layer('gru').get_weights()
dense_1_weights = trained_model.get_layer('dense').get_weights()
final_out_weights = trained_model.get_layer('final_out').get_weights()

# Create new layer instances for the inference model, explicitly configuring GRU
# for single step prediction and state output. Weights will be set later.
embedding_layer_infer = layers.Embedding(
    input_dim=trained_model.get_layer('embedding').input_dim,
    output_dim=trained_model.get_layer('embedding').output_dim,
    name='embedding_infer'
)

gru_layer_infer = layers.GRU(
    trained_model.get_layer('gru').units,
    return_sequences=False, # Process single input token, get single output
    return_state=True,      # Return the hidden state
    name='gru_infer'
)

dense_layer_1_infer = layers.Dense(
    trained_model.get_layer('dense').units,
    activation=trained_model.get_layer('dense').activation,
    name='dense_infer'
)

final_layer_infer = layers.Dense(
    trained_model.get_layer('final_out').units,
    name='final_out_infer'
)

softmax_layer_infer = layers.Activation('softmax', name='activation_infer')

# Build the functional graph for inference
emb_out = embedding_layer_infer(inp_token)
gru_output, gru_state_out = gru_layer_infer(emb_out, initial_state=inp_state)
dense_out = dense_layer_1_infer(gru_output)
final_out = final_layer_infer(dense_out)
softmax_out = softmax_layer_infer(final_out)

infer_model = tf.keras.models.Model(
    inputs=[inp_token, inp_state],
    outputs=[softmax_out, gru_state_out]
)

# Set the weights for the new layers
embedding_layer_infer.set_weights(embedding_weights)
gru_layer_infer.set_weights(gru_weights)
dense_layer_1_infer.set_weights(dense_1_weights)
final_layer_infer.set_weights(final_out_weights)

infer_model.summary()



In [14]:
# 2. Write the Greedy Decoding loop (based on Listing 10.7)

def generate_text_greedy(seed_text, n_to_generate=50):
    print(f"Seed text: '{seed_text}'\n")
    text = get_ngrams(seed_text.lower(), ngrams)
    seq = tokenizer.texts_to_sequences([text])

    # Initialize the state
    state = np.zeros(shape=(1, 1024))

    # Feed the seed text to the model to "warm up" the state
    for i in range(len(seq[0]) - 1):
        x_in = np.array([[seq[0][i]]])
        out, state = infer_model.predict([x_in, state])

    # Start generating from the last token of the seed text
    x = np.array([[seq[0][-1]]])
    generated_text = list(text)

    for _ in range(n_to_generate):
        out, state = infer_model.predict([x, state])

        # Greedy step: get the ID of the most probable next token
        wid = int(np.argmax(out[0], axis=-1).ravel())

        # Stop if we predict 'unk' or 0 (padding)
        if wid == 0 or wid == tokenizer.word_index['unk']:
            break

        word = tokenizer.index_word[wid]
        generated_text.append(word)

        # The new input is the word we just predicted
        x = np.array([[wid]])

    print("Generated text:")
    print(''.join(generated_text))

generate_text_greedy("the dog was", n_to_generate=30)

Seed text: 'the dog was'

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2s/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 197ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 171ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 116ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 106ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 122ms/step


  wid = int(np.argmax(out[0], axis=-1).ravel())


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 154ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 103ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 123ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 57ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 116ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 105ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 204ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 200ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 148ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 128ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 106ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 89ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1

## 10.6 Beam search: Enhancing the predictive power of sequential models

Greedy decoding has a major flaw: it's not optimal. It might pick a word that seems good *now* but leads to a dead end later (e.g., "the dog was **ru**" -> "the dog was **running**" vs. "the dog was **ru**" -> "the dog was **ru**g").

**Beam Search** improves this by keeping track of the *k* (e.g., *k*=3) most probable sequences at each step.

1.  **Step 1**: Get the top 3 most likely next words (e.g., "running", "barking", "sleeping").
2.  **Step 2**: For *each* of those 3 sequences, predict the *next* top 3 words. This gives $3 \times 3 = 9$ candidate sequences.
3.  **Step 3**: Rank all 9 sequences by their combined probability and keep only the new top 3.
4.  Repeat.

This is a more complex, recursive function (based on Listing 10.8) that explores more of the search space and usually produces more coherent text.

# Task
To address the training speed issue, I will modify the model architecture and training parameters as per the plan:

First, I'll modify the model architecture to reduce the number of GRU units and the embedding output dimension. Then, I'll update the `batch_size` for data pipelines. Finally, I'll recompile and retrain the model and evaluate its performance.

Here's how I'll update the parameters in cell `ioAm9_OINN_b`:
1. **Reduce Model Parameters**:
    - Change `output_dim` of `layers.Embedding` from 512 to 128.
    - Change `layers.GRU` units from 1024 to 256.

And in cell `mxKUJ57HNN_Z`:
1. **Increase Batch Size**:
    - Change `batch_size` from 128 to 256.

After these modifications, I'll rerun cell `mxKUJ57HNN_Z` to update the data pipelines, then rerun cell `ioAm9_OINN_b` to apply the model architecture changes. Finally, I'll calculate `steps_per_epoch` and `validation_steps` and retrain the model in cell `FkdLyKVXNN_d`.

I will start by executing the cell `mxKUJ57HNN_Z` to increase the batch size and update the data pipelines.

```python
import tensorflow as tf
from tensorflow.keras import layers, models
import tensorflow.keras.backend as K
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np
import os
import requests
import tarfile
from collections import Counter
from itertools import chain
import pickle
import pandas as pd

# --- 1. Download and Read Data (Simulated from book) ---
# The book uses the bAbI dataset. We'll simulate downloading and reading it.
data_dir = os.path.join('data', 'lm', 'CBTest', 'data')
os.makedirs(data_dir, exist_ok=True)
train_path = os.path.join(data_dir, 'cbt_train.txt')
valid_path = os.path.join(data_dir, 'cbt_valid.txt')
test_path = os.path.join(data_dir, 'cbt_test.txt')

# Create dummy data files for demonstration
if not os.path.exists(train_path):
    with open(train_path, 'w') as f:
        f.write("_BOOK_TITLE_ A Simple Story\n")
        f.write("Once upon a time, there was a fox.\n")
        f.write("The fox was quick and brown.\n")
        f.write("_BOOK_TITLE_ Another Story\n")
        f.write("A dog and a cat were friends.\n")
        f.write("They played in the yard.\n")

    with open(valid_path, 'w') as f:
        f.write("_BOOK_TITLE_ A Validation Story\n")
        f.write("The sun was bright.\n")

    with open(test_path, 'w') as f:
        f.write("_BOOK_TITLE_ A Test Story\n")
        f.write("The moon was full.\n")

def read_data(path):
    stories = []
    with open(path, 'r') as f:
        s = []
        for row in f:
            if row.startswith("_BOOK_TITLE_"):
                if len(s) > 0:
                    stories.append(' '.join(s).lower())
                s = []
            s.append(row.strip()) # Add strip() to remove newlines
        if len(s) > 0:
            stories.append(' '.join(s).lower())
    return stories

stories = read_data(train_path)
val_stories = read_data(valid_path)
test_stories = read_data(test_path)

# --- 2. N-gram and Tokenizer Processing ---

# Function to get n-grams
def get_ngrams(text, n):
    return [text[i:i+n] for i in range(0, len(text), n)]

ngrams = 2 # We'll use 2-grams (bigrams)
train_ngram_stories = [get_ngrams(s, ngrams) for s in stories]

# Calculate vocabulary size (e.g., all n-grams appearing >= 10 times)
# In our small demo, we'll use a threshold of 1
text_corpus = chain(*train_ngram_stories)
cnt = Counter(text_corpus)
freq_df = pd.Series(list(cnt.values()), index=list(cnt.keys())).sort_values(ascending=False)

n_vocab = (freq_df >= 1).sum() # For demo, use 1. Book uses 10.

# --- 3. Tokenize Data ---
tokenizer = Tokenizer(num_words=n_vocab, oov_token='unk', lower=False)

# Fit tokenizer on training n-grams
tokenizer.fit_on_texts(train_ngram_stories)

# Convert all datasets to sequences of integer IDs
train_data_seq = tokenizer.texts_to_sequences(train_ngram_stories)

val_ngram_stories = [get_ngrams(s, ngrams) for s in val_stories]
val_data_seq = tokenizer.texts_to_sequences(val_ngram_stories)

test_ngram_stories = [get_ngrams(s, ngrams) for s in test_stories]
test_data_seq = tokenizer.texts_to_sequences(test_ngram_stories)

# Based on Listing 10.3
def get_tf_pipeline(data_seq, n_seq, batch_size=64, shift=1, shuffle=True):
    """Converts sequences of text IDs into (input, target) batches."""

    # Use RaggedTensor to handle stories of different lengths
    text_ds = tf.data.Dataset.from_tensor_slices(tf.ragged.constant(data_seq))

    if shuffle:
        # Ensure buffer_size is at least 1, even for very small datasets
        buffer_size_stories = max(1, len(data_seq) // 2)
        text_ds = text_ds.shuffle(buffer_size=buffer_size_stories)

    # Use flat_map to apply windowing to each story individually
    text_ds = text_ds.flat_map(
        lambda x: tf.data.Dataset.from_tensor_slices(x).window(
            n_seq + 1, shift=shift, drop_remainder=True
        ).flat_map(
            lambda window: window.batch(n_seq + 1, drop_remainder=True)
        )
    )

    if shuffle:
        # Ensure buffer_size is at least 1
        buffer_size_batches = max(1, 10 * batch_size)
        text_ds = text_ds.shuffle(buffer_size=buffer_size_batches)

    text_ds = text_ds.batch(batch_size)

    # Split into (x, y) pairs where y is x shifted by one
    text_ds = text_ds.map(lambda x: (x[:, :-1], x[:, 1:]))

    # Add .repeat() for training datasets to ensure multiple epochs
    if shuffle: # Only repeat for training dataset, not validation/test
        text_ds = text_ds.repeat()

    text_ds = text_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
    return text_ds

# Set hyperparameters
n_seq = 10 # Sequence length for the model (changed from 100 to 10)
batch_size = 256 # Increased batch size from 128 to 256

train_ds = get_tf_pipeline(train_data_seq, n_seq, batch_size=batch_size, shuffle=True)
valid_ds = get_tf_pipeline(val_data_seq, n_seq, batch_size=batch_size)
test_ds = get_tf_pipeline(test_data_seq, n_seq, batch_size=batch_size)

# Inspect a batch
for x_batch, y_batch in train_ds.take(1):
    print(f"X batch shape: {x_batch.shape}")
    print(f"Y batch shape: {y_batch.shape}")
    print(f"\nExample X: {x_batch[0, :10]}")
    print(f"Example Y: {y_batch[0, :10]}")
```

In [16]:
import tensorflow as tf
from tensorflow.keras import layers, models
import tensorflow.keras.backend as K
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np
import os
import requests
import tarfile
from collections import Counter
from itertools import chain
import pickle
import pandas as pd

# --- 1. Download and Read Data (Simulated from book) ---
# The book uses the bAbI dataset. We'll simulate downloading and reading it.
data_dir = os.path.join('data', 'lm', 'CBTest', 'data')
os.makedirs(data_dir, exist_ok=True)
train_path = os.path.join(data_dir, 'cbt_train.txt')
valid_path = os.path.join(data_dir, 'cbt_valid.txt')
test_path = os.path.join(data_dir, 'cbt_test.txt')

# Create dummy data files for demonstration
if not os.path.exists(train_path):
    with open(train_path, 'w') as f:
        f.write("_BOOK_TITLE_ A Simple Story\n")
        f.write("Once upon a time, there was a fox.\n")
        f.write("The fox was quick and brown.\n")
        f.write("_BOOK_TITLE_ Another Story\n")
        f.write("A dog and a cat were friends.\n")
        f.write("They played in the yard.\n")

    with open(valid_path, 'w') as f:
        f.write("_BOOK_TITLE_ A Validation Story\n")
        f.write("The sun was bright.\n")

    with open(test_path, 'w') as f:
        f.write("_BOOK_TITLE_ A Test Story\n")
        f.write("The moon was full.\n")

def read_data(path):
    stories = []
    with open(path, 'r') as f:
        s = []
        for row in f:
            if row.startswith("_BOOK_TITLE_"):
                if len(s) > 0:
                    stories.append(' '.join(s).lower())
                s = []
            s.append(row.strip()) # Add strip() to remove newlines
        if len(s) > 0:
            stories.append(' '.join(s).lower())
    return stories

stories = read_data(train_path)
val_stories = read_data(valid_path)
test_stories = read_data(test_path)

# --- 2. N-gram and Tokenizer Processing ---

# Function to get n-grams
def get_ngrams(text, n):
    return [text[i:i+n] for i in range(0, len(text), n)]

ngrams = 2 # We'll use 2-grams (bigrams)
train_ngram_stories = [get_ngrams(s, ngrams) for s in stories]

# Calculate vocabulary size (e.g., all n-grams appearing >= 10 times)
# In our small demo, we'll use a threshold of 1
text_corpus = chain(*train_ngram_stories)
cnt = Counter(text_corpus)
freq_df = pd.Series(list(cnt.values()), index=list(cnt.keys())).sort_values(ascending=False)

n_vocab = (freq_df >= 1).sum() # For demo, use 1. Book uses 10.

# --- 3. Tokenize Data ---
tokenizer = Tokenizer(num_words=n_vocab, oov_token='unk', lower=False)

# Fit tokenizer on training n-grams
tokenizer.fit_on_texts(train_ngram_stories)

# Convert all datasets to sequences of integer IDs
train_data_seq = tokenizer.texts_to_sequences(train_ngram_stories)

val_ngram_stories = [get_ngrams(s, ngrams) for s in val_stories]
val_data_seq = tokenizer.texts_to_sequences(val_ngram_stories)

test_ngram_stories = [get_ngrams(s, ngrams) for s in test_stories]
test_data_seq = tokenizer.texts_to_sequences(test_ngram_stories)

# Based on Listing 10.3
def get_tf_pipeline(data_seq, n_seq, batch_size=64, shift=1, shuffle=True):
    """Converts sequences of text IDs into (input, target) batches."""

    # Use RaggedTensor to handle stories of different lengths
    text_ds = tf.data.Dataset.from_tensor_slices(tf.ragged.constant(data_seq))

    if shuffle:
        # Ensure buffer_size is at least 1, even for very small datasets
        buffer_size_stories = max(1, len(data_seq) // 2)
        text_ds = text_ds.shuffle(buffer_size=buffer_size_stories)

    # Use flat_map to apply windowing to each story individually
    text_ds = text_ds.flat_map(
        lambda x: tf.data.Dataset.from_tensor_slices(x).window(
            n_seq + 1, shift=shift, drop_remainder=True
        ).flat_map(
            lambda window: window.batch(n_seq + 1, drop_remainder=True)
        )
    )

    if shuffle:
        # Ensure buffer_size is at least 1
        buffer_size_batches = max(1, 10 * batch_size)
        text_ds = text_ds.shuffle(buffer_size=buffer_size_batches)

    text_ds = text_ds.batch(batch_size)

    # Split into (x, y) pairs where y is x shifted by one
    text_ds = text_ds.map(lambda x: (x[:, :-1], x[:, 1:]))

    # Add .repeat() for training datasets to ensure multiple epochs
    if shuffle: # Only repeat for training dataset, not validation/test
        text_ds = text_ds.repeat()

    text_ds = text_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
    return text_ds

# Set hyperparameters
n_seq = 10 # Sequence length for the model (changed from 100 to 10)
batch_size = 256

train_ds = get_tf_pipeline(train_data_seq, n_seq, batch_size=batch_size, shuffle=True)
valid_ds = get_tf_pipeline(val_data_seq, n_seq, batch_size=batch_size)
test_ds = get_tf_pipeline(test_data_seq, n_seq, batch_size=batch_size)

# Inspect a batch
for x_batch, y_batch in train_ds.take(1):
    print(f"X batch shape: {x_batch.shape}")
    print(f"Y batch shape: {y_batch.shape}")
    print(f"\nExample X: {x_batch[0, :10]}")
    print(f"Example Y: {y_batch[0, :10]}")

X batch shape: (67, 10)
Y batch shape: (67, 10)

Example X: [45 46  2 47 48  2  7  2 49 50]
Example Y: [46  2 47 48  2  7  2 49 50 20]


In [17]:
K.clear_session()

model = models.Sequential([
    layers.Embedding(
        input_dim=n_vocab + 1, # +1 for the padding token (ID 0)
        output_dim=128, # Changed from 512 to 128
        input_shape=(None,)
    ),

    layers.GRU(256, return_state=False, return_sequences=True), # Changed from 1024 to 256

    layers.Dense(512, activation='relu'),

    layers.Dense(n_vocab, name='final_out'),
    layers.Activation('softmax')
])

model.summary()

  super().__init__(**kwargs)


In [36]:
import tensorflow as tf
from tensorflow.keras import layers, models
import tensorflow.keras.backend as K
from tensorflow.keras.preprocessing.text import Tokenizer
import numpy as np
import os
import requests
import tarfile
from collections import Counter
from itertools import chain
import pickle
import pandas as pd

# --- 1. Download and Read Data (Simulated from book) ---
# The book uses the bAbI dataset. We'll simulate downloading and reading it.
data_dir = os.path.join('data', 'lm', 'CBTest', 'data')
os.makedirs(data_dir, exist_ok=True)
train_path = os.path.join(data_dir, 'cbt_train.txt')
valid_path = os.path.join(data_dir, 'cbt_valid.txt')
test_path = os.path.join(data_dir, 'cbt_test.txt')

# Create dummy data files for demonstration
if not os.path.exists(train_path):
    with open(train_path, 'w') as f:
        f.write("_BOOK_TITLE_ A Simple Story\n")
        f.write("Once upon a time, there was a fox.\n")
        f.write("The fox was quick and brown.\n")
        f.write("_BOOK_TITLE_ Another Story\n")
        f.write("A dog and a cat were friends.\n")
        f.write("They played in the yard.\n")

    with open(valid_path, 'w') as f:
        f.write("_BOOK_TITLE_ A Validation Story\n")
        f.write("The sun was bright.\n")

    with open(test_path, 'w') as f:
        f.write("_BOOK_TITLE_ A Test Story\n")
        f.write("The moon was full.\n")

def read_data(path):
    stories = []
    with open(path, 'r') as f:
        s = []
        for row in f:
            if row.startswith("_BOOK_TITLE_"):
                if len(s) > 0:
                    stories.append(' '.join(s).lower())
                s = []
            s.append(row.strip()) # Add strip() to remove newlines
        if len(s) > 0:
            stories.append(' '.join(s).lower())
    return stories

stories = read_data(train_path)
val_stories = read_data(valid_path)
test_stories = read_data(test_path)

# --- 2. N-gram and Tokenizer Processing ---

# Function to get n-grams
def get_ngrams(text, n):
    return [text[i:i+n] for i in range(0, len(text), n)]

ngrams = 2 # We'll use 2-grams (bigrams)
train_ngram_stories = [get_ngrams(s, ngrams) for s in stories]

# Calculate vocabulary size (e.g., all n-grams appearing >= 10 times)
# In our small demo, we'll use a threshold of 1
text_corpus = chain(*train_ngram_stories)
cnt = Counter(text_corpus)
freq_df = pd.Series(list(cnt.values()), index=list(cnt.keys())).sort_values(ascending=False)

n_vocab = (freq_df >= 1).sum() # For demo, use 1. Book uses 10.

# --- 3. Tokenize Data ---
tokenizer = Tokenizer(num_words=n_vocab, oov_token='unk', lower=False)

# Fit tokenizer on training n-grams
tokenizer.fit_on_texts(train_ngram_stories)

# Convert all datasets to sequences of integer IDs
train_data_seq = tokenizer.texts_to_sequences(train_ngram_stories)

val_ngram_stories = [get_ngrams(s, ngrams) for s in val_stories]
val_data_seq = tokenizer.texts_to_sequences(val_ngram_stories)

test_ngram_stories = [get_ngrams(s, ngrams) for s in test_stories]
test_data_seq = tokenizer.texts_to_sequences(test_ngram_stories)

# Based on Listing 10.3
def get_tf_pipeline(data_seq, n_seq, batch_size=64, shift=1, shuffle=True):
    """Converts sequences of text IDs into (input, target) batches."""

    # Use RaggedTensor to handle stories of different lengths
    text_ds = tf.data.Dataset.from_tensor_slices(tf.ragged.constant(data_seq))

    if shuffle:
        # Ensure buffer_size is at least 1, even for very small datasets
        buffer_size_stories = max(1, len(data_seq) // 2)
        text_ds = text_ds.shuffle(buffer_size=buffer_size_stories)

    # Use flat_map to apply windowing to each story individually
    text_ds = text_ds.flat_map(
        lambda x: tf.data.Dataset.from_tensor_slices(x).window(
            n_seq + 1, shift=shift, drop_remainder=True
        ).flat_map(
            lambda window: window.batch(n_seq + 1, drop_remainder=True)
        )
    )

    if shuffle:
        # Ensure buffer_size is at least 1
        buffer_size_batches = max(1, 10 * batch_size)
        text_ds = text_ds.shuffle(buffer_size=buffer_size_batches)

    text_ds = text_ds.batch(batch_size)

    # Split into (x, y) pairs where y is x shifted by one
    text_ds = text_ds.map(lambda x: (x[:, :-1], x[:, 1:]))

    # Removed: `text_ds = text_ds.repeat()` for ALL datasets.
    # All datasets will now be finite, and `model.fit(epochs=N)` will handle repetitions if needed.

    text_ds = text_ds.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
    return text_ds

# Set hyperparameters
n_seq = 10 # Sequence length for the model (changed from 100 to 10)
batch_size = 256 # Increased batch size from 128 to 256

train_ds = get_tf_pipeline(train_data_seq, n_seq, batch_size=batch_size, shuffle=True)
valid_ds = get_tf_pipeline(val_data_seq, n_seq, batch_size=batch_size, shuffle=False)
test_ds = get_tf_pipeline(test_data_seq, n_seq, batch_size=batch_size, shuffle=False)

# Inspect a batch
for x_batch, y_batch in train_ds.take(1):
    print(f"X batch shape: {x_batch.shape}")
    print(f"Y batch shape: {y_batch.shape}")
    print(f"\nExample X: {x_batch[0, :10]}")
    print(f"Example Y: {y_batch[0, :10]}")

X batch shape: (67, 10)
Y batch shape: (67, 10)

Example X: [18 19 15 14 54 55 56 16  5  3]
Example Y: [19 15 14 54 55 56 16  5  3  1]


In [37]:
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, CSVLogger

# Compile the model
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy', PerplexityMetric()]
)

# Define callbacks
os.makedirs('eval', exist_ok=True)
csv_logger = CSVLogger(os.path.join('eval', '1_language_modelling.log'))
es_callback = EarlyStopping(monitor='val_perplexity', patience=5, mode='min')
lr_callback = ReduceLROnPlateau(monitor='val_perplexity', factor=0.1, patience=2, mode='min')

# Calculate num_train_windows and num_val_windows for context (steps_per_epoch will be handled automatically by Keras)
num_train_windows = sum(max(0, len(s) - n_seq) for s in train_data_seq)
num_val_windows = sum(max(0, len(s) - n_seq) for s in val_data_seq)

# Train the model
print("Starting model training...")
history = model.fit(
    train_ds,
    epochs=10,
    validation_data=valid_ds,
    callbacks=[es_callback, lr_callback, csv_logger]
)

# Evaluate on the test set
print("\nEvaluating model on test set...")
model.evaluate(test_ds)

# Save the model and tokenizer
os.makedirs('models', exist_ok=True)
model.save(os.path.join('models', '2_gram_lm.h5'))

with open(os.path.join('models', 'text_hyperparams.pkl'), 'wb') as f:
    pickle.dump({'n_vocab': n_vocab, 'ngrams': ngrams, 'n_seq': n_seq}, f)

Starting model training...
Epoch 1/10
      1/Unknown [1m6s[0m 6s/step - accuracy: 0.9537 - loss: 0.2214 - perplexity: 1.2597



[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 7s/step - accuracy: 0.9537 - loss: 0.2214 - perplexity: 1.2597 - val_accuracy: 0.2000 - val_loss: 6.8920 - val_perplexity: 1576.9830 - learning_rate: 0.0010
Epoch 2/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 245ms/step - accuracy: 0.9045 - loss: 0.3140 - perplexity: 1.3913 - val_accuracy: 0.1875 - val_loss: 7.1631 - val_perplexity: 3245.8750 - learning_rate: 0.0010
Epoch 3/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 260ms/step - accuracy: 0.9552 - loss: 0.1992 - perplexity: 1.2320 - val_accuracy: 0.1875 - val_loss: 7.4845 - val_perplexity: 6902.5352 - learning_rate: 0.0010
Epoch 4/10
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 196ms/step - accuracy: 0.9522 - loss: 0.1904 - perplexity: 1.2188 - val_accuracy: 0.1875 - val_loss: 7.5192 - val_perplexity: 7358.2900 - learning_rate: 1.0000e

