# Implementing the Transformer Architecture for Shakespearean Text Generation

# Table of Contents

1. [Introduction](#introduction)
2. [Project Overview and Motivation](#project-overview-and-motivation)
3. [Objectives and Goals](#objectives-and-goals)
4. [Dependencies and Setup](#dependencies-and-setup)
5. [Data Acquisition and Preprocessing](#data-acquisition-and-preprocessing)
    - [Brief Description of the Dataset](#brief-description-of-the-dataset)
    - [Downloading the Shakespeare Text](#downloading-the-shakespeare-text)
    - [Text Preprocessing](#text-preprocessing)
    - [Vocabulary Creation](#vocabulary-creation)
    - [Text Encoding](#text-encoding)
    - [Data Splitting](#data-splitting)
    - [Sequence Creation for Language Modeling](#sequence-creation-for-language-modeling)
    - [Saving Vocabulary Mappings](#saving-vocabulary-mappings)
    - [Saving Preprocessed Data](#saving-preprocessed-data)
6. [Data Augmentation Techniques](#data-augmentation-techniques)
    - [1. Word Dropout](#1-word-dropout)
    - [2. Random Token Swapping](#2-random-token-swapping)
    - [3. Combined Augmentation](#3-combined-augmentation)
7. [TensorFlow Dataset Creation](#tensorflow-dataset-creation)
8. [Transformer Architecture](#transformer-architecture)
    - [Key Components of the Transformer](#key-components-of-the-transformer)
    - [Positional Encoding](#positional-encoding)
    - [Scaled Dot-Product Attention](#scaled-dot-product-attention)
    - [Multi-Head Attention Implementation](#multi-head-attention-implementation)
    - [Feed-Forward Networks](#feed-forward-networks)
9. [Encoder Components](#encoder-components)
    - [Encoder Layer](#encoder-layer)
    - [Complete Encoder Stack](#complete-encoder-stack)
10. [Decoder Components](#decoder-components)
    - [Decoder Layer](#decoder-layer)
    - [Complete Decoder Stack](#complete-decoder-stack)
11. [Masking Functions](#masking-functions)
    - [Padding Mask Creation](#padding-mask-creation)
    - [Look-Ahead Mask Creation](#look-ahead-mask-creation)
    - [Combined Masking](#combined-masking)
12. [Complete Transformer Model](#complete-transformer-model)
    - [Configuring the Transformer for Word-Level Language Modeling](#configuring-the-transformer-for-word-level-language-modeling)
13. [Training Implementation](#training-implementation)
    - [Loss Function with Label Smoothing](#loss-function-with-label-smoothing)
    - [Custom Learning Rate Scheduler](#custom-learning-rate-scheduler)
    - [Optimizer Configuration](#optimizer-configuration)
    - [Training Step Implementation](#training-step-implementation)
    - [Evaluation Function](#evaluation-function)
14. [Training Process](#training-process)
    - [Training Loop Details](#training-loop-details)
    - [Sample Text Generation During Training](#sample-text-generation-during-training)
15. [Results and Analysis](#results-and-analysis)
    - [Training Process and Metrics](#training-process-and-metrics)
    - [Performance Analysis](#performance-analysis)
    - [Text Generation Quality](#text-generation-quality)
    - [Overfitting Analysis](#overfitting-analysis)
    - [Conclusion](#conclusion)
16. [Training Metrics Visualization](#training-metrics-visualization)
17. [Model Persistence](#model-persistence)
    - [Saving Model and Configuration](#saving-model-and-configuration)
    - [Loading Model from Saved Files](#loading-model-from-saved-files)
18. [Model Evaluation](#model-evaluation)
    - [Perplexity Calculation](#perplexity-calculation)
    - [Generated Text Quality Assessment](#generated-text-quality-assessment)
    - [Performance Benchmarking](#performance-benchmarking)
19. [Conclusion and Future Work](#conclusion-and-future-work)
    - [Discussion of Observations and Limitations](#discussion-of-observations-and-limitations)
    - [Future Improvements and Experimentation Ideas](#future-improvements-and-experimentation-ideas)
20. [References](#references)

# Introduction

The Transformer architecture, introduced in the landmark paper "Attention Is All You Need" by Vaswani et al. (2017), revolutionized natural language processing by demonstrating that sequence modeling tasks could be effectively tackled without recurrent or convolutional neural networks. This project implements a complete Transformer model from scratch for Shakespearean text generation, showcasing the architecture's ability to capture long-range dependencies and generate coherent, stylistically consistent text.





## Project Overview and Motivation

This implementation serves as both an educational exploration of the Transformer architecture and a demonstration of its capabilities in creative text generation. By training on Shakespeare's works, we aim to create a model that can generate text with similar linguistic patterns, vocabulary, and stylistic elements as the original author. The Transformer is particularly well-suited for this task due to its:

1. **Parallel processing capabilities**: Unlike RNNs, Transformers process entire sequences simultaneously, making training more efficient.
2. **Attention mechanisms**: The self-attention mechanism allows the model to weigh the importance of different words in the input sequence regardless of their distance, capturing long-range dependencies more effectively than RNNs.
3. **Scalability**: The architecture can be scaled to handle larger datasets and more complex patterns by adjusting the number of layers, attention heads, and model dimensions.

This project implements the complete Transformer architecture with both encoder and decoder components, following the original paper's design while making appropriate adjustments for word-level language modeling of Shakespearean text.

## Objectives and Goals

The primary objectives of this project are:

1. **Implement a complete Transformer model**: Build all components of the Transformer architecture from scratch, including multi-head attention, positional encoding, feed-forward networks, and the full encoder-decoder structure.

2. **Train on Shakespearean text**: Process and prepare Shakespeare's works for training a language model that captures the unique patterns and style of Elizabethan English.

3. **Generate coherent Shakespearean-style text**: Demonstrate the model's ability to generate text that maintains thematic consistency and mimics Shakespeare's writing style.

4. **Explore hyperparameter effects**: Examine how different model configurations, training strategies, and generation parameters affect the quality and diversity of the generated text.

5. **Analyze model performance**: Evaluate the model using appropriate metrics like perplexity and qualitative assessment of generated samples.

Through this implementation, we aim to gain deeper insights into how attention-based models learn linguistic patterns and demonstrate the Transformer's effectiveness for creative text generation tasks.

## Dependencies and Setup

This implementation requires several key libraries to build and train our Transformer model:

- **TensorFlow**: The core deep learning framework used for model implementation, providing efficient tensor operations and automatic differentiation capabilities.
- **NumPy**: Used for numerical computations and array manipulations outside of TensorFlow's eager execution.
- **Matplotlib**: Visualization library for plotting training metrics and model performance.
- **Requests**: HTTP library for downloading the Shakespeare dataset.
- **JSON**: For saving and loading model configurations and vocabulary mappings.
- **Time**: For tracking training duration and benchmarking inference speed.

The code is designed to run on both CPU and GPU environments, though training will be significantly faster with GPU acceleration. The implementation uses TensorFlow's eager execution mode for intuitive debugging and development, while leveraging `@tf.function` decorators for performance-critical operations during training.

All dependencies are standard Python libraries that can be installed via pip if not already available. The model architecture is implemented from scratch following the original paper specifications, without relying on high-level APIs like Keras Layers for the core Transformer components, providing full transparency into the implementation details.

In [1]:
# Import necessary libraries
import tensorflow as tf         # Deep learning framework
import numpy as np              # Numerical computations
import matplotlib.pyplot as plt # Visualization
import requests                 # HTTP requests for data downloading
import os                       # File and directory operations
import time                     # Time tracking for benchmarking
import json                     # JSON handling for model config

## Data Acquisition and Preprocessing

The foundation of our Transformer model's training is high-quality text data that captures the unique linguistic patterns of Shakespeare's writing. This section details how we acquire, clean, and prepare Shakespeare's works for training our language model.



### Brief Description of the Dataset

For this project, we use the "Tiny Shakespeare" dataset, a condensed corpus of Shakespeare's works commonly used for language modeling tasks. This dataset contains approximately 1 million characters of text from various Shakespeare plays and sonnets, providing a rich source of Elizabethan English with its distinctive vocabulary, syntax, and stylistic elements.

The dataset includes dialogue from various characters, stage directions, and scene descriptions, offering a diverse range of linguistic patterns for our model to learn. While relatively small compared to modern language modeling datasets (which often contain billions of tokens), Tiny Shakespeare provides sufficient data to demonstrate the capabilities of the Transformer architecture while remaining computationally tractable for training without specialized hardware.

The text exhibits several characteristics that make it an interesting challenge for language modeling:

1. Archaic vocabulary and grammatical constructions
2. Poetic meter and rhyme schemes in certain sections
3. Distinctive character speech patterns
4. Formal and informal language variations
5. Rich metaphorical and figurative language

These features provide an excellent test case for evaluating how well our Transformer model can capture and reproduce complex linguistic patterns across different contexts.

### Downloading the Shakespeare Text

To train our Transformer model, we first need to acquire the Shakespeare dataset. We'll use the "Tiny Shakespeare" dataset, which is a condensed corpus of Shakespeare's works commonly used for language modeling experiments. This dataset is hosted in Andrej Karpathy's char-rnn repository and contains approximately 1 million characters of text.

The function below handles the downloading process, creating a data directory if needed, and saving the text to a local file. It also performs basic validation by displaying the text length and a sample of the content to ensure the download was successful. This approach allows us to cache the dataset locally, avoiding redundant downloads in future runs.

In [2]:
# Download and load the Shakespeare text data
def download_shakespeare_data():
    """
    Download and process Shakespeare text data from Karpathy's char-rnn repository.

    Returns:
        str: The complete Shakespeare text data as a string.
    """
    url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"

    # Create data directory if it doesn't exist
    if not os.path.exists('data'):
        os.makedirs('data')

    file_path = os.path.join('data', 'shakespeare.txt')

    # Download data if not already present
    if not os.path.exists(file_path):
        print("Downloading Shakespeare text data...")
        response = requests.get(url)
        with open(file_path, 'w', encoding='utf-8') as file:
            file.write(response.text)
        print(f"Downloaded and saved to {file_path}")
    else:
        print(f"Shakespeare data already exists at {file_path}")

    # Read the downloaded text data
    with open(file_path, 'r', encoding='utf-8') as file:
        shakespeare_text = file.read()

    # Display basic information about the data
    print(f"Text length: {len(shakespeare_text)} characters")
    print(f"First 100 characters: \n{shakespeare_text[:100]}")

    return shakespeare_text

In [None]:
# Download Shakespeare data
shakespeare_text = download_shakespeare_data()

### Text Preprocessing

After downloading the Shakespeare text, we need to preprocess it to make it suitable for training our Transformer model. The preprocessing steps include:

1. **Cleaning the text**: Replacing line breaks with spaces, normalizing whitespace, and converting to lowercase for consistency.

2. **Handling punctuation**: Adding spaces around punctuation marks to treat them as separate tokens, which helps the model learn the grammatical structure more effectively.

3. **Tokenization**: Splitting the text into individual words (tokens) that will serve as the basic units for our language model.

This word-level tokenization approach offers a good balance between model complexity and linguistic understanding. Unlike character-level models that must learn spelling patterns from scratch, word-level models can focus on learning grammatical structures and semantic relationships between words. This is particularly important for Shakespeare's text, which contains rich vocabulary and complex sentence structures.

The function below implements these preprocessing steps and returns a list of words from the Shakespeare text. We'll use this list to build our vocabulary and create training examples for the model.

In [4]:
# Text preprocessing function
def preprocess_text(text):
    """
    Perform basic text preprocessing to prepare data for tokenization.

    Args:
        text (str): Raw text data to preprocess.

    Returns:
        list: List of preprocessed words from the text.
    """
    # Replace line breaks with spaces for better tokenization
    text = text.replace('\n', ' ')

    # Replace multiple consecutive spaces with a single space
    text = ' '.join(text.split())

    # Convert text to lowercase for consistency
    text = text.lower()

    # Add spaces around punctuation for better tokenization
    punctuation_marks = ['.', ',', '!', '?', ':', ';', '"', '(', ')', '[', ']', '{', '}']
    for punctuation in punctuation_marks:
        text = text.replace(punctuation, f" {punctuation} ")

    # Split text into individual words
    words = text.split()

    # Display information about processed words
    print(f"Total words after preprocessing: {len(words)}")
    print(f"Sample words: {words[:13]}")

    return words

In [None]:
# Process the Shakespeare text to get a list of words
processed_words = preprocess_text(shakespeare_text)

### Vocabulary Creation

After preprocessing the text into individual words, we need to create a vocabulary that maps each unique word to a numerical index. This vocabulary serves as the foundation for our language model, defining the set of tokens the model can recognize and generate.

The vocabulary creation process involves:

1. **Counting word frequencies**: We count how often each word appears in the text to identify the most common words.

2. **Vocabulary size limitation**: To manage computational complexity, we limit our vocabulary to the most frequent words (up to a maximum size), which typically cover the vast majority of the text.

3. **Special token addition**: We add special tokens to handle specific cases:
   - `<PAD>`: Used for padding sequences to a consistent length in batches
   - `<UNK>`: Used for unknown/rare words not included in the vocabulary

4. **Word-to-index and index-to-word mappings**: We create bidirectional mappings between words and their corresponding indices, which are essential for encoding input text and decoding model outputs.

This approach balances vocabulary coverage with model efficiency. While a larger vocabulary can represent more unique words directly, it also increases the model's parameter count and computational requirements. Our implementation focuses on capturing the most important words in Shakespeare's vocabulary while using the `<UNK>` token to handle rare words.

In [6]:
# Create vocabulary from processed text
def create_word_vocabulary(words, max_vocab_size=20000):
    """
    Create vocabulary mappings from processed words.

    Args:
        words (list): List of preprocessed words.
        max_vocab_size (int): Maximum size of the vocabulary.

    Returns:
        tuple: (word-to-index mapping, index-to-word mapping, vocabulary size)
    """
    # Count frequency of each word in the text
    word_frequencies = {}
    for word in words:
        word_frequencies[word] = word_frequencies.get(word, 0) + 1
    
    # Sort words by frequency (most frequent first)
    sorted_words = sorted(word_frequencies.items(), key=lambda x: x[1], reverse=True)

    # Build vocabulary with special tokens and most frequent words
    # <PAD>: Used for padding sequences to Used for unknown/rare words not in vocabulary
    vocabulary = ["<PAD>", "<UNK>"] + [word for word, count in sorted_words[:max_vocab_size-2]]
    
    # Create word-to-index and index-to-word mappings
    word_to_index = {word: idx for idx, word in enumerate(vocabulary)}
    index_to_word = {idx: word for idx, word in enumerate(vocabulary)}
    
    vocabulary_size = len(vocabulary)
    
    # Display information about the created vocabulary
    print(f"Vocabulary size: {vocabulary_size} words")
    print(f"Sample word-to-index mappings: {list(word_to_index.items())[:5]}")

    return word_to_index, index_to_word, vocabulary_size

In [None]:
# Create word-to-index and index-to-word mappings, and get vocabulary size
word2idx, idx2word, vocab_size = create_word_vocabulary(processed_words)

### Text Encoding

After creating our vocabulary, we need to convert the preprocessed text into numerical sequences that our Transformer model can process. This encoding step transforms each word in our text into its corresponding index from the vocabulary.

The encoding process serves several important purposes:

1. **Numerical representation**: Neural networks operate on numerical data, so we need to convert text into numbers.

2. **Vocabulary mapping**: Words not present in our vocabulary (rare words or those excluded due to vocabulary size limitations) are mapped to the special `<UNK>` token, ensuring our model can handle any input text.

3. **Consistent representation**: By using the same word-to-index mapping throughout training and inference, we maintain consistency in how the model interprets text.

This encoding step is the final transformation before we can create training examples for our model. The function below handles this conversion, taking our preprocessed words and mapping each one to its corresponding index in the vocabulary.

In [8]:
# Encode text using vocabulary
def encode_text(words, word2idx):
    """
    Encode text using word-to-index mapping from vocabulary.

    Args:
        words (list): List of preprocessed words.
        word2idx (dict): Word-to-index mapping.

    Returns:
        list: List of encoded words (indices).
    """
    # Convert each word to its index, using <UNK> for words not in vocabulary
    return [word2idx.get(word, word2idx["<UNK>"]) for word in words]

In [None]:
# Encode the text
encoded_text = encode_text(processed_words, word2idx)
print(f"Encoded text length: {len(encoded_text)}")
print(f"Sample encoded text: {encoded_text[:13]}")

### Data Splitting

After encoding our text, we need to divide it into separate datasets for training, validation, and testing. This division is crucial for properly evaluating our model's performance:

1. **Training set**: The largest portion of the data, used to train the model's parameters. The model learns patterns and relationships directly from this data.

2. **Validation set**: Used during training to tune hyperparameters and monitor for overfitting. This helps us make decisions about model architecture and training process without contaminating our final evaluation.

3. **Testing set**: Held-out data that the model never sees during training or validation. This provides an unbiased evaluation of the final model's performance on new, unseen text.

We'll use a standard split ratio of 80% for training, 10% for validation, and 10% for testing. This balance provides sufficient data for training while reserving enough for meaningful validation and testing.

The function below implements this splitting process, ensuring that the chronological order of the text is preserved. This is important for language modeling, as it maintains the narrative flow and contextual relationships within each subset of the data.

In [10]:
# Split data into training, validation, and testing sets
def split_data(data, train_ratio=0.8, val_ratio=0.1, test_ratio=0.1):
    """
    Split encoded data into training, validation, and testing sets.

    Args:
        data (list): Encoded text data.
        train_ratio (float): Proportion of data to use for training.
        val_ratio (float): Proportion of data to use for validation.
        test_ratio (float): Proportion of data to use for testing.

    Returns:
        tuple: (training data, validation data, testing data)
    """
    # Ensure ratios sum to 1.0
    assert train_ratio + val_ratio + test_ratio == 1.0, "Data split ratios must sum to 1.0"

    # Convert data to numpy array for easier manipulation
    data_array = np.array(data, dtype=np.int32)

    # Calculate split indices
    data_size = len(data_array)
    train_size = int(data_size * train_ratio)
    val_size = int(data_size * val_ratio)

    # Split data into training, validation, and testing sets
    train_data = data_array[:train_size]
    val_data = data_array[train_size:train_size+val_size]
    test_data = data_array[train_size+val_size:]

    # Display information about the data splits
    print(f"Training set size: {len(train_data)} tokens")
    print(f"Validation set size: {len(val_data)} tokens")
    print(f"Testing set size: {len(test_data)} tokens")

    return train_data, val_data, test_data

In [None]:
# Split the data
train_data, val_data, test_data = split_data(encoded_text)

### Sequence Creation for Language Modeling

After splitting our encoded text into training, validation, and testing sets, we need to create input-target sequence pairs for language modeling. This step is crucial as it transforms our linear text data into a format suitable for training a sequence-to-sequence model.

For language modeling with the Transformer architecture, we need to create pairs of sequences where:

1. **Input sequence**: A fixed-length sequence of tokens that serves as context
2. **Target sequence**: The same sequence shifted by one position, representing what the model should predict

This sliding window approach allows the model to learn to predict the next word given a context of previous words. The process involves:

1. **Sequence extraction**: Creating overlapping sequences of fixed length from our encoded text
2. **Input-target pairing**: For each input sequence [x₀, x₁, ..., xₙ₋₁], creating a target sequence [x₁, x₂, ..., xₙ]
3. **Efficient batching**: Organizing these pairs into batches for efficient training

The function below implements this sequence creation process, with a configurable step size that determines the overlap between consecutive sequences. A smaller step size creates more training examples with greater overlap, potentially helping the model learn more effectively from limited data, while a larger step size reduces redundancy.

In [12]:
# Create input-target pairs for sequence modeling
def create_sequences(data, seq_length, step=1):
    """
    Create input-target sequence pairs for language modeling.

    For each sequence of length seq_length in the data, creates:
    - Input sequence: [x_0, x_1, ..., x_{seq_length-1}]
    - Target sequence: [x_1, x_2, ..., x_{seq_length}]

    Args:
        data (numpy.ndarray): Encoded token sequence.
        seq_length (int): Fixed length for each input sequence.
        step (int): Stride between starting positions of consecutive sequences.
                    Default is 1 for maximum overlap between sequences.

    Returns:
        tuple: (input sequences array, target sequences array)
    """
    input_sequences = []
    target_sequences = []

    # Create overlapping sequences from data
    for start_idx in range(0, len(data) - seq_length, step):
        # Input sequence: current seq_length tokens
        input_seq = data[start_idx:start_idx + seq_length]
        # Target sequence: next seq_length tokens (shifted by 1)
        target_seq = data[start_idx + 1:start_idx + seq_length + 1]

        input_sequences.append(input_seq)
        target_sequences.append(target_seq)

    # Convert lists to numpy arrays
    input_sequences_array = np.array(input_sequences)
    target_sequences_array = np.array(target_sequences)

    # Display information about created sequences
    print(f"Number of sequence pairs: {len(input_sequences_array)}")
    print(f"Input sequence shape: {input_sequences_array.shape}")
    print(f"Target sequence shape: {target_sequences_array.shape}")

    return input_sequences_array, target_sequences_array

In [None]:
# Create sequences for training, validation, and testing
seq_length = 50

train_inputs, train_targets = create_sequences(train_data, seq_length)
val_inputs, val_targets = create_sequences(val_data, seq_length)
test_inputs, test_targets = create_sequences(test_data, seq_length)

### Saving Vocabulary Mappings

Preserving the vocabulary mappings is a critical step in our language modeling pipeline. These mappings establish the connection between words in our text and their numerical representations in the model. By saving these mappings to disk, we ensure:

1. **Consistency between runs**: The same vocabulary can be used across multiple training sessions, ensuring consistent model behavior.

2. **Inference capability**: When using the trained model for text generation or other tasks, we need the exact same vocabulary mappings to properly encode inputs and decode outputs.

3. **Model portability**: Saving these mappings alongside the model weights allows the model to be deployed in different environments while maintaining the same word-to-index relationships.

4. **Reproducibility**: Preserving the vocabulary is essential for reproducing experimental results and comparing different model versions.

The functions below handle saving our vocabulary mappings (word-to-index and index-to-word dictionaries) to JSON files, along with the vocabulary size. This information will be loaded during model inference to ensure that text processing remains consistent with what the model was trained on.

In [14]:
# Save vocabulary mappings to files
def save_vocab(word2idx, idx2word, vocab_size, save_dir='data'):
    """
    Save vocabulary mappings and metadata to JSON files.

    Args:
        word2idx (dict): Word-to-index mapping.
        idx2word (dict): Index-to-word mapping.
        vocab_size (int): Size of the vocabulary.
        save_dir (str): Directory to save vocabulary files.
    """
    # Create save directory if it doesn't exist
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    # Save word-to-index mapping
    with open(os.path.join(save_dir, 'word2idx.json'), 'w') as file:
        # Convert integer keys to strings for JSON serialization
        word2idx_serializable = {word: int(idx) for word, idx in word2idx.items()}
        json.dump(word2idx_serializable, file, ensure_ascii=False, indent=2)

    # Save index-to-word mapping
    with open(os.path.join(save_dir, 'idx2word.json'), 'w') as file:
        # Convert integer keys to strings for JSON serialization
        idx2word_serializable = {str(idx): word for idx, word in idx2word.items()}
        json.dump(idx2word_serializable, file, ensure_ascii=False, indent=2)

    # Save vocabulary size and metadata
    with open(os.path.join(save_dir, 'vocab_info.json'), 'w') as file:
        json.dump({
            'vocab_size': vocab_size,
            'special_tokens': {
                'pad_token': '<PAD>',
                'unknown_token': '<UNK>',
                'pad_token_id': word2idx['<PAD>'],
                'unknown_token_id': word2idx['<UNK>']
            }
        }, file)

    print(f"Vocabulary files saved to {save_dir}")

In [None]:
# Save the vocabulary mappings and size
save_vocab(word2idx, idx2word, vocab_size)

### Saving Preprocessed Data

After creating our input-target sequence pairs and splitting them into training, validation, and testing sets, it's important to save these preprocessed data arrays to disk. This step provides several key benefits:

1. **Computational efficiency**: Preprocessing text data can be time-consuming, especially for large corpora. By saving the results, we avoid repeating these operations in future sessions.

2. **Reproducibility**: Storing preprocessed data ensures that different training runs use exactly the same data, making experimental results more comparable and reproducible.

3. **Workflow flexibility**: Saving intermediate data allows us to experiment with different model architectures or training configurations without redoing the preprocessing steps.

4. **Checkpoint capability**: If our workflow is interrupted, we can resume from the saved data rather than starting from scratch.

The functions below handle saving our preprocessed arrays (training, validation, and testing inputs and targets) to NumPy files. These files can be quickly loaded in subsequent sessions, streamlining the development and experimentation process.

In [16]:
# Save prepared data arrays
def save_numpy_arrays(arrays, filenames, save_dir='data'):
    """
    Save numpy arrays to files for future use.

    Args:
        arrays (list): List of numpy arrays to save.
        filenames (list): List of filenames for each array.
        save_dir (str): Directory to save arrays.
    """
    # Create save directory if it doesn't exist
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    # Save each array to a file
    for array, filename in zip(arrays, filenames):
        file_path = os.path.join(save_dir, filename)
        np.save(file_path, array)

    print(f"Data arrays saved to {save_dir}")

In [None]:
# Save all the preprocessed data arrays 
save_numpy_arrays(
        [train_inputs, train_targets, val_inputs, val_targets, test_inputs, test_targets],
        ['train_inputs.npy', 'train_targets.npy', 'val_inputs.npy', 'val_targets.npy', 'test_inputs.npy', 'test_targets.npy']
    )

## Data Augmentation Techniques

Data augmentation is a powerful strategy to improve model generalization by artificially expanding the training dataset with modified versions of existing examples. For natural language processing tasks like our Shakespeare language modeling, appropriate augmentation techniques can help the model become more robust to variations in text and reduce overfitting, especially when working with limited data.

For our Transformer model, we'll implement three complementary augmentation techniques:

### 1. Word Dropout

Word dropout randomly replaces a percentage of words in the input sequences with the `<UNK>` (unknown) token. This technique serves multiple purposes:

- **Improves robustness**: Forces the model to learn to predict the next word even when parts of the context are missing
- **Simulates rare words**: Helps the model handle situations where it encounters words outside its vocabulary
- **Reduces overfitting**: Prevents the model from relying too heavily on specific words in fixed positions

In [18]:
# Word dropout function
def apply_word_dropout(sequences, word2idx, dropout_rate=0.1):
    """
    Randomly replace words with <UNK> token to improve model robustness.
    
    Args:
        sequences (np.ndarray): Input sequences of token IDs
        word2idx (dict): Word-to-index mapping
        dropout_rate (float): Probability of replacing a word with <UNK>
        
    Returns:
        np.ndarray: Augmented sequences with some words replaced by <UNK>
    """
    # Create a copy to avoid modifying original data
    augmented_sequences = sequences.copy()
    
    # Get token IDs for special tokens
    unk_id = word2idx["<UNK>"]
    pad_id = word2idx["<PAD>"]
    
    # Create random mask for dropout (True where we should drop)
    mask = np.random.random(sequences.shape) < dropout_rate
    
    # Only drop non-padding tokens
    effective_mask = (sequences != pad_id) & mask
    
    # Replace dropped tokens with <UNK>
    augmented_sequences[effective_mask] = unk_id
    
    return augmented_sequences

### 2. Random Token Swapping

This technique randomly swaps adjacent tokens in the input sequences, creating slight variations in word order. Benefits include:

- **Syntactic variation**: Exposes the model to different grammatical structures
- **Reduces positional bias**: Prevents the model from overly relying on exact word positions
- **Increases linguistic diversity**: Creates new sentence structures while preserving most of the semantic content

In [19]:
# Randomly swaps adjacent tokens function
def apply_random_swaps(sequences, swap_rate=0.1):
    """
    Randomly swap adjacent words to create more diverse training data.
    
    Args:
        sequences (np.ndarray): Input sequences of token IDs
        swap_rate (float): Probability of swapping adjacent words
        
    Returns:
        np.ndarray: Augmented sequences with some adjacent words swapped
    """
    augmented_sequences = sequences.copy()
    seq_length = sequences.shape[1]
    
    # For each sequence in the batch
    for i in range(len(sequences)):
        # Consider each position except the last one
        for j in range(seq_length - 1):
            # Apply swap with probability swap_rate
            if np.random.random() < swap_rate:
                # Swap the current token with the next one
                augmented_sequences[i, j], augmented_sequences[i, j+1] = \
                    augmented_sequences[i, j+1], augmented_sequences[i, j]
    
    return augmented_sequences

### 3. Combined Augmentation

By applying both techniques with controlled probability, we create a diverse set of augmented examples that maintain the core meaning of the original text while introducing controlled variations. This balanced approach helps the model generalize better without distorting the linguistic patterns too severely.

The functions below implement these augmentation techniques, with configurable parameters to control the intensity of each transformation. We'll apply these augmentations only to the training data, keeping the validation and test sets in their original form to properly evaluate model performance.

In [20]:
# Data augmentation function
def augment_data(inputs, targets, word2idx, dropout_rate=0.1, swap_rate=0.1, verbose=True):
    """
    Create augmented training data using word dropout and random swaps.
    
    Args:
        inputs: Input sequences
        targets: Target sequences
        word2idx: Word to index mapping dictionary
        dropout_rate: Probability of dropping a word
        swap_rate: Probability of swapping adjacent words
        verbose: Whether to print information about the augmentation
        
    Returns:
        Tuple of (augmented_inputs, augmented_targets)
    """
    if verbose:
        print("Applying data augmentation techniques...")

    # Apply word dropout
    augmented_inputs_1 = apply_word_dropout(inputs, word2idx, dropout_rate=dropout_rate)

    # Apply random swaps
    augmented_inputs_2 = apply_random_swaps(inputs, swap_rate=swap_rate)

    # Combine original and augmented data
    augmented_inputs = np.concatenate([
        inputs,             # Original data
        augmented_inputs_1, # Word dropout augmentation
        augmented_inputs_2  # Word swap augmentation
    ], axis=0)

    # Repeat targets to match the augmented inputs
    augmented_targets = np.concatenate([
        targets,
        targets,
        targets
    ], axis=0)
    
    if verbose:
        print(f"Original dataset size: {len(inputs)} sequences")
        print(f"Augmented dataset size: {len(augmented_inputs)} sequences")
        
    return augmented_inputs, augmented_targets

In [None]:
# Apply data augmentation to training data
augmented_train_inputs, augmented_train_targets = augment_data(train_inputs, train_targets, word2idx)

## TensorFlow Dataset Creation

After preparing and augmenting our data, we need to convert it into efficient TensorFlow Dataset objects for training and evaluation. The TensorFlow Dataset API provides optimized data pipelines that can significantly improve training performance through features like prefetching, batching, and caching.

For our Transformer model, we'll create three separate datasets:

### 1. Training Dataset

The training dataset requires special handling to optimize the learning process:
- **Shuffling**: Randomizes the order of examples to prevent the model from learning sequence-specific patterns and improves training stability
- **Batching**: Groups examples together for efficient parallel processing on GPU/TPU
- **Prefetching**: Loads the next batch of data while the current batch is being processed, reducing idle time
- **Caching**: Stores processed data in memory to avoid redundant computations

### 2. Validation Dataset

The validation dataset is used to evaluate the model during training:
- No shuffling is applied to ensure consistent evaluation across epochs
- Batching is still used for computational efficiency
- The dataset size is typically smaller than the training set

### 3. Test Dataset

The test dataset is used only for final model evaluation:
- Like the validation set, no shuffling is applied
- Provides an unbiased assessment of model performance on unseen data

The functions below implement this dataset creation process, converting our sequence pairs into TensorFlow Dataset objects with the appropriate configurations for each use case. This approach ensures efficient data handling during the training process, particularly important for larger datasets or when training on GPU/TPU accelerators.

In [22]:
# Create TensorFlow datasets for efficient training
def create_tf_dataset(inputs, targets, batch_size, buffer_size=10000, shuffle=True):
    """
    Create a TensorFlow dataset from input and target sequences.

    Args:
        inputs (numpy.ndarray): Input sequence data.
        targets (numpy.ndarray): Target sequence data.
        batch_size (int): Number of samples per batch.
        buffer_size (int): Buffer size for shuffling.
        shuffle (bool): Whether to shuffle the dataset.

    Returns:
        tf.data.Dataset: TensorFlow dataset for model training/evaluation.
    """
    # Create dataset from tensors
    dataset = tf.data.Dataset.from_tensor_slices((inputs, targets))

    # Shuffle dataset if specified
    if shuffle:
        dataset = dataset.shuffle(buffer_size)

    # Batch the dataset and drop remainder to ensure consistent batch sizes
    dataset = dataset.batch(batch_size, drop_remainder=True)

    # Prefetch data for better performance
    dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)

    return dataset

In [None]:
# Create TensorFlow datasets
batch_size = 128
train_dataset = create_tf_dataset(augmented_train_inputs, augmented_train_targets, batch_size)
val_dataset = create_tf_dataset(val_inputs, val_targets, batch_size, shuffle=False)
test_dataset = create_tf_dataset(test_inputs, test_targets, batch_size, shuffle=False)

print("Datasets created successfully!")

## Transformer Architecture

The Transformer architecture, introduced in the seminal paper "Attention Is All You Need" by Vaswani et al. (2017), represents a paradigm shift in sequence modeling. Unlike previous approaches that relied on recurrent or convolutional neural networks, Transformers use self-attention mechanisms to process input sequences in parallel, capturing long-range dependencies more effectively while enabling significantly faster training.

### Key Components of the Transformer

Our implementation follows the original architecture with some adaptations for language modeling:

1. **Multi-Head Self-Attention**: The core innovation of the Transformer, allowing the model to attend to different positions in the input sequence simultaneously. Each attention head can focus on different aspects of the relationships between words.

2. **Positional Encoding**: Since the Transformer processes all tokens in parallel (rather than sequentially like RNNs), positional encodings are added to provide information about the relative or absolute position of tokens in the sequence.

3. **Feed-Forward Networks**: Each layer contains a position-wise feed-forward network that applies the same transformation to each position independently, adding non-linearity and representational power.

4. **Residual Connections and Layer Normalization**: These components help stabilize training and allow for deeper networks by mitigating the vanishing gradient problem.

5. **Encoder-Decoder Structure**: The complete Transformer consists of an encoder that processes the input sequence and a decoder that generates the output sequence, with cross-attention mechanisms connecting them.

### Adaptations for Language Modeling

For our Shakespeare text generation task, we make several adaptations to the original architecture:

1. **Causal Attention Masking**: In the decoder, we apply masking to ensure that predictions for a given position can only depend on known outputs at earlier positions, which is essential for autoregressive generation.

2. **Vocabulary Embeddings**: We use learned embeddings to convert token indices into continuous vector representations, with the same embedding layer used for both the encoder and decoder inputs.

3. **Linear Output Projection**: The final decoder output is projected to logits over our vocabulary, representing the probability distribution for the next word.

The following sections implement each component of the Transformer architecture, building from the fundamental attention mechanism to the complete encoder-decoder model.

### Positional Encoding

Unlike recurrent neural networks, the Transformer processes all tokens in a sequence simultaneously, which means it has no inherent understanding of token order. To address this limitation, positional encodings are added to the input embeddings to provide the model with information about the relative or absolute position of each token in the sequence.

The standard approach, as described in the original "Attention Is All You Need" paper, uses sine and cosine functions of different frequencies to create unique positional encodings:

$$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$$
$$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$$

Where:
- $pos$ is the position of the token in the sequence
- $i$ is the dimension index
- $d_{model}$ is the embedding dimension

This formulation has several important properties:

1. **Uniqueness**: Each position gets a unique encoding, allowing the model to distinguish between different positions
2. **Deterministic**: The encoding is fixed and doesn't require learning
3. **Bounded values**: The values fall within [-1, 1], making them compatible with the scale of the embeddings
4. **Theoretical extrapolation**: The sinusoidal pattern theoretically allows the model to extrapolate to sequence lengths longer than those seen during training

The implementation below creates these positional encodings as a fixed tensor that can be added to the input embeddings. By incorporating this information at the input level, every layer in the Transformer can access positional information, enabling the model to learn position-dependent patterns in the data.

In [24]:
# Positional encoding functions for Transformer
def get_angles(positions, indices, d_model):
    """
    Calculate angle rates for positional encoding.

    Args:
        positions (numpy.ndarray): Position indices [0, 1, 2, ...]
        indices (numpy.ndarray): Dimension indices [0, 1, 2, ...]
        d_model (int): Model dimension size.

    Returns:
        numpy.ndarray: Angle rates for positional encoding.
    """
    # Calculate angle rates based on position and dimension
    angle_rates = 1 / np.power(10000, (2 * (indices // 2)) / np.float32(d_model))
    return positions * angle_rates

def positional_encoding(max_position, d_model):
    """
    Generate positional encoding for Transformer model.

    The positional encoding is added to the embedding to provide
    positional information since Transformer has no recurrence/convolution.

    Args:
        max_position (int): Maximum sequence length to generate encodings for.
        d_model (int): Dimensionality of the model embeddings.

    Returns:
        tf.Tensor: Positional encoding tensor of shape [1, max_position, d_model].
    """
    # Calculate angle rates for all positions and dimensions
    angle_rads = get_angles(
        np.arange(max_position)[:, np.newaxis],    # Position vector: [0, 1, 2, ...]
        np.arange(d_model)[np.newaxis, :],         # Dimension vector: [0, 1, 2, ...]
        d_model
    )

    # Apply sine to even dimensions and cosine to odd dimensions
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])  # Even dimensions: sine
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])  # Odd dimensions: cosine

    # Add batch dimension [1, max_position, d_model]
    pos_encoding = angle_rads[np.newaxis, ...]

    # Convert to TensorFlow tensor with float32 precision
    return tf.cast(pos_encoding, dtype=tf.float32)

### Scaled Dot-Product Attention

At the heart of the Transformer architecture lies the attention mechanism, specifically the scaled dot-product attention. This mechanism allows the model to focus on different parts of the input sequence when producing each element of the output sequence, effectively capturing dependencies regardless of their distance in the sequence.

The scaled dot-product attention is computed as follows:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:
- $Q$ (query), $K$ (key), and $V$ (value) are matrices representing different projections of the input
- $d_k$ is the dimension of the keys (used for scaling)
- The softmax is applied row-wise

The attention operation can be broken down into these steps:

1. **Compatibility calculation**: Compute dot products between each query and all keys ($QK^T$)
2. **Scaling**: Divide by $\sqrt{d_k}$ to prevent extremely small gradients when $d_k$ is large
3. **Masking (optional)**: Apply a mask to prevent attention to certain positions (e.g., future positions in causal attention)
4. **Softmax**: Apply softmax to obtain attention weights that sum to 1
5. **Value weighting**: Multiply the attention weights by the values ($V$)

The result is a weighted sum of the values, where the weights are determined by the compatibility of the corresponding keys with each query.

For language modeling, we often use causal masking in the decoder to ensure that predictions for a given position can only depend on known outputs at earlier positions, which is essential for autoregressive generation.

The implementation below creates a flexible attention function that supports both standard attention for the encoder and masked attention for the decoder.

In [25]:
# Scaled dot-product attention mechanism
def scaled_dot_product_attention(query, key, value, mask=None):
    """
    Calculate scaled dot-product attention as described in the Transformer paper.

    The attention formula is: Attention(Q, K, V) = softmax(QK^T/√d_k)V

    Args:
        query: Query tensor of shape (..., seq_len_q, depth).
        key: Key tensor of shape (..., seq_len_k, depth).
        value: Value tensor of shape (..., seq_len_v, depth_v).
        mask: Optional mask tensor of shape broadcastable to (..., seq_len_q, seq_len_k).
              Used to mask out certain attention connections (1 = keep, 0 = mask).

    Returns:
        tuple: (attention output, attention weights)
    """
    # Calculate dot product of query and key: (Q)(K^T)
    matmul_qk = tf.matmul(query, key, transpose_b=True)

    # Scale dot product by square root of depth
    depth = tf.cast(tf.shape(key)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(depth)

    # Apply mask if provided (adding large negative values to masked positions)
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)

    # Apply softmax to get attention weights
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)

    # Compute weighted sum of values using attention weights
    attention_output = tf.matmul(attention_weights, value)

    return attention_output, attention_weights

### Multi-Head Attention Implementation

While the scaled dot-product attention mechanism is powerful, the original Transformer paper introduced multi-head attention to further enhance the model's capability. Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions, enabling it to capture various aspects of the relationships between tokens.

The multi-head attention mechanism works as follows:

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, ..., \text{head}_h)W^O$$

Where each head is computed as:

$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$

In this formulation:
- $W_i^Q$, $W_i^K$, and $W_i^V$ are learned projection matrices for the $i$-th head
- $W^O$ is the output projection matrix
- $h$ is the number of attention heads

The key advantages of multi-head attention include:

1. **Parallel attention mechanisms**: Each head can focus on different aspects of the input, similar to having multiple feature detectors
2. **Increased representation power**: The model can simultaneously attend to information from different representation subspaces
3. **Enhanced learning capacity**: Multiple heads provide more pathways for gradient flow during training

Our implementation follows these steps:
1. Project the inputs into multiple heads using learned linear transformations
2. Apply scaled dot-product attention to each head independently
3. Concatenate the results from all heads
4. Apply a final linear projection to produce the output

This implementation supports both self-attention (where Q, K, and V are the same sequence) and cross-attention (where Q comes from one sequence while K and V come from another), which are used in different parts of the Transformer architecture.

In [26]:
# Multi-head attention layer implementation
class MultiHeadAttention(tf.keras.layers.Layer):
    """
    Multi-head attention layer as described in 'Attention Is All You Need'.

    This layer splits the embedding dimension into multiple heads to allow
    the model to jointly attend to information from different representation
    subspaces at different positions.
    """

    def __init__(self, d_model, num_heads):
        """
        Initialize multi-head attention layer.

        Args:
            d_model (int): Embedding dimension.
            num_heads (int): Number of attention heads.
        """
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model

        # Ensure d_model is divisible by num_heads
        assert d_model % self.num_heads == 0, "d_model must be divisible by num_heads"

        # Depth of each attention head
        self.depth = d_model // self.num_heads

        # Dense layers for linear projections
        self.query_layer = tf.keras.layers.Dense(d_model)  # Query projection
        self.key_layer = tf.keras.layers.Dense(d_model)    # Key projection
        self.value_layer = tf.keras.layers.Dense(d_model)  # Value projection
        self.output_layer = tf.keras.layers.Dense(d_model) # Output projection

    def split_heads(self, x, batch_size):
        """
        Split the last dimension into (num_heads, depth).

        Args:
            x (tf.Tensor): Tensor with shape (batch_size, seq_len, d_model).
            batch_size (int): Batch size.

        Returns:
            tf.Tensor: Reshaped tensor with shape (batch_size, num_heads, seq_len, depth).
        """
        # Reshape x to (batch_size, seq_len, num_heads, depth)
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))

        # Transpose to (batch_size, num_heads, seq_len, depth)
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, value, key, query, mask=None):
        """
        Forward pass for multi-head attention.

        Args:
            value (tf.Tensor): Value tensor.
            key (tf.Tensor): Key tensor.
            query (tf.Tensor): Query tensor.
            mask (tf.Tensor, optional): Attention mask tensor.

        Returns:
            tuple: (attention output, attention weights)
        """
        batch_size = tf.shape(query)[0]

        # Linear projections
        query_projected = self.query_layer(query)  # (batch_size, seq_len_q, d_model)
        key_projected = self.key_layer(key)        # (batch_size, seq_len_k, d_model)
        value_projected = self.value_layer(value)  # (batch_size, seq_len_v, d_model)

        # Split projections into multiple heads
        query_multi_head = self.split_heads(query_projected, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        key_multi_head = self.split_heads(key_projected, batch_size)      # (batch_size, num_heads, seq_len_k, depth)
        value_multi_head = self.split_heads(value_projected, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        # Apply scaled dot-product attention to each head
        scaled_attention, attention_weights = scaled_dot_product_attention(
            query_multi_head, key_multi_head, value_multi_head, mask)
        # scaled_attention shape: (batch_size, num_heads, seq_len_q, depth)

        # Transpose and reshape back to original dimensions
        # Transpose to (batch_size, seq_len_q, num_heads, depth)
        scaled_attention_transposed = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])

        # Reshape to (batch_size, seq_len_q, d_model)
        concat_attention = tf.reshape(scaled_attention_transposed, (batch_size, -1, self.d_model))

        # Apply final output projection
        output = self.output_layer(concat_attention)  # (batch_size, seq_len_q, d_model)

        return output, attention_weights

### Feed-Forward Networks

In addition to the attention mechanisms, each layer of the Transformer contains a position-wise feed-forward network (FFN). This component applies the same feed-forward transformation to each position independently, adding non-linearity and increasing the model's representational capacity.

The feed-forward network consists of two linear transformations with a ReLU activation in between:

$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

Where:
- $W_1$ and $b_1$ are the weights and bias of the first linear transformation
- $W_2$ and $b_2$ are the weights and bias of the second linear transformation
- $\max(0, \cdot)$ represents the ReLU activation function

Key characteristics of the feed-forward networks in Transformers:

1. **Position-wise application**: The same transformation is applied to each position independently, preserving the sequence length
2. **Dimensionality expansion**: Typically, the inner dimension ($d_{ff}$) is larger than the model dimension ($d_{model}$), allowing for more complex transformations
3. **Non-linearity**: The ReLU activation introduces non-linearity, enabling the model to learn more complex functions
4. **Parameter efficiency**: Despite the increased inner dimension, sharing parameters across positions keeps the total parameter count manageable

The feed-forward network can be viewed as a 1x1 convolution with two layers, processing each position in the sequence independently while transforming the feature representation. This component complements the attention mechanism: while attention captures relationships between different positions, the feed-forward network processes each position's features more deeply.

The implementation below creates a feed-forward network class that follows the original Transformer architecture specifications.

In [27]:
# Point-wise feed-forward network
def point_wise_feed_forward_network(d_model, dff):
    """
    Create a point-wise feed-forward network for Transformer layers.

    As specified in the Transformer paper, each layer contains a fully-connected
    feed-forward network consisting of two linear transformations with a ReLU
    activation in between.

    Args:
        d_model (int): Model dimension (input and output dimension).
        dff (int): Inner layer dimension (typically 4*d_model).

    Returns:
        tf.keras.Sequential: Feed-forward network.
    """
    return tf.keras.Sequential([
        # First dense layer with ReLU activation
        tf.keras.layers.Dense(dff, activation='relu'),  # (batch_size, seq_len, dff)

        # Second dense layer to restore original dimension
        tf.keras.layers.Dense(d_model)  # (batch_size, seq_len, d_model)
    ])

## Encoder Components

The encoder is a fundamental component of the Transformer architecture, responsible for processing the input sequence and creating contextualized representations that capture the relationships between tokens. The encoder consists of multiple identical layers stacked on top of each other, each containing two main sub-layers: a multi-head self-attention mechanism and a position-wise feed-forward network.




### Encoder Layer

Each encoder layer transforms its input through the following sequence of operations:

1. **Multi-Head Self-Attention**: Allows each position to attend to all positions in the previous layer, capturing contextual relationships regardless of distance
   - Input: Sequence representations from the previous layer
   - Process: Computes attention scores between all pairs of positions
   - Output: Context-aware representations for each position

2. **Residual Connection and Layer Normalization**: After the attention sub-layer
   - Residual connection: Adds the input to the output of the attention sub-layer, helping with gradient flow
   - Layer normalization: Normalizes the features across the embedding dimension, stabilizing training

3. **Position-wise Feed-Forward Network**: Applies the same feed-forward transformation to each position independently
   - Input: Normalized output from the attention sub-layer
   - Process: Two linear transformations with a ReLU activation in between
   - Output: Transformed representations with enhanced feature extraction

4. **Second Residual Connection and Layer Normalization**: After the feed-forward sub-layer
   - Completes the layer by adding another residual connection and normalization

This architecture allows each encoder layer to refine the representations from the previous layer, gradually building up a rich understanding of the input sequence.

The implementation below creates both the individual encoder layer and the complete encoder stack, following the architecture described in the original Transformer paper.

In [28]:
# Encoder layer implementation
class EncoderLayer(tf.keras.layers.Layer):
    """
    Encoder layer for the Transformer model.

    Each encoder layer consists of:
    1. Multi-head self-attention mechanism
    2. Position-wise feed-forward network

    Both sublayers have residual connections and layer normalization.
    """

    def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
        """
        Initialize encoder layer.

        Args:
        d_model (int): Model dimension.
        num_heads (int): Number of attention heads.
        dff (int): Inner dimension of feed-forward network.
        dropout_rate (float): Dropout rate for regularization.
        """
        super(EncoderLayer, self).__init__()

        # Multi-head attention sublayer
        self.multi_head_attention = MultiHeadAttention(d_model, num_heads)

        # Feed-forward network sublayer
        self.feed_forward = point_wise_feed_forward_network(d_model, dff)

        # Layer normalization layers
        self.layer_norm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layer_norm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        # Dropout layers for regularization
        self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout2 = tf.keras.layers.Dropout(dropout_rate)

    def call(self, inputs, training=True, mask=None):
        """
        Forward pass for encoder layer.

        Args:
        inputs (tf.Tensor): Input tensor.
        training (bool): Whether in training mode.
        mask (tf.Tensor, optional): Padding mask.

        Returns:
        tf.Tensor: Output of the encoder layer.
        """
        # Multi-head self-attention sublayer
        # For self-attention, the same tensor is used for query, key, and value
        attention_output, _ = self.multi_head_attention(
            query=inputs, key=inputs, value=inputs, mask=mask)
        attention_output = self.dropout1(attention_output, training=training)
        attention_output_normalized = self.layer_norm1(inputs + attention_output)  # Residual connection

        # Feed-forward network sublayer
        ffn_output = self.feed_forward(attention_output_normalized)
        ffn_output = self.dropout2(ffn_output, training=training)
        output_normalized = self.layer_norm2(attention_output_normalized + ffn_output)  # Residual connection

        return output_normalized

### Complete Encoder Stack

The complete encoder consists of N identical layers stacked sequentially. This stacking allows the model to build increasingly abstract and context-aware representations:

- The first few layers typically capture more local patterns and syntactic relationships
- Deeper layers develop more abstract and semantic representations
- The final layer produces the encoder output that will be used by the decoder

The number of encoder layers (N) is a hyperparameter that affects the model's capacity and computational requirements. Deeper encoders can capture more complex patterns but require more computation and may be more difficult to train.



In [29]:
# Encoder implementation
class Encoder(tf.keras.layers.Layer):
    """
    Transformer encoder consisting of multiple encoder layers.

    The encoder processes the input sequence through:
    1. Input embedding
    2. Positional encoding addition
    3. Multiple encoder layers
    """

    def __init__(self, num_layers, d_model, num_heads, dff,
                 input_vocab_size, maximum_position_encoding, dropout_rate=0.1):
        """
        Initialize encoder.

        Args:
            num_layers (int): Number of encoder layers.
            d_model (int): Model dimension.
            num_heads (int): Number of attention heads.
            dff (int): Inner dimension of feed-forward networks.
            input_vocab_size (int): Size of input vocabulary.
            maximum_position_encoding (int): Maximum sequence length for position encoding.
            dropout_rate (float): Dropout rate for regularization.
        """
        super(Encoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        # Input embedding layer
        self.embedding_layer = tf.keras.layers.Embedding(input_vocab_size, d_model)

        # Positional encoding
        self.positional_encoding = positional_encoding(maximum_position_encoding, d_model)

        # Stack of encoder layers
        self.encoder_layers = [
            EncoderLayer(d_model, num_heads, dff, dropout_rate)
            for _ in range(num_layers)
        ]

        # Dropout for regularization
        self.dropout = tf.keras.layers.Dropout(dropout_rate)

    def call(self, inputs, training=True, mask=None):
        """
        Forward pass for encoder.

        Args:
            inputs (tf.Tensor): Input token indices.
            training (bool): Whether in training mode.
            mask (tf.Tensor, optional): Padding mask.

        Returns:
            tf.Tensor: Encoder output.
        """
        # Get sequence length
        sequence_length = tf.shape(inputs)[1]

        # Convert token indices to embeddings
        embeddings = self.embedding_layer(inputs)  # (batch_size, seq_len, d_model)

        # Scale embeddings
        embeddings *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))

        # Add positional encoding
        embeddings += self.positional_encoding[:, :sequence_length, :]

        # Apply dropout to embeddings
        encoder_output = self.dropout(embeddings, training=training)

        # Pass through each encoder layer
        for layer_index in range(self.num_layers):
            encoder_output = self.encoder_layers[layer_index](
                encoder_output, training=training, mask=mask)

        # Final encoder output shape: (batch_size, seq_len, d_model)
        return encoder_output

## Decoder Components

The decoder is the second major component of the Transformer architecture, responsible for generating output sequences based on the encoded representations. Like the encoder, the decoder consists of multiple identical layers stacked on top of each other, but with an additional cross-attention mechanism that connects it to the encoder output.





### Decoder Layer

Each decoder layer transforms its input through the following sequence of operations:

1. **Masked Multi-Head Self-Attention**: Similar to the encoder's self-attention, but with a crucial difference - it includes a look-ahead mask to prevent positions from attending to future positions
   - Input: Sequence representations from the previous decoder layer
   - Process: Computes attention scores with masking to ensure autoregressive property
   - Output: Context-aware representations that only depend on previous positions

2. **Residual Connection and Layer Normalization**: After the masked self-attention sub-layer
   - Residual connection: Adds the input to the output of the attention sub-layer
   - Layer normalization: Normalizes the features across the embedding dimension

3. **Multi-Head Cross-Attention**: Allows each position in the decoder to attend to all positions in the encoder output
   - Input: Normalized output from the self-attention sub-layer and encoder output
   - Process: Computes attention scores between decoder queries and encoder keys/values
   - Output: Representations that incorporate information from the encoder

4. **Second Residual Connection and Layer Normalization**: After the cross-attention sub-layer

5. **Position-wise Feed-Forward Network**: Identical to the one in the encoder layer
   - Input: Normalized output from the cross-attention sub-layer
   - Process: Two linear transformations with a ReLU activation in between
   - Output: Transformed representations with enhanced feature extraction

6. **Third Residual Connection and Layer Normalization**: After the feed-forward sub-layer

The masking in the self-attention mechanism is crucial for autoregressive generation, ensuring that predictions for position i can only depend on known outputs at positions less than i.

The implementation below creates both the individual decoder layer and the complete decoder stack, following the architecture described in the original Transformer paper.

In [30]:
# Decoder layer implementation
class DecoderLayer(tf.keras.layers.Layer):
    """
    Decoder layer for the Transformer model.

    Each decoder layer consists of:
    1. Masked multi-head self-attention mechanism
    2. Multi-head encoder-decoder attention mechanism
    3. Position-wise feed-forward network

    All sublayers have residual connections and layer normalization.
    """

    def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
        """
        Initialize decoder layer.

        Args:
        d_model (int): Model dimension.
        num_heads (int): Number of attention heads.
        dff (int): Inner dimension of feed-forward network.
        dropout_rate (float): Dropout rate for regularization.
        """
        super(DecoderLayer, self).__init__()

        # Masked multi-head self-attention sublayer
        self.self_attention = MultiHeadAttention(d_model, num_heads)

        # Multi-head encoder-decoder attention sublayer
        self.encoder_decoder_attention = MultiHeadAttention(d_model, num_heads)

        # Feed-forward network sublayer
        self.feed_forward = point_wise_feed_forward_network(d_model, dff)

        # Layer normalization layers
        self.layer_norm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layer_norm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layer_norm3 = tf.keras.layers.LayerNormalization(epsilon=1e-6)

        # Dropout layers for regularization
        self.dropout1 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout2 = tf.keras.layers.Dropout(dropout_rate)
        self.dropout3 = tf.keras.layers.Dropout(dropout_rate)

    def call(self, inputs, encoder_output, training=True, look_ahead_mask=None, padding_mask=None):
        """
        Forward pass for decoder layer.

        Args:
        inputs (tf.Tensor): Decoder input tensor.
        encoder_output (tf.Tensor): Output from the encoder.
        training (bool): Whether in training mode.
        look_ahead_mask (tf.Tensor, optional): Mask for masked self-attention.
        padding_mask (tf.Tensor, optional): Mask for encoder-decoder attention.

        Returns:
        tuple: (layer output, self-attention weights, encoder-decoder attention weights)
        """
        # Masked multi-head self-attention sublayer
        self_attention_output, self_attention_weights = self.self_attention(
            query=inputs, key=inputs, value=inputs, mask=look_ahead_mask)
        self_attention_output = self.dropout1(self_attention_output, training=training)
        self_attention_normalized = self.layer_norm1(inputs + self_attention_output)  # Residual connection

        # Multi-head encoder-decoder attention sublayer
        enc_dec_attention_output, enc_dec_attention_weights = self.encoder_decoder_attention(
            query=self_attention_normalized, key=encoder_output, value=encoder_output, mask=padding_mask)
        enc_dec_attention_output = self.dropout2(enc_dec_attention_output, training=training)
        enc_dec_attention_normalized = self.layer_norm2(
            self_attention_normalized + enc_dec_attention_output)  # Residual connection

        # Feed-forward network sublayer
        ffn_output = self.feed_forward(enc_dec_attention_normalized)
        ffn_output = self.dropout3(ffn_output, training=training)
        output_normalized = self.layer_norm3(enc_dec_attention_normalized + ffn_output)  # Residual connection

        return output_normalized, self_attention_weights, enc_dec_attention_weights

### Complete Decoder Stack

The complete decoder consists of N identical layers stacked sequentially, mirroring the structure of the encoder stack. The decoder stack processes its inputs in conjunction with the encoder output:

1. The first layer takes positional-encoded embeddings of the target sequence (shifted right)
2. Each subsequent layer refines these representations using self-attention, cross-attention with the encoder output, and feed-forward transformations
3. The final layer produces representations that are then projected to output probabilities over the vocabulary

The decoder's autoregressive property makes it suitable for sequence generation tasks like language modeling, where each token is generated based on all previously generated tokens.



In [31]:
# Decoder implementation
class Decoder(tf.keras.layers.Layer):
    """
    Transformer decoder consisting of multiple decoder layers.

    The decoder processes the target sequence through:
    1. Target embedding
    2. Positional encoding addition
    3. Multiple decoder layers
    """

    def __init__(self, num_layers, d_model, num_heads, dff,
                 target_vocab_size, maximum_position_encoding, dropout_rate=0.1):
        """
        Initialize decoder.

        Args:
            num_layers (int): Number of decoder layers.
            d_model (int): Model dimension.
            num_heads (int): Number of attention heads.
            dff (int): Inner dimension of feed-forward networks.
            target_vocab_size (int): Size of target vocabulary.
            maximum_position_encoding (int): Maximum sequence length for position encoding.
            dropout_rate (float): Dropout rate for regularization.
        """
        super(Decoder, self).__init__()

        self.d_model = d_model
        self.num_layers = num_layers

        # Target embedding layer
        self.embedding_layer = tf.keras.layers.Embedding(target_vocab_size, d_model)

        # Positional encoding
        self.positional_encoding = positional_encoding(maximum_position_encoding, d_model)

        # Stack of decoder layers
        self.decoder_layers = [
            DecoderLayer(d_model, num_heads, dff, dropout_rate)
            for _ in range(num_layers)
        ]

        # Dropout for regularization
        self.dropout = tf.keras.layers.Dropout(dropout_rate)

    def call(self, inputs, encoder_output, training=True, look_ahead_mask=None, padding_mask=None):
        """
        Forward pass for decoder.

        Args:
            inputs (tf.Tensor): Target token indices.
            encoder_output (tf.Tensor): Output from the encoder.
            training (bool): Whether in training mode.
            look_ahead_mask (tf.Tensor, optional): Mask for masked self-attention.
            padding_mask (tf.Tensor, optional): Mask for encoder-decoder attention.

        Returns:
            tuple: (decoder output, attention weights dictionary)
        """
        # Get sequence length
        sequence_length = tf.shape(inputs)[1]

        # Dictionary to store attention weights from each decoder layer
        attention_weights = {}

        # Convert token indices to embeddings
        embeddings = self.embedding_layer(inputs)  # (batch_size, seq_len, d_model)

        # Scale embeddings
        embeddings *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))

        # Add positional encoding
        embeddings += self.positional_encoding[:, :sequence_length, :]

        # Apply dropout to embeddings
        decoder_output = self.dropout(embeddings, training=training)

        # Pass through each decoder layer
        for layer_index in range(self.num_layers):
            decoder_output, self_attn_weights, enc_dec_attn_weights = self.decoder_layers[layer_index](
                decoder_output,
                encoder_output,
                training=training,
                look_ahead_mask=look_ahead_mask,
                padding_mask=padding_mask
            )

            # Store attention weights for visualization/analysis
            attention_weights[f'decoder_layer{layer_index+1}_self_attention'] = self_attn_weights
            attention_weights[f'decoder_layer{layer_index+1}_encoder_decoder_attention'] = enc_dec_attn_weights

        # Final decoder output shape: (batch_size, seq_len, d_model)
        return decoder_output, attention_weights

## Masking Functions

Masking is a crucial aspect of the Transformer architecture, serving two distinct purposes:

1. **Padding Masks**: Handle variable-length sequences in batched processing
2. **Look-Ahead Masks**: Ensure the autoregressive property in the decoder

These masks modify the attention mechanism to prevent certain connections between positions, effectively controlling information flow within the model.

### Padding Mask Creation

When processing batches of sequences, we typically pad shorter sequences to match the length of the longest sequence in the batch. However, we don't want the model to attend to or be influenced by these padding tokens. Padding masks solve this problem:

- They identify which positions contain actual tokens versus padding tokens
- They're applied to the attention weights before the softmax operation
- By setting attention weights for padding tokens to a large negative value (e.g., -1e9), the softmax effectively gives them zero attention

Padding masks are used in both the encoder and decoder to ensure that padding tokens don't contribute to the contextual representations.

In [32]:
# Helper functions for creating masks
def create_padding_mask(sequence):
    """
    Create a padding mask for transformer model.

    This mask identifies padding tokens (zeros) in the input sequence to ensure
    they don't contribute to attention calculations.

    Args:
        sequence (tf.Tensor): Input sequence tensor of shape (batch_size, seq_len).

    Returns:
        tf.Tensor: Padding mask of shape (batch_size, 1, 1, seq_len).
    """
    # Create mask where padding tokens (0) become 1, and other tokens become 0
    mask = tf.cast(tf.math.equal(sequence, 0), tf.float32)

    # Add dimensions for multi-head attention broadcasting
    # Shape: (batch_size, 1, 1, seq_len)
    return mask[:, tf.newaxis, tf.newaxis, :]

### Look-Ahead Mask Creation

The look-ahead mask (or causal mask) is specific to the decoder's self-attention mechanism and enforces the autoregressive property:

- It prevents each position from attending to future positions
- It creates a lower triangular matrix where each position i can only attend to positions j ≤ i
- Like padding masks, it works by setting prohibited connections to large negative values

This masking is essential for language modeling and text generation tasks, where the model should only condition its predictions on previously generated tokens.

In [None]:
def create_look_ahead_mask(sequence_length):
    """
    Create a look-ahead mask for decoder self-attention.

    This mask prevents the decoder from attending to future positions during training,
    ensuring autoregressive property (can only see previous positions).

    Args:
        sequence_length (int): Length of the sequence.

    Returns:
        tf.Tensor: Look-ahead mask of shape (sequence_length, sequence_length).
    """
    # Create a lower triangular matrix with ones
    # 1s in the lower triangle, 0s in the upper triangle
    mask = 1 - tf.linalg.band_part(tf.ones((sequence_length, sequence_length)), -1, 0)
    return mask  # Shape: (seq_len, seq_len)

### Combined Masking

In practice, we often need to combine both types of masks in the decoder:

- The look-ahead mask ensures autoregressive generation
- The padding mask ensures we ignore padding tokens
- The combined mask applies both constraints simultaneously

The helper functions below implement these masking operations, creating the appropriate tensor masks that can be directly applied in the attention mechanism.

In [None]:
def create_masks(encoder_input, decoder_input):
    """
    Create all necessary masks for the transformer model.

    Args:
        encoder_input (tf.Tensor): Input to the encoder.
        decoder_input (tf.Tensor): Input to the decoder.

    Returns:
        tuple: (encoder padding mask, combined decoder mask, decoder padding mask)
    """
    # Create padding mask for encoder inputs
    encoder_padding_mask = create_padding_mask(encoder_input)

    # Create padding mask for decoder inputs in encoder-decoder attention
    decoder_padding_mask = create_padding_mask(encoder_input)

    # Create look-ahead mask for decoder self-attention
    decoder_look_ahead_mask = create_look_ahead_mask(tf.shape(decoder_input)[1])

    # Create padding mask for decoder inputs in decoder self-attention
    decoder_input_padding_mask = create_padding_mask(decoder_input)

    # Combine look-ahead mask and padding mask for decoder self-attention
    # This prevents attending to future tokens AND padding tokens
    combined_decoder_mask = tf.maximum(decoder_look_ahead_mask, decoder_input_padding_mask)

    return encoder_padding_mask, combined_decoder_mask, decoder_padding_mask

## Complete Transformer Model

After implementing all the individual components, we can now assemble the complete Transformer architecture. The Transformer model integrates the encoder and decoder stacks, along with embedding layers, positional encoding, and the final output layer to create a powerful sequence-to-sequence model.

The complete Transformer architecture consists of:

1. **Input and Output Embedding Layers**: Convert token indices to continuous vector representations
   - In our implementation, we share the same embedding layer for both input and output tokens
   - This parameter sharing is common in language models and reduces the model size

2. **Positional Encoding**: Adds information about token positions to the embeddings
   - Applied to both encoder and decoder inputs
   - Enables the model to understand sequence order despite its parallel processing

3. **Encoder Stack**: Processes the input sequence to create contextualized representations
   - Consists of N identical layers with self-attention and feed-forward networks
   - Captures bidirectional context from the entire input sequence

4. **Decoder Stack**: Generates the output sequence based on the encoder output
   - Consists of N identical layers with masked self-attention, cross-attention, and feed-forward networks
   - Maintains the autoregressive property for sequence generation

5. **Final Linear Layer and Softmax**: Projects decoder output to vocabulary-sized logits

In [33]:
# Complete Transformer Model
class Transformer(tf.keras.Model):
    """
    Complete Transformer model as described in 'Attention Is All You Need'.

    The Transformer consists of an encoder and a decoder, each with multiple layers
    of self-attention and feed-forward networks, connected by encoder-decoder attention.
    """

    def __init__(self, num_layers, d_model, num_heads, dff,
                 input_vocab_size, target_vocab_size,
                 maximum_position_encoding_input, maximum_position_encoding_target,
                 dropout_rate=0.1):
        """
        Initialize Transformer model.

        Args:
            num_layers (int): Number of encoder and decoder layers.
            d_model (int): Model dimension.
            num_heads (int): Number of attention heads.
            dff (int): Inner dimension of feed-forward networks.
            input_vocab_size (int): Size of input vocabulary.
            target_vocab_size (int): Size of target vocabulary.
            maximum_position_encoding_input (int): Maximum input sequence length for position encoding.
            maximum_position_encoding_target (int): Maximum target sequence length for position encoding.
            dropout_rate (float): Dropout rate for regularization.
        """
        super(Transformer, self).__init__()

        # Encoder stack
        self.encoder = Encoder(
            num_layers=num_layers,
            d_model=d_model,
            num_heads=num_heads,
            dff=dff,
            input_vocab_size=input_vocab_size,
            maximum_position_encoding=maximum_position_encoding_input,
            dropout_rate=dropout_rate
        )

        # Decoder stack
        self.decoder = Decoder(
            num_layers=num_layers,
            d_model=d_model,
            num_heads=num_heads,
            dff=dff,
            target_vocab_size=target_vocab_size,
            maximum_position_encoding=maximum_position_encoding_target,
            dropout_rate=dropout_rate
        )

        # Final projection layer to vocabulary
        self.final_projection = tf.keras.layers.Dense(target_vocab_size)

    def call(self, inputs, training=True):
        """
        Forward pass for the Transformer model.

        Args:
            inputs (tuple): (encoder_inputs, decoder_inputs).
            training (bool): Whether in training mode.

        Returns:
            tuple: (output logits, attention weights dictionary)
        """
        # Unpack inputs
        encoder_inputs, decoder_inputs = inputs

        # Create masks for encoder and decoder
        encoder_padding_mask, combined_decoder_mask, decoder_padding_mask = create_masks(
            encoder_inputs, decoder_inputs)

        # Pass through encoder
        encoder_output = self.encoder(
            encoder_inputs,
            training=training,
            mask=encoder_padding_mask
        )

        # Pass through decoder
        decoder_output, attention_weights = self.decoder(
            decoder_inputs,
            encoder_output,
            training=training,
            look_ahead_mask=combined_decoder_mask,
            padding_mask=decoder_padding_mask
        )

        # Project to vocabulary size
        final_output = self.final_projection(decoder_output)

        return final_output, attention_weights

### Configuring the Transformer for Word-Level Language Modeling

After implementing the complete Transformer architecture, we need to configure it specifically for our word-level Shakespeare language modeling task. This configuration involves setting appropriate hyperparameters that balance model capacity, computational efficiency, and performance.

The function below creates a Transformer model with the following configuration:

1. **Model Size and Capacity**:
   - **Number of layers**: 3 encoder and decoder layers, providing sufficient depth to capture complex patterns while remaining computationally tractable
   - **Embedding dimension (d_model)**: 192-dimensional embeddings, offering a good balance between expressiveness and efficiency
   - **Feed-forward dimension (dff)**: 768-dimensional inner layer in the feed-forward networks, allowing for complex non-linear transformations
   - **Number of attention heads**: 6 heads, enabling the model to attend to different aspects of the input simultaneously

2. **Regularization**:
   - **Dropout rate**: 0.3, providing regularization to prevent overfitting on our relatively small dataset

3. **Sequence Handling**:
   - **Maximum position encoding**: 5000 positions, accommodating long sequences while maintaining positional information
   - **Vocabulary size**: Dynamically set based on our preprocessed data, ensuring all tokens in our vocabulary can be represented

This configuration is specifically tailored for word-level language modeling, where each token represents a complete word rather than a character. Word-level models typically require:
- Larger embedding dimensions to capture the greater semantic complexity of words
- Fewer position encodings than character models (since sequences contain fewer tokens)
- Careful regularization to handle the larger vocabulary size

The resulting model strikes a balance between capacity and efficiency, making it suitable for training on Shakespeare's text without requiring excessive computational resources.

In [34]:
# Create Transformer model for word-level language modeling
def create_word_transformer(vocab_size):
    """
    Create a Transformer model configured for word-level language modeling.

    Args:
        vocab_size (int): Size of the vocabulary.

    Returns:
        Transformer: Configured model instance.
    """
    # Model architecture hyperparameters
    num_layers = 3        # Number of encoder/decoder layers
    d_model = 192         # Embedding dimension
    num_heads = 6         # Number of attention heads
    dff = 768            # Feed-forward network inner dimension
    dropout_rate = 0.3    # Dropout rate for regularization

    # Vocabulary sizes from the processed data
    input_vocab_size = vocab_size
    target_vocab_size = vocab_size

    # Maximum position encoding length
    # For word models, we typically need fewer position encodings than character models
    max_position_encoding = 5000

    # Create and return the Transformer model
    return Transformer(
        num_layers=num_layers,
        d_model=d_model,
        num_heads=num_heads,
        dff=dff,
        input_vocab_size=input_vocab_size,
        target_vocab_size=target_vocab_size,
        maximum_position_encoding_input=max_position_encoding,
        maximum_position_encoding_target=max_position_encoding,
        dropout_rate=dropout_rate
    )

In [35]:
# Initialize transformer model with default parameters
word_transformer = create_word_transformer(vocab_size)  

## Training Implementation

Training a Transformer model for language modeling requires careful implementation of several components: loss calculation, optimization strategy, learning rate scheduling, and the training loop itself. This section covers the complete training pipeline for our Shakespeare language model.

### Loss Function with Label Smoothing

For training our Transformer model, we implement a specialized loss function that incorporates label smoothing, a regularization technique that improves model generalization and calibration. Label smoothing addresses several challenges in language model training:

1. **Overconfidence**: Neural networks tend to become overly confident in their predictions, assigning probabilities close to 1.0 for the target class. Label smoothing prevents this by "softening" the target distribution.

2. **Generalization**: By introducing uncertainty into the training targets, label smoothing encourages the model to learn more robust representations rather than memorizing the training data.

3. **Calibration**: Models trained with label smoothing typically produce better-calibrated probability distributions, meaning their confidence better reflects their actual accuracy.

The implementation works as follows:

- **Standard one-hot encoding** would assign a probability of 1.0 to the correct word and 0.0 to all other words
- **With label smoothing**, we assign a probability of (1-α) to the correct word and distribute the remaining α probability uniformly across all words in the vocabulary
- **Mathematically**: If y is the one-hot encoded ground truth and V is the vocabulary size, the smoothed label becomes:
  
  $$y_{smooth} = (1-\alpha) \cdot y + \alpha \cdot \frac{1}{V}$$

Additionally, our loss function handles padding tokens by:
- Creating a mask that identifies non-padding tokens (those with ID ≠ 0)
- Applying this mask to the loss values to ignore padding tokens
- Normalizing the total loss by the number of non-padding tokens

This approach ensures that the model is only penalized for its predictions on actual content, not on padding tokens used for batch processing.

In [36]:
# Loss function for training
def loss_function(real_tokens, predicted_logits, smoothing_factor=0.1):
    """
    Calculate the loss with manual label smoothing, ignoring padding tokens.
    
    Args:
        real_tokens (tf.Tensor): Ground truth tokens.
        predicted_logits (tf.Tensor): Predicted token logits.
        smoothing_factor (float): Label smoothing factor (0.0 to 1.0).
        
    Returns:
        tf.Tensor: The average loss value across the batch.
    """
    # Get vocabulary size from the last dimension of the logits
    vocab_size = tf.shape(predicted_logits)[-1]
    
    # Convert sparse tokens to one-hot
    one_hot_labels = tf.one_hot(real_tokens, depth=vocab_size)
    
    # Apply label smoothing manually:
    # - Assign (1-smoothing_factor) probability to the correct class
    # - Distribute smoothing_factor probability uniformly to all classes
    smooth_labels = (1.0 - smoothing_factor) * one_hot_labels + \
                    smoothing_factor / tf.cast(vocab_size, tf.float32)
    
    # Calculate cross entropy from logits using smoothed labels
    loss_values = tf.keras.losses.categorical_crossentropy(
        smooth_labels, predicted_logits, from_logits=True)
    
    # Create a mask to ignore padding tokens (ID = 0)
    mask = tf.math.logical_not(tf.math.equal(real_tokens, 0))
    
    # Apply mask to ignore padding tokens in loss calculation
    mask = tf.cast(mask, dtype=loss_values.dtype)
    loss_values *= mask
    
    # Return average loss (averaging over non-padding tokens only)
    return tf.reduce_sum(loss_values) / tf.reduce_sum(mask)

### Custom Learning Rate Scheduler

The Transformer architecture benefits significantly from a carefully designed learning rate schedule. The original paper "Attention Is All You Need" introduced a specific learning rate schedule that has become standard practice when training Transformer models. This schedule combines two key elements:

1. **Warm-up phase**: The learning rate gradually increases during the initial training steps, allowing the model to establish meaningful parameter values before applying larger updates.

2. **Decay phase**: After the warm-up, the learning rate decreases proportionally to the inverse square root of the step number, helping the model converge to an optimal solution.

The mathematical formula for this schedule is:

$$\text{lr} = \text{scale} \cdot d_{\text{model}}^{-0.5} \cdot \min(\text{step}^{-0.5}, \text{step} \cdot \text{warmup\_steps}^{-1.5})$$

Where:
- $d_{\text{model}}$ is the embedding dimension
- $\text{step}$ is the current training step
- $\text{warmup\_steps}$ is the number of steps in the warm-up phase
- $\text{scale}$ is an additional scaling factor to fine-tune the overall learning rate magnitude

This schedule offers several advantages:

- **Stability**: The gradual warm-up prevents unstable gradients early in training
- **Efficient exploration**: Higher learning rates during the middle phase allow efficient parameter space exploration
- **Fine-tuning**: The gradual decay helps the model settle into an optimal configuration
- **Scaling with model size**: The $d_{\text{model}}^{-0.5}$ factor automatically adjusts the learning rate based on model size

The implementation below creates a custom TensorFlow learning rate scheduler that follows this formula, which we'll use with the Adam optimizer for training our Transformer model.

In [37]:
# Custom learning rate scheduler
class CustomSchedule(tf.keras.optimizers.schedules.LearningRateSchedule):
    """
    Custom learning rate scheduler for Transformer model.

    Implements the learning rate schedule from the Transformer paper:
    lr = d_model^(-0.5) * min(step^(-0.5), step * warmup_steps^(-1.5))
    """

    def __init__(self, d_model, warmup_steps=1000, initial_scale=0.5):
        """
        Initialize learning rate scheduler.

        Args:
            d_model (int): Model dimension.
            warmup_steps (int): Number of warmup steps. Default is 1000.
            initial_scale (float): Additional scaling factor. Default is 0.5.
        """
        super(CustomSchedule, self).__init__()

        self.d_model = tf.cast(d_model, tf.float32)
        self.warmup_steps = warmup_steps
        self.initial_scale = initial_scale

    def __call__(self, step):
        """
        Calculate learning rate based on step count.

        Args:
            step (tf.Tensor): Current step number.

        Returns:
            tf.Tensor: Learning rate value.
        """
        # Convert step to float32
        step = tf.cast(step, tf.float32)

        # Calculate args for min function
        arg1 = tf.math.rsqrt(step)  # step^(-0.5)
        arg2 = step * (self.warmup_steps ** -1.5)  # step * warmup_steps^(-1.5)

        # Apply formula: scale * d_model^(-0.5) * min(step^(-0.5), step * warmup_steps^(-1.5))
        return self.initial_scale * tf.math.rsqrt(self.d_model) * tf.math.minimum(arg1, arg2)

### Optimizer Configuration

Selecting and configuring the right optimizer is crucial for effectively training Transformer models. For our Shakespeare language model, we use the Adam optimizer with carefully tuned hyperparameters and our custom learning rate schedule.

Adam (Adaptive Moment Estimation) combines the benefits of two other optimization algorithms:
- **AdaGrad**: Adapts learning rates based on the frequency of parameter updates
- **RMSProp**: Uses a moving average of squared gradients to normalize updates

This makes it particularly well-suited for training deep neural networks with many parameters, such as Transformers. Our optimizer configuration includes several important customizations:

1. **Custom learning rate schedule**: Implements the warm-up and decay schedule described in the previous section, which is essential for stable Transformer training

2. **Beta parameters**: Controls the exponential decay rates for moment estimates
   - **Beta_1 = 0.9**: Standard value for the first moment (mean)
   - **Beta_2 = 0.98**: Slightly higher than the default (0.999) for the second moment (variance), as recommended in the original Transformer paper

3. **Epsilon**: A small constant (1e-9) added to the denominator for numerical stability, preventing division by zero

4. **Weight decay**: A small amount of L2 regularization (0.01) to prevent overfitting by penalizing large weights

These carefully selected hyperparameters help balance the trade-offs between:
- Training speed and stability
- Exploration of parameter space and convergence
- Model performance and generalization

The function below creates an Adam optimizer with these customizations, which we'll use to train our Transformer model.

In [38]:
# Initialize optimizer with learning rate schedule
def create_optimizer(d_model=192):
    """
    Create an Adam optimizer with custom learning rate schedule.
    
    Args:
        d_model (int): Model dimension for the learning rate schedule.
        
    Returns:
        tf.keras.optimizers.Adam: Configured optimizer.
    """
    # Create learning rate schedule
    learning_rate_schedule = CustomSchedule(
        d_model=d_model,  # Match model's embedding dimension
        warmup_steps=2000,
        initial_scale=0.3
    )

    # Initialize Adam optimizer with custom learning rate schedule
    optimizer = tf.keras.optimizers.Adam(
        learning_rate=learning_rate_schedule,
        beta_1=0.9,        # Exponential decay rate for 1st moment estimates
        beta_2=0.98,       # Exponential decay rate for 2nd moment estimates (slightly higher than default)
        epsilon=1e-9,      # Small constant for numerical stability
        weight_decay=0.01  # Add L2 regularization    
    )
    
    return optimizer

### Training Step Implementation

The core of our Transformer training pipeline is the training step function, which executes a single forward and backward pass through the model. This function is optimized using TensorFlow's `@tf.function` decorator, which compiles the computation into a high-performance graph for faster execution.

Our training step implements several key techniques for effective sequence-to-sequence training:

1. **Teacher Forcing**: During training, we provide the ground truth tokens as input to the decoder, rather than using the decoder's own predictions. This approach:
   - Stabilizes training by preventing error accumulation
   - Allows parallel training of all output positions
   - Creates a consistent learning signal

2. **Input-Target Preparation**:
   - **Decoder inputs**: The target sequence with the last token removed (since we don't need to predict after the last token)
   - **Decoder targets**: The target sequence with the first token removed (since the first token is typically a start token)

3. **Gradient Management**:
   - **Gradient clipping**: Limits the gradient norm to prevent exploding gradients, a common issue in training deep sequence models
   - **Automatic differentiation**: Uses TensorFlow's GradientTape to efficiently compute gradients

4. **Metrics Tracking**:
   - **Loss**: Tracks the cross-entropy loss with label smoothing
   - **Accuracy**: Measures token-level prediction accuracy

The training step function encapsulates the entire process of:
1. Preparing the input and target sequences
2. Performing the forward pass through the model
3. Computing the loss with label smoothing
4. Calculating and clipping gradients
5. Applying the gradients to update model parameters
6. Updating training metrics

This function will be called repeatedly during the training loop, once for each batch of data in each epoch.

In [39]:
# Define training step function
@tf.function
def train_step(model, input_batch, target_batch, optimizer, train_loss_metric, train_accuracy_metric):
    """
    Execute a single training step (forward pass, loss calculation, and backpropagation).

    This function implements teacher forcing for sequence-to-sequence training,
    where the target sequence is shifted to create decoder inputs and labels.

    Args:
        model (Transformer): The Transformer model.
        input_batch (tf.Tensor): Batch of input sequences.
        target_batch (tf.Tensor): Batch of target sequences.
        optimizer (tf.keras.optimizers.Optimizer): Optimizer instance.
        train_loss_metric (tf.keras.metrics.Mean): Metric to track training loss.
        train_accuracy_metric (tf.keras.metrics.SparseCategoricalAccuracy): Metric to track training accuracy.
    """
    # Implement teacher forcing:
    # - decoder_inputs: target without the last token
    # - decoder_targets: target without the first token
    decoder_inputs = target_batch[:, :-1]   # Remove the last token (used as input to decoder)
    decoder_targets = target_batch[:, 1:]   # Remove the first token (used as ground truth)

    # Use gradient tape to record operations for automatic differentiation
    with tf.GradientTape() as tape:
        # Forward pass
        # Model expects (encoder_inputs, decoder_inputs) as input
        predictions, _ = model([input_batch, decoder_inputs], training=True)

        # Calculate loss
        loss = loss = loss_function(decoder_targets, predictions, smoothing_factor=0.1)

    # Calculate gradients
    gradients = tape.gradient(loss, model.trainable_variables)

    # Apply gradient clipping with norm of 1.0 to prevent explosive gradients
    gradients, _ = tf.clip_by_global_norm(gradients, clip_norm=1.0)

    # Apply gradients
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    # Update metrics
    train_loss_metric(loss)
    train_accuracy_metric(decoder_targets, predictions)

### Evaluation Function

Evaluating our Transformer model during and after training is essential for monitoring progress, detecting overfitting, and assessing final performance. The evaluation function provides a standardized way to measure the model's performance on validation and test datasets.

The evaluation process shares similarities with the training step but with key differences:

1. **No Gradient Computation**: During evaluation, we only perform the forward pass through the model without computing or applying gradients, making the process more computationally efficient.

2. **No Training-Specific Operations**: Features like dropout are disabled during evaluation to assess the model's true performance.

3. **Complete Dataset Processing**: Unlike training, which often reports metrics after each batch, evaluation processes the entire dataset before reporting final metrics.

Our evaluation function calculates two primary metrics:

1. **Loss**: The same sparse categorical cross-entropy loss used during training, which measures how well the model's probability distributions match the actual next words.

2. **Accuracy**: The percentage of tokens that the model predicts correctly, providing an intuitive measure of performance.

Like in training, we use teacher forcing during evaluation, providing the ground truth tokens as input to the decoder. This approach allows us to evaluate each prediction position independently, giving a clear picture of the model's capabilities across different sequence positions.

The function below implements this evaluation process, processing an entire dataset and returning the average loss and accuracy. These metrics will be used to:
- Monitor training progress
- Implement early stopping
- Compare different model configurations
- Assess final model performance

In [40]:
#Model evaluation function
def evaluate_model(model, dataset):
    """
    Evaluate the transformer model on a validation/test dataset.
    
    Args:
        model: The transformer model to evaluate
        dataset: TensorFlow dataset containing input and target batches
        
    Returns:
        tuple: (validation_loss, validation_accuracy) as numpy values
    """
    # Initialize metrics to track loss and accuracy
    val_loss = tf.keras.metrics.Mean()
    val_accuracy = tf.keras.metrics.SparseCategoricalAccuracy()
    
    # Iterate through all batches in the dataset
    for input_batch, target_batch in dataset:
        # Prepare decoder inputs (shift targets left) and targets (shift right)
        # This creates teacher forcing inputs for the decoder
        decoder_inputs = target_batch[:, :-1]  # All but the last token
        decoder_targets = target_batch[:, 1:]  # All but the first token
        
        # Get model predictions (without training)
        predictions, _ = model([input_batch, decoder_inputs], training=False)
        
        # Calculate loss with label smoothing (0.1) to prevent overconfidence
        batch_loss = loss_function(decoder_targets, predictions)
        
        # Update metrics
        val_loss.update_state(batch_loss)
        val_accuracy.update_state(decoder_targets, predictions)
    
    # Return final metrics as numpy values
    return val_loss.result().numpy(), val_accuracy.result().numpy()

## Training Process

Training a Transformer model for language modeling is a complex process that requires careful management of data flow, optimization, and model evaluation. This section outlines our complete training pipeline, which combines all the components we've built so far.



### Training Loop Details

The training loop orchestrates the entire training process, managing multiple aspects of model training:

1. **Epoch-Based Training**: The model is trained for a specified number of epochs, with each epoch processing the entire training dataset once.

2. **Batch Processing**: Within each epoch, the data is processed in batches to enable efficient parallel computation and stochastic optimization.

3. **Metrics Tracking**: Several metrics are tracked throughout training:
   - Training loss and accuracy: Measured on the training data
   - Validation loss and accuracy: Measured on the held-out validation data
   - Time per epoch: Tracked to monitor training efficiency

4. **Model Checkpointing**: The model's weights are saved whenever the validation loss improves, ensuring we retain the best-performing model configuration.

5. **Early Stopping**: Training is halted if the validation loss fails to improve for a specified number of consecutive epochs (patience), preventing overfitting and saving computational resources.

6. **Progress Reporting**: Regular updates are printed during training to monitor progress, including:
   - Batch-level metrics during each epoch
   - Epoch summaries with training and validation metrics
   - Sample text generation after each epoch to qualitatively assess model capabilities

7. **Sample Generation**: After each epoch, the model generates sample text based on a prompt, providing a tangible demonstration of its current capabilities.

This comprehensive training process balances efficient optimization with careful monitoring and evaluation, ensuring that we train an effective language model while avoiding common pitfalls like overfitting or unstable training dynamics.

In [41]:
# Model training function
def train_model(model, train_dataset, val_dataset, epochs, optimizer):
    """
    Train the Transformer model for language modeling.

    Args:
        model (Transformer): The Transformer model to train.
        train_dataset (tf.data.Dataset): Training dataset.
        val_dataset (tf.data.Dataset): Validation dataset.
        epochs (int): Number of training epochs.
        optimizer (tf.keras.optimizers.Optimizer): Optimizer.

    Returns:
        dict: Training history with loss and accuracy metrics.
    """
    # Initialize dictionary to track training history
    training_history = {
        'train_loss': [],
        'train_accuracy': [],
        'val_loss': [],
        'val_accuracy': []
    }

    # Set up model checkpointing
    checkpoint_directory = "./checkpoints"
    checkpoint_path = os.path.join(checkpoint_directory, "transformer.weights.h5")

    # Create checkpoint directory if it doesn't exist
    if not os.path.exists(checkpoint_directory):
        os.makedirs(checkpoint_directory)

    # Initialize best validation loss for checkpointing
    best_val_loss = float('inf')
    patience = 10  # Number of epochs with no improvement before early stopping
    patience_counter = 0

    # Training loop for specified number of epochs
    for epoch in range(epochs):
        start_time = time.time()

        # Initialize metrics for this epoch
        train_loss = tf.keras.metrics.Mean(name='train_loss')
        train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')

        # Training loop
        for batch_idx, (input_batch, target_batch) in enumerate(train_dataset):
            # Execute a single training step
            train_step(model, input_batch, target_batch, optimizer, train_loss, train_accuracy)

            # Print progress every 50 batches
            if batch_idx % 50 == 0:
                print(f'Epoch {epoch + 1}, Batch {batch_idx}: '
                      f'Loss = {train_loss.result():.4f}, '
                      f'Accuracy = {train_accuracy.result():.4f}')

        # Store training metrics for this epoch
        train_loss_value = train_loss.result().numpy()
        train_accuracy_value = train_accuracy.result().numpy()

        # Validation loop
        val_loss_value, val_accuracy_value = evaluate_model(model, val_dataset)

        # Checkpointing logic
        if val_loss_value < best_val_loss:
            best_val_loss = val_loss_value
            patience_counter = 0
            # Save model weights
            model.save_weights(checkpoint_path)
            print(f"Model checkpoint saved with validation loss: {val_loss_value:.4f}")
        else:
            patience_counter += 1
            print(f"Validation loss did not improve. Patience: {patience_counter}/{patience}")
            # Early stopping logic - will break training if validation loss doesn't improve
            if patience_counter >= patience:
                print("Early stopping triggered.")
                break

        # Print epoch summary
        print(f'Epoch {epoch + 1} Summary: '
              f'Training Loss = {train_loss_value:.4f}, '
              f'Training Accuracy = {train_accuracy_value:.4f}, '
              f'Validation Loss = {val_loss_value:.4f}, '
              f'Validation Accuracy = {val_accuracy_value:.4f}')

        print(f'Time taken for 1 epoch: {time.time() - start_time:.2f} seconds\n')

        # Generate sample text after every epoch
        print("\nGenerating sample text:")
        generated_text = generate_text(
            model, "ROMEO:", word2idx, idx2word,
            generation_length=30, temperature=0.7, top_k=10
        )
        print(generated_text)
        print()

        # Update training history
        training_history['train_loss'].append(train_loss_value)
        training_history['train_accuracy'].append(train_accuracy_value)
        training_history['val_loss'].append(val_loss_value)
        training_history['val_accuracy'].append(val_accuracy_value)

    return training_history

### Sample Text Generation During Training

Generating sample text during the training process provides valuable qualitative feedback on the model's progress. While quantitative metrics like loss and accuracy are important, they don't always reflect the subjective quality of the generated text. By periodically generating text samples, we can:

1. **Observe Learning Progress**: Watch how the model's outputs evolve from random words to coherent Shakespeare-like text
2. **Detect Specific Issues**: Identify problems like repetition loops, vocabulary limitations, or stylistic inconsistencies
3. **Assess Creative Capabilities**: Evaluate the model's ability to generate text that captures Shakespeare's unique style and language

Our text generation function implements several important techniques for high-quality language generation:

1. **Autoregressive Generation**: Words are generated one at a time, with each new word conditioned on all previously generated words
2. **Temperature Sampling**: Controls the randomness of the generation process
   - Higher temperature (e.g., 1.0+): More diverse but potentially less coherent text
   - Lower temperature (e.g., 0.5): More focused and deterministic but potentially repetitive text

3. **Top-k Sampling**: Restricts sampling to only the k most probable next words
   - Helps avoid low-probability outputs that might be grammatically incorrect or contextually inappropriate
   - Balances between beam search (deterministic) and pure sampling (fully random)

4. **Prompt Conditioning**: Allows generation to be steered by providing an initial text prompt
   - Can be used to generate dialogue for specific characters
   - Helps maintain thematic consistency in the generated text

The function below implements this generation process, taking a trained model and a starting prompt, and producing a sequence of text that continues from that prompt in a style consistent with the training data.

In [42]:
# Text generation function
def generate_text(model, start_string, word2idx, idx2word, generation_length=100, temperature=1.0, top_k=0):
    """
    Generate text using the trained Transformer model.

    Args:
        model (Transformer): Trained Transformer model.
        start_string (str): Initial text prompt to start generation.
        word2idx (dict): Word-to-index mapping.
        idx2word (dict): Index-to-word mapping.
        generation_length (int): Number of words to generate.
        temperature (float): Controls randomness. Higher values increase diversity,
                             lower values increase determinism.
        top_k (int): If > 0, only sample from the top k most probable words.

    Returns:
        str: Generated text starting with the input prompt.
    """
    # Convert start string to lowercase and split into words
    words = start_string.lower().split()

    # Convert words to token indices, handling unknown words
    input_indices = [word2idx.get(word, word2idx["<UNK>"]) for word in words]

    # Create input tensor for the encoder - add batch dimension
    encoder_input = tf.expand_dims(input_indices, 0)  # Add batch dimension

    # Initialize list to store generated words
    generated_words = []

    # Generate words one at a time
    for _ in range(generation_length):
        # Prepare decoder input from what we've generated so far
        decoder_input = tf.expand_dims(input_indices, 0)  # Add batch dimension

        # Get model predictions
        predictions, _ = model([encoder_input, decoder_input], training=False)

        # Get the prediction for the next word (last position)
        next_word_predictions = predictions[:, -1:, :]

        # Apply temperature scaling
        scaled_predictions = next_word_predictions / temperature

        # Apply top-k sampling if specified
        if top_k > 0:
            # Get top k predictions and their indices
            top_k_predictions, top_k_indices = tf.nn.top_k(
                scaled_predictions[0, 0], k=top_k)

            # Apply softmax to get probabilities for top-k words only
            top_k_probabilities = tf.nn.softmax(top_k_predictions)

            # Sample from the filtered distribution
            top_k_probabilities_reshaped = tf.reshape(top_k_probabilities, (1, -1))
            sampled_index = tf.random.categorical(
                tf.math.log(top_k_probabilities_reshaped), num_samples=1)[0, 0]

            # Get the actual token ID from our top k indices
            predicted_id = top_k_indices[sampled_index]
        else:
            # Sample from the full distribution
            scaled_predictions_reshaped = tf.reshape(scaled_predictions, (1, -1))
            predicted_id = tf.random.categorical(
                scaled_predictions_reshaped, num_samples=1)[0, 0]

        # Convert predicted ID to word and add to result
        predicted_word = idx2word[predicted_id.numpy()]
        generated_words.append(predicted_word)

        # Add predicted ID to input for next iteration
        input_indices.append(predicted_id.numpy())

    # Combine original prompt with generated text
    return start_string + ' ' + ' '.join(generated_words)

## Results and Analysis

After implementing our Transformer model for Shakespeare text generation, we now train the model and analyze its performance. This section covers the complete training process, evaluation metrics, and qualitative assessment of the generated text.

### Training Process and Metrics

The training process involves initializing the model with our chosen architecture, creating the optimizer with our custom learning rate schedule, and training for a specified number of epochs. Throughout training, we track several key metrics:

1. **Training and Validation Loss**: Measures how well the model predicts the next word in a sequence. Lower values indicate better performance.

2. **Training and Validation Accuracy**: The percentage of tokens that the model predicts correctly. This provides an intuitive measure of model performance.

3. **Generation Quality**: By generating sample text after each epoch, we can qualitatively assess how the model's language generation capabilities evolve during training.

The training function implements several best practices:
- **Model Checkpointing**: Saves the best model based on validation loss
- **Early Stopping**: Halts training if validation performance plateaus
- **Periodic Evaluation**: Assesses model performance on the validation set after each epoch
- **Sample Generation**: Produces text samples to demonstrate current capabilities

### Performance Analysis

After training, we analyze the model's performance from multiple perspectives:

1. **Quantitative Metrics**: Final loss and accuracy values on both training and validation sets
2. **Learning Curves**: How loss and accuracy evolve throughout training
3. **Generalization Gap**: The difference between training and validation metrics

In [43]:
# Run the training process
def initialize_and_train_model(word2idx, idx2word, vocab_size, train_dataset, val_dataset):
    """
    Initialize the model and start training.
    
    Args:
        word2idx (dict): Word-to-index mapping.
        idx2word (dict): Index-to-word mapping.
        vocab_size (int): Size of the vocabulary.
        train_dataset (tf.data.Dataset): Training dataset.
        val_dataset (tf.data.Dataset): Validation dataset.
        
    Returns:
        tuple: (trained model, training history)
    """
    # Create the transformer model
    word_transformer = create_word_transformer(vocab_size)
    
    # Create optimizer
    optimizer = create_optimizer(d_model=192)
    
    # Test untrained model's text generation
    print("Generating text with untrained model:")
    print(generate_text(
        word_transformer, "ROMEO:", word2idx, idx2word,
        generation_length=30, temperature=1.0
    ))
    print()

    # Start model training
    epochs = 5  # Number of training epochs
    training_history = train_model(
        word_transformer, train_dataset, val_dataset, epochs, optimizer
    )
    
    # Load the best model checkpoint after training is complete
    print("\nLoading the best model checkpoint based on validation loss...")
    checkpoint_path = os.path.join("./checkpoints", "transformer.weights.h5")
    if os.path.exists(checkpoint_path):
        word_transformer.load_weights(checkpoint_path)
        print(f"Best model loaded with validation loss: {min(training_history['val_loss']):.4f}")
    
    # Display final metrics after loading best model
    print("\nEvaluating best model on validation data...")
    val_loss, val_accuracy = evaluate_model(word_transformer, val_dataset)
    print(f"Best model validation metrics - Loss: {val_loss:.4f}, Accuracy: {val_accuracy:.4f}")

    # Display final training metrics
    print("\nTraining complete! Final metrics:")
    print(f"Training Loss: {training_history['train_loss'][-1]:.4f}")
    print(f"Training Accuracy: {training_history['train_accuracy'][-1]:.4f}")
    print(f"Validation Loss: {training_history['val_loss'][-1]:.4f}")
    print(f"Validation Accuracy: {training_history['val_accuracy'][-1]:.4f}")
    
    return word_transformer, training_history

In [None]:
# Train the model
word_transformer, training_history = initialize_and_train_model(word2idx, idx2word, vocab_size, train_dataset, val_dataset)

## Training Results and Analysis

### Initial State and Rapid Improvement

Our Transformer model for Shakespearean text generation began with completely random outputs, as evidenced by the initial generated text:

"ROMEO: forewarn toge lodging conveying achieve mummers proudest bianco's true-disposing rein rest sight-outrunning traitors' strings scripture highest unmanner'd wine works helmed guerdon'd answers strew kam rebuke traitorly schoolboys' whoever through't re-quicken'd"

This random collection of Shakespearean vocabulary demonstrates the untrained model's lack of understanding of language structure, grammar, or context.

The first epoch showed dramatic improvement, with the loss decreasing from 9.5020 to 3.5426 and accuracy increasing from 0% to 63.37%. This rapid initial learning is typical in language models, as they quickly learn basic patterns like common word sequences and simple grammatical structures. The validation metrics (loss: 2.3887, accuracy: 86.71%) were significantly better than the training metrics, suggesting that our model was generalizing well rather than overfitting.

### Learning Progression Across Epochs

| Epoch | Training Loss | Training Accuracy | Validation Loss | Validation Accuracy | Time (s) |
|-------|---------------|-------------------|-----------------|---------------------|----------|
| 1     | 3.5426        | 63.37%            | 2.3887          | 86.71%              | 747.30   |
| 2     | 1.8419        | 90.68%            | 2.3412          | 86.84%              | 739.49   |
| 3     | 1.7879        | 91.58%            | 2.3277          | 86.83%              | 734.49   |
| 4     | 1.7643        | 91.97%            | 2.3468          | 86.34%              | 737.69   |
| 5     | 1.7497        | 92.22%            | 2.3575          | 86.10%              | 735.58   |

The learning curve shows a classic pattern:
- **Epoch 1-2**: Steep improvement in both training and validation metrics
- **Epoch 3**: Peak validation performance (lowest validation loss of 2.3277)
- **Epochs 4-5**: Continued improvement in training metrics but declining validation performance, indicating the onset of overfitting

### Text Generation Quality

The quality of generated text improved substantially across epochs:

**Epoch 1**:

"ROMEO: , would more of death , that word in time . let death , that in me no more : one , no more , an one , if more"

**Epoch 3** (Best model):

"ROMEO: ? i cannot , do what i cannot , that i say , i can do what , let me , and to me , if i would say ,"

**Epoch 5**:

"ROMEO: , but in the law , if she is a woman , if it is , but the better that which in it once , but this purpose , and"

The progression shows:
1. **Basic grammatical structure** emerged quickly
2. **Coherent phrases** developed by the middle epochs
3. **Contextual relevance** improved, with the model generating text that resembles Shakespearean dialogue

However, the generated text still lacks long-term coherence and complex narrative structure, which would require a larger model and more training data.

### Overfitting Analysis

The divergence between training and validation metrics after epoch 3 indicates the beginning of overfitting. While the training accuracy continued to improve (reaching 92.22% by epoch 5), the validation loss increased from 2.3277 in epoch 3 to 2.3575 in epoch 5. This pattern suggests that the model was starting to memorize specific patterns in the training data rather than learning generalizable features.

The early stopping mechanism correctly identified epoch 3 as the optimal stopping point, saving the best model with a validation loss of 2.3277 and accuracy of 86.83%.

### Conclusion

Our Transformer model successfully learned to generate Shakespearean-style text with reasonable grammatical structure and vocabulary usage. The model achieved its best performance at epoch 3, with a good balance between fitting the training data and generalizing to unseen examples. The generated text shows clear stylistic elements of Shakespeare's writing, though it lacks the complex narrative structure and deep thematic elements of the original works.

## Training Metrics Visualization

Visualizing training metrics provides valuable insights into the model's learning dynamics and helps identify patterns that might not be apparent from numerical data alone. The plots below illustrate the progression of loss and accuracy metrics for both training and validation sets across the training epochs.

### Key Observations from the Visualization:

1. **Loss Convergence Pattern**:
   - The training loss (blue line) shows a dramatic decrease during the first epoch, dropping from approximately 3.6 to 1.9
   - After the first epoch, the training loss continues to decrease but at a much slower rate, indicating diminishing returns from additional training
   - The validation loss (red line) remains relatively stable throughout training, with a slight decrease in the early epochs followed by a gradual increase

2. **Accuracy Progression**:
   - Training accuracy (blue line) shows rapid improvement in the first epoch, jumping from around 65% to over 90%
   - The accuracy continues to improve gradually in subsequent epochs, reaching approximately 92% by the end
   - Validation accuracy (red line) peaks early and then shows a slight downward trend, suggesting the onset of overfitting

3. **Training-Validation Gap**:
   - A significant gap develops between training and validation metrics after the first epoch
   - This gap widens as training progresses, which is a classic indicator of overfitting
   - The model continues to improve on the training data while its performance on unseen data (validation set) begins to deteriorate

4. **Optimal Stopping Point**:
   - The validation loss reaches its minimum around epoch 3, suggesting this would be the optimal point to stop training
   - Continuing beyond this point yields diminishing returns and potentially harmful overfitting

These visualizations confirm our earlier observations about the model's learning dynamics and validate the effectiveness of our early stopping strategy, which saved the model from epoch 3 as the best-performing version based on validation loss.

The function below creates these visualization plots, displaying both loss and accuracy metrics side by side for easy comparison.

In [None]:
# Visualization of training progress
def plot_training_history(history):
    """
    Plot training and validation metrics over epochs.

    Args:
        history (dict): Training history dictionary with metrics.
    """
    # Create a figure with two subplots
    fig, (loss_axis, accuracy_axis) = plt.subplots(1, 2, figsize=(15, 5))

    # Plot loss metrics
    loss_axis.plot(history['train_loss'], 'b-', label='Training Loss')
    loss_axis.plot(history['val_loss'], 'r-', label='Validation Loss')
    loss_axis.set_xlabel('Epoch')
    loss_axis.set_ylabel('Loss')
    loss_axis.set_title('Training and Validation Loss')
    loss_axis.legend()
    loss_axis.grid(True)

    # Plot accuracy metrics
    accuracy_axis.plot(history['train_accuracy'], 'b-', label='Training Accuracy')
    accuracy_axis.plot(history['val_accuracy'], 'r-', label='Validation Accuracy')
    accuracy_axis.set_xlabel('Epoch')
    accuracy_axis.set_ylabel('Accuracy')
    accuracy_axis.set_title('Training and Validation Accuracy')
    accuracy_axis.legend()
    accuracy_axis.grid(True)

    # Adjust layout and display plot
    plt.tight_layout()
    plt.show()

In [None]:
# Visualization of training history
plot_training_history(history=training_history)

## Model Persistence

Saving and loading trained models is a critical aspect of machine learning workflows, allowing us to:
1. Preserve the results of computationally expensive training processes
2. Deploy models in production environments
3. Share models with other researchers or users
4. Resume work with trained models without retraining

For our Shakespeare text generation Transformer, we implement a comprehensive model persistence system that saves both the model weights and its configuration parameters.





### Saving Model and Configuration

Properly saving a Transformer model requires preserving two key components:

1. **Model Weights**: The learned parameters that encode the knowledge acquired during training. These weights represent the model's ability to generate Shakespeare-like text.

2. **Model Architecture Configuration**: The structural parameters that define the model's architecture, such as:
   - Number of encoder/decoder layers
   - Embedding dimension (d_model)
   - Number of attention heads
   - Feed-forward network dimensions
   - Vocabulary sizes
   - Maximum position encoding lengths
   - Dropout rates

Our saving function handles both aspects:
- Weights are saved in HDF5 format, a standard for efficient storage of large numerical arrays
- Configuration is saved as a JSON file, providing human-readable documentation of the model's architecture

This approach ensures that we can fully reconstruct the model without needing to remember or document the specific hyperparameters used during its creation.

In [1]:
# Save model function
def save_model(model, filepath='./saved_model/shakespeare_transformer'):
    """
    Save the trained Transformer model and its configuration.

    Args:
        model (Transformer): Trained model to save.
        filepath (str): Directory path to save model.
    """
    # Create the directory if it doesn't exist
    os.makedirs(filepath, exist_ok=True)

    # Save model weights
    weights_path = os.path.join(filepath, 'model.weights.h5')
    model.save_weights(weights_path)
    print(f"Model weights saved to {weights_path}")

    # Extract and save model configuration
    model_config = {
        'num_layers': model.encoder.num_layers,
        'd_model': model.encoder.d_model,
        'num_heads': model.encoder.encoder_layers[0].multi_head_attention.num_heads,
        'dff': model.encoder.encoder_layers[0].feed_forward.layers[0].units,
        'input_vocab_size': model.encoder.embedding_layer.input_dim,
        'target_vocab_size': model.decoder.embedding_layer.input_dim,
        'maximum_position_encoding_input': model.encoder.positional_encoding.shape[1],
        'maximum_position_encoding_target': model.decoder.positional_encoding.shape[1],
        'dropout_rate': model.encoder.dropout.rate
    }

    # Save configuration to JSON file
    config_path = os.path.join(filepath, 'model_config.json')
    with open(config_path, 'w') as config_file:
        json.dump(model_config, config_file, indent=2)
    print(f"Model configuration saved to {config_path}")

### Loading Model from Saved Files

The complementary loading function reverses this process:
1. Reads the configuration file to determine the model's architecture
2. Instantiates a new Transformer model with the same architecture
3. Loads the saved weights into this model

This creates an exact replica of our trained model, ready for text generation or further training. The loading process is designed to be robust, with clear error messages if files are missing or corrupted.

The functions below implement this persistence system, providing a reliable way to save and restore our Shakespeare Transformer model.

In [2]:
# Load model function
def load_model(filepath='./saved_model/shakespeare_transformer'):
    """
    Load a Transformer model from saved weights and configuration.

    Args:
        filepath (str): Directory path containing saved model.

    Returns:
        Transformer: Loaded model instance.
    """
    # Load model configuration
    config_path = os.path.join(filepath, 'model_config.json')
    with open(config_path, 'r') as config_file:
        config = json.load(config_file)

    # Create model instance with loaded configuration
    model = Transformer(
        num_layers=config['num_layers'],
        d_model=config['d_model'],
        num_heads=config['num_heads'],
        dff=config['dff'],
        input_vocab_size=config['input_vocab_size'],
        target_vocab_size=config['target_vocab_size'],
        maximum_position_encoding_input=config['maximum_position_encoding_input'],
        maximum_position_encoding_target=config['maximum_position_encoding_target'],
        dropout_rate=config['dropout_rate']
    )

    # Build the model by calling it once with dummy data
    dummy_encoder_input = tf.zeros((1, seq_length), dtype=tf.int32)  # (batch_size, seq_length)
    dummy_decoder_input = tf.zeros((1, seq_length), dtype=tf.int32)
    _ = model([dummy_encoder_input, dummy_decoder_input], training=False)

    # Load model weights
    weights_path = os.path.join(filepath, 'model.weights.h5')
    model.load_weights(weights_path)
    print(f"Model loaded successfully from {filepath}")

    return model

In [None]:
# Save the model
save_model(word_transformer, './saved_model/shakespeare_transformer')

# Load the model
word_transformer = load_model(filepath='./saved_model/shakespeare_transformer')

## Model Evaluation

After training our Transformer model for Shakespeare text generation, we need to evaluate its performance comprehensively. This evaluation goes beyond the training and validation metrics to assess how well the model actually performs its intended task: generating Shakespeare-like text. We'll examine the model from multiple perspectives:

### Perplexity Calculation

Perplexity is the standard quantitative metric for evaluating language models. It measures how "surprised" the model is by the test data, with lower values indicating better performance. Mathematically, perplexity is defined as the exponentiated average negative log-likelihood of a sequence:

$$\text{Perplexity} = \exp\left(-\frac{1}{N}\sum_{i=1}^{N}\log p(x_i|x_1, \ldots, x_{i-1})\right)$$

Where:
- $N$ is the number of tokens in the sequence
- $p(x_i|x_1, \ldots, x_{i-1})$ is the probability the model assigns to the actual token $x_i$ given the preceding tokens

Intuitively, perplexity can be interpreted as the weighted average number of choices the model is uncertain about when predicting the next token. A perfect model would have a perplexity of 1.0, while random guessing would result in a perplexity equal to the vocabulary size.

For our Shakespeare model, we calculate perplexity on the held-out test set to get an unbiased assessment of its predictive power.

In [None]:
# Calculate model perplexity
def calculate_perplexity(model, dataset):
    """
    Calculate perplexity of the model on a dataset.

    Perplexity is a measure of how well a language model predicts a text sample.
    Lower perplexity values indicate better performance.

    Args:
        model (Transformer): Trained model.
        dataset (tf.data.Dataset): Evaluation dataset.

    Returns:
        float: Perplexity value.
    """
    total_loss = 0.0
    total_tokens = 0

    # Evaluate on dataset
    for input_batch, target_batch in dataset:
        # Teacher forcing for evaluation
        decoder_inputs = target_batch[:, :-1]
        decoder_targets = target_batch[:, 1:]

        # Get model predictions
        predictions, _ = model([input_batch, decoder_inputs], training=False)

        # Calculate token-level loss
        loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
            from_logits=True, reduction='none')
        token_loss = loss_object(decoder_targets, predictions)

        # Create mask to ignore padding tokens
        mask = tf.cast(
            tf.math.logical_not(tf.math.equal(decoder_targets, 0)),
            dtype=token_loss.dtype
        )

        # Apply mask
        masked_loss = token_loss * mask

        # Sum loss and count tokens
        total_loss += tf.reduce_sum(masked_loss).numpy()
        total_tokens += tf.reduce_sum(mask).numpy()

    # Calculate average per-token loss
    avg_loss = total_loss / total_tokens

    # Perplexity is exp(average_loss)
    perplexity = np.exp(avg_loss)

    return perplexity

In [None]:
# Evaluate model performance using perplexity metric on test data
print(f"Test set perplexity: {calculate_perplexity(word_transformer, test_dataset):.4f}")

### Generated Text Quality Assessment

While perplexity provides a quantitative measure, the ultimate test of a language model is the quality of the text it generates. We evaluate this qualitative aspect by generating text samples with various prompts and generation settings:

1. **Different prompts**: We use iconic lines from Shakespeare's plays as starting points to assess how well the model continues in the appropriate style and context.

2. **Temperature variation**: The temperature parameter controls randomness in generation:
   - Low temperature (e.g., 0.7): More conservative, focused outputs
   - Medium temperature (e.g., 1.0): Balanced creativity and coherence
   - High temperature (e.g., 1.3): More creative but potentially less coherent outputs

3. **Top-k sampling**: This parameter limits token selection to the k most probable next words:
   - k=0: Use the full vocabulary distribution
   - k=10: Focus on the 10 most likely next words
   - k=40: Wider but still constrained selection

By systematically varying these parameters, we can explore the model's creative capabilities and find the optimal settings for different generation tasks.

In [None]:
# Generate creative text samples with different settings
def generate_creative_samples(model, prompts, word2idx, idx2word):
    """
    Generate text samples with various settings for creative exploration.

    Args:
        model (Transformer): Trained model.
        prompts (list): List of text prompts to start generation.
        word2idx (dict): Word-to-index mapping.
        idx2word (dict): Index-to-word mapping.

    Returns:
        list: List of dictionaries containing generation results.
    """
    results = []

    # Define different temperature and top-k settings to try
    temperatures = [0.7, 1.0, 1.3]  # Controls randomness: low=conservative, high=creative
    top_k_values = [0, 10, 40]      # Controls diversity: 0=all vocabulary, higher=more focused

    print("Generating creative text samples with different settings:")

    # Generate text for each prompt and setting combination
    for prompt in prompts:
        print(f"\nPrompt: {prompt}")

        for temperature in temperatures:
            for top_k in top_k_values:
                # Generate text with current settings
                sample = generate_text(
                    model,
                    prompt,
                    word2idx,
                    idx2word,
                    generation_length=40,
                    temperature=temperature,
                    top_k=top_k
                )

                # Display generated text
                print(f"\nTemperature={temperature}, Top-K={top_k}:")
                print(sample)

                # Store result
                results.append({
                    'prompt': prompt,
                    'temperature': temperature,
                    'top_k': top_k,
                    'text': sample
                })

    return results

In [None]:
# Example prompts
prompts = [
     "HAMLET: To be, or not to be,",
     "ROMEO: But, soft! what light through yonder",
     "MACBETH: Is this a dagger which I see before me,",
     "KING LEAR: How sharper than a serpent's tooth it is"
 ]
 
creative_samples = generate_creative_samples(word_transformer, prompts, word2idx, idx2word)

### Performance Benchmarking

Beyond quality, we also evaluate the model's practical performance characteristics:

#### Inference Speed

Generation speed is crucial for applications requiring real-time or near-real-time responses. We measure:
- Time required to generate a fixed number of words
- Words generated per second
- Consistency of generation speed across multiple runs

These metrics help assess whether the model is suitable for interactive applications or batch processing scenarios.

#### Model Size Analysis

The model's resource requirements affect where and how it can be deployed:
- Number of trainable and non-trainable parameters
- Memory footprint (in MB)
- Computational complexity

This analysis helps determine whether the model can be deployed on resource-constrained environments or if it requires high-performance computing resources.

Together, these evaluation approaches provide a comprehensive assessment of our Shakespeare Transformer model's capabilities, limitations, and practical utility.

In [None]:
# Performance benchmarking
def benchmark_inference(model, prompt, word2idx, idx2word, num_words=100, num_runs=5):
    """
    Benchmark the inference speed and model size.

    Args:
        model (Transformer): Trained model.
        prompt (str): Starting text for generation.
        word2idx (dict): Word-to-index mapping.
        idx2word (dict): Index-to-word mapping.
        num_words (int): Number of words to generate per run.
        num_runs (int): Number of runs for averaging results.

    Returns:
        dict: Benchmark results.
    """
    generation_times = []

    # Run multiple generations and measure time
    for run_idx in range(num_runs):
        start_time = time.time()

        # Generate text
        _ = generate_text(
            model, prompt, word2idx, idx2word,
            generation_length=num_words
        )

        # Record time
        end_time = time.time()
        generation_times.append(end_time - start_time)

    # Calculate average generation time
    avg_generation_time = sum(generation_times) / len(generation_times)
    words_per_second = num_words / avg_generation_time

    # Display time metrics
    print(f"Average generation time: {avg_generation_time:.4f} seconds for {num_words} words")
    print(f"Generation speed: {words_per_second:.2f} words per second")

    # Calculate model size
    trainable_params = sum(
        np.prod(variable.shape) for variable in model.trainable_variables
    )
    non_trainable_params = sum(
        np.prod(variable.shape) for variable in model.non_trainable_variables
    )
    total_params = trainable_params + non_trainable_params

    # Display model size information
    print(f"Model parameters: {trainable_params:,} trainable, {non_trainable_params:,} non-trainable")
    print(f"Total parameters: {total_params:,}")
    print(f"Approximate model size: {total_params * 4 / (1024 * 1024):.2f} MB (assuming float32)")

    # Return comprehensive benchmark results
    return {
        'avg_generation_time': avg_generation_time,
        'words_per_second': words_per_second,
        'trainable_params': int(trainable_params),
        'non_trainable_params': int(non_trainable_params),
        'total_params': int(total_params),
        'individual_times': generation_times
    }

In [None]:
# Test the model by generating text starting with "HAMLET: "
benchmark_inference(word_transformer, "HAMLET: ", word2idx, idx2word)

## Conclusion and Future Work

This project has successfully implemented a complete Transformer architecture from scratch for Shakespearean text generation. By training on Shakespeare's works, we've created a model capable of generating text that captures aspects of the Bard's distinctive style and vocabulary. The implementation demonstrates the power of attention-based models for creative language generation tasks.

### Discussion of Observations and Limitations

Throughout this project, several key observations emerged:

1. **Learning Dynamics**: The model showed rapid initial learning, with most improvements occurring in the first epoch. This suggests that basic patterns in Shakespeare's language are relatively easy to learn, while the nuanced stylistic elements require more training.

2. **Overfitting Patterns**: Despite our regularization efforts, the model began to overfit after the third epoch, with validation loss increasing while training metrics continued to improve. This highlights the challenge of generalization with limited training data.

3. **Text Generation Quality**: The generated text successfully captured Shakespearean vocabulary and basic sentence structures. However, it struggled with:
   - Long-term coherence beyond a few sentences
   - Complex narrative structures
   - Consistent character development
   - Thematic depth characteristic of Shakespeare's works

4. **Hyperparameter Sensitivity**: Text quality was highly sensitive to generation parameters like temperature and top-k sampling. Lower temperatures produced more coherent but less creative text, while higher temperatures increased diversity at the cost of occasional grammatical errors.

5. **Model Size Constraints**: With approximately 10 million parameters, our model is relatively small compared to state-of-the-art language models with billions of parameters. This limited capacity constrains the model's ability to capture the full complexity of Shakespeare's writing.

6. **Training Efficiency**: The custom learning rate schedule with warmup proved effective, allowing stable training without excessive hyperparameter tuning. However, the training process remained computationally intensive, requiring significant time even on modern hardware.

### Future Improvements and Experimentation Ideas

Several promising directions could enhance this project:

1. **Architecture Enhancements**:
   - Implement relative positional encodings instead of absolute positions
   - Explore Transformer-XL or similar architectures for better handling of long-range dependencies
   - Experiment with different attention mechanisms like sparse attention

2. **Training Improvements**:
   - Pre-train on a larger corpus of Elizabethan English before fine-tuning on Shakespeare
   - Implement curriculum learning, starting with simpler texts and gradually introducing more complex works
   - Explore more sophisticated regularization techniques like stochastic depth or LayerDrop

3. **Generation Strategies**:
   - Implement nucleus sampling (top-p) as an alternative to top-k
   - Add beam search for more coherent generation
   - Develop controlled generation techniques to maintain consistent characters or themes

4. **Evaluation Enhancements**:
   - Develop automated metrics specific to Shakespearean style
   - Conduct human evaluation studies comparing generated text to actual Shakespeare
   - Analyze the model's ability to capture specific Shakespearean devices like iambic pentameter

5. **Application Extensions**:
   - Create a character-aware model that can generate text in the style of specific Shakespeare characters
   - Develop a dialogue generation system for creating new scenes between Shakespeare characters
   - Build an interactive system allowing users to collaborate with the model in writing Shakespeare-inspired content

6. **Efficiency Optimizations**:
   - Implement model quantization to reduce memory footprint
   - Explore knowledge distillation to create smaller, faster models
   - Optimize the inference process for real-time generation

By pursuing these improvements, future iterations could create even more convincing Shakespeare-style text generation while addressing the limitations observed in the current implementation.

## References

1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). *Attention Is All You Need*. Advances in Neural Information Processing Systems, 30. [arXiv:1706.03762](https://arxiv.org/abs/1706.03762)

2. Karpathy, A. (2015). *The Unreasonable Effectiveness of Recurrent Neural Networks*. [Blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

3. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*. [arXiv:1810.04805](https://arxiv.org/abs/1810.04805)

4. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). *Language Models are Unsupervised Multitask Learners*. OpenAI Blog.

5. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., ... & Rush, A. M. (2020). *Transformers: State-of-the-Art Natural Language Processing*. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (pp. 38-45).

6. Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). *The Curious Case of Neural Text Degeneration*. International Conference on Learning Representations. [arXiv:1904.09751](https://arxiv.org/abs/1904.09751)

7. Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., ... & Auli, M. (2019). *fairseq: A Fast, Extensible Toolkit for Sequence Modeling*. In Proceedings of NAACL-HLT 2019: Demonstrations.

8. Sennrich, R., Haddow, B., & Birch, A. (2016). *Neural Machine Translation of Rare Words with Subword Units*. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1715-1725).

9. Kingma, D. P., & Ba, J. (2014). *Adam: A Method for Stochastic Optimization*. [arXiv:1412.6980](https://arxiv.org/abs/1412.6980)

10. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., ... & Zheng, X. (2016). *TensorFlow: A System for Large-Scale Machine Learning*. In 12th USENIX Symposium on Operating Systems Design and Implementation (pp. 265-283).

11. Shakespeare, W. *The Complete Works of William Shakespeare*. [Project Gutenberg](https://www.gutenberg.org/ebooks/100)

12. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., ... & Chintala, S. (2019). *PyTorch: An Imperative Style, High-Performance Deep Learning Library*. Advances in Neural Information Processing Systems, 32, 8026-8037.