# 1.Objective

This project aims to implement and compare advanced Recurrent Neural Network (RNN) architectures—specifically, vanilla RNNs, LSTMs, and GRUs—for the task of creative text generation using Shakespeare’s writings. We further incorporate techniques such as temperature-controlled sampling, beam search, teacher forcing, and gradient clipping to improve training stability and text quality. Ultimately, the goal is to produce coherent, stylistically aligned generated text and to evaluate models quantitatively (using perplexity) and qualitatively.


# 2.Dataset and Data Preprocessing

We use a dataset of Shakespeare’s plays. The dataset is provided in a CSV file (shakespeare_plays.csv) where texts from plays such as Hamlet, Coriolanus, and Richard III are extracted.

We filter the dataset by selecting specific plays and then concatenate the text of all chosen plays into one long string. This ensures that the training data covers diverse styles within Shakespeare’s corpus.


In [2]:
import os
import math
import random
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils import clip_grad_norm_
import torch.nn.functional as F
from tokenizers import ByteLevelBPETokenizer
import pandas as pd


# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [3]:
shakespeare = pd.read_csv("shakespeare_plays.csv")

filtered_shakespeare = shakespeare[shakespeare['play_name'].isin(['Hamlet', 'Coriolanus', 'Richard III'])]



In [4]:
print(filtered_shakespeare)

       Unnamed: 0    play_name    genre          character  act  scene  \
72918       72918  Richard III  History         Gloucester    1      1   
72919       72919  Richard III  History         Gloucester    1      1   
72920       72920  Richard III  History         Gloucester    1      1   
72921       72921  Richard III  History         Gloucester    1      1   
72922       72922  Richard III  History         Gloucester    1      1   
...           ...          ...      ...                ...  ...    ...   
87964       87964       Hamlet  Tragedy  Prince Fortinbras    5      2   
87965       87965       Hamlet  Tragedy  Prince Fortinbras    5      2   
87966       87966       Hamlet  Tragedy  Prince Fortinbras    5      2   
87967       87967       Hamlet  Tragedy  Prince Fortinbras    5      2   
87968       87968       Hamlet  Tragedy  Prince Fortinbras    5      2   

       sentence                                              text   sex  
72918         1               Now is 

In [5]:
text = filtered_shakespeare["text"].str.cat(sep=" ")
print(text)




## 2.2. Tokenization using Byte-Level BPE

Instead of using simple word-level tokenization, we employ Byte-Level Byte Pair Encoding (BPE). This approach helps manage vocabulary size, handles out-of-vocabulary tokens, and captures meaningful subword patterns, which is important for creative text generation.

We define a custom PyTorch Dataset that creates training samples from the tokenized text. Each sample consists of:

An input sequence: A random slice of the tokenized text with a length randomly chosen between a minimum (20 tokens) and a maximum (100 tokens).

A target sequence: The same slice shifted one token to the right, which trains the model to predict the next token.

In [6]:
tokenizer = ByteLevelBPETokenizer()
tokenizer.train_from_iterator([text], vocab_size=30000, min_frequency=2, special_tokens=["<pad>", "<unk>", "<bos>", "<eos>"])


# Encode the text
encoded = tokenizer.encode(text)
token_ids = encoded.ids
vocab_size = tokenizer.get_vocab_size()

print(f"Total tokens: {len(token_ids)}, Vocabulary size: {vocab_size}")
# Encode the entire text
encoded = tokenizer.encode(text)
token_ids = encoded.ids
print("Total tokens in dataset:", len(token_ids))


class TextDataset(Dataset):
    def __init__(self, token_ids, min_seq_len=20, max_seq_len=100):
        """
        Creates sequences of variable lengths sampled randomly between min_seq_len and max_seq_len.
        Each sample is a tuple (input_sequence, target_sequence), where the target is shifted by one token.
        """
        self.token_ids = token_ids
        self.min_seq_len = min_seq_len
        self.max_seq_len = max_seq_len
        self.indices = list(range(0, len(token_ids) - max_seq_len - 1))

    def __len__(self):
        return len(self.indices)

    def __getitem__(self, idx):
        start = self.indices[idx]
        seq_len = random.randint(self.min_seq_len, self.max_seq_len)
        end = start + seq_len
        input_seq = self.token_ids[start:end]
        target_seq = self.token_ids[start+1:end+1]
        return torch.tensor(input_seq, dtype=torch.long), torch.tensor(target_seq, dtype=torch.long)

# Create train/validation splits
dataset = TextDataset(token_ids)
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = torch.utils.data.random_split(dataset, [train_size, val_size])

batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=lambda batch: nn.utils.rnn.pad_sequence([item[0] for item in batch], batch_first=True, padding_value=tokenizer.token_to_id("<pad>")))
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=lambda batch: nn.utils.rnn.pad_sequence([item[0] for item in batch], batch_first=True, padding_value=tokenizer.token_to_id("<pad>")))


Total tokens: 112119, Vocabulary size: 7637
Total tokens in dataset: 112119


# 3. Model Architectures

Experiments:

We compared models with different configurations:

Activation and Kernel Choices: We experimented with different RNN types (RNN, LSTM, GRU) to evaluate their effect on sequence modeling.

Bidirectionality: Bidirectional layers were added to capture context from both past and future tokens during training.

In [7]:
class RNNModel(nn.Module):
    def __init__(self, rnn_type, vocab_size, embed_size, hidden_size, num_layers=1, bidirectional=False, dropout=0.2):
        """
        Generic RNN model that can use vanilla RNN, LSTM, or GRU.
        """
        super(RNNModel, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.rnn_type = rnn_type.upper()
        if self.rnn_type == "RNN":
            self.rnn = nn.RNN(embed_size, hidden_size, num_layers=num_layers, batch_first=True, bidirectional=bidirectional, dropout=dropout)
        elif self.rnn_type == "LSTM":
            self.rnn = nn.LSTM(embed_size, hidden_size, num_layers=num_layers, batch_first=True, bidirectional=bidirectional, dropout=dropout)
        elif self.rnn_type == "GRU":
            self.rnn = nn.GRU(embed_size, hidden_size, num_layers=num_layers, batch_first=True, bidirectional=bidirectional, dropout=dropout)
        else:
            raise ValueError("Unsupported rnn_type. Choose from RNN, LSTM, or GRU.")

        self.bidirectional = bidirectional
        self.num_directions = 2 if bidirectional else 1
        self.fc = nn.Linear(hidden_size * self.num_directions, vocab_size)

    def forward(self, x, hidden=None):
        """
        x: [batch, seq_len]
        Returns logits [batch, seq_len, vocab_size] and new hidden state.
        """
        x = self.embed(x)

        if hidden is None:
            output, hidden = self.rnn(x)
        else:
            output, hidden = self.rnn(x, hidden)

        logits = self.fc(output)
        return logits, hidden

# 4. Training Strategy

## 4.1 Gradient Clipping and Early Stopping

To prevent exploding gradients, especially in recurrent networks, we apply gradient clipping (using a maximum norm of 5) in all of our training strategy. This ensures that the gradients do not become excessively large, stabilizing training.

Early stopping is integrated by monitoring the validation loss. If the loss does not improve by at least a minimum delta for 5 consecutive epochs, training is halted early.


In [8]:
def train_model(model, train_loader, val_loader, criterion, optimizer, scheduler,
                num_epochs=20, save_path="best_model.pth", early_stop_patience=5,min_delta=1e-4):
    best_val_loss = float('inf')
    early_stop_counter = 0
    model.to(device)
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0.0
        for batch in train_loader:
            inputs = batch.to(device)
            optimizer.zero_grad()
            logits, _ = model(inputs)
            logits = logits[:, :-1, :].contiguous().view(-1, logits.size(-1))
            targets = inputs[:, 1:].contiguous().view(-1)
            loss = criterion(logits, targets)
            loss.backward()
            clip_grad_norm_(model.parameters(), max_norm=5)
            optimizer.step()
            total_loss += loss.item() * inputs.size(0)
        train_loss = total_loss / len(train_loader.dataset)

        # Validation
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for batch in val_loader:
                inputs = batch.to(device)
                logits, _ = model(inputs)
                logits = logits[:, :-1, :].contiguous().view(-1, logits.size(-1))
                targets = inputs[:, 1:].contiguous().view(-1)
                loss = criterion(logits, targets)
                val_loss += loss.item() * inputs.size(0)
        val_loss = val_loss / len(val_loader.dataset)

        # Step scheduler (if not ReduceLROnPlateau)
        if not isinstance(scheduler, optim.lr_scheduler.ReduceLROnPlateau):
            scheduler.step()
        else:
            scheduler.step(val_loss)

        print(f"Epoch [{epoch+1}/{num_epochs}] | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")

        # Check for significant improvement
        if best_val_loss - val_loss > min_delta:
            best_val_loss = val_loss
            torch.save(model.state_dict(), save_path)
            early_stop_counter = 0
            print(f"--> Best model saved with Val Loss: {best_val_loss:.4f}")
        else:
            early_stop_counter += 1
            print(f"Early stop counter: {early_stop_counter}/{early_stop_patience}")
            if early_stop_counter >= early_stop_patience:
                print("Early stopping triggered.")
                break




## 4.2 Teacher Forcing


In the training loop, we generate the sequence step-by-step. At each time step, we decide whether to feed the actual next token or the predicted token based on a teacher forcing ratio

In [9]:
def train_model_with_teacher_forcing(model, train_loader, val_loader, criterion, optimizer, scheduler,
                                     num_epochs=20, teacher_forcing_ratio=0.5,
                                     save_path="best_model.pth", early_stop_patience=5, min_delta=1e-4):
    best_val_loss = float('inf')
    early_stop_counter = 0
    model.to(device)

    for epoch in range(num_epochs):
        model.train()
        total_loss = 0.0
        for batch in train_loader:
            inputs = batch.to(device)  # ground truth sequence
            batch_size, seq_len = inputs.shape
            optimizer.zero_grad()

            hidden = None
            outputs = []
            # Initialize with the first token in the sequence
            input_t = inputs[:, 0].unsqueeze(1)  # shape: [batch, 1]
            for t in range(seq_len - 1):
                logits, hidden = model(input_t, hidden)  # logits: [batch, 1, vocab_size]
                outputs.append(logits)

                if random.random() < teacher_forcing_ratio:
                    next_input = inputs[:, t+1]
                else:
                    next_input = logits.argmax(dim=-1).squeeze(1)
                input_t = next_input.unsqueeze(1)  # prepare next input

            outputs = torch.cat(outputs, dim=1)
            targets = inputs[:, 1:]  # target sequence is shifted by one
            loss = criterion(outputs.contiguous().view(-1, outputs.size(-1)), targets.contiguous().view(-1))
            loss.backward()

            clip_grad_norm_(model.parameters(), max_norm=5)
            optimizer.step()

            total_loss += loss.item() * batch_size

        train_loss = total_loss / len(train_loader.dataset)

        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for batch in val_loader:
                inputs = batch.to(device)
                logits, _ = model(inputs)  # here, we pass the whole sequence at once
                logits = logits[:, :-1, :].contiguous().view(-1, logits.size(-1))
                targets = inputs[:, 1:].contiguous().view(-1)
                loss = criterion(logits, targets)
                val_loss += loss.item() * inputs.size(0)
        val_loss = val_loss / len(val_loader.dataset)

        if not isinstance(scheduler, optim.lr_scheduler.ReduceLROnPlateau):
            scheduler.step()
        else:
            scheduler.step(val_loss)

        print(f"Epoch [{epoch+1}/{num_epochs}] | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")

        if best_val_loss - val_loss > min_delta:
            best_val_loss = val_loss
            torch.save(model.state_dict(), save_path)
            early_stop_counter = 0
            print(f"--> Best model saved with Val Loss: {best_val_loss:.4f}")
        else:
            early_stop_counter += 1
            print(f"Early stop counter: {early_stop_counter}/{early_stop_patience}")
            if early_stop_counter >= early_stop_patience:
                print("Early stopping triggered.")
                break

    print("Training complete.")

## 4.3 Quantitative Evaluation: Perplexity

Perplexity is computed from the average cross-entropy loss. Lower perplexity indicates better performance. In our experiments, we compared perplexity across different models (vanilla RNN, LSTM, GRU) to assess predictive performance.

In [10]:
def evaluate_model(model, data_loader, criterion=None):
    model.eval()
    total_loss = 0.0
    total_tokens = 0
    with torch.no_grad():
        for batch in data_loader:
            inputs = batch.to(device)
            logits, _ = model(inputs)
            if criterion is not None:
                logits = logits[:, :-1, :].contiguous().view(-1, logits.size(-1))
                targets = inputs[:, 1:].contiguous().view(-1)
                loss = criterion(logits, targets)
                total_loss += loss.item() * inputs.size(0)
                total_tokens += targets.size(0)
    if criterion is not None:
        avg_loss = total_loss / len(data_loader.dataset)
        perplexity = math.exp(avg_loss)
        print(f"Perplexity: {perplexity:.2f}")
    else:
        print("Evaluation complete (loss not computed).")

# 5. Advanced Text Generation

## 5.1 Temperature-Controlled Sampling

Temperature controls the randomness of the generated text. A lower temperature makes the distribution “peaky” , while a higher temperature produces more diverse outputs. We set temperature at 0.8 after several experiments for all models.

## 5.2 Beam Search

We also inroduced Beam Search in text generation that keeps track of the top-k candidate sequences at each time step, selecting the sequence with the highest overall score.

In [11]:
def sample_text(model, start_text, tokenizer, max_length=100, temperature=1.0):
    """
    Generate text using temperature-controlled sampling.
    start_text: The initial text to prime the model.
    """
    model.eval()
    encoding = tokenizer.encode(start_text)
    input_ids = torch.tensor([encoding.ids], dtype=torch.long).to(device)  # Shape: [1, seq_len]
    generated = input_ids.tolist()[0]

    hidden = None
    with torch.no_grad():
        for _ in range(max_length):
            logits, hidden = model(input_ids, hidden)
            logits = logits[:, -1, :] / temperature  # Shape: [1, vocab_size]
            probs = F.softmax(logits, dim=-1)
            next_token = torch.multinomial(probs, num_samples=1)  # Shape: [1, 1]
            generated.append(next_token.item())
            input_ids = next_token
            if tokenizer.id_to_token(next_token.item()) == "<eos>":
                break
    text_generated = tokenizer.decode(generated)
    return text_generated

def beam_search_text(model, start_text, tokenizer, beam_width=3, max_length=100, temperature=1.0):
    """
    Generate text using beam search.
    This is a simple implementation.
    """
    model.eval()
    encoding = tokenizer.encode(start_text)
    start_ids = encoding.ids
    sequences = [(start_ids, 0.0, None)]

    for _ in range(max_length):
        all_candidates = []
        for seq, score, hidden in sequences:
            input_ids = torch.tensor([seq], dtype=torch.long).to(device)
            with torch.no_grad():
                logits, hidden_new = model(input_ids, hidden)
            logits = logits[:, -1, :] / temperature
            probs = F.log_softmax(logits, dim=-1)
            topk = torch.topk(probs, beam_width)
            for i in range(beam_width):
                token = topk.indices[0, i].item()
                token_score = topk.values[0, i].item()
                candidate_seq = seq + [token]
                candidate_score = score + token_score
                all_candidates.append((candidate_seq, candidate_score, hidden_new))
        ordered = sorted(all_candidates, key=lambda tup: tup[1], reverse=True)
        sequences = ordered[:beam_width]
        if tokenizer.id_to_token(sequences[0][0][-1]) == "<eos>":
            break
    best_seq = sequences[0][0]
    return tokenizer.decode(best_seq)


# 6. Main

In [12]:
if __name__ == '__main__':
    embed_size = 256
    hidden_size = 512
    num_layers = 2
    bidirectional = True
    dropout = 0.3
    num_epochs = 20
    model_rnn = RNNModel("RNN", vocab_size, embed_size, hidden_size, num_layers, bidirectional, dropout).to(device)
    model_lstm = RNNModel("LSTM", vocab_size, embed_size, hidden_size, num_layers, bidirectional, dropout).to(device)
    model_gru = RNNModel("GRU", vocab_size, embed_size, hidden_size, num_layers, bidirectional, dropout).to(device)

    # Define loss
    criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.token_to_id("<pad>"))

    # Use Adam optimizer for these experiments
    optimizer_rnn = optim.Adam(model_rnn.parameters(), lr=0.001)
    optimizer_lstm = optim.Adam(model_lstm.parameters(), lr=0.001)
    optimizer_gru = optim.Adam(model_gru.parameters(), lr=0.001)

    # Use a ReduceLROnPlateau scheduler
    scheduler_rnn = optim.lr_scheduler.ReduceLROnPlateau(optimizer_rnn, mode='min', patience=3, factor=0.5, verbose=True)
    scheduler_lstm = optim.lr_scheduler.ReduceLROnPlateau(optimizer_lstm, mode='min', patience=3, factor=0.5, verbose=True)
    scheduler_gru = optim.lr_scheduler.ReduceLROnPlateau(optimizer_gru, mode='min', patience=3, factor=0.5, verbose=True)





## RNN

### Training without teacher forcing - RNN

In [26]:

    print("Training Vanilla RNN model...")
    train_model(model_rnn, train_loader, val_loader, criterion, optimizer_rnn, scheduler_rnn,
                num_epochs=num_epochs,save_path="best_rnn.pth", early_stop_patience = 5)
    model_rnn.load_state_dict(torch.load("best_rnn.pth"))
    print("Evaluating Vanilla RNN model:")
    evaluate_model(model_rnn, val_loader, criterion)

Training Vanilla RNN model...
Epoch [1/20] | Train Loss: 0.1321 | Val Loss: 0.0001
--> Best model saved with Val Loss: 0.0001
Epoch [2/20] | Train Loss: 0.0001 | Val Loss: 0.0000
--> Best model saved with Val Loss: 0.0000
Epoch [3/20] | Train Loss: 0.0000 | Val Loss: 0.0000
Early stop counter: 1/5
Epoch [4/20] | Train Loss: 0.0000 | Val Loss: 0.0000
Early stop counter: 2/5
Epoch [5/20] | Train Loss: 0.0000 | Val Loss: 0.0000
Early stop counter: 3/5
Epoch [6/20] | Train Loss: 0.0000 | Val Loss: 0.0000
Early stop counter: 4/5
Epoch [7/20] | Train Loss: 0.0000 | Val Loss: 0.0000
Early stop counter: 5/5
Early stopping triggered.
Evaluating Vanilla RNN model:
Perplexity: 1.00


### Training with Teacher forcing - RNN

In [34]:
model_rnn_teacher = RNNModel("RNN", vocab_size, embed_size, hidden_size, num_layers, bidirectional, dropout).to(device)
optimizer = optim.Adam(model_rnn_teacher.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=3, factor=0.5, verbose=True)
train_model_with_teacher_forcing(model_rnn_teacher, train_loader, val_loader, criterion, optimizer, scheduler, num_epochs=10, teacher_forcing_ratio=0.5, save_path="best_model_teacher.pth", early_stop_patience=3)
print("Evaluating Vanilla RNN_Teacher model:")
evaluate_model(model_rnn_teacher, val_loader, criterion)



Epoch [1/10] | Train Loss: 4.8295 | Val Loss: 3.5305
--> Best model saved with Val Loss: 3.5305
Epoch [2/10] | Train Loss: 3.1458 | Val Loss: 3.1779
--> Best model saved with Val Loss: 3.1779
Epoch [3/10] | Train Loss: 2.2545 | Val Loss: 3.0074
--> Best model saved with Val Loss: 3.0074
Epoch [4/10] | Train Loss: 1.9013 | Val Loss: 2.9382
--> Best model saved with Val Loss: 2.9382
Epoch [5/10] | Train Loss: 1.7558 | Val Loss: 2.9371
--> Best model saved with Val Loss: 2.9371
Epoch [6/10] | Train Loss: 1.7279 | Val Loss: 2.9412
Early stop counter: 1/3
Epoch [7/10] | Train Loss: 1.7494 | Val Loss: 2.9304
--> Best model saved with Val Loss: 2.9304
Epoch [8/10] | Train Loss: 1.7915 | Val Loss: 2.9428
Early stop counter: 1/3
Epoch [9/10] | Train Loss: 1.7985 | Val Loss: 2.9333
Early stop counter: 2/3
Epoch [10/10] | Train Loss: 1.9447 | Val Loss: 3.0215
Early stop counter: 3/3
Early stopping triggered.
Training complete.
Evaluating Vanilla RNN_Teacher model:
Perplexity: 20.44


## LSTM

### Training without teacher forcing - LSTM

In [15]:
print("\nTraining LSTM model...")
train_model(model_lstm, train_loader, val_loader, criterion, optimizer_lstm, scheduler_lstm,
                num_epochs=num_epochs, save_path="best_lstm.pth", early_stop_patience=3)
model_lstm.load_state_dict(torch.load("best_lstm.pth"))
print("Evaluating LSTM model:")
evaluate_model(model_lstm, val_loader, criterion)




Training LSTM model...
Epoch [1/20] | Train Loss: 0.3974 | Val Loss: 0.0006
--> Best model saved with Val Loss: 0.0006
Epoch [2/20] | Train Loss: 0.0006 | Val Loss: 0.0001
--> Best model saved with Val Loss: 0.0001
Epoch [3/20] | Train Loss: 0.0002 | Val Loss: 0.0000
Early stop counter: 1/3
Epoch [4/20] | Train Loss: 0.0001 | Val Loss: 0.0000
--> Best model saved with Val Loss: 0.0000
Epoch [5/20] | Train Loss: 0.0001 | Val Loss: 0.0000
Early stop counter: 1/3
Epoch [6/20] | Train Loss: 0.0000 | Val Loss: 0.0000
Early stop counter: 2/3
Epoch [7/20] | Train Loss: 0.0000 | Val Loss: 0.0000
Early stop counter: 3/3
Early stopping triggered.
Evaluating LSTM model:
Perplexity: 1.00


### Training with teacher forcing - LSTM

In [16]:
model_lstm_teacher = RNNModel("LSTM", vocab_size, embed_size, hidden_size, num_layers, bidirectional, dropout).to(device)
optimizer = optim.Adam(model_lstm_teacher.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=3, factor=0.5, verbose=True)
train_model_with_teacher_forcing(model_lstm_teacher, train_loader, val_loader, criterion, optimizer, scheduler, num_epochs=10, teacher_forcing_ratio=0.5, save_path="best_model.pth", early_stop_patience=3)
evaluate_model(model_lstm_teacher, val_loader, criterion)

Epoch [1/10] | Train Loss: 5.1588 | Val Loss: 3.3703
--> Best model saved with Val Loss: 3.3703
Epoch [2/10] | Train Loss: 2.0524 | Val Loss: 2.3216
--> Best model saved with Val Loss: 2.3216
Epoch [3/10] | Train Loss: 0.9169 | Val Loss: 2.0372
--> Best model saved with Val Loss: 2.0372
Epoch [4/10] | Train Loss: 0.8221 | Val Loss: 1.8484
--> Best model saved with Val Loss: 1.8484
Epoch [5/10] | Train Loss: 0.7751 | Val Loss: 1.8334
--> Best model saved with Val Loss: 1.8334
Epoch [6/10] | Train Loss: 0.7360 | Val Loss: 1.6948
--> Best model saved with Val Loss: 1.6948
Epoch [7/10] | Train Loss: 0.6970 | Val Loss: 1.6427
--> Best model saved with Val Loss: 1.6427
Epoch [8/10] | Train Loss: 0.6938 | Val Loss: 1.5665
--> Best model saved with Val Loss: 1.5665
Epoch [9/10] | Train Loss: 0.6671 | Val Loss: 1.5162
--> Best model saved with Val Loss: 1.5162
Epoch [10/10] | Train Loss: 0.6533 | Val Loss: 1.5136
--> Best model saved with Val Loss: 1.5136
Training complete.
Perplexity: 4.55


### Training without teacher forcing - GRU

In [13]:
print("\nTraining GRU model...")
train_model(model_gru, train_loader, val_loader, criterion, optimizer_gru, scheduler_gru,
                num_epochs=num_epochs, save_path="best_gru.pth", early_stop_patience=5)
model_gru.load_state_dict(torch.load("best_gru.pth"))
print("Evaluating GRU model:")
evaluate_model(model_gru, val_loader, criterion)



Training GRU model...
Epoch [1/20] | Train Loss: 0.2407 | Val Loss: 0.0002
--> Best model saved with Val Loss: 0.0002
Epoch [2/20] | Train Loss: 0.0002 | Val Loss: 0.0000
--> Best model saved with Val Loss: 0.0000
Epoch [3/20] | Train Loss: 0.0001 | Val Loss: 0.0000
Early stop counter: 1/5
Epoch [4/20] | Train Loss: 0.0000 | Val Loss: 0.0000
Early stop counter: 2/5
Epoch [5/20] | Train Loss: 0.0000 | Val Loss: 0.0000
Early stop counter: 3/5
Epoch [6/20] | Train Loss: 0.0004 | Val Loss: 0.0000
Early stop counter: 4/5
Epoch [7/20] | Train Loss: 0.0000 | Val Loss: 0.0000
Early stop counter: 5/5
Early stopping triggered.
Evaluating GRU model:
Perplexity: 1.00


### Training with teacher forcing - GRU

In [15]:
model_gru_teacher = RNNModel("GRU", vocab_size, embed_size, hidden_size, num_layers, bidirectional, dropout).to(device)
optimizer = optim.Adam(model_gru_teacher.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=3, factor=0.5, verbose=True)
train_model_with_teacher_forcing(model_gru_teacher, train_loader, val_loader, criterion, optimizer, scheduler, num_epochs=20, teacher_forcing_ratio=0.5, save_path="best_gru_teacher.pth", early_stop_patience=3)
evaluate_model(model_gru_teacher, val_loader, criterion)

Epoch [1/20] | Train Loss: 4.4511 | Val Loss: 2.4821
--> Best model saved with Val Loss: 2.4821
Epoch [2/20] | Train Loss: 1.5721 | Val Loss: 2.0117
--> Best model saved with Val Loss: 2.0117
Epoch [3/20] | Train Loss: 1.1218 | Val Loss: 1.9232
--> Best model saved with Val Loss: 1.9232
Epoch [4/20] | Train Loss: 1.0361 | Val Loss: 1.9096
--> Best model saved with Val Loss: 1.9096
Epoch [5/20] | Train Loss: 1.0130 | Val Loss: 1.8684
--> Best model saved with Val Loss: 1.8684
Epoch [6/20] | Train Loss: 0.9959 | Val Loss: 1.8862
Early stop counter: 1/3
Epoch [7/20] | Train Loss: 0.9829 | Val Loss: 1.8913
Early stop counter: 2/3
Epoch [8/20] | Train Loss: 0.9802 | Val Loss: 1.8516
--> Best model saved with Val Loss: 1.8516
Epoch [9/20] | Train Loss: 0.9805 | Val Loss: 1.8796
Early stop counter: 1/3
Epoch [10/20] | Train Loss: 0.9669 | Val Loss: 1.8986
Early stop counter: 2/3
Epoch [11/20] | Train Loss: 0.9839 | Val Loss: 1.9177
Early stop counter: 3/3
Early stopping triggered.
Training co

## Optimizer Experiments

### Training with RMSprop

In [None]:
optimizer_gru = optim.RMSprop(model_gru.parameters(), lr=0.001)

print("\nTraining GRU model...")
train_model(model_gru, train_loader, val_loader, criterion, optimizer_gru, scheduler_gru,
                num_epochs=num_epochs, save_path="best_gru.pth", early_stop_patience=3)
model_gru.load_state_dict(torch.load("best_gru.pth"))
print("Evaluating GRU model:")
evaluate_model(model_gru, val_loader, criterion)





Training GRU model...
Epoch [1/20] | Train Loss: 0.3908 | Val Loss: 0.0001
--> Best model saved with Val Loss: 0.0001
Epoch [2/20] | Train Loss: 0.0000 | Val Loss: 0.0000
Early stop counter: 1/3
Epoch [3/20] | Train Loss: 0.0000 | Val Loss: 0.0000
Early stop counter: 2/3
Epoch [4/20] | Train Loss: 0.0000 | Val Loss: 0.0000
Early stop counter: 3/3
Early stopping triggered.
Evaluating GRU model:
Perplexity: 1.00


In [21]:
model_gru_teacher = RNNModel("GRU", vocab_size, embed_size, hidden_size, num_layers, bidirectional, dropout).to(device)
optimizer = optim.RMSprop(model_gru_teacher.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=3, factor=0.5, verbose=True)
train_model_with_teacher_forcing(model_gru_teacher, train_loader, val_loader, criterion, optimizer, scheduler, num_epochs=20, teacher_forcing_ratio=0.5, save_path="best_gru_teacher.pth", early_stop_patience=3)
evaluate_model(model_gru_teacher, val_loader, criterion)

Epoch [1/20] | Train Loss: 4.5433 | Val Loss: 2.7545
--> Best model saved with Val Loss: 2.7545
Epoch [2/20] | Train Loss: 1.9690 | Val Loss: 2.2835
--> Best model saved with Val Loss: 2.2835
Epoch [3/20] | Train Loss: 1.3033 | Val Loss: 2.1269
--> Best model saved with Val Loss: 2.1269
Epoch [4/20] | Train Loss: 1.1582 | Val Loss: 2.0029
--> Best model saved with Val Loss: 2.0029
Epoch [5/20] | Train Loss: 1.0677 | Val Loss: 1.9780
--> Best model saved with Val Loss: 1.9780
Epoch [6/20] | Train Loss: 1.0118 | Val Loss: 1.9451
--> Best model saved with Val Loss: 1.9451
Epoch [7/20] | Train Loss: 0.9841 | Val Loss: 1.9416
--> Best model saved with Val Loss: 1.9416
Epoch [8/20] | Train Loss: 1.0086 | Val Loss: 1.9443
Early stop counter: 1/3
Epoch [9/20] | Train Loss: 0.9817 | Val Loss: 1.8872
--> Best model saved with Val Loss: 1.8872
Epoch [10/20] | Train Loss: 0.9772 | Val Loss: 1.8923
Early stop counter: 1/3
Epoch [11/20] | Train Loss: 0.9599 | Val Loss: 1.8963
Early stop counter: 2/3

**Comparison with Adam optimizer**

In the non-teacher forced setup, the GRU model with RMSprop started with a moderate training loss (~0.39) in the first epoch, but then rapidly dropped to near-zero loss in subsequent epochs. By epoch 2, the loss was effectively zero, and early stopping was triggered after a few more epochs.

When teacher forcing was introduced (with a 0.5 ratio), the training dynamics changed significantly. The teacher forced GRU started with a much higher loss (Train Loss ~4.54 and Val Loss ~2.75 in epoch 1) and exhibited a gradual decrease in loss over time. By epoch 14, the validation loss was around 1.87, and the final evaluation showed a perplexity of approximately 6.69.

RMSprop in the non-teacher forced setting appears to push the model to a state where both training and validation loss become extremely low very quickly. Although this might seem desirable at first glance, it often indicates overfitting or that the optimizer is driving the gradients into a regime where the model fails to generalize.

Adam combined with teacher forcing yields a more measured convergence. The gradual decrease in loss, along with a realistic perplexity (∼6.81), indicates that the model learns better and retains the ability to generate diverse and coherent text.

In these experiments, using Adam with teacher forcing produced more stable training dynamics and realistic evaluation metrics.

## Text Generation

### GRU

In [16]:
# --- Advanced Text Generation ---
  # Select one model (e.g., best LSTM) as the base for text generation
base_model = model_gru
  # Generate text using temperature-controlled sampling
prompt = "To be, or not to be:"
generated_text = sample_text(base_model, prompt, tokenizer, max_length=100, temperature=0.8)
print("\nGenerated Text (Temperature Sampling):")
print(generated_text)

  # Generate text using beam search
beam_generated_text = beam_search_text(base_model, prompt, tokenizer, beam_width=3, max_length=100, temperature=0.8)
print("\nGenerated Text (Beam Search):")
print(beam_generated_text)



Generated Text (Temperature Sampling):
To be, or not to be: ThereforeTo ThereforeTo Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore Therefore There

In [17]:

    # Select one model
  base_model = model_gru_teacher
    # Generate text using temperature-controlled sampling
  prompt = "To be, or not to be:"
  generated_text = sample_text(base_model, prompt, tokenizer, max_length=100, temperature=0.8)
  print("\nGenerated Text (Temperature Sampling):")
  print(generated_text)

    # Generate text using beam search
  beam_generated_text = beam_search_text(base_model, prompt, tokenizer, beam_width=3, max_length=100, temperature=0.8)
  print("\nGenerated Text (Beam Search):")
  print(beam_generated_text)


Generated Text (Temperature Sampling):
To be, or not to be: has a charter to extol her blood, When she does praise me grieves me. I have done As you have done; that's what I can; induced As you have been; that's for my country: He that has but effected his good will Hath overta'en mine act. You shall not be The grave of your deserving; Rome must know The value of her own: 'twere a concealment Worse than a theft, no less than a traducement, To hide your do

Generated Text (Beam Search):
To be, or not to be: And four here I'll courtdeliver; looking as it were yours. Come, my coach such a perfecter S pack'd With many places as I say 'thwack our general;' Give it outface me. I cannot tell thee power. My custom always of my lord, my lord,-- I will not be my good lord,-- What, my Hastings, my lord,-- I will beseech you, Ay, madam, that will be so. Come, my sweet lord. How now, sweet


### RNN

In [39]:
  # Select one model
base_model = model_rnn
prompt = "To be, or not to be:"
generated_text = sample_text(base_model, prompt, tokenizer, max_length=100, temperature=0.8)
print("\nGenerated Text (Temperature Sampling):")
print(generated_text)

  # Generate text using beam search
beam_generated_text = beam_search_text(base_model, prompt, tokenizer, beam_width=3, max_length=100, temperature=0.9)
print("\nGenerated Text (Beam Search):")
print(beam_generated_text)



Generated Text (Temperature Sampling):
To be, or not to be: BernTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo senseTo

Generated Text (Beam Search):
To be, or not to be: indifferentToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToTo


In [46]:
# --- Advanced Text Generation ---
  # Select one model (e.g., best LSTM) as the base for text generation
base_model = model_rnn_teacher
  # Generate text using temperature-controlled sampling
prompt = "To be, or not to be:"
generated_text = sample_text(base_model, prompt, tokenizer, max_length=100, temperature=0.8)
print("\nGenerated Text (Temperature Sampling):")
print(generated_text)

  # Generate text using beam search
beam_generated_text = beam_search_text(base_model, prompt, tokenizer, beam_width=3, max_length=100, temperature=0.8)
print("\nGenerated Text (Beam Search):")
print(beam_generated_text)



Generated Text (Temperature Sampling):
To be, or not to be: Ocles that that bottom of the sea-fold force and a sponge, you shall no changeling; if we ourselves compell'd, Even to the teeth and forehead of our faults, To give in evidence thee? but, That blame--as to my serious foul and a made of it,s' banishment had then ' still app parts, these three parts coward: Who, that hath the land, Suffer't, and live with such as cannot rule Nor ever will most on

Generated Text (Beam Search):
To be, or not to be: O, let them still possess'd all done patience, Catesby, let us all. But now, mother by heaven and did love me,ment sit by time is't'd,, mother by the world-- Thy mother lives is hell, love, aught. It is time your patience, Hamlet, for my good friends. God patience, that is dead, is mine uncle?'t now, Hamlet's with him in lovely Edward's son, yet let me now, an Edward's grave To


### LSTM

In [19]:
  # Select one model (e.g., best LSTM) as the base for text generation
base_model = model_lstm
  # Generate text using temperature-controlled sampling
prompt = "To be, or not to be:"
generated_text = sample_text(base_model, prompt, tokenizer, max_length=100, temperature=1.2)
print("\nGenerated Text (Temperature Sampling):")
print(generated_text)

  # Generate text using beam search
beam_generated_text = beam_search_text(base_model, prompt, tokenizer, beam_width=3, max_length=100, temperature=0.9)
print("\nGenerated Text (Beam Search):")
print(beam_generated_text)



Generated Text (Temperature Sampling):
To be, or not to be: OxfordTo OxfordTo soldTo soldToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToatsToats wellats wellats wellats wellats wellats wellats wellats wellats wellats wellats wellats wellats wellats wellats wellats wellats wellats wellats well

Generated Text (Beam Search):
To be, or not to be: ElyToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToTo


In [18]:
  # Select one model
base_model = model_lstm_teacher
prompt = "To be, or not to be:"
generated_text = sample_text(base_model, prompt, tokenizer, max_length=100, temperature=0.9)
print("\nGenerated Text (Temperature Sampling):")
print(generated_text)

  # Generate text using beam search
beam_generated_text = beam_search_text(base_model, prompt, tokenizer, beam_width=3, max_length=100, temperature=0.9)
print("\nGenerated Text (Beam Search):")
print(beam_generated_text)


Generated Text (Temperature Sampling):
To be, or not to be: give him inventorially would dizzy the arithmetic of memory, and yet but yaw neither, in respect of his quick sail. But, in the verity of extolment, I take him to be a soul of great article; and his infusion of such dearth and rareness, as, to make true diction of him, his semblable is his mirror; and who else would trace him, his umbrage, nothing more. Your lordship speaks most infallibly

Generated Text (Beam Search):
To be, or not to be: But yet this is no oath: The George, profaned, hath lost his holy honour; The fear it is no oath: The George, profaned, hath lost his holy honour; The fear it is no oath: The George, profaned, hath lost his holy honour; The fear it is no oath: The George, profaned, hath lost his holy honour; The fear it is no oath: The George, profaned, hath lost his holy honour; The garter, blemish'd


# 7. Metric Comparison

## Non-Teacher Forced Models (Vanilla RNN, LSTM, GRU):

All three architectures trained without teacher forcing quickly reached near-zero training and validation loss. As a result, their reported perplexity is 1.00. Although this may appear ideal, it is suspiciously low and suggests that the models might have overfitted or that the loss is not being computed as expected during validation. A perplexity of 1.00 for a text generation task (especially on a complex corpus like Shakespeare) is unrealistic.

## Teacher Forced Models:

By introducing teacher forcing with a ratio of 0.5, the models show more realistic loss values:

**Vanilla RNN**

The validation loss decreased from around 3.53 in epoch 1 to approximately 2.93 by epoch 5–7, resulting in a perplexity in the range of 20.

**LSTM (Teacher Forced):**

The teacher forced LSTM model steadily improved, with the best validation loss around 1.51 by the end of training, yielding a perplexity of approximately 4.55.

**GRU (Teacher Forced):**

The teacher forced GRU model reached its best validation loss around 1.85, corresponding to a perplexity of roughly = 6.81.




The teacher forced variants report higher perplexities, which are more indicative of a model that is learning a realistic language distribution rather than simply memorizing the training data.



## Conclusion

**Convergence Speed**

The vanilla RNN, LSTM, and GRU models trained without teacher forcing quickly reached near-zero training and validation loss—resulting in a perplexity of 1.00.

When teacher forcing was introduced (with a ratio of 0.5), the models started with significantly higher loss values. For example, the teacher forced vanilla RNN had an initial training loss of around 4.83 and a validation loss of 3.53, which gradually decreased over the epochs. Similarly, the teacher forced LSTM model showed a steady decline in loss—from about 5.16 at epoch 1 to around 1.51 by epoch 10, yielding a more realistic perplexity of approximately 4.55.
The gradual decrease in loss under teacher forcing indicates a smoother convergence process. It prevents the model from relying solely on its own predictions (which might be highly confident but incorrect) during training, thereby encouraging it to learn more meaningful sequential dependencies.｜

# Text Generation Evaluation:


**Non-Teacher Forced Models:**

When using the vanilla GRU or LSTM model without teacher forcing, the generated text exhibited significant repetition. For instance, the GRU model without teacher forcing produced outputs like “BernTo senseTo senseTo sense…” and beam search resulted in long strings of repeated tokens (e.g., “indifferentToToToTo…”). These repetitive outputs suggest that the model has likely overfitted to common patterns in the training data and becomes trapped in a loop, lacking the ability to generate diverse and coherent sequences.



**Teacher Forced Models:**

**GRU Model**

The GRU output with temperature sampling produced text that is relatively coherent and somewhat reflective of Shakespeare’s style. It uses phrases like “has a charter to extol her blood” and “He that has but effected his good will,” which evoke Shakespearean language. Although the text sometimes shifts focus abruptly, overall the vocabulary and phrasing appear more natural and varied.


The beam search output for GRU is more structured and grammatically consistent. It uses complete phrases and punctuation to form longer, connected sentences (e.g., “And four here I’ll courtdeliver; looking as it were yours. Come, my coach such a perfecter…”). While beam search tends to favor high-probability sequences (which can sometimes lead to repetitive structures), in this case, the generated text maintains a level of coherence and captures some of the stylistic rhythm of Shakespeare.

**Vanilla RNN Model**

The vanilla RNN output under temperature sampling appears to generate text that is stylistically reminiscent of Shakespeare, with archaic word forms and some unusual phrasing (e.g., “Ocles that that bottom of the sea-fold force and a sponge…”). However, the output also includes fragments that seem less natural and slightly disjointed, suggesting that the vanilla RNN might be struggling to capture longer-range dependencies consistently.

The beam search output for the vanilla RNN shows attempts at more complete sentences (e.g., “O, let them still possess’d all done patience, Catesby, let us all…”). Although it maintains a Shakespearean flavor, the repetition of certain phrases and awkward syntax indicates that the vanilla RNN’s capacity to model complex language structure is more limited compared to the GRU and LSTM models.

**LSTM Model**

The LSTM output with temperature sampling is notable for its richness. It produces more fluid, creative, and contextually appropriate language (e.g., “give him inventorially would dizzy the arithmetic of memory, and yet but yaw neither, in respect of his quick sail…”). The text is more nuanced and diverse, suggesting that the LSTM is effectively capturing long-term dependencies and generating language that closely resembles the poetic and intricate style of Shakespeare.

The beam search output from the LSTM model, while grammatically coherent and repeating certain phrases (“The George, profaned, hath lost his holy honour…”), tends to become somewhat repetitive. This repetition is a known trade-off with beam search: it often converges to the most probable sequence, which may sacrifice some diversity for coherence. Nonetheless, the overall style remains more consistent and polished than the outputs from the vanilla RNN.

**Comparative Analysis**

**Quality and Coherence:**

LSTM: Among the three, the LSTM produces the most fluid and stylistically rich output. The temperature-sampled text demonstrates creative variation, while beam search generates grammatically correct (albeit sometimes repetitive) text.

GRU: The GRU model also generates text with a strong Shakespearean flavor. It balances creativity with coherence fairly well, though it is slightly less rich in vocabulary compared to the LSTM.

Vanilla RNN: The vanilla RNN output is the least robust, with more disjointed phrases and awkward syntax. It seems to capture some stylistic elements, but overall its coherence and quality are lower.

**Diversity:**

Temperature Sampling: This method generally leads to more diverse outputs. The LSTM’s temperature-sampled output stands out for its creative use of language, whereas the vanilla RNN and GRU show diversity to a lesser degree.

Beam Search: Beam search tends to reduce diversity as it often selects the most likely sequence. This effect is visible in the repetitive phrases produced by both the GRU and LSTM, although the LSTM’s output remains stylistically more consistent.

**Stylistic Alignment with Shakespeare:**

LSTM and GRU with Teacher Forcing: These models, especially when trained with teacher forcing, manage to produce text that echoes Shakespeare’s rhythm and vocabulary. The LSTM, in particular, is able to emulate the cadence and complexity of Shakespeare’s language.

Vanilla RNN: The style is somewhat reminiscent of the period language, but it is marred by inconsistencies and lacks the sophistication seen in the LSTM and GRU outputs.

# 8.Conclusion

## 8.1 Challenges and Insights



**Rapid Convergence and Overfitting in Non-Teacher Forced Models:**

In several experiments (e.g., the vanilla RNN, LSTM, and GRU models trained without teacher forcing), the training loss dropped extremely quickly—often reaching near-zero values within just a few epochs. Consequently, the validation loss also became near-zero, leading to a perplexity of 1.00.
While a perplexity of 1.00 might seem ideal, in this context it is a red flag indicating that the models may have memorized the training data rather than learning a robust language model. This rapid convergence is a strong indicator of overfitting.

When teacher forcing was introduced (with a ratio of 0.5), the models converged more gradually. The training loss started at a higher value and decreased over several epochs, yielding more realistic perplexity values

  **Qualitative Insights on Text Generation:**

  The text generated by non-teacher forced models was highly repetitive and lacked diversity, despite the extremely low perplexity values. In contrast, teacher forced models, though converging more slowly, produced more varied and coherent text that more closely resembled Shakespearean language.

  The quality of generated text is a critical qualitative measure of model performance. The discrepancy between low perplexity and repetitive output in non-teacher forced models underlines the importance of evaluating both quantitative metrics (like perplexity) and qualitative aspects (diversity and stylistic alignment). It also emphasizes that overfitting in sequence models can lead to artificially low loss values while failing to capture the true variability of natural language.