<div align="center">
<h1><b>How To Build a Large Language Model</b></h1>
<h3>Under 5 Minutes</h3>
</div>

<div align="left">
<p> Hey Everyone, today we are going to focus on building our own Large Language Model. We’ll cover the core backbone of how to organize, understand, and apply building such a model. This guide is broken into 8 steps: 6 for model creation, followed by exporting the model in ONNX format and deploying it into a backend service of your choice. In this example, we will use Microsoft Azure. However, you can also use Firebase, MongoDB, or even create your own backend with MySQL. The options are endless!</p>

<div align="left">
<p><b>Overview:</b></p>
<ol>
    <li>Data Feature Engineering</li>
    <li>Model Architecture</li>
    <li>Training</li>
    <li>Validation</li>
    <li>Testing</li>
    <li>Backtesting</li>
    <li>Export the Model from PyTorch to ONNX</li>
    <li>Deploy the Model in a Backend</li>
</ol>
</div>

<div align="left">
<p><b>BreakDown:</b></p>
<ol type="A">
    <li><b>Model Development</b>
        <ol type="1">
            <li>Data Feature Engineering</li>
            <li>Model Architecture</li>
            <li>Training</li>
            <li>Validation</li>
            <li>Testing</li>
            <li>Backtesting</li>
        </ol>
    </li>
    <li><b>Exporting and Deploying Model</b>
        <ol type="1" start="7">
            <li>Export the Model from PyTorch to ONNX</li>
            <li>Deploy the Model in a Backend</li>
        </ol>
    </li>
</ol>
</div>


In [1]:
#Libraries
import json
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, random_split
from collections import Counter
import matplotlib.pyplot as plt
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
import seaborn as sns
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score
from tqdm import tqdm  # Progress bar for batch processing

In [3]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Wed_Jan_15_19:38:46_Pacific_Standard_Time_2025
Cuda compilation tools, release 12.8, V12.8.61
Build cuda_12.8.r12.8/compiler.35404655_0


In [5]:
import torch
print(torch.version.cuda)  # Should return a CUDA version
print(torch.backends.cudnn.enabled)  # Should return True


12.1
True


In [6]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)  # Should return 'cuda'


cpu


In [4]:
import torch
print(torch.cuda.is_available())  # Should return True
print(torch.cuda.device_count())  # Should return a number greater than 0
print(torch.cuda.get_device_name(0))  # Should return the GPU name

False
0


RuntimeError: No CUDA GPUs are available

In [2]:
# ================================================================
# Step 1: Load Vocabulary
# ================================================================

# Load the tokenized vocabulary (vocab.json)
with open("vocab.json", "r", encoding="utf-8") as f:
    vocab = json.load(f)

# Create token-to-ID and ID-to-token mappings
token_to_id = vocab
id_to_token = {v: k for k, v in vocab.items()}

print(f"Vocabulary size: {len(vocab)}")

# ================================================================
# Step 2: Define Tokenized Dataset Class
# ================================================================

class TokenizedDataset(Dataset):
    """
    PyTorch Dataset for tokenized sequences.
    - Handles padding and truncation to ensure uniform sequence length.
    """
    def __init__(self, data, max_seq_length=4096):
        """
        Parameters:
        - data: List of tokenized sequences (each a list of token IDs).
        - max_seq_length: Fixed length for padding/truncation.
        """
        self.data = data
        self.max_seq_length = max_seq_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        """
        Returns a padded/truncated tokenized sequence as a PyTorch tensor.
        """
        tokens = self.data[idx]
        
        # Pad or truncate the sequence to max_seq_length
        if len(tokens) < self.max_seq_length:
            tokens += [token_to_id["<pad>"]] * (self.max_seq_length - len(tokens))
        else:
            tokens = tokens[:self.max_seq_length]
        
        return torch.tensor(tokens, dtype=torch.long)

# ================================================================
# Step 3: Load WET Tokenized Data
# ================================================================

# Load the wet_tokenized.json file
#We will have 20 different WET files to show the different sequences
#For a larger language model we are going to need alot more cleaned WET Files
with open("wet_tokenized.json", "r", encoding="utf-8") as f:
    wet_tokenized_data = json.load(f)

# Extract token IDs from the dataset
tokenized_sequences = [entry["token_ids"] for entry in wet_tokenized_data]

print(f"Total tokenized sequences: {len(tokenized_sequences)}")

# ================================================================
# Step 4: Split Data into Train/Validation/Test
# ================================================================

# Define split sizes (70% train, 20% validation, 10% test)
total_sequences = len(tokenized_sequences)
train_size = int(0.7 * total_sequences)
val_size = int(0.2 * total_sequences)
test_size = total_sequences - train_size - val_size

# Perform the split
train_data, val_data, test_data = random_split(tokenized_sequences, [train_size, val_size, test_size])

print(f"Data Split: {len(train_data)} training, {len(val_data)} validation, {len(test_data)} testing sequences.")

# ================================================================
# Step 5: Create DataLoaders for Batch Processing
# ================================================================

# Set batch size (optimized for GPU memory)
batch_size = 64
max_seq_length = 4096

# Create PyTorch Datasets
train_dataset = TokenizedDataset(train_data, max_seq_length=max_seq_length)
val_dataset = TokenizedDataset(val_data, max_seq_length=max_seq_length)
test_dataset = TokenizedDataset(test_data, max_seq_length=max_seq_length)

# Create DataLoaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, pin_memory=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, pin_memory=True)

print(f"DataLoaders ready:")
print(f" - Train batches: {len(train_loader)}")
print(f" - Validation batches: {len(val_loader)}")
print(f" - Test batches: {len(test_loader)}")

# ================================================================
# Notes for the Model Architecture
# ================================================================

"""
HOW THIS IS PROCESSED:
- Each DataLoader yields batches of size `batch_size`.
- Each batch contains tokenized sequences padded/truncated to `max_seq_length`.
- These batches are directly fed into the Transformer model during training.

For example:
- Train DataLoader yields a batch of shape [batch_size, max_seq_length].
- The batch is moved to the GPU (Titan RTX) and passed to the model.

Optimized GPU Utilization:
- Larger `batch_size` takes advantage of the GPU's memory.
- Use `pin_memory=True` for faster data transfer between CPU and GPU.
"""

# Example: Using DataLoader in Training Loop
for batch in train_loader:
    batch = batch.to("cuda")  # Move batch to GPU
    print(f"Batch shape: {batch.shape}")  # Example output: [batch_size, max_seq_length]
    break  # Stop after the first batch to illustrate


Vocabulary size: 50000
Total tokenized sequences: 4856
Data Split: 3399 training, 971 validation, 486 testing sequences.
DataLoaders ready:
 - Train batches: 54
 - Validation batches: 16
 - Test batches: 8


RuntimeError: No CUDA GPUs are available

In [None]:
#Model Architecture
"""
The model architecture is a transfoer model which is inspired and made from the following paper, Attention is All You Need

and the similar technique, Textbooks is All You Need, apply the transformer model, but using textbook like data from a mix of resources, including
a majority of it from common crawl 

There are 4 parts to a transformer model
1. Self Attention Layer--->Pays attention to the inputs and sees the probability of what to "predict" based on a score
2. Feed Forward Network-->Processes hwo the data will be made with the data
3. Transformer Block = Combines the Self Attention Layer with the Feed Forward Network
4. Transformer Model with its necessary parameters and adding in the number of heads, embeddings, hidden layers, and 
"""
# ================================================================
#  PART 1: MULTI-HEAD SELF-ATTENTION LAYER (Uses Softmax for Attention Weights)
# ================================================================

class MultiHeadSelfAttention(nn.Module):
    """
    Multi-Head Self-Attention Layer:
    - Computes self-attention for multiple "heads" in parallel.
    - Uses Softmax for probability-based attention weights.

    Parameters:
    - embed_dim: The dimension of the input embeddings.
    - num_heads: Number of attention heads.

    Operations:
    - Projects input embeddings into queries, keys, and values.
    - Computes scaled dot-product attention.
    - Applies Softmax to normalize attention scores.
    - Merges attention outputs and projects them back to embedding space.
    """
    def __init__(self, embed_dim, num_heads):
        super().__init__()
        assert embed_dim % num_heads == 0, "Embedding dimension must be divisible by num_heads."
        
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads

        # Learnable projection matrices for Q, K, V
        self.qkv_proj = nn.Linear(embed_dim, 3 * embed_dim)
        self.out_proj = nn.Linear(embed_dim, embed_dim)

    def forward(self, x, mask=None):
        """
        Forward pass of Multi-Head Self-Attention.

        Inputs:
        - x: Tensor of shape (batch_size, seq_length, embed_dim)
        - mask: Optional mask to prevent attending to certain positions.

        Returns:
        - Output tensor of shape (batch_size, seq_length, embed_dim)
        """
        batch_size, seq_length, embed_dim = x.shape

        # Project input to Queries, Keys, and Values
        qkv = self.qkv_proj(x).reshape(batch_size, seq_length, self.num_heads, 3 * self.head_dim).permute(2, 0, 1, 3)
        queries, keys, values = qkv.chunk(3, dim=-1)

        # Compute attention scores (scaled dot-product)
        attention_scores = torch.matmul(queries, keys.transpose(-2, -1)) / (self.head_dim ** 0.5)

        # Apply optional mask (for masked attention)
        if mask is not None:
            attention_scores = attention_scores.masked_fill(mask == 0, float('-inf'))
        
        # Apply Softmax to normalize attention scores
        attention_weights = F.softmax(attention_scores, dim=-1)

        # Compute attention output
        context = torch.matmul(attention_weights, values).permute(1, 2, 0, 3).reshape(batch_size, seq_length, embed_dim)
        
        return self.out_proj(context)

# ================================================================
#  PART 2: FEEDFORWARD NETWORK (FFN) WITH GELU
# ================================================================

class FeedForwardNetwork(nn.Module):
    """
    FeedForward Network (FFN):
    - Applies a non-linear transformation to each token representation.
    - Uses GELU 

    Parameters:
    - embed_dim: Input and output dimension.
    - hidden_dim: Expanded dimension inside the feedforward network.

    Operations:
    - Applies Linear → GELU → Linear transformations.
    """
    def __init__(self, embed_dim, hidden_dim):
        super().__init__()
        self.fc1 = nn.Linear(embed_dim, hidden_dim)
        self.gelu = nn.GELU()  # Using GELU
        self.fc2 = nn.Linear(hidden_dim, embed_dim)

    def forward(self, x):
        """
        Forward pass of FFN.

        Inputs:
        - x: Tensor of shape (batch_size, seq_length, embed_dim)

        Returns:
        - Output tensor of shape (batch_size, seq_length, embed_dim)
        """
        return self.fc2(self.gelu(self.fc1(x)))

# ================================================================
#  PART 3: TRANSFORMER ENCODER BLOCK (Combining Attention & FFN)
# ================================================================

class TransformerBlock(nn.Module):
    """
    Transformer Encoder Block:
    - Combines Multi-Head Self-Attention and FeedForward Network (FFN).
    - Uses Layer Normalization for stability.
    - Applies residual connections to improve gradient flow.

    Parameters:
    - embed_dim: The embedding dimension.
    - num_heads: The number of attention heads.
    - hidden_dim: The size of the hidden layer in FFN.

    This block is stacked multiple times in the full Transformer.
    """
    def __init__(self, embed_dim, num_heads, hidden_dim):
        super().__init__()

        # Multi-Head Self-Attention Layer
        self.attention = MultiHeadSelfAttention(embed_dim, num_heads)

        # Layer Normalization for stable learning
        self.norm1 = nn.LayerNorm(embed_dim)
        self.norm2 = nn.LayerNorm(embed_dim)

        # Feedforward Network
        self.ffn = FeedForwardNetwork(embed_dim, hidden_dim)

    def forward(self, x, mask=None):
        """
        Forward pass through Transformer Encoder Block.

        Inputs:
        - x: Input tensor of shape (batch_size, seq_length, embed_dim)
        - mask: Optional attention mask

        Returns:
        - Output tensor of shape (batch_size, seq_length, embed_dim)
        """
        # Apply Self-Attention with Residual Connection
        attn_output = self.attention(x, mask)
        x = self.norm1(x + attn_output)  

        # Apply FeedForward Network with Residual Connection
        ffn_output = self.ffn(x)
        x = self.norm2(x + ffn_output)  

        return x

# ================================================================
#  PART 4: FULL TRANSFORMER MODEL (25 Billion Parameters)
# ================================================================

class TransformerModel(nn.Module):
    """
    Full Transformer Model:
    - Consists of Token Embeddings, Positional Encoding, Transformer Blocks, and Output Layer.
    - Uses GELU for all activation functions.

    Parameters:
    - vocab_size: Number of unique tokens in vocabulary.
    - embed_dim: Size of the word embeddings.
    - num_heads: Number of attention heads per layer.
    - num_layers: Number of stacked Transformer layers.
    - hidden_dim: Size of the hidden layer in FFN.
    """
    def __init__(self, vocab_size, embed_dim=12288, num_heads=96, num_layers=96, hidden_dim=49152):
        super().__init__()

        # Token Embedding Layer
        self.embedding = nn.Embedding(vocab_size, embed_dim)

        # Positional Encoding (learnable)
        self.positional_encoding = nn.Parameter(torch.zeros(1, 4096, embed_dim))

        # Transformer Layers (Stacked Encoder Blocks)
        self.encoder_layers = nn.ModuleList([
            TransformerBlock(embed_dim, num_heads, hidden_dim)
            for _ in range(num_layers)
        ])

        # Final projection layer to vocabulary size
        self.fc_out = nn.Linear(embed_dim, vocab_size)

    def forward(self, input_ids, attention_mask=None):
        """
        Forward pass through the full Transformer.

        Inputs:
        - input_ids: Tensor of tokenized input (batch_size, seq_length)
        - attention_mask: Optional mask to prevent attending to padding tokens.

        Returns:
        - Logits of shape (batch_size, seq_length, vocab_size)
        """
        # Token Embeddings
        x = self.embedding(input_ids)

        # Add Positional Encoding
        x = x + self.positional_encoding[:, :x.size(1), :]

        # Pass through all Transformer Encoder Blocks
        for layer in self.encoder_layers:
            x = layer(x, attention_mask)

        # Output projection
        logits = self.fc_out(x)
        return logits

# ================================================================
#  PART 5: MODEL CONFIGURATION & TESTING
# ================================================================
if __name__ == "__main__":
    vocab_size = 50000
    max_seq_length = 4096
    model = TransformerModel(vocab_size)

    total_params = sum(p.numel() for p in model.parameters())
    print(f" Transformer Model with {total_params:,} Parameters")

    input_ids = torch.randint(0, vocab_size, (2, max_seq_length))
    output = model(input_ids)
    print("Output shape:", output.shape)  # Expected: (batch_size, seq_length, vocab_size)



In [None]:
#Train
""""
When we train a given neural network, what we are doing is first splitting our data into 70:20:10, or can do 80:10:10, where in this case, the 70
is going to be the majority of our data which we will train with our given model, and during training we are going to be having the following criteria

1. Batch processing

2. Number of Epochs

3. Corresponding loss score

4. Early Stopping

*We will use matplot lib to show the plot of epochs to the loss score, along with the F1 score, confusion matrix, then we will apply validation,
testing, and backtesting

""""
# ================================================================
# 🚀 TRAINING FUNCTION WITH EARLY STOPPING, LOSS VISUALIZATION, & PROGRESS BAR
# ================================================================

def train_model(model, train_loader, val_loader, epochs=10, lr=2e-5, patience=3, device="cuda"):
    """
    Trains the Transformer model and tracks performance metrics with a progress bar.
    
    Parameters:
    - model: Transformer model (25B parameters).
    - train_loader: DataLoader for training data.
    - val_loader: DataLoader for validation data.
    - epochs: Maximum number of epochs to train.
    - lr: Learning rate for AdamW optimizer.
    - patience: Number of epochs to wait before early stopping.
    - device: "cuda" or "cpu".
    
    Returns:
    - Trained model with the best validation performance.
    """
    
    model.to(device)
    optimizer = optim.AdamW(model.parameters(), lr=lr)
    loss_fn = nn.CrossEntropyLoss(ignore_index=0)

    best_val_loss = float('inf')
    patience_counter = 0
    train_losses, val_losses = [], []

    for epoch in range(epochs):
        model.train()
        total_train_loss = 0

        # Add progress bar for training batches
        train_progress = tqdm(train_loader, desc=f"Epoch {epoch+1}/{epochs} - Training", leave=False)

        for batch in train_progress:
            batch = batch.to(device)
            optimizer.zero_grad()
            outputs = model(batch)
            loss = loss_fn(outputs.view(-1, outputs.size(-1)), batch.view(-1))
            loss.backward()
            optimizer.step()
            total_train_loss += loss.item()

            # Update progress bar with current loss
            train_progress.set_postfix({"Batch Loss": loss.item()})

        avg_train_loss = total_train_loss / len(train_loader)
        avg_val_loss = validate_model(model, val_loader, loss_fn, device)

        train_losses.append(avg_train_loss)
        val_losses.append(avg_val_loss)

        print(f"Epoch {epoch+1}/{epochs} | Train Loss: {avg_train_loss:.4f} | Val Loss: {avg_val_loss:.4f}")

        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            patience_counter = 0
            torch.save(model.state_dict(), "best_transformer_model.pth")
            print("Model saved (Best Validation Loss Improved).")
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print("Early stopping triggered. Training stopped.")
                break

    plt.figure(figsize=(8, 6))
    plt.plot(range(1, len(train_losses) + 1), train_losses, label='Train Loss')
    plt.plot(range(1, len(val_losses) + 1), val_losses, label='Validation Loss', linestyle='dashed')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.title('Training vs. Validation Loss Over Epochs')
    plt.legend()
    plt.show()

    return model

In [None]:
#Validate
"""
Validation Function
"""
# ================================================================
# 🚀 VALIDATION FUNCTION WITH PROGRESS BAR
# ================================================================

def validate_model(model, val_loader, loss_fn, device="cuda"):
    """
    Evaluates the model on validation data and computes key performance metrics.
    
    Parameters:
    - model: Transformer model.
    - val_loader: DataLoader for validation data.
    - loss_fn: Loss function.
    - device: "cuda" or "cpu".
    
    Returns:
    - Average validation loss.
    """
    model.eval()
    total_loss = 0
    all_preds, all_labels = [], []

    val_progress = tqdm(val_loader, desc="Validation", leave=False)  # Progress bar for validation

    with torch.no_grad():
        for batch in val_progress:
            batch = batch.to(device)
            outputs = model(batch)
            loss = loss_fn(outputs.view(-1, outputs.size(-1)), batch.view(-1))
            total_loss += loss.item()

            preds = torch.argmax(outputs, dim=-1).view(-1).cpu().numpy()
            labels = batch.view(-1).cpu().numpy()
            all_preds.extend(preds)
            all_labels.extend(labels)

            # Update progress bar with current batch loss
            val_progress.set_postfix({"Batch Loss": loss.item()})

    avg_val_loss = total_loss / len(val_loader)
    accuracy = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds, average="weighted")
    cm = confusion_matrix(all_labels, all_preds)

    print(f"Validation Loss: {avg_val_loss:.4f}")
    print(f"Accuracy: {accuracy:.4f}, F1 Score: {f1:.4f}")

    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
    plt.xlabel("Predicted Labels")
    plt.ylabel("True Labels")
    plt.title("Confusion Matrix - Validation")
    plt.show()

    return avg_val_loss

In [None]:
#Test
"""
Test Function
"""
# ================================================================
# 🚀 TEST FUNCTION WITH PROGRESS BAR
# ================================================================

def test_model(model, test_loader, device="cuda"):
    """
    Evaluates the trained model on test data and computes final metrics.
    
    Parameters:
    - model: Transformer model.
    - test_loader: DataLoader for test data.
    - device: "cuda" or "cpu".
    
    Returns:
    - Perplexity score (lower is better).
    """
    model.to(device)
    loss_fn = nn.CrossEntropyLoss(ignore_index=0)
    total_loss = 0
    all_preds, all_labels = []

    test_progress = tqdm(test_loader, desc="Testing", leave=False)  # Progress bar for testing

    with torch.no_grad():
        for batch in test_progress:
            batch = batch.to(device)
            outputs = model(batch)
            loss = loss_fn(outputs.view(-1, outputs.size(-1)), batch.view(-1))
            total_loss += loss.item()

            preds = torch.argmax(outputs, dim=-1).view(-1).cpu().numpy()
            labels = batch.view(-1).cpu().numpy()
            all_preds.extend(preds)
            all_labels.extend(labels)

            # Update progress bar with current batch loss
            test_progress.set_postfix({"Batch Loss": loss.item()})

    avg_test_loss = total_loss / len(test_loader)
    perplexity = torch.exp(torch.tensor(avg_test_loss))

    accuracy = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds, average="weighted")
    cm = confusion_matrix(all_labels, all_preds)

    print(f"Test Loss: {avg_test_loss:.4f}")
    print(f"Test Perplexity: {perplexity:.2f}, Accuracy: {accuracy:.4f}, F1 Score: {f1:.4f}")

    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
    plt.xlabel("Predicted Labels")
    plt.ylabel("True Labels")
    plt.title("Confusion Matrix - Test")
    plt.show()

    return perplexity


In [None]:
#Backtest
"""
Backtesting is the final benchmarking phase depneding how one wants to apply their model, and we will be using 4 types of backtests for our language model
1. Perplexity score: Measure Predictability
2. BLEU Score: Measures Similarity to Reference Texts
3. ROGUE Score: Measures Uniqueness of Generated Text
4. Test Divedrsity Score: Measures Uniqueness of Generated Text
"""
import torch
import torch.nn.functional as F
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from collections import Counter

# ================================================================
# 🚀 PERPLEXITY SCORE (Measures Predictability)
# ================================================================

def calculate_perplexity(model, test_loader, device="cuda"):
    """
    Computes perplexity (PPL) on test data.
    - Lower perplexity means the model predicts tokens more accurately.

    Parameters:
    - model: The trained Transformer model.
    - test_loader: DataLoader with test data.
    - device: "cuda" or "cpu".

    Returns:
    - Perplexity score (lower = better).
    """
    model.to(device)
    model.eval()
    total_loss = 0
    loss_fn = torch.nn.CrossEntropyLoss(ignore_index=0)  # Ignore padding token (0)

    with torch.no_grad():
        for batch in test_loader:
            batch = batch.to(device)
            outputs = model(batch)  # Model predicts token probabilities
            loss = loss_fn(outputs.view(-1, outputs.size(-1)), batch.view(-1))  # Compute loss
            total_loss += loss.item()

    avg_loss = total_loss / len(test_loader)
    perplexity = torch.exp(torch.tensor(avg_loss))  # e^(cross-entropy loss)

    print(f"Perplexity Score: {perplexity:.2f} (Lower is better)")
    return perplexity

# ================================================================
# 🚀 BLEU SCORE (Measures Similarity to Reference Texts)
# ================================================================

def calculate_bleu(reference_texts, generated_texts):
    """
    Computes BLEU score between reference texts and model-generated texts.
    - BLEU measures how close generated text is to human-written text.

    Parameters:
    - reference_texts: List of ground truth text samples.
    - generated_texts: List of model-generated samples.

    Returns:
    - Average BLEU score (higher = better).
    """
    scores = []
    for ref, gen in zip(reference_texts, generated_texts):
        reference = [ref.split()]  # Reference text split into words
        candidate = gen.split()  # Model-generated text split into words
        score = sentence_bleu(reference, candidate)  # Compute BLEU score
        scores.append(score)

    avg_bleu = sum(scores) / len(scores)  # Compute average BLEU across samples
    print(f"BLEU Score: {avg_bleu:.4f} (Higher is better)")
    return avg_bleu

# ================================================================
# 🚀 ROUGE SCORE (Measures Recall & Overlap with Reference Texts)
# ================================================================

def calculate_rouge(reference_texts, generated_texts):
    """
    Computes ROUGE scores for model evaluation.
    - ROUGE measures how much of the reference text appears in the generated text.

    Parameters:
    - reference_texts: List of human-written reference texts.
    - generated_texts: List of model-generated outputs.

    Returns:
    - ROUGE scores (Higher is better).
    """
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scores = [scorer.score(ref, gen) for ref, gen in zip(reference_texts, generated_texts)]

    # Compute average ROUGE scores across all test samples
    avg_rouge = {
        'rouge1': sum(s['rouge1'].fmeasure for s in scores) / len(scores),
        'rouge2': sum(s['rouge2'].fmeasure for s in scores) / len(scores),
        'rougeL': sum(s['rougeL'].fmeasure for s in scores) / len(scores)
    }

    print(f"ROUGE Scores: {avg_rouge}")
    return avg_rouge

# ================================================================
# 🚀 TEXT DIVERSITY SCORE (Measures Uniqueness of Generated Text)
# ================================================================

def measure_text_diversity(generated_texts, n=2):
    """
    Measures diversity in model-generated text by analyzing distinct n-grams.
    - Ensures that generated text is not repetitive or generic.

    Parameters:
    - generated_texts: List of generated text outputs.
    - n: N-gram size (default = 2 for bigrams).

    Returns:
    - Distinct n-gram ratio (higher = more diverse).
    """
    all_ngrams = []
    total_ngrams = 0

    for text in generated_texts:
        words = text.split()
        ngrams = [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]
        all_ngrams.extend(ngrams)
        total_ngrams += len(ngrams)

    unique_ngrams = len(set(all_ngrams))
    diversity_score = unique_ngrams / total_ngrams if total_ngrams > 0 else 0

    print(f"Diversity Score (Distinct-{n} Ratio): {diversity_score:.4f} (Higher is better)")
    return diversity_score

# ================================================================
# 🚀 RUN THE FULL BACKTESTING PIPELINE
# ================================================================

if __name__ == "__main__":
    """
    Runs all backtesting benchmarks for a trained language model.
    - Loads reference and generated text.
    - Computes Perplexity, BLEU, ROUGE, and Diversity Scores.
    """

    # Example reference and generated texts (For real evaluation, use a larger dataset)
    reference_texts = ["The cat sat on the mat.", "Artificial intelligence is advancing rapidly."]
    generated_texts = ["The cat lay on the rug.", "AI is progressing very fast."]

    # Load trained model and test data
    model = TransformerModel(vocab_size=50000)  # Load your trained model
    model.load_state_dict(torch.load("best_transformer_model.pth"))  # Load best model
    model.eval()

    # Define test DataLoader
    test_loader = DataLoader(test_dataset, batch_size=16)

    print("\n===== Backtesting Results =====\n")

    # Compute Perplexity (Lower is Better)
    perplexity = calculate_perplexity(model, test_loader, device="cuda")

    # Compute BLEU Score (Higher is Better)
    bleu_score = calculate_bleu(reference_texts, generated_texts)

    # Compute ROUGE Score (Higher is Better)
    rouge_scores = calculate_rouge(reference_texts, generated_texts)

    # Compute Diversity Score (Higher is Better)
    diversity_score = measure_text_diversity(generated_texts)


In [None]:
#Export
"""
We will convert our Pytorch model to ONNX format to then save the model locally prior to deploying it on Microsoft Azure backend's servers so everyone
can use the following model and interface with it 
"""