# Custom K-Beams Testing

## Package installation

In [1]:
!pip install rouge-score bert-score mauve-text

Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting bert-score
  Downloading bert_score-0.3.13-py3-none-any.whl.metadata (15 kB)
Collecting mauve-text
  Downloading mauve_text-0.4.0-py3-none-any.whl.metadata (3.5 kB)
Collecting faiss-cpu>=1.7.0 (from mauve-text)
  Downloading faiss_cpu-1.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.4 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.0.0->bert-score)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (fr

In [2]:
!pip install datasets sacrebleu

Collecting datasets
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting sacrebleu
  Downloading sacrebleu-2.5.1-py3-none-any.whl.metadata (51 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/51.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.12.0,>=2023.1.0 (from fsspec[http]<=2024.12.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.12.0-py3-none-any.whl.metadata (11 kB)
Collecting portalocker (from sacrebleu)
  Downloading portalocker-3.1.1-py3-non

In [3]:
!pip install huggingface_hub[hf_xet]

Collecting hf-xet>=0.1.4 (from huggingface_hub[hf_xet])
  Downloading hf_xet-1.0.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (494 bytes)
Downloading hf_xet-1.0.2-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (53.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.8/53.8 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: hf-xet
Successfully installed hf-xet-1.0.2


## Initialization Code

In [4]:
import torch
import torch.nn.functional as F
import numpy as np
from typing import List, Dict, Tuple, Optional, Union, Any
from dataclasses import dataclass
from transformers import AutoModelForCausalLM, AutoTokenizer

@dataclass
class Beam:
    """Class to represent a beam in the k-beams search."""
    sequence: List[int]  # Token IDs in the sequence
    score: float         # Cumulative log probability score
    window_size: int     # Current window size before dropping

    def __str__(self):
        return f"Sequence: {self.sequence}, Score: {self.score}, Window: {self.window_size}"

class CustomKBeamSearch:
    def __init__(
        self,
        model: Any,
        tokenizer: Any,
        k: int = 5,
        initial_window_size: int = 3,
        max_length: int = 20,
        device: str = "cuda" if torch.cuda.is_available() else "cpu",
    ):
        """
        Initialize the custom k-beams search.

        Args:
            model: Any model that can generate logits for the next token
            tokenizer: Tokenizer for encoding/decoding
            k: Beam size
            initial_window_size: Window size before dropping a beam not in top-k
            max_length: Maximum length of generated sequences
            device: Device to run the model on
        """
        self.model = model.to(device)
        self.tokenizer = tokenizer
        self.k = k
        self.initial_window_size = initial_window_size
        self.max_length = max_length
        self.device = device

    def _get_next_token_probabilities(self, input_ids: torch.Tensor) -> torch.Tensor:
        """
        Get probabilities for the next token.

        Args:
            input_ids: Input token IDs [batch_size, seq_len]

        Returns:
            Log probabilities for next token [batch_size, vocab_size]
        """
        with torch.no_grad():

            outputs = self.model(input_ids)

            # Handle different model output formats
            if hasattr(outputs, "logits"):
                # Take last token's logits
                logits = outputs.logits[:, -1, :]
            elif isinstance(outputs, tuple) and hasattr(outputs[0], "logits"):
                logits = outputs[0].logits[:, -1, :]
            else:
                # Assuming the model directly returns logits
                logits = outputs[:, -1, :]

            # Convert to log probabilities
            log_probs = F.log_softmax(logits, dim=-1)

        return log_probs

    def generate(
        self,
        input_ids: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        **model_kwargs
    ) -> List[int]:
        """
        Generate a sequence using custom k-beams search.

        Args:
            input_ids: Input token IDs [batch_size=1, seq_len]
            attention_mask: Attention mask [batch_size=1, seq_len]
            model_kwargs: Additional arguments for the model

        Returns:
            Generated token IDs
        """
        # Initialize beams with the input sequence [seq_len]
        input_seq = input_ids[0].tolist()

        # Start with a single beam containing the input sequence
        beams = [Beam(sequence=input_seq, score=0.0, window_size=self.initial_window_size)]

        # Generate tokens up to max_length
        for _ in range(self.max_length):
            if not beams:
                # If all beams were dropped
                break

            # Get all current sequences from beams
            sequences = [beam.sequence for beam in beams]
            scores = [beam.score for beam in beams]
            window_sizes = [beam.window_size for beam in beams]

            # Prepare input for model
            batch_input_ids = torch.tensor([seq for seq in sequences], device=self.device)

            # Get next token probabilities for all beams [batch_size, vocab_size]
            next_token_log_probs = self._get_next_token_probabilities(batch_input_ids)

            # Flatten beams and token probabilities for selection
            all_next_tokens = []
            all_scores = []
            all_beam_indices = []

            # For each beam, get top-k next tokens
            for i, (beam_log_probs, beam_score) in enumerate(zip(next_token_log_probs, scores)):
                # Get top 2k tokens to ensure diversity (we'll filter to k later)
                # This gives us a chance to see tokens that might not be in the global top-k
                topk_log_probs, topk_indices = torch.topk(beam_log_probs, min(2 * self.k, beam_log_probs.size(-1)))

                for token_idx, token_log_prob in zip(topk_indices.tolist(), topk_log_probs.tolist()):
                    all_next_tokens.append(token_idx)
                    all_scores.append(beam_score + token_log_prob)
                    all_beam_indices.append(i)

            # Get top-k candidates among all beams
            if len(all_scores) <= self.k:
                top_indices = list(range(len(all_scores)))
            else:
                top_scores = torch.tensor(all_scores, device=self.device)
                _, top_indices = torch.topk(top_scores, min(self.k, len(all_scores)))
                top_indices = top_indices.tolist()

            # Track which beams are in top-k
            beams_in_topk = set()
            for idx in top_indices:
                beams_in_topk.add(all_beam_indices[idx])

            # Create new candidate beams
            new_beams = []
            processed_beam_indices = set()

            # First, process beams that are in top-k
            for idx in top_indices:
                beam_idx = all_beam_indices[idx]
                token_idx = all_next_tokens[idx]
                score = all_scores[idx]

                processed_beam_indices.add(beam_idx)

                # Create a new beam with the token appended
                new_sequence = sequences[beam_idx] + [token_idx]
                new_beams.append(Beam(
                    sequence=new_sequence,
                    score=score,
                    window_size=self.initial_window_size  # Reset window for beams in top-k
                ))

            # Process beams that are not in top-k but still have window remaining
            for i, beam in enumerate(beams):
                if i not in beams_in_topk and i not in processed_beam_indices:
                    # Reduce window size for this beam
                    new_window_size = window_sizes[i] - 1

                    if new_window_size > 0:
                        # Keep this beam but with reduced window
                        # We'll use the best token from this beam even though it's not in global top-k
                        token_log_probs = next_token_log_probs[i]
                        token_score, token_idx = torch.max(token_log_probs, dim=-1)

                        new_sequence = sequences[i] + [token_idx.item()]
                        new_beams.append(Beam(
                            sequence=new_sequence,
                            score=scores[i] + token_score.item(),
                            window_size=new_window_size
                        ))

            # Update beams
            beams = new_beams

            # Remove extinguished beams
            beams = [beam for beam in beams if beam.window_size > 0]

            # Check if all beams end with EOS token
            if self.tokenizer.eos_token_id is not None:
                if all(beam.sequence[-1] == self.tokenizer.eos_token_id for beam in beams):
                    break

        # Return the best beam's sequence
        if beams:
            best_beam = max(beams, key=lambda x: x.score)
            return best_beam.sequence
        else:
            return input_seq  # Return input if all beams were dropped


# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

# Define search parameters
beam_search = CustomKBeamSearch(
    model=model,
    tokenizer=tokenizer,
    k=5,
    initial_window_size=3,
    max_length=20
)

# Prepare input
prompt = "Once upon a time"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# Generate sequence
output_ids = beam_search.generate(input_ids.to(model.device))

# Decode output
output_text = tokenizer.decode(output_ids, skip_special_tokens=True)
print(f"Input: {prompt}")
print(f"Output: {output_text}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Input: Once upon a time
Output: Once upon a time there was a great deal of confusion about the meaning of the word, and the meaning of the word


## Bleu Score

In [5]:
import torch
from datasets import load_dataset
from sacrebleu import corpus_bleu
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm import tqdm

def evaluate_kbeams(models: list, num_examples: int = 4000, prompt_length: int = 10, max_gen_length: int = 50):
    """Evaluate custom k-beams search on multiple models using WikiText-103."""
    # Load dataset
    dataset = load_dataset('wikitext', 'wikitext-103-raw-v1', split='test')
    results = {}

    for model_name in models:
        # Load model and tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(model_name).to(torch.device(
            "cuda" if torch.cuda.is_available() else "cpu"))

        # Initialize your custom beam search
        generator = CustomKBeamSearch(
            model=model,
            tokenizer=tokenizer,
            k=5,
            initial_window_size=3,
            max_length=max_gen_length
        )

        # Generate hypotheses and collect references
        references = []
        hypotheses = []

        for example in tqdm(dataset.select(range(num_examples)), desc=f"Evaluating {model_name}"):
            text = example['text'].strip()
            if not text:
                continue

            # Tokenize and split into prompt/reference
            tokens = tokenizer.encode(text, return_tensors='pt')[0]
            if len(tokens) < prompt_length + max_gen_length:
                continue

            prompt_tokens = tokens[:prompt_length]
            reference_tokens = tokens[prompt_length:prompt_length + max_gen_length]

            # Generate continuation
            generated = generator.generate(prompt_tokens.unsqueeze(0).to(model.device))
            generated_continuation = generated[len(prompt_tokens):]  # Remove prompt

            # Decode texts
            reference_text = tokenizer.decode(reference_tokens, skip_special_tokens=True)
            generated_text = tokenizer.decode(generated_continuation, skip_special_tokens=True)

            references.append(reference_text)
            hypotheses.append(generated_text)

        # Calculate BLEU score
        bleu_score = corpus_bleu(hypotheses, [references], force=True)
        results[model_name] = bleu_score.score
        print(f"{model_name} BLEU: {bleu_score.score:.2f}")
        print(f"Final Reference Sentence: {reference_text}")
        print(f"Final Generated Sentence: {generated_text}")

    return results

# Run evaluation
models_to_test = [
    'distilgpt2',
    'gpt2',
    'openai-community/gpt2-medium'
]

evaluation_results = evaluate_kbeams(
    models=models_to_test,
    num_examples=2000,
    prompt_length=10,
    max_gen_length=50
)

print("\nFinal Results:")
for model, score in evaluation_results.items():
    print(f"{model}: {score:.2f} BLEU")

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/157M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/157M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/1801350 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Evaluating distilgpt2: 100%|██████████| 2000/2000 [11:05<00:00,  3.01it/s]


distilgpt2 BLEU: 0.20
Final Reference Sentence: ling Lesnar 's career entitled Brock Lesnar : Here Comes the Pain . It was re @-@ released in 2012 as a three @-@ disc DVD and two @-@ disc Blu @-@ ray collector 's edition to tie
Final Generated Sentence: ling the history of WWE Home Video.












































Evaluating gpt2: 100%|██████████| 2000/2000 [14:39<00:00,  2.27it/s]


gpt2 BLEU: 0.92
Final Reference Sentence: ling Lesnar 's career entitled Brock Lesnar : Here Comes the Pain . It was re @-@ released in 2012 as a three @-@ disc DVD and two @-@ disc Blu @-@ ray collector 's edition to tie
Final Generated Sentence: ling the WWE World Heavyweight Championship match between Randy Orton and Randy Orton Jr.

, WWE Home Video released a DVD chronicling the WWE World Heavyweight Championship match between Randy Orton and Randy Orton Jr. In 2004 , WWE


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Evaluating openai-community/gpt2-medium: 100%|██████████| 2000/2000 [39:02<00:00,  1.17s/it]


openai-community/gpt2-medium BLEU: 1.23
Final Reference Sentence: ling Lesnar 's career entitled Brock Lesnar : Here Comes the Pain . It was re @-@ released in 2012 as a three @-@ disc DVD and two @-@ disc Blu @-@ ray collector 's edition to tie
Final Generated Sentence: ling the events leading up to WrestleMania XXVIII. The DVD features interviews with Vince McMahon, Vince McMahon Jr., and Vince McMahon Jr.'s brother, Vince McMahon.

, WWE Home Video released a DVD chronicling the events leading up

Final Results:
distilgpt2: 0.20 BLEU
gpt2: 0.92 BLEU
openai-community/gpt2-medium: 1.23 BLEU


## Rouge Score

In [None]:
import torch
from datasets import load_dataset
from rouge_score import rouge_scorer
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm import tqdm

def evaluate_kbeams(models: list, num_examples: int = 4000, prompt_length: int = 10, max_gen_length: int = 50):
    """Evaluate custom k-beams search on multiple models using WikiText-103."""
    # Load dataset
    dataset = load_dataset('wikitext', 'wikitext-103-raw-v1', split='test')
    results = {}
    scores = []
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)

    for model_name in models:
        # Load model and tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(model_name).to(torch.device(
            "cuda" if torch.cuda.is_available() else "cpu"))

        # Initialize your custom beam search
        generator = CustomKBeamSearch(
            model=model,
            tokenizer=tokenizer,
            k=5,
            initial_window_size=3,
            max_length=max_gen_length
        )

        # Generate hypotheses and collect references
        references = []
        hypotheses = []

        for example in tqdm(dataset.select(range(num_examples)), desc=f"Evaluating {model_name}"):
            text = example['text'].strip()
            if not text:
                continue

            # Tokenize and split into prompt/reference
            tokens = tokenizer.encode(text, return_tensors='pt')[0]
            if len(tokens) < prompt_length + max_gen_length:
                continue

            prompt_tokens = tokens[:prompt_length]
            reference_tokens = tokens[prompt_length:prompt_length + max_gen_length]

            # Generate continuation
            generated = generator.generate(prompt_tokens.unsqueeze(0).to(model.device))
            generated_continuation = generated[len(prompt_tokens):]  # Remove prompt

            # Decode texts
            reference_text = tokenizer.decode(reference_tokens, skip_special_tokens=True)
            generated_text = tokenizer.decode(generated_continuation, skip_special_tokens=True)

            references.append(reference_text)
            hypotheses.append(generated_text)
            scores.append(scorer.score(reference_text, generated_text)['rougeL'])

        # Calculate BLEU score
        # bleu_score = corpus_bleu(hypotheses, [references], force=True)
        results[model_name] = np.mean([s.fmeasure for s in scores])
        print(f"Final Reference Sentence: {reference_text}")
        print(f"Final Generated Sentence: {generated_text}")

    return results

# Run evaluation
models_to_test = [
    'distilgpt2',
    'gpt2',
    'openai-community/gpt2-medium'
]

evaluation_results = evaluate_kbeams(
    models=models_to_test,
    num_examples=2000,
    prompt_length=10,
    max_gen_length=50
)

print("\nFinal Results:")
for model, score in evaluation_results.items():
    print(f"{model}: {score:.2f} BLEU")

Evaluating distilgpt2: 100%|██████████| 2000/2000 [10:58<00:00,  3.04it/s]


Final Reference Sentence: ling Lesnar 's career entitled Brock Lesnar : Here Comes the Pain . It was re @-@ released in 2012 as a three @-@ disc DVD and two @-@ disc Blu @-@ ray collector 's edition to tie
Final Generated Sentence: ling the history of WWE Home Video.












































Evaluating gpt2: 100%|██████████| 2000/2000 [15:20<00:00,  2.17it/s]


Final Reference Sentence: ling Lesnar 's career entitled Brock Lesnar : Here Comes the Pain . It was re @-@ released in 2012 as a three @-@ disc DVD and two @-@ disc Blu @-@ ray collector 's edition to tie
Final Generated Sentence: ling the WWE World Heavyweight Championship match between Randy Orton and Randy Orton Jr.

, WWE Home Video released a DVD chronicling the WWE World Heavyweight Championship match between Randy Orton and Randy Orton Jr. In 2004 , WWE


Evaluating openai-community/gpt2-medium: 100%|██████████| 2000/2000 [41:07<00:00,  1.23s/it]

Final Reference Sentence: ling Lesnar 's career entitled Brock Lesnar : Here Comes the Pain . It was re @-@ released in 2012 as a three @-@ disc DVD and two @-@ disc Blu @-@ ray collector 's edition to tie
Final Generated Sentence: ling the events leading up to WrestleMania XXVIII. The DVD features interviews with Vince McMahon, Vince McMahon Jr., and Vince McMahon Jr.'s brother, Vince McMahon.

, WWE Home Video released a DVD chronicling the events leading up

Final Results:
distilgpt2: 0.07 BLEU
gpt2: 0.10 BLEU
openai-community/gpt2-medium: 0.11 BLEU





## BertScore

In [None]:
import torch
from bert_score import score as bert_score
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from tqdm import tqdm

def evaluate_kbeams(models: list, num_examples: int = 4000, prompt_length: int = 10, max_gen_length: int = 50):
    """Evaluate custom k-beams search on multiple models using WikiText-103."""
    # Load dataset
    dataset = load_dataset('wikitext', 'wikitext-103-raw-v1', split='test')
    results = {}

    for model_name in models:
        # Load model and tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(model_name).to(torch.device(
            "cuda" if torch.cuda.is_available() else "cpu"))

        # Initialize your custom beam search
        generator = CustomKBeamSearch(
            model=model,
            tokenizer=tokenizer,
            k=5,
            initial_window_size=3,
            max_length=max_gen_length
        )

        # Generate hypotheses and collect references
        references = []
        hypotheses = []

        for example in tqdm(dataset.select(range(num_examples)), desc=f"Evaluating {model_name}"):
            text = example['text'].strip()
            if not text:
                continue

            # Tokenize and split into prompt/reference
            tokens = tokenizer.encode(text, return_tensors='pt')[0]
            if len(tokens) < prompt_length + max_gen_length:
                continue

            prompt_tokens = tokens[:prompt_length]
            reference_tokens = tokens[prompt_length:prompt_length + max_gen_length]

            # Generate continuation
            generated = generator.generate(prompt_tokens.unsqueeze(0).to(model.device))
            generated_continuation = generated[len(prompt_tokens):]  # Remove prompt

            # Decode texts
            reference_text = tokenizer.decode(reference_tokens, skip_special_tokens=True)
            generated_text = tokenizer.decode(generated_continuation, skip_special_tokens=True)

            references.append(reference_text)
            hypotheses.append(generated_text)

        # Calculate BLEU score
        _, _, F1 = bert_score(hypotheses, references, lang="en", model_type="roberta-large")
        results[model_name] = F1.mean().item()
        print(f"Final Reference Sentence: {reference_text}")
        print(f"Final Generated Sentence: {generated_text}")

    return results

# Run evaluation
models_to_test = [
    'distilgpt2',
    'gpt2',
    'openai-community/gpt2-medium'
]

evaluation_results = evaluate_kbeams(
    models=models_to_test,
    num_examples=2000,
    prompt_length=10,
    max_gen_length=50
)

print("\nFinal Results:")
for model, score in evaluation_results.items():
    print(f"{model}: {score:.2f} BertScore")

Evaluating distilgpt2: 100%|██████████| 2000/2000 [11:00<00:00,  3.03it/s]


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Final Reference Sentence: ling Lesnar 's career entitled Brock Lesnar : Here Comes the Pain . It was re @-@ released in 2012 as a three @-@ disc DVD and two @-@ disc Blu @-@ ray collector 's edition to tie
Final Generated Sentence: ling the history of WWE Home Video.












































Evaluating gpt2: 100%|██████████| 2000/2000 [16:52<00:00,  1.98it/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Final Reference Sentence: ling Lesnar 's career entitled Brock Lesnar : Here Comes the Pain . It was re @-@ released in 2012 as a three @-@ disc DVD and two @-@ disc Blu @-@ ray collector 's edition to tie
Final Generated Sentence: ling the WWE World Heavyweight Championship match between Randy Orton and Randy Orton Jr.

, WWE Home Video released a DVD chronicling the WWE World Heavyweight Championship match between Randy Orton and Randy Orton Jr. In 2004 , WWE


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Evaluating openai-community/gpt2-medium: 100%|██████████| 2000/2000 [42:08<00:00,  1.26s/it]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Final Reference Sentence: ling Lesnar 's career entitled Brock Lesnar : Here Comes the Pain . It was re @-@ released in 2012 as a three @-@ disc DVD and two @-@ disc Blu @-@ ray collector 's edition to tie
Final Generated Sentence: ling the events leading up to WrestleMania XXVIII. The DVD features interviews with Vince McMahon, Vince McMahon Jr., and Vince McMahon Jr.'s brother, Vince McMahon.

, WWE Home Video released a DVD chronicling the events leading up

Final Results:
distilgpt2: 0.79 BertScore
gpt2: 0.80 BertScore
openai-community/gpt2-medium: 0.81 BertScore


## Mauve

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import mauve
from tqdm import tqdm

def evaluate_kbeams(models: list, num_examples: int = 4000, prompt_length: int = 10, max_gen_length: int = 50):
    """Evaluate custom k-beams search on multiple models using WikiText-103."""
    # Load dataset
    dataset = load_dataset('wikitext', 'wikitext-103-raw-v1', split='test')
    results = {}

    for model_name in models:
        # Load model and tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(model_name).to(torch.device(
            "cuda" if torch.cuda.is_available() else "cpu"))

        # Initialize your custom beam search
        generator = CustomKBeamSearch(
            model=model,
            tokenizer=tokenizer,
            k=5,
            initial_window_size=3,
            max_length=max_gen_length
        )

        # Generate hypotheses and collect references
        references = []
        hypotheses = []

        for example in tqdm(dataset.select(range(num_examples)), desc=f"Evaluating {model_name}"):
            text = example['text'].strip()
            if not text:
                continue

            # Tokenize and split into prompt/reference
            tokens = tokenizer.encode(text, return_tensors='pt')[0]
            if len(tokens) < prompt_length + max_gen_length:
                continue

            prompt_tokens = tokens[:prompt_length]
            reference_tokens = tokens[prompt_length:prompt_length + max_gen_length]

            # Generate continuation
            generated = generator.generate(prompt_tokens.unsqueeze(0).to(model.device))
            generated_continuation = generated[len(prompt_tokens):]  # Remove prompt

            # Decode texts
            reference_text = tokenizer.decode(reference_tokens, skip_special_tokens=True)
            generated_text = tokenizer.decode(generated_continuation, skip_special_tokens=True)

            references.append(reference_text)
            hypotheses.append(generated_text)

        # Calculate BLEU score
        score = mauve.compute_mauve(
            p_text=hypotheses,
            q_text=references,
            device_id=0,
            max_text_length=256,
            verbose=False,
            batch_size=16
        ).mauve
        results[model_name] = score
        print(f"Final Reference Sentence: {reference_text}")
        print(f"Final Generated Sentence: {generated_text}")

    return results

# Run evaluation
models_to_test = [
    'distilgpt2',
    'gpt2',
    'openai-community/gpt2-medium'
]

evaluation_results = evaluate_kbeams(
    models=models_to_test,
    num_examples=2000,
    prompt_length=10,
    max_gen_length=50
)

print("\nFinal Results:")
for model, score in evaluation_results.items():
    print(f"{model}: {score:.2f} Mauve")

Evaluating distilgpt2: 100%|██████████| 2000/2000 [10:52<00:00,  3.06it/s]


Featurizing p:   0%|          | 0/48 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/48 [00:00<?, ?it/s]

Final Reference Sentence: ling Lesnar 's career entitled Brock Lesnar : Here Comes the Pain . It was re @-@ released in 2012 as a three @-@ disc DVD and two @-@ disc Blu @-@ ray collector 's edition to tie
Final Generated Sentence: ling the history of WWE Home Video.












































Evaluating gpt2: 100%|██████████| 2000/2000 [15:33<00:00,  2.14it/s]


Featurizing p:   0%|          | 0/48 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/48 [00:00<?, ?it/s]

Final Reference Sentence: ling Lesnar 's career entitled Brock Lesnar : Here Comes the Pain . It was re @-@ released in 2012 as a three @-@ disc DVD and two @-@ disc Blu @-@ ray collector 's edition to tie
Final Generated Sentence: ling the WWE World Heavyweight Championship match between Randy Orton and Randy Orton Jr.

, WWE Home Video released a DVD chronicling the WWE World Heavyweight Championship match between Randy Orton and Randy Orton Jr. In 2004 , WWE


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Evaluating openai-community/gpt2-medium: 100%|██████████| 2000/2000 [41:28<00:00,  1.24s/it]


Featurizing p:   0%|          | 0/48 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/48 [00:00<?, ?it/s]

Final Reference Sentence: ling Lesnar 's career entitled Brock Lesnar : Here Comes the Pain . It was re @-@ released in 2012 as a three @-@ disc DVD and two @-@ disc Blu @-@ ray collector 's edition to tie
Final Generated Sentence: ling the events leading up to WrestleMania XXVIII. The DVD features interviews with Vince McMahon, Vince McMahon Jr., and Vince McMahon Jr.'s brother, Vince McMahon.

, WWE Home Video released a DVD chronicling the events leading up

Final Results:
distilgpt2: 0.01 Mauve
gpt2: 0.13 Mauve
openai-community/gpt2-medium: 0.28 Mauve


# Original K-Beam Testing

## Initalization

In [None]:
import torch
import time
from typing import List, Dict, Any, Optional
from transformers import AutoModelForCausalLM, AutoTokenizer

class StandardKBeamSearch:
    def __init__(
        self,
        model_name: str = "distilgpt2",
        k: int = 5,
        max_length: int = 20,
        device: str = "cuda" if torch.cuda.is_available() else "cpu",
    ):
        """
        Initialize standard beam search

        Args:
            model_name: Name of the model to use
            k: Beam size
            max_length: Maximum length of generated sequences
            device: Device to run the model on
        """
        self.model_name = model_name
        self.k = k
        self.max_length = max_length
        self.device = device

        # Load model and tokenizer
        print(f"Loading {model_name}...")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

        # Add padding token if it doesn't exist
        if self.tokenizer.pad_token_id is None:
            self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

    def generate(
        self,
        input_ids: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        num_return_sequences: int = 1,
        early_stopping: bool = True,
        output_scores: bool = False,
        return_dict_in_generate: bool = False,
    ) -> Any:
        """
        Generate sequences using standard beam search.

        Args:
            input_ids: Input token IDs
            attention_mask: Attention mask
            num_return_sequences: Number of sequences to return (must be <= k)
            early_stopping: Whether to stop when all beams are finished
            output_scores: Whether to output scores
            return_dict_in_generate: Whether to return a dict with additional info

        Returns:
            Generated token IDs or dict with generated tokens and scores
        """
        # Move inputs to device
        input_ids = input_ids.to(self.device)
        if attention_mask is not None:
            attention_mask = attention_mask.to(self.device)

        # Number of sequences to return cannot be greater than beam size
        num_return_sequences = min(num_return_sequences, self.k)

        # Generate using standard beam search
        outputs = self.model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_length=input_ids.shape[1] + self.max_length,
            num_beams=self.k,
            num_return_sequences=num_return_sequences,
            early_stopping=early_stopping,
            output_scores=output_scores,
            return_dict_in_generate=return_dict_in_generate,
            use_cache=True,
            pad_token_id=self.tokenizer.eos_token_id,
        )

        return outputs

    def decode(self, output_ids: torch.Tensor) -> List[str]:
        """
        Decode output token IDs to strings.

        Args:
            output_ids: Output token IDs [num_sequences, seq_len]

        Returns:
            List of decoded strings
        """
        return [self.tokenizer.decode(ids, skip_special_tokens=True) for ids in output_ids]


models = [
    "distilgpt2",
    "gpt2",
]

prompts = [
    "Once upon a time",
    "The meaning of life is",
    "In the future, AI will",
]

for model in models:
    print(f"\n=== Testing {model} ===")
    beam_search = StandardKBeamSearch(model_name=model, k=5, max_length=20)

    for prompt in prompts:
        print(f"\nPrompt: {prompt}")
        input_ids = beam_search.tokenizer(prompt, return_tensors="pt").input_ids

        # Regular generation
        outputs = beam_search.generate(input_ids)
        decoded = beam_search.decode(outputs)
        print(f"Output: {decoded[0]}")

        # Generate with scores
        outputs_with_scores = beam_search.generate(
            input_ids,
            num_return_sequences=1,
            output_scores=True,
            return_dict_in_generate=True
        )

        # Extract sequences
        sequences = outputs_with_scores.sequences
        decoded_sequences = beam_search.decode(sequences)

        # Print sequences
        print("\nSequences:")
        for i, text in enumerate(decoded_sequences):
            print(f"  {i+1}. {text}")

Loading distilgpt2...

Prompt: Once upon a time

=== Standard Beam Search ===
Loading distilgpt2...
Output: Once upon a time when the world was in a state of flux, the world was in a state of flux, the
Time: 0.3639 seconds

=== Custom K-Beams Search ===
Input: Once upon a time
Output: Once upon a time when the world was in a state of flux, the world was in a state of flux, the
Time: 1.5691 seconds


## Bleu Score

In [None]:
import torch
from datasets import load_dataset
from sacrebleu import corpus_bleu
from tqdm import tqdm

def evaluate_standard_beams(models: list, num_examples: int = 100, prompt_length: int = 10, max_gen_length: int = 20):
    """Evaluate standard beam search on multiple models using WikiText-103."""
    # Load dataset
    dataset = load_dataset('wikitext', 'wikitext-103-raw-v1', split='test')
    results = {}

    for model_name in models:
        # Initialize standard beam search
        generator = StandardKBeamSearch(
            model_name=model_name,
            k=5,
            max_length=max_gen_length,
            device="cuda" if torch.cuda.is_available() else "cpu"
        )

        # Generate hypotheses and collect references
        references = []
        hypotheses = []

        for example in tqdm(dataset.select(range(num_examples)), desc=f"Evaluating {model_name}"):
            text = example['text'].strip()
            if not text:
                continue

            # Tokenize and split into prompt/reference
            inputs = generator.tokenizer(text, return_tensors='pt')
            tokens = inputs.input_ids[0]
            attention_mask = inputs.attention_mask[0] if 'attention_mask' in inputs else None

            if len(tokens) < prompt_length + max_gen_length:
                continue

            prompt_tokens = tokens[:prompt_length]
            reference_tokens = tokens[prompt_length:prompt_length + max_gen_length]
            attention_mask = attention_mask[:prompt_length] if attention_mask is not None else None

            # Generate continuation sequence
            input_ids = prompt_tokens.unsqueeze(0).to(generator.device)
            attention_mask = attention_mask.unsqueeze(0).to(generator.device) if attention_mask is not None else None
            generated_output = generator.generate(input_ids, attention_mask, return_dict_in_generate=False)

            # Remove prompt tokens from genereated sequence
            generated_continuation = generated_output[0][len(prompt_tokens):]

            # Decode texts
            reference_text = generator.tokenizer.decode(reference_tokens, skip_special_tokens=True)
            generated_text = generator.tokenizer.decode(generated_continuation, skip_special_tokens=True)

            references.append(reference_text)
            hypotheses.append(generated_text)

        # Calculate BLEU score
        bleu_score = corpus_bleu(hypotheses, [references], force=True)
        results[model_name] = bleu_score.score
        print(f"{model_name} BLEU: {bleu_score.score:.2f}")

    return results

# Run evaluation
models_to_test = [
    'distilgpt2',
    'gpt2',
    'openai-community/gpt2-medium'
]

evaluation_results = evaluate_standard_beams(
    models=models_to_test,
    num_examples=2000,
    prompt_length=10,
    max_gen_length=50
)

print("\nStandard Beam Search Results:")
for model, score in evaluation_results.items():
    print(f"{model}: {score:.2f} BLEU")

Loading distilgpt2...


Evaluating distilgpt2: 100%|██████████| 2000/2000 [02:57<00:00, 11.26it/s]


distilgpt2 BLEU: 0.24
Loading gpt2...


Evaluating gpt2: 100%|██████████| 2000/2000 [07:25<00:00,  4.49it/s]


gpt2 BLEU: 1.03
Loading openai-community/gpt2-medium...


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Evaluating openai-community/gpt2-medium: 100%|██████████| 2000/2000 [13:22<00:00,  2.49it/s]


openai-community/gpt2-medium BLEU: 1.29

Standard Beam Search Results:
distilgpt2: 0.24 BLEU
gpt2: 1.03 BLEU
openai-community/gpt2-medium: 1.29 BLEU


## Rouge L Score

In [None]:
import torch
from datasets import load_dataset
import numpy as np
from rouge_score import rouge_scorer
from tqdm import tqdm

def evaluate_standard_beams(models: list, num_examples: int = 100, prompt_length: int = 10, max_gen_length: int = 20):
    """Evaluate standard beam search on multiple models using WikiText-103."""
    # Load dataset
    dataset = load_dataset('wikitext', 'wikitext-103-raw-v1', split='test')
    results = {}
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    scores = []

    for model_name in models:
        # Initialize standard beam search
        generator = StandardKBeamSearch(
            model_name=model_name,
            k=5,
            max_length=max_gen_length,
            device="cuda" if torch.cuda.is_available() else "cpu"
        )

        # Generate hypotheses and collect references
        references = []
        hypotheses = []

        for example in tqdm(dataset.select(range(num_examples)), desc=f"Evaluating {model_name}"):
            text = example['text'].strip()
            if not text:
                continue

            # Tokenize and split into prompt/reference
            inputs = generator.tokenizer(text, return_tensors='pt')
            tokens = inputs.input_ids[0]
            attention_mask = inputs.attention_mask[0] if 'attention_mask' in inputs else None

            if len(tokens) < prompt_length + max_gen_length:
                continue

            prompt_tokens = tokens[:prompt_length]
            reference_tokens = tokens[prompt_length:prompt_length + max_gen_length]
            attention_mask = attention_mask[:prompt_length] if attention_mask is not None else None

            # Generate continuation sequence
            input_ids = prompt_tokens.unsqueeze(0).to(generator.device)
            attention_mask = attention_mask.unsqueeze(0).to(generator.device) if attention_mask is not None else None
            generated_output = generator.generate(input_ids, attention_mask, return_dict_in_generate=False)

            # Remove prompt tokens from genereated sequence
            generated_continuation = generated_output[0][len(prompt_tokens):]

            # Decode texts
            reference_text = generator.tokenizer.decode(reference_tokens, skip_special_tokens=True)
            generated_text = generator.tokenizer.decode(generated_continuation, skip_special_tokens=True)

            references.append(reference_text)
            hypotheses.append(generated_text)

            # Compute rouge score for this seqeunce
            scores.append(scorer.score(reference_text, generated_text)['rougeL'])

        # Compute mean and set it for this model
        results[model_name] = np.mean([s.fmeasure for s in scores])
    return results

# Run evaluation
models_to_test = [
    'distilgpt2',
    'gpt2',
    'openai-community/gpt2-medium'
]

evaluation_results = evaluate_standard_beams(
    models=models_to_test,
    num_examples=2000,
    prompt_length=10,
    max_gen_length=50
)

print("\nStandard Beam Search Results:")
for model, score in evaluation_results.items():
    print(f"{model}: {score:.2f} RougeL Score")

Loading distilgpt2...


Evaluating distilgpt2: 100%|██████████| 2000/2000 [02:59<00:00, 11.17it/s]


Loading gpt2...


Evaluating gpt2: 100%|██████████| 2000/2000 [07:28<00:00,  4.46it/s]


Loading openai-community/gpt2-medium...


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Evaluating openai-community/gpt2-medium: 100%|██████████| 2000/2000 [13:30<00:00,  2.47it/s]


Standard Beam Search Results:
distilgpt2: 0.08 RougeL Score
gpt2: 0.10 RougeL Score
openai-community/gpt2-medium: 0.12 RougeL Score





## BERTScore

In [None]:
import torch
from datasets import load_dataset
from bert_score import score as bert_score
from tqdm import tqdm

def evaluate_standard_beams(models: list, num_examples: int = 100, prompt_length: int = 10, max_gen_length: int = 20):
    """Evaluate standard beam search on multiple models using WikiText-103."""
    # Load dataset
    dataset = load_dataset('wikitext', 'wikitext-103-raw-v1', split='test')
    results = {}
    scores = []

    for model_name in models:
        # Initialize standard beam search
        generator = StandardKBeamSearch(
            model_name=model_name,
            k=5,
            max_length=max_gen_length,
            device="cuda" if torch.cuda.is_available() else "cpu"
        )

        # Generate hypotheses and collect references
        references = []
        hypotheses = []

        for example in tqdm(dataset.select(range(num_examples)), desc=f"Evaluating {model_name}"):
            text = example['text'].strip()
            if not text:
                continue

            # Tokenize and split into prompt/reference
            inputs = generator.tokenizer(text, return_tensors='pt')
            tokens = inputs.input_ids[0]
            attention_mask = inputs.attention_mask[0] if 'attention_mask' in inputs else None

            if len(tokens) < prompt_length + max_gen_length:
                continue

            prompt_tokens = tokens[:prompt_length]
            reference_tokens = tokens[prompt_length:prompt_length + max_gen_length]
            attention_mask = attention_mask[:prompt_length] if attention_mask is not None else None

            # Generate continuation seqeunce
            input_ids = prompt_tokens.unsqueeze(0).to(generator.device)
            attention_mask = attention_mask.unsqueeze(0).to(generator.device) if attention_mask is not None else None
            generated_output = generator.generate(input_ids, attention_mask, return_dict_in_generate=False)

            # Remove prompt tokens from genereated sequence
            generated_continuation = generated_output[0][len(prompt_tokens):]

            # Decode texts
            reference_text = generator.tokenizer.decode(reference_tokens, skip_special_tokens=True)
            generated_text = generator.tokenizer.decode(generated_continuation, skip_special_tokens=True)

            references.append(reference_text)
            hypotheses.append(generated_text)

        # Calculate Bert score
        _, _, F1 = bert_score(hypotheses, references, lang="en", model_type="roberta-large")
        results[model_name] = F1.mean().item()
    return results

# Run evaluation
models_to_test = [
    'distilgpt2',
    'gpt2',
    'openai-community/gpt2-medium'
]

evaluation_results = evaluate_standard_beams(
    models=models_to_test,
    num_examples=2000,
    prompt_length=10,
    max_gen_length=50
)

print("\nStandard Beam Search Results:")
for model, score in evaluation_results.items():
    print(f"{model}: {score:.2f} BertScore")

Loading distilgpt2...


Evaluating distilgpt2: 100%|██████████| 2000/2000 [02:57<00:00, 11.28it/s]


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading gpt2...


Evaluating gpt2: 100%|██████████| 2000/2000 [07:41<00:00,  4.33it/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Loading openai-community/gpt2-medium...


Evaluating openai-community/gpt2-medium: 100%|██████████| 2000/2000 [13:52<00:00,  2.40it/s]
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Standard Beam Search Results:
distilgpt2: 0.80 BertScore
gpt2: 0.80 BertScore
openai-community/gpt2-medium: 0.81 BertScore


## Mauve Score

In [None]:
import torch
from datasets import load_dataset
import mauve
from tqdm import tqdm

def evaluate_standard_beams(models: list, num_examples: int = 100, prompt_length: int = 10, max_gen_length: int = 20):
    """Evaluate standard beam search on multiple models using WikiText-103."""
    # Load dataset
    dataset = load_dataset('wikitext', 'wikitext-103-raw-v1', split='test')
    results = {}
    scores = []

    for model_name in models:
        # Initialize standard beam search
        generator = StandardKBeamSearch(
            model_name=model_name,
            k=5,
            max_length=max_gen_length,
            device="cuda" if torch.cuda.is_available() else "cpu"
        )

        # Generate hypotheses and collect references
        references = []
        hypotheses = []

        for example in tqdm(dataset.select(range(num_examples)), desc=f"Evaluating {model_name}"):
            text = example['text'].strip()
            if not text:
                continue

            # Tokenize and split into prompt/reference
            inputs = generator.tokenizer(text, return_tensors='pt')
            tokens = inputs.input_ids[0]
            attention_mask = inputs.attention_mask[0] if 'attention_mask' in inputs else None

            if len(tokens) < prompt_length + max_gen_length:
                continue

            prompt_tokens = tokens[:prompt_length]
            reference_tokens = tokens[prompt_length:prompt_length + max_gen_length]
            attention_mask = attention_mask[:prompt_length] if attention_mask is not None else None

            # Generate continuation sequence
            input_ids = prompt_tokens.unsqueeze(0).to(generator.device)
            attention_mask = attention_mask.unsqueeze(0).to(generator.device) if attention_mask is not None else None
            generated_output = generator.generate(input_ids, attention_mask, return_dict_in_generate=False)

            # Remove prompt tokens from genereated sequence
            generated_continuation = generated_output[0][len(prompt_tokens):]

            # Decode texts
            reference_text = generator.tokenizer.decode(reference_tokens, skip_special_tokens=True)
            generated_text = generator.tokenizer.decode(generated_continuation, skip_special_tokens=True)

            references.append(reference_text)
            hypotheses.append(generated_text)

        # Calculate Mauve score
        score = mauve.compute_mauve(
            p_text=hypotheses,
            q_text=references,
            device_id=0,
            max_text_length=256,
            verbose=False,
            batch_size=16
        ).mauve
        results[model_name] = score
    return results

# Run evaluation
models_to_test = [
    'distilgpt2',
    'gpt2',
    'openai-community/gpt2-medium'
]

evaluation_results = evaluate_standard_beams(
    models=models_to_test,
    num_examples=2000,
    prompt_length=10,
    max_gen_length=50
)

print("\nStandard Beam Search Results:")
for model, score in evaluation_results.items():
    print(f"{model}: {score:.2f} Mauve Score")

Loading distilgpt2...


Evaluating distilgpt2: 100%|██████████| 2000/2000 [02:56<00:00, 11.31it/s]


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Featurizing p:   0%|          | 0/48 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/48 [00:00<?, ?it/s]

Loading gpt2...


Evaluating gpt2: 100%|██████████| 2000/2000 [07:27<00:00,  4.47it/s]


Featurizing p:   0%|          | 0/48 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/48 [00:00<?, ?it/s]

Loading openai-community/gpt2-medium...


Evaluating openai-community/gpt2-medium: 100%|██████████| 2000/2000 [13:28<00:00,  2.47it/s]


Featurizing p:   0%|          | 0/48 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/48 [00:00<?, ?it/s]


Standard Beam Search Results:
distilgpt2: 0.01 Mauve Score
gpt2: 0.16 Mauve Score
openai-community/gpt2-medium: 0.24 Mauve Score


# Testing different Window Size

## Pre-trained GPT-Medium model

In [None]:
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import mauve
from tqdm import tqdm

def evaluate_kbeams(model_name, window_sizes: list, num_examples: int = 4000, prompt_length: int = 10, max_gen_length: int = 50):
    """Evaluate custom k-beams search on multiple models using WikiText-103."""
    # Load dataset
    dataset = load_dataset('wikitext', 'wikitext-103-raw-v1', split='test')
    results = {}

    # Load model and tokenizer outside the loop as we are not training
    print(f"Loading Model {model_name}...")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name).to(torch.device(
        "cuda" if torch.cuda.is_available() else "cpu"))
    print(f"Model Loaded. Starting Computations...")
    for window_size in window_sizes:

        # Initialize custom beam search
        generator = CustomKBeamSearch(
            model=model,
            tokenizer=tokenizer,
            k=5,
            initial_window_size=window_size,
            max_length=max_gen_length
        )

        # Generate hypotheses and collect references
        references = []
        hypotheses = []

        for example in tqdm(dataset.select(range(num_examples)), desc=f"Evaluating Window Size: {window_size}"):
            text = example['text'].strip()
            if not text:
                continue

            # Tokenize and split into prompt/reference
            tokens = tokenizer.encode(text, return_tensors='pt')[0]
            if len(tokens) < prompt_length + max_gen_length:
                continue

            prompt_tokens = tokens[:prompt_length]
            reference_tokens = tokens[prompt_length:prompt_length + max_gen_length]

            # Generate continuation seqeuence
            generated = generator.generate(prompt_tokens.unsqueeze(0).to(model.device))
            # Remove prompt tokens from generated text
            generated_continuation = generated[len(prompt_tokens):]

            # Decode texts
            reference_text = tokenizer.decode(reference_tokens, skip_special_tokens=True)
            generated_text = tokenizer.decode(generated_continuation, skip_special_tokens=True)

            references.append(reference_text)
            hypotheses.append(generated_text)

        # Calculate Mauve score
        score = mauve.compute_mauve(
            p_text=hypotheses,
            q_text=references,
            device_id=0,
            max_text_length=256,
            verbose=False,
            batch_size=16
        ).mauve
        print(f"Score for Window size {window_size} is {score}.")
        results[window_size] = score

    return results

# Run evaluation
window_sizes = [
    2,
    4,
    5,
    6
]

evaluation_results = evaluate_kbeams(
    model_name='openai-community/gpt2-medium',
    num_examples=2000,
    prompt_length=10,
    max_gen_length=50,
    window_sizes=window_sizes
)

print("\nFinal Results:")
for model, score in evaluation_results.items():
    print(f"{model}: {score:.2f} Mauve")

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00002.parquet:   0%|          | 0.00/157M [00:00<?, ?B/s]

train-00001-of-00002.parquet:   0%|          | 0.00/157M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/1801350 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Loading Model openai-community/gpt2-medium...


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Model Loaded. Starting Computations...


Evaluating Window Size: 2: 100%|██████████| 2000/2000 [37:15<00:00,  1.12s/it]


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/666 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/3.25G [00:00<?, ?B/s]

Featurizing p:   0%|          | 0/48 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/48 [00:00<?, ?it/s]

Score for Window size 2 is 0.21936078374996132.


Evaluating Window Size: 4: 100%|██████████| 2000/2000 [49:29<00:00,  1.48s/it]


Featurizing p:   0%|          | 0/48 [00:00<?, ?it/s]

Featurizing q:   0%|          | 0/48 [00:00<?, ?it/s]

Score for Window size 4 is 0.2603922832563385.


Evaluating Window Size: 5:  78%|███████▊  | 1557/2000 [43:42<12:17,  1.66s/it]

We stopped testing over with window size of 4 as the results were sufficient for making a conclusion in our opinion.