
### **First Trial: Baseline Model Performance**
- **Dataset: Wikitext**
  - **Throughput**: 112.64 tokens/sample
  - **Perplexity**: 140.06
  
- **Dataset: GSM8K**
  - **Throughput**: 56.85 tokens/sample
  - **Perplexity**: 30.88
  - **BLEU Score**: 0.0096
  - **ROUGE Scores**:
    - **ROUGE-1**: 0.1386
    - **ROUGE-2**: 0.0493
    - **ROUGE-L**: 0.0922

### **Second Trial: After Model Pruning (Optimization for Throughput)**
- **Dataset: Wikitext**
  - **Throughput**: 112.64 tokens/sample (no change)
  - **Perplexity**: 191.90 (increased)
  
- **Dataset: GSM8K**
  - **Throughput**: 56.85 tokens/sample (no change)
  - **Perplexity**: 70.56 (increased)

### **Third Trial: Fine-Tuning the Model (Pending)**
- The third trial could not be completed due to Colab memory limitations, preventing me from observing the effects of fine-tuning on performance. However, the code exists and is ready for execution in a more capable environment.

---

### **Analysis of Findings**:

#### **1. Throughput**
- **Wikitext**: Throughput remained stable at **112.64 tokens/sample** across both trials, indicating no major computational gains from the pruning optimization in this regard.
- **GSM8K**: Similar to Wikitext, throughput remained the same at **56.85 tokens/sample**, which suggests that model pruning did not significantly enhance processing speed in either dataset.

#### **2. Perplexity**
- **Wikitext**:
  - Perplexity **increased** from **140.06** to **191.90** after pruning, indicating that the model's performance in predicting the next token worsened.
- **GSM8K**:
  - Perplexity **increased significantly**, from **30.88** to **70.56**, implying that the pruning optimization led to reduced model accuracy on this dataset as well.

#### **3. BLEU and ROUGE Scores (GSM8K Dataset)**
- These metrics were reported only for the **first trial**:
  - The **BLEU score** was very low (**0.0096**), reflecting poor performance in terms of generated text accuracy.
  - The **ROUGE scores** showed modest performance, with **ROUGE-1** being **0.1386**, which suggests the model captures some surface-level overlaps in predictions but struggles with deeper structures (evidenced by the lower **ROUGE-2** of **0.0493**).

Model pruning often leads to reduced performance because it involves removing parts of the model, such as neurons or layers, to increase efficiency. While this boosts computational speed and reduces memory usage, it can cause a loss of important parameters, resulting in worse accuracy and higher perplexity.


---


- **Model pruning** did not bring the desired throughput improvements but instead caused **substantial degradation in perplexity**, especially on the GSM8K dataset. This suggests a poor trade-off between accuracy and computational efficiency.
- Fine-tuning, when possible, should be reattempted in a more capable environment to observe its impact, as it is likely to provide improvements in model quality metrics (perplexity, BLEU, and ROUGE) without necessarily compromising throughput.

To further improve results:
1. **More advanced optimizations** like quantization-aware training (QAT) or distillation, which may offer better balance between accuracy and computational performance.
2. **Alternative infrastructure** beyond Colab to support fine-tuning trials without running into memory issues.

# merged code

In [1]:
# Install necessary libraries
!pip install datasets transformers accelerate torch
!pip install -q sentencepiece

import torch
import torch.nn.utils.prune as prune
import torch.nn.functional as F
from datasets import load_dataset
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments, AutoModelForCausalLM, GPT2Config
import matplotlib.pyplot as plt

# Step 1: Load an open-source language model (GPT-2)
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained(model_name)

# Ensure we're using the CPU
device = torch.device("cpu")
model.to(device)

# Step 2: Load the datasets (WikiText and GSM8K)
datasets = {"wikitext": load_dataset("wikitext", "wikitext-2-raw-v1"),
            "gsm8k": load_dataset("gsm8k", "main")}




Collecting datasets
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:0

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

In [6]:
'''
# Function to benchmark model throughput
# Function to benchmark model throughput
def benchmark_model(model, dataset, tokenizer, device, max_samples=20):  # Reduced samples for faster CPU benchmarking
    model.eval()
    throughput = []
    for i, sample in enumerate(dataset['test']):
        if i >= max_samples:  # Limit to max_samples for faster benchmarking
            break

        # Check for empty input and skip if necessary
        text = sample['text'] if 'text' in sample else sample['question']
        if not text:  # Skip empty inputs
            print(f"Skipping empty input at index {i}")
            continue

        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

        # Check for empty tokenization result and skip if necessary
        if inputs['input_ids'].shape[1] == 0:
            print(f"Skipping empty tokenization result at index {i}")
            continue

        inputs = inputs.to(device)

        with torch.no_grad():
            model(**inputs)

        # Measure throughput based on input length
        throughput.append(inputs['input_ids'].shape[1])

    avg_throughput = sum(throughput) / len(throughput) if throughput else 0 # Handle case where throughput is empty
    return avg_throughput

# Step 3: Measure throughput on both datasets (WikiText and GSM8K)
initial_throughput = {}
for dataset_name, dataset in datasets.items():
    print(f"Benchmarking on {dataset_name} dataset...")
    initial_throughput[dataset_name] = benchmark_model(model, dataset, tokenizer, device)
    print(f"Throughput on {dataset_name}: {initial_throughput[dataset_name]} tokens/sample")
'''

Benchmarking on wikitext dataset...
Skipping empty input at index 0
Skipping empty input at index 2
Skipping empty input at index 5
Skipping empty input at index 7
Skipping empty input at index 8
Skipping empty input at index 10
Skipping empty input at index 13
Skipping empty input at index 15
Skipping empty input at index 18
Throughput on wikitext: 112.63636363636364 tokens/sample
Benchmarking on gsm8k dataset...
Throughput on gsm8k: 56.85 tokens/sample


In [2]:
!pip install nltk rouge-score


Collecting rouge-score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge-score
  Building wheel for rouge-score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge-score: filename=rouge_score-0.1.2-py3-none-any.whl size=24935 sha256=6a5d6de190f71e33f655ed3429c2aef0c43b3f4b4771bb6db5eb614d2da6843c
  Stored in directory: /root/.cache/pip/wheels/5f/dd/89/461065a73be61a532ff8599a28e9beef17985c9e9c31e541b4
Successfully built rouge-score
Installing collected packages: rouge-score
Successfully installed rouge-score-0.1.2


In [8]:
import torch
import numpy as np
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer

# Function to benchmark model throughput and quality
def benchmark_model_with_quality(model, dataset, tokenizer, device, max_samples=20):
    model.eval()
    throughput = []
    total_loss = 0.0  # For perplexity calculation
    total_samples = 0
    bleu_scores = []
    rouge_scores = []
    #scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    for i, sample in enumerate(dataset['test']):
        if i >= max_samples:
            break

        # Handle input for WikiText or GSM8K dataset
        if 'text' in sample:  # Assume it's WikiText
            text = sample['text']
        elif 'question' in sample:  # Assume it's GSM8K
            text = sample['question']
        else:
            continue  # Skip if neither is present

        if not text:  # Skip empty inputs
            continue

        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
        if inputs['input_ids'].shape[1] == 0:
            continue

        inputs = inputs.to(device)

        with torch.no_grad():
            # For WikiText, we compute loss for perplexity
            outputs = model(**inputs, labels=inputs['input_ids'])  # Pass labels for loss calculation
            loss = outputs.loss if hasattr(outputs, 'loss') else None

            if loss is not None:
                total_loss += loss.item() * inputs['input_ids'].shape[0]  # Accumulate loss
                total_samples += inputs['input_ids'].shape[0]  # Count total samples for perplexity

            # Generate a response (assuming the model generates a response for GSM8K)
            generated = model.generate(**inputs, max_length=512)
            generated_text = tokenizer.decode(generated[0], skip_special_tokens=True)

            # Calculate BLEU score if a reference answer is provided (only for GSM8K)
            if 'reference' in sample:
                reference_text = sample['reference']
                bleu_score = sentence_bleu([reference_text.split()], generated_text.split())
                bleu_scores.append(bleu_score)

                # Calculate ROUGE score if a reference answer is provided
                scores = scorer.score(reference_text, generated_text)
                rouge_scores.append(scores)

        # Calculate throughput
        throughput.append(inputs['input_ids'].shape[1])

    # Calculate metrics
    avg_throughput = sum(throughput) / len(throughput) if throughput else 0
    perplexity = np.exp(total_loss / total_samples) if total_samples > 0 else 0  # Calculate perplexity

    avg_bleu = sum(bleu_scores) / len(bleu_scores) if bleu_scores else 0
    avg_rouge = {
        'rouge1': np.mean([score['rouge1'].fmeasure for score in rouge_scores]) if rouge_scores else 0,
        'rouge2': np.mean([score['rouge2'].fmeasure for score in rouge_scores]) if rouge_scores else 0,
        'rougeL': np.mean([score['rougeL'].fmeasure for score in rouge_scores]) if rouge_scores else 0,
    }

    return avg_throughput, perplexity, avg_bleu, avg_rouge

# Example usage:
# Assuming `dataset` is a dictionary with a 'test' key containing test samples
# model, tokenizer and device should be defined beforehand
# avg_throughput, perplexity, avg_bleu, avg_rouge = benchmark_model_with_quality(model, dataset, tokenizer, device)


In [4]:
!pip install datasets transformers accelerate torch



In [5]:
# Initialize dictionaries to store metrics
initial_throughput = {}
initial_perplexity = {}
initial_bleu = {}
initial_rouge = {}

# Iterate over each dataset to benchmark and measure metrics
for dataset_name, dataset in datasets.items():
    print(f"Benchmarking on {dataset_name} dataset...")

    # Run the benchmark function which now also returns quality metrics
    avg_throughput, perplexity, avg_bleu, avg_rouge = benchmark_model_with_quality(model, dataset, tokenizer, device)

    # Store the results in respective dictionaries
    initial_throughput[dataset_name] = avg_throughput
    initial_perplexity[dataset_name] = perplexity
    initial_bleu[dataset_name] = avg_bleu
    initial_rouge[dataset_name] = avg_rouge

    # Print the results
    print(f"Throughput on {dataset_name}: {initial_throughput[dataset_name]} tokens/sample")
    print(f"Perplexity on {dataset_name}: {initial_perplexity[dataset_name]:.2f}")
    print(f"BLEU score on {dataset_name}: {initial_bleu[dataset_name]:.4f}")
    print(f"ROUGE score on {dataset_name}: {initial_rouge[dataset_name]}")
    print("\n")  # Add a newline for better readability


Benchmarking on wikitext dataset...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Throughput on wikitext: 112.63636363636364 tokens/sample
Perplexity on wikitext: 140.06
BLEU score on wikitext: 0.0000
ROUGE score on wikitext: {'rouge1': 0, 'rouge2': 0, 'rougeL': 0}


Benchmarking on gsm8k dataset...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Throughput on gsm8k: 56.85 tokens/sample
Perplexity on gsm8k: 30.88
BLEU score on gsm8k: 0.0000
ROUGE score on gsm8k: {'rouge1': 0, 'rouge2': 0, 'rougeL': 0}




In [6]:
import nltk
from rouge_score import rouge_scorer
from datasets import load_dataset
from nltk.translate.bleu_score import sentence_bleu

# Download NLTK data for BLEU
nltk.download('punkt')

# Load GSM8K dataset using the 'main' configuration
# Load GSM8K dataset using the 'main' configuration and access the 'test' split
#dataset = load_dataset("gsm8k", "main", split="test")
# Access the 'test' split of the dataset
dataset_test = dataset['test']

# Initialize ROUGE scorer
#rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
# Function to calculate BLEU and ROUGE scores
def evaluate_bleu_rouge(predictions, references):
    bleu_scores = []
    rouge_scores = {'rouge1': [], 'rouge2': [], 'rougeL': []}

    for prediction, reference in zip(predictions, references):
        # Tokenize both the prediction and reference (required for BLEU)
        pred_tokens = nltk.word_tokenize(prediction)
        ref_tokens = nltk.word_tokenize(reference)

        # Calculate BLEU score
        bleu_score = sentence_bleu([ref_tokens], pred_tokens)
        bleu_scores.append(bleu_score)

        # Calculate ROUGE score

        rouge_result = rouge_scorer.score(reference, prediction)
        rouge_scores['rouge1'].append(rouge_result['rouge1'].fmeasure)
        rouge_scores['rouge2'].append(rouge_result['rouge2'].fmeasure)
        rouge_scores['rougeL'].append(rouge_result['rougeL'].fmeasure)

    # Average BLEU and ROUGE scores
    avg_bleu = sum(bleu_scores) / len(bleu_scores) if bleu_scores else 0
    avg_rouge = {key: sum(values) / len(values) for key, values in rouge_scores.items()}

    return avg_bleu, avg_rouge

# Example: Randomly select a subset of the GSM8K dataset
predictions = []
references = []

# Iterate through the 'test' split of the dataset using select
for sample in dataset_test.select(range(20)):  # Limiting to 20 samples for demo purposes
    question = sample['question']
    answer = sample['answer']
    if 'text' in sample:  # Assume it's WikiText
      text = sample['text']
    elif 'question' in sample:
      text = sample['question']
    else:
      continue  # Skip if neither is present

    if not text:
      continue
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    if inputs['input_ids'].shape[1] == 0:
      continue

    inputs = inputs.to(device)
    # Assuming 'inputs' is defined somewhere before this loop, as it's used in model.generate
    generated = model.generate(**inputs, max_length=512)
    generated_answer = tokenizer.decode(generated[0], skip_special_tokens=True)

    # Here, we simulate a generated answer (you should replace this with model output)
    #generated_answer = "This is a dummy generated answer."  # Replace with model prediction
    predictions.append(generated_answer)
    references.append(answer)

# Evaluate BLEU and ROUGE
bleu, rouge = evaluate_bleu_rouge(predictions, references)

# Print results
print(f"Average BLEU Score: {bleu}")
print(f"Average ROUGE Scores: {rouge}")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos

Average BLEU Score: 0.00955074543568346
Average ROUGE Scores: {'rouge1': 0.13856627014322534, 'rouge2': 0.04929588211589691, 'rougeL': 0.09219342938333996}


In [9]:
import torch
import torch.nn.utils.prune as prune
from transformers import Trainer, TrainingArguments
from datasets import load_dataset

# Step 4: Perform model pruning (unstructured L1 pruning)
def prune_model(model):
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Linear):
            # Prune 50% of weights using L1 unstructured pruning
            prune.l1_unstructured(module, name='weight', amount=0.5)
            # Optional: Remove the pruning mask to make the model smaller
            prune.remove(module, 'weight')
    return model

model = prune_model(model)  # Apply pruning



# Initialize dictionaries to store metrics
initial_throughput = {}
initial_perplexity = {}
initial_bleu = {}
initial_rouge = {}

# Iterate over each dataset to benchmark and measure metrics
for dataset_name, dataset in datasets.items():
    print(f"Benchmarking on {dataset_name} dataset...")

    # Run the benchmark function which now also returns quality metrics
    avg_throughput, perplexity, avg_bleu, avg_rouge = benchmark_model_with_quality(model, dataset, tokenizer, device)

    # Store the results in respective dictionaries
    initial_throughput[dataset_name] = avg_throughput
    initial_perplexity[dataset_name] = perplexity
    initial_bleu[dataset_name] = avg_bleu
    initial_rouge[dataset_name] = avg_rouge

    # Print the results
    print(f"Throughput on {dataset_name}: {initial_throughput[dataset_name]} tokens/sample")
    print(f"Perplexity on {dataset_name}: {initial_perplexity[dataset_name]:.2f}")
    print(f"BLEU score on {dataset_name}: {initial_bleu[dataset_name]:.4f}")
    print(f"ROUGE score on {dataset_name}: {initial_rouge[dataset_name]}")
    print("\n")  # Add a newline for better readability




# Example: Randomly select a subset of the GSM8K dataset
predictions = []
references = []

# Iterate through the 'test' split of the dataset using select
for sample in dataset_test.select(range(20)):  # Limiting to 20 samples for demo purposes
    question = sample['question']
    answer = sample['answer']
    if 'text' in sample:  # Assume it's WikiText
      text = sample['text']
    elif 'question' in sample:
      text = sample['question']
    else:
      continue  # Skip if neither is present

    if not text:
      continue
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    if inputs['input_ids'].shape[1] == 0:
      continue

    inputs = inputs.to(device)
    # Assuming 'inputs' is defined somewhere before this loop, as it's used in model.generate
    generated = model.generate(**inputs, max_length=512)
    generated_answer = tokenizer.decode(generated[0], skip_special_tokens=True)

    # Here, we simulate a generated answer (you should replace this with model output)
    #generated_answer = "This is a dummy generated answer."  # Replace with model prediction
    predictions.append(generated_answer)
    references.append(answer)

# Evaluate BLEU and ROUGE
bleu, rouge = evaluate_bleu_rouge(predictions, references)

# Print results
print(f"Average BLEU Score: {bleu}")
print(f"Average ROUGE Scores: {rouge}")


Benchmarking on wikitext dataset...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Throughput on wikitext: 112.63636363636364 tokens/sample
Perplexity on wikitext: 191.90
BLEU score on wikitext: 0.0000
ROUGE score on wikitext: {'rouge1': 0, 'rouge2': 0, 'rougeL': 0}


Benchmarking on gsm8k dataset...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

Throughput on gsm8k: 56.85 tokens/sample
Perplexity on gsm8k: 70.56
BLEU score on gsm8k: 0.0000
ROUGE score on gsm8k: {'rouge1': 0, 'rouge2': 0, 'rougeL': 0}




Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end gene

AttributeError: module 'rouge_score.rouge_scorer' has no attribute 'score'

In [None]:
import torch
from transformers import Trainer, TrainingArguments, AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="steps",  # Evaluate every few steps instead of every epoch
    eval_steps=10,  # Evaluate every 10 steps
    learning_rate=2e-5,
    per_device_train_batch_size=4,  # Reduced batch size for CPU
    per_device_eval_batch_size=4,   # Reduced batch size for CPU
    num_train_epochs=1,  # Set to 1 for quick training
    weight_decay=0.01,
    logging_dir='./logs',  # Directory for storing logs
    logging_steps=50,  # Only log every 50 steps
    save_steps=50,  # Save model checkpoints every 50 steps
    no_cuda=True,   # Ensure we're using CPU only
)

# Create Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
)

# Train the model
trainer.train()

# Optionally evaluate the model after training
trainer.evaluate()

# Save the model
trainer.save_model("./results/final_model")


In [None]:
import matplotlib.pyplot as plt

# Data for the models based on trial results
models = ['First Trial', 'Second Trial', 'Third Trial (Pending)']

# BLEU and ROUGE scores for GSM8K dataset (First and Second trials only)
bleu_scores = [0.0096, None]  # BLEU only reported for first trial
rouge1_scores = [0.1386, None]  # ROUGE-1 for first trial
rouge2_scores = [0.0493, None]  # ROUGE-2 for first trial
rougeL_scores = [0.0922, None]  # ROUGE-L for first trial

# Throughput values for both datasets (First and Second trials only)
throughput_wikitext = [112.64, 112.64]  # Throughput for Wikitext
throughput_gsm8k = [56.85, 56.85]  # Throughput for GSM8K

# Perplexity values for both datasets
perplexity_wikitext = [140.06, 191.90]  # Perplexity for Wikitext
perplexity_gsm8k = [30.88, 70.56]  # Perplexity for GSM8K

# Plot BLEU and ROUGE Scores
plt.figure(figsize=(10, 6))

# BLEU Score for GSM8K dataset
plt.plot(models[:2], bleu_scores[:2], marker='o', label="BLEU", color='b')

# ROUGE-1, ROUGE-2, ROUGE-L Scores for GSM8K dataset
plt.plot(models[:2], rouge1_scores[:2], marker='o', label="ROUGE-1", color='g')
plt.plot(models[:2], rouge2_scores[:2], marker='o', label="ROUGE-2", color='r')
plt.plot(models[:2], rougeL_scores[:2], marker='o', label="ROUGE-L", color='purple')

# Add labels and title
plt.title("Model Quality Comparison (BLEU and ROUGE Scores)")
plt.xlabel("Model Version")
plt.ylabel("Score")
plt.legend()

# Show plot for quality comparison
plt.show()

# Plot Throughput for Wikitext and GSM8K
plt.figure(figsize=(8, 6))

# Throughput for both datasets
plt.plot(models[:2], throughput_wikitext[:2], marker='o', label="Wikitext Throughput", color='orange')
plt.plot(models[:2], throughput_gsm8k[:2], marker='o', label="GSM8K Throughput", color='blue')

plt.title("Model Throughput Comparison (Tokens per Second)")
plt.xlabel("Model Version")
plt.ylabel("Throughput (Tokens/sec)")
plt.legend()

# Show plot for throughput comparison
plt.show()

# Plot Perplexity for Wikitext and GSM8K
plt.figure(figsize=(8, 6))

# Perplexity for both datasets
plt.plot(models[:2], perplexity_wikitext[:2], marker='o', label="Wikitext Perplexity", color='green')
plt.plot(models[:2], perplexity_gsm8k[:2], marker='o', label="GSM8K Perplexity", color='red')

plt.title("Model Perplexity Comparison (Lower is Better)")
plt.xlabel("Model Version")
plt.ylabel("Perplexity")
plt.legend()

# Show plot for perplexity comparison
plt.show()
