# Make a GPT-2 Model Smaller and More Powerful (v0.0.43-updated)

This notebook demonstrates how to make a GPT-2 model both smaller and more powerful by:
1. Applying pruning to remove less important attention heads
2. Fine-tuning the pruned model to recover and improve performance
3. Showing clear metrics of improvement throughout the process

We use real data (Wikitext) rather than synthetic data for realistic evaluation.

Version History:
- v0.0.43-updated (April 2025): FIXED COLAB REPOSITORY URL AND BRANCH SELECTION
- v0.0.43 (April 2025): Fixed entropy pruning implementation to handle API availability gracefully
- v0.0.42 (April 2025): Added super_simple test mode and improved error handling
- v0.0.41 (April 2025): Modularized code using sentinel.pruning package
- v0.0.40 (April 2025): Improve robustness for different model architectures
- v0.0.39 (April 2025): Fix TypeError in run_experiment function call
- v0.0.38 (April 2025): Fix ValueError in generate_text function
- v0.0.37 (April 2025): Complete rewrite with minimal dependencies for reliability
- v0.0.36 (April 2025): Simplified pruning implementation for better reliability
- v0.0.35 (April 2025): Fixed in-place operation error in apply_head_pruning function
- v0.0.34 (April 2025): Fixed undefined variable error, visualization issues and enhanced CUDA error handling
- v0.0.33 (April 2025): Fixed visualization issues, improved model compatibility and enhanced error handling
- v0.0.32 (April 2025): Added CUDA error handling for Colab compatibility and memory management
- v0.0.31 (April 2025): Fixed get_strategy parameters issue and improved Colab compatibility
- v0.0.30 (April 2025): Added OPT model support and chart improvements

---
**Note**: This notebook is part of the SentinelAI project. For detailed documentation, see `PruningAndFineTuningColab.md`.

## Purpose of this Notebook

This notebook demonstrates how pruning transformer models can make them both smaller and more powerful. Pruning is the process of removing less important components (in this case, attention heads) to create a more efficient model.

The steps in this experiment are:

1. **Initial Evaluation**: Measure the starting performance of the model
2. **Pruning**: Remove less important attention heads using one of several strategies
3. **Fine-tuning**: Train the pruned model to recover and potentially exceed its original performance
4. **Evaluation**: Compare model performance before and after pruning and fine-tuning

The metrics we track:
- **Loss**: The training loss (lower is better)
- **Perplexity**: A measure of how well the model predicts the next token (lower is better)

The experiment shows how a properly pruned and fine-tuned model can be both smaller and more powerful than the original model.

---

## How to Use This Notebook

1. **Run all cells sequentially** from top to bottom
2. You can adjust parameters like model size, pruning percentage, and training epochs
3. The final cell allows you to interactively generate text with your pruned and fine-tuned model

For GPT-2 models, this notebook works best with:
- distilgpt2 (82M parameters)
- gpt2 (124M parameters) 
- gpt2-medium (355M parameters)

Other model architectures (OPT, Pythia, etc.) may require additional modifications.

In [ ]:
# Install required packages
!pip install -q transformers==4.38.0 datasets==2.17.0 torch matplotlib tqdm

# Import necessary libraries
import os
import torch
import matplotlib.pyplot as plt
import numpy as np
from tqdm.auto import tqdm
import json

# Import sys for path manipulation
import sys
# Add the project root to the path so we can import our API modules
if not any(p.endswith('sentinel-ai') for p in sys.path):
    # For Google Colab - handle case where the notebook is running in a different directory
    if os.path.exists('/content'):
        # Clone the repo if running in Colab and not already cloned
        if not os.path.exists('/content/sentinel-ai'):
            # Use the correct repository URL and branch
            !git clone -b feature/implement-adaptive-plasticity https://github.com/CambrianTech/sentinel-ai.git /content/sentinel-ai
        sys.path.append('/content/sentinel-ai')
    else:
        # Add parent directory to path if running locally
        sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

# Print the Python path for debugging
print("Python path:", sys.path)

# Check if the expected directory structure exists
if os.path.exists('/content/sentinel-ai'):
    print("Repository contents:")
    !ls -la /content/sentinel-ai
    print("\nBranch information:")
    !cd /content/sentinel-ai && git branch -v

# Import the modular sentinel.pruning API with better error handling
try:
    from sentinel.pruning.experiment_runner import run_experiment, ExperimentConfig
    from sentinel.pruning.text_generator import interactive_generate
    print("Successfully imported sentinel.pruning modules")
except ImportError as e:
    print(f"Failed to import sentinel.pruning modules: {e}")
    print("This notebook requires the modular sentinel.pruning package.")
    print("Make sure you've pulled the latest code from the repository.")
    print("Falling back to direct API imports...")
    
    # Fall back to the old API imports if sentinel.pruning is not available
    try:
        # Try importing from utils in the sentinel-ai repository
        from utils.pruning.api.pruning import compute_head_importance, prune_heads, fine_tune, evaluate_model
        from utils.pruning.api.data import load_wikitext, prepare_data, prepare_test_data
        print("Successfully imported from utils.pruning.api")
    except ImportError as e2:
        print(f"Failed to import utils.pruning.api: {e2}")
        print("Installing the necessary packages via pip...")
        !pip install -q git+https://github.com/CambrianTech/sentinel-ai.git@feature/implement-adaptive-plasticity
        print("Trying imports again after installation...")
        try:
            from sentinel.pruning.experiment_runner import run_experiment, ExperimentConfig
            from sentinel.pruning.text_generator import interactive_generate
            print("Successfully imported sentinel.pruning modules after pip install")
        except ImportError as e3:
            print(f"Still failing after pip install: {e3}")
            print("Please check the repository structure or run locally.")

# Set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Create directories for saving results
os.makedirs("pruning_results", exist_ok=True)

## How Pruning Works

Pruning in transformer models involves identifying and removing less important components. In this notebook, we focus on pruning attention heads.

### Why Prune Attention Heads?

1. **Efficiency**: Fewer attention heads means less computation and memory usage
2. **Specialization**: Removing redundant heads can force the model to learn more efficiently
3. **Performance**: Properly pruned and fine-tuned models can actually perform better than the original

### The Process

1. **Importance Calculation**: We use various metrics to determine which heads are important
   - Random (baseline)
   - Magnitude (based on weight norms)
   - Entropy (based on attention patterns)

2. **Pruning**: We remove heads with the lowest importance scores
   - This is done by masking their output rather than actually removing parameters
   
3. **Fine-tuning**: We train the pruned model to recover and improve performance 
   - The model learns to compensate for the missing heads
   - Remaining heads become more specialized and effective

The result is a smaller, faster model that can match or exceed the original model's performance!

In [None]:
# Configure experiment parameters
MODEL_NAME = "distilgpt2"  # Smaller GPT-2 model for faster demonstration
PRUNING_PERCENT = 0.3  # Percentage of heads to prune (0-1)
NUM_EPOCHS = 3  # Number of fine-tuning epochs 
BATCH_SIZE = 4  # Batch size for training and evaluation

# Create experiment config
experiment_config = ExperimentConfig(
    model_name=MODEL_NAME,
    pruning_percent=PRUNING_PERCENT,
    num_epochs=NUM_EPOCHS,
    batch_size=BATCH_SIZE,
    device=device,
    output_dir="pruning_results"
)

## Run the Experiment

Now we'll run the full pruning and fine-tuning experiment using our modular API. This process will:

1. Load the specified model and tokenizer
2. Evaluate the initial model performance
3. Compute head importance using entropy-based metrics
4. Prune the least important heads
5. Evaluate the pruned model
6. Fine-tune the pruned model
7. Evaluate the final model and generate a summary

All of this functionality is now neatly encapsulated in the `run_experiment` function from our modular API.

In [ ]:
# Run the experiment
print("\nRunning experiment with modular API and improved error handling...")
print("Note: Entropy pruning will gracefully fall back to alternative methods if needed")
try:
    # Create a simpler test config for troubleshooting if needed
    test_config = experiment_config
    test_config.use_test_data = True  # Use test data to ensure it works
    
    # Try running with test data first for verification
    print("First running with test data to verify functionality...")
    model, tokenizer, summary = run_experiment(test_config)
    
    # If that works, run with the real config
    if not experiment_config.use_test_data:
        print("\nNow running with actual configuration...")
        model, tokenizer, summary = run_experiment(experiment_config)
        
except Exception as e:
    print(f"\nError in experiment: {e}")
    import traceback
    traceback.print_exc()
    
    if "collect_attention_distributions" in str(e) or "entropy_based_pruning" in str(e):
        print("\nNOTE: If you're seeing an error with entropy pruning functions, make sure")
        print("you're using the latest version of the benchmark_with_metrics.py script that")
        print("has the fix for handling different API availability scenarios.")
    
    # Try manual setup as a fallback
    print("\nAttempting manual setup as fallback...")
    try:
        from transformers import AutoModelForCausalLM, AutoTokenizer
        print("\nLoading model directly for demonstration purposes...")
        model = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)
        tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
        
        # Add some basic pruning functionality for demo purposes
        print("\nSetting up basic pruning demo...")
        
        # Create a very simple pruning function
        def simple_pruning_demo(model, tokenizer, text="Once upon a time"):
            print(f"Running simple pruning demo on '{text}'")
            # Generate text before pruning
            inputs = tokenizer(text, return_tensors="pt").to(device)
            with torch.no_grad():
                outputs_before = model.generate(
                    inputs["input_ids"], 
                    max_length=50, 
                    do_sample=True, 
                    temperature=0.7,
                    num_return_sequences=1
                )
            
            before_text = tokenizer.decode(outputs_before[0], skip_special_tokens=True)
            print(f"\nBefore pruning: {before_text}")
            
            # Apply simple mask to attention (simulating pruning)
            print("\nApplying simple attention masking (simulating pruning)...")
            num_modified = 0
            
            # Find attention modules and apply masks
            for name, module in model.named_modules():
                if "attention" in name.lower() and hasattr(module, "dropout"):
                    # Set dropout higher to simulate pruning
                    if hasattr(module.dropout, "p"):
                        old_dropout = module.dropout.p
                        module.dropout.p = min(0.5, old_dropout + 0.2)
                        num_modified += 1
            
            print(f"Modified {num_modified} attention modules")
            
            # Generate text after "pruning"
            with torch.no_grad():
                outputs_after = model.generate(
                    inputs["input_ids"], 
                    max_length=50, 
                    do_sample=True, 
                    temperature=0.7,
                    num_return_sequences=1
                )
            
            after_text = tokenizer.decode(outputs_after[0], skip_special_tokens=True)
            print(f"\nAfter pruning: {after_text}")
            
            return model, tokenizer
        
        # Run the demo
        model, tokenizer = simple_pruning_demo(model, tokenizer)
        
    except Exception as e2:
        print(f"Fallback also failed: {e2}")
        print("Please run this notebook in a local environment where you have the repository properly set up.")

## Interactive Text Generation

Now that we have a pruned and fine-tuned model, let's try generating some text with it interactively. You can provide your own prompts and see how the model responds.

In [None]:
# Helper function for interactive text generation 
def generate_interactive(prompt=None, max_length=100):
    """Generate text from the fine-tuned model interactively"""
    return interactive_generate(model, tokenizer, prompt, max_length)

# Generate text interactively from the fine-tuned model
generate_interactive()