# Sentinel-AI Colab Tutorial

This notebook provides a beginner-friendly introduction to the Sentinel-AI framework, a dynamic transformer architecture that can prune, regrow, and adapt during training and inference.

## Features Demonstrated:
1. **Loading and initializing** the Sentinel-AI model
2. **Dynamic pruning** to remove unnecessary attention heads
3. **Adaptation to new data** after pruning
4. **Visualizing model behavior** during training and inference

Let's get started!

## Step 1: Setup Google Colab Environment

First, we'll install the necessary libraries and clone the Sentinel-AI repository.

In [None]:
!pip install -q transformers torch matplotlib numpy pandas tqdm

In [None]:
# Clone the repository
!git clone https://github.com/CambrianTech/sentinel-ai.git
%cd sentinel-ai

## Step 2: Mount Google Drive (Optional)

If you want to save your models and results to Google Drive, run this cell to mount your drive.

In [None]:
# Mount Google Drive (optional but recommended for saving models)
from google.colab import drive
drive.mount('/content/drive')

# Create a directory for saving models and results
DRIVE_PATH = "/content/drive/MyDrive/sentinel-ai"
!mkdir -p {DRIVE_PATH}

## Step 3: Import Libraries and Setup Paths

Let's import the necessary libraries and set up our Python paths.

In [None]:
import os
import sys
import torch
import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from transformers import AutoTokenizer

# Add project root to path
sys.path.insert(0, os.getcwd())

# Import Sentinel-AI modules
from models.loaders.loader import load_baseline_model, load_adaptive_model
from utils.generation_wrapper import generate_text
from scripts.colab_training import apply_initial_pruning, visualize_gates

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## Step 4: Load a Pretrained Model

Now, let's load a pretrained model (e.g., DistilGPT2) and wrap it with our adaptive architecture.

In [None]:
model_name = "distilgpt2"  # You can try other models like "gpt2" if you have enough memory

# Load tokenizer
print(f"Loading tokenizer: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load baseline model
print(f"Loading baseline model: {model_name}")
baseline_model = load_baseline_model(model_name, device)

# Convert to adaptive model
print("Converting to adaptive model with sentinel gates...")
adaptive_model = load_adaptive_model(model_name, baseline_model, device)

## Step 5: Generate Text with the Full Model

Let's first generate some text with the full model (no pruning) to establish a baseline.

In [None]:
prompts = [
    "Once upon a time in a land far away,",
    "The future of artificial intelligence depends on",
    "Scientists have recently discovered that"
]

print("=== Generating text with full model (no pruning) ===\n")
for i, prompt in enumerate(prompts):
    print(f"Prompt {i+1}: {prompt}")
    output = generate_text(
        model=adaptive_model,
        tokenizer=tokenizer,
        prompt=prompt,
        max_length=100,
        temperature=0.7,
        device=device
    )
    print(f"Generated: {output}\n")

## Step 6: Visualize Gate Activity (Before Pruning)

Let's visualize the gate activity in the model before we apply any pruning.

In [None]:
# Visualize gate activity before pruning
print("Gate activity before pruning:")
gates_before = visualize_gates(adaptive_model)

# Count active heads
num_layers = len(adaptive_model.blocks)
num_heads = adaptive_model.blocks[0]["attn"].num_heads
total_heads = num_layers * num_heads
active_heads = sum(
    1 for l in range(num_layers) for h in range(num_heads) 
    if adaptive_model.blocks[l]["attn"].gate[h].item() > 0.01
)
print(f"Active heads: {active_heads}/{total_heads} ({active_heads/total_heads:.1%})")

## Step 7: Apply Pruning

Now, let's apply entropy-based pruning to reduce the model size. This will identify and disable the least important attention heads.

In [None]:
# Apply entropy-based pruning
pruning_level = 0.5  # Prune 50% of heads
pruning_strategy = "entropy"  # Can be "random", "entropy", or "gradient"

pruned_model = apply_initial_pruning(
    adaptive_model, 
    pruning_strategy, 
    pruning_level, 
    device
)

## Step 8: Visualize Gate Activity (After Pruning)

Let's visualize the gate activity after pruning to see which heads were pruned.

In [None]:
# Visualize gate activity after pruning
print("Gate activity after pruning:")
gates_after = visualize_gates(pruned_model)

# Count active heads after pruning
active_heads_after = sum(
    1 for l in range(num_layers) for h in range(num_heads) 
    if pruned_model.blocks[l]["attn"].gate[h].item() > 0.01
)
print(f"Active heads: {active_heads_after}/{total_heads} ({active_heads_after/total_heads:.1%})")

## Step 9: Generate Text with the Pruned Model

Let's generate text with the pruned model to see if the quality is maintained.

In [None]:
print(f"=== Generating text with pruned model ({pruning_level*100:.0f}% pruning) ===\n")
for i, prompt in enumerate(prompts):
    print(f"Prompt {i+1}: {prompt}")
    output = generate_text(
        model=pruned_model,
        tokenizer=tokenizer,
        prompt=prompt,
        max_length=100,
        temperature=0.7,
        device=device
    )
    print(f"Generated: {output}\n")

## Step 10: Measure Inference Speed

Let's measure the inference speed to see if pruning has made the model faster.

In [None]:
import time

def measure_inference_speed(model, tokenizer, prompt, num_tokens=50, num_runs=3):
    """Measure inference speed in tokens per second."""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    # Warmup run
    _ = model.generate(
        **inputs, 
        max_length=len(inputs.input_ids[0]) + 10, 
        do_sample=False
    )
    
    times = []
    for _ in range(num_runs):
        start_time = time.time()
        _ = model.generate(
            **inputs, 
            max_length=len(inputs.input_ids[0]) + num_tokens,
            do_sample=True,
            temperature=0.7
        )
        end_time = time.time()
        times.append(end_time - start_time)
    
    avg_time = np.mean(times)
    tokens_per_second = num_tokens / avg_time
    
    return tokens_per_second

# Load a fresh copy of the baseline model for fair comparison
baseline_model_new = load_baseline_model(model_name, device)
adaptive_model_new = load_adaptive_model(model_name, baseline_model_new, device)

# Measure full model speed
full_speed = measure_inference_speed(adaptive_model_new, tokenizer, prompts[0])
print(f"Full model inference speed: {full_speed:.2f} tokens/sec")

# Measure pruned model speed
pruned_speed = measure_inference_speed(pruned_model, tokenizer, prompts[0])
print(f"Pruned model inference speed: {pruned_speed:.2f} tokens/sec")

# Calculate speedup
speedup = (pruned_speed / full_speed - 1) * 100
print(f"Speedup: {speedup:.1f}%")

## Step 11: Train on New Data

Now, let's demonstrate that our pruned model can still learn new tasks efficiently. We'll fine-tune it on a small dataset and see how it adapts.

In [None]:
# Define training parameters
training_args = """
--model_name distilgpt2 
--dataset tiny_shakespeare 
--epochs 1 
--batch_size 4 
--learning_rate 5e-5 
--eval_every 100 
--save_every 500 
--max_length 128 
--save_results
"""

# Add drive path if available
if 'DRIVE_PATH' in globals():
    training_args += f" --drive_path {DRIVE_PATH}"

# Run training script
!python scripts/colab_training.py {training_args}

## Step 12: Training with Initial Pruning

Now, let's try training a model that starts with pruning already applied. This demonstrates how we can train efficiently from the beginning.

In [None]:
# Define training parameters with initial pruning
pruned_training_args = """
--model_name distilgpt2 
--dataset tiny_shakespeare 
--epochs 1 
--batch_size 4 
--learning_rate 5e-5 
--eval_every 100 
--save_every 500 
--initial_pruning 0.5 
--pruning_strategy entropy 
--max_length 128 
--save_results
"""

# Add drive path if available
if 'DRIVE_PATH' in globals():
    pruned_training_args += f" --drive_path {DRIVE_PATH}"

# Run training script
!python scripts/colab_training.py {pruned_training_args}

## Step 13: Training with Dynamic Pruning (Controller)

Finally, let's demonstrate the dynamic pruning capability using the controller, which can learn which heads to prune during training.

In [None]:
# Define training parameters with controller
controller_training_args = """
--model_name distilgpt2 
--dataset tiny_shakespeare 
--epochs 1 
--batch_size 4 
--learning_rate 5e-5 
--eval_every 100 
--save_every 500 
--enable_controller 
--controller_interval 100 
--target_pruning 0.5 
--max_length 128 
--save_results
"""

# Add drive path if available
if 'DRIVE_PATH' in globals():
    controller_training_args += f" --drive_path {DRIVE_PATH}"

# Run training script
!python scripts/colab_training.py {controller_training_args}

## Advanced Feature: Learning After Pruning

Now let's explore one of the most powerful capabilities of Sentinel-AI: the ability for pruned models to learn new tasks efficiently and potentially grow into more powerful models.

We'll use our new `learning_after_pruning.py` script to demonstrate this capability.

In [ ]:
# Define learning after pruning parameters
learning_args = """
--model_name distilgpt2 
--pruning_level 0.5 
--pruning_strategy entropy 
--task sentiment 
--sample_size 100 
--epochs 3 
--batch_size 4 
--learning_rate 5e-5 
--save_results
"""

# Add drive path if available
if 'DRIVE_PATH' in globals():
    learning_args += f" --drive_path {DRIVE_PATH}"

# Run the learning after pruning script
!python scripts/learning_after_pruning.py {learning_args}

## Analyzing Learning Results

Let's examine the results of our learning experiment to understand how the pruned model compares to the full model in learning a new task.

In [ ]:
# Find the most recent learning results directory
import glob
import os

if 'DRIVE_PATH' in globals():
    results_base = os.path.join(DRIVE_PATH, "learning_results")
else:
    results_base = "learning_results"

# Get all sentiment results directories sorted by creation time (latest first)
result_dirs = sorted(
    glob.glob(f"{results_base}/sentiment_*"), 
    key=os.path.getctime, 
    reverse=True
)

if result_dirs:
    latest_dir = result_dirs[0]
    print(f"Found results in: {latest_dir}")
    
    # Display learning efficiency comparison
    from IPython.display import Image, display
    
    images = [
        "performance_comparison.png",
        "learning_efficiency_comparison.png",
        "gate_activity_comparison.png",
        "gate_activity_difference.png"
    ]
    
    for img in images:
        img_path = os.path.join(latest_dir, img)
        if os.path.exists(img_path):
            print(f"\n{img.replace('.png', '').replace('_', ' ').title()}:")
            display(Image(img_path))
    
    # Display summary text
    summary_path = os.path.join(latest_dir, "learning_results_summary.txt")
    if os.path.exists(summary_path):
        print("\nSummary of Results:")
        with open(summary_path, 'r') as f:
            summary = f.read()
        print(summary)
else:
    print("No learning results found. Run the learning_after_pruning.py script first.")

## Experimenting with Different Tasks

The learning_after_pruning.py script supports several tasks to demonstrate adaptability:

1. **sentiment** - A sentiment analysis classification task
2. **code** - Learning to generate code snippets for programming problems
3. **science** - Learning scientific facts and explanations
4. **poetry** - Learning to generate poetic text with specific structures

Let's try a different task to see how pruned models adapt to different types of learning:

In [ ]:
# Try a different task (poetry generation)
poetry_args = """
--model_name distilgpt2 
--pruning_level 0.5 
--pruning_strategy entropy 
--task poetry 
--sample_size 100 
--epochs 3 
--batch_size 4 
--learning_rate 5e-5 
--save_results
"""

# Add drive path if available
if 'DRIVE_PATH' in globals():
    poetry_args += f" --drive_path {DRIVE_PATH}"

# Run the learning after pruning script with poetry task
!python scripts/learning_after_pruning.py {poetry_args}

## Comparing Sample Generations

Let's look at the sample text generated by both the full and pruned models after learning:

In [ ]:
# Display sample generations
import glob

# Find the most recent learning results directories
if 'DRIVE_PATH' in globals():
    results_base = os.path.join(DRIVE_PATH, "learning_results")
else:
    results_base = "learning_results"

sentiment_dirs = sorted(
    glob.glob(f"{results_base}/sentiment_*"), 
    key=os.path.getctime, 
    reverse=True
)

poetry_dirs = sorted(
    glob.glob(f"{results_base}/poetry_*"), 
    key=os.path.getctime, 
    reverse=True
)

# Function to display sample generations
def display_generations(result_dir, task_name):
    samples_path = os.path.join(result_dir, "sample_generations.txt")
    if os.path.exists(samples_path):
        print(f"\n=== Sample {task_name.capitalize()} Generations ===")
        with open(samples_path, 'r') as f:
            samples = f.read()
            # Print just the first 1000 characters to avoid overwhelming the output
            print(samples[:1000] + "...\n(output truncated)")
    else:
        print(f"No sample generations found for {task_name} task.")

# Display generations from both tasks if available
if sentiment_dirs:
    display_generations(sentiment_dirs[0], "sentiment")
    
if poetry_dirs:
    display_generations(poetry_dirs[0], "poetry")

## Conclusion

In this advanced section, we've demonstrated a key capability of the Sentinel-AI framework:

1. **Adaptability After Pruning**: Pruned models can efficiently learn new tasks, showing that pruning doesn't compromise the model's ability to adapt and grow.

2. **Comparative Learning Efficiency**: In many cases, pruned models learn as efficiently (or sometimes more efficiently) than their full-sized counterparts, while requiring less computational resources.

3. **Task Flexibility**: The adaptive architecture works across various tasks including classification (sentiment) and generation (poetry), demonstrating versatility.

4. **Gate Activity Evolution**: By examining gate values before and after learning, we can observe how the model dynamically adjusts its attention mechanisms to optimize for new tasks.

These experiments provide compelling evidence that the Sentinel-AI approach not only reduces model size and increases inference speed but also maintains or enhances adaptability - a critical attribute for models that need to grow in capability over time.

The ability of pruned models to learn new tasks efficiently supports our hypothesis that models can "grow into something much more powerful, given an existing model" through our adaptive architecture.

## Next Steps

To further explore the capabilities of Sentinel-AI:

1. **Try different pruning levels**: Experiment with pruning levels from 0.3 to 0.7 to find the optimal tradeoff between efficiency and performance.

2. **Test other pruning strategies**: Compare entropy-based pruning with gradient-based and random pruning to see which works best for different tasks.

3. **Combine pruning with progressive learning**: Start with a heavily pruned model and allow it to "grow" new heads as it learns more complex tasks.

4. **Explore controller-based dynamic pruning**: Use the controller to automatically adjust pruning during training based on task performance.

5. **Apply to larger models**: If you have access to more computational resources, try applying these techniques to larger models like gpt2-medium.

The Sentinel-AI framework opens up numerous possibilities for creating more efficient and adaptable transformer models!

## Advanced Feature: Learning After Pruning

Now let's explore one of the most powerful capabilities of Sentinel-AI: the ability for pruned models to learn new tasks efficiently and potentially grow into more powerful models.

We'll use our new `learning_after_pruning.py` script to demonstrate this capability.