# Progressive Growth in Sentinel-AI

This notebook demonstrates Sentinel-AI's ability to start with a heavily pruned model and progressively grow it into a more capable system. Unlike conventional approaches that start with full models and prune, this shows how models can:

1. **Start in a highly efficient, heavily pruned state**
2. **Strategically regrow attention heads** based on importance signals
3. **Evolve into more powerful models** during training
4. **Target growth to the most valuable computational pathways**

This capability is critical for creating models that can grow into more powerful systems based on task demands, rather than needing to start with overparameterized architectures.

## Setup Environment

First, let's set up our environment and import the necessary libraries.

In [None]:
import sys
import os
import torch
import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from transformers import AutoTokenizer

# Add project root to path
sys.path.insert(0, os.getcwd())

# Import Sentinel-AI modules
from models.loaders.loader import load_baseline_model, load_adaptive_model
from datasets.dataset_loader import load_dataset
from utils.generation_wrapper import generate_text
from controller.controller_manager import ControllerManager
from controller.metrics.head_metrics import collect_head_metrics
from utils.head_metrics import compute_head_importance

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## Load Model

We'll start by loading a pretrained model and converting it to our adaptive architecture.

In [None]:
model_name = "distilgpt2"  # You can try "gpt2" if you have enough memory

# Load tokenizer
print(f"Loading tokenizer: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Load baseline model
print(f"Loading baseline model: {model_name}")
baseline_model = load_baseline_model(model_name, device)

# Convert to adaptive model
print("Converting to adaptive model with sentinel gates...")
adaptive_model = load_adaptive_model(model_name, baseline_model, device)

## Apply Initial Heavy Pruning

We'll start with a heavily pruned model (90% pruning) and then progressively grow it during training.

In [None]:
def apply_initial_pruning(model, strategy, pruning_level, device):
    """Apply initial heavy pruning to the model."""
    print(f"Applying initial {strategy} pruning at {pruning_level:.1%} level")
    
    # Get model dimensions
    num_layers = len(model.blocks)
    num_heads = model.blocks[0]["attn"].num_heads
    total_heads = num_layers * num_heads
    heads_to_keep = int(total_heads * (1 - pruning_level))
    
    if heads_to_keep < 1:
        print("Warning: Pruning level too high, keeping at least 1 head per layer")
        heads_to_keep = num_layers  # Ensure at least 1 head per layer

    # Create dummy input for collecting metrics if needed
    batch_size = 2
    seq_len = 32
    dummy_input = torch.randint(0, 1000, (batch_size, seq_len)).to(device)
    dummy_batch = {"input_ids": dummy_input, 
                  "attention_mask": torch.ones_like(dummy_input)}
    
    # Set all gates to near-zero initially
    with torch.no_grad():
        for l in range(num_layers):
            for h in range(num_heads):
                model.blocks[l]["attn"].gate[h] = torch.tensor(0.001, device=device)
    
    # Apply pruning based on strategy
    if strategy == "random":
        # Get a flattened list of (layer, head) tuples
        all_heads = [(l, h) for l in range(num_layers) for h in range(num_heads)]
        
        # Randomly select heads to keep active
        kept_head_indices = np.random.choice(len(all_heads), heads_to_keep, replace=False)
        
        # Set gates to 1.0 for kept heads
        with torch.no_grad():
            for idx in kept_head_indices:
                layer_idx, head_idx = all_heads[idx]
                model.blocks[layer_idx]["attn"].gate[head_idx] = torch.tensor(1.0, device=device)
    
    elif strategy == "uniform":
        # Distribute active heads uniformly across layers
        heads_per_layer = max(1, heads_to_keep // num_layers)
        remaining_heads = heads_to_keep - (heads_per_layer * num_layers)
        
        with torch.no_grad():
            for layer_idx in range(num_layers):
                # Determine how many heads to keep in this layer
                layer_heads = heads_per_layer
                if layer_idx < remaining_heads:
                    layer_heads += 1
                
                # Randomly select heads to keep in this layer
                head_indices = np.random.choice(num_heads, layer_heads, replace=False)
                
                # Set gates to 1.0 for kept heads
                for head_idx in head_indices:
                    model.blocks[layer_idx]["attn"].gate[head_idx] = torch.tensor(1.0, device=device)
    
    elif strategy in ["entropy", "gradient"]:
        # Collect metrics
        metrics = collect_head_metrics(model, batch=dummy_batch)
        
        if strategy == "entropy" and "entropy" in metrics:
            head_scores = metrics["entropy"]
            # Lower entropy = more focused attention = more important to keep
            descending = False
        elif strategy == "gradient" and "grad_norm" in metrics:
            head_scores = metrics["grad_norm"]
            # Higher gradient norm = more important head = more important to keep
            descending = True
        else:
            print(f"Warning: {strategy} metrics not available, using random pruning")
            return apply_initial_pruning(model, "random", pruning_level, device)
        
        # Reshape and flatten scores
        if not isinstance(head_scores, torch.Tensor):
            head_scores = torch.tensor(head_scores, device=device)
            
        if len(head_scores.shape) < 2:
            head_scores = head_scores.reshape(num_layers, num_heads)
            
        flat_scores = head_scores.view(-1)
        
        # Sort scores
        _, indices = torch.sort(flat_scores, descending=descending)
        indices_to_keep = indices[:heads_to_keep]
        
        # Apply pruning - keep only selected heads
        with torch.no_grad():
            # First set all gates to 0.001 (pruned)
            for layer_idx in range(num_layers):
                for head_idx in range(num_heads):
                    model.blocks[layer_idx]["attn"].gate[head_idx] = torch.tensor(0.001, device=device)
            
            # Then activate only the selected heads
            for idx in indices_to_keep:
                layer_idx = idx.item() // num_heads
                head_idx = idx.item() % num_heads
                model.blocks[layer_idx]["attn"].gate[head_idx] = torch.tensor(1.0, device=device)
    
    # Count active heads for verification
    active_count = 0
    with torch.no_grad():
        for layer_idx in range(num_layers):
            for head_idx in range(num_heads):
                if model.blocks[layer_idx]["attn"].gate[head_idx].item() > 0.5:
                    active_count += 1
    
    print(f"Kept {active_count} of {total_heads} heads active ({active_count/total_heads:.1%})")
    return model

# Apply 90% pruning to the model
initial_pruning_level = 0.9  # 90% pruning
pruning_strategy = "uniform"  # Ensure at least one head per layer

pruned_model = apply_initial_pruning(
    adaptive_model, 
    pruning_strategy, 
    initial_pruning_level, 
    device
)

## Visualize Initial Gate Activity

Let's visualize which heads are active and which are pruned.

In [None]:
def visualize_gate_activity(model):
    """Visualize gate activity across layers and heads."""
    num_layers = len(model.blocks)
    num_heads = model.blocks[0]["attn"].num_heads
    
    # Create matrix of gate values
    gate_values = torch.zeros(num_layers, num_heads)
    for l in range(num_layers):
        for h in range(num_heads):
            gate_values[l, h] = model.blocks[l]["attn"].gate[h].item()
    
    # Create heatmap
    plt.figure(figsize=(10, 6))
    plt.imshow(gate_values.numpy(), cmap="YlOrRd", vmin=0, vmax=1)
    plt.colorbar(label="Gate Value")
    plt.title("Attention Head Gate Activity", fontsize=16)
    plt.xlabel("Attention Head", fontsize=14)
    plt.ylabel("Transformer Layer", fontsize=14)
    
    # Add grid lines
    plt.grid(False)
    plt.xticks(range(num_heads))
    plt.yticks(range(num_layers))
    
    plt.show()
    
    # Count active heads
    active_heads = (gate_values > 0.5).sum().item()
    total_heads = num_layers * num_heads
    print(f"Active heads: {active_heads}/{total_heads} ({active_heads/total_heads:.1%})")
    
    return gate_values.numpy()

# Visualize initial gate activity
initial_gates = visualize_gate_activity(pruned_model)

## Generate Text with Heavily Pruned Model

Let's see how the heavily pruned model performs on text generation.

In [None]:
prompts = [
    "Once upon a time in a land far away,",
    "The future of artificial intelligence depends on",
    "In the midst of winter, I found there was, within me,"
]

print("=== Generating text with heavily pruned model ===\n")
for i, prompt in enumerate(prompts):
    print(f"Prompt {i+1}: {prompt}")
    output = generate_text(
        model=pruned_model,
        tokenizer=tokenizer,
        prompt=prompt,
        max_length=100,
        temperature=0.7,
        device=device
    )
    print(f"Generated: {output}\n")

## Setup Progressive Growth Functions

Now we'll define functions to help us grow the model progressively during training.

In [None]:
def get_head_growth_order(model, strategy, dataloader, device):
    """Determine the order in which to grow attention heads."""
    # Get model dimensions
    num_layers = len(model.blocks)
    num_heads = model.blocks[0]["attn"].num_heads
    
    # Get currently inactive heads
    inactive_heads = []
    with torch.no_grad():
        for layer_idx in range(num_layers):
            for head_idx in range(num_heads):
                if model.blocks[layer_idx]["attn"].gate[head_idx].item() < 0.5:
                    inactive_heads.append((layer_idx, head_idx))
    
    if strategy == "random":
        # Shuffle the inactive heads randomly
        np.random.shuffle(inactive_heads)
        return inactive_heads
    
    # For other strategies, we need metrics for ranking
    batch = next(iter(dataloader))
    batch = {k: v.to(device) for k, v in batch.items()}
    
    if strategy == "importance":
        # Compute head importance for regrowth
        print("Computing head importance for regrowth...")
        
        # Temporarily activate all heads for importance calculation
        head_gates_backup = {}
        with torch.no_grad():
            for layer_idx, head_idx in inactive_heads:
                # Store original gate value
                head_gates_backup[(layer_idx, head_idx)] = model.blocks[layer_idx]["attn"].gate[head_idx].item()
                # Temporarily set gate to 1.0
                model.blocks[layer_idx]["attn"].gate[head_idx] = torch.tensor(1.0, device=device)
        
        # Compute importance scores for all heads
        importance_scores = compute_head_importance(model, batch)
        
        # Restore original gate values
        with torch.no_grad():
            for (layer_idx, head_idx), gate_value in head_gates_backup.items():
                model.blocks[layer_idx]["attn"].gate[head_idx] = torch.tensor(gate_value, device=device)
        
        # Create a list of (importance, layer_idx, head_idx) for inactive heads
        head_importance = []
        for layer_idx, head_idx in inactive_heads:
            imp = importance_scores[layer_idx][head_idx].item() if isinstance(importance_scores, torch.Tensor) else importance_scores[layer_idx, head_idx]
            head_importance.append((imp, layer_idx, head_idx))
        
        # Sort by importance (higher first)
        head_importance.sort(reverse=True)
        
        # Return only the (layer_idx, head_idx) tuples in order of importance
        return [(layer_idx, head_idx) for _, layer_idx, head_idx in head_importance]
    
    elif strategy in ["gradient", "entropy"]:
        # Collect metrics
        metrics = collect_head_metrics(model, batch=batch)
        
        if strategy == "entropy" and "entropy" in metrics:
            head_scores = metrics["entropy"]
            # Lower entropy = more focused attention = higher priority for growth
            reverse = False
        elif strategy == "gradient" and "grad_norm" in metrics:
            head_scores = metrics["grad_norm"]
            # Higher gradient norm = more important head = higher priority for growth
            reverse = True
        else:
            print(f"Warning: {strategy} metrics not available, using random growth")
            return inactive_heads
        
        # Create a list of (score, layer_idx, head_idx) for inactive heads
        head_scores_list = []
        for layer_idx, head_idx in inactive_heads:
            score = head_scores[layer_idx][head_idx].item() if isinstance(head_scores, torch.Tensor) else head_scores[layer_idx, head_idx]
            head_scores_list.append((score, layer_idx, head_idx))
        
        # Sort by score
        head_scores_list.sort(reverse=reverse)
        
        # Return only the (layer_idx, head_idx) tuples in order of score
        return [(layer_idx, head_idx) for _, layer_idx, head_idx in head_scores_list]
    
    # Default to random order if strategy not recognized
    return inactive_heads


def grow_attention_heads(model, num_heads_to_grow, growth_order, device):
    """Grow attention heads according to the specified order."""
    if not growth_order:
        print("No more heads to grow.")
        return 0
    
    print(f"Growing {num_heads_to_grow} attention heads...")
    heads_to_grow = growth_order[:num_heads_to_grow]
    
    # Activate the selected heads
    with torch.no_grad():
        for layer_idx, head_idx in heads_to_grow:
            # Activate the head by setting its gate to 1.0
            model.blocks[layer_idx]["attn"].gate[head_idx] = torch.tensor(1.0, device=device)
            print(f"  Activated head {head_idx} in layer {layer_idx}")
    
    return len(heads_to_grow)


def count_active_heads(model):
    """Count the number of active attention heads in the model."""
    active_count = 0
    total_count = 0
    
    with torch.no_grad():
        for layer_idx in range(len(model.blocks)):
            num_heads = model.blocks[layer_idx]["attn"].num_heads
            total_count += num_heads
            
            for head_idx in range(num_heads):
                if model.blocks[layer_idx]["attn"].gate[head_idx].item() > 0.5:
                    active_count += 1
    
    return active_count, total_count

## Load Dataset for Training

In [None]:
# Load dataset
dataset_name = "tiny_shakespeare"
max_length = 128

print(f"Loading {dataset_name} dataset...")
train_dataset, eval_dataset = load_dataset(
    dataset_name, tokenizer, max_length
)

# Create data loaders
batch_size = 4
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=batch_size, shuffle=True
)
eval_loader = torch.utils.data.DataLoader(
    eval_dataset, batch_size=batch_size, shuffle=False
)

## Setup Training with Progressive Growth

Now let's set up the training loop with progressive growth.

In [None]:
from transformers import get_linear_schedule_with_warmup

# Training parameters
epochs = 3
learning_rate = 5e-5
warmup_steps = 200

# Growth parameters
growth_strategy = "importance"  # Use importance-based growth
growth_rate = 0.2  # Grow 20% of the remaining heads per epoch
target_pruning = 0.3  # Target final pruning level (30% pruned)

# Setup optimizer
optimizer = torch.optim.AdamW(
    pruned_model.parameters(),
    lr=learning_rate,
    weight_decay=0.01
)

# Setup learning rate scheduler
total_steps = len(train_loader) * epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps
)

# Enable U-Net skip connections
controller_manager = ControllerManager(model=pruned_model)
controller_manager.enable_unet_connections(enable=True, connection_scale=0.05)

# Tracking metrics
all_metrics = {
    "epochs": [],
    "train_loss": [],
    "eval_loss": [],
    "perplexity": [],
    "active_heads": [],
    "gate_snapshots": []
}

# Calculate heads to grow
active_heads, total_heads = count_active_heads(pruned_model)
target_active_heads = int(total_heads * (1 - target_pruning))
heads_to_grow_total = max(0, target_active_heads - active_heads)

print(f"Starting with {active_heads} of {total_heads} heads active ({active_heads/total_heads:.1%})")
print(f"Target: {target_active_heads} active heads ({target_active_heads/total_heads:.1%})")
print(f"Need to grow {heads_to_grow_total} heads over {epochs} epochs")

## Train with Progressive Growth

Now let's train the model while progressively growing attention heads.

In [None]:
def train_epoch(model, train_loader, optimizer, scheduler, device):
    """Train the model for one epoch."""
    model.train()
    epoch_loss = 0
    
    metrics = {
        "loss": [],
        "active_heads": []
    }
    
    for batch in tqdm(train_loader, desc="Training"):
        # Move batch to device
        batch = {k: v.to(device) for k, v in batch.items()}
        
        # Forward pass
        outputs = model(
            input_ids=batch["input_ids"],
            attention_mask=batch["attention_mask"],
            labels=batch["labels"]
        )
        loss = outputs.loss
        
        # Backward pass
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        
        # Update parameters
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        
        # Record metrics
        epoch_loss += loss.item()
        metrics["loss"].append(loss.item())
        
        # Record active heads
        active_heads, total_heads = count_active_heads(model)
        metrics["active_heads"].append(active_heads)
    
    # Calculate average loss
    avg_loss = epoch_loss / len(train_loader)
    print(f"Average training loss: {avg_loss:.4f}")
    
    return metrics

def evaluate(model, eval_loader, device):
    """Evaluate the model on the validation set."""
    model.eval()
    eval_loss = 0
    
    with torch.no_grad():
        for batch in tqdm(eval_loader, desc="Evaluating"):
            # Move batch to device
            batch = {k: v.to(device) for k, v in batch.items()}
            
            # Forward pass
            outputs = model(
                input_ids=batch["input_ids"],
                attention_mask=batch["attention_mask"],
                labels=batch["labels"]
            )
            
            eval_loss += outputs.loss.item()
    
    # Calculate average loss and perplexity
    avg_loss = eval_loss / len(eval_loader)
    perplexity = torch.exp(torch.tensor(avg_loss)).item()
    
    print(f"Evaluation loss: {avg_loss:.4f}, Perplexity: {perplexity:.2f}")
    
    return {
        "loss": avg_loss,
        "perplexity": perplexity
    }

# Initial snapshot
all_metrics["gate_snapshots"].append(initial_gates)

# Get growth order (initial)
growth_order = get_head_growth_order(pruned_model, growth_strategy, train_loader, device)

# Training and growth loop
for epoch in range(epochs):
    print(f"\nEpoch {epoch+1}/{epochs}")
    
    # Calculate number of heads to grow this epoch
    if epoch < epochs - 1:
        # Distribute growth across epochs
        heads_to_grow = int(heads_to_grow_total * growth_rate)
    else:
        # Last epoch - grow all remaining heads to reach target
        current_active, _ = count_active_heads(pruned_model)
        heads_to_grow = max(0, target_active_heads - current_active)
    
    # Grow heads if needed
    if heads_to_grow > 0 and growth_order:
        grow_attention_heads(pruned_model, heads_to_grow, growth_order, device)
        # Update growth order after growing
        growth_order = get_head_growth_order(pruned_model, growth_strategy, train_loader, device)
    
    # Train for one epoch
    train_metrics = train_epoch(pruned_model, train_loader, optimizer, scheduler, device)
    
    # Evaluate
    eval_metrics = evaluate(pruned_model, eval_loader, device)
    
    # Get current active head count
    active_heads, _ = count_active_heads(pruned_model)
    
    # Update metrics
    all_metrics["epochs"].append(epoch + 1)
    all_metrics["train_loss"].append(np.mean(train_metrics["loss"]))
    all_metrics["eval_loss"].append(eval_metrics["loss"])
    all_metrics["perplexity"].append(eval_metrics["perplexity"])
    all_metrics["active_heads"].append(active_heads)
    
    # Take a snapshot of gate activity
    gate_snapshot = visualize_gate_activity(pruned_model)
    all_metrics["gate_snapshots"].append(gate_snapshot)
    
    # Report progress
    print(f"Active heads: {active_heads}/{total_heads} ({active_heads/total_heads:.1%})")
    
    # Update growth order for the next epoch
    growth_order = get_head_growth_order(pruned_model, growth_strategy, train_loader, device)

## Generate Text After Progressive Growth

Let's see how the model performs after progressive growth.

In [None]:
print("=== Generating text with progressively grown model ===\n")
for i, prompt in enumerate(prompts):
    print(f"Prompt {i+1}: {prompt}")
    output = generate_text(
        model=pruned_model,
        tokenizer=tokenizer,
        prompt=prompt,
        max_length=100,
        temperature=0.7,
        device=device
    )
    print(f"Generated: {output}\n")

## Visualize Progressive Growth Results

Let's create visualizations to analyze the progressive growth process.

In [None]:
# Extract metrics for plotting
epochs = all_metrics["epochs"]
train_loss = all_metrics["train_loss"]
eval_loss = all_metrics["eval_loss"]
perplexity = all_metrics["perplexity"]
active_heads = all_metrics["active_heads"]

# Plot training and evaluation loss
plt.figure(figsize=(10, 6))
plt.plot(epochs, train_loss, label="Training Loss", marker="o")
plt.plot(epochs, eval_loss, label="Evaluation Loss", marker="s")
plt.title("Loss During Progressive Growth", fontsize=16)
plt.xlabel("Epoch", fontsize=14)
plt.ylabel("Loss", fontsize=14)
plt.legend(fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Plot perplexity
plt.figure(figsize=(10, 6))
plt.plot(epochs, perplexity, label="Perplexity", marker="o", color="green")
plt.title("Perplexity During Progressive Growth", fontsize=16)
plt.xlabel("Epoch", fontsize=14)
plt.ylabel("Perplexity", fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Plot active heads
plt.figure(figsize=(10, 6))
plt.plot(epochs, active_heads, label="Active Heads", marker="o", color="purple")
plt.title("Active Attention Heads During Progressive Growth", fontsize=16)
plt.xlabel("Epoch", fontsize=14)
plt.ylabel("Number of Active Heads", fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Combined plot
fig, ax1 = plt.subplots(figsize=(12, 7))

# Plot loss on left axis
ax1.set_xlabel("Epoch", fontsize=14)
ax1.set_ylabel("Loss", fontsize=14, color="blue")
ax1.plot(epochs, train_loss, label="Training Loss", marker="o", color="blue", alpha=0.7)
ax1.plot(epochs, eval_loss, label="Evaluation Loss", marker="s", color="green", alpha=0.7)
ax1.tick_params(axis="y", labelcolor="blue")

# Create second y-axis for active heads
ax2 = ax1.twinx()
ax2.set_ylabel("Active Heads", fontsize=14, color="purple")
ax2.plot(epochs, active_heads, label="Active Heads", marker="d", color="purple", linestyle="--", linewidth=2)
ax2.tick_params(axis="y", labelcolor="purple")

# Add legend with combined handles from both axes
handles1, labels1 = ax1.get_legend_handles_labels()
handles2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(handles1 + handles2, labels1 + labels2, loc="upper right", fontsize=12)

plt.title("Progressive Growth: Loss and Active Heads", fontsize=16)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Visualize Gate Activity Evolution

Let's compare the gate activity at different stages of the progressive growth process.

In [None]:
# Create heat map of gate activities at different growth stages
snapshots = all_metrics["gate_snapshots"]
num_snapshots = len(snapshots)

# Get dimensions from first snapshot
num_layers, num_heads = snapshots[0].shape

fig, axes = plt.subplots(1, num_snapshots, figsize=(4 * num_snapshots, 6))
if num_snapshots == 1:
    axes = [axes]

for i, (gates, epoch) in enumerate(zip(snapshots, [0] + epochs)):
    im = axes[i].imshow(gates, cmap="YlOrRd", vmin=0, vmax=1)
    axes[i].set_title(f"Epoch {epoch}", fontsize=14)
    axes[i].set_xlabel("Attention Head", fontsize=12)
    if i == 0:
        axes[i].set_ylabel("Transformer Layer", fontsize=12)
    
    # Add grid lines
    axes[i].grid(False)
    axes[i].set_xticks(range(num_heads))
    axes[i].set_yticks(range(num_layers))

fig.colorbar(im, ax=axes, label="Gate Value")
plt.suptitle("Gate Activity Evolution During Progressive Growth", fontsize=16)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()

## Conclusion

In this notebook, we've demonstrated Sentinel-AI's ability to perform progressive growth - starting with a heavily pruned, efficient model and strategically growing it into a more powerful system.

Key findings:

1. **Efficiency with Capability**: We've shown that models can start in a highly efficient state (90% pruned) and gradually grow only the most important computational pathways.

2. **Targeted Growth**: By using importance metrics to guide growth, we ensure that computational resources are allocated to the most valuable attention heads.

3. **Performance Improvement**: The progressive growth approach leads to improved performance as the model evolves, while maintaining efficiency compared to starting with a full model.

4. **Architectural Evolution**: The gate activity visualizations show how the model's architecture evolves during training, targeting growth to the most valuable pathways.

This capability is critical for creating models that can grow into more powerful systems based on task demands, rather than needing to start with overparameterized architectures. Progressive growth represents a more biologically-inspired approach to neural network development, similar to how human brains develop by first overproducing neurons and then selectively pruning and strengthening connections.