# 3. The Active Learning Loop

This notebook explains what active learning is and how the loop works.

## What is Active Learning?

**Problem:** Labeling data is expensive and time-consuming. You have 10,000 audio clips but can only afford to label 100.

**Question:** Which 100 should you label to get the best model?

**Active Learning Answer:** Don't pick randomly! Let the model tell you which samples would be most helpful.

## The Core Loop

```
1. Start with:
   - Small labeled set (e.g., 10 samples)
   - Large unlabeled set (e.g., 990 samples)

2. Train model on labeled set

3. Use model to select "informative" samples from unlabeled set
   (e.g., samples the model is uncertain about)

4. Label those samples (human annotator)

5. Add to labeled set

6. Repeat steps 2-5 until satisfied
```

## Key Concepts

### Labeled vs Unlabeled Indices

We maintain two lists:
- `labeled_indices`: Indices of samples we've labeled (and can train on)
- `unlabeled_indices`: Indices of samples we haven't labeled yet

In [2]:
# Simulating the state
total_samples = 100

# Start: 5 labeled, 95 unlabeled
labeled_indices = [0, 1, 2, 3, 4]
unlabeled_indices = list(range(5, 100))

print(f"Labeled: {len(labeled_indices)} samples")
print(f"Unlabeled: {len(unlabeled_indices)} samples")
print(f"\nLabeled indices: {labeled_indices}")
print(f"Unlabeled indices (first 10): {unlabeled_indices[:10]}")

Labeled: 5 samples
Unlabeled: 95 samples

Labeled indices: [0, 1, 2, 3, 4]
Unlabeled indices (first 10): [5, 6, 7, 8, 9, 10, 11, 12, 13, 14]


## Sampling Strategies

How do we select which unlabeled samples to label next?

### 1. Random Sampling (Baseline)
Just pick random samples. Simple but not optimal.

In [3]:
import numpy as np

def sample_random(unlabeled_indices, n_samples=5):
    """Random sampling strategy"""
    selected = np.random.choice(unlabeled_indices, size=n_samples, replace=False)
    return selected.tolist()

# Sample 5 random samples
selected = sample_random(unlabeled_indices, n_samples=5)
print(f"Randomly selected samples: {selected}")

Randomly selected samples: [90, 42, 38, 32, 60]


### 2. Uncertainty Sampling (Smarter)
Pick samples the model is most uncertain about.

**Intuition:** If the model predicts probabilities [0.51, 0.49], it's very uncertain! This sample would be informative.

If the model predicts [0.99, 0.01], it's very confident. This sample is less informative.

In [4]:
import torch
import torch.nn.functional as F

def sample_uncertainty(model, embeddings, unlabeled_indices, n_samples=5):
    """Uncertainty sampling: select samples with highest prediction uncertainty"""
    model.eval()
    
    with torch.no_grad():
        # Get predictions for all unlabeled samples
        X_unlabeled = embeddings[unlabeled_indices]
        logits = model(X_unlabeled)
        probs = F.softmax(logits, dim=1)
        
        # Uncertainty metric: entropy (higher = more uncertain)
        # Or simpler: 1 - max_probability
        max_probs, _ = torch.max(probs, dim=1)
        uncertainty = 1 - max_probs
        
        # Select top-k most uncertain
        top_k_indices = torch.topk(uncertainty, k=n_samples).indices
        
        # Convert back to original indices
        selected = [unlabeled_indices[i] for i in top_k_indices.tolist()]
    
    return selected

# Demo with fake data
fake_model = torch.nn.Linear(10, 5)  # Fake model
fake_embeddings = torch.randn(100, 10)

selected = sample_uncertainty(fake_model, fake_embeddings, unlabeled_indices, n_samples=5)
print(f"Uncertainty-based selected samples: {selected}")

Uncertainty-based selected samples: [48, 60, 13, 78, 96]


## Adding Samples to Labeled Set

Once we've selected samples, we:
1. Remove them from `unlabeled_indices`
2. Add them to `labeled_indices`

In [5]:
def add_samples(labeled_indices, unlabeled_indices, selected_indices):
    """Move selected samples from unlabeled to labeled"""
    for idx in selected_indices:
        if idx in unlabeled_indices:
            unlabeled_indices.remove(idx)
            labeled_indices.append(idx)
    return labeled_indices, unlabeled_indices

# Add the selected samples
print(f"Before: Labeled={len(labeled_indices)}, Unlabeled={len(unlabeled_indices)}")
labeled_indices, unlabeled_indices = add_samples(labeled_indices, unlabeled_indices, selected)
print(f"After:  Labeled={len(labeled_indices)}, Unlabeled={len(unlabeled_indices)}")
print(f"\nLabeled indices: {labeled_indices}")

Before: Labeled=5, Unlabeled=95
After:  Labeled=10, Unlabeled=90

Labeled indices: [0, 1, 2, 3, 4, 48, 60, 13, 78, 96]


## Complete Active Learning Cycle

Let's simulate one complete cycle:

In [6]:
import torch.nn as nn
import torch.optim as optim

# Setup
n_samples = 100
n_features = 10
n_classes = 3

# Create fake dataset
embeddings = torch.randn(n_samples, n_features)
labels = torch.randint(0, n_classes, (n_samples,))

# Initialize
model = nn.Linear(n_features, n_classes)
optimizer = optim.Adam(model.parameters(), lr=0.01)
criterion = nn.CrossEntropyLoss()

labeled_indices = [0, 1, 2, 3, 4]  # Start with 5 labeled
unlabeled_indices = list(range(5, n_samples))

print("=== ACTIVE LEARNING CYCLE ===")
print(f"\nStep 1: Initial state")
print(f"  Labeled: {len(labeled_indices)}")
print(f"  Unlabeled: {len(unlabeled_indices)}")

# Train on labeled data
print(f"\nStep 2: Training on labeled data...")
model.train()
for epoch in range(10):
    optimizer.zero_grad()
    X_train = embeddings[labeled_indices]
    y_train = labels[labeled_indices]
    logits = model(X_train)
    loss = criterion(logits, y_train)
    loss.backward()
    optimizer.step()
print(f"  Final loss: {loss.item():.4f}")

# Sample using uncertainty
print(f"\nStep 3: Selecting most informative samples...")
selected = sample_uncertainty(model, embeddings, unlabeled_indices, n_samples=5)
print(f"  Selected: {selected}")

# Add to labeled set
print(f"\nStep 4: Adding to labeled set...")
labeled_indices, unlabeled_indices = add_samples(labeled_indices, unlabeled_indices, selected)
print(f"  Labeled: {len(labeled_indices)}")
print(f"  Unlabeled: {len(unlabeled_indices)}")

print(f"\n=== CYCLE COMPLETE ===")
print(f"We grew our labeled set from 5 to {len(labeled_indices)} samples!")

=== ACTIVE LEARNING CYCLE ===

Step 1: Initial state
  Labeled: 5
  Unlabeled: 95

Step 2: Training on labeled data...
  Final loss: 0.8675

Step 3: Selecting most informative samples...
  Selected: [31, 94, 64, 74, 11]

Step 4: Adding to labeled set...
  Labeled: 10
  Unlabeled: 90

=== CYCLE COMPLETE ===
We grew our labeled set from 5 to 10 samples!


## Summary

**Active Learning in 4 Steps:**
1. **Train** on current labeled data
2. **Sample** informative examples from unlabeled data
3. **Label** those examples (human or oracle)
4. **Add** to labeled set and repeat

**Key Benefit:** You get better models with fewer labels!

**In our implementation:**
- `ActiveLearner` class manages this loop
- API endpoints expose each step
- Frontend visualizes the process