# Lab 3.1.3: NEFTune Magic - 5 Lines to 35% Better Performance

**Module:** 3.1 - Large Language Model Fine-Tuning  
**Time:** 1 hour  
**Difficulty:** ⭐⭐☆☆☆

---

## Learning Objectives

By the end of this notebook, you will:
- [ ] Understand why adding noise to embeddings helps fine-tuning
- [ ] Implement NEFTune from scratch (just 5 lines!)
- [ ] Measure the dramatic improvement (29.8% → 64.7% on AlpacaEval!)
- [ ] Know how to tune the noise parameter for your task

---

## Prerequisites

- Completed: Lab 3.1.1 (LoRA Theory)
- Knowledge of: PyTorch, embeddings basics

---

## Real-World Context

### The Problem: Fine-Tuned Models Sound Repetitive

You've probably noticed that fine-tuned models sometimes:
- Give shorter, more robotic responses
- Repeat phrases or patterns
- Lose some of the base model's "personality"

Why? During fine-tuning, the model overfits to the exact patterns in your training data.

**NEFTune (Noisy Embedding Fine-Tuning)** fixes this with a brilliantly simple trick: add random noise to the input embeddings during training. This forces the model to be more robust and generates more diverse, natural responses.

**The results are staggering:**
- AlpacaEval win rate: 29.8% → **64.7%** (Llama-2-7B)
- Just 5 lines of code to implement!
- Works with any fine-tuning method (LoRA, full FT, etc.)

---

## ELI5: What is NEFTune?

> **Imagine you're learning to draw by tracing over pictures.** If you trace the exact same picture 1000 times, you'll be really good at drawing *that specific picture*, but not very good at drawing in general.
>
> **NEFTune is like drawing with slightly shaky hands.** The small wobbles force you to understand the *essence* of the drawing rather than memorizing exact lines. You learn to capture the general shape and style, not just copy pixels.
>
> **In AI terms:** By adding small random noise to the input embeddings, we prevent the model from memorizing exact token sequences. Instead, it learns more robust, generalizable patterns that transfer better to new conversations.
>
> **The secret sauce:** The noise is scaled by $1/\sqrt{L}$ where $L$ is sequence length. Longer sequences get less noise per token, keeping the total "shakiness" consistent.

---

## Part 1: The NEFTune Algorithm

### The Math (It's Simple!)

Given input embeddings $E \in \mathbb{R}^{B \times L \times D}$ where:
- $B$ = batch size
- $L$ = sequence length  
- $D$ = embedding dimension

NEFTune adds noise:

$$E' = E + \frac{\alpha}{\sqrt{L \cdot D}} \cdot \epsilon$$

Where:
- $\alpha$ is the noise intensity (typically 5-15)
- $\epsilon \sim \text{Uniform}(-1, 1)$ is random noise
- The $\sqrt{L \cdot D}$ scaling keeps noise proportional

### Why This Works

1. **Regularization effect**: Noise prevents memorizing exact patterns
2. **Data augmentation**: Each training example becomes slightly different each time
3. **Smoother loss landscape**: Helps optimization avoid sharp local minima
4. **Improved generalization**: Model learns robust features

In [None]:
# Setup
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from typing import Optional, Tuple, Dict
import warnings
warnings.filterwarnings('ignore')

torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name()}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

---

## Part 2: The 5-Line Implementation

Here's the entire NEFTune algorithm in just 5 lines!

In [None]:
def neftune_forward(embeddings: torch.Tensor, alpha: float = 5.0, training: bool = True) -> torch.Tensor:
    """
    NEFTune: Noisy Embedding Fine-Tuning
    
    The complete implementation in 5 lines!
    
    Args:
        embeddings: Input embeddings (batch, seq_len, embed_dim)
        alpha: Noise intensity (recommended: 5-15)
        training: Whether in training mode
    
    Returns:
        Noisy embeddings (only during training)
    """
    if not training:                                           # Line 1
        return embeddings                                      # Line 2
    dims = embeddings.shape[-1] * embeddings.shape[-2]         # Line 3: L * D
    noise = torch.rand_like(embeddings) * 2 - 1                # Line 4: Uniform(-1, 1)
    return embeddings + (alpha / dims**0.5) * noise            # Line 5: Add scaled noise


# That's it! Let's verify it works
batch, seq_len, embed_dim = 4, 128, 768
embeddings = torch.randn(batch, seq_len, embed_dim)

# Apply NEFTune
noisy_embeddings = neftune_forward(embeddings, alpha=5.0, training=True)

# Calculate noise statistics
noise_added = noisy_embeddings - embeddings
print(f"Original embeddings shape: {embeddings.shape}")
print(f"Noise statistics:")
print(f"  Mean: {noise_added.mean():.6f} (should be ~0)")
print(f"  Std:  {noise_added.std():.6f}")
print(f"  Max:  {noise_added.abs().max():.6f}")
print(f"\nNoise as % of embedding std: {noise_added.std() / embeddings.std() * 100:.2f}%")

### What's Happening?

The noise is:
1. **Uniform** between -1 and 1 (not Gaussian)
2. **Scaled** by $\alpha / \sqrt{L \cdot D}$ to keep it proportional
3. **Only during training** - inference is unaffected

The scaling means longer sequences get less noise per token, but the same "total" noise.

---

## Part 3: Visualizing the Effect

In [None]:
def visualize_neftune_effect():
    """
    Visualize what NEFTune does to embeddings.
    """
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    # Sample embedding
    seq_len, embed_dim = 64, 256
    embedding = torch.randn(1, seq_len, embed_dim)
    
    # Different alpha values
    alphas = [0, 5, 15]
    
    for i, alpha in enumerate(alphas):
        noisy = neftune_forward(embedding, alpha=alpha, training=True)
        noise = noisy - embedding
        
        # Original vs noisy embedding visualization
        im = axes[0, i].imshow(
            noisy[0, :32, :32].numpy(), 
            cmap='RdBu', aspect='auto', vmin=-3, vmax=3
        )
        axes[0, i].set_title(f'Embedding (α={alpha})')
        axes[0, i].set_xlabel('Embedding dim')
        axes[0, i].set_ylabel('Sequence position')
        
        # Noise distribution
        axes[1, i].hist(noise.flatten().numpy(), bins=50, density=True, alpha=0.7, color='steelblue')
        axes[1, i].axvline(x=0, color='red', linestyle='--', linewidth=2)
        axes[1, i].set_title(f'Noise Distribution (α={alpha})')
        axes[1, i].set_xlabel('Noise value')
        axes[1, i].set_ylabel('Density')
        
        noise_std = noise.std().item()
        axes[1, i].text(0.95, 0.95, f'std={noise_std:.4f}', 
                       transform=axes[1, i].transAxes, ha='right', va='top',
                       fontsize=10, bbox=dict(boxstyle='round', facecolor='white'))
    
    plt.tight_layout()
    plt.savefig('neftune_visualization.png', dpi=150, bbox_inches='tight')
    plt.show()
    plt.close(fig)

visualize_neftune_effect()

In [None]:
def visualize_noise_scaling():
    """
    Show how noise scales with sequence length.
    """
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    embed_dim = 768
    alpha = 5.0
    seq_lengths = [32, 64, 128, 256, 512, 1024, 2048]
    
    noise_stds = []
    noise_per_token = []
    
    for seq_len in seq_lengths:
        embedding = torch.randn(1, seq_len, embed_dim)
        noisy = neftune_forward(embedding, alpha=alpha, training=True)
        noise = noisy - embedding
        
        noise_stds.append(noise.std().item())
        # Total noise magnitude
        noise_per_token.append(noise.abs().mean().item())
    
    # Plot noise std vs sequence length
    axes[0].plot(seq_lengths, noise_stds, 'bo-', linewidth=2, markersize=8)
    axes[0].set_xlabel('Sequence Length')
    axes[0].set_ylabel('Noise Std')
    axes[0].set_title(f'Noise Magnitude vs Sequence Length (α={alpha})')
    axes[0].set_xscale('log')
    axes[0].grid(True, alpha=0.3)
    
    # Expected: noise_std ∝ 1/sqrt(seq_len)
    expected = [noise_stds[0] * (seq_lengths[0] / sl) ** 0.5 for sl in seq_lengths]
    axes[0].plot(seq_lengths, expected, 'r--', linewidth=2, label='Expected 1/√L scaling')
    axes[0].legend()
    
    # Total noise per sample
    total_noise = [s * l for s, l in zip(noise_stds, seq_lengths)]
    axes[1].bar([str(s) for s in seq_lengths], total_noise, color='coral')
    axes[1].set_xlabel('Sequence Length')
    axes[1].set_ylabel('Total Noise (std × seq_len)')
    axes[1].set_title('Total Noise Roughly Constant')
    
    plt.tight_layout()
    plt.savefig('neftune_scaling.png', dpi=150, bbox_inches='tight')
    plt.show()
    plt.close(fig)
    
    print("Key Insight: Noise per token decreases with sequence length,")
    print("but total noise per sample stays roughly constant!")

visualize_noise_scaling()

---

## Part 4: Wrapping a Model with NEFTune

Let's create a proper wrapper that can be applied to any embedding layer.

### PyTorch Transformer Components Used

| Component | Description |
|-----------|-------------|
| `nn.TransformerEncoder` | A stack of N encoder layers. Applies self-attention and feedforward networks. |
| `nn.TransformerEncoderLayer` | Single encoder layer with multi-head self-attention + feedforward. Parameters: `d_model` (dim), `nhead` (attention heads), `dim_feedforward`, `batch_first=True`. |
| `nn.Embedding` | Lookup table that maps token IDs to dense vectors. |

In [None]:
class NEFTuneEmbedding(nn.Module):
    """
    Wrapper that adds NEFTune noise to any embedding layer.
    
    Usage:
        model.embed_tokens = NEFTuneEmbedding(model.embed_tokens, alpha=5.0)
    """
    
    def __init__(self, embedding_layer: nn.Embedding, alpha: float = 5.0):
        super().__init__()
        self.embedding = embedding_layer
        self.alpha = alpha
    
    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        # Get embeddings
        embeddings = self.embedding(input_ids)
        
        # Apply NEFTune during training
        if self.training and self.alpha > 0:
            dims = embeddings.shape[-1] * embeddings.shape[-2]
            noise = torch.rand_like(embeddings) * 2 - 1
            embeddings = embeddings + (self.alpha / dims**0.5) * noise
        
        return embeddings
    
    @property
    def weight(self):
        return self.embedding.weight
    
    @property
    def num_embeddings(self):
        return self.embedding.num_embeddings
    
    @property
    def embedding_dim(self):
        return self.embedding.embedding_dim


def apply_neftune_to_model(model: nn.Module, alpha: float = 5.0, embedding_name: str = 'embed_tokens'):
    """
    Apply NEFTune to a model's embedding layer.
    
    Works with most HuggingFace models.
    """
    # Find the embedding layer
    for name, module in model.named_modules():
        if embedding_name in name and isinstance(module, nn.Embedding):
            # Get parent module
            parent_name = '.'.join(name.split('.')[:-1])
            child_name = name.split('.')[-1]
            
            if parent_name:
                parent = model
                for part in parent_name.split('.'):
                    parent = getattr(parent, part)
            else:
                parent = model
            
            # Wrap with NEFTune
            neftune_embedding = NEFTuneEmbedding(module, alpha=alpha)
            setattr(parent, child_name, neftune_embedding)
            print(f"Applied NEFTune (α={alpha}) to {name}")
            return
    
    print(f"Warning: Could not find embedding layer '{embedding_name}'")

In [None]:
# Demo with a simple model
class SimpleLanguageModel(nn.Module):
    """A tiny language model for demonstration."""
    
    def __init__(self, vocab_size: int = 1000, embed_dim: int = 256, hidden_dim: int = 512):
        super().__init__()
        self.embed_tokens = nn.Embedding(vocab_size, embed_dim)
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(embed_dim, nhead=8, dim_feedforward=hidden_dim, batch_first=True),
            num_layers=2
        )
        self.lm_head = nn.Linear(embed_dim, vocab_size)
    
    def forward(self, input_ids: torch.Tensor) -> torch.Tensor:
        x = self.embed_tokens(input_ids)
        x = self.transformer(x)
        return self.lm_head(x)


# Create model
model = SimpleLanguageModel(vocab_size=1000, embed_dim=256).to(device)
print("Original model:")
print(f"  Embedding type: {type(model.embed_tokens).__name__}")

# Apply NEFTune
apply_neftune_to_model(model, alpha=5.0)
print(f"\nAfter NEFTune:")
print(f"  Embedding type: {type(model.embed_tokens).__name__}")

---

## Part 5: Comparing With and Without NEFTune

Let's train a model with and without NEFTune and measure the difference.

In [None]:
def train_with_neftune(
    use_neftune: bool = True,
    alpha: float = 5.0,
    n_epochs: int = 50,
    verbose: bool = True
) -> Dict:
    """
    Train a model with or without NEFTune.
    """
    # Create fresh model
    model = SimpleLanguageModel(vocab_size=500, embed_dim=128).to(device)
    
    if use_neftune:
        apply_neftune_to_model(model, alpha=alpha)
    
    # Generate synthetic data
    # Task: Next token prediction with some pattern
    torch.manual_seed(42)
    n_samples = 1000
    seq_len = 32
    
    # Create sequences with patterns (not just random)
    train_data = torch.randint(0, 500, (n_samples, seq_len), device=device)
    # Target: shifted by 1
    train_targets = torch.cat([train_data[:, 1:], torch.randint(0, 500, (n_samples, 1), device=device)], dim=1)
    
    # Test data (different seed)
    torch.manual_seed(999)
    test_data = torch.randint(0, 500, (200, seq_len), device=device)
    test_targets = torch.cat([test_data[:, 1:], torch.randint(0, 500, (200, 1), device=device)], dim=1)
    
    # Training setup
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
    
    train_losses = []
    test_losses = []
    batch_size = 32
    
    for epoch in range(n_epochs):
        # Training
        model.train()
        epoch_loss = 0
        n_batches = 0
        
        perm = torch.randperm(n_samples)
        for i in range(0, n_samples, batch_size):
            idx = perm[i:i+batch_size]
            batch_x = train_data[idx]
            batch_y = train_targets[idx]
            
            optimizer.zero_grad()
            logits = model(batch_x)
            loss = F.cross_entropy(logits.view(-1, 500), batch_y.view(-1))
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
            n_batches += 1
        
        avg_train_loss = epoch_loss / n_batches
        train_losses.append(avg_train_loss)
        
        # Evaluation
        model.eval()
        with torch.no_grad():
            test_logits = model(test_data)
            test_loss = F.cross_entropy(test_logits.view(-1, 500), test_targets.view(-1))
            test_losses.append(test_loss.item())
        
        if verbose and (epoch + 1) % 10 == 0:
            print(f"  Epoch {epoch+1}: Train={avg_train_loss:.4f}, Test={test_loss.item():.4f}")
    
    # Compute output diversity
    model.eval()
    with torch.no_grad():
        # Generate by sampling
        sample_input = torch.randint(0, 500, (50, seq_len), device=device)
        logits = model(sample_input)
        probs = F.softmax(logits, dim=-1)
        entropy = -(probs * torch.log(probs + 1e-8)).sum(dim=-1).mean().item()
    
    # Cleanup
    del model, optimizer
    torch.cuda.empty_cache()
    
    return {
        'use_neftune': use_neftune,
        'alpha': alpha if use_neftune else 0,
        'train_losses': train_losses,
        'test_losses': test_losses,
        'final_train_loss': np.mean(train_losses[-5:]),
        'final_test_loss': np.mean(test_losses[-5:]),
        'output_entropy': entropy,  # Higher = more diverse
    }


# Run comparison
print("="*60)
print("Training WITHOUT NEFTune...")
print("="*60)
results_no_neftune = train_with_neftune(use_neftune=False)

print("\n" + "="*60)
print("Training WITH NEFTune (α=5)...")
print("="*60)
results_neftune = train_with_neftune(use_neftune=True, alpha=5.0)

In [None]:
# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Training curves
axes[0].plot(results_no_neftune['train_losses'], label='No NEFTune (train)', color='blue', linestyle='--')
axes[0].plot(results_no_neftune['test_losses'], label='No NEFTune (test)', color='blue')
axes[0].plot(results_neftune['train_losses'], label='NEFTune (train)', color='red', linestyle='--')
axes[0].plot(results_neftune['test_losses'], label='NEFTune (test)', color='red')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training vs Test Loss')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Generalization gap
methods = ['No NEFTune', 'NEFTune (α=5)']
train_final = [results_no_neftune['final_train_loss'], results_neftune['final_train_loss']]
test_final = [results_no_neftune['final_test_loss'], results_neftune['final_test_loss']]
gaps = [t - tr for t, tr in zip(test_final, train_final)]

x_pos = np.arange(len(methods))
width = 0.35
axes[1].bar(x_pos - width/2, train_final, width, label='Train', color='steelblue')
axes[1].bar(x_pos + width/2, test_final, width, label='Test', color='coral')
axes[1].set_xlabel('Method')
axes[1].set_ylabel('Loss')
axes[1].set_title('Final Losses (Lower = Better)')
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(methods)
axes[1].legend()

# Output diversity (entropy)
entropies = [results_no_neftune['output_entropy'], results_neftune['output_entropy']]
colors = ['blue', 'red']
axes[2].bar(methods, entropies, color=colors)
axes[2].set_xlabel('Method')
axes[2].set_ylabel('Output Entropy')
axes[2].set_title('Output Diversity (Higher = More Diverse)')

plt.tight_layout()
plt.savefig('neftune_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
plt.close(fig)

# Print summary
print("\n" + "="*60)
print("SUMMARY")
print("="*60)
print(f"\nWithout NEFTune:")
print(f"  Final train loss: {results_no_neftune['final_train_loss']:.4f}")
print(f"  Final test loss:  {results_no_neftune['final_test_loss']:.4f}")
print(f"  Generalization gap: {gaps[0]:.4f}")
print(f"  Output entropy: {results_no_neftune['output_entropy']:.4f}")

print(f"\nWith NEFTune (α=5):")
print(f"  Final train loss: {results_neftune['final_train_loss']:.4f}")
print(f"  Final test loss:  {results_neftune['final_test_loss']:.4f}")
print(f"  Generalization gap: {gaps[1]:.4f}")
print(f"  Output entropy: {results_neftune['output_entropy']:.4f}")

test_improvement = (results_no_neftune['final_test_loss'] - results_neftune['final_test_loss']) / results_no_neftune['final_test_loss'] * 100
diversity_improvement = (results_neftune['output_entropy'] - results_no_neftune['output_entropy']) / results_no_neftune['output_entropy'] * 100

print(f"\nNEFTune Benefits:")
print(f"  Test loss improvement: {test_improvement:.1f}%")
print(f"  Diversity improvement: {diversity_improvement:.1f}%")

---

## Part 6: Finding the Optimal Alpha

The noise intensity (alpha) matters! Let's find the sweet spot.

In [None]:
# Test different alpha values
alphas_to_test = [0, 1, 5, 10, 15, 20]
alpha_results = []

print("Testing different alpha values...")
for alpha in alphas_to_test:
    print(f"\nAlpha = {alpha}")
    result = train_with_neftune(
        use_neftune=(alpha > 0),
        alpha=alpha,
        n_epochs=30,
        verbose=False
    )
    result['alpha'] = alpha
    alpha_results.append(result)
    print(f"  Test loss: {result['final_test_loss']:.4f}, Entropy: {result['output_entropy']:.4f}")

In [None]:
# Visualize alpha sweep
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

alphas = [r['alpha'] for r in alpha_results]
test_losses = [r['final_test_loss'] for r in alpha_results]
entropies = [r['output_entropy'] for r in alpha_results]

axes[0].plot(alphas, test_losses, 'bo-', linewidth=2, markersize=8)
axes[0].set_xlabel('Alpha (noise intensity)')
axes[0].set_ylabel('Test Loss')
axes[0].set_title('Test Loss vs Alpha (Lower = Better)')
axes[0].grid(True, alpha=0.3)

# Mark the best
best_idx = np.argmin(test_losses)
axes[0].scatter([alphas[best_idx]], [test_losses[best_idx]], color='red', s=200, zorder=5, marker='*')
axes[0].annotate(f'Best: α={alphas[best_idx]}', 
                 (alphas[best_idx], test_losses[best_idx]),
                 xytext=(10, 10), textcoords='offset points')

axes[1].plot(alphas, entropies, 'ro-', linewidth=2, markersize=8)
axes[1].set_xlabel('Alpha (noise intensity)')
axes[1].set_ylabel('Output Entropy')
axes[1].set_title('Output Diversity vs Alpha (Higher = More Diverse)')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('neftune_alpha_sweep.png', dpi=150, bbox_inches='tight')
plt.show()
plt.close(fig)

print(f"\nRecommendation: α = {alphas[best_idx]} gives the best test loss")
print("(In practice, α = 5 is a good default for most LLM fine-tuning tasks)")

---

## Part 7: Using NEFTune with TRL (Production)

In production, TRL (Transformers Reinforcement Learning) has NEFTune built-in. It's just one parameter!

In [None]:
# Production usage with TRL
trl_example = """
from trl import SFTTrainer, SFTConfig

# Create trainer with NEFTune enabled
training_args = SFTConfig(
    output_dir="./output",
    max_seq_length=2048,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    num_train_epochs=3,
    
    # NEFTune: Just add this one line!
    neftune_noise_alpha=5.0,  # Recommended: 5-15
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
)

# Train with NEFTune automatically applied!
trainer.train()

# The improvement is dramatic:
# - Llama-2-7B on AlpacaEval: 29.8% → 64.7%
# - Llama-2-7B on MT-Bench: 5.22 → 6.02
"""

print("Production NEFTune with TRL:")
print("=" * 50)
print(trl_example)

---

## Alpha Guidelines

| Task | Recommended Alpha | Notes |
|------|------------------|-------|
| Instruction following | 5 | Default, works well |
| Chat/Dialogue | 5-10 | Higher for more diverse responses |
| Code generation | 3-5 | Lower to preserve precision |
| Creative writing | 10-15 | Higher for more creativity |
| QA/Factual | 5 | Balanced |

**General rule:** Start with α=5, increase if responses feel repetitive, decrease if they become incoherent.

---

## Try It Yourself: Exercises

### Exercise 1: Gaussian vs Uniform Noise

The original NEFTune uses uniform noise. Try Gaussian noise and compare.

<details>
<summary>Hint</summary>

Replace `torch.rand_like(embeddings) * 2 - 1` with `torch.randn_like(embeddings)`
</details>

In [None]:
# Exercise 1: Your code here
def neftune_gaussian(embeddings: torch.Tensor, alpha: float = 5.0, training: bool = True) -> torch.Tensor:
    """
    NEFTune variant with Gaussian noise instead of uniform.
    """
    # Your implementation here
    pass

### Exercise 2: Per-Token Noise

What if we add different noise intensity per token position? Earlier tokens might need less noise than later ones.

<details>
<summary>Hint</summary>

Create a position-dependent alpha: `alpha_per_pos = alpha * (1 + position / seq_len)`
</details>

In [None]:
# Exercise 2: Your code here
def neftune_positional(embeddings: torch.Tensor, alpha: float = 5.0, training: bool = True) -> torch.Tensor:
    """
    NEFTune variant with position-dependent noise.
    """
    # Your implementation here
    pass

---

## Common Mistakes

### Mistake 1: Applying Noise During Inference

```python
# Wrong: Noise during inference makes outputs inconsistent
def forward(self, x):
    embeddings = self.embedding(x)
    noise = torch.rand_like(embeddings) * 2 - 1
    return embeddings + noise  # WRONG!

# Right: Only during training
def forward(self, x):
    embeddings = self.embedding(x)
    if self.training:  # Key check!
        noise = torch.rand_like(embeddings) * 2 - 1
        embeddings = embeddings + (self.alpha / dims**0.5) * noise
    return embeddings
```

### Mistake 2: Wrong Scaling

```python
# Wrong: Not scaling by sequence length
noise = torch.rand_like(embeddings) * alpha  # WRONG!

# Right: Scale by sqrt(L * D)
dims = embeddings.shape[-1] * embeddings.shape[-2]
noise = torch.rand_like(embeddings) * 2 - 1
noise = (alpha / dims**0.5) * noise  # Correct!
```

### Mistake 3: Alpha Too High

```python
# Wrong: Very high alpha can destabilize training
neftune_noise_alpha=50  # Too high!

# Right: Keep in reasonable range
neftune_noise_alpha=5  # Start here
# Increase to 10-15 only if needed
```

---

## Checkpoint

You've learned:
- ✅ The NEFTune algorithm (just 5 lines!)
- ✅ Why adding noise improves generalization (29.8% → 64.7% on AlpacaEval!)
- ✅ How to implement NEFTune from scratch
- ✅ How to use NEFTune with TRL in production
- ✅ How to tune the alpha parameter for your task

---

## Key Takeaway

**NEFTune is one of the highest ROI techniques in LLM fine-tuning:**
- 5 lines of code
- No extra compute cost
- 35%+ improvement on benchmarks
- Works with any fine-tuning method

**Always use NEFTune when fine-tuning LLMs!**

---

## Further Reading

- [NEFTune Paper](https://arxiv.org/abs/2310.05914) - Noisy Embeddings Improve Instruction Finetuning
- [TRL Documentation](https://huggingface.co/docs/trl) - NEFTune integration

---

## Cleanup

In [None]:
# Clear GPU memory
import gc

torch.cuda.empty_cache()
gc.collect()

print("Cleanup complete!")
if torch.cuda.is_available():
    print(f"GPU memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")

---

## Next Steps

Now that you understand NEFTune, continue to:

**[Lab 3.1.4: 8B Model LoRA Fine-tuning](lab-3.1.4-8b-lora-finetuning.ipynb)** - Put it all together: LoRA + DoRA + NEFTune on a real 8B model!