# üî• FILE 3-A: Advanced Training Techniques

**PH·∫¶N 3 - ADVANCED & PROFESSIONAL**

---

## üìã N·ªôi Dung

‚úÖ Custom Loss Functions

‚úÖ Custom Layers/Modules

‚úÖ Advanced Gradient Operations

‚úÖ Learning Rate Scheduling n√¢ng cao

‚úÖ Gradient Accumulation

‚úÖ Debugging Training Issues

‚úÖ Advanced Optimization Tricks

---

## ‚è±Ô∏è Th·ªùi Gian H·ªçc: 3-4 gi·ªù

---

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Device: {device}")

---

# 1Ô∏è‚É£ Custom Loss Functions

## T·∫°i Sao C·∫ßn Custom Loss?

Built-in losses (MSE, CrossEntropy) kh√¥ng ph·∫£i l√∫c n√†o c≈©ng ph√π h·ª£p:
- Domain-specific requirements
- Multi-task learning
- Weighted losses
- Custom metrics optimization

In [None]:
print("=" * 70)
print("CUSTOM LOSS - BASIC")
print("=" * 70)

class FocalLoss(nn.Module):
    """Focal Loss for imbalanced classification"""
    
    def __init__(self, alpha=0.25, gamma=2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
    
    def forward(self, inputs, targets):
        # inputs: (N, C) logits
        # targets: (N,) class indices
        
        ce_loss = F.cross_entropy(inputs, targets, reduction='none')
        p_t = torch.exp(-ce_loss)
        focal_loss = self.alpha * (1 - p_t) ** self.gamma * ce_loss
        
        return focal_loss.mean()

# Test
logits = torch.randn(10, 5)  # 10 samples, 5 classes
targets = torch.randint(0, 5, (10,))

focal = FocalLoss(alpha=0.25, gamma=2.0)
loss = focal(logits, targets)

print(f"\nFocal Loss: {loss.item():.4f}")

# Compare v·ªõi CrossEntropy
ce = nn.CrossEntropyLoss()
ce_loss = ce(logits, targets)
print(f"CrossEntropy: {ce_loss.item():.4f}")

print("""

FOCAL LOSS:
  - Down-weight easy examples
  - Focus on hard examples
  - T·ªët cho imbalanced data
  
  FL = -Œ±(1-p_t)^Œ≥ * log(p_t)
  
  Œ≥=0 ‚Üí CrossEntropy
  Œ≥‚Üë ‚Üí Focus more on hard examples
""")

In [None]:
print("=" * 70)
print("CUSTOM LOSS - MULTI-TASK")
print("=" * 70)

class MultiTaskLoss(nn.Module):
    """Combine multiple losses for multi-task learning"""
    
    def __init__(self, task_weights=None):
        super().__init__()
        self.task_weights = task_weights or [1.0, 1.0]
    
    def forward(self, outputs, targets):
        # outputs: dict {'task1': pred1, 'task2': pred2}
        # targets: dict {'task1': y1, 'task2': y2}
        
        # Task 1: Classification
        loss1 = F.cross_entropy(outputs['classification'], targets['classification'])
        
        # Task 2: Regression
        loss2 = F.mse_loss(outputs['regression'], targets['regression'])
        
        # Weighted combination
        total_loss = (self.task_weights[0] * loss1 + 
                     self.task_weights[1] * loss2)
        
        return total_loss, {'cls_loss': loss1.item(), 'reg_loss': loss2.item()}

# Example usage
outputs = {
    'classification': torch.randn(8, 3),
    'regression': torch.randn(8, 1)
}
targets = {
    'classification': torch.randint(0, 3, (8,)),
    'regression': torch.randn(8, 1)
}

mtl = MultiTaskLoss(task_weights=[1.0, 0.5])
total_loss, individual_losses = mtl(outputs, targets)

print(f"\nTotal Loss: {total_loss.item():.4f}")
print(f"Classification Loss: {individual_losses['cls_loss']:.4f}")
print(f"Regression Loss: {individual_losses['reg_loss']:.4f}")

---

# 2Ô∏è‚É£ Custom Layers

## Custom Linear Layer

In [None]:
print("=" * 70)
print("CUSTOM LAYER")
print("=" * 70)

class CustomLinear(nn.Module):
    """Custom implementation of Linear layer"""
    
    def __init__(self, in_features, out_features, bias=True):
        super().__init__()
        self.in_features = in_features
        self.out_features = out_features
        
        # Initialize parameters
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        if bias:
            self.bias = nn.Parameter(torch.zeros(out_features))
        else:
            self.register_parameter('bias', None)
        
        # Initialize weights (Xavier/Glorot)
        self.reset_parameters()
    
    def reset_parameters(self):
        nn.init.kaiming_uniform_(self.weight, a=np.sqrt(5))
        if self.bias is not None:
            fan_in, _ = nn.init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / np.sqrt(fan_in)
            nn.init.uniform_(self.bias, -bound, bound)
    
    def forward(self, x):
        # y = xW^T + b
        output = x @ self.weight.t()
        if self.bias is not None:
            output += self.bias
        return output

# Test
custom = CustomLinear(10, 5)
builtin = nn.Linear(10, 5)

x = torch.randn(3, 10)
out1 = custom(x)
out2 = builtin(x)

print(f"\nCustom output shape: {out1.shape}")
print(f"Built-in output shape: {out2.shape}")
print(f"\nCustom parameters: {sum(p.numel() for p in custom.parameters())}")
print(f"Built-in parameters: {sum(p.numel() for p in builtin.parameters())}")

In [None]:
print("=" * 70)
print("ATTENTION LAYER")
print("=" * 70)

class SelfAttention(nn.Module):
    """Simple self-attention mechanism"""
    
    def __init__(self, embed_dim):
        super().__init__()
        self.embed_dim = embed_dim
        
        # Query, Key, Value projections
        self.query = nn.Linear(embed_dim, embed_dim)
        self.key = nn.Linear(embed_dim, embed_dim)
        self.value = nn.Linear(embed_dim, embed_dim)
        
        self.scale = np.sqrt(embed_dim)
    
    def forward(self, x):
        # x: (batch, seq_len, embed_dim)
        
        Q = self.query(x)  # (batch, seq_len, embed_dim)
        K = self.key(x)
        V = self.value(x)
        
        # Attention scores
        scores = torch.bmm(Q, K.transpose(1, 2)) / self.scale
        attn_weights = F.softmax(scores, dim=-1)
        
        # Apply attention to values
        output = torch.bmm(attn_weights, V)
        
        return output, attn_weights

# Test
attn = SelfAttention(embed_dim=64)
x = torch.randn(2, 10, 64)  # batch=2, seq_len=10, embed=64

output, weights = attn(x)
print(f"\nInput: {x.shape}")
print(f"Output: {output.shape}")
print(f"Attention weights: {weights.shape}")

---

# 3Ô∏è‚É£ Advanced Gradient Operations

## Gradient Clipping

In [None]:
print("=" * 70)
print("GRADIENT CLIPPING")
print("=" * 70)

model = nn.Sequential(
    nn.Linear(10, 64),
    nn.ReLU(),
    nn.Linear(64, 1)
)

optimizer = optim.Adam(model.parameters(), lr=0.001)

# Dummy data
x = torch.randn(32, 10)
y = torch.randn(32, 1)

# Forward
outputs = model(x)
loss = F.mse_loss(outputs, y)

# Backward
optimizer.zero_grad()
loss.backward()

# Check gradients before clipping
total_norm_before = 0
for p in model.parameters():
    if p.grad is not None:
        total_norm_before += p.grad.data.norm(2).item() ** 2
total_norm_before = total_norm_before ** 0.5

print(f"\nGradient norm before clipping: {total_norm_before:.4f}")

# Clip gradients
max_norm = 1.0
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

# Check after clipping
total_norm_after = 0
for p in model.parameters():
    if p.grad is not None:
        total_norm_after += p.grad.data.norm(2).item() ** 2
total_norm_after = total_norm_after ** 0.5

print(f"Gradient norm after clipping: {total_norm_after:.4f}")

optimizer.step()

print("""

GRADIENT CLIPPING:
  - Prevents exploding gradients
  - Essential for RNNs, LSTMs
  - max_norm = 1.0 (typical)
  
USAGE:
  loss.backward()
  torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
  optimizer.step()
""")

## Gradient Accumulation

In [None]:
print("=" * 70)
print("GRADIENT ACCUMULATION")
print("=" * 70)

print("""
GRADIENT ACCUMULATION:
  - Simulate larger batch size
  - Useful when GPU memory limited
  - Accumulate gradients over N batches
  - Then update weights
"""
)

model = nn.Linear(10, 1)
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()

accumulation_steps = 4  # Effective batch = 32 * 4 = 128

# Training loop
for step in range(10):
    # Dummy mini-batch
    x = torch.randn(32, 10)
    y = torch.randn(32, 1)
    
    # Forward
    outputs = model(x)
    loss = criterion(outputs, y)
    
    # Normalize loss (average over accumulation steps)
    loss = loss / accumulation_steps
    
    # Backward (accumulate gradients)
    loss.backward()
    
    # Update weights every N steps
    if (step + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
        print(f"Step {step+1}: Updated weights (effective batch=128)")

print("""

BENEFITS:
  ‚úÖ Larger effective batch size
  ‚úÖ Same memory as small batch
  ‚úÖ Better gradient estimates
  
TRADEOFF:
  ‚ùå Slower (more forward passes)
""")

---

# 4Ô∏è‚É£ Advanced LR Scheduling

## Warmup + Cosine Decay

In [None]:
print("=" * 70)
print("WARMUP + COSINE ANNEALING")
print("=" * 70)

class WarmupCosineScheduler:
    """Warmup followed by cosine annealing"""
    
    def __init__(self, optimizer, warmup_steps, total_steps, min_lr=1e-6):
        self.optimizer = optimizer
        self.warmup_steps = warmup_steps
        self.total_steps = total_steps
        self.min_lr = min_lr
        self.base_lr = optimizer.param_groups[0]['lr']
        self.current_step = 0
    
    def step(self):
        self.current_step += 1
        
        if self.current_step < self.warmup_steps:
            # Linear warmup
            lr = self.base_lr * self.current_step / self.warmup_steps
        else:
            # Cosine annealing
            progress = (self.current_step - self.warmup_steps) / (self.total_steps - self.warmup_steps)
            lr = self.min_lr + (self.base_lr - self.min_lr) * 0.5 * (1 + np.cos(np.pi * progress))
        
        for param_group in self.optimizer.param_groups:
            param_group['lr'] = lr
        
        return lr

# Demo
model = nn.Linear(10, 1)
optimizer = optim.Adam(model.parameters(), lr=0.001)
scheduler = WarmupCosineScheduler(optimizer, warmup_steps=100, total_steps=1000)

lrs = []
for step in range(1000):
    lr = scheduler.step()
    lrs.append(lr)

# Plot
plt.figure(figsize=(12, 5))
plt.plot(lrs, linewidth=2)
plt.axvline(x=100, color='r', linestyle='--', alpha=0.5, label='Warmup end')
plt.xlabel('Step')
plt.ylabel('Learning Rate')
plt.title('Warmup + Cosine Annealing')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("""
WARMUP + COSINE:
  - Warmup: Gradually increase LR (stabilize training)
  - Cosine: Smooth decay
  - Very effective for Transformers
  - Used in BERT, GPT, etc.
""")

---

# 5Ô∏è‚É£ Debugging Training

## Monitor Gradients

In [None]:
print("=" * 70)
print("GRADIENT MONITORING")
print("=" * 70)

class GradientMonitor:
    """Monitor gradient statistics"""
    
    def __init__(self, model):
        self.model = model
    
    def check_gradients(self):
        stats = {
            'mean': [],
            'std': [],
            'max': [],
            'min': [],
            'norm': []
        }
        
        for name, param in self.model.named_parameters():
            if param.grad is not None:
                grad = param.grad.data
                
                stats['mean'].append(grad.mean().item())
                stats['std'].append(grad.std().item())
                stats['max'].append(grad.max().item())
                stats['min'].append(grad.min().item())
                stats['norm'].append(grad.norm().item())
                
                # Check for issues
                if grad.norm().item() > 100:
                    print(f"‚ö†Ô∏è  Large gradient in {name}: {grad.norm().item():.2f}")
                if grad.norm().item() < 1e-7:
                    print(f"‚ö†Ô∏è  Vanishing gradient in {name}: {grad.norm().item():.2e}")
                if torch.isnan(grad).any():
                    print(f"‚ùå NaN gradient in {name}")
        
        return stats

# Test
model = nn.Sequential(
    nn.Linear(10, 64),
    nn.ReLU(),
    nn.Linear(64, 1)
)

monitor = GradientMonitor(model)

# Dummy forward/backward
x = torch.randn(32, 10)
y = torch.randn(32, 1)
loss = F.mse_loss(model(x), y)
loss.backward()

stats = monitor.check_gradients()

print(f"\nGradient statistics:")
print(f"  Mean: {np.mean(stats['mean']):.6f}")
print(f"  Std: {np.mean(stats['std']):.6f}")
print(f"  Max: {np.max(stats['max']):.6f}")
print(f"  Min: {np.min(stats['min']):.6f}")
print(f"  Norm: {np.mean(stats['norm']):.6f}")

## Weight Initialization Check

In [None]:
print("=" * 70)
print("WEIGHT INITIALIZATION")
print("=" * 70)

def init_weights(m):
    """Custom weight initialization"""
    if isinstance(m, nn.Linear):
        nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
        if m.bias is not None:
            nn.init.constant_(m.bias, 0)
    elif isinstance(m, nn.Conv2d):
        nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
    elif isinstance(m, nn.BatchNorm2d):
        nn.init.constant_(m.weight, 1)
        nn.init.constant_(m.bias, 0)

# Apply initialization
model = nn.Sequential(
    nn.Linear(10, 64),
    nn.ReLU(),
    nn.Linear(64, 32),
    nn.ReLU(),
    nn.Linear(32, 1)
)

model.apply(init_weights)

print("\n‚úÖ Weights initialized")

# Check initialization
for name, param in model.named_parameters():
    if 'weight' in name:
        print(f"{name:20s}: mean={param.mean():.4f}, std={param.std():.4f}")

print("""

COMMON INITIALIZATIONS:

1. Xavier/Glorot:
   nn.init.xavier_uniform_(weight)
   ‚Üí For tanh, sigmoid

2. Kaiming/He:
   nn.init.kaiming_normal_(weight, nonlinearity='relu')
   ‚Üí For ReLU, LeakyReLU

3. Orthogonal:
   nn.init.orthogonal_(weight)
   ‚Üí For RNNs
""")

---

# ‚úÖ T·ªïng K·∫øt FILE 3-A

## B·∫°n ƒê√£ H·ªçc

‚úÖ **Custom Loss**: Focal Loss, Multi-task Loss

‚úÖ **Custom Layers**: Linear, Self-Attention

‚úÖ **Gradient Operations**: Clipping, Accumulation

‚úÖ **Advanced LR Scheduling**: Warmup + Cosine

‚úÖ **Debugging**: Gradient monitoring, Weight init

---

## Best Practices

```python
# Complete training setup
model = YourModel()
model.apply(init_weights)  # Custom initialization

optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
scheduler = WarmupCosineScheduler(optimizer, warmup_steps=100, total_steps=1000)

for epoch in range(epochs):
    for batch in dataloader:
        # Forward
        loss = model(batch)
        
        # Backward
        optimizer.zero_grad()
        loss.backward()
        
        # Clip gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        
        # Update
        optimizer.step()
        scheduler.step()
```

---

## Ti·∫øp Theo

üìï **FILE 3-B: Transfer Learning & Mixed Precision**

---