# Hyperparameter Optimization (HPO) Tutorial

This tutorial teaches you how to optimize neural network hyperparameters using Neural DSL's built-in HPO capabilities.

## What You'll Learn

1. HPO syntax and configuration
2. Optimizing layer parameters
3. Optimizing optimizer settings
4. Running HPO trials
5. Analyzing and applying results
6. Multi-framework HPO (TensorFlow and PyTorch)

**Time:** ~30 minutes  
**Level:** Intermediate

## What is HPO?

Hyperparameter optimization automatically finds the best configuration for your model:
- **Layer sizes** (number of units, filters)
- **Learning rate** and optimizer settings
- **Dropout rates** for regularization
- **Architecture choices** (kernel sizes, activations)

Instead of manual trial-and-error, HPO systematically searches the space of possibilities.

## Setup

In [None]:
# Install Neural DSL with HPO support
!pip install neural-dsl optuna tensorflow

import neural
print(f"Neural DSL version: {neural.__version__}")

## Example 1: Basic HPO - Optimizing Dense Units

Let's start with a simple example: finding the best number of units in a dense layer.

In [None]:
# Define model with HPO for Dense units
basic_hpo_model = """
network BasicHPO {
  input: (28, 28, 1)
  
  layers:
    Flatten()
    # Try different numbers of units: 64, 128, 256, or 512
    Dense(units=HPO(choice(64, 128, 256, 512)), activation="relu")
    Dropout(rate=0.5)
    Output(units=10, activation="softmax")
  
  loss: "sparse_categorical_crossentropy"
  optimizer: Adam(learning_rate=0.001)
  metrics: ["accuracy"]
  
  train {
    epochs: 5
    batch_size: 64
    validation_split: 0.2
  }
}
"""

with open('basic_hpo.neural', 'w') as f:
    f.write(basic_hpo_model)

print("‚úÖ Model with HPO saved to 'basic_hpo.neural'")

### HPO Syntax Explained

```yaml
Dense(units=HPO(choice(64, 128, 256, 512)))
```

- **`HPO(...)`**: Marks parameter for optimization
- **`choice(64, 128, 256, 512)`**: Try these discrete values
- HPO will test each value and find the best

In [None]:
# Run HPO (this will take a few minutes)
!neural hpo basic_hpo.neural --backend tensorflow --trials 4 --output optimized_basic.neural

print("\n‚úÖ HPO completed! Best configuration saved to 'optimized_basic.neural'")

## Example 2: Multiple Parameter HPO

Let's optimize multiple parameters simultaneously:

In [None]:
# Model with multiple HPO parameters
multi_hpo_model = """
network MultiHPO {
  input: (28, 28, 1)
  
  layers:
    # Optimize Conv2D filters: try 16, 32, or 64
    Conv2D(
      filters=HPO(choice(16, 32, 64)),
      kernel_size=(3, 3),
      activation="relu"
    )
    MaxPooling2D(pool_size=(2, 2))
    
    Flatten()
    
    # Optimize Dense units: range from 64 to 256, step by 32
    Dense(
      units=HPO(range(64, 256, step=32)),
      activation="relu"
    )
    
    # Optimize dropout rate: try values between 0.2 and 0.7
    Dropout(rate=HPO(range(0.2, 0.7, step=0.1)))
    
    Output(units=10, activation="softmax")
  
  loss: "sparse_categorical_crossentropy"
  optimizer: Adam(learning_rate=0.001)
  metrics: ["accuracy"]
  
  train {
    epochs: 5
    batch_size: 64
    validation_split: 0.2
  }
}
"""

with open('multi_hpo.neural', 'w') as f:
    f.write(multi_hpo_model)

print("‚úÖ Multi-parameter HPO model saved")

### HPO Functions Reference

| Function | Usage | Example |
|----------|-------|--------|
| `choice(...)` | Discrete values | `HPO(choice(32, 64, 128))` |
| `range(min, max, step)` | Integer range | `HPO(range(10, 100, step=10))` |
| `log_range(min, max)` | Log-scale range | `HPO(log_range(1e-5, 1e-2))` |

**When to use each:**
- `choice`: When you have specific values to try
- `range`: For integer parameters (units, filters)
- `log_range`: For learning rates (vary across orders of magnitude)

In [None]:
# Run multi-parameter HPO with more trials
!neural hpo multi_hpo.neural --backend tensorflow --trials 10 --output optimized_multi.neural

print("\n‚úÖ Multi-parameter HPO completed!")

## Example 3: Optimizing Learning Rate

Learning rate is one of the most important hyperparameters. Let's optimize it:

In [None]:
# Model with learning rate HPO
lr_hpo_model = """
network LearningRateHPO {
  input: (28, 28, 1)
  
  layers:
    Conv2D(filters=32, kernel_size=(3, 3), activation="relu")
    MaxPooling2D(pool_size=(2, 2))
    Flatten()
    Dense(units=128, activation="relu")
    Dropout(rate=0.5)
    Output(units=10, activation="softmax")
  
  loss: "sparse_categorical_crossentropy"
  
  # Use log_range for learning rate
  # Searches from 0.00001 to 0.01 on log scale
  optimizer: Adam(learning_rate=HPO(log_range(1e-5, 1e-2)))
  
  metrics: ["accuracy"]
  
  train {
    epochs: 5
    batch_size: 64
    validation_split: 0.2
  }
}
"""

with open('lr_hpo.neural', 'w') as f:
    f.write(lr_hpo_model)

print("‚úÖ Learning rate HPO model saved")

### Why log_range for Learning Rate?

Learning rates vary across orders of magnitude:
- 0.00001 (1e-5)
- 0.0001 (1e-4)
- 0.001 (1e-3)
- 0.01 (1e-2)

`log_range` samples uniformly on log scale, ensuring we explore all orders of magnitude equally.

In [None]:
# Run learning rate optimization
!neural hpo lr_hpo.neural --backend tensorflow --trials 8 --output optimized_lr.neural

print("\n‚úÖ Learning rate optimization completed!")

## Example 4: Comprehensive HPO

Let's put it all together and optimize many parameters:

In [None]:
# Comprehensive HPO model
comprehensive_hpo = """
network ComprehensiveHPO {
  input: (28, 28, 1)
  
  layers:
    # First conv block
    Conv2D(
      filters=HPO(choice(16, 32, 64)),
      kernel_size=(3, 3),
      activation="relu"
    )
    MaxPooling2D(pool_size=(2, 2))
    
    # Second conv block
    Conv2D(
      filters=HPO(choice(32, 64, 128)),
      kernel_size=(3, 3),
      activation="relu"
    )
    MaxPooling2D(pool_size=(2, 2))
    
    Flatten()
    
    # Dense layers
    Dense(
      units=HPO(choice(64, 128, 256, 512)),
      activation="relu"
    )
    Dropout(rate=HPO(range(0.3, 0.7, step=0.1)))
    
    Dense(
      units=HPO(choice(32, 64, 128)),
      activation="relu"
    )
    Dropout(rate=HPO(range(0.2, 0.5, step=0.1)))
    
    Output(units=10, activation="softmax")
  
  loss: "sparse_categorical_crossentropy"
  
  # Optimize learning rate and batch size
  optimizer: Adam(learning_rate=HPO(log_range(1e-5, 1e-2)))
  metrics: ["accuracy"]
  
  train {
    epochs: 5
    # Can also optimize batch size (power of 2)
    batch_size: HPO(choice(32, 64, 128))
    validation_split: 0.2
  }
}
"""

with open('comprehensive_hpo.neural', 'w') as f:
    f.write(comprehensive_hpo)

print("‚úÖ Comprehensive HPO model saved")
print("\nOptimizing: filters, units, dropout rates, learning rate, batch size")

### Note on Search Space

This model has a large search space:
- Conv1 filters: 3 options
- Conv2 filters: 3 options  
- Dense1 units: 4 options
- Dense2 units: 3 options
- Dropout rates: Multiple options
- Learning rate: Continuous
- Batch size: 3 options

**Recommendation:** Start with 20-50 trials, increase if needed.

In [None]:
# Run comprehensive HPO (will take longer)
# Adjust --trials based on time available
!neural hpo comprehensive_hpo.neural --backend tensorflow --trials 20 --output best_model.neural

print("\n‚úÖ Comprehensive HPO completed!")
print("Best configuration saved to 'best_model.neural'")

## Analyzing HPO Results

Let's examine the optimized model:

In [None]:
# Read optimized model
with open('best_model.neural', 'r') as f:
    optimized_model = f.read()

print("Optimized Model Configuration:")
print("=" * 50)
print(optimized_model)

### Comparing Before and After

HPO replaces `HPO(...)` expressions with best values:

**Before:**
```yaml
Dense(units=HPO(choice(64, 128, 256, 512)))
optimizer: Adam(learning_rate=HPO(log_range(1e-5, 1e-2)))
```

**After:**
```yaml
Dense(units=256)  # Best value found
optimizer: Adam(learning_rate=0.0003)  # Best learning rate
```

## Multi-Framework HPO

Neural DSL's HPO works across TensorFlow and PyTorch!

In [None]:
# Run HPO with PyTorch backend
!pip install torch torchvision

# Same model, different backend
!neural hpo comprehensive_hpo.neural --backend pytorch --trials 10 --output best_model_pytorch.neural

print("\n‚úÖ PyTorch HPO completed!")

## HPO Best Practices

### 1. Start Small
```yaml
# ‚ùå Too many parameters at once
Dense(units=HPO(range(10, 1000, step=10)))  # 100 options!

# ‚úÖ Start with discrete choices
Dense(units=HPO(choice(64, 128, 256)))  # 3 options
```

### 2. Use Appropriate Ranges
```yaml
# ‚úÖ Learning rate: log scale
optimizer: Adam(learning_rate=HPO(log_range(1e-5, 1e-2)))

# ‚úÖ Units: linear range
Dense(units=HPO(range(32, 512, step=32)))

# ‚úÖ Dropout: small steps
Dropout(rate=HPO(range(0.2, 0.8, step=0.1)))
```

### 3. Prioritize Important Parameters

Most impact:
1. Learning rate
2. Network architecture (layers, units)
3. Regularization (dropout)

Less impact:
- Batch size (try 32, 64, 128)
- Minor architectural details

### 4. Use Early Stopping
```yaml
train {
  epochs: 20
  early_stopping: 5  # Stop if no improvement
}
```

### 5. Parallel Trials
```bash
# Speed up HPO with parallel execution
neural hpo model.neural --parallel 4
```

## Common HPO Patterns

### Pattern 1: Layer Size Progression
```yaml
# Optimize first layer, scale others
Dense(units=HPO(choice(64, 128, 256)))  # Optimize
Dense(units=64)  # Half of first layer (manual)
```

### Pattern 2: Learning Rate Schedule
```yaml
optimizer: Adam(
  learning_rate=HPO(log_range(1e-5, 1e-2))
)
lr_schedule: ExponentialDecay(
  initial_lr=0.001,
  decay_rate=HPO(range(0.9, 0.99, step=0.01))
)
```

### Pattern 3: Architecture Search
```yaml
# Optimize number of conv blocks
Conv2D(filters=HPO(choice(16, 32, 64)))
MaxPooling2D(pool_size=(2, 2))
Conv2D(filters=HPO(choice(32, 64, 128)))
```

## Troubleshooting HPO

### Problem: HPO takes too long
**Solution:**
- Reduce `--trials`
- Reduce `epochs` in train block
- Use smaller search spaces
- Enable `--parallel`

### Problem: No improvement in trials
**Solution:**
- Widen search ranges
- Check data preprocessing
- Verify model architecture
- Increase epochs per trial

### Problem: Memory errors
**Solution:**
- Reduce max batch size in HPO
- Limit max units/filters
- Use gradient accumulation

### Problem: Unstable training
**Solution:**
- Narrow learning rate range
- Add gradient clipping
- Use batch normalization

## Summary

You learned:

‚úÖ HPO syntax: `HPO(choice(...))`, `HPO(range(...))`, `HPO(log_range(...))`  
‚úÖ Optimizing layer parameters (units, filters, dropout)  
‚úÖ Optimizing optimizer settings (learning rate)  
‚úÖ Running HPO trials with `neural hpo`  
‚úÖ Analyzing and applying results  
‚úÖ Multi-framework HPO (TensorFlow and PyTorch)  
‚úÖ Best practices and common patterns  

## Next Steps

- **[Advanced Architectures Tutorial](advanced_architectures.ipynb)** - Complex models
- **[Cloud Tutorial](cloud_tutorial.ipynb)** - HPO in the cloud
- **[Production Deployment](deployment_tutorial.ipynb)** - Deploy optimized models

## Resources

- **[HPO Guide](../examples/hpo_guide.md)** - Detailed HPO documentation
- **[Examples](../../examples/)** - More HPO examples
- **[Discord](https://discord.gg/KFku4KvS)** - Get help with HPO

Happy optimizing! üéØ