In [None]:
"""
AutoML & Hyperparameter Optimization - Setup

Production AutoML stack:
- HPO Frameworks: Optuna, Ray Tune, Hyperopt, KerasTuner, AutoGluon
- Optimization Algorithms: Bayesian Optimization (TPE, GP), Evolutionary (CMA-ES, NSGA-II)
- NAS: DARTS, ENAS, NASBench, AutoKeras
- Multi-Fidelity: Hyperband, ASHA, BOHB
- Experiment Tracking: Weights & Biases, MLflow, TensorBoard
"""

import numpy as np
import pandas as pd
from dataclasses import dataclass, field
from typing import List, Dict, Any, Tuple, Optional, Callable
from collections import defaultdict
import time
import uuid
from scipy.stats import norm
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

print("‚úÖ Setup complete - Ready for AutoML and hyperparameter optimization!")

## 1Ô∏è‚É£ Grid Search vs Random Search

### üìù What's Happening in This Code?

**Purpose:** Compare basic hyperparameter search strategies: grid search and random search

**Key Concepts:**

**1. Grid Search**
- **Concept**: Exhaustively try all combinations of hyperparameter values
- **Algorithm**:
  1. Define discrete values for each hyperparameter (e.g., learning_rate = [0.001, 0.01, 0.1])
  2. Generate all combinations (Cartesian product)
  3. Train model with each combination
  4. Select combination with best validation performance
- **Complexity**: Exponential in number of hyperparameters
  - 3 hyperparams √ó 10 values each = 10¬≥ = 1,000 trials
  - 5 hyperparams √ó 10 values each = 10‚Åµ = 100,000 trials (infeasible!)

**2. Random Search**
- **Concept**: Sample hyperparameter combinations randomly from search space
- **Algorithm**:
  1. Define distributions for each hyperparameter (e.g., learning_rate ~ LogUniform(1e-5, 1e-1))
  2. Sample N random combinations
  3. Train model with each sample
  4. Select best performing combination
- **Advantage**: More efficient than grid search for high-dimensional spaces
  - **Bergstra & Bengio (2012)**: Random search finds good configs 3√ó faster than grid search

**3. Why Random > Grid?**
- **Coverage**: Random search explores more of the hyperparameter space
- **Important hyperparameters**: If only 2 of 10 hyperparameters matter, random search tries more values for those 2
- **Continuous spaces**: Grid requires discretization, random samples continuous values
- **Diminishing returns**: Grid wastes trials on less important regions

**Mathematical Insight:**
For hyperparameters with low importance (flat response surface), grid search wastes many trials testing the same effective value. Random search spreads trials more evenly across important dimensions.

**Why This Matters:**
- **Baseline**: Random search is minimum viable HPO strategy
- **Cost**: Grid search can cost $10,000+ in compute (100K trials √ó $0.10/trial)
- **Speed**: Random search finds 90%-optimal solution in 10% of trials
- **Practical**: Easy to implement, no complex optimization logic

**Post-Silicon Example:**
Optimize yield prediction model (5 hyperparameters):
- **Grid search**: 10‚Åµ trials √ó 2 min/trial = 200,000 min = 139 days
- **Random search**: 100 trials √ó 2 min/trial = 200 min = 3.3 hours (find 85% optimal)
- **Business value**: $18.7M/year from finding good hyperparameters in hours vs months

In [None]:
from dataclasses import dataclass
from typing import Dict, Any, List, Callable
import itertools
import random

@dataclass
class HPOTrial:
    """Single hyperparameter optimization trial"""
    trial_id: str
    hyperparameters: Dict[str, Any]
    score: float
    duration_seconds: float

class GridSearchOptimizer:
    """Exhaustive grid search over hyperparameter space"""
    
    def __init__(self, search_space: Dict[str, List[Any]], metric: Callable):
        self.search_space = search_space
        self.metric = metric  # Function: hyperparameters -> score (higher = better)
        self.trials: List[HPOTrial] = []
        
    def optimize(self, max_trials: int = None) -> HPOTrial:
        """Run grid search"""
        import time
        
        # Generate all combinations
        param_names = list(self.search_space.keys())
        param_values = [self.search_space[name] for name in param_names]
        combinations = list(itertools.product(*param_values))
        
        print(f"Grid Search: {len(combinations)} total combinations")
        
        # Limit trials if specified
        if max_trials and len(combinations) > max_trials:
            print(f"‚ö†Ô∏è Limiting to {max_trials} trials (would take too long!)")
            combinations = combinations[:max_trials]
        
        # Try each combination
        for i, values in enumerate(combinations):
            hyperparams = dict(zip(param_names, values))
            
            start = time.time()
            score = self.metric(hyperparams)
            duration = time.time() - start
            
            trial = HPOTrial(
                trial_id=f"grid_{i}",
                hyperparameters=hyperparams,
                score=score,
                duration_seconds=duration
            )
            self.trials.append(trial)
            
            if (i + 1) % 10 == 0:
                print(f"  Trial {i+1}/{len(combinations)}: score={score:.4f}")
        
        # Return best trial
        best_trial = max(self.trials, key=lambda t: t.score)
        print(f"\n‚úÖ Best score: {best_trial.score:.4f}")
        return best_trial

class RandomSearchOptimizer:
    """Random sampling from hyperparameter space"""
    
    def __init__(self, search_space: Dict[str, tuple], metric: Callable):
        # search_space format: {'param': (min, max)} or {'param': [discrete values]}
        self.search_space = search_space
        self.metric = metric
        self.trials: List[HPOTrial] = []
        
    def _sample_hyperparameters(self) -> Dict[str, Any]:
        """Sample random hyperparameter configuration"""
        hyperparams = {}
        for name, space in self.search_space.items():
            if isinstance(space, list):
                # Discrete values
                hyperparams[name] = random.choice(space)
            elif isinstance(space, tuple) and len(space) == 2:
                # Continuous range (min, max)
                hyperparams[name] = random.uniform(space[0], space[1])
            else:
                raise ValueError(f"Invalid search space for {name}")
        return hyperparams
    
    def optimize(self, n_trials: int = 100) -> HPOTrial:
        """Run random search"""
        import time
        
        print(f"Random Search: {n_trials} random trials")
        
        for i in range(n_trials):
            hyperparams = self._sample_hyperparameters()
            
            start = time.time()
            score = self.metric(hyperparams)
            duration = time.time() - start
            
            trial = HPOTrial(
                trial_id=f"random_{i}",
                hyperparameters=hyperparams,
                score=score,
                duration_seconds=duration
            )
            self.trials.append(trial)
            
            if (i + 1) % 10 == 0:
                best_so_far = max(self.trials, key=lambda t: t.score).score
                print(f"  Trial {i+1}/{n_trials}: current score={score:.4f}, best={best_so_far:.4f}")
        
        # Return best trial
        best_trial = max(self.trials, key=lambda t: t.score)
        print(f"\n‚úÖ Best score: {best_trial.score:.4f}")
        return best_trial

# Example: Optimize yield prediction model
def yield_prediction_objective(hyperparams: Dict[str, Any]) -> float:
    """
    Simulated yield prediction model performance
    
    Post-silicon context:
    - Predict device yield% from parametric test data
    - Hyperparameters: n_estimators, max_depth, learning_rate
    - Metric: R¬≤ score (higher = better)
    """
    import numpy as np
    
    # Simulate model training (realistic response surface)
    n_est = hyperparams['n_estimators']
    depth = hyperparams['max_depth']
    lr = hyperparams['learning_rate']
    
    # Optimal around: n_est=200, depth=8, lr=0.05
    # R¬≤ formula (simulated, peaked at optimal config)
    r2 = 0.7 + 0.24 * np.exp(-((n_est - 200)**2 / 10000 + (depth - 8)**2 / 16 + (lr - 0.05)**2 / 0.01))
    
    # Add noise
    r2 += np.random.normal(0, 0.02)
    
    return max(0, min(1, r2))  # Clamp to [0, 1]

# Grid search (limited to 50 trials for speed)
print("=" * 60)
print("GRID SEARCH")
print("=" * 60)
grid_space = {
    'n_estimators': [50, 100, 200, 500],
    'max_depth': [5, 8, 12, 15],
    'learning_rate': [0.01, 0.05, 0.1]
}
grid_opt = GridSearchOptimizer(grid_space, yield_prediction_objective)
grid_best = grid_opt.optimize(max_trials=50)
print(f"Best hyperparameters: {grid_best.hyperparameters}")

print("\n" + "=" * 60)
print("RANDOM SEARCH")
print("=" * 60)
random_space = {
    'n_estimators': [50, 100, 200, 300, 500, 1000],
    'max_depth': [3, 5, 8, 10, 12, 15, 20],
    'learning_rate': (0.001, 0.3)  # Continuous range
}
random_opt = RandomSearchOptimizer(random_space, yield_prediction_objective)
random_best = random_opt.optimize(n_trials=50)
print(f"Best hyperparameters: {random_best.hyperparameters}")

# Compare
print("\n" + "=" * 60)
print("COMPARISON")
print("=" * 60)
print(f"Grid Search Best:   R¬≤ = {grid_best.score:.4f}")
print(f"Random Search Best: R¬≤ = {random_best.score:.4f}")
print(f"Winner: {'Random Search' if random_best.score > grid_best.score else 'Grid Search'}")
print(f"\nüí° Random search often finds better configs with same budget!")
print(f"   (especially for continuous hyperparameters like learning_rate)")

# Business value
baseline_r2 = 0.75
improvement = max(grid_best.score, random_best.score) - baseline_r2
revenue_per_point = 18.7e6 / 0.19  # $18.7M for 0.19 R¬≤ improvement (from 0.75 to 0.94)
value = improvement * revenue_per_point
print(f"\nüí∞ Business Value:")
print(f"   R¬≤ improvement: {improvement:.4f} (from {baseline_r2:.2f} baseline)")
print(f"   Annual value: ${value/1e6:.1f}M/year")

## 2Ô∏è‚É£ Bayesian Optimization with Gaussian Processes

### üìù What's Happening in This Code?

**Purpose:** Implement intelligent hyperparameter optimization using Bayesian optimization with Gaussian Process surrogate model

**Key Concepts:**

**1. Bayesian Optimization**
- **Idea**: Build a probabilistic model of the objective function and use it to select the most promising hyperparameters
- **Algorithm**:
  1. **Surrogate model**: Gaussian Process approximates the unknown objective function f(x)
  2. **Acquisition function**: Decides which hyperparameters to try next (balance exploration vs exploitation)
  3. **Iterative refinement**: Update surrogate with new observations, repeat
  
**2. Gaussian Process (GP)**
- **Concept**: Distribution over functions (not just parameters)
- **Mathematics**:
  - Prior: f(x) ~ GP(Œº(x), k(x, x'))
    - Œº(x) = mean function (often 0)
    - k(x, x') = kernel function (Mat√©rn 5/2 is common)
  - Posterior (after observations): f(x|D) ~ N(Œº_post(x), œÉ¬≤_post(x))
    - Œº_post(x) = k(x, X)(K + œÉ¬≤I)‚Åª¬πy (predictive mean)
    - œÉ¬≤_post(x) = k(x, x) - k(x, X)(K + œÉ¬≤I)‚Åª¬πk(X, x) (uncertainty)
  - Where:
    - X = observed hyperparameters
    - y = observed scores
    - K = kernel matrix K_ij = k(x_i, x_j)

**3. Acquisition Functions**
- **Expected Improvement (EI)**: E[max(f(x) - f(x_best), 0)]
  - Formula: EI(x) = (Œº(x) - f_best)Œ¶(Z) + œÉ(x)œÜ(Z)
    - Z = (Œº(x) - f_best) / œÉ(x)
    - Œ¶(¬∑) = cumulative standard normal
    - œÜ(¬∑) = probability density standard normal
  - **Intuition**: Balance between high predicted value (Œº(x)) and high uncertainty (œÉ(x))
  - **Trade-off**: Exploitation (high Œº) vs Exploration (high œÉ)

**4. Why Bayesian > Random?**
- **Sample efficiency**: Finds optimal config in 10-50 trials (vs 100-1000 for random)
- **Intelligent exploration**: Uses past trials to inform next trial
- **Convergence**: Provably converges to global optimum (under smoothness assumptions)
- **Cost reduction**: Each trial saves ~$10-100 in compute (especially for expensive models)

**Mathematical Insight:**
Gaussian Process posterior variance œÉ¬≤_post(x) is HIGH in unexplored regions and LOW near observations. Acquisition function balances:
- High Œº_post(x): Likely good performance (exploit)
- High œÉ_post(x): High uncertainty (explore)

**Why This Matters:**
- **Cost**: Training a large model costs $50-500 per trial
  - Random search: 100 trials √ó $100 = $10,000
  - Bayesian optimization: 20 trials √ó $100 = $2,000 (80% savings)
- **Time**: Reduce hyperparameter tuning from weeks to days
- **Quality**: Find better hyperparameters (Bayesian explores intelligently)

**Post-Silicon Example:**
Optimize wafer test time vs defect coverage (multi-objective):
- **Random search**: 200 trials, 40 hours compute
- **Bayesian optimization**: 30 trials, 6 hours compute (7√ó faster, same quality)
- **Business value**: $21.3M/year from 22% test time reduction (120s ‚Üí 93s)

In [None]:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
from scipy.stats import norm
import numpy as np

class BayesianOptimizer:
    """Bayesian Optimization with Gaussian Process surrogate model"""
    
    def __init__(self, bounds: Dict[str, tuple], metric: Callable, maximize: bool = True):
        """
        Args:
            bounds: {'param': (min, max)} for continuous hyperparameters
            metric: Function to optimize (hyperparameters -> score)
            maximize: True to maximize metric, False to minimize
        """
        self.bounds = bounds
        self.param_names = list(bounds.keys())
        self.metric = metric
        self.maximize = maximize
        
        # Gaussian Process with Mat√©rn 5/2 kernel (smooth but flexible)
        kernel = Matern(nu=2.5)
        self.gp = GaussianProcessRegressor(
            kernel=kernel,
            alpha=1e-6,  # Noise level
            normalize_y=True,
            n_restarts_optimizer=10  # Fit kernel hyperparameters
        )
        
        self.trials: List[HPOTrial] = []
        self.X_observed = []  # Hyperparameter vectors
        self.y_observed = []  # Observed scores
        
    def _hyperparams_to_vector(self, hyperparams: Dict[str, Any]) -> np.ndarray:
        """Convert hyperparameter dict to vector"""
        return np.array([hyperparams[name] for name in self.param_names])
    
    def _vector_to_hyperparams(self, vector: np.ndarray) -> Dict[str, Any]:
        """Convert vector to hyperparameter dict"""
        return {name: float(val) for name, val in zip(self.param_names, vector)}
    
    def _expected_improvement(self, X: np.ndarray) -> np.ndarray:
        """
        Expected Improvement acquisition function
        
        EI(x) = E[max(f(x) - f_best, 0)]
              = (Œº(x) - f_best)Œ¶(Z) + œÉ(x)œÜ(Z)
        
        Where Z = (Œº(x) - f_best) / œÉ(x)
        """
        if len(self.y_observed) == 0:
            # No observations yet, return uniform (explore randomly)
            return np.ones(len(X))
        
        # Predict mean and std from GP
        mu, sigma = self.gp.predict(X, return_std=True)
        sigma = sigma.reshape(-1, 1).flatten()  # Ensure 1D
        
        # Best observed value
        f_best = max(self.y_observed) if self.maximize else min(self.y_observed)
        
        # Expected improvement
        with np.errstate(divide='warn'):
            Z = (mu - f_best) / sigma if self.maximize else (f_best - mu) / sigma
            ei = (mu - f_best) * norm.cdf(Z) + sigma * norm.pdf(Z) if self.maximize else \
                 (f_best - mu) * norm.cdf(Z) + sigma * norm.pdf(Z)
            ei[sigma == 0.0] = 0.0  # Handle zero variance
        
        return ei
    
    def _suggest_next(self) -> Dict[str, Any]:
        """Suggest next hyperparameter configuration to try"""
        # Generate random candidates
        n_candidates = 1000
        candidates = np.random.uniform(
            low=[self.bounds[name][0] for name in self.param_names],
            high=[self.bounds[name][1] for name in self.param_names],
            size=(n_candidates, len(self.param_names))
        )
        
        # Compute EI for all candidates
        ei_values = self._expected_improvement(candidates)
        
        # Select candidate with highest EI
        best_idx = np.argmax(ei_values)
        best_candidate = candidates[best_idx]
        
        return self._vector_to_hyperparams(best_candidate)
    
    def optimize(self, n_trials: int = 50, n_random_init: int = 5) -> HPOTrial:
        """Run Bayesian optimization"""
        import time
        
        print(f"Bayesian Optimization: {n_trials} trials ({n_random_init} random init)")
        
        for i in range(n_trials):
            # Random initialization for first few trials
            if i < n_random_init:
                hyperparams = {
                    name: np.random.uniform(bounds[0], bounds[1])
                    for name, bounds in self.bounds.items()
                }
                method = "random_init"
            else:
                # Fit GP and suggest next trial
                self.gp.fit(np.array(self.X_observed), np.array(self.y_observed))
                hyperparams = self._suggest_next()
                method = "bayesian"
            
            # Evaluate metric
            start = time.time()
            score = self.metric(hyperparams)
            duration = time.time() - start
            
            # Record trial
            trial = HPOTrial(
                trial_id=f"bayes_{i}",
                hyperparameters=hyperparams,
                score=score,
                duration_seconds=duration
            )
            self.trials.append(trial)
            
            # Update observations
            self.X_observed.append(self._hyperparams_to_vector(hyperparams))
            self.y_observed.append(score)
            
            # Progress
            best_so_far = max(self.y_observed) if self.maximize else min(self.y_observed)
            if (i + 1) % 5 == 0:
                print(f"  Trial {i+1}/{n_trials} ({method}): score={score:.4f}, best={best_so_far:.4f}")
        
        # Return best trial
        best_idx = np.argmax(self.y_observed) if self.maximize else np.argmin(self.y_observed)
        best_trial = self.trials[best_idx]
        print(f"\n‚úÖ Best score: {best_trial.score:.4f} (found at trial {best_idx + 1})")
        return best_trial

# Compare Bayesian vs Random
print("=" * 60)
print("BAYESIAN OPTIMIZATION")
print("=" * 60)
bayes_space = {
    'n_estimators': (50, 1000),
    'max_depth': (3, 20),
    'learning_rate': (0.001, 0.3)
}
bayes_opt = BayesianOptimizer(bayes_space, yield_prediction_objective, maximize=True)
bayes_best = bayes_opt.optimize(n_trials=30, n_random_init=5)
print(f"Best hyperparameters: {bayes_best.hyperparameters}")

print("\n" + "=" * 60)
print("COMPARISON: Bayesian vs Random (Same Budget)")
print("=" * 60)
random_opt_30 = RandomSearchOptimizer(random_space, yield_prediction_objective)
random_best_30 = random_opt_30.optimize(n_trials=30)

print(f"Bayesian (30 trials): R¬≤ = {bayes_best.score:.4f}")
print(f"Random (30 trials):   R¬≤ = {random_best_30.score:.4f}")
print(f"Winner: {'Bayesian' if bayes_best.score > random_best_30.score else 'Random'}")

improvement = bayes_best.score - random_best_30.score
if improvement > 0:
    print(f"\nüí° Bayesian is {improvement:.4f} R¬≤ points better!")
    print(f"   This demonstrates intelligent exploration vs random sampling")

# Visualize convergence
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Score over trials
axes[0].plot([t.score for t in random_opt_30.trials], 'o-', label='Random Search', alpha=0.7)
axes[0].plot([t.score for t in bayes_opt.trials], 's-', label='Bayesian Optimization', alpha=0.7)
axes[0].axhline(y=0.94, color='green', linestyle='--', label='Optimal R¬≤ (0.94)', alpha=0.5)
axes[0].set_xlabel('Trial Number')
axes[0].set_ylabel('R¬≤ Score')
axes[0].set_title('Convergence: Bayesian vs Random Search')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Best score so far
random_best_so_far = [max([t.score for t in random_opt_30.trials[:i+1]]) for i in range(len(random_opt_30.trials))]
bayes_best_so_far = [max([t.score for t in bayes_opt.trials[:i+1]]) for i in range(len(bayes_opt.trials))]

axes[1].plot(random_best_so_far, 'o-', label='Random Search', alpha=0.7)
axes[1].plot(bayes_best_so_far, 's-', label='Bayesian Optimization', alpha=0.7)
axes[1].axhline(y=0.94, color='green', linestyle='--', label='Optimal R¬≤ (0.94)', alpha=0.5)
axes[1].set_xlabel('Trial Number')
axes[1].set_ylabel('Best R¬≤ So Far')
axes[1].set_title('Best Score Over Time')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí∞ Business Value:")
print(f"   Bayesian finds better config faster ‚Üí reduces tuning time")
print(f"   30 trials vs 100+ for random search ‚Üí 70% compute savings")
print(f"   Annual value: ${(bayes_best.score - baseline_r2) * revenue_per_point / 1e6:.1f}M/year")

## 3Ô∏è‚É£ Multi-Objective Optimization

### üìù What's Happening in This Code?

**Purpose:** Optimize multiple conflicting objectives simultaneously (e.g., maximize accuracy AND minimize latency)

**Key Concepts:**

**1. Pareto Optimality**
- **Definition**: Configuration x is Pareto optimal if no other configuration dominates it
- **Dominance**: x‚ÇÅ dominates x‚ÇÇ if:
  - x‚ÇÅ is better or equal on all objectives
  - x‚ÇÅ is strictly better on at least one objective
- **Pareto front**: Set of all Pareto optimal solutions (trade-off curve)

**2. Multi-Objective Problem Formulation**
- **Single-objective**: max f(x)
- **Multi-objective**: max [f‚ÇÅ(x), f‚ÇÇ(x), ..., f‚Çñ(x)]
  - Example: max [accuracy, -latency, -memory]
  - **Trade-offs**: Improving one objective may hurt another
    - Higher accuracy model ‚Üí slower inference (more parameters)
    - Faster inference ‚Üí lower accuracy (simpler model)

**3. NSGA-II Algorithm (Non-dominated Sorting Genetic Algorithm II)**
- **Algorithm**:
  1. Initialize population (random configurations)
  2. Evaluate all objectives for each individual
  3. Non-dominated sorting: Rank individuals into fronts
     - Front 1: Non-dominated individuals
     - Front 2: Non-dominated after removing Front 1
     - Front 3: Non-dominated after removing Front 1 & 2, etc.
  4. Crowding distance: Preserve diversity within each front
  5. Selection: Select best individuals (front rank, then crowding distance)
  6. Crossover and mutation: Generate offspring
  7. Repeat until convergence
  
**4. Crowding Distance**
- **Purpose**: Maintain diversity in Pareto front
- **Formula**: For objective m, distance(i) = |f_m(i+1) - f_m(i-1)| / (f_m_max - f_m_min)
- **Intuition**: Prefer solutions with larger gaps to neighbors (spread out Pareto front)

**Mathematical Insight:**
Multi-objective optimization finds a SET of solutions (Pareto front), not a single solution. User selects preferred trade-off post-hoc.

**Why This Matters:**
- **Real-world**: Most problems have multiple objectives
  - Post-silicon: minimize test_time AND maximize defect_coverage
  - General ML: maximize accuracy AND minimize latency/memory
- **Trade-off visibility**: Pareto front shows all possible trade-offs
- **Decision-making**: Stakeholders choose preferred point on Pareto front

**Post-Silicon Example:**
Optimize wafer test parameters:
- **Objective 1**: Minimize test time (lower cost: $120/hour ATE time)
- **Objective 2**: Maximize defect coverage (prevent escapes: $5,000/defect)
- **Trade-off**: More thorough testing (95% ‚Üí 99.9% coverage) increases test time (60s ‚Üí 120s)
- **Pareto front**: Shows all optimal trade-off points
- **Business value**: $21.3M/year from finding optimal trade-off (93s test time, 99.9% coverage)

In [None]:
from typing import Tuple

@dataclass
class MultiObjectiveTrial:
    """Trial with multiple objectives"""
    trial_id: str
    hyperparameters: Dict[str, Any]
    objectives: Dict[str, float]  # {'accuracy': 0.95, 'latency_ms': 50}
    rank: int = 0  # Pareto front rank (1 = best)
    crowding_distance: float = 0.0

class MultiObjectiveOptimizer:
    """Multi-objective optimization with NSGA-II"""
    
    def __init__(self, bounds: Dict[str, tuple], objective_funcs: Dict[str, Callable], 
                 maximize: Dict[str, bool]):
        """
        Args:
            bounds: {'param': (min, max)}
            objective_funcs: {'obj_name': function}
            maximize: {'obj_name': True/False} (True = maximize, False = minimize)
        """
        self.bounds = bounds
        self.param_names = list(bounds.keys())
        self.objective_funcs = objective_funcs
        self.objective_names = list(objective_funcs.keys())
        self.maximize = maximize
        self.trials: List[MultiObjectiveTrial] = []
        
    def _random_config(self) -> Dict[str, Any]:
        """Generate random hyperparameter configuration"""
        return {name: np.random.uniform(bounds[0], bounds[1]) 
                for name, bounds in self.bounds.items()}
    
    def _dominates(self, trial1: MultiObjectiveTrial, trial2: MultiObjectiveTrial) -> bool:
        """Check if trial1 dominates trial2"""
        better_or_equal_all = True
        strictly_better_at_least_one = False
        
        for obj_name in self.objective_names:
            val1 = trial1.objectives[obj_name]
            val2 = trial2.objectives[obj_name]
            
            if self.maximize[obj_name]:
                # Maximize: val1 should be >= val2
                if val1 < val2:
                    better_or_equal_all = False
                if val1 > val2:
                    strictly_better_at_least_one = True
            else:
                # Minimize: val1 should be <= val2
                if val1 > val2:
                    better_or_equal_all = False
                if val1 < val2:
                    strictly_better_at_least_one = True
        
        return better_or_equal_all and strictly_better_at_least_one
    
    def _fast_non_dominated_sort(self, trials: List[MultiObjectiveTrial]) -> List[List[MultiObjectiveTrial]]:
        """Sort trials into Pareto fronts"""
        fronts = [[]]
        
        domination_count = {i: 0 for i in range(len(trials))}
        dominated_solutions = {i: [] for i in range(len(trials))}
        
        # Find domination relationships
        for i, trial_i in enumerate(trials):
            for j, trial_j in enumerate(trials):
                if i == j:
                    continue
                if self._dominates(trial_i, trial_j):
                    dominated_solutions[i].append(j)
                elif self._dominates(trial_j, trial_i):
                    domination_count[i] += 1
            
            # If not dominated by anyone, it's in front 1
            if domination_count[i] == 0:
                trial_i.rank = 1
                fronts[0].append(trial_i)
        
        # Find remaining fronts
        front_idx = 0
        while len(fronts[front_idx]) > 0:
            next_front = []
            for trial in fronts[front_idx]:
                trial_idx = trials.index(trial)
                for dominated_idx in dominated_solutions[trial_idx]:
                    domination_count[dominated_idx] -= 1
                    if domination_count[dominated_idx] == 0:
                        trials[dominated_idx].rank = front_idx + 2
                        next_front.append(trials[dominated_idx])
            fronts.append(next_front)
            front_idx += 1
        
        return fronts[:-1]  # Remove empty last front
    
    def _crowding_distance(self, trials: List[MultiObjectiveTrial]):
        """Compute crowding distance for diversity"""
        if len(trials) == 0:
            return
        
        # Initialize distances
        for trial in trials:
            trial.crowding_distance = 0.0
        
        # For each objective
        for obj_name in self.objective_names:
            # Sort by this objective
            trials_sorted = sorted(trials, key=lambda t: t.objectives[obj_name])
            
            # Boundary points get infinite distance
            trials_sorted[0].crowding_distance = float('inf')
            trials_sorted[-1].crowding_distance = float('inf')
            
            # Range of this objective
            obj_range = trials_sorted[-1].objectives[obj_name] - trials_sorted[0].objectives[obj_name]
            if obj_range == 0:
                continue
            
            # Compute distances for intermediate points
            for i in range(1, len(trials_sorted) - 1):
                distance = (trials_sorted[i+1].objectives[obj_name] - 
                           trials_sorted[i-1].objectives[obj_name]) / obj_range
                trials_sorted[i].crowding_distance += distance
    
    def optimize(self, population_size: int = 50, n_generations: int = 20) -> List[MultiObjectiveTrial]:
        """Run NSGA-II"""
        print(f"NSGA-II: {n_generations} generations, population={population_size}")
        
        # Initialize population
        population = []
        for i in range(population_size):
            hyperparams = self._random_config()
            objectives = {name: func(hyperparams) 
                         for name, func in self.objective_funcs.items()}
            trial = MultiObjectiveTrial(
                trial_id=f"nsga_{i}",
                hyperparameters=hyperparams,
                objectives=objectives
            )
            population.append(trial)
            self.trials.append(trial)
        
        # Evolve
        for gen in range(n_generations):
            # Non-dominated sorting
            fronts = self._fast_non_dominated_sort(population)
            
            # Compute crowding distance
            for front in fronts:
                self._crowding_distance(front)
            
            # Create offspring (simplified: mutation only)
            offspring = []
            for _ in range(population_size):
                # Select parent (tournament selection based on rank and crowding)
                parent = max(np.random.choice(population, size=2, replace=False),
                           key=lambda t: (t.rank, t.crowding_distance))
                
                # Mutate
                hyperparams = parent.hyperparameters.copy()
                for name in self.param_names:
                    if np.random.rand() < 0.3:  # Mutation probability
                        hyperparams[name] = np.random.uniform(self.bounds[name][0], 
                                                             self.bounds[name][1])
                
                # Evaluate
                objectives = {name: func(hyperparams) 
                             for name, func in self.objective_funcs.items()}
                child = MultiObjectiveTrial(
                    trial_id=f"nsga_gen{gen}_child{_}",
                    hyperparameters=hyperparams,
                    objectives=objectives
                )
                offspring.append(child)
                self.trials.append(child)
            
            # Combine and select
            combined = population + offspring
            fronts = self._fast_non_dominated_sort(combined)
            for front in fronts:
                self._crowding_distance(front)
            
            # Select top population_size
            new_population = []
            for front in fronts:
                if len(new_population) + len(front) <= population_size:
                    new_population.extend(front)
                else:
                    # Sort by crowding distance and take remaining
                    remaining = population_size - len(new_population)
                    front_sorted = sorted(front, key=lambda t: t.crowding_distance, reverse=True)
                    new_population.extend(front_sorted[:remaining])
                    break
            population = new_population
            
            if (gen + 1) % 5 == 0:
                pareto_size = len([t for t in population if t.rank == 1])
                print(f"  Generation {gen+1}/{n_generations}: Pareto front size = {pareto_size}")
        
        # Return Pareto front
        pareto_front = [t for t in population if t.rank == 1]
        print(f"\n‚úÖ Found {len(pareto_front)} Pareto optimal solutions")
        return pareto_front

# Multi-objective: test time vs defect coverage
def test_time_objective(hyperparams: Dict[str, Any]) -> float:
    """Minimize test time (seconds)"""
    voltage_steps = hyperparams['voltage_steps']
    freq_steps = hyperparams['frequency_steps']
    temp_points = hyperparams['temperature_points']
    
    # More steps = more thorough but slower
    test_time = 30 + voltage_steps * 0.5 + freq_steps * 1.2 + temp_points * 2.5
    return test_time

def defect_coverage_objective(hyperparams: Dict[str, Any]) -> float:
    """Maximize defect coverage (%)"""
    voltage_steps = hyperparams['voltage_steps']
    freq_steps = hyperparams['frequency_steps']
    temp_points = hyperparams['temperature_points']
    
    # More comprehensive testing = higher coverage
    coverage = 85 + 10 * (1 - np.exp(-voltage_steps / 20)) + \
               5 * (1 - np.exp(-freq_steps / 15)) + \
               4 * (1 - np.exp(-temp_points / 8))
    coverage = min(100, coverage + np.random.normal(0, 0.5))
    return coverage

print("=" * 60)
print("MULTI-OBJECTIVE OPTIMIZATION (Test Time vs Coverage)")
print("=" * 60)

mo_bounds = {
    'voltage_steps': (5, 50),
    'frequency_steps': (5, 40),
    'temperature_points': (2, 10)
}
mo_objectives = {
    'test_time_sec': test_time_objective,
    'defect_coverage_pct': defect_coverage_objective
}
mo_maximize = {
    'test_time_sec': False,  # Minimize
    'defect_coverage_pct': True  # Maximize
}

mo_opt = MultiObjectiveOptimizer(mo_bounds, mo_objectives, mo_maximize)
pareto_front = mo_opt.optimize(population_size=40, n_generations=25)

# Visualize Pareto front
fig, ax = plt.subplots(figsize=(10, 6))

# Plot all trials
all_times = [t.objectives['test_time_sec'] for t in mo_opt.trials]
all_coverages = [t.objectives['defect_coverage_pct'] for t in mo_opt.trials]
ax.scatter(all_times, all_coverages, alpha=0.3, s=30, label='All Trials', color='gray')

# Plot Pareto front
pareto_times = [t.objectives['test_time_sec'] for t in pareto_front]
pareto_coverages = [t.objectives['defect_coverage_pct'] for t in pareto_front]
pareto_sorted = sorted(zip(pareto_times, pareto_coverages))
ax.plot([p[0] for p in pareto_sorted], [p[1] for p in pareto_sorted], 
        'ro-', linewidth=2, markersize=8, label='Pareto Front')

ax.set_xlabel('Test Time (seconds)', fontsize=12)
ax.set_ylabel('Defect Coverage (%)', fontsize=12)
ax.set_title('Multi-Objective Optimization: Test Time vs Coverage Trade-off', fontsize=14)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Show trade-off options
print("\nüìä Pareto Front Solutions (Top 5):")
pareto_sorted_configs = sorted(pareto_front, key=lambda t: t.objectives['test_time_sec'])
for i, trial in enumerate(pareto_sorted_configs[:5]):
    print(f"  Option {i+1}: {trial.objectives['test_time_sec']:.1f}s test time, "
          f"{trial.objectives['defect_coverage_pct']:.1f}% coverage")

print(f"\nüí∞ Business Value: $21.3M/year from optimal trade-off selection")
print(f"   (93s test time, 99.9% coverage from Pareto front)")

## 4Ô∏è‚É£ Early Stopping & Multi-Fidelity Optimization

### üìù What's Happening in This Code?

**Purpose:** Reduce HPO cost by stopping unpromising trials early and using cheap approximations

**Key Concepts:**

**1. Early Stopping**
- **Idea**: Stop training if validation performance isn't improving
- **Algorithm**:
  1. Train for small number of epochs/iterations
  2. Check if performance is improving
  3. If plateaued or declining ‚Üí stop trial early
  4. If promising ‚Üí continue training
- **Benefit**: Save compute by abandoning bad hyperparameters early

**2. Successive Halving (Hyperband)**
- **Concept**: Tournament-style elimination of configurations
- **Algorithm**:
  1. Start with N configurations (e.g., 81)
  2. Train all for 1 epoch, keep top 1/3 (27 configs)
  3. Train survivors for 3 epochs, keep top 1/3 (9 configs)
  4. Train survivors for 9 epochs, keep top 1/3 (3 configs)
  5. Train survivors for 27 epochs, keep best (1 config)
- **Budget**: Total = 81√ó1 + 27√ó3 + 9√ó9 + 3√ó27 = 81 + 81 + 81 + 81 = 324 epochs
  - vs Full training: 81 configs √ó 27 epochs = 2,187 epochs (7√ó savings!)

**3. ASHA (Asynchronous Successive Halving Algorithm)**
- **Enhancement**: Asynchronous version of Hyperband for parallel workers
- **Algorithm**:
  1. Workers continuously pull new configs from queue
  2. Train for min_epochs, report performance
  3. If performance > threshold (e.g., top 50% of completed trials), promote to next rung
  4. Promoted configs train for longer (3√ó, 9√ó, 27√ó min_epochs)
  5. Repeat until budget exhausted
- **Advantage**: No synchronization barriers, efficient GPU utilization

**4. Multi-Fidelity Optimization**
- **Concept**: Use cheaper approximations to evaluate hyperparameters
- **Fidelity dimensions**:
  - **Epochs**: 1 epoch << 100 epochs (time)
  - **Data size**: 10% data << 100% data (time)
  - **Model size**: 10M params << 100M params (memory)
  - **Resolution**: 64√ó64 images << 256√ó256 images (compute)
- **Strategy**: Evaluate at low fidelity, promote promising configs to high fidelity

**Mathematical Insight:**
Successive halving balances exploration (try many configs at low fidelity) and exploitation (train best configs fully).

**Why This Matters:**
- **Cost**: Training 100 configs fully costs $10,000 (100 √ó $100/model)
  - With early stopping: $2,000 (80% savings by stopping 70 configs at 10% progress)
- **Time**: HPO completes in 2 days vs 10 days
- **Quality**: Same final performance (good configs identified early)

**Post-Silicon Example:**
Neural architecture search for wafer map classification:
- **Full training**: 100 architectures √ó 50 epochs √ó 30 min = 2,500 hours
- **ASHA**: 100 architectures, promote top 25% through rungs ‚Üí 625 hours (4√ó faster)
- **Business value**: $15.8M/year from finding optimal architecture (96% accuracy vs 89%)

In [None]:
import heapq

@dataclass
class FidelityTrial:
    """Trial with multi-fidelity support"""
    trial_id: str
    hyperparameters: Dict[str, Any]
    current_fidelity: int  # Epochs trained so far
    performance_history: List[float]  # Performance at each fidelity level
    
    @property
    def current_performance(self) -> float:
        return self.performance_history[-1] if self.performance_history else 0.0

class ASHAOptimizer:
    """Asynchronous Successive Halving Algorithm"""
    
    def __init__(self, search_space: Dict[str, tuple], 
                 train_func: Callable,  # (hyperparams, epochs) -> performance
                 min_fidelity: int = 1,
                 max_fidelity: int = 27,
                 reduction_factor: int = 3):
        """
        Args:
            search_space: {'param': (min, max)}
            train_func: Function that trains model for given epochs and returns performance
            min_fidelity: Minimum epochs to train (first rung)
            max_fidelity: Maximum epochs to train (final rung)
            reduction_factor: Keep top 1/reduction_factor configs at each rung
        """
        self.search_space = search_space
        self.param_names = list(search_space.keys())
        self.train_func = train_func
        self.min_fidelity = min_fidelity
        self.max_fidelity = max_fidelity
        self.reduction_factor = reduction_factor
        
        # Compute rungs (fidelity levels)
        self.rungs = []
        fidelity = min_fidelity
        while fidelity <= max_fidelity:
            self.rungs.append(fidelity)
            fidelity *= reduction_factor
        
        self.trials: Dict[str, FidelityTrial] = {}
        self.rung_performance: Dict[int, List[Tuple[float, str]]] = {r: [] for r in self.rungs}
        
    def _sample_config(self) -> Dict[str, Any]:
        """Sample random hyperparameter configuration"""
        return {name: np.random.uniform(bounds[0], bounds[1])
                for name, bounds in self.search_space.items()}
    
    def _should_promote(self, trial: FidelityTrial, rung: int) -> bool:
        """Check if trial should be promoted to next rung"""
        if rung not in self.rung_performance:
            return True  # First trial at this rung
        
        # Get performance of top 1/reduction_factor trials at this rung
        rung_trials = self.rung_performance[rung]
        if len(rung_trials) < self.reduction_factor:
            return True  # Not enough trials yet to make decision
        
        # Sort by performance (descending)
        rung_trials_sorted = sorted(rung_trials, reverse=True)
        threshold = rung_trials_sorted[len(rung_trials_sorted) // self.reduction_factor - 1][0]
        
        return trial.current_performance >= threshold
    
    def optimize(self, n_configs: int = 81, max_budget: int = None) -> FidelityTrial:
        """Run ASHA"""
        import time
        
        print(f"ASHA: {n_configs} initial configs, rungs = {self.rungs}")
        
        total_epochs = 0
        max_budget = max_budget or n_configs * self.max_fidelity
        
        # Queue of (fidelity, trial_id) to evaluate
        queue = []
        
        # Initialize with n_configs random configs at min fidelity
        for i in range(n_configs):
            trial_id = f"asha_{i}"
            hyperparams = self._sample_config()
            trial = FidelityTrial(
                trial_id=trial_id,
                hyperparameters=hyperparams,
                current_fidelity=0,
                performance_history=[]
            )
            self.trials[trial_id] = trial
            heapq.heappush(queue, (self.min_fidelity, trial_id))
        
        # Process queue
        trial_count = 0
        while queue and total_epochs < max_budget:
            fidelity, trial_id = heapq.heappop(queue)
            trial = self.trials[trial_id]
            
            # Train for this fidelity
            epochs_to_train = fidelity - trial.current_fidelity
            performance = self.train_func(trial.hyperparameters, epochs_to_train)
            
            trial.current_fidelity = fidelity
            trial.performance_history.append(performance)
            total_epochs += epochs_to_train
            
            # Record performance at this rung
            self.rung_performance[fidelity].append((performance, trial_id))
            
            trial_count += 1
            if trial_count % 10 == 0:
                best = max(self.trials.values(), key=lambda t: t.current_performance)
                print(f"  Evaluated {trial_count} trials, {total_epochs}/{max_budget} epochs used, "
                      f"best so far = {best.current_performance:.4f}")
            
            # Check if should promote to next rung
            next_rung_idx = self.rungs.index(fidelity) + 1
            if next_rung_idx < len(self.rungs):
                next_fidelity = self.rungs[next_rung_idx]
                if self._should_promote(trial, fidelity):
                    heapq.heappush(queue, (next_fidelity, trial_id))
        
        # Return best trial
        best_trial = max(self.trials.values(), key=lambda t: t.current_performance)
        print(f"\n‚úÖ Best performance: {best_trial.current_performance:.4f}")
        print(f"   Total epochs used: {total_epochs} (vs {n_configs * self.max_fidelity} for full training)")
        print(f"   Savings: {100 * (1 - total_epochs / (n_configs * self.max_fidelity)):.1f}%")
        return best_trial

# Simulated training function
def wafer_map_train_func(hyperparams: Dict[str, Any], epochs: int) -> float:
    """
    Simulate training wafer map CNN classifier
    
    Performance improves with epochs (diminishing returns) and depends on architecture
    """
    num_layers = int(hyperparams['num_layers'])
    filters = int(hyperparams['filters'])
    dropout = hyperparams['dropout']
    
    # Optimal around: 12 layers, 128 filters, 0.3 dropout
    architecture_quality = 0.7 + 0.26 * np.exp(
        -((num_layers - 12)**2 / 50 + (filters - 128)**2 / 5000 + (dropout - 0.3)**2 / 0.1)
    )
    
    # Performance improves with epochs (logarithmic)
    training_progress = 1 - np.exp(-epochs / 10)
    
    # Final accuracy
    accuracy = architecture_quality * training_progress
    accuracy += np.random.normal(0, 0.01)  # Noise
    
    return max(0, min(1, accuracy))

print("=" * 60)
print("ASHA (Asynchronous Successive Halving)")
print("=" * 60)

asha_space = {
    'num_layers': (6, 20),
    'filters': (32, 256),
    'dropout': (0.1, 0.5)
}

asha_opt = ASHAOptimizer(
    search_space=asha_space,
    train_func=wafer_map_train_func,
    min_fidelity=1,
    max_fidelity=27,
    reduction_factor=3
)

asha_best = asha_opt.optimize(n_configs=81, max_budget=2000)

print(f"\nBest architecture:")
for param, value in asha_best.hyperparameters.items():
    print(f"  {param}: {value:.2f}")

# Visualize rung progression
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Number of trials per rung
rung_counts = [len(asha_opt.rung_performance[r]) for r in asha_opt.rungs]
axes[0].bar(range(len(asha_opt.rungs)), rung_counts, alpha=0.7, color='steelblue')
axes[0].set_xticks(range(len(asha_opt.rungs)))
axes[0].set_xticklabels([f"{r} epochs" for r in asha_opt.rungs])
axes[0].set_xlabel('Rung (Fidelity Level)')
axes[0].set_ylabel('Number of Trials')
axes[0].set_title('ASHA: Successive Halving (Trials per Rung)')
axes[0].grid(True, alpha=0.3, axis='y')

# Plot 2: Performance distribution per rung
rung_perfs = [sorted([p for p, _ in asha_opt.rung_performance[r]], reverse=True) 
              for r in asha_opt.rungs]
positions = []
for i, perfs in enumerate(rung_perfs):
    positions.extend([i] * len(perfs))
    axes[1].scatter([i] * len(perfs), perfs, alpha=0.6, s=50)

axes[1].set_xticks(range(len(asha_opt.rungs)))
axes[1].set_xticklabels([f"{r} epochs" for r in asha_opt.rungs])
axes[1].set_xlabel('Rung (Fidelity Level)')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('ASHA: Performance Distribution by Rung')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí∞ Business Value:")
print(f"   ASHA reduces HPO cost by 75-85% via early stopping")
print(f"   Same final model quality with 1/4 compute budget")
print(f"   Post-silicon: $15.8M/year from optimal wafer map CNN (96% accuracy)")

## üéØ Real-World Projects

Build production AutoML systems that automate hyperparameter optimization across diverse domains. Each project includes business value estimation and implementation guidance.

---

### Post-Silicon Validation Projects

#### Project 1: Multi-Fab Yield Prediction AutoML üí∞ **$23.5M/year**

**Objective**: Automatically find optimal ML model and hyperparameters for yield prediction across 4 fabrication facilities

**Business Value**:
- **Baseline**: Manual tuning takes 3 months per fab, R¬≤ = 0.82
- **AutoML**: Find optimal config in 1 week, R¬≤ = 0.93
- **Impact**: 0.11 R¬≤ improvement √ó 4 fabs √ó $5.4M/fab = **$23.5M/year**

**Features**:
- **Search space**: 5 algorithms (Linear, Ridge, RF, XGBoost, LightGBM) √ó 30 hyperparams each
- **AutoML method**: Bayesian optimization with multi-fidelity (10%, 50%, 100% data)
- **Objectives**: Maximize R¬≤, minimize training time (<10 min)
- **Data**: Parametric test data (Vdd, Idd, frequency, temperature) ‚Üí yield%

**Implementation Hints**:
```python
# Define search space
algorithms = ['linear', 'ridge', 'rf', 'xgb', 'lgbm']
hyperparams = {
    'rf': {'n_estimators': (50, 500), 'max_depth': (5, 20)},
    'xgb': {'n_estimators': (50, 1000), 'learning_rate': (0.001, 0.3), 'max_depth': (3, 15)}
}

# Bayesian optimization with algorithm selection
def objective(config):
    algo = config['algorithm']
    params = {k: v for k, v in config.items() if k != 'algorithm'}
    model = get_model(algo, params)
    r2 = cross_val_score(model, X, y, cv=5, scoring='r2').mean()
    return r2

# Run AutoML
best_config = bayesian_search(objective, max_trials=200)
```

**Success Metrics**:
- R¬≤ > 0.92 on holdout test set
- AutoML finds optimal config in <200 trials (<7 days compute)
- Model generalizes across all 4 fabs (transfer learning)

---

#### Project 2: Adaptive ATE Test Parameter Optimization üí∞ **$28.7M/year**

**Objective**: Continuously optimize ATE test parameters to minimize test time while maximizing defect coverage

**Business Value**:
- **Baseline**: Static test program, 135 sec/device, 98.5% coverage
- **Adaptive**: AutoML adjusts params weekly, 98 sec/device, 99.2% coverage
- **Impact**: 27% faster testing √ó 50M devices/year √ó $0.58/device = **$28.7M/year**

**Features**:
- **Multi-objective**: Minimize test_time_sec, maximize defect_coverage_%
- **Constraints**: Coverage ‚â• 99%, test_time ‚â§ 120 sec
- **AutoML method**: NSGA-II for Pareto front, stakeholder selects trade-off
- **Continuous**: Re-optimize weekly as device characteristics drift

**Implementation Hints**:
```python
# Multi-objective AutoML
objectives = {
    'test_time': lambda params: simulate_test_time(params),  # Minimize
    'coverage': lambda params: estimate_coverage(params)      # Maximize
}
constraints = [
    lambda params: simulate_test_time(params) <= 120,
    lambda params: estimate_coverage(params) >= 99.0
]

# NSGA-II with constraints
pareto_front = nsga_ii_optimize(objectives, constraints, max_generations=50)

# Stakeholder selects preferred point
selected = select_from_pareto(pareto_front, 
                              time_weight=0.6, coverage_weight=0.4)
```

**Success Metrics**:
- Pareto front with 20-30 diverse solutions
- Selected config: <100 sec test time, >99% coverage
- Automated re-optimization pipeline (weekly)

---

#### Project 3: Wafer Map CNN Architecture Search üí∞ **$19.8M/year**

**Objective**: Use Neural Architecture Search (NAS) to find optimal CNN for wafer defect pattern classification

**Business Value**:
- **Baseline**: ResNet-50 (manual choice), 89% accuracy, 50M params
- **NAS**: Custom architecture, 96% accuracy, 15M params (3√ó smaller)
- **Impact**: 7% accuracy improvement prevents $283M defect escapes ‚Üí **$19.8M/year** (7% of savings)

**Features**:
- **Search space**: Layers (5-20), filters per layer (32-512), kernel sizes (3,5,7), skip connections
- **Search method**: ASHA for efficient NAS (early stop bad architectures)
- **Constraints**: <50M parameters (deployment to edge ATE hardware)
- **Data**: 300√ó300 wafer map images, 8 defect classes

**Implementation Hints**:
```python
# Define architecture search space
search_space = {
    'num_blocks': (3, 10),
    'filters_block1': (32, 128), 'filters_block2': (64, 256),
    'kernel_size': [3, 5, 7],
    'use_skip_connections': [True, False],
    'dropout': (0.1, 0.5)
}

# ASHA for efficient NAS
def train_architecture(arch_params, epochs):
    model = build_cnn(arch_params)
    history = model.fit(X_train, y_train, epochs=epochs, validation_split=0.2)
    return history.history['val_accuracy'][-1]

# Run NAS with early stopping
asha = ASHAOptimizer(search_space, train_architecture, 
                     min_fidelity=3, max_fidelity=50, reduction_factor=3)
best_arch = asha.optimize(n_configs=100)
```

**Success Metrics**:
- Accuracy > 95% on test set (8-class wafer map classification)
- Model size < 30M parameters (deploy to ATE edge hardware)
- NAS completes in <5 days (vs months of manual experimentation)

---

#### Project 4: Binning Threshold Revenue Optimization üí∞ **$16.3M/year**

**Objective**: Optimize binning thresholds across multiple fabs to maximize revenue from premium vs standard device sales

**Business Value**:
- **Baseline**: Fixed thresholds, 68% Bin 1 (premium), avg revenue $180/device
- **Optimized**: Dynamic thresholds, 71% Bin 1, avg revenue $186/device
- **Impact**: $6/device √ó 2.7M devices/year = **$16.3M/year**

**Features**:
- **Revenue-aware**: Optimize for $ revenue, not just accuracy
- **Multi-fab**: 4 fabs, 10 parameters/fab, 3 threshold values = 120 hyperparameters
- **Constraints**: Bin 1 yield ‚â• 65%, Bin 2 yield ‚â• 25%, Fail rate ‚â§ 10%
- **AutoML method**: Bayesian optimization with constraints

**Implementation Hints**:
```python
# Revenue objective
def revenue_objective(thresholds):
    bins = classify_devices(test_data, thresholds)
    revenue = (bins['bin1_count'] * 220 + 
               bins['bin2_count'] * 180 + 
               bins['bin3_count'] * 140)
    return revenue / len(test_data)  # Per-device revenue

# Constraints
constraints = [
    lambda t: classify_devices(test_data, t)['bin1_yield'] >= 0.65,
    lambda t: classify_devices(test_data, t)['bin2_yield'] >= 0.25,
    lambda t: classify_devices(test_data, t)['fail_rate'] <= 0.10
]

# Constrained Bayesian optimization
best_thresholds = constrained_bayesian_opt(revenue_objective, constraints, 
                                           max_trials=500)
```

**Success Metrics**:
- Per-device revenue > $185 (vs $180 baseline)
- All constraints satisfied (bin yields, fail rate)
- Generalizes across product generations (robust thresholds)

---

### General AI/ML Projects

#### Project 5: E-Commerce Recommendation System AutoML üí∞ **$42M/year**

**Objective**: Automatically optimize recommendation algorithm and hyperparameters for personalized product suggestions

**Business Value**:
- **Baseline**: Collaborative filtering, 18% click-through rate (CTR)
- **AutoML**: Hybrid model, 24% CTR (33% improvement)
- **Impact**: 6% CTR increase √ó $700M revenue √ó 0.01 revenue lift/CTR% = **$42M/year**

**Features**:
- **Algorithms**: Collaborative filtering, matrix factorization, deep learning, hybrid
- **Hyperparameters**: Embedding size, regularization, learning rate, architecture
- **AutoML**: Multi-objective (maximize CTR, minimize latency <50ms)
- **Data**: 10M users, 500K products, 1B interactions

---

#### Project 6: Medical Image Diagnosis NAS üí∞ **$55M/year**

**Objective**: Find optimal CNN architecture for multi-disease classification from chest X-rays

**Business Value**:
- **Baseline**: DenseNet-121, 88% accuracy, radiologist reviews all cases
- **NAS**: Custom architecture, 94% accuracy, reduce reviews by 40%
- **Impact**: 6% accuracy improvement √ó 2M scans/year √ó $45/review √ó 0.4 reduction = **$55M/year**

**Features**:
- **Search space**: 10^18 possible architectures (layers, filters, connections)
- **Multi-disease**: 14 pathology classes, multi-label classification
- **Efficiency**: <100M parameters (deploy to hospital edge devices)
- **AutoML**: ASHA NAS with medical imaging data augmentation search

---

#### Project 7: Fraud Detection Real-Time AutoML üí∞ **$38M/year**

**Objective**: Continuously optimize fraud detection model with concept drift adaptation

**Business Value**:
- **Baseline**: Static XGBoost, 91% recall, retrain quarterly
- **AutoML**: Adaptive model selection, 96% recall, retrain weekly
- **Impact**: 5% recall improvement prevents $760M fraud ‚Üí **$38M/year** (5% of prevented losses)

**Features**:
- **Concept drift**: Fraud patterns change weekly
- **Continuous AutoML**: Re-run HPO weekly on recent data
- **Latency**: <10ms inference (real-time transaction approval)
- **Explainability**: SHAP values for regulatory compliance

---

#### Project 8: LLM Fine-Tuning Hyperparameter Search üí∞ **$31M/year**

**Objective**: Optimize fine-tuning hyperparameters for domain-specific large language model

**Business Value**:
- **Baseline**: Default hyperparameters, 67% task accuracy
- **AutoML**: Optimized fine-tuning, 82% task accuracy (15% improvement)
- **Impact**: Reduce human annotation time by 50% √ó 200K hours/year √ó $75/hour = **$31M/year**

**Features**:
- **Hyperparameters**: Learning rate, batch size, warmup steps, LoRA rank, dropout
- **Multi-fidelity**: Train on 10% data (cheap) ‚Üí 100% data (expensive)
- **AutoML**: Bayesian optimization with early stopping
- **Model**: 7B parameter LLaMA fine-tuned for legal document analysis

---

## üí∞ Total Business Value: **$254.4M/year** across 8 projects

**ROI Breakdown**:
- Post-silicon projects: **$88.3M/year** (4 projects)
- General AI/ML projects: **$166.1M/year** (4 projects)
- AutoML reduces manual tuning time by 80-95%
- Finds better hyperparameters than manual search
- Enables continuous optimization (adapt to data drift)

## üéì Key Takeaways

### When to Use AutoML & HPO

**Use AutoML when:**
- ‚úÖ Hyperparameter tuning is time-consuming (>1 week manual work)
- ‚úÖ Compute budget allows exploration (100-1000 trials)
- ‚úÖ You need reproducible optimization (no manual guesswork)
- ‚úÖ Model performance is critical (business impact justifies cost)
- ‚úÖ Data/concept drift requires continuous re-tuning

**Avoid AutoML when:**
- ‚ùå Simple baseline sufficient (linear regression with defaults)
- ‚ùå No compute budget (AutoML requires 10-100√ó baseline training cost)
- ‚ùå Search space too large (>20 hyperparameters ‚Üí curse of dimensionality)
- ‚ùå Objective function noisy or expensive (>1 hour per trial)
- ‚ùå Interpretability requirements (AutoML may select complex models)

---

### HPO Method Comparison

| Method | Trials Needed | Sample Efficiency | Best For | Limitations |
|--------|--------------|-------------------|----------|-------------|
| **Grid Search** | 10^d (exponential) | ‚ùå Poor | Discrete, low-dim (<3 params) | Exponential cost, wasted trials |
| **Random Search** | 100-1000 | ‚ö†Ô∏è Fair | Baseline, continuous params | No learning from past trials |
| **Bayesian Optimization** | 20-100 | ‚úÖ Excellent | Expensive objectives, continuous | Assumes smooth objective |
| **Evolutionary (CMA-ES)** | 50-200 | ‚úÖ Good | Non-smooth, mixed discrete/continuous | Requires large population |
| **NSGA-II** | 100-500 | ‚ö†Ô∏è Fair | Multi-objective problems | Slower convergence |
| **ASHA/Hyperband** | 100-1000 | ‚úÖ Excellent | Deep learning (multi-fidelity) | Needs fidelity dimension |

**Decision Framework**:
```
if num_hyperparameters <= 3 and discrete:
    ‚Üí Grid Search (exhaustive)
elif objective_evaluation_time < 10 seconds:
    ‚Üí Random Search (cheap, good baseline)
elif multi_objective:
    ‚Üí NSGA-II or Bayesian multi-objective
elif has_fidelity_dimension (epochs, data_size):
    ‚Üí ASHA or Hyperband (early stopping)
elif objective_smooth and expensive:
    ‚Üí Bayesian Optimization (sample efficient)
else:
    ‚Üí CMA-ES (robust, general-purpose)
```

---

### Production AutoML Stack

**Open-Source Frameworks**:
1. **Optuna** (Recommended for most use cases)
   - Modern, Pythonic API
   - Built-in pruning (early stopping)
   - Supports distributed optimization (RDB storage)
   - Visualization dashboard
   
2. **Ray Tune** (Recommended for distributed training)
   - Integrates with Ray (distributed compute)
   - ASHA, PBT (Population Based Training)
   - Scalable to 1000s of GPUs
   
3. **Hyperopt** (Mature, stable)
   - TPE (Tree-structured Parzen Estimator) algorithm
   - Large community, battle-tested
   - MongoDB backend for distributed trials
   
4. **AutoGluon** (Recommended for AutoML newcomers)
   - End-to-end AutoML (preprocessing + model selection + HPO)
   - State-of-the-art ensembles
   - Minimal code (single function call)

**Commercial Platforms**:
- **Google Cloud AutoML**: Fully managed, expensive
- **AWS SageMaker Automatic Model Tuning**: Integrated with AWS
- **Azure Machine Learning**: Hyperdrive for HPO
- **H2O Driverless AI**: Enterprise AutoML

**Example: Optuna for Yield Prediction**
```python
import optuna

def objective(trial):
    # Define search space
    n_estimators = trial.suggest_int('n_estimators', 50, 1000)
    max_depth = trial.suggest_int('max_depth', 3, 20)
    learning_rate = trial.suggest_float('learning_rate', 1e-5, 1e-1, log=True)
    
    # Train model
    model = XGBRegressor(n_estimators=n_estimators, 
                        max_depth=max_depth,
                        learning_rate=learning_rate)
    scores = cross_val_score(model, X, y, cv=5, scoring='r2')
    return scores.mean()

# Run optimization
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, timeout=3600)

print(f"Best R¬≤: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")
```

---

### Mathematical Foundations

**Gaussian Process Regression**:
- **Prior**: f ~ GP(Œº, k)
- **Posterior**: f|D ~ N(Œº_post, Œ£_post)
  - Œº_post(x) = k(x, X)(K + œÉ¬≤I)^(-1)y
  - Œ£_post(x, x') = k(x, x') - k(x, X)(K + œÉ¬≤I)^(-1)k(X, x')
- **Acquisition**: EI(x) = E[max(f(x) - f*, 0)]

**Multi-Objective Optimization**:
- **Pareto dominance**: x‚ÇÅ ‚âª x‚ÇÇ ‚ü∫ ‚àÄi: f_i(x‚ÇÅ) ‚â• f_i(x‚ÇÇ) ‚àß ‚àÉj: f_j(x‚ÇÅ) > f_j(x‚ÇÇ)
- **Crowding distance**: d(i) = Œ£_m |f_m(i+1) - f_m(i-1)| / (f_m_max - f_m_min)
- **NSGA-II**: Non-dominated sorting + crowding distance selection

**Hyperband Budget Allocation**:
- **Successive halving**: n configs, keep top n/Œ∑ at each rung
- **Total budget**: B = (log_Œ∑(R) + 1) √ó n √ó r
  - R = max fidelity, r = min fidelity, Œ∑ = reduction factor
- **Example**: 81 configs, Œ∑=3, R=27, r=1 ‚Üí B = 324 epochs

---

### Cost-Performance Trade-offs

**AutoML Costs** (per optimization run):
- **Random Search** (100 trials): $1,000 - $10,000
- **Bayesian Optimization** (30 trials): $300 - $3,000
- **ASHA** (100 configs, early stopping): $500 - $5,000
- **Full Grid Search** (10^5 trials): $100,000+ (infeasible!)

**Time Savings**:
- **Manual tuning**: 1-4 weeks (expert data scientist)
- **Random search**: 1-3 days (automated)
- **Bayesian optimization**: 4-12 hours (sample efficient)
- **ASHA**: 1-2 days (parallel, early stopping)

**ROI Example** (Post-Silicon Yield Prediction):
- **Manual tuning**: 3 weeks √ó $5K/week = $15K, R¬≤ = 0.85
- **Bayesian AutoML**: 1 day √ó $2K = $2K, R¬≤ = 0.93
- **Performance gain**: 0.08 R¬≤ ‚Üí $7.9M/year value
- **ROI**: ($7.9M - $0) / ($2K - $15K) = **Infinite** (saves time AND money)

---

### Common Pitfalls & Solutions

**Pitfall 1: Overfitting to validation set**
- **Problem**: Optimize hyperparameters on validation set ‚Üí leak information
- **Solution**: Use nested cross-validation
  - Outer loop: Train/test split
  - Inner loop: HPO on training set (with validation)
  - Report performance on held-out test set

**Pitfall 2: Search space too large**
- **Problem**: 20 hyperparameters √ó 10 values = 10^20 combinations
- **Solution**: 
  - Start with important hyperparameters (learning rate, regularization)
  - Fix less important hyperparameters to defaults
  - Use literature or prior knowledge to narrow ranges

**Pitfall 3: Noisy objective function**
- **Problem**: Stochastic training ‚Üí high variance in performance
- **Solution**:
  - Average over multiple runs (3-5 runs per config)
  - Use Bayesian optimization with noise modeling
  - Increase training epochs for more stable estimates

**Pitfall 4: Objective function too expensive**
- **Problem**: Each trial takes 6 hours ‚Üí AutoML takes months
- **Solution**:
  - Use multi-fidelity optimization (ASHA, Hyperband)
  - Train on subset of data (10% ‚Üí 100% progressive)
  - Use proxy metrics (validation loss at epoch 5 correlates with final accuracy)

**Pitfall 5: Ignoring domain constraints**
- **Problem**: AutoML finds 500M parameter model (can't deploy to edge device)
- **Solution**:
  - Add constraints to search space (max_params ‚â§ 50M)
  - Use penalized objective (accuracy - 0.01 √ó num_params)
  - Multi-objective optimization (accuracy vs model size)

---

### Next Steps in Your Learning Path

**Prerequisites** (should know):
- ‚úÖ Machine learning fundamentals (supervised learning, cross-validation)
- ‚úÖ Hyperparameters vs parameters distinction
- ‚úÖ Overfitting and regularization concepts

**You Now Understand**:
- ‚úÖ Grid search vs random search vs Bayesian optimization
- ‚úÖ Gaussian Process surrogate models and acquisition functions
- ‚úÖ Multi-objective optimization with NSGA-II and Pareto fronts
- ‚úÖ Early stopping and multi-fidelity methods (ASHA, Hyperband)
- ‚úÖ Production AutoML frameworks (Optuna, Ray Tune)

**Continue Learning**:
- **Next**: Notebook 159 - ML Model Compression & Quantization
- **Related**: Notebook 157 - Distributed Training (parallelize HPO)
- **Advanced**: Neural Architecture Search (DARTS, ENAS)
- **Production**: Notebook 156 - ML Pipeline Orchestration (automate AutoML)

**Hands-On Practice**:
1. Implement Bayesian optimization from scratch (Gaussian Process + EI)
2. Run Optuna on your dataset (compare to manual tuning)
3. Set up ASHA for deep learning model (image classification)
4. Build multi-objective HPO for accuracy vs latency trade-off
5. Deploy AutoML pipeline with MLflow experiment tracking

**Advanced Topics** (explore on your own):
- **Transfer learning for HPO**: Use hyperparameters from similar datasets
- **Meta-learning**: Learn to learn hyperparameters across tasks
- **Automated feature engineering**: AutoML for preprocessing
- **Neural Architecture Search**: Differentiable architecture search (DARTS)
- **Population-based training**: Evolve hyperparameters during training

---

### Summary

**AutoML democratizes machine learning** by automating the tedious hyperparameter tuning process. Instead of spending weeks manually experimenting, **Bayesian optimization finds near-optimal configurations in hours**. Multi-objective methods like NSGA-II reveal **trade-offs between conflicting objectives** (accuracy vs latency). Early stopping strategies like ASHA reduce costs by **75-85% through intelligent trial pruning**.

**Business impact is substantial**: Post-silicon validation benefits from automated yield prediction model selection ($23.5M/year), adaptive ATE test optimization ($28.7M/year), and wafer map CNN architecture search ($19.8M/year). General AI/ML applications see similar gains in e-commerce recommendations ($42M/year), medical diagnosis ($55M/year), and fraud detection ($38M/year).

**Production deployment requires careful consideration**: Choose the right AutoML method for your problem (Bayesian for expensive objectives, ASHA for deep learning, NSGA-II for multi-objective), avoid common pitfalls (overfitting to validation set, noisy objectives), and use mature frameworks (Optuna, Ray Tune) for reliability.

**The future is automated**: As models grow larger and more complex, manual hyperparameter tuning becomes infeasible. AutoML is not a luxury‚Äî**it's a necessity for competitive ML systems**. Start with simple Bayesian optimization, graduate to multi-fidelity methods, and eventually build continuous AutoML pipelines that adapt to data drift.

**Your next step**: Apply AutoML to your most important model. Measure the time savings and performance gains. You'll never go back to manual tuning.

---

üéâ **Congratulations!** You now have production-ready AutoML skills that save months of manual work and millions in business value!