# Exercise 1: Implementing a Learning Rate Finder

In this exercise, we'll implement a basic learning rate finder that helps identify good learning rates for training neural networks. The learning rate finder works by training the model for a few iterations while exponentially increasing the learning rate and monitoring the loss.

## Setup and Imports



In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# Load and preprocess Fashion MNIST dataset
fashion_mnist = tf.keras.datasets.fashion_mnist.load_data()
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist
X_train = X_train_full.astype('float32') / 255.0
y_train = y_train_full

# Create a simple model
def create_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Flatten(input_shape=[28, 28]),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    return model



## Part 1: Learning Rate Finder Callback

First, we'll create a callback that increases the learning rate exponentially and records the loss at each step.



In [None]:
class LearningRateFinder(tf.keras.callbacks.Callback):
    def __init__(self, min_lr=1e-7, max_lr=10, steps=200):
        super().__init__()
        self.min_lr = min_lr
        self.max_lr = max_lr
        self.steps = steps
        self.learning_rates = []
        self.losses = []
        
        # Calculate the multiplication factor for each step
        # STUDENT TASK: Calculate the factor that, when multiplied steps times,
        # goes from min_lr to max_lr
        self.lr_factor = # Your code here
        
    def on_train_begin(self, logs={}):
        # Start with minimum learning rate
        #print(f"Setting min rate {self.min_lr}")
        self.model.optimizer.learning_rate.assign(self.min_lr)
        
    def on_train_batch_end(self, batch, logs={}):
        # Get the current learning rate and loss
        lr = self.model.optimizer.learning_rate
        loss = logs['loss']
        
        # Store the values
        self.learning_rates.append(tf.keras.backend.get_value(lr))
        self.losses.append(loss)
        
        # Stop if the loss is not finite (exploding)
        if not np.isfinite(loss):
            print('Stopping - Loss is not finite!')
            self.model.stop_training = True
            return
        
        # Increase the learning rate for the next iteration
        # STUDENT TASK: Multiply the current learning rate by lr_factor
        new_lr = # Your code here
        #print(f"New factor {new_lr}")
        self.model.optimizer.learning_rate.assign(new_lr)



## Part 2: Finding the Learning Rate

Now we'll create a function that uses our callback to find a good learning rate.



In [None]:
def find_learning_rate(model, X, y, batch_size=64, steps=200):
    # Initialize the learning rate finder
    lr_finder = LearningRateFinder(steps=steps)
    
    # Compile the model
    # STUDENT TASK: Compile the model with SGD optimizer and sparse_categorical_crossentropy
    model.compile(
        optimizer=# YOUR CODE HERE,
        loss=# YOUR CODE HERE,
        metrics=['accuracy']
    )
    
    # Calculate the number of samples to use
    num_samples = steps * batch_size
    
    # If we have more samples than we need, take a random subset
    if len(X) > num_samples:
        idx = np.random.randint(len(X), size=num_samples)
        X = X[idx]
        y = y[idx]
    
    # Train the model with the learning rate finder
    history = model.fit(
        X, y,
        # STUDENT TASK: Set batch_size and include the lr_finder callback
        batch_size=# YOUR CODE HERE,
        epochs=1,
        callbacks=[# YOUR CODE HERE],
        verbose=0
    )
    
    return lr_finder.learning_rates, lr_finder.losses



## Part 3: Visualizing and Analyzing Results

Finally, we'll create a function to plot and analyze the results.



In [None]:
def plot_learning_rate(learning_rates, losses):
    # Remove any infinite or nan losses
    valid_idx = np.isfinite(losses)
    learning_rates = np.array(learning_rates)[valid_idx]
    losses = np.array(losses)[valid_idx]
    
    # Create the plot
    plt.figure(figsize=(10, 6))
    plt.plot(learning_rates, losses)
    plt.xscale('log')
    plt.xlabel('Learning Rate')
    plt.ylabel('Loss')
    plt.title('Loss vs. Learning Rate')
    
    # STUDENT TASK: Find the learning rate with minimum loss
    min_loss_idx = # YOUR CODE HERE
    best_lr = learning_rates[min_loss_idx]
    
    # Add a dot at the minimum loss
    plt.plot(best_lr, losses[min_loss_idx], 'ro')
    
    # Add a text annotation
    plt.annotate(f'Best LR: {best_lr:.2e}', 
                xy=(best_lr, losses[min_loss_idx]),
                xytext=(best_lr*1.5, losses[min_loss_idx]*1.1),
                arrowprops=dict(facecolor='black', shrink=0.05))
    
    plt.grid(True)
    plt.show()
    
    return best_lr

# Now let's run everything!
model = create_model()
lr_values, loss_values = find_learning_rate(model, X_train, y_train)
best_lr = plot_learning_rate(lr_values, loss_values)
print(f"\nRecommended learning rate: {best_lr:.2e}")



## Student Tasks:

1. In the `LearningRateFinder` class, calculate the `lr_factor` that will increase the learning rate from `min_lr` to `max_lr` over the specified number of steps.
   - Hint: Think about the relationship between exponential growth and multiplication.
   - The formula is: `factor = (max_lr/min_lr)^(1/steps)`

2. In the `LearningRateFinder` class, implement the learning rate update step.
   - You need to multiply the current learning rate by the `lr_factor`

3. In the `find_learning_rate` function, complete the model compilation step:
   - Choose the appropriate optimizer (SGD)
   - Set the loss function for classification
   - Remember this is a multi-class problem with integer labels

4. In the `find_learning_rate` function, set up the model fitting parameters:
   - Set the batch size
   - Include the learning rate finder callback

5. In the `plot_learning_rate` function, find the index of the minimum loss:
   - Use numpy to find the index of the minimum value in the losses array

## Extension Questions:

1. Why do we use a logarithmic scale for the learning rate?
2. What happens if we set the maximum learning rate too high?
3. Why might we want to use only a subset of our training data for finding the learning rate?
4. How would you modify this code to find a good learning rate for different optimizers (Adam, RMSprop, etc.)?

## Expected Output:

When run correctly, you should see:
- A plot showing the loss vs. learning rate on a log scale
- A red dot indicating the point of minimum loss
- A annotation showing the recommended learning rate
- The recommended learning rate should typically be between 1e-4 and 1e-1

Remember: The recommended learning rate is usually about 1/10th of the learning rate at the minimum loss point, as this provides a good balance between learning speed and stability.

# Exercise 2: Optimizer Comparison with Noisy Data

In this exercise, we'll explore how different optimizers perform when training on data with varying levels of noise. We'll learn how adaptive and non-adaptive optimizers respond differently to noisy gradients.

## Setup and Imports



In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)



## Part 1: Data Generation

First, we'll create a function to generate synthetic data with controllable noise levels.



In [None]:
def generate_noisy_data(n_samples=1000, noise_level=0.1):
    """
    Generates synthetic classification data with controlled noise
    and non-linear decision boundaries
    
    Args:
        n_samples: Number of samples to generate
        noise_level: Controls both cluster overlap and noise magnitude
    """
    # Number of samples per class
    n_per_class = n_samples // 4
    
    # Generate four overlapping clusters in a spiral pattern
    t = np.linspace(0, 4*np.pi, n_per_class)
    
    # First two clusters (class 0)
    r1 = 2 + 0.2 * noise_level * np.random.randn(n_per_class)
    x1 = r1 * np.cos(t)
    y1 = r1 * np.sin(t)
    
    r2 = 4 + 0.2 * noise_level * np.random.randn(n_per_class)
    x2 = r2 * np.cos(t + np.pi/2)
    y2 = r2 * np.sin(t + np.pi/2)
    
    # Second two clusters (class 1)
    r3 = 3 + 0.2 * noise_level * np.random.randn(n_per_class)
    x3 = r3 * np.cos(t + np.pi/4)
    y3 = r3 * np.sin(t + np.pi/4)
    
    r4 = 5 + 0.2 * noise_level * np.random.randn(n_per_class)
    x4 = r4 * np.cos(t + 3*np.pi/4)
    y4 = r4 * np.sin(t + 3*np.pi/4)
    
    # Combine all clusters
    X = np.vstack([
        np.column_stack([x1, y1]),
        np.column_stack([x2, y2]),
        np.column_stack([x3, y3]),
        np.column_stack([x4, y4])
    ])
    
    # Create labels
    y = np.hstack([
        np.zeros(2*n_per_class),
        np.ones(2*n_per_class)
    ])
    
    # STUDENT TASK
    # Add random noise to the data set to make it a little more challenging
    X = #YOUR CODE HERE
    
    # Shuffle the dataset
    idx = np.random.permutation(len(X))
    X = X[idx]
    y = y[idx]
    
    # Standardize features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    
    return X, y



## Part 2: Model Creation



In [None]:
def create_model():
    """Creates a simple neural network classifier"""
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(16, activation='relu', input_shape=(2,)),
        tf.keras.layers.Dense(8, activation='relu'),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    return model

def get_optimizer(optimizer_name, learning_rate):
    """
    Creates an optimizer instance based on name and learning rate
    
    Args:
        optimizer_name: String name of optimizer ('sgd', 'adam', etc.)
        learning_rate: Learning rate to use
    """
    # STUDENT TASK: Create and return the appropriate optimizer
    # Include momentum=0.9 for SGD
    if optimizer_name.lower() == 'sgd':
        return # YOUR CODE HERE
    elif optimizer_name.lower() == 'adam':
        return # YOUR CODE HERE
    elif optimizer_name.lower() == 'rmsprop':
        return # YOUR CODE HERE
    else:
        raise ValueError(f"Unsupported optimizer: {optimizer_name}")



## Part 3: Training and Evaluation



In [None]:
def train_and_evaluate(X, y, optimizer_name, learning_rate, noise_level, 
                      epochs=100, batch_size=32):
    """
    Trains a model and returns training history
    """
    # Split data into train/test sets
    # STUDENT TASK: Split the data 80/20 using train_test_split
    X_train, X_test, y_train, y_test = # YOUR CODE HERE
    
    # Create and compile model
    model = create_model()
    optimizer = get_optimizer(optimizer_name, learning_rate)
    model.compile(optimizer=optimizer,
                 loss='binary_crossentropy',
                 metrics=['accuracy'])
    
    # Train model
    history = model.fit(X_train, y_train,
                       epochs=epochs,
                       batch_size=batch_size,
                       validation_data=(X_test, y_test),
                       verbose=0)
    
    # Get final test accuracy
    test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
    
    return history.history, test_acc

def compare_optimizers(noise_levels=[0.1, 0.5, 1.0], 
                      learning_rates=[0.001, 0.01, 0.1]):
    """
    Compares optimizers across different noise levels and learning rates
    """
    results = []
    optimizers = ['SGD', 'Adam', 'RMSprop']
    
    # STUDENT TASK: Create nested loops to test all combinations
    # Loop through noise levels, optimizers, and learning rates
    for # YOUR CODE HERE:
        # Generate noisy data
        X, y = generate_noisy_data(noise_level=noise_level)
        
        # Train with current configuration
        history, test_acc = train_and_evaluate(X, y, optimizer, lr, noise_level)
        
        # Store results
        results.append({
            'Noise Level': noise_level,
            'Optimizer': optimizer,
            'Learning Rate': lr,
            'Test Accuracy': test_acc
        })
    
    return pd.DataFrame(results)



## Part 4: Visualization



In [None]:
def plot_results(results_df):
    """
    Creates visualizations of the results
    """
    # Create a heatmap for each optimizer
    plt.figure(figsize=(15, 5))
    
    for i, optimizer in enumerate(['SGD', 'Adam', 'RMSprop']):
        plt.subplot(1, 3, i+1)
        
        # STUDENT TASK: Pivot the data to create a heatmap
        # Rows should be noise levels, columns should be learning rates
        pivot_data = # YOUR CODE HERE
        
        sns.heatmap(pivot_data, annot=True, fmt='.3f', cmap='YlOrRd')
        plt.title(f'{optimizer} Performance')
        plt.xlabel('Learning Rate')
        plt.ylabel('Noise Level')
    
    plt.tight_layout()
    plt.show()

# Run the experiment
results = compare_optimizers()
plot_results(results)



## Student Tasks:

1. In `generate_noisy_data`, add Gaussian noise to the features:
   - Use np.random.normal with the given noise_level as the scale
   - The noise should be added to the original features X

2. In `get_optimizer`, implement the creation of each optimizer type:
   - SGD with momentum=0.9
   - Adam with default parameters
   - RMSprop with default parameters

3. In `train_and_evaluate`, split the data into training and test sets:
   - Use sklearn's train_test_split
   - Use an 80/20 split ratio
   - Set random_state=42 for reproducibility

4. In `compare_optimizers`, implement the nested loops:
   - Outer loop over noise levels
   - Middle loop over optimizers
   - Inner loop over learning rates

5. In `plot_results`, create the pivot table for the heatmap:
   - Filter for the current optimizer
   - Pivot the data with noise levels as index and learning rates as columns
   - Values should be test accuracy

## Extension Questions:

1. Why do adaptive optimizers (Adam, RMSprop) typically perform better with noisy data?
2. How does the optimal learning rate change with noise level for each optimizer?
3. What happens if we increase the number of training epochs? Does it affect different optimizers differently?
4. How would you modify this experiment to test the optimizers' robustness to different types of noise (e.g., outliers vs. Gaussian noise)?

## Expected Output:

When run correctly, you should see:
- Three heatmaps showing the performance of each optimizer
- Performance should generally decrease with increasing noise
- Adaptive optimizers should show more robustness to noise
- Higher learning rates should show more sensitivity to noise

The visualization should help identify which optimizer is most robust across different noise levels and learning rates.

# Exercise 3: Custom Learning Rate Schedule

In this exercise, we'll create a custom learning rate schedule that combines multiple scheduling strategies. Our schedule will implement a "warm-up" period, followed by exponential decay, and include periodic "restarts" where the learning rate temporarily increases.

## Setup and Imports



In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

import math

from tensorflow.keras.datasets import mnist

(X_train_full, y_train_full), (X_test, y_test) = mnist.load_data()
# Take 5000 samples
n_samples = 5000
indices = np.random.permutation(len(X_train_full))[:n_samples]
X_train = X_train_full[indices] / 255.0
y_train = y_train_full[indices]



## Part 1: Custom Learning Rate Schedule Implementation



In [None]:
class WarmupCosineRestart(tf.keras.optimizers.schedules.LearningRateSchedule):
    """
    Custom learning rate schedule with:
    1. Linear warmup period
    2. Cosine decay
    3. Periodic restarts
    """
    def __init__(self, initial_learning_rate, warmup_steps, decay_steps, alpha=0.0):
        super(WarmupCosineRestart, self).__init__()
        
        self.initial_learning_rate = initial_learning_rate
        self.warmup_steps = warmup_steps
        self.decay_steps = decay_steps
        self.alpha = alpha  # Minimum learning rate factor
        
    def __call__(self, step):
        # Convert step to float32
        step = tf.cast(step, tf.float32)
        
        # STUDENT TASK 1: Implement warmup phase
        # During warmup, LR should increase linearly from 0 to initial_learning_rate
        warmup_lr = # YOUR CODE HERE
        
        # STUDENT TASK 2: Implement cosine decay with restart
        # Calculate the current cycle and progress within cycle
        cycle = tf.floor(step / self.decay_steps)
        current_step = step - (cycle * self.decay_steps)
        
        # Cosine decay formula
        cosine_decay = # YOUR CODE HERE
        
        # Combine warmup and decay
        lr = tf.cond(
            step < self.warmup_steps,
            lambda: warmup_lr,
            lambda: self.initial_learning_rate * cosine_decay
        )
        
        return lr
    
    def get_config(self):
        return {
            "initial_learning_rate": self.initial_learning_rate,
            "warmup_steps": self.warmup_steps,
            "decay_steps": self.decay_steps,
            "alpha": self.alpha
        }



## Part 2: Training Loop and Visualization



In [None]:
def create_model():
    """Creates a simple CNN model"""
    model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Conv2D(64, 3, activation='relu'),
        tf.keras.layers.MaxPooling2D(),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    return model

def plot_schedule(schedule, steps):
    """Plots the learning rate schedule"""
    lrs = [schedule(step).numpy() for step in range(steps)]
    plt.figure(figsize=(10, 4))
    plt.plot(lrs)
    plt.title('Learning Rate Schedule')
    plt.xlabel('Step')
    plt.ylabel('Learning Rate')
    plt.grid(True)
    plt.show()
    
def train_and_compare_schedules(X_train, y_train, X_test, y_test):
    """Trains models with different learning rate schedules and compares them"""
    
    # Calculate steps per epoch
    batch_size = 32
    steps_per_epoch = len(X_train) // batch_size
    
    # STUDENT TASK 3: Create three different schedule configurations
    schedules = {
        'standard': tf.keras.optimizers.schedules.ExponentialDecay(
            # YOUR CODE HERE
        ),
        'warmup_cosine': WarmupCosineRestart(
            # YOUR CODE HERE
        ),
        'custom': # Create your own schedule configuration
    }
    
    histories = {}
    
    # Train with each schedule
    for name, schedule in schedules.items():
        # STUDENT TASK 4: Create and compile model
        model = create_model()
        model.compile(
            # YOUR CODE HERE
        )
        
        # Reshape data for CNN
        X_train_reshaped = X_train.reshape(-1, 28, 28, 1)
        X_test_reshaped = X_test.reshape(-1, 28, 28, 1)
        
        # Train model
        history = model.fit(
            X_train_reshaped, y_train,
            epochs=10,
            validation_data=(X_test_reshaped, y_test),
            verbose=1
        )
        
        histories[name] = history.history
    
    return histories

def visualize_results(histories):
    """Plots training curves for different schedules"""
    plt.figure(figsize=(12, 4))
    
    # STUDENT TASK 5: Create subplots for loss and accuracy
    # Plot training curves for each schedule
    plt.subplot(1, 2, 1)
    for name, history in histories.items():
        # YOUR CODE HERE - Plot loss
    
    plt.subplot(1, 2, 2)
    for name, history in histories.items():
        # YOUR CODE HERE - Plot accuracy
    
    plt.tight_layout()
    plt.show()



## Part 3: Running the Experiment



In [None]:
# Create and visualize the custom schedule
schedule = WarmupCosineRestart(
    initial_learning_rate=0.001,
    warmup_steps=1000,
    decay_steps=4000,
    alpha=0.1
)

# Plot the schedule
plot_schedule(schedule, 15000)

# Train models and compare results
histories = train_and_compare_schedules(X_train, y_train, X_test, y_test)
visualize_results(histories)



## Student Tasks:

1. In the `WarmupCosineRestart` class, implement the warmup phase:
   - Learning rate should increase linearly from 0 to initial_learning_rate
   - Use tf.minimum to ensure we don't exceed warmup_steps

2. In the `WarmupCosineRestart` class, implement the cosine decay:
   - Use tf.cos and tf.cast to create the decay
   - Formula: 0.5 * (1 + cos(π * x)) where x goes from 0 to 1 in each cycle

3. In `train_and_compare_schedules`, create three learning rate schedules:
   - Standard exponential decay
   - Warmup cosine with restart
   - Your own custom configuration of the WarmupCosineRestart

4. Complete the model compilation in `train_and_compare_schedules`:
   - Use the appropriate optimizer with the schedule
   - Set loss and metrics

5. In `visualize_results`, implement the plotting code:
   - Create loss subplot with all schedules
   - Create accuracy subplot with all schedules
   - Add appropriate labels and legend

## Extension Questions:

1. How does the warmup period affect the early stages of training?
2. What are the advantages and disadvantages of learning rate restarts?
3. How would you modify the schedule for a very deep network?
4. What considerations would you make when choosing the warmup_steps and decay_steps parameters?

## Expected Output:

When run correctly, you should see:
- A plot of the learning rate schedule showing warmup, decay, and restarts
- Training curves comparing different schedules showing:
  - Loss generally decreasing faster with the custom schedule
  - Potential spikes in loss during learning rate restarts
  - Better final performance with the custom schedule

The plot should clearly show the warmup period, followed by cosine decay cycles with restarts.

# Exercise 4: Momentum and Learning Rate Interaction Study

In this exercise, we'll explore how momentum and learning rate interact during training. We'll create a systematic study of different combinations and visualize their effects on model training.

## Setup and Imports



In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tensorflow.keras.datasets import fashion_mnist
import pandas as pd

# Load and preprocess Fashion MNIST dataset
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0

# Take a subset of data for faster experimentation
n_samples = 10000
X_train = X_train[:n_samples]
y_train = y_train[:n_samples]



## Part 1: Training Infrastructure



In [None]:
def create_model(seed=42):
    """Creates a simple neural network"""
    tf.random.set_seed(seed)
    return tf.keras.Sequential([
        tf.keras.layers.Flatten(input_shape=[28, 28]),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(64, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])

class TrainingMonitor(tf.keras.callbacks.Callback):
    """Monitors training metrics including gradient norms"""
    def __init__(self):
        super().__init__()
        self.batch_losses = []
        self.loss_changes = []  # Track relative loss changes
        
    def on_train_batch_end(self, batch, logs=None):
        # STUDENT TASK 1: Store the batch loss
        # Hint: Use logs dictionary
        self.batch_losses.append(# YOUR CODE HERE)
        
        
        # STUDENT TASK 2: Calculate relative loss change 
        # Calculate the absolute change in loss relative to the loss from the previous batch
        if len(self.batch_losses) > 1:
            loss_change = #YOUR CODE HERE
            self.loss_changes.append(loss_change)
        else:
            self.loss_changes.append(0.0)




## Part 2: Training Function



In [None]:
def train_model_with_params(learning_rate, momentum, use_nesterov=False):
    """Trains model with specific learning rate and momentum settings"""
    model = create_model()
    
    # STUDENT TASK 3: Create SGD optimizer with given parameters
    optimizer = # YOUR CODE HERE
    
    model.compile(
        optimizer=optimizer,
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    monitor = TrainingMonitor()
    
    # Train for a small number of epochs
    history = model.fit(
        X_train, y_train,
        epochs=5,
        batch_size=32,
        validation_split=0.2,
        callbacks=[monitor],
        verbose=0
    )
    
    return {
        'history': history.history,
        'batch_losses': monitor.batch_losses,
        'loss_changes': monitor.loss_changes,
        'final_loss': history.history['loss'][-1],
        'final_accuracy': history.history['accuracy'][-1]
    }

def run_parameter_study(learning_rates, momentums):
    """Runs training with different combinations of learning rates and momentums"""
    results = []
    
    # STUDENT TASK 4: Create nested loops to test all combinations
    # Include both standard momentum and Nesterov momentum
    for # YOUR CODE HERE:
        
        # Store results in a list of dictionaries
        results.append({
            'learning_rate': lr,
            'momentum': momentum,
            'nesterov': use_nesterov,
            'final_loss': metrics['final_loss'],
            'final_accuracy': metrics['final_accuracy'],
            'avg_gradient_norm': np.mean(metrics['gradient_norms'])
        })
    
    return pd.DataFrame(results)



## Part 3: Visualization Functions



In [None]:
def create_heatmaps(results_df):
    """Creates heatmaps for loss and accuracy across parameter combinations"""
    fig, axes = plt.subplots(2, 2, figsize=(15, 15))
    
    # STUDENT TASK 5: Create four heatmaps
    # Standard momentum accuracy
    std_acc_data = # YOUR CODE HERE - Pivot table for standard momentum accuracy
    sns.heatmap(std_acc_data, ax=axes[0, 0], cmap='viridis', annot=True)
    axes[0, 0].set_title('Standard Momentum - Accuracy')
    
    # Nesterov momentum accuracy
    nesterov_acc_data = # YOUR CODE HERE - Pivot table for Nesterov momentum accuracy
    sns.heatmap(nesterov_acc_data, ax=axes[0, 1], cmap='viridis', annot=True)
    axes[0, 1].set_title('Nesterov Momentum - Accuracy')
    
    # Standard momentum gradient norms
    std_stablity_data = # YOUR CODE HERE - Pivot table for standard momentum stability (using loss changes)
    sns.heatmap(std_stablity_data, ax=axes[1, 0], cmap='rocket', annot=True)
    axes[1, 0].set_title('Standard Momentum - Loss stability')
    
    # Nesterov momentum gradient norms
    nesterov_stability_data = # YOUR CODE HERE - Pivot table for Nesterov momentum stability
    sns.heatmap(nesterov_stability_data, ax=axes[1, 1], cmap='rocket', annot=True)
    axes[1, 1].set_title('Nesterov Momentum - Loss stability')
    
    plt.tight_layout()
    return fig



## Part 4: Running the Experiment



In [None]:
# Define parameter ranges
learning_rates = [0.001, 0.01, 0.1, 0.5]
momentums = [0.0, 0.5, 0.9, 0.99]

# Run the study
results = run_parameter_study(learning_rates, momentums)

# Create visualizations
fig = create_heatmaps(results)
plt.show()

# Print best configurations
print("\nBest configurations:")
for use_nesterov in [False, True]:
    subset = results[results['nesterov'] == use_nesterov]
    best_idx = subset['final_accuracy'].idxmax()
    best_config = subset.loc[best_idx]
    print(f"\n{'Nesterov' if use_nesterov else 'Standard'} Momentum:")
    print(f"Learning Rate: {best_config['learning_rate']}")
    print(f"Momentum: {best_config['momentum']}")
    print(f"Accuracy: {best_config['final_accuracy']:.4f}")



## Student Tasks:

1. In the `TrainingMonitor` class, implement batch loss storage:
   - Extract the 'loss' value from the logs dictionary
   - Append it to batch_losses list

2. In the `TrainingMonitor` class, calculate gradient norm:
   - Use tf.linalg.global_norm to compute the norm of gradients
   - Return the norm as a scalar value

3. In `train_model_with_params`, create the SGD optimizer:
   - Use tf.keras.optimizers.SGD
   - Include learning_rate, momentum, and nesterov parameters

4. In `run_parameter_study`, implement the nested loops:
   - Loop over learning rates, momentums, and nesterov options
   - Call train_model_with_params with each combination
   - Store results in the results list

5. In `create_heatmaps`, create the pivot tables:
   - Filter data for standard/Nesterov momentum
   - Create pivot tables with learning rates as columns and momentum as index
   - Use 'final_accuracy' and 'avg_loss_change' as values

## Extension Questions:

1. Why do some combinations of learning rate and momentum lead to unstable training?
2. How does the interaction between learning rate and momentum change with deeper networks?
3. What role does the loss change play in understanding training dynamics?
4. Why might Nesterov momentum perform better than standard momentum in some cases?

## Expected Output:

When run correctly, you should see:
- Four heatmaps showing the interaction between learning rate and momentum
- Clear patterns showing optimal combinations for each momentum type
- Potential instability with high learning rates and high momentum
- Different patterns for standard vs Nesterov momentum
- Summary of best configurations for each momentum type

The visualization should help identify safe and risky parameter combinations, and show how Nesterov momentum might provide more stability in some cases.

# Exercise 5: Adaptive Learning Rate Emergency

In this exercise, you'll implement a custom callback that monitors training stability and automatically adjusts the learning rate when problems are detected. This represents a real-world scenario where you need to rescue training that's becoming unstable.

## Setup and Imports



In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.datasets import fashion_mnist
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Load and preprocess Fashion MNIST
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0



## Part 1: Training Monitor Callback



In [None]:
class TrainingEmergencyCallback(tf.keras.callbacks.Callback):
    """
    Monitors training stability and adjusts learning rate when needed.
    """
    def __init__(self, 
                 patience=5,
                 loss_spike_threshold=1.5,
                 min_lr=1e-6):
        super().__init__()
        self.patience = patience
        self.grad_norm_threshold = grad_norm_threshold
        self.loss_spike_threshold = loss_spike_threshold
        self.min_lr = min_lr
        
        self.loss_history = []
        self.grad_history = []
        self.lr_history = []
        
    def on_train_begin(self, logs=None):
        # STUDENT TASK 1: Initialize monitoring variables
        # Keep track of consecutive problems and best loss
        self.consecutive_problems = # YOUR CODE HERE
        self.best_loss = # YOUR CODE HERE
        
    def compute_gradient_norm(self):
        """Computes the global norm of the model's gradients"""
        # Get gradients
        gradients = self.model.optimizer.get_gradients(
            self.model.total_loss,
            self.model.trainable_weights
        )
        
        # STUDENT TASK 2: Calculate and return gradient norm
        # Use tf.linalg.global_norm
        return # YOUR CODE HERE
    
    def check_training_problems(self, logs):
        """Checks for potential training problems"""
        current_loss = logs['loss']
        current_lr = tf.keras.backend.get_value(self.model.optimizer.learning_rate)
        grad_norm = self.compute_gradient_norm()
        
        # Store history
        self.loss_history.append(current_loss)
        self.grad_history.append(grad_norm)
        self.lr_history.append(current_lr)
        
        problems = []
        
        # STUDENT TASK 3: Implement problem detection
        # Check for:
        # 1. Loss spike (current loss much higher than best loss)
        # 2. NaN/Inf values in loss
        # 3. Consistently increasing loss
        if # YOUR CODE HERE:  # Loss spike check
            problems.append("Loss spike detected")
            
        if # YOUR CODE HERE:  # Gradient explosion check
            problems.append("Gradient explosion detected")
            
        if # YOUR CODE HERE:  # NaN/Inf check
            problems.append("Consistently increasing loss")
            
        return problems
    
    def adjust_learning_rate(self, problems):
        """Adjusts learning rate based on detected problems"""
        current_lr = tf.keras.backend.get_value(self.model.optimizer.learning_rate)
        
        # STUDENT TASK 4: Implement learning rate adjustment
        # Reduce learning rate if there are problems
        # Make sure new lr isn't below min_lr
        if problems:
            new_lr = # YOUR CODE HERE
            new_lr = max(new_lr, self.min_lr)
            self.model.optimizer.learning_rate.assign(new_lr)
            logger.info(f"Learning rate adjusted from {current_lr} to {new_lr}")
            return True
        return False
    
    def on_batch_end(self, batch, logs=None):
        problems = self.check_training_problems(logs)
        
        if problems:
            self.consecutive_problems += 1
            logger.warning(f"Training problems detected: {problems}")
            
            if self.consecutive_problems >= self.patience:
                self.adjust_learning_rate(problems)
                self.consecutive_problems = 0
        else:
            self.consecutive_problems = 0
            current_loss = logs['loss']
            if current_loss < self.best_loss:
                self.best_loss = current_loss



## Part 2: Training Function



In [None]:
def create_model():
    """Creates a model prone to training instability"""
    return tf.keras.Sequential([
        tf.keras.layers.Flatten(input_shape=[28, 28]),
        tf.keras.layers.Dense(512, activation='relu'),
        tf.keras.layers.Dense(512, activation='relu'),
        tf.keras.layers.Dense(512, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])

def train_with_emergency_monitoring(initial_lr=0.1, epochs=10, patience=3):
    """Trains model with emergency monitoring"""
    model = create_model()
    
    # STUDENT TASK 5: Create optimizer and compile model
    optimizer = # YOUR CODE HERE
    
    model.compile(
        optimizer=optimizer,
        loss='sparse_categorical_crossentropy',
        metrics=['accuracy']
    )
    
    # Create callback
    emergency_cb = TrainingEmergencyCallback(
        patience=patience,
        grad_norm_threshold=10.0,
        loss_spike_threshold=1.5
    )
    
    # Train model
    history = model.fit(
        X_train, y_train,
        epochs=epochs,
        validation_split=0.2,
        callbacks=[emergency_cb],
        batch_size=32,
        verbose=1
    )
    
    return history, emergency_cb



## Part 3: Visualization



In [None]:
def plot_training_metrics(callback):
    """Visualizes training metrics and learning rate adjustments"""
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))
    
    # Plot loss history
    ax1.plot(callback.loss_history)
    ax1.set_title('Training Loss')
    ax1.set_ylabel('Loss')
    ax1.grid(True)
    

    
    # Plot learning rate changes
    ax2.plot(callback.lr_history)
    ax2.set_title('Learning Rate')
    ax2.set_ylabel('Learning Rate')
    ax2.set_yscale('log')
    ax2.grid(True)
    
    plt.tight_layout()
    plt.show()

# Set the logging level if you want to adjust the verbosity 
logger = logging.getLogger()
logger.setLevel(logging.WARNING)

# Run training
history, callback = train_with_emergency_monitoring(patience=5)
plot_training_metrics(callback)



## Student Tasks:

1. In `TrainingEmergencyCallback.__init__`, initialize monitoring variables:
   - Set consecutive_problems to 0
   - Set best_loss to float('inf')

2. In `compute_gradient_norm`, calculate the gradient norm:
   - Use tf.linalg.global_norm to compute norm of gradients
   - Convert result to a numpy scalar

3. In `check_training_problems`, implement problem detection:
   - Check if current loss exceeds best_loss * loss_spike_threshold
   - Check if gradient norm exceeds grad_norm_threshold
   - Check for NaN/Inf values in loss

4. In `adjust_learning_rate`, implement learning rate adjustment:
   - Reduce current learning rate by factor of 2
   - Ensure new rate doesn't fall below min_lr
   - Return True if adjustment was made

5. In `train_with_emergency_monitoring`, create the optimizer:
   - Use SGD with the specified initial learning rate
   - Add momentum=0.9 to make training more stable

## Extension Questions:

1. What happens when you change the patience values?  Try several.
2. How would you modify the callback to also adjust batch size when problems are detected?
2. What other metrics might be useful to monitor for training stability?
3. How would you implement a "recovery mode" that tries to restore weights to the last stable state?
4. How would you modify the callback to work with adaptive optimizers like Adam?

## Expected Output:

When run correctly, you should see:
- Training progress with occasional warnings about detected problems
- Learning rate adjustments when problems persist
- Two plots showing:
  - Training loss over time
  - Learning rate changes on log scale
- Successful training completion with stable final epochs

The visualization should show clear correlations between detected problems and learning rate adjustments.