# Hyperparameters

## Tune Process

- It's hard to decide which hyperparameter is the most important in a problem. It depends a lot on your problem.

- One of the ways to tune is to sample a grid with $N$ hyperparameter settings and then try all settings combinations on your problem.

- Try random values: don't use a grid.

- You can use a **Coarse to fine sampling scheme**:
  - When you find some hyperparameters values that give you a better performance.
  - Zoom into a smaller region around these values and sample more densely within this space.

## Using appropriate Scale

- **Issue:**
  - For finding the optimal learning rate alpha in the range $10^{-4}$ to 1, a uniform distribution leads to 90% of the samples between 0.1 and 1, which skews the search.

- **Solution:**
  - A distribution that allows equal exploration across all magnitudes of the range is necessary.
  
  - Logarithmic distribution for sampling ensures a balanced search across the potential range of values.


# Batch Normalization

- **Formulas:**
  - Mean: $\mu = \frac{1}{m} \sum_{i} z^{(i)} $
  - Variance: $ \sigma^2 = \frac{1}{m} \sum_{i} (z^{(i)} - \mu)^2 $
  - Normalization: $ z_{\text{norm}}^{(i)} = \frac{z^{(i)} - \mu}{\sqrt{\sigma^2 + \epsilon}} $
  - Scale and Shift: $ \hat{z}^{(i)} = \gamma z_{\text{norm}}^{(i)} + \beta $
  
- **Batch Norm in Neural Networks:**
  - **Beta and Gamma:**
    - Beta and gamma are parameters learned through backpropagation, optimizing the learning process.
  - **Working with Mini Batches:**
    - The bias parameter $ b $ is removed due to the normalization step $ Z = Z - \text{mean} $.

## Normalization Algorithm:

- **Given:**
  - $Z[1]$ array representing inputs \( z(1), ..., z(m) \), for \( i \) from 1 to \( m \) (for each input).

- **Steps:**
  - Compute the mean: $mean = 1/m * sum(Z[1])$.
  - Compute the variance: $variance = 1/m * sum((Z[1] - mean)^2)$.
  - Normalize: $Z_{norm[i]} = (Z[i] - mean) / np.sqrt(variance + epsilon)$ (add $epsilon$ for numerical stability if variance is 0), which forces inputs into a distribution with zero mean and variance of 1.
  - Scale and Shift: $Z_{tilde[i]} = gamma * Z_{norm[i]} + beta$, adapting inputs to another distribution (with different mean and variance).

## Why does Batch Normalization Work?

- The reason is similar to why we normalize features input.

- **Regularization**: 
  - Each mini batch is scaled by the mean/variance computed of that mini-batch.
  
  - This adds some noise to the values $Z[l]$ within that mini batch. So similar to dropout it adds some noise to each hidden layer's activations.

## Problem at Test Time

- **Issue:**
  - Post-training, for effective prediction, normalization using the data's mean and variance is required.

  - However, the mean and variance calculated per mini-batch differ, making them inapplicable directly to the test set.

- **Solution: Exponentially Weighted Averages**.
  
  - Iteratively update the global mean and variance during each iteration to maintain consistent normalization for predictions.
