# Data Set Splitting
- **Basic Split**: Training, dev, and test sets.

- **Sourcing**: Importance of sourcing dev and test sets from the same data.

- **Test Set Omission**: For non-critical estimations, focus on train/dev splits to avoid overfitting.


# Bias and Variance
- **Scenarios**: High variance (overfitting), high bias (underfitting)

- **High Variance Solutions**: More data, regularization, neural network architecture search.

- **High Bias Solutions**: Larger networks, longer training, architecture search.

- **Error Evaluation**: Based on Bayes error for distinguishing error levels.


|                         | High variance (overfit) | High bias (underfit) | High bias High variance | Low bias Low variance |
|-------------------------|:-----------------------:|:--------------------:|:-----------------------:|:---------------------:|
| **Training set error**  |           Low           |        High          |           High          |          Low          |
| **Dev set error**       |           High          |        High          |         Much higher     |          Low          |


![Image](./image/VarAndBias.png)

# Regularization 

## Overview
- Regularization techniques are essential for reducing overfitting in machine learning models by adding penalty terms to the loss function, which helps in generalizing the model better to unseen data.

## L1 and L2 Regularization

- **L1 Regularization**: 
  $Cost = Loss + λ*(\sum(|w|))$

- **L2 Regularization**: $Cost = Loss + λ*(\sum(w^2))$ <br>
$\implies Frobenius\space Norm\space for\space Matrix\space Calculation$: $||w^{[l]}||^{2} = \sum_{i = 1}^{n^{[l]}}\sum_{j = 1}^{n^{[l - 1]}}(w_{i, j}^{[l]})^{2}$<br>
  - The rows $i$ of the matrix should be the number of neurons in the current layer  $n^{[l]}$;
  - the columns $j$ of the weight matrix should equal the number of neurons in the previous layer $n^{[l - 1]}$

**Intuition Behind Regularization:**
  - High **lambda (λ)** values push weights (**W**) towards zero, simplifying the neural network.

  - Regularization reduces their units impact, so making less prone to overfitting network.

## Dropout Regularization
  - Operates by setting a probability for each neuron in the network to be dropped or kept in a given training phase.

- **Dropout Process:**
  - During training, each neuron has a 50% chance of being kept or removed, simulating training on various network architectures.

  - This probabilistic removal of neurons leads to a diminished network, forcing the model to adapt to different structures.
  
  - Ensures that the network does not become too dependent on any single neuron, promoting a more generalized learning.

- **Benefits of Dropout:**
  - Encourages the neural network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.

  - Helps in preventing overfitting by ensuring that the model does not rely too heavily on any single path of neurons.

  - At test time, the network benefits from the learning of an ensemble of varied sub-networks, improving generalization.

- **Operational Insights:**
  - For each training example, a different "thinned" network is used, with neurons dropped out randomly.

  - This process simulates training multiple smaller networks and averaging their predictions, leading to a more robust model.

- **Dropout Probility:**: 
  - For layers with a large number of units and large weight sizes, it is advisable to set keep_prob low to ensure there is no overfitting.


![Image](./image/DropOut.png)

## Others

### Early Stopping
- Stops when there is a significant divergence between the loss functions of the development (dev) set and the training set.

#### Limitations
- Performs two tasks simultaneously, which can be conflicting: reducing the cost function (J) and avoiding overfitting, making the set of experiments (models, adjustments, etc.) complex.

#### Alternative Approaches
- L2 regularization can be used as an alternative by running many epochs, however, it requires experimenting with various lambda values for L2.

![Image](./image/EarlyStopping.png)

### Data Augmentation: 
- Generate more data by modify the origin data $\implies$ Overfitting Solution

![Image](./image/DataAug.png)

# Normalization
  1. **Mean Normalization:**<br>
      - $mean(X) = \frac{1}{n}\sum_{i=1}^{n}X_{i}$
      <br><br>
      - $x_{norm} = \frac{x - mean(x)}{std(x)}$
      <br><br>
  2. **Variance Normalization:**

      - $std(x) = \sqrt{\frac{1}{n}\sum_{i = 1}{n} (x_i - mean(x))^{2}}$
      <br><br>
      - $x_{norm} = \frac{x}{std(x_{desired})}$

$\implies$: Help the distribution more equally

![Image](./image/Distribution.png)

# Vanishing and Exploding

- **Exploding Gradients:**
  - If weight matrices are too big, the output grows too fast so it's hard to converge.

- **Vanishing Gradients:**
  - If weight matrices are too small, the update process will be slow down.

- **Partial Solution:**
  - Careful initialization of weights (Can initialize weights with small variance)
    ```
    W = np.random.randn(shape) * variance
    ```

# Gradient Checking

## Numerical Approximation
- **Approximation:** $\frac{df(x)}{dx} = lim_{h->0}\frac{f(x + h) - f(x - h)}{2*h}$

## Checking Process

- $Difference = \frac{||grad\space -\space grad_{approximate}||_{2}}{\space ||grad||_{2}\space +\space ||\space grad_{approximate}\space ||_{2}}$

    <br>$\implies$ If $Difference$ is larger than $\epsilon$ (Ex: $10^{-7}$), hence the process have been wrong implement