# Regularization

## Concept & Methods:

**Why regularize?**
- Penalizes "memorization" (over-learning examples).
- Helps the model generalize to unseen examples.
- Changes the representations of learning (either more sparse or more distributed depending on the regularizer)
- Can increase or decrease training time.
- Can descrease training accuracy but increase generalization.
- Works better for large models with multiple hidden layers.
- Generally works better with sufficient data.

<b>Three families of regularizers in Deep Learning</b>
- **Family 1:** Modify the model (dropout)
- **Family 2:** Add a cost to the loss function (L1/L2)
- **Family 3:** Modify or add data (batch training, data augmentation)

<p style="display:flex">
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-1.png?token=GHSAT0AAAAAABY4P3FR5AETPZKJ43H3KJVSYZMA7WQ" width=450 style="border-radius:10px"/>
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-2.png?token=GHSAT0AAAAAABY4P3FRSBE5GP74OMM2L6Q6YZMBAKQ" width=500 style="border-radius:10px"/>
</p>

<p style="display:flex">
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-3.png?token=GHSAT0AAAAAABY4P3FRXW4A36U42GDF6BTWYZMBAYA" width=450 style="border-radius:10px"/>
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-4.png?token=GHSAT0AAAAAABY4P3FQR3EGF6G7KVGQTBOYYZMBBBQ" width=450 style="border-radius:10px"/>
</p>

**How to think about regularization:**
- Adds a cost to the complexity of the solution
- Forces the solution to be smooth
- Prevents the model from learning item-specific details.

**Which regularization method to use?**
<p>
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-5.png?token=GHSAT0AAAAAABY4P3FRWMQC7H4FKM56X7RWYZMBBLA" width=500 style="border-radius:10px"/>
</p>

## `train()` & `eval()` method:

**Training vs evaluation mode:**
- Gradients are computed only during backpropagation, not during evaluation.
- Some regularization methods are applied only during training, not during evaluation.
- Ergo: We need a way to deactivate gradient computations and regularization while evaluating model performance.

<p style="display:flex">
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-6.png?token=GHSAT0AAAAAABY4P3FRUAVB7XMU2UPOUEJAYZMBCAQ" width=450 style="border-radius:10px"/>
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-7.png?token=GHSAT0AAAAAABY4P3FQKRRGFFUMP2RHQMDMYZMBCLA" width=500 style="border-radius:10px"/>
</p>

## DropOut Regularization:

<p style="display:flex">
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-8.png?token=GHSAT0AAAAAABY4P3FQ6DSURO5FRQYHVS3YYZMBCUA" width=450 style="border-radius:10px"/>
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-9.png?token=GHSAT0AAAAAABY4P3FR6P64AJRYZDZWYF4SYZMBC4A" width=450 style="border-radius:10px"/>
</p>

<p style="display:flex">
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-10.png?token=GHSAT0AAAAAABY4P3FQQMKRPQ4QCNNNYTXOYZMBDGA" width=450 style="border-radius:10px"/>
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-11.png?token=GHSAT0AAAAAABY4P3FQMDOW64P2UFNZMTGYYZMBDPA" width=450 style="border-radius:10px"/>
</p>

**How it works?**
- Dropout reduces the overall activation (fewer elements in the weighted sum)
- Solutions
    - Scale up the weights during training
    - Scale down the weights during testing.

**Effects:**
- Prevents a single node from learning too much
- Forces the model to have distributed representations.
- Makes the model less reliant or individual nodes and thus more stable.

**Other observations:**
- Generally requires more training (though each epoch computes faster).
- Can desrease training accuracy but increase generalization.
- Usually works better on deep than sallow networks.
- Debate about applying it to convolution layers.
- Works better with sufficient data, unnecessary with "enough" data.

**Code**
- [Part 1 - DropOut In Pytorch](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/Part%201%20-%20DropOut%20In%20Pytorch.ipynb)
- [Part 2 - DropOut Regularization by Building Model](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/Part%202%20-%20DropOut%20Regularization%20by%20Building%20Model.ipynb)
- [Part 3 - Dropout on Iris Dataset](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/Part%203%20-%20Dropout%20on%20Iris%20Dataset.ipynb)

## L1 & L2 Regularization

<p style="display:flex">
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-12.png?token=GHSAT0AAAAAABY4P3FQIDNMVOHCWWYJEKBOYZMBD6A" style="border-radius:10px"/>
</p>

<p style="display:flex">
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-13.png?token=GHSAT0AAAAAABY4P3FQYPLDMN4BYGDCIFJMYZMBEGA" width=450 style="border-radius:10px"/>
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-14.png?token=GHSAT0AAAAAABY4P3FRZJ6QIPPQ7PS47XWQYZMBEOQ" width=450 style="border-radius:10px"/>
</p>

<p style="display:flex">
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-15.png?token=GHSAT0AAAAAABY4P3FRTV6D2XQUAVXLCSKIYZMBEWQ" width=450 style="border-radius:10px"/>
</p>

**What else to regularize for?**
- L1 + L1 ("elastic net" regression)
- Norm of weight matrix
- Sample-specific (e.g., positive bias on cancer diagnosis)

**Why does regularization reduce overfitting?**
- Discourages complex and sample-specific representations.
- Prevents overfitting to training examples.
- Large weights lead to instability (very different outputs for similar inputs).

**When to use L1/L2 regularization?**
- In large, complex models with lots of weights (high risk of overfitting)
- Use L1 when trying to understand the important encoding features (more common in regression than DL)
- When training accuracy is much higher than validation accuracy.

**Code:**
- [Part 4 - L2 Regularization](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/Part%204%20-%20L2%20Regularization.ipynb)
- [Part 5 - L1 Regularization](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/Part%205%20-%20L1%20Regularization.ipynb)

## Training in Mini-Batches
<p style="display:flex">
    <img src="https://raw.githubusercontent.com/Sayan-Roy-729/Data-Science/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/images/image-16.png?token=GHSAT0AAAAAABY4P3FRLR7NRBOHV3GVUFJUYZMBE7A" width=450 style="border-radius:10px"/>
</p>

**How and why to train with mini-batches?**
- Batch size is often power of 2 (e.g., $2^4 = 16$), between 2 and 512.
- Training in batches can decrease computation time because of vectorization (matrix multiplication instead of for-loops).
- But batching can increase computation time for large batches and large data samples (e.g., images)
- Batching is a form of regularization: It smooths learning by averaging the loss over many samples, and thereby reduces overfitting.
- If samples are highly similar, minibatch=1 can give faster training.

**Mini-Batch Analogy**
- Imagine you take an exam with 100 questions.
- SGD: Teacher gives you detailed feedback on each answer. This is good for learning but very time consuming.
- One batch: Teacher gives you a final exam score with no feedback. Grading is fast, but it's difficult to learn from your mistakes.
- Min-Batch: Teacher gives you a separate grade and feedback on average performance of blocks of 10 questions. This balances apeed and learning ability.

**Code:**
- [Part 6 - MiniBatch](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/Part%206%20-%20MiniBatch.ipynb)
- [Part 7 - Importance of Equal Batch Sizes](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/Part%207%20-%20Importance%20of%20Equal%20Batch%20Sizes.ipynb)
- [Part 8 - CodeChallenge Effects of Mini-Batch Size](https://github.com/Sayan-Roy-729/Data-Science/blob/main/Deep%20Learning/Using%20Pytorch/Part%205%20-%20Regularization/Part%208%20-%20CodeChallenge%20Effects%20of%20Mini-Batch%20Size.ipynb)