# 📘 Comprehensive List of Techniques Supporting Deep Model Training

---

## 1. Regularization by Randomization
Introduce randomness into activations, weights, or structure to reduce overfitting.

- **Dropout** (Srivastava et al., 2014) → randomly drop units  
- **DropConnect** → randomly drop weights  
- **Stochastic Depth** (Huang et al., 2016) → randomly drop residual blocks  
- **Shake-Shake Regularization** → random branch combination in ResNets  
- **DropBlock** → structured dropout in feature maps  
- **Stochastic Layer Skipping** (SkipNet, 2017) → conditionally skip layers  
- **Zoneout (RNNs)** → randomly preserve past hidden states  
- **RandAugment / Random Erasing** → random input perturbations  

---

## 2. Normalization Techniques
Control internal activations to stabilize gradients and reduce covariate shift.

- **Batch Normalization (BN)**  
- **Layer Normalization (LN)**  
- **Instance Normalization (IN)**  
- **Group Normalization (GN)**  
- **Weight Normalization**  
- **Spectral Normalization** (popular in GANs)  

---

## 3. Architectural Innovations
Structural designs that improve gradient flow and optimization.

- **Residual Connections / ResNets**  
- **Highway Networks** (gated shortcuts)  
- **DenseNet** (dense connectivity, feature reuse)  
- **Skip Connections in Transformers**  
- **Auxiliary Classifiers** (Inception, deeply-supervised nets)  
- **Neural ODEs** (continuous-depth modeling)  

---

## 4. Advanced Weight Initialization
Smart starting conditions for optimization.

- **Xavier Initialization** (Glorot & Bengio, 2010)  
- **He Initialization** (ReLU, PReLU; He et al., 2015)  
- **Orthogonal Initialization**  
- **LSUV** (Layer-Sequential Unit-Variance)  

---

## 5. Gradient & Optimization Stabilizers
Techniques to avoid gradient explosion/vanishing and improve convergence.

- **Gradient Clipping** (RNNs, Transformers)  
- **Residual Gradient Scaling** (ResNets)  
- **Adaptive Optimizers** → Adam, RMSProp, Adagrad  
- **Learning Rate Schedules** → step decay, cosine annealing, cyclical LR  
- **Warmup Schedules** (Transformers, very deep nets)  
- **Lookahead Optimizer**  
- **Sharpness-Aware Minimization (SAM)** (Izmailov et al., 2021)  

---

## 6. Ensemble & Implicit Ensemble Techniques
Boost generalization by simulating multiple models.

- **Bagging / Boosting**  
- **Dropout** → implicit ensemble  
- **Stochastic Depth** → ensemble of different-depth networks  
- **Snapshot Ensembles** (save checkpoints during one training run)  
- **SWAG** (Stochastic Weight Averaging-Gaussian)  

---

## 7. Data-Level Techniques
Improve data diversity and robustness.

- **Standard Augmentation** (flip, crop, rotate, jitter)  
- **Mixup** (linear interpolation of samples/labels)  
- **CutMix** (patch-level mixing)  
- **CutOut** (mask patches)  
- **AutoAugment / RandAugment** (learned policies)  
- **Adversarial Training** (robustness to perturbations)  

---

## 8. Regularization by Constraints
Apply explicit mathematical constraints to weights.

- **Weight Decay (L2 regularization)**  
- **L1 Sparsity Regularization**  
- **Orthogonality Constraints**  
- **Spectral Constraints** (bound Lipschitz constant)  
- **Manifold Regularization** (semi-supervised)  

---

## 9. Noise Injection Techniques
Add stochasticity to promote robustness.

- **Gaussian Noise** (inputs/weights)  
- **Label Smoothing** (target perturbation)  
- **Stochastic Gradient Descent** → inherent minibatch noise  
- **Bayesian Dropout** (interpreted as variational inference)  

---

## 10. Curriculum & Sample Selection
Order and weight samples during training.

- **Curriculum Learning** (easy → hard)  
- **Self-Paced Learning**  
- **Hard Example Mining**  
- **Focal Loss** (down-weight easy samples)  

---

## 11. Specialized Regularizers
Tailored methods for specific architectures.

- **Teacher Forcing / Scheduled Sampling** (Seq2Seq)  
- **KL Annealing / β-VAE** (stabilizing generative models)  
- **Consistency Regularization** (Mean Teacher, FixMatch)  
- **Contrastive Loss / InfoNCE** (self-supervised representation learning)  

---

## 12. Optimization Tricks for Scaling Depth
Specific techniques for very deep networks.

- **Gradient Checkpointing** (memory-efficient backprop)  
- **ResNet Identity Mappings** (He et al., 2016b)  
- **Stochastic Depth** (deep ResNets)  
- **ReZero** (skip connections initialized as identity)  
- **Pre-activation ResNets** (better gradient flow)  

---

## 🎯 Key Takeaway
**Stochastic Depth** is part of the **randomized structural regularization family** (Dropout, DropConnect, Shake-Shake, etc.), but training deep models successfully requires a **toolbox of complementary methods**:

- **Weights/gradients:** initialization, optimizers, clipping.  
- **Activations:** dropout, normalization.  
- **Architecture:** residuals, dense connections.  
- **Data:** augmentation, adversarial training.  

👉 Collectively, these techniques form the **deep learning toolkit** that makes modern large-scale training possible.


# 📘 Techniques Supporting Deep Model Training

---

## Regularization by Randomization

| Technique | Paper / Authors | Year |
|-----------|----------------|------|
| Dropout | Srivastava et al. – *Dropout: A Simple Way to Prevent NN Overfitting* | 2014 |
| DropConnect | Wan et al. – *DropConnect* | 2013 |
| Stochastic Depth | Huang et al. – *Deep Networks with Stochastic Depth* | 2016 |
| Shake-Shake Regularization | Gastaldi – *Shake-Shake Regularization* | 2017 |
| DropBlock | Ghiasi et al. – *DropBlock: A Structured Dropout* | 2018 |
| SkipNet (Layer Skipping) | Wang et al. – *SkipNet* | 2017 |
| Zoneout (RNNs) | Krueger et al. – *Zoneout* | 2016 |
| Random Erasing / RandAugment | Zhong et al. – *Random Erasing*; Cubuk et al. – *RandAugment* | 2017 / 2020 |

---

## Normalization Techniques

| Technique | Paper / Authors | Year |
|-----------|----------------|------|
| Batch Normalization | Ioffe & Szegedy – *Batch Norm* | 2015 |
| Layer Normalization | Ba et al. – *Layer Norm* | 2016 |
| Instance Normalization | Ulyanov et al. – *Instance Norm* | 2016 |
| Group Normalization | Wu & He – *Group Norm* | 2018 |
| Weight Normalization | Salimans & Kingma – *Weight Norm* | 2016 |
| Spectral Normalization | Miyato et al. – *Spectral Norm GANs* | 2018 |

---

## Architectural Innovations

| Technique | Paper / Authors | Year |
|-----------|----------------|------|
| Residual Connections (ResNet) | He et al. – *Deep Residual Learning* | 2016 |
| Highway Networks | Srivastava et al. – *Highway Networks* | 2015 |
| DenseNet | Huang et al. – *Densely Connected CNNs* | 2017 |
| Skip Connections (Transformers) | Vaswani et al. – *Attention Is All You Need* | 2017 |
| Auxiliary Classifiers (Inception) | Szegedy et al. – *Going Deeper with Inception* | 2015 |
| Neural ODEs | Chen et al. – *Neural Ordinary Differential Equations* | 2018 |

---

## Weight Initialization

| Technique | Paper / Authors | Year |
|-----------|----------------|------|
| Xavier Initialization | Glorot & Bengio – *Understanding Difficulty of Training Deep FFNs* | 2010 |
| He Initialization | He et al. – *Delving Deep into Rectifiers* | 2015 |
| Orthogonal Initialization | Saxe et al. – *Exact Solutions to Deep Linear Nets* | 2014 |
| LSUV Initialization | Mishkin & Matas – *All You Need is LSUV* | 2015 |

---

## Optimization Stabilizers

| Technique | Paper / Authors | Year |
|-----------|----------------|------|
| Gradient Clipping | Pascanu et al. – *On the Difficulty of Training RNNs* | 2013 |
| Residual Gradient Scaling | He et al. – *ResNet* | 2016 |
| Adam Optimizer | Kingma & Ba – *Adam* | 2015 |
| RMSProp | Tieleman & Hinton – *Lecture Notes* | 2012 |
| Adagrad | Duchi et al. – *Adaptive Subgradient Methods* | 2011 |
| LR Scheduling (Cosine, Step, Cyclical) | Loshchilov & Hutter – *SGDR* | 2016 |
| Warmup Schedules | He et al. – *ResNet-1202* | 2016 |
| Lookahead Optimizer | Zhang et al. – *Lookahead Optimizer* | 2019 |
| SAM | Foret et al. – *Sharpness-Aware Minimization* | 2021 |

---

## Ensemble & Implicit Ensembles

| Technique | Paper / Authors | Year |
|-----------|----------------|------|
| Bagging / Boosting | Breiman – *Bagging*; Freund & Schapire – *Boosting* | 1996 / 1997 |
| Dropout as Ensemble | Srivastava et al. – *Dropout* | 2014 |
| Stochastic Depth Ensemble Effect | Huang et al. – *Stochastic Depth* | 2016 |
| Snapshot Ensembles | Huang et al. – *Snapshot Ensembles* | 2017 |
| SWAG | Maddox et al. – *Stochastic Weight Averaging-Gaussian* | 2019 |

---

## Data-Level Techniques

| Technique | Paper / Authors | Year |
|-----------|----------------|------|
| Data Augmentation | Krizhevsky et al. – *ImageNet CNN* | 2012 |
| Mixup | Zhang et al. – *Mixup* | 2017 |
| CutMix | Yun et al. – *CutMix* | 2019 |
| CutOut | DeVries & Taylor – *Cutout* | 2017 |
| AutoAugment | Cubuk et al. – *AutoAugment* | 2019 |
| Adversarial Training | Goodfellow et al. – *Explaining & Harnessing Adversarial Examples* | 2015 |

---

## Constraints & Regularizers

| Technique | Paper / Authors | Year |
|-----------|----------------|------|
| Weight Decay (L2) | Krogh & Hertz – *Weight Decay in Backprop* | 1992 |
| L1 Sparsity | Tibshirani – *LASSO* | 1996 |
| Orthogonality Constraints | Brock et al. – *Orthogonal Regularization RNNs* | 2016 |
| Spectral Constraints | Yoshida & Miyato – *Spectral Norm Bounds* | 2017 |
| Manifold Regularization | Belkin et al. – *Manifold Regularization* | 2006 |

---

## Noise Injection

| Technique | Paper / Authors | Year |
|-----------|----------------|------|
| Gaussian Noise in Inputs/Weights | Bishop – *Training with Noise is Equivalent to Tikhonov Regularization* | 1995 |
| Label Smoothing | Szegedy et al. – *Rethinking Inception* | 2016 |
| SGD Noise | Bottou – *Stochastic Gradient Descent* | 2010 |
| Bayesian Dropout | Gal & Ghahramani – *Dropout as Bayesian Approximation* | 2016 |

---

## Curriculum & Sample Selection

| Technique | Paper / Authors | Year |
|-----------|----------------|------|
| Curriculum Learning | Bengio et al. – *Curriculum Learning* | 2009 |
| Self-Paced Learning | Kumar et al. – *Self-Paced Learning* | 2010 |
| Hard Example Mining | Shrivastava et al. – *OHEM* | 2016 |
| Focal Loss | Lin et al. – *Focal Loss for Dense Detection* | 2017 |

---

## Specialized Regularizers

| Technique | Paper / Authors | Year |
|-----------|----------------|------|
| Teacher Forcing / Scheduled Sampling | Bengio et al. – *Scheduled Sampling* | 2015 |
| KL Annealing / β-VAE | Higgins et al. – *β-VAE* | 2017 |
| Consistency Regularization (Mean Teacher) | Tarvainen & Valpola – *Mean Teacher* | 2017 |
| Contrastive Loss / InfoNCE | van den Oord et al. – *CPC* | 2018 |

---

## Scaling Depth Tricks

| Technique | Paper / Authors | Year |
|-----------|----------------|------|
| Gradient Checkpointing | Chen et al. – *Training Deep Nets with Checkpoints* | 2016 |
| Identity Mappings in ResNets | He et al. – *Identity Mappings in ResNets* | 2016b |
| Stochastic Depth | Huang et al. – *Stochastic Depth* | 2016 |
| ReZero | Bachlechner et al. – *ReZero* | 2020 |
| Pre-activation ResNets | He et al. – *Pre-Activation ResNets* | 2016 |

---
