# Regularization
- Technique that tries to reduce the generalization gap between training and test set performance

## Explicit Regularization

$$\hat{\phi} = argmin_{\phi} \left[\sum_{i=1}^I l_i[x_i,y_i] + \lambda g[\phi]    \right]$$

- $g[\phi]$ is a function that returns a scalar that takes a larger value when the parameters are less prefered. $\lambda$ is a positive scalar that controls the contribution of this term to the loss function.
### Probabilistic Interpretation
- Regularization can be considered as a prior that represents knowledge about the parameters before seeing the data

$$\hat{\phi} = argmin_{\phi} \left[\prod_{i=1}^I Pr(y_i|xi_,\phi)Pr(\phi)   \right]$$

### L2 regulariztion
$$\hat{\phi} = argmin_{\phi} \left[\sum_{i=1}^I l_i[x_i,y_i] + \lambda \sum_{j} \phi_j^2    \right]$$
- Penalizes the sum of squares of the parameter values  
- Correspondes to $N(0,1)$ prior on the parameters
- Weight decay: Applied only to weights, not biases
- Effects
  - If overfitting, increase regularization
  - If is underfitting, regularization might be too strong
- When the model is over-parametrized, some of the extra model capacity will describe areas with no training data

## Implicit Regularization
- Neither gradient descent or stochastic gradient descent moves neutrally to the minimum of the loss function, some solutions are prefered over others
- Implict regularization due to gradient descent may be responsible for the observation that full batch gradient descent generalizes better with larger step sizes
- SGD generalizes better than gradient descent and smaller batch sizes generally perform better than larger ones

## Heuristics to improve performance
- Early stopping
  - Stopping the training procedure before it has fully converged
  - Can reduce overfitting by stopping from capturing the noise of the data
  - Similar effect to explict L2
  - Single parameter: Number of steps after which learning is terminated
    - Chosen using validation set
- Ensembling
  - Build several models and averaging their predictions
  - Combined by:
    - Mean of the outputs (regression) or median
    - Mean of softmax outputs (multiclass ) or most frequent class in predictins
  - How to train different models
    - Different random initializations
    - Bootstrap samples of the data
- Dropout
  - Randomly clamps a subset of hidden units to zero at each iteration of SGD. 
    - Encourages weights to have smaller magnitudes
- Applying noise
  - Adding noise to input data
  - Adding noise to weights
  - Perturb labels
    - Randomly changing labels at each training iteration
      - Changing loss function to minimize cross entropy between predicted distribution and a distribution where the true label has probability $1-\rho$ of ocurring, and the other classes have equal probability
- Transfer Learning
  - Pre-trained to perform a related secondary task for which data are more plentiful
  - Resulting model is adapted to the original task
  - Can be done
    - Removing the last layer and  adding one or more layers that produce a suitable output
    - Main model fixed and the new last layers are trained
    - Trained end-to-end
- Self-supervised learning
  - Generative self-supervised learning
    - Part of each data example is masked and the secondary task is to predict the missing part
  - Contrastive self-supervised learning
    - Pair of examples with commonalities are compared to uncorrelated pairs
    - For images, might be if two images are transformed versions of one another or unconnected
- Augmentation
  - Expand the dataset changing each datapoint, but the label stays the same
  - Text: Translate to another language and translate it back