# CHAPTER 7. REGULARIZATION FOR DEEP LEARNING

* 딥러닝 세미나 : 이론 [1]
* 김무성

# Contents
* 7.1 Parameter Norm Penalties
* 7.2 Norm Penalties as Constrained Optimization
* 7.3 Regularization and Under-Constrained Problems
* 7.4 Dataset Augmentation
* <font color="red">7.5 Noise Robustness</font>
    - 7.5.1 Injecting Noise at the Output Targets
* 7.6 Semi-Supervised Learning
* 7.7 Multi-Task Learning
* 7.8 Early Stopping
* 7.9 Parameter Tying and Parameter Sharing
* 7.10 Sparse Representations
* 7.11 Bagging and Other Ensemble Methods
* 7.12 Dropout
* 7.13 Adversarial Training
* 7.14 Tangent Distance, Tangent Prop, and ManifoldTangent Classifier

# 7.5 Noise Robustness
* 7.5.1 Injecting Noise at the Output Targets

#### inputs
* Section 7.4 has motivated the use of noise applied to the inputs as a dataset augmentation strategy. 
* For some models, the addition of noise with inﬁnitesimal variance at the input of the model is equivalent to imposing a penalty on the norm of the weights

#### hidden units
In the general case, it is important to remember that <font color="red">noise injection</font> can be much more powerful than simply shrinking the parameters, especially when the noise is added to the hidden units.

#### weights

##### bayesian
* Another way that noise has been used in the service of regularizing models is by adding it to the weights. 
* This technique has been used primarily in thecontext of recurrent neural networks (Jim et al., 1996; Graves, 2011).
* This can be interpreted as a stochastic implementation of <font color="red">Bayesian inference</font> over the weights.
* <font color="red">Adding noise to the weights is a practical, stochastic way to reﬂect this uncertainty</font>.

##### traditional form of regularization
* Noise applied to the weights can also be interpreted as <font color="red">equivalent (under some assumptions) to a more traditional form of regularization</font>, encouraging stability ofthe function to be learned.

 <font color="blue">Consider the regression setting</font>, where we wish to train a function ˆy(x) that maps a set of features x to a scalar using the least-squares cost function between the model predictions ˆy(x) and the true values y:

<img src="figures/cap7.5.1.png" width=600 />

We now assume that with each input presentation we also include a random perturbation

<img src="figures/cap7.5.2.png" width=200 /> of the network weights.

Let us imagine that wehave a standard l-layer MLP. We denote the perturbed model as

<img src="figures/cap7.5.3.png" width=100 />

<font color="red">Despitethe injection of noise, we are still interested in minimizing the squared error of the output of the network</font>. The objective function thus becomes:

<img src="figures/cap7.5.4.png" width=600 />

For small $η$, the minimization of $J$ with added weight noise (with covariance $ηI$) is <font color="blue">equivalent to minimization of $J$ with an additional regularization term<font> :

<img src="figures/cap7.5.5.png" width=200 />

<font color="red">This form of regularization encourages the parameters to go to regions of parameter space where small perturbations of the weights have a relatively small inﬂuence on the output</font>.
* In other words, it pushes the model into regions where the model is relatively insensitive to small variations in the weights, ﬁnding points that are not merely minima, but minima surrounded by ﬂat regions

<font color="blue">In the simpliﬁed case of linear regression</font>

<img src="figures/cap7.5.6.png" width=200 />

this <font color="red">regularization term collapses into</font>

<img src="figures/cap7.5.7.png" width=200 />

which is not a function of parameters and therefore does notcontribute to the gradient of ˜JW with respect to the model parameters.

## 7.5.1 Injecting Noise at the Output Targets

* Most datasets have some amount of mistakes in they labels.
* It can be harmful tomaximizelog $logp(y|x)$ when $y$ is a mistake. 
* One way to prevent this is to <font color="red">explicitly model the noise on the labels</font>.
    - For example, we can assume that for some small constant $\epsilon$ , 
        - the training set label $y$ is correct with probability 1−$\epsilon$, 
        - and otherwise any of the other possible labels might be correct. 
        - <font color="red">This assumption is easy to incorporate into the cost function analytically, rather than by explicitly drawing noise samples</font>.

#### label smoothing  
* For example,label smoothing regularizes a model based on a softmax with $k$ output values 
    - by replacing the hard 0 and 1 classiﬁcation targets 
        - with targets of $\epsilon$/$(k−1)$ and 1− $\epsilon$, respectively. 
* The standard cross-entropy loss may then be used with these soft targets.
* Maximum likelihood learning with a softmax classiﬁer and hard targets <font color="red">may actually never converge</font> 
    - the softmax can never predict a probability of exactly 0 or exactly 1, 
    - so it will continue to learn larger and larger weights,
    - making more extreme predictions forever.
    - <font color="blue">It is possible toprevent this scenario using other regularization strategies like weight decay</font>.
* <font color="red">Label smoothing has the advantage of preventing the pursuit of hard probabilities without discouraging correct classiﬁcation</font>.

# 참고자료
* [1] DEEP LEARNING (Yoshua Bengio)- http://www.deeplearningbook.org/