# Chapter 8. Optimization for Training DeepModels

* 딥러닝 세미나 : 이론 [1]
* 김무성

# Contents
* 8.1 How Learning Diﬀers from Pure Optimization
    - 8.1.1 Empirical Risk Minimization
    - 8.1.2 Surrogate Loss Functions and Early Stopping
    - 8.1.3 Batch and Minibatch Algorithms
* 8.2 Challenges in Neural Network Optimization
    - 8.2.1 Ill-Conditioning
    - 8.2.2 Local Minima
    - 8.2.3 Plateaus, Saddle Points and Other Flat Regions
    - 8.2.4 Cliﬀs and Exploding Gradients
    - 8.2.5 Long-Term Dependencies
    - 8.2.6 Inexact Gradients
    - 8.2.7 Poor Correspondence between Local and Global Structure
    - 8.2.8 Theoretical Limits of Optimization
* 8.3 Basic Algorithms
    - 8.3.1 Stochastic Gradient Descent
    - 8.3.2 Momentum
    - 8.3.3 Nesterov Momentum
* 8.4 Parameter Initialization Strategies
* 8.5 Algorithms with Adaptive Learning Rates
    - 8.5.1 AdaGrad
    - 8.5.2 RMSProp
    - 8.5.3 Adam
    - 8.5.4 Choosing the Right Optimization Algorithm
* 8.6 Approximate Second-Order Methods
    - 8.6.1 Newton’s Method
    - 8.6.2 Conjugate Gradients
    - 8.6.3 BFGS
* 8.7 Optimization Strategies and Meta-Algorithms
    - 8.7.1 Batch Normalization
    - 8.7.2 Coordinate Descent
    - 8.7.3 Polyak Averaging
    - 8.7.4 Supervised Pretraining
    - 8.7.5 Designing Models to Aid Optimization
    - 8.7.6 Continuation Methods and Curriculum Learning

#### 참고
* [2] An overview of gradient descent optimization algorithms - http://sebastianruder.com/optimizing-gradient-descent/index.html
* [3] Stochastic gradient methods for machine learning - http://research.microsoft.com/en-us/um/cambridge/events/mls2013/downloads/stochastic_gradient.pdf
* [4] (OXFORD, Machine Learning: 2014-2015 )Deep Learning Lecture 6: Optimization - https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/lecture5.pdf
* [5] 다크프로그래머 : Gradient, Jacobian 행렬, Hessian 행렬, Laplacian- http://darkpgmr.tistory.com/132
* [6] 조금은 느리게 살자 : 테일러 급수(級數, Taylor series) - http://ghebook.blogspot.kr/2010/07/taylor-series.html
* [7] 다크프로그래머 : 테일러 급수의 이해와 활용 (Taylor series) - http://darkpgmr.tistory.com/59

This chapter presents these optimization techniques for neural network training.

<img src="https://qph.is.quoracdn.net/main-qimg-46a77c77c721ba34283308232a1788c8?convert_to_webp=true" width=600 />

<img src="https://msampler.files.wordpress.com/2009/07/cvx-fun.gif" width=600 />

<img src="https://camo.githubusercontent.com/30bf2d42d3a9b0e07dbc03a014f4e36dbc06904f/68747470733a2f2f7261772e6769746875622e636f6d2f7175696e6e6c69752f4d616368696e654c6561726e696e672f6d61737465722f696d61676573466f724578706c616e6174696f6e2f4772616469656e7444657363656e74576974684d75746c69706c654c6f63616c4d696e696d756d2e6a7067" width=600 />

<img src="http://image.slidesharecdn.com/aihandson20140710bslideshare-140716182318-phpapp02/95/jsais-ai-tool-introduction-deep-learning-pylearn2-and-torch7-52-638.jpg?cb=1435207500" width=600 />

<font color="red">This chapter focuses on one particular case of optimization</font>: 
* ﬁnding the parameters $θ$ of a neural network that signiﬁcantly reduce a cost function $J(θ)$, which typically includes a performance measure evaluated on the entire training set as well as additional regularization terms.
    <img src="http://cs231n.github.io/assets/dataflow.jpeg" />
* We begin with a description of how optimization used as a training algorithm for a machine learning task diﬀers from pure optimization. 
* Next, we present several of the concrete challenges that make optimization of neural networks diﬃcult. 
* We then deﬁne several practical algorithms, including both
    - optimization algorithms themselves and 
    - strategies for initializing the parameters. 
* More advanced algorithms adapt 
    - their learning rates during training or 
    - leverage information contained in the second derivatives of the cost function
* Finally, we conclude with a review of 
    - several optimization strategies that are formed by 
        - combining simple optimization algorithms 
            - into higher-level procedures.

# 8.1 How Learning Diﬀers from Pure Optimization
* 8.1.1 Empirical Risk Minimization
* 8.1.2 Surrogate Loss Functions and Early Stopping
* 8.1.3 Batch and Minibatch Algorithms

Optimization algorithms used for training of deep models diﬀer from traditional optimization algorithms in several ways.
* Machine learning usually acts indirectly.
    - In most machine learning scenarios, 
        - we care about some performance measure $P$, 
            - that is deﬁned with respect to the test set and 
            - may also be intractable. 
    - We therefore optimize $P$ only indirectly. 
* We reduce a diﬀerent cost function $J(θ)$ in the hope that 
    - doing so will improve P. 
* This is in contrast to pure optimization,where minimizing 
    - $J$ is a goal in and of itself. 
* Optimization algorithms for training deep models 
    - also typically include 
        - some specialization 
            - on the speciﬁc structure 
                - of machine learning objective functions.

Typically, the cost function can be written as an average over the training set,such as

<img src="figures/cap8.1.png" width=600 />

whereLis the per-example loss function, $f(x;θ)$ is the predicted output whenthe input is $x$, $ˆp$ data is the empirical distribution.

#### data generating distribution

* Eq. 8.1 deﬁnes an objective function with respect to the training s
* We would usually prefer to minimize 
    - the corresponding objective function 
        - where the expectation is taken across 
            - the data generating distribution $p$ data 
                - rather than just over the ﬁnite training set:

<img src="figures/cap8.2.png" width=600 />

## 8.1.1 Empirical Risk Minimization

#### 참고
* [8] Empirical Risk Minimization - http://demo.clab.cs.cmu.edu/fa2015-11763/slides/erm.pdf
* [9] The Learning Problem and Regularization - http://www.mit.edu/~9.520/spring11/slides/class02.pdf
* [10] Risk Minimization - http://hellbell.tistory.com/entry/Risk-Minimization

<img src="http://www.svms.org/srm/Rychetsky2001_2-4.png" />

The simplest way to convert a machine learning problem back into an optimization problem is to minimize the expected loss on the training set. 
* This means replacing the true distribution $p(x, y)$ with the empirical distribution $ˆp(x, y)$ deﬁned by the training set. 
* We now minimize the empirical risk

<img src="figures/cap8.3.png" width=600 />

where m is the number of training examples.

The training process based on minimizing this average training error is known as <font color="red">empirical risk minimization</font>.
* In this setting, machine learning is still very similarto straightforward optimization. 
* Rather than optimizing the risk directly, 
    - we optimize the empirical risk, 
    - and hope that 
        - the risk decreases signiﬁcantly as well.
* A variety of theoretical results establish conditions under which the true risk can be expected to decrease by various amounts

<font color="red">However, empirical risk minimization is prone to overﬁtting</font>. 
* Models with high capacity can simply memorize the training set. In many cases, empirical risk minimization is not really feasible. 
    - The most eﬀective modern optimizationalgorithms are based on gradient descent, but many useful loss functions, suchas 0-1 loss, have no useful derivatives (the derivative is either zero or undeﬁnedeverywhere). 
* These two problems mean that, <font color="red">in the context of deep learning, werarely use empirical risk minimization</font>. 
* Instead, we must use a slightly diﬀerent approach, in which the quantity that we actually optimize is even more diﬀerent from the quantity that we truly want to optimize

## 8.1.2 Surrogate Loss Functions and Early Stopping

## 8.1.3 Batch and Minibatch Algorithms

<img src="figures/cap8.4.png" width=600 />

<img src="figures/cap8.5.png" width=600 />

<img src="figures/cap8.6.png" width=600 />

<img src="figures/cap8.7.png" width=600 />

<img src="figures/cap8.8.png" width=600 />

# 8.2 Challenges in Neural Network Optimization
* 8.2.1 Ill-Conditioning
* 8.2.2 Local Minima
* 8.2.3 Plateaus, Saddle Points and Other Flat Regions
* 8.2.4 Cliﬀs and Exploding Gradients
* 8.2.5 Long-Term Dependencies
* 8.2.6 Inexact Gradients
* 8.2.7 Poor Correspondence between Local and Global Structure
* 8.2.8 Theoretical Limits of Optimization

## 8.2.1 Ill-Conditioning

<img src="figures/cap8.9.png" width=600 />

<img src="figures/cap8.10.png" width=600 />

<img src="figures/cap8.11.png" width=600 />

## 8.2.2 Local Minima

## 8.2.3 Plateaus, Saddle Points and Other Flat Regions

<img src="figures/cap8.12.png" width=600 />

## 8.2.4 Cliﬀs and Exploding Gradients

<img src="figures/cap8.13.png" width=600 />

## 8.2.5 Long-Term Dependencies

<img src="figures/cap8.14.png" width=600 />

## 8.2.6 Inexact Gradients

## 8.2.7 Poor Correspondence between Local and Global Structure

<img src="figures/cap8.15.png" width=600 />

<img src="figures/cap8.16.png" width=600 />

## 8.2.8 Theoretical Limits of Optimization

# 8.3 Basic Algorithms
* 8.3.1 Stochastic Gradient Descent
* 8.3.2 Momentum
* 8.3.3 Nesterov Momentum

In [None]:
<img src="figures/cap8.17.png" width=600 />
<img src="figures/cap8.18.png" width=600 />
<img src="figures/cap8.19.png" width=600 />
<img src="figures/cap8.20.png" width=600 />
<img src="figures/cap8.21.png" width=600 />
<img src="figures/cap8.22.png" width=600 />
<img src="figures/cap8.23.png" width=600 />
<img src="figures/cap8.24.png" width=600 />
<img src="figures/cap8.25.png" width=600 />
<img src="figures/cap8.26.png" width=600 />
<img src="figures/cap8.27.png" width=600 />
<img src="figures/cap8.28.png" width=600 />
<img src="figures/cap8.29.png" width=600 />
<img src="figures/cap8.30.png" width=600 />
<img src="figures/cap8.31.png" width=600 />
<img src="figures/cap8.32.png" width=600 />
<img src="figures/cap8.33.png" width=600 />
<img src="figures/cap8.34.png" width=600 />
<img src="figures/cap8.35.png" width=600 />
<img src="figures/cap8.36.png" width=600 />
<img src="figures/cap8.37.png" width=600 />
<img src="figures/cap8.38.png" width=600 />
<img src="figures/cap8.39.png" width=600 />
<img src="figures/cap8.40.png" width=600 />
<img src="figures/cap8.41.png" width=600 />
<img src="figures/cap8.42.png" width=600 />
<img src="figures/cap8.43.png" width=600 />
<img src="figures/cap8.44.png" width=600 />
<img src="figures/cap8.45.png" width=600 />
<img src="figures/cap8.46.png" width=600 />
<img src="figures/cap8.47.png" width=600 />
<img src="figures/cap8.48.png" width=600 />
<img src="figures/cap8.49.png" width=600 />
<img src="figures/cap8.50.png" width=600 />
<img src="figures/cap8.51.png" width=600 />
<img src="figures/cap8.52.png" width=600 />
<img src="figures/cap8.53.png" width=600 />
<img src="figures/cap8.54.png" width=600 />
<img src="figures/cap8.55.png" width=600 />
<img src="figures/cap8.56.png" width=600 />
<img src="figures/cap8.57.png" width=600 />

## 8.3.1 Stochastic Gradient Descent

## 8.3.2 Momentum

## 8.3.3 Nesterov Momentum

# 8.4 Parameter Initialization Strategies

# 8.5 Algorithms with Adaptive Learning Rates
* 8.5.1 AdaGrad
* 8.5.2 RMSProp
* 8.5.3 Adam
* 8.5.4 Choosing the Right Optimization Algorithm

## 8.5.1 AdaGrad

## 8.5.2 RMSProp

## 8.5.3 Adam

## 8.5.4 Choosing the Right Optimization Algorithm

# 8.6 Approximate Second-Order Methods
* 8.6.1 Newton’s Method
* 8.6.2 Conjugate Gradients
* 8.6.3 BFGS

## 8.6.1 Newton’s Method


## 8.6.2 Conjugate Gradients


## 8.6.3 BFGS

# 8.7 Optimization Strategies and Meta-Algorithms
* 8.7.1 Batch Normalization
* 8.7.2 Coordinate Descent
* 8.7.3 Polyak Averaging
* 8.7.4 Supervised Pretraining
* 8.7.5 Designing Models to Aid Optimization
* 8.7.6 Continuation Methods and Curriculum Learning

## 8.7.1 Batch Normalization

## 8.7.2 Coordinate Descent

## 8.7.3 Polyak Averaging

## 8.7.4 Supervised Pretraining

## 8.7.5 Designing Models to Aid Optimization


## 8.7.6 Continuation Methods and Curriculum Learning

# 참고자료

* [1] DEEP LEARNING (Yoshua Bengio) : 8. Optimization for Training Deep Models - http://www.deeplearningbook.org/contents/optimization.html
* [2] An overview of gradient descent optimization algorithms - http://sebastianruder.com/optimizing-gradient-descent/index.html
* [3] Stochastic gradient methods
for machine learning -  http://research.microsoft.com/en-us/um/cambridge/events/mls2013/downloads/stochastic_gradient.pdf
* [4] (OXFORD, Machine Learning: 2014-2015
)Deep Learning Lecture 6: Optimization - https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/lecture5.pdf
* [5] 다크프로그래머 : Gradient, Jacobian 행렬, Hessian 행렬, Laplacian- http://darkpgmr.tistory.com/132
* [6] 조금은 느리게 살자 : 테일러 급수(級數, Taylor series) - http://ghebook.blogspot.kr/2010/07/taylor-series.html
* [7] 다크프로그래머 : 테일러 급수의 이해와 활용 (Taylor series) - http://darkpgmr.tistory.com/59
* [8] Empirical Risk Minimization - http://demo.clab.cs.cmu.edu/fa2015-11763/slides/erm.pdf
* [9] The Learning Problem and Regularization - http://www.mit.edu/~9.520/spring11/slides/class02.pdf
* [10] Risk Minimization - http://hellbell.tistory.com/entry/Risk-Minimization