/
MBSGD.h
188 lines (182 loc) · 7.58 KB
/
MBSGD.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
#ifndef OPENANN_OPTIMIZATION_MBSGD_H_
#define OPENANN_OPTIMIZATION_MBSGD_H_
#include <OpenANN/optimization/Optimizer.h>
#include <OpenANN/optimization/StoppingCriteria.h>
#include <OpenANN/util/Random.h>
#include <Eigen/Core>
#include <vector>
#include <list>
namespace OpenANN
{
/**
* @class MBSGD
*
* Mini-batch stochastic gradient descent.
*
* This implementation of gradient descent has some modifications:
*
* - it is stochastic, we update the weights with a randomly chosen subset of
* the training set to escape local minima and thus increase the
* generalization capabilities
* - we use a momentum to smooth the search direction
* - each weight has an adaptive learning rate
* - we can decrease the learning rate during optimization
* - we can increase the momentum during optimization
*
* A good introduction to optimization with MBSGD can be found in Geoff
* Hinton's Coursera course "Neural Networks for Machine Learning". A detailed
* description of this implementation follows.
*
* When the batch size equals 1, the algorithms degenerates to stochastic
* gradient descent. When it equals the training set size, the algorithm is
* like batch gradient descent.
*
* Standard mini-batch stochastic gradient descent updates the weight vector w
* in step t through
*
* \f$ w^t = w^{t-1} - \frac{\alpha}{|B_t|} \sum_{n \in B_t} \nabla E_n(w) \f$
*
* or
*
* \f$ \Delta w^t = - \frac{\alpha}{|B_t|} \sum_{n \in B_t} \nabla E_n(w),
* \quad w^t = w^{t-1} + \Delta w^t, \f$
*
* where \f$ \alpha \f$ is the learning rate and \f$ B_t \f$ is the set of
* indices of the t-th mini-batch, which is drawn randomly. The random order
* of gradients prevents us from getting stuck in local minima. This is an
* advantage over batch gradient descent. However, we must not make the batch
* size too small. A bigger batch size makes the optimization more robust
* against noise of the training set. A reasonable batch size is between 10
* and 100. The learning rate has to be within [0, 1). High learning rates can
* result in divergence, i.e. the error increases. Too low learning rates
* might make learning too slow, i.e. the number of epochs required to find an
* optimum might be infeasibe. A reasonable value for :math:`\\alpha` is
* usually within [1e-5, 0.1].
*
* A momentum can increase the optimization stability. In this case, the
* update rule is
*
* \f$ \Delta w^t = \eta \Delta w^{t-1} - \frac{\alpha}{|B_t|}
* \sum_{n \in B_t} \nabla E_n(w), \quad w^t = w^{t-1} +
* \Delta w^t, \f$
*
* where \f$ \eta \f$ is called momentum and must lie in [0, 1). The momentum
* term incorporates past gradients with exponentially decaying influence.
* This reduces changes of the search direction. An intuitive explanation of
* this update rule is: we regard w as the position of a ball that is rolling
* down a hill. The gradient represents its acceleration and the acceleration
* modifies its momentum.
*
* Another type of momentum that is available in this implementation is the
* Nesterov's accelerated gradient (NAG). For smooth convex functions, NAG
* achieves a convergence rate of \f$ O(\frac{1}{T^2}) \f$ instead of
* \f$ O(\frac{1}{T}) \f$ [1]. The update rule only differs in where we
* calculate the gradient.
*
* \f$ \Delta w^t = \eta \Delta w^{t-1} - \frac{\alpha}{|B_t|}
* \sum_{n \in B_t} \nabla E_n(w + \eta \Delta w^{t-1}),
* \quad w^t = w^{t-1} + \Delta w^t, \f$
*
* Another trick is using different learning rates for each weight. For each
* weight \f$ w_{ji} \f$ we can introduce a gain \f$ g_{ji} \f$ which will be
* multiplied with the learning rate so that we obtain an update rule for each
* weight
*
* \f$ \Delta w_{ji}^t = \eta \Delta w_{ji}^{t-1} -
* \frac{\alpha g_{ji}^{t-1}}{|B_t|} \sum_{n \in B_t} \nabla E_n(w_{ji}),
* \quad w_{ji}^t = w_{ji}^{t-1} + \Delta w_{ji}^t, \f$
*
* where \f$ g_{ji}^0 = 1 \f$ and \f$ g_{ji} \f$ will be increased by 0.05 if
* \f$ \Delta w_{ji}^t \Delta w_{ji}^{t-1} \geq 0 \f$, i.e. the sign of the
* search direction did not change and \f$ g_{ji} \f$ will be multiplied by
* 0.95 otherwise. We set a minimum and a maximum value for each gain. Usually
* these are 0.1 and 10 or 0.001 and 100 respectively.
*
* During optimization it often makes sense to start with a more global
* search, i.e. with a high learning rate and decrease the learning rate as we
* approach the minimum so that we obtain an update rule for the learning
* rate:
*
* \f$ \alpha^t = max(\alpha_{decay} \alpha^{t-1}, \alpha_{min}). \f$
*
* In addition, we can allow the optimizer to change the search direction more
* often at the beginning of the optimization and reduce this possibility at
* the end. To do this, we can start with a low momentum and increase it over
* time until we reach a maximum:
*
* \f$ \eta^t = min(\eta^{t-1} + \eta_{inc}, \eta_{max}). \f$
*
* [1] Sutskever, Ilya; Martens, James; Dahl, George; Hinton, Geoffrey:
* On the importance of initialization and momentum in deep learning,
* International Conference on Machine Learning, 2013.
*/
class MBSGD : public Optimizer
{
//! Stopping criteria
StoppingCriteria stop;
//! Optimizable problem
Optimizable* opt; // do not delete
//! Use nesterov's accelerated momentum
bool nesterov;
//! Learning rate
double alpha;
//! Learning rate decay
double alphaDecay;
//! Minimum learning rat
double minAlpha;
//! Momentum
double eta;
//! Momentum gain
double etaGain;
//! Maximum momentum
double maxEta;
//! Typical size of a mini-batch is 10 to a few hundred.
int batchSize;
//! Minimum gain of the learning rate per parameter, e. g. 0.01, 0.1 or 1
double minGain;
//! Maximum gain of the learning rate per parameter, e. g. 1, 10 or 100
double maxGain;
//! Use parameter adaption
bool useGain;
int iteration;
RandomNumberGenerator rng;
int P, N, batches;
Eigen::VectorXd gradient, gains, parameters, momentum, currentGradient;
double accumulatedError;
std::vector<int> randomIndices;
public:
/**
* Create mini-batch stochastic gradient descent optimizer.
*
* @param learningRate learning rate (usually called alpha); range: (0, 1]
* @param momentum momentum coefficient (usually called eta); range: [0, 1)
* @param batchSize size of the mini-batches; range: [1, N], where N is the
* size of the training set
* @param nesterov use nesterov's accelerated momentum
* @param learningRateDecay will be multiplied with the learning rate after
* each weight update; range: (0, 1]
* @param minimalLearningRate minimum value for the learning rate; range:
* [0, 1]
* @param momentumGain will be added to the momentum after each weight
* update; range: [0, 1)
* @param maximalMomentum maximum value for the momentum; range [0, 1]
* @param minGain minimum factor for individual learning rates
* @param maxGain maximum factor for individual learning rates
*/
MBSGD(double learningRate = 0.01, double momentum = 0.5, int batchSize = 10,
bool nesterov = false, double learningRateDecay = 1.0,
double minimalLearningRate = 0.0, double momentumGain = 0.0,
double maximalMomentum = 1.0, double minGain = 1.0,
double maxGain = 1.0);
~MBSGD();
virtual void setOptimizable(Optimizable& opt);
virtual void setStopCriteria(const StoppingCriteria& stop);
virtual void optimize();
virtual bool step();
virtual Eigen::VectorXd result();
virtual std::string name();
private:
void initialize();
};
} // namespace OpenANN
#endif // OPENANN_OPTIMIZATION_MBSGD_H_