In [1]:
import tensorflow as tf
import numpy as np

**`tf.clip_by_average_norm`**
**`tf.clip_by_global_norm`**
**`tf.clip_by_norm`**
**`tf.clip_by_value(t, clip_value_min, clip_value_max, name=None)`**
+ Clips tensor values to a specified min and max.

In [8]:
v = tf.constant([[1.0, 2.0, 4.0],[4.0, 5.0, 6.0]])
result = tf.clip_by_value(v, 2.5, 4.5)

with tf.Session() as sess:
    print(sess.run(v))
    print(sess.run(result))

[[1. 2. 4.]
 [4. 5. 6.]]
[[2.5 2.5 4. ]
 [4.  4.5 4.5]]


**`tf.train`**

+ **`tf.train.Optimizer()`**

+ Base class for optimizers. This class defines the API to add Ops to train a model.
+ You never use this class directly, but instead instantiate one of its subclasses such as `GradientDescentOptimizer`, `AdagradOptimizer`, or `MomentumOptimizer`.

+ `Usage`

```python
# Create an optimizer with the desired parameters.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Add Ops to the graph to minimize a cost by updating a list of variables.
# "cost" is a Tensor, and the list of variables contains tf.Variable
# objects.
opt_op = opt.minimize(cost, var_list=<list of variables>)
```

In [None]:
tf.train.Optimizer
https://www.tensorflow.org/api_docs/python/tf/train/Optimizer

### Processing gradients before applying them.

Calling `minimize()` takes care of both computing the gradients and
applying them to the variables.  If you want to process the gradients
before applying them you can instead use the optimizer in three steps:

1.  Compute the gradients with `compute_gradients()`.
2.  Process the gradients as you wish.
3.  Apply the processed gradients with `apply_gradients()`.

Example:

```python
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
```


+ `Method`

```python
apply_gradients(
    grads_and_vars,
    global_step=None,
    name=None
)


compute_gradients(
    loss,
    var_list=None,
    gate_gradients=GATE_OP,
    aggregation_method=None,
    colocate_gradients_with_ops=False,
    grad_loss=None
)

minimize(
    loss,
    global_step=None,
    var_list=None,
    gate_gradients=GATE_OP,
    aggregation_method=None,
    colocate_gradients_with_ops=False,
    name=None,
    grad_loss=None
)

```

```python
apply_gradients(
    grads_and_vars,
    global_step=None,
    name=None
)
```

Apply gradients to variables.

This is the second part of `minimize()`. It returns an `Operation` that applies gradients.


+ Args:
    - `grads_and_vars`: List of (gradient, variable) pairs as returned by `compute_gradients()`.
    - `global_step`: Optional `Variable` to increment by one after the variables have been updated.
    - `name`: Optional name for the returned operation. Default to the name passed to the `Optimizer` constructor.

- Returns:
    - An `Operation` that applies the specified gradients. If `global_step` was not None, that operation also increments `global_step`.

```python
compute_gradients(
    loss,
    var_list=None,
    gate_gradients=GATE_OP,
    aggregation_method=None,
    colocate_gradients_with_ops=False,
    grad_loss=None
)
```

Compute gradients of loss for the variables in var_list.

This is the first part of minimize(). It returns a list of (gradient, variable) pairs where "gradient" is the gradient for "variable". Note that "gradient" can be a Tensor, an IndexedSlices, or None if there is no gradient for the given variable.
Args:

    loss: A Tensor containing the value to minimize or a callable taking no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.
    var_list: Optional list or tuple of tf.Variable to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.
    gate_gradients: How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.
    aggregation_method: Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod.
    colocate_gradients_with_ops: If True, try colocating gradients with the corresponding op.
    grad_loss: Optional. A Tensor holding the gradient computed for loss.

Returns:

A list of (gradient, variable) pairs. Variable is always present, but gradient can be None.

```python
minimize(
    loss,
    global_step=None,
    var_list=None,
    gate_gradients=GATE_OP,
    aggregation_method=None,
    colocate_gradients_with_ops=False,
    name=None,
    grad_loss=None
)

```

Add operations to minimize loss by updating var_list.

This method simply combines calls compute_gradients() and apply_gradients(). If you want to process the gradient before applying them call compute_gradients() and apply_gradients() explicitly instead of using this function.
Args:

    loss: A Tensor containing the value to minimize.
    global_step: Optional Variable to increment by one after the variables have been updated.
    var_list: Optional list or tuple of Variable objects to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.
    gate_gradients: How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.
    aggregation_method: Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod.
    colocate_gradients_with_ops: If True, try colocating gradients with the corresponding op.
    name: Optional name for the returned operation.
    grad_loss: Optional. A Tensor holding the gradient computed for loss.

Returns:

An Operation that updates the variables in var_list. If global_step was not None, that operation also increments global_step.

**`tf.train.AdamOptimizer`**
**`tf.train.ExponentialMovingAverage`**
**`tf.train.GradientDescentOptimizer`**
**`tf.train.MomentumOptimizer`**
**`tf.train.Optimizer`**
**`tf.train.RMSPropOptimizer`**
**`tf.train.XXX`**

In [None]:
tf.train.AdamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name='Adam')

Construct a new Adam optimizer.

Initialization:

$$m_0 := 0  ext{(Initialize initial 1st moment vector)}$$
$$v_0 := 0  ext{(Initialize initial 2nd moment vector)}$$
$$t := 0    ext{(Initialize timestep)}$$

The update rule for `variable` with gradient `g` uses an optimization
described at the end of section2 of the paper:

$$t := t + 1$$
$$lr_t :=   ext{learning\_rate} * \sqrt{1 - beta_2^t} / (1 - beta_1^t)$$

$$m_t := beta_1 * m_{t-1} + (1 - beta_1) * g$$
$$v_t := beta_2 * v_{t-1} + (1 - beta_2) * g * g$$
$$variable := variable - lr_t * m_t / (\sqrt{v_t} + \epsilon)$$

The default value of 1e-8 for epsilon might not be a good default in
general. For example, when training an Inception network on ImageNet a
current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the
formulation just before Section 2.1 of the Kingma and Ba paper rather than
the formulation in Algorithm 1, the "epsilon" referred to here is "epsilon
hat" in the paper.

The sparse implementation of this algorithm (used when the gradient is an
IndexedSlices object, typically because of `tf.gather` or an embedding
lookup in the forward pass) does apply momentum to variable slices even if
they were not used in the forward pass (meaning they have a gradient equal
to zero). Momentum decay (beta1) is also applied to the entire momentum
accumulator. This means that the sparse behavior is equivalent to the dense
behavior (in contrast to some momentum implementations which ignore momentum
unless a variable slice was actually used).

Args:
  learning_rate: A Tensor or a floating point value.  The learning rate.
  beta1: A float value or a constant float tensor.
    The exponential decay rate for the 1st moment estimates.
  beta2: A float value or a constant float tensor.
    The exponential decay rate for the 2nd moment estimates.
  epsilon: A small constant for numerical stability. This epsilon is
    "epsilon hat" in the Kingma and Ba paper (in the formula just before
    Section 2.1), not the epsilon in Algorithm 1 of the paper.
  use_locking: If True use locks for update operations.
  name: Optional name for the operations created when applying gradients.
    Defaults to "Adam".

@compatibility(eager)
When eager execution is enabled, `learning_rate`, `beta1`, `beta2`, and
`epsilon` can each be a callable that takes no arguments and returns the
actual value to use. This can be useful for changing these values across
different invocations of optimizer functions.
@end_compatibility

In [None]:
相比较于 SGD 算法而言
1. 不容易陷入局部最优点
2. 速度更快

learning_rate: A Tensor or a floating point value. The learning rate.
beta1: A float value or a constant float tensor. The exponential decay rate for the 1st moment estimates.
beta2: A float value or a constant float tensor. The exponential decay rate for the 2nd moment estimates.
epsilon: A small constant for numerical stability. This epsilon is "epsilon hat" in the Kingma and Ba paper (in the formula just before Section 2.1), not the epsilon in Algorithm 1 of the paper.
use_locking: If True use locks for update operations.
name: Optional name for the operations created when applying gradients. Defaults to "Adam".·

In [None]:
**``**