In [1]:
import tensorflow as tf
import numpy as np

**`tf.gradients`**(ys, xs, grad_ys=None, name='gradients', stop_gradients=None)

+ Args:
    + ys: A `Tensor` or list of tensors to be differentiated.
    + xs: A `Tensor` or list of tensors to be used for differentiation.
    + grad_ys: Optional. A `Tensor` or list of tensors the same size as `ys` and holding the gradients computed for each y in `ys`.
    + stop_gradients: Optional. A `Tensor` or list of tensors not to differentiate through.


+ Returns:
  + A list of `sum(dy/dx)` for each x in `xs`.


+ Constructs symbolic derivatives of sum of `ys` w.r.t. x in `xs`.
    + `ys` and `xs` are each a `Tensor` or a list of tensors.  `grad_ys` is a list of `Tensor`, holding the gradients received by the `ys`. The list must be the same length as `ys`.

    + `gradients()` adds ops to the graph to output the derivatives of `ys` with respect to `xs`.  It returns a list of `Tensor` of length `len(xs)` where each tensor is the `sum(dy/dx)` for y in `ys`.
    
    + `grad_ys` is a list of tensors of the same length as `ys` that holds the initial gradients for each y in `ys`.  When `grad_ys` is None, we fill in a tensor of '1's of the shape of y for each y in `ys`.  A user can provide their own initial `grad_ys` to compute the derivatives using a different initial gradient for each y.

    + `stop_gradients` is a `Tensor` or a list of tensors to be considered constant with respect to all `xs`. These tensors will not be backpropagated through, as though they had been explicitly disconnected using `stop_gradient`.  Among other things, this allows computation of partial derivatives as opposed to total derivatives.

In [2]:
with tf.variable_scope("", reuse=tf.AUTO_REUSE):
    w1 = tf.get_variable('w1', shape=[3])
    w2 = tf.get_variable('w2', shape=[3])
    w3 = tf.get_variable('w3', shape=[3])
    w4 = tf.get_variable('w4', shape=[3])

z1 = w1 + w2+ w3
z2 = w3 + w4

grads = tf.gradients([z1, z2], [w1, w2, w3, w4], 
                     grad_ys=[tf.convert_to_tensor([2.,2.,3.]), 
                              tf.convert_to_tensor([3.,2.,4.])])

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    print(sess.run(grads))

[array([2., 2., 3.], dtype=float32), array([2., 2., 3.], dtype=float32), array([5., 4., 7.], dtype=float32), array([3., 2., 4.], dtype=float32)]


In [3]:
a = tf.constant(0.)
b = 2 * a
g = tf.gradients(a + b, [a, b], stop_gradients=[a, b])
with tf.Session() as sess:
    print(sess.run(g))

[1.0, 1.0]


In [4]:
a = tf.constant(0.)
b = 2 * a
g = tf.gradients(a + b, [a, b])
with tf.Session() as sess:
    print(sess.run(g))

[3.0, 1.0]


In [5]:
a = tf.stop_gradient(tf.constant(0.))
b = tf.stop_gradient(2 * a)
g = tf.gradients(a + b, [a, b])
with tf.Session() as sess:
    print(sess.run(g))

[1.0, 1.0]


`高阶导数`

In [44]:
a = tf.constant(3.)
b = tf.pow(a, 2)

grad = tf.gradients(ys=b, xs=a) # 一阶导
grad_2 = tf.gradients(ys=grad[0], xs=a) # 二阶导
grad_3 = tf.gradients(ys=grad_2[0], xs=a) # 三阶导\
    
with tf.Session() as sess:
    print(sess.run(grad_3))
    print(sess.run(grad_2))
    print(sess.run(grad))

[0.0]
[2.0]
[6.0]


**`tf.stop_gradient`**(input, name=None)

+ Args:
  + input: A `Tensor`.
  + name: A name for the operation (optional).
  
  
+ Stops gradient computation.

+ When executed in a graph, this op outputs its input tensor as-is.

In [38]:
w1 = tf.constant(2.0)
w2 = tf.constant(2.0)
a = tf.multiply(w1, 3.0)
b = tf.multiply(a, w2)
grad = tf.gradients(b, [w1, w2])

with tf.Session() as sess:
    print(sess.run(grad))

[6.0, 6.0]


```python
w1 = tf.constant(2.0)
w2 = tf.constant(2.0)
a = tf.multiply(w1, 3.0)
a_stoped = tf.stop_gradient(a)
b = tf.multiply(a_stoped, w2)
grad = tf.gradients(b, [w1, w2])
print(grad)

    [None, <tf.Tensor 'gradients_32/Mul_39_grad/Mul_1:0' shape=() dtype=float32>]
    # 一个节点被 stop 之后, 在这个节点上的梯度, 就无法再向前BP了
    # 由于 W1 变量的梯度只能来自与 a 节点, 所以计算梯度返回是 None

with tf.Session() as sess:
    print(sess.run(grad))
```

In [42]:
a = tf.Variable(1.0)
b = tf.Variable(1.0)
c = tf.add(a, b)
c_stoped = tf.stop_gradient(c)
d = tf.add(a, b)
e = tf.add(c_stoped, d)
gradients = tf.gradients(e, xs=[a, b])

# 虽然c节点被stop了，但是a,b还有从d传回的梯度，所以还是可以输出梯度值的
with tf.Session() as sess:
    tf.global_variables_initializer().run()
    print(sess.run(gradients)) #输出 [1.0, 1.0]

[1.0, 1.0]


**`Avtive Function`**

In [None]:
tf.nn.relu
tf.nn.relu6
tf.nn.relu_layer
tf.nn.sigmoid
tf.nn.tanh
tf.nn.softmax
tf.nn.softplus
tf.nn.softsign

**`Loss Function`**

+ `MSE / mean squared error . 均方误差`
```python
mse = tf.reduce_mean(tf.square(y_ - y))
```

+ `Cross Entropy`:刻画两个概率分布之间的距离
```python
cross_entropy = -tf.reduce_mean(
    y_label * tf.log(
        tf.clip_by_value(y_predict, 1e-10, 1.0)
    )
)
```

+ `cross entropy` 一般会与 `softmax` 一起使用，所以 Tensorflow 对这两个功能进行了统一封装

```python
tf.nn.softmax_cross_entropy_with_logits_v2(labels=None, logits=None, dim=-1, name=None)
# Computes softmax cross entropy between `logits` and `labels`

# Measures the probability error in discrete classification tasks in which the classes are mutually exclusive (each entry is in exactly one class).  For example, each CIFAR-10 image is labeled with one and only one label: an image can be a dog or a truck, but not both.
```

In [31]:
from numpy.random import RandomState
rdm = RandomState(1)

logits  = tf.constant(rdm.rand(1,10))
y = np.array([0.0, 1.0, 0, 0, 0, 0, 0, 0, 0, 0])

cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=y, logits=logits)
with tf.Session() as sess:
    loss = sess.run(cross_entropy)
    print("calc with softmax_cross_entropy_with_logits_v2\n\t", loss)

y_softmax = tf.nn.softmax(logits=logits)
cross_entropy_v1 = -tf.reduce_sum(y * tf.log(y_softmax))

with tf.Session() as sess:
    loss = sess.run(cross_entropy_v1)
    print("calc with softmax & cross_entropy\n\t", loss)

calc with softmax_cross_entropy_with_logits_v2
	 [1.91865955]
calc with softmax & cross_entropy
	 1.9186595478618151


**`tf.train`**

+ **`tf.train.Optimizer()`**
    + Class Optimizer 的基类，这个类定义了在训练模型时候添加一个操作的API
    + 基本不会直接使用这个类，但是会用到它的子类比如 `tf.trian.GradientDescentOptimizer, AdagradOptimizer, MomentumOptimizer`
    + Base class for optimizers. This class defines the API to add Ops to train a model.
    + You never use this class directly, but instead instantiate one of its subclasses such as `GradientDescentOptimizer`, `AdagradOptimizer`, or `MomentumOptimizer`.

+ `Method`

```python
apply_gradients(
    grads_and_vars,
    global_step=None,
    name=None
)


compute_gradients(
    loss,
    var_list=None,
    gate_gradients=GATE_OP,
    aggregation_method=None,
    colocate_gradients_with_ops=False,
    grad_loss=None
)

minimize(
    loss,
    global_step=None,
    var_list=None,
    gate_gradients=GATE_OP,
    aggregation_method=None,
    colocate_gradients_with_ops=False,
    name=None,
    grad_loss=None
)

```

------

```python
compute_gradients(
    loss,
    var_list=None
)
```

计算`loss`中可训练的`var_list`中的梯度

相当于`minimize()`的第一步，返回 `(gradient, variable)` 对的 list

Compute gradients of `loss` for the variables in `var_list`.

This is the first part of `minimize()`.  It returns `a list of (gradient, variable) pairs` where "gradient" is the gradient for "variable".  Note that "gradient" can be a `Tensor`, an `IndexedSlices`, or `None` if there is no gradient for the given variable.


+ Args:
    + loss: A Tensor containing the value to minimize or a callable taking no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.
    + var_list: Optional list or tuple of `tf.Variable` to update to minimize `loss`.  Defaults to the list of variables collected in the graph under the key `GraphKeys.TRAINABLE_VARIABLES`.


+ Returns:
    A list of (gradient, variable) pairs. Variable is always present, but gradient can be `None`.

------

```python
apply_gradients(
    grads_and_vars,
    global_step=None,
    name=None
)
```

将`gradient`作用于`variables`

`minimize()`的第二部分，返回一个执行梯度更新的`Operation`

Apply gradients to variables.

This is the second part of `minimize()`. It returns an `Operation` that applies gradients.


+ Args:
    - `grads_and_vars`: List of (gradient, variable) pairs as returned by `compute_gradients()`.
    - `global_step`: Optional `Variable` to increment by one after the variables have been updated.
    - `name`: Optional name for the returned operation. Default to the name passed to the `Optimizer` constructor.

- Returns:
    - An `Operation` that applies the specified gradients. If `global_step` was not None, that operation also increments `global_step`.

------

```python
minimize(
    loss,
    global_step=None,
    var_list=None
)
```

Add operations to minimize `loss` by updating `var_list`.

This method simply combines calls `compute_gradients()` and `apply_gradients()`. If you want to process the gradient before applying them call `compute_gradients()` and `apply_gradients()` explicitly instead of using this function.

+ Args:
  + loss: A `Tensor` containing the value to minimize.
  + global_step: Optional `Variable` to increment by one after the variables have been updated.
  + var_list: Optional list or tuple of `Variable` objects to update to minimize `loss`.  Defaults to the list of variables collected in the graph under the key `GraphKeys.TRAINABLE_VARIABLES`.
  

+ Returns:
  + An Operation that updates the variables in `var_list`.  If `global_step` was not `None`, that operation also increments `global_step`.


------

+ **`Usage - minimize`**

```python
# Create an optimizer with the desired parameters.
opt = GradientDescentOptimizer(learning_rate=0.1)
# Add Ops to the graph to minimize a cost by updating a list of variables.
# "cost" is a Tensor, and the list of variables contains tf.Variable
# objects.
opt_op = opt.minimize(cost, var_list=<list of variables>)
```

+ **`Usage - 梯度修剪`**
    + `梯度修剪`主要避免训练造成的 `梯度爆炸` & `梯度消失` 问题
```python
tf.train.Optimizer().apply_gradients()
tf.train.Optimizer().compute_gradients()
```

```python
------------------------------
opt = tf.train.GradientDescentOptimizer(learning_rate=0.1)
train_op = opt.minimize(loss)
```
+ 在调用`sess.run(train_op)`时，会对`trainable_variable`进行更新
+ Calling `minimize()` takes care of both computing the gradients and applying them to the variables.
+ If you want to process the gradients before applying them you can instead use the optimizer in three steps:
    1.  Compute the gradients with `compute_gradients()`.
    2.  Process the gradients as you wish.
    3.  Apply the processed gradients with `apply_gradients()`.

+ Example:

```python
# Create an optimizer.
opt = GradientDescentOptimizer(learning_rate=0.1)

# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss, <list of variables>)

# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]

# Ask the optimizer to apply the capped gradients.
opt.apply_gradients(capped_grads_and_vars)
```

+ **`tf.train.AdamOptimizer`**
+ **`tf.train.ExponentialMovingAverage`**
+ **`tf.train.GradientDescentOptimizer`**
+ **`tf.train.MomentumOptimizer`**
+ **`tf.train.Optimizer`**
+ **`tf.train.RMSPropOptimizer`**
+ **`tf.train.XXX`**

**`tf.train.AdamOptimizer`** : 相比较于 SGD 算法而言
+ 不容易陷入局部最优点
+ 速度更快

+ loss = (x-3)^2, 求loss最小时候， x的值

In [12]:
# x = tf.placeholder(tf.float32)
x = tf.Variable(tf.truncated_normal([1]), name="x")
goal = tf.pow(x-3, 2, name="goal")

In [13]:
with tf.Session() as sess:
    tf.global_variables_initializer().run()
    print(x.eval())
    print(goal.eval())

[0.465416]
[6.424116]


In [21]:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.18)
train_step = optimizer.minimize(goal)

In [22]:
def train():
    with tf.Session() as sess:
        x.initializer.run()
        for i in range(10):
            print("x: ", x.eval())
            train_step.run()
            print("goal: ",goal.eval())
train()

x:  [0.06676739]
goal:  [3.5241385]
x:  [1.1227311]
goal:  [1.4434869]
x:  [1.798548]
goal:  [0.5912522]
x:  [2.2310708]
goal:  [0.24217696]
x:  [2.5078852]
goal:  [0.0991956]
x:  [2.6850467]
goal:  [0.04063048]
x:  [2.79843]
goal:  [0.01664222]
x:  [2.8709953]
goal:  [0.00681664]
x:  [2.917437]
goal:  [0.00279209]
x:  [2.9471598]
goal:  [0.00114364]


**`Regularization`**

In [None]:
tf.contrib.layers.l1_regularizer
tf.contrib.layers.l2_regularizer
tf.contrib.layers.l1_l2_regularizer
tf.contrib.layers.sum_regularizer
tf.contrib.layers.apply_regularization

In [None]:
tf.contrib.layers.sum_regularizer(regularizer_list, scope=None)
返回一个可以执行多种(个)正则化的函数.意思是,创建一个正则化方法,这个方法是多个正则化方法的混合体.

In [None]:
tf.contrib.layers.apply_regularization(regularizer, weights_list=None)
regularizer:就是我们上一步创建的正则化方法
weights_list: 想要执行正则化方法的参数列表,如果为None的话,就取GraphKeys.WEIGHTS中的weights

函数返回一个标量Tensor,同时,这个标量Tensor也会保存到GraphKeys.REGULARIZATION_LOSSES中.这个Tensor保存了计算正则项损失的方法.