# PyTorch学习-神经网络（LS16-LS36）

In [1]:
import numpy as np
import torch
from torch.nn import functional as F

## LS17.激活函数与LOSS的梯度

### 激活函数

$$
sigmoid(x)=\frac1{1+e^{-x}}
$$

$$
sigmoid'(x)=sigmoid(x)(1-sigmoid(x))
$$

In [98]:
a=torch.linspace(-10,10,10)
print(torch.sigmoid(a)) # 小心梯度弥散问题

tensor([4.5398e-05, 4.1877e-04, 3.8510e-03, 3.4445e-02, 2.4766e-01, 7.5234e-01,
        9.6555e-01, 9.9615e-01, 9.9958e-01, 9.9995e-01])


$$
tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}=2sigmoid(2x)-1
$$

$$
tanh'(x)=1-tanh^2(x)
$$

X轴压缩2倍，Y轴放大两倍，再向下平移1。范围 \[-1,1\]

In [97]:
a=torch.linspace(-10,10,10)
print(torch.tanh(a))

tensor([-1.0000, -1.0000, -1.0000, -0.9975, -0.8045,  0.8045,  0.9975,  1.0000,
         1.0000,  1.0000])


#### Rectified Linear Unit

$$
ReLU=\begin{cases} 0 & \text{for } x<0 \\ x & \text{for } x\ge0 \end{cases}
$$

- 如果将小于零的线段增加一定幅度，则是 Leaky ReLU
- 如果将小于零的线段改成 sigmoid，则是 SELU
- 如果对拐点进行平滑处理，则是 Softplus

In [99]:
a=torch.linspace(-10,10,10)
print(torch.relu(a))
print(F.relu(a))

tensor([ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  1.1111,  3.3333,  5.5556,
         7.7778, 10.0000])
tensor([ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  1.1111,  3.3333,  5.5556,
         7.7778, 10.0000])


### Loss函数

$$
MSE=\sum(f_\theta(x)-y)^2
$$

$$
\nabla MSE=2\sum(f_\theta(x)-y)*\frac{\nabla f_\theta(x)}{\nabla \theta}
$$

In [129]:
#用autograd.grad自动求导,torch.autograd.grad(loss,[w1,w2,...]),返回[w1 grad,w2 grad,...]

x=torch.ones(1)
w=torch.full([1],2).float() # y=wx,x=1,w=2
w.requires_grad_()  #允许计算梯度，没有这个会报错
mse=F.mse_loss(x*w,torch.ones(1))  # 更新动态图, y=1
print(mse) # (2-1)^2
print(torch.autograd.grad(mse,[w]))  # 输出值（mse）对输入变量（w）求导, 2*(2-1)*1

tensor(1., grad_fn=<MseLossBackward0>)
(tensor([2.]),)


In [130]:
#用backward函数，会自动计算所有的梯度大小，可以调用w1.grad和 w2.grad

mse=F.mse_loss(x*w,torch.ones(1))
mse.backward()
print(w.grad)

tensor([2.])


#### Soft version of max

$$
y=\begin{matrix} 2.0 \\ 1.0 \\ 0.1 \end{matrix}\Rightarrow p_i=Softmax(y_i)=\frac{e^{y_i}}{\sum_ke^{y_k}}=\begin{matrix} 0.7 \\ 0.2 \\ 0.1 \end{matrix}
$$

$$
\nabla Softmax(y_i)=\frac{\partial p_i}{\partial a_j}=\begin{cases} p_i(1-p_i) & \text{if } i=j \\ -p_j·p_i & \text{if } i\ne j \end{cases}
$$

In [135]:
a=torch.rand(3)
a.requires_grad_()
p=F.softmax(a,dim=0)
print(torch.autograd.grad(p[1],[a],retain_graph=True)) # Loss只能1维
print(torch.autograd.grad(p[2],[a]))

(tensor([-0.1005,  0.2052, -0.1048]),)
(tensor([-0.1265, -0.1048,  0.2313]),)


## LS19.感知机的梯度推导

### 单一感知机

<img src="pic\pic1.jpg" width="50%" height="50%" />

其中损失函数表示为：$E=\frac12(O_0^1-t)^2$

$sigmoid$ 函数求导：$g'(z)=g(z)(1-g(z))$

#### 梯度求导过程

$$
\begin{aligned}\frac{\partial E}{\partial w_{j0}}&=\left(O_{0}-t\right) \frac{\partial O_{0}}{\partial w_{j0}} \\
\frac{\partial E}{\partial w_{j0}}&=\left(O_{0}-t\right) \frac{\partial \sigma\left(x_{0}\right)}{\partial w_{j0}} \\
\frac{\partial E}{\partial w_{j0}}&=\left(O_{0}-t\right) \sigma\left(x_{0}\right)\left(1-\sigma\left(x_{0}\right)\right) \frac{\partial x_{0}^{1}}{\partial w_{j0}} \\
\frac{\partial E}{\partial w_{j0}}&=\left(O_{0}-t\right) O_{0}\left(1-O_{0}\right) \frac{\partial x_{0}^{1}}{\partial w_{j0}} \\
\frac{\partial E}{\partial w_{j0}}&=\left(O_{0}-t\right) O_{0}\left(1-O_{0}\right) x_{j}^{0}\end{aligned}
$$

#### 代码运用

In [5]:
x=torch.randn(1,10)
w=torch.randn(1,10,requires_grad=True)
o=torch.sigmoid(x@w.t()) #矩阵式乘法

loss=F.mse_loss(torch.ones(1,1),o)

loss.backward()
print(w.grad)

tensor([[ 0.0891, -0.0548, -0.0099, -0.1739,  0.0405, -0.1314, -0.0911, -0.0527,
         -0.0367, -0.0011]])


### 多输出感知机

<img src="pic\pic2.jpg" width="50%" height="50%" />

其中损失函数表示为：$E=\frac12(O_i^1-t_i)^2$

#### 梯度求导过程

$$
\begin{aligned}
\frac{\partial E}{\partial w_{j k}}&=\left(O_{k}-t_{k}\right) \frac{\partial O_{k}}{\partial w_{j k}} \\
\frac{\partial E}{\partial w_{j k}}&=\left(O_{\mathrm{k}}-t_{k}\right) \frac{\partial \sigma\left(x_{k}\right)}{\partial w_{j k}} \\
\frac{\partial E}{\partial w_{j k}}&=\left(O_{k}-t_{k}\right) \sigma\left(x_{k}\right)\left(1-\sigma\left(x_{k}\right)\right) \frac{\partial x_{k}^{1}}{\partial w_{j k}} \\
\frac{\partial E}{\partial w_{j k}}&=\left(O_{k}-t_{k}\right) O_{\mathrm{k}}\left(1-O_{k}\right) \frac{\partial x_{k}^{1}}{\partial w_{j k}} \\
\frac{\partial E}{\partial w_{j k}}&=\left(O_{k}-t_{k}\right) O_{\mathrm{k}}\left(1-O_{k}\right) x_{j}^{0}
\end{aligned}
$$

#### 代码运用

In [6]:
x=torch.randn(1,10)
w=torch.randn(2,10,requires_grad=True)
o=torch.sigmoid(x@w.t()) # Broadcasting 自动拓展

loss=F.mse_loss(torch.ones(1,2),o)

loss.backward()
print(w.grad)

tensor([[-0.0043, -0.0066, -0.0252, -0.0253, -0.0119,  0.0927, -0.1116,  0.0628,
          0.0308, -0.0127],
        [-0.0026, -0.0040, -0.0151, -0.0152, -0.0071,  0.0556, -0.0669,  0.0377,
          0.0185, -0.0076]])


## LS20.引入链式法则

<img src="pic\pic3.jpg" width="50%" height="50%" />

$$
\frac{\partial E}{\partial w_{j k}^{1}}=\frac{\partial E}{\partial O_{k}^{1}} \frac{\partial O_{k}^{1}}{\partial x}=\frac{\partial E}{\partial O_{k}^{2}} \frac{\partial O_{k}^{2}}{\partial O_{k}^{1}} \frac{\partial O_{k}^{1}}{\partial x}
$$

In [9]:
x=torch.tensor(1.)
w1=torch.tensor(2.,requires_grad=True)
b1=torch.tensor(1.)
w2=torch.tensor(2.,requires_grad=True)
b2=torch.tensor(1.)

y1=x*w1+b1
y2=y1*w2+b2

dy2_dy1=torch.autograd.grad(y2,[y1],retain_graph=True)[0]
dy1_dw1=torch.autograd.grad(y1,[w1],retain_graph=True)[0]
dy2_dw1=torch.autograd.grad(y2,[w1],retain_graph=True)[0]

print(dy2_dy1*dy1_dw1)
print(dy2_dw1)

tensor(2.)
tensor(2.)


## LS21.多层感知机

<img src="pic\pic4.jpg" width="50%" height="50%" />

为了方便运算，设置

$$
\delta_k^K=\left(O_{k}-t_{k}\right) O_{\mathrm{k}}\left(1-O_{k}\right) 
$$

以上数据都是可以直接求出

#### 梯度求导过程

$$
\begin{aligned}
\frac{\partial E}{\partial W_{i j}} &=\frac{\partial}{\partial W_{i j}} \frac{1}{2} \sum_{k \in K}\left(O_{k}-t_{k}\right)^{2} \\
\frac{\partial E}{\partial W_{i j}} &=\sum_{k \in K}\left(O_{k}-t_{k}\right) \frac{\partial}{\partial W_{i j}} O_{k} \\
\frac{\partial E}{\partial W_{i j}} &=\sum_{k \in K}\left(O_{k}-t_{k}\right) \frac{\partial}{\partial W_{i j}} \sigma\left(x_{k}\right)\\
\frac{\partial E}{\partial W_{i j}}&=\sum_{k \in K}\left(O_{k}-t_{k}\right) \sigma\left(x_{k}\right)\left(1-\sigma\left(x_{k}\right)\right) \frac{\partial x_{k}}{\partial W_{i j}} \\
\frac{\partial E}{\partial W_{i j}}&=\sum_{k \in K}\left(O_{k}-t_{k}\right) O_{k}\left(1-O_{k}\right) \frac{\partial x_{k}}{\partial O_{j}} \cdot \frac{\partial O_{j}}{\partial W_{i j}} \\
\frac{\partial E}{\partial W_{i j}}&=\sum_{k \in K}\left(O_{k}-t_{k}\right) O_{k}\left(1-O_{k}\right) W_{j k} \frac{\partial O_{j}}{\partial W_{i j}}\\
\frac{\partial E}{\partial W_{i j}}&=O_{j}\left(1-O_{j}\right) \frac{\partial x_{j}}{\partial W_{i j}} \sum_{k \in K}\left(O_{k}-t_{k}\right) O_{k}\left(1-O_{k}\right) W_{j k}\\
\frac{\partial E}{\partial W_{i j}}&=O_{i} O_{j}\left(1-O_{j}\right) \sum_{k \in K} \delta_{k} W_{j k}
\end{aligned}
$$

同样，将
$$
\delta_j^J=O_{j}\left(1-O_{j}\right) \sum_{k \in K} \delta_{k} W_{j k}
$$
设置为从J层到以后的信息

### 总结

For an output layer node $k \in K$

$$
\frac{\partial E}{\partial W_{j k}}=O_{j} \delta_{k}
$$

where

$$
\delta_{k}=O_{k}\left(1-O_{k}\right)\left(O_{k}-t_{k}\right)
$$

For a hidden layer node $j \in J$

$$
\frac{\partial E}{\partial W_{i j}}=O_{i} \delta_{j}
$$

where

$$
\delta_{j}=O_{j}\left(1-O_{j}\right) \sum_{k \in K} \delta_{k} W_{j k}
$$

## LS22.交叉熵

### Entropy

$$
Entropy=-\sum_iP(i)logP(i)
$$

- Uncertainty
- measure of surprise
- higher entropy: higher uncertainty.

In [2]:
a=torch.full([4],1/4.)
print(-(a*torch.log2(a)).sum()) #求  entropy

a=torch.tensor([0.1,0.1,0.1,0.7])
print(-(a*torch.log2(a)).sum())

a=torch.tensor([0.001,0.001,0.001,0.997])
print(-(a*torch.log2(a)).sum())

#概率值越集中在某一个点，entropy越小；概率值越平均，entropy越大

tensor(2.)
tensor(1.3568)
tensor(0.0342)


### Cross Entropy

$$
H(p,q)=-\sum p\log q=H(p)+D_{KL}(p|q),\quad D_{KL}=\text{KL Divergence}
$$

越是相似，KL Divergence越接近0，因此当
- P=Q 时，D_{KL}(p|q)=0, Cross Entropy=Entropy
- 同时运用 one-hot encoding 时, entropy = 1log1=0
- 因此分类问题的 Cross Entropy，相当于对 KL Divergence 的运算

### Binary Classification

$$
H(P, Q)=-P(c a t) \log Q(c a t)-(1-P(c a t)) \log (1-Q(c a t)) 
$$

In [3]:
x=torch.randn(1,784)
w=torch.randn(10,784)
logits=x@w.t()
print(F.cross_entropy(logits,torch.tensor([3])))# cross_entropy=softmax+log+nll_loss

pred=F.softmax(logits,dim=1)
pred_log=torch.log(pred)
print(F.nll_loss(pred_log,torch.tensor([3])))

tensor(16.3820)
tensor(16.3820)


## LS30.Visdom可视化

1. 首先通过 `python -m visdom.server` 建立连接，打开网页
2. 然后 `from visdom import Visdom`,`viz = Visdom()`, 加载监听器
3. `viz.line([0.], [0.], win='窗口ID', opts=dict(title='窗口标题'))`, [0.][0.]是初始点
4. 通过`viz.line([y轴数据],[x轴数据], win='窗口ID', update='append')` 更新数据

5. `viz.images()` 展示图片, `viz.text()` 展示字

## LS31.防止过拟合的方法

- 采用交叉验证
- 加入正则项
- 采用动量
- 采用学习率衰减
- Early Stop
- Dropout