# NN复杂度
## 空间复杂度
层数 = 隐藏层数 + 1个输出层  
以3-4-2为例：一个隐藏层+一个输出层=2层  
总参数 = 总w + 总b  
以3-4-2为例：3*4+4 + 4*2+2=26  
## 时间复杂度
乘加运算的次数  
以3-4-2为例：3*4+4*2=20  

# 学习率
$$ w_{t+1} = w_t - lr*\frac{\partial loss}{\partial w_t} $$
## 指数衰减学习率
$$ 指数衰减学习率 = 初始学习率 * 学习率衰减率^{当前轮数 / 多少轮衰减一次} $$
eg：损失函数 loss=（w + 1）^2 | 梯度▽=（dloss）/（dw）=2w+2 | (当w=-1时，▽=0，最优)  
init_w=5,learning_rate=0.2  
1. | w=5 | 5-0.2*（2*5+2）=2.6  
2. | w=2.6 | 2.6-0.2*（2*2.6+2）=1.16  
......  
学习率大了会震荡不收敛，学习率小了会收敛速度慢  
这里引入指数衰减学习率这个概念：  
指数衰减学习率是先使用较大的学习率来快速得到一个较优的解，然后随着迭代的继续,逐步减小学习率，使得模型在训练后期更加稳定  
eg: lr = LR_BASE * LR_DECAY ** (epoch / LR_STEP)  
非线性、可微性-优化器使用梯度下降、单调性-单层神经网络的损失函数是凸函数、近似恒等-f(x)≈x,当参数初始化为随机小值时，神经网络更稳定  
激活函数输出为有限值时，基于梯度的优化方法更稳定  
激活函数输出为无限值时，建议调小学习率  

# 激活函数
### Sigmoid函数
$$ f(x) = \frac{1}{1+e^{-x}} $$
tf.nn.sigmoid(x)  
#### 特点
易造成梯度消失、输出非0均值，收敛慢、幂运算复杂度，训练时间长  
### TanH函数
$$ f(x) = \frac{1-e^{-2x}}{1+e^{-2x}} $$
tf.math.tanh(x)
#### 特点
输出的是0均值、易造成梯度消失、幂运算复杂度，训练时间长
### Relu函数
$$ f(x) = max(x,0)=\begin{cases}0,&x<0 \\ x,&x>=0 \end{cases} $$
tf.nn.relu(x)
#### 优点
解决了梯度消失问题（在正区间）、只需判断输入是否大于0，计算速度快、收敛速度远快于sigmoid和tanh  
#### 缺点
输出非0均值，收敛慢、Dead RelU问题：某些神经元可能永远不被激活，导致相应的参数永远不能被更新（输入负数特征过多，减少学习率，减少参数分布的巨大变化）。
### Leaky Relu函数
$$ f(x) = max(αx,x)=\begin{cases}αx,&x<0 \\ x,&x>=0 \end{cases} $$
tf.nn.leaky_relu(x)
#### 特点
理论上leaky_relu有relu的优点，但不会出现dead relu问题，实际上效果不一定优于relu

# Tips
首选relu激活函数  
学习率设置较小  
输入特征标准化，即让输入特征满足以0为均值，1为标准差的正态分布  
初始参数中心化，即让随机生成的参数满足以0为均值，$$ \sqrt{\frac{2}{当前层输入特征个数}} $$为标准差的正态分布

# 损失函数（loss）：预测值（y）与已知答案（y_）的差距
NN优化的目标：loss最小
## 均方误差MSE（Mean Squared Error）
$$ mse(y_,y)=\frac{Σ_{i=1}^n(y-y\_)^2}{n} $$
### loss_mse=tf.reduce_mean(tf.square(y_-y))

In [33]:
import tensorflow as tf
import numpy as np
seed = 23455     #随机种子
rdm = np.random.RandomState(seed)
x = rdm.rand(32,2)
y_ = [[x1+x2+(rdm.rand()/10.0-0.05)] for (x1,x2) in x]

#x转变数据类型
x = tf.cast(x,dtype=tf.float32)
w1 = tf.Variable(tf.random.normal([2,1],stddev=1,seed=1))

epoch = 15000#循环迭代次数
lr = 0.002#学习率

for epoch in range(epoch):
    with tf.GradientTape() as tape:#梯度下降
        y = tf.matmul(x,w1)#前向传播计算结果
        loss_mse = tf.reduce_mean(tf.square(y_-y))#损失函数：均方误差
        
    grads = tape.gradient(loss_mse,w1)#损失函数对待训练参数w1求偏导
    w1.assign_sub(lr*grads)#更新参数w1
    
    if epoch %500 == 0:
        print("After %d training steps,w1 is"%(epoch))
        print(w1.numpy(),"\n")
        
print("Final w1 is:",w1.numpy())

After 0 training steps,w1 is
[[ 0.36442968]
 [-0.9614987 ]] 

After 500 training steps,w1 is
[[1.0837942 ]
 [0.07894276]] 

After 1000 training steps,w1 is
[[1.272613  ]
 [0.47505528]] 

After 1500 training steps,w1 is
[[1.2926469 ]
 [0.64627784]] 

After 2000 training steps,w1 is
[[1.2637599 ]
 [0.73507214]] 

After 2500 training steps,w1 is
[[1.2248375]
 [0.7905472]] 

After 3000 training steps,w1 is
[[1.1877908]
 [0.8302174]] 

After 3500 training steps,w1 is
[[1.1556823]
 [0.8607971]] 

After 4000 training steps,w1 is
[[1.1287668]
 [0.8852214]] 

After 4500 training steps,w1 is
[[1.1064948]
 [0.9050342]] 

After 5000 training steps,w1 is
[[1.0881608 ]
 [0.92121136]] 

After 5500 training steps,w1 is
[[1.0730999]
 [0.9344564]] 

After 6000 training steps,w1 is
[[1.0607392]
 [0.9453121]] 

After 6500 training steps,w1 is
[[1.0505978]
 [0.9542137]] 

After 7000 training steps,w1 is
[[1.0422784]
 [0.9615144]] 

After 7500 training steps,w1 is
[[1.0354537 ]
 [0.96750236]] 

After 8000 t

## 自定义损失函数
自定义损失函数  $$ loss(y_,y) = Σ_nf(y_,y) $$
### 自定义成本与利润
$$ f(y_,y) = \begin{cases}PROFIT*(y\_-y),&y<y\_ \\ COST*(y-y\_),&y>y\_ \end{cases} $$
### loss_zdy = tf.reduce_sum(tf.where(tf.greater(y,y_),COST*(y-y_),PROFIT*(y_-y)))

In [34]:
import tensorflow as tf
import numpy as np
seed = 23455     #随机种子
COST = 1
PROFIT = 999

rdm = np.random.RandomState(seed)
x = rdm.rand(32,2)
y_ = [[x1+x2+(rdm.rand()/10.0-0.05)] for (x1,x2) in x]

#x转变数据类型
x = tf.cast(x,dtype=tf.float32)
w1 = tf.Variable(tf.random.normal([2,1],stddev=1,seed=1))

epoch = 10000#循环迭代次数
lr = 0.002#学习率

for epoch in range(epoch):
    with tf.GradientTape() as tape:#梯度下降
        y = tf.matmul(x,w1)#前向传播计算结果
        loss_zdy = tf.reduce_sum(tf.where(tf.greater(y,y_),(y-y_)*COST,(y_-y)*PROFIT))
        
    grads = tape.gradient(loss_zdy,w1)#损失函数对待训练参数w1求偏导
    w1.assign_sub(lr*grads)#更新参数w1
    
    if epoch %500 == 0:
        print("After %d training steps,w1 is"%(epoch))
        print(w1.numpy(),"\n")
        
print("Final w1 is:",w1.numpy())

After 0 training steps,w1 is
[[29.671358]
 [32.047916]] 

After 500 training steps,w1 is
[[14.436411]
 [15.782992]] 

After 1000 training steps,w1 is
[[1.1549116]
 [2.1614442]] 

After 1500 training steps,w1 is
[[1.3539518]
 [1.2449582]] 

After 2000 training steps,w1 is
[[1.5100361]
 [1.0314224]] 

After 2500 training steps,w1 is
[[1.2925394]
 [2.2895539]] 

After 3000 training steps,w1 is
[[1.297658]
 [1.275389]] 

After 3500 training steps,w1 is
[[1.6664945]
 [2.6464522]] 

After 4000 training steps,w1 is
[[1.2408148]
 [2.1398804]] 

After 4500 training steps,w1 is
[[1.2029781]
 [1.828666 ]] 

After 5000 training steps,w1 is
[[1.363632 ]
 [1.4350257]] 

After 5500 training steps,w1 is
[[1.5197163]
 [1.2214899]] 

After 6000 training steps,w1 is
[[1.6900622]
 [2.6749787]] 

After 6500 training steps,w1 is
[[1.0750307]
 [1.8906238]] 

After 7000 training steps,w1 is
[[1.2311151]
 [1.677088 ]] 

After 7500 training steps,w1 is
[[1.3871995]
 [1.4635522]] 

After 8000 training steps,w1 i

## 交叉熵损失函数CE(Cross Entropy)：表征两个概率分布之间的距离
$$ H(y_,y) = -Σy\_*ln y $$
### tf.losses.categorical_crossentropy(y_,y)

In [35]:
loss_ce1 = tf.losses.categorical_crossentropy([1,0],[0.6,0.4])
loss_ce2 = tf.losses.categorical_crossentropy([1,0],[0.8,0.2])
print("loss_ce1:",loss_ce1)
print("loss_ce2:",loss_ce2)

loss_ce1: tf.Tensor(0.5108256, shape=(), dtype=float32)
loss_ce2: tf.Tensor(0.22314353, shape=(), dtype=float32)


## softmax与交叉熵结合
输出先过softmax函数，再计算y与y_的交叉熵损失函数
### tf.nn.softmax_cross_entropy_with_logits(y_,y)

In [36]:
y_ = np.array([[1,0,0],[0,1,0],[0,0,1],[1,0,0],[0,1,0]])
y = np.array([[1,0,0],[0,1,0],[0,10,1],[1,0,0],[0,1,0]])

y_pro = tf.nn.softmax(y)

loss_ce1 = tf.losses.categorical_crossentropy(y_,y_pro)
loss_ce2 = tf.nn.softmax_cross_entropy_with_logits(y_,y)
print("分步计算：",loss_ce1)
print("合步计算：",loss_ce2)

NotFoundError: Could not find valid device for node.
Node:{{node Softmax}}
All kernels registered for op Softmax :
  device='CPU'; T in [DT_HALF]
  device='CPU'; T in [DT_FLOAT]
  device='CPU'; T in [DT_DOUBLE]
 [Op:Softmax]