<a href="https://colab.research.google.com/github/Muzhi1920/awesome-models/blob/main/00-basic_tricks/1_activation_func%26weight_init.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 参数初始化-non_activation

1. 固定初始化；
2. 预训练初始化；
3. 随机初始化，需考虑初始化的分布，与梯度爆炸&消失的问题。

##  Naive Initialization
1. 高斯分布
$$W \sim N(\mu,\sigma^2)$$
2. 均匀分布
$$W \sim N(a,b), a=-\frac{1}{\sqrt{N}}, b=\frac{1}{\sqrt{N}}$$


In [None]:
import tensorflow as tf

In [None]:
t1 = tf.feature_column.categorical_column_with_hash_bucket(key='featureField28',hash_bucket_size=4, dtype=tf.int64)
t1

HashedCategoricalColumn(key='featureField28', hash_bucket_size=4, dtype=tf.int64)

In [None]:
emb = tf.feature_column.embedding_column(categorical_column=t1,dimension = 128,combiner='mean',
    initializer=tf.keras.initializers.RandomNormal(mean=0,stddev=1.0),ckpt_to_load_from=None,tensor_name_in_ckpt=None,max_norm=None,trainable=True,use_safe_embedding_lookup=True)
emb

EmbeddingColumn(categorical_column=HashedCategoricalColumn(key='featureField28', hash_bucket_size=4, dtype=tf.int64), dimension=128, combiner='mean', initializer=<keras.initializers.initializers_v2.RandomNormal object at 0x7f1618bdc0d0>, ckpt_to_load_from=None, tensor_name_in_ckpt=None, max_norm=None, trainable=True, use_safe_embedding_lookup=True)

In [None]:
# 序列id，不足长度，引擎发送-1，保证维度一致；-1 has a special meaning of missing feature
features ={
    'featureField28': tf.constant([[1000,-1,-1], [-1,1002,-1],[1000,1002,-1]]), # corss_1
}
features

{'featureField28': <tf.Tensor: shape=(3, 3), dtype=int32, numpy=
 array([[1000,   -1,   -1],
        [  -1, 1002,   -1],
        [1000, 1002,   -1]], dtype=int32)>}

In [None]:
feature_layer = tf.compat.v1.keras.layers.DenseFeatures(feature_columns=[emb])
dense_tensor = feature_layer(features, training=False)

In [None]:
layer_num = 80
for index, units in enumerate([128]*layer_num):
    dense_tensor = tf.compat.v1.keras.layers.Dense(units, activation='linear',
                                                   kernel_initializer=tf.keras.initializers.RandomNormal(mean=0.0,stddev=1.0/(128**0.5)))(dense_tensor)
    mean, variance = tf.nn.moments(dense_tensor, axes=1)
    print('layer_{}_mean_{}_variabce_{}'.format(index, mean, variance**0.5))
prediction = tf.compat.v1.keras.layers.Dense(1)(dense_tensor)

layer_0_mean_[-0.00474764 -0.11256254 -0.0586551 ]_variabce_[0.9030776 0.8726572 0.6470899]
layer_1_mean_[-0.07459724  0.05302676 -0.01078526]_variabce_[0.9536409  0.8253405  0.67076385]
layer_2_mean_[0.12263592 0.01789524 0.07026559]_variabce_[1.0337179 0.7588261 0.6421874]
layer_3_mean_[ 0.16345797 -0.0112858   0.07608611]_variabce_[1.1443658  0.7512038  0.66125846]
layer_4_mean_[0.04273469 0.01263199 0.02768335]_variabce_[0.949007   0.73244965 0.57807887]
layer_5_mean_[-0.12120512 -0.13118267 -0.12619388]_variabce_[0.8956612  0.77741194 0.60022247]
layer_6_mean_[0.03900212 0.09053891 0.0647705 ]_variabce_[0.8652431  0.79009306 0.606217  ]
layer_7_mean_[-0.06023568  0.06622929  0.00299681]_variabce_[0.8194714 0.781086  0.5971431]
layer_8_mean_[ 0.02006426 -0.02454356 -0.00223965]_variabce_[0.9004625 0.7788832 0.5988036]
layer_9_mean_[-0.20334095 -0.18653768 -0.19493939]_variabce_[0.86875   0.732591  0.5203612]
layer_10_mean_[-0.04872639  0.01962633 -0.01454999]_variabce_[0.9129105  0

### 结论
1. 输入emb的初始化方式，$I0$；
2. 每一层layer的weight初始化方式，$I1$；
3. 单层后的方差为$\sigma^2 = n * (D(I0) * D(I1))$；
4. $n$层后的方差为$\sigma^2 * D(I1)^{(n-1)}$，假设dim不变时的推导；
5. 据此公式，可推导出各种方差变化关系；
6. 因此需要选择合理地，输入emb的初始化、每层weight的初始化方式。

## Xavier Initialization
优秀的初始化应该使得，各层激活值和状态梯度，在正向传播中保持方差一致性

正向、反向传播得到平均的方差，同时得到均匀分布Xavier initialization

$\sigma^2_k =n_{k-1} *D(w_k) * \sigma_{k-1}^2 * D(h_{k-1})$

$D(w^k)=\frac{1}{n_{k-1}}$

In [None]:
feature_layer = tf.compat.v1.keras.layers.DenseFeatures(feature_columns=[emb])
dense_tensor = feature_layer(features, training=False)

In [None]:
layer_num = 80
for index, units in enumerate([128]*layer_num):
    dense_tensor = tf.compat.v1.keras.layers.Dense(units, activation='linear',
                                                   kernel_initializer=tf.keras.initializers.GlorotUniform)(dense_tensor)
    mean, variance = tf.nn.moments(dense_tensor, axes=1)
    print('layer_{}_mean_{}_variabce_{}'.format(index, mean, variance**0.5))
prediction = tf.compat.v1.keras.layers.Dense(1)(dense_tensor)

layer_0_mean_[ 0.02483078 -0.00109796  0.01186643]_variabce_[0.82370484 0.948914   0.615857  ]
layer_1_mean_[-0.07208517  0.08645344  0.00718416]_variabce_[0.8847155 0.8629459 0.6368445]
layer_2_mean_[-0.03414366  0.11825216  0.04205421]_variabce_[0.9669925  0.83785677 0.6641544 ]
layer_3_mean_[-0.03470136  0.00367569 -0.01551282]_variabce_[1.0158526  0.87873864 0.7045144 ]
layer_4_mean_[0.04297611 0.00330246 0.02313926]_variabce_[0.99449486 0.923896   0.7148219 ]
layer_5_mean_[-0.05191755  0.00731733 -0.02230008]_variabce_[0.9446416 1.0590137 0.7847328]
layer_6_mean_[-0.06161759  0.10039135  0.01938676]_variabce_[1.0846543  0.9660863  0.75370353]
layer_7_mean_[-0.06990366  0.0233842  -0.0232597 ]_variabce_[1.0425826  0.89115554 0.7008463 ]
layer_8_mean_[ 0.11449258 -0.138902   -0.01220481]_variabce_[1.0296674  0.8736784  0.69986767]
layer_9_mean_[-0.1174293   0.05342923 -0.03199992]_variabce_[1.034592   0.70938236 0.64833784]
layer_10_mean_[ 0.00602157 -0.00570393  0.00015872]_variabc

### 结论
1. 输入，输出方差一致性。
2. 但一切都基于layer选择无激活函数时的情况

# 参数初始化-activation
## sigmoid & tanh

In [None]:
feature_layer = tf.compat.v1.keras.layers.DenseFeatures(feature_columns=[emb])
dense_tensor = feature_layer(features, training=False)
layer_num = 80
for index, units in enumerate([128]*layer_num):
    dense_tensor = tf.compat.v1.keras.layers.Dense(units, activation='relu',
                                                   kernel_initializer=tf.keras.initializers.GlorotUniform)(dense_tensor)
    mean, variance = tf.nn.moments(dense_tensor, axes=1)
    print('layer_{}_mean_{}_variabce_{}'.format(index, mean, variance**0.5))
prediction = tf.compat.v1.keras.layers.Dense(1)(dense_tensor)

layer_0_mean_[0.38708365 0.4723001  0.3041571 ]_variabce_[0.54093164 0.61664486 0.42793804]
layer_1_mean_[0.2547238  0.31664765 0.203093  ]_variabce_[0.35807145 0.44605884 0.2941656 ]
layer_2_mean_[0.18215099 0.24574809 0.16300657]_variabce_[0.2651742  0.31129587 0.20953818]
layer_3_mean_[0.14823917 0.15288803 0.11309601]_variabce_[0.21272396 0.22477272 0.1561114 ]
layer_4_mean_[0.09851833 0.12605196 0.08142226]_variabce_[0.14044975 0.18249093 0.10882287]
layer_5_mean_[0.06090649 0.09730212 0.05504981]_variabce_[0.08821749 0.13933423 0.07545511]
layer_6_mean_[0.04671136 0.06696291 0.03718035]_variabce_[0.06388108 0.09618846 0.05271107]
layer_7_mean_[0.0325173  0.04154013 0.02335726]_variabce_[0.0461579  0.06434995 0.03571976]
layer_8_mean_[0.02049509 0.03298849 0.01551645]_variabce_[0.03293059 0.05112841 0.02288269]
layer_9_mean_[0.0167726  0.0240771  0.01162702]_variabce_[0.02301887 0.03129017 0.01543266]
layer_10_mean_[0.01164736 0.01712047 0.00759532]_variabce_[0.01635502 0.02622594

### 结论
1. sigmoid时，0.5均值，0.1方差
2. tanh时，0均值，逐减方差
3. relu时，0均值，0方差
4. Xavier只能针对sigmoid和tanh的饱和激活函数，无法应用relu非饱和激活函数。

## 激活函数的增益证明-待补充
1. sigmoid: 1；
2. tnah: 5/3。
3. 缺点：指数运算、软饱和，梯度更新慢

![激活函数信息增益](https://pic4.zhimg.com/80/v2-5346cac5ab78831170c31afdda6c79a3_720w.jpg)

### Kaiming初始化


In [None]:
feature_layer = tf.compat.v1.keras.layers.DenseFeatures(feature_columns=[emb])
dense_tensor = feature_layer(features, training=False)
layer_num = 80
for index, units in enumerate([128]*layer_num):
    dense_tensor = tf.compat.v1.keras.layers.Dense(units, activation='relu',
                                                   kernel_initializer=tf.keras.initializers.HeUniform)(dense_tensor)
    mean, variance = tf.nn.moments(dense_tensor, axes=1)
    print('layer_{}_mean_{}_variabce_{}'.format(index, mean, variance**0.5))
prediction = tf.compat.v1.keras.layers.Dense(1)(dense_tensor)

layer_0_mean_[0.6815914  0.44994766 0.4279186 ]_variabce_[0.93412244 0.74842    0.633804  ]
layer_1_mean_[0.74655604 0.44662625 0.44490612]_variabce_[0.9562693 0.6873715 0.5816312]
layer_2_mean_[0.6344416  0.4201606  0.37366065]_variabce_[0.95291287 0.6056135  0.5512138 ]
layer_3_mean_[0.63523066 0.4870914  0.38921013]_variabce_[0.9148161  0.64915854 0.55019027]
layer_4_mean_[0.6661496  0.42441612 0.38301337]_variabce_[0.8732549  0.61280215 0.5185682 ]
layer_5_mean_[0.5462897  0.37623435 0.3140301 ]_variabce_[0.8165627  0.5244592  0.45257992]
layer_6_mean_[0.53636336 0.34057957 0.3263271 ]_variabce_[0.77413094 0.50370675 0.4690775 ]
layer_7_mean_[0.5432247  0.38242215 0.3228837 ]_variabce_[0.77435297 0.54743886 0.46645826]
layer_8_mean_[0.5245128  0.38176587 0.28827542]_variabce_[0.80272794 0.5410475  0.45088774]
layer_9_mean_[0.55005604 0.3363331  0.2887788 ]_variabce_[0.86950207 0.52417994 0.43467844]
layer_10_mean_[0.42597854 0.25813988 0.22418189]_variabce_[0.6896235  0.40841714 0.

# 模型现状
## 特征emb初始化
参数初始化：**tf.keras.initializers.VarianceScaling(distribution='uniform')**
```
With `distribution="uniform"`, samples are drawn from a uniform distribution
  within [-limit, limit], with `limit = sqrt(3 * scale / n)`.
```  
$$limit = \sqrt(\frac{3 * scale}{n})$$
均匀分布的方差为$\frac{(a-b)^2}{12}$


## before mmoe hidden layer
1. 参数初始化：**tf.compat.v1.glorot_uniform_initializer()**
2. 激活函数：relu
```
Draws samples from a uniform distribution within [-limit, limit] where `limit` is `sqrt(6 / (fan_in + fan_out))` where `fan_in` is the number of input units in the weight tensor and `fan_out` is the number of output units in the weight tensor.
```
$$limit = \sqrt(\frac{6}{fan_{in} + fan_{out}})$$

## mmoe-expert部分
1. 参数初始化：**VarianceScaling**
2. 激活函数：relu
$$limit = \sqrt(\frac{3 * 1}{n})$$


## multi_head-dnn部分
1. 参数初始化：**glorot_uniform**
2. 激活函数：relu
$$limit = \sqrt(\frac{6}{fan_{in} + fan_{out}})$$



## output部分
1. 参数初始化：**tf.keras.initializers.VarianceScaling()**
2. 激活函数：无即linear；损失计算过sigmoid。
$$limit = \sqrt(\frac{3 * 1}{n})$$


# reference
https://zhuanlan.zhihu.com/p/148034113