# How to make deep neural network stable?

## xavier initialization

think of MLP

$y = Wx + b$

if $x \sim U(0, \mathcal{1}_{d_{in}})$, $W \sim U(0, \mathcal{1}_{d_{out}\times d_{in}})$, then $y \sim U(0, d_{in} \mathcal{1}_{d_{out}})$

the variance will explore as the neural network goes deeper.

In [paper](https://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf0), a weight initialization method is proposed. Instead of initial with Uniform distribution $U[-1, 1]$, we should initial from uniform distribution with $U[-1/\sqrt{d_{in}}, 1/\sqrt{d_{in}}]$.

This keeps the variance of neurons.


In [2]:
# test the idea of xavier initializaiton

import torch
import math

d_in = 10
d_out = 10

x = torch.randn(1000, d_in)
W1 = (torch.rand(d_in, d_out) - 0.5)*2*math.sqrt(3)
W2 = W1 / math.sqrt(d_in)
y1 = torch.einsum('ij,kj->ik', x, W1)
y2 = torch.einsum('ij,kj->ik', x, W2)


In [3]:
print(x.var(), W1.var(), y1.var())
print(x.var(), W2.var(), y2.var())

tensor(1.0002) tensor(0.9782) tensor(9.6752)
tensor(1.0002) tensor(0.0978) tensor(0.9675)


激活函数导致方差的变化

若输出是正态分布，方差变化为原分布的0.341

若输入是平均分布，方差变化为0.5

In [37]:
import numpy as np

# He初始化（适用于ReLU）
def he_init(dim_in, dim_out):
    return np.random.randn(dim_in, dim_out) * np.sqrt(2.0 / dim_in)

# ReLU激活函数
def relu(x):
    return np.maximum(0, x)

# MLP层
class MLPLayer:
    def __init__(self, input_dim, output_dim):
        self.W = he_init(input_dim, output_dim)  # He初始化权重
        self.b = np.zeros(output_dim)            # 偏置初始化为0

    def forward(self, x):
        rst = np.dot(x, self.W) + self.b
        return rst

# 标准的MLP网络
class MLP:
    def __init__(self, layer_dims):
        self.layers = []
        for i in range(len(layer_dims) - 1):
            layer = MLPLayer(layer_dims[i], layer_dims[i + 1])
            self.layers.append(layer)

    def forward(self, x):
        for layer in self.layers:
            x = layer.forward(x)
            print('var before relu', x.var())
            x = relu(x)  # 除最后一层外，每层后接ReLU
            # 丢掉小于零的项
            
            print('var after relu', x.var())
        return x

# 设置网络结构和输入数据
input_dim = 100
hidden_dims = [100, 100, 100]  # 3个隐藏层，每层100个神经元
output_dim = 100
layer_dims = [input_dim] + hidden_dims + [output_dim]

# 初始化MLP
mlp = MLP(layer_dims)

# 生成输入数据 (均值为0，方差为1)
num_samples = 1000
x = np.random.randn(num_samples, input_dim)
print('var input', x.var())
# 前向传播
output = mlp.forward(x)

var input 0.9954899937768414
var before relu 1.9777306782911959
var after relu 0.6738203141509091
var before relu 1.9375079691397927
var after relu 0.6737599491212752
var before relu 1.9557270470001271
var after relu 0.6171111721672213
var before relu 1.7624774460636328
var after relu 0.5390921383593544


In [53]:
import numpy as np

# 生成正态分布的随机数据 (均值为0，方差为1)
num_samples = 1000000
z = np.random.randn(num_samples)

# 计算ReLU前的方差
var_before = np.var(z)

# 应用ReLU
relu_z = np.maximum(0, z)

# 计算ReLU后的方差
var_after = np.var(relu_z)

print("ReLU前的方差:", var_before)
print("ReLU后的方差:", var_after)
print("方差变化比例:", var_after / var_before)

ReLU前的方差: 0.9993980444129914
ReLU后的方差: 0.34012237903980674
方差变化比例: 0.3403272409238921


In [60]:
import torch
import math
# 生成正态分布的随机数据 (均值为0，方差为1)
num_samples = 1000000
z = 2*math.sqrt(3)*(torch.rand(num_samples) - 0.5)

# 计算ReLU前的方差
var_before = z.var()

# 应用ReLU
relu_z = torch.relu(z)

# 计算ReLU后的方差
var_after = relu_z.var()

print("ReLU前的方差:", var_before)
print("ReLU后的方差:", var_after)
print("方差变化比例:", var_after / var_before)

ReLU前的方差: tensor(0.9988)
ReLU后的方差: tensor(0.3120)
方差变化比例: tensor(0.3124)


径向函数的norm测试

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def gaussian_rbf_embedding(r, centers, beta):
    """
    计算 RBF embedding
    :param r: 输入距离数组 (N,)
    :param centers: RBF 中心值数组 (M,)
    :param beta: RBF 宽度参数 (标量或数组 (M,))
    :return: (N, M) 的嵌入矩阵
    """
    r = r[:, np.newaxis]  # 变为 (N, 1) 便于广播
    return np.exp(-beta * (r - centers) ** 2)

# ====== 配置参数 ======
num_samples = 1000  # 采样点数
num_rbfs = 10       # RBF 维度
r_min, r_max = 0, 5 # 取值范围
beta = 2.0          # RBF 宽度

# 生成输入 r 的分布，例如正态分布或均匀分布
r_samples = np.random.normal(loc=2.5, scale=1.0, size=num_samples)  # 正态分布
r_samples = np.clip(r_samples, r_min, r_max)  # 限制范围

# 设定 RBF 中心点（均匀分布在 [r_min, r_max]）
centers = np.linspace(r_min, r_max, num_rbfs)

# 计算 RBF embedding
embeddings = gaussian_rbf_embedding(r_samples, centers, beta)

# 计算每个 RBF 维度的均值和方差
embedding_means = np.mean(embeddings, axis=0)
embedding_vars = np.var(embeddings, axis=0)
'''
# ====== 结果可视化 ======
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# 原始 r 分布
axes[0].hist(r_samples, bins=30, alpha=0.7, color='blue', edgecolor='black')
axes[0].set_title("Input $r$ Distribution")
axes[0].set_xlabel("$r$")
axes[0].set_ylabel("Frequency")

# Embedding 分布
for i in range(num_rbfs):
    axes[1].hist(embeddings[:, i], bins=30, alpha=0.5, label=f'RBF {i+1}')

axes[1].set_title("RBF Embedding Distributions")
axes[1].set_xlabel("Embedding Value")
axes[1].set_ylabel("Frequency")
axes[1].legend()

plt.tight_layout()
plt.show()
'''
# 打印 embedding 统计信息
print("input r mean:", r_samples.mean())
print("input r variance:", r_samples.var())
print("RBF Embedding Means:", embedding_means)
print("RBF Embedding Variances:", embedding_vars)


input r mean: 4.864406698692232
input r variance: 0.4202306512605468
RBF Embedding Means: [0.01020935 0.00686296 0.00401549 0.00819955 0.0145136  0.01811526
 0.02413803 0.10123646 0.52405653 0.94740698]
RBF Embedding Variances: [0.00977696 0.00412055 0.00153939 0.00546363 0.01064526 0.01264845
 0.01375641 0.01292222 0.01195515 0.04679431]
