#### Batch Normalization

딥러닝 학습할 때 각 Layer를 거치면서 데이터의 분포가 계속 변한다. 

##### Problem of Deep Neural Network (DNN)

- Q. Layer을 깊게 쌓으면 좋아지는 거 아니야?

    - 현실 : 깊어질수록 학습이 오히려 안 되거나, 아주 느려지는 형상 발생
    - 원인: 데이터가 네트워크를 통과할수록 그 `분포(Distribution)`가 제멋대로 널뛰기 때문.

- P. 기울기 소실과 폭발(Vanishing & Exploding)
    
    - Layer를 거칠 때마다 입력갑 $x$가 $W$(가중치)가 곱해짐`


입력: x (배치 데이터) [batch_size, num_features]

1. 배치 평균 계산:
   $μ = (1/m) * Σ(x_i)$
   `m = batch_size`, 각 feature별로 평균 구하기

2. 배치 분산 계산:
   $σ² = (1/m) * Σ(x_i - μ)²$
   각 feature별로 분산 구하기

3. 정규화 (Normalization):
   $x̂ = (x - μ) / √(σ² + ε)$
   ε(엡실론)은 0으로 나누는 것 방지 (1e-5)

4. 스케일 & 시프트:
   $y = γ * x̂ + β$
   $γ$(gamma)는 scale, $β$(beta)는 shift

In [3]:
import random
import math

class BatchNorm:
    def __init__(self, num_features):
        # num_features : feature 갯수 (Linear의 out_features)
        self.num_features = num_features
        self.eps = 1e-5

        # 학습 가능한 파라미터
        self.gamma = [1.0]*num_features  # Scale (수정!)
        self.beta = [0.0]*num_features   # Shift

        # Backward를 위한 캐시 변수들
        self.x = None
        self.x_normalized = None
        self.mu = None
        self.var = None
        self.std = None

    def forward(self, x):
        # x : Tensor, Shape [batch_size, num_features]
        batch_size = len(x)
        self.x = x  # 추가!
        
        # 1. 배치 평균 계산 (각 feature 별로)
        self.mu = []
        for j in range(self.num_features):  # 수정!
            sum_val = 0
            for i in range(batch_size):
                sum_val += x[i][j]  # 수정!
            mu = sum_val / batch_size
            self.mu.append(mu)
        
        # 2. 배치 분산 계산 (각 feature별로)
        self.var = []
        for j in range(self.num_features):
            sum_squared_diff = 0
            for i in range(batch_size):
                diff = x[i][j] - self.mu[j]
                sum_squared_diff += diff ** 2
            variance = sum_squared_diff / batch_size
            self.var.append(variance)
        
        # 3. 표준편차 계산
        self.std = []
        for j in range(self.num_features):
            std = math.sqrt(self.var[j] + self.eps)
            self.std.append(std)
        
        # 4. 정규화 (Normalization)
        self.x_normalized = []
        for i in range(batch_size):
            normalized_row = []
            for j in range(self.num_features):
                x_norm = (x[i][j] - self.mu[j]) / self.std[j]
                normalized_row.append(x_norm)
            self.x_normalized.append(normalized_row)
        
        # 5. Scale and Shift
        output = []
        for i in range(batch_size):
            output_row = []
            for j in range(self.num_features):
                out_val = self.gamma[j] * self.x_normalized[i][j] + self.beta[j]
                output_row.append(out_val)
            output.append(output_row)
    
        return output

    def backward(self, grad_output):
        # grad_output : [batch_size, num_features]
        batch_size = len(grad_output)
        
        # 1. gamma, beta에 대한 gradient
        self.grad_gamma = []
        self.grad_beta = []
        
        for j in range(self.num_features):
            grad_gamma_j = 0
            grad_beta_j = 0
            for i in range(batch_size):
                grad_gamma_j += grad_output[i][j] * self.x_normalized[i][j]
                grad_beta_j += grad_output[i][j]
            self.grad_gamma.append(grad_gamma_j)
            self.grad_beta.append(grad_beta_j)
        
        # 2. x_normalized에 대한 gradient
        grad_x_normalized = []
        for i in range(batch_size):
            grad_x_norm_row = []
            for j in range(self.num_features):
                grad_x_norm = grad_output[i][j] * self.gamma[j]
                grad_x_norm_row.append(grad_x_norm)
            grad_x_normalized.append(grad_x_norm_row)
        
        # 3. std에 대한 gradient
        grad_std = []
        for j in range(self.num_features):
            grad_std_j = 0
            for i in range(batch_size):
                grad_std_j += grad_x_normalized[i][j] * (self.x[i][j] - self.mu[j]) * (-1.0 / (self.std[j] ** 2))
            grad_std.append(grad_std_j)
        
        # 4. var에 대한 gradient
        grad_var = []
        for j in range(self.num_features):
            grad_var_j = grad_std[j] * 0.5 / math.sqrt(self.var[j] + self.eps)
            grad_var.append(grad_var_j)
        
        # 5. mu에 대한 gradient
        grad_mu = []
        for j in range(self.num_features):
            grad_mu_j = 0
            # std를 통한 gradient
            for i in range(batch_size):
                grad_mu_j += grad_x_normalized[i][j] * (-1.0 / self.std[j])
            # var를 통한 gradient
            grad_mu_j += grad_var[j] * (-2.0 / batch_size) * sum(self.x[i][j] - self.mu[j] for i in range(batch_size))
            grad_mu.append(grad_mu_j)
        
        # 6. x에 대한 gradient
        grad_x = []
        for i in range(batch_size):
            grad_x_row = []
            for j in range(self.num_features):
                # x_normalized를 통한 gradient
                grad_x_ij = grad_x_normalized[i][j] / self.std[j]
                # var를 통한 gradient
                grad_x_ij += grad_var[j] * (2.0 / batch_size) * (self.x[i][j] - self.mu[j])
                # mu를 통한 gradient
                grad_x_ij += grad_mu[j] / batch_size
                grad_x_row.append(grad_x_ij)
            grad_x.append(grad_x_row)
        
        return grad_x

In [4]:
# 테스트
bn = BatchNorm(num_features=2)

x = [
    [1.0, 2.0],
    [3.0, 4.0],
    [5.0, 6.0]
]

output = bn.forward(x)
print("Output:", output)

grad_output = [
    [0.1, 0.2],
    [0.3, 0.4],
    [0.5, 0.6]
]

grad_x = bn.backward(grad_output)
print("Grad x:", grad_x)
print("Grad gamma:", bn.grad_gamma)
print("Grad beta:", bn.grad_beta)

Output: [[-1.2247425750014138, -1.2247425750014138], [0.0, 0.0], [1.2247425750014138, 1.2247425750014138]]
Grad x: [[-4.592767433309053e-07, -4.5927674327539414e-07], [0.0, 0.0], [4.592767433309053e-07, 4.5927674327539414e-07]]
Grad gamma: [0.4898970300005655, 0.4898970300005655]
Grad beta: [0.9, 1.2000000000000002]
