### Weight initialisation methods

When training deep networks, how you initialize weights can make a huge difference. Why?
*  If weights start too small, signals shrink as they move forward → vanishing gradients.
*  If weights start too large, signals explode → exploding gradients.
*  Both make training painfully slow or unstable.

<b> {the term "signal" refers to the flow of information, During Forward Propagation: The "signal" is the output of a neuron, During Backpropagation:The "signal" is the gradient}</b>

---

<b> Vanishing Gradients</b>
*  If weights are initialized too small, the signals (gradients) shrink exponentially as they propagate backward through the network.
*  Result: Early layers receive tiny updates (their gradients approach zero), so they learn very slowly or stop learning entirely.
  
  Small weights → small gradients → multiplying many small numbers → near-zero gradients.
---

<b>Exploding Gradients</b>
*  If weights are initialized too large, signals (gradients) grow exponentially during backpropagation.

*  Result: Gradients become extremely large, causing:
  *  Unstable training (loss fluctuates wildly).
  *  Numerical overflow (e.g., NaN values).
  *  Oversized weight updates that destabilize the model.


#### Solutions
<b>For Vanishing Gradients:</b>
*  Use ReLU/Leaky ReLU activations (avoid sigmoid/tanh for deep networks).
*  Batch Normalization stabilizes signal flow.
*  Residual connections (e.g., in ResNet) allow gradients to bypass layers.
---

<b>For Exploding Gradients:</b>
*  Gradient Clipping: Cap gradients during backpropagation.
*  Weight Regularization (e.g., L2 regularization).
*  Careful weight initialization (e.g., Xavier/Glorot for sigmoid, He initialization for ReLU).
---



### Common Initialization Methods
<table>
<tr><td>Method</td>	<td>Formula</td><td>	Best for</td><td>	Reason</td></tr>
<tr><td>Xavier (Glorot)</td><td>	Var(W) = 2 / (fan_in + fan_out) </td><td>	tanh / sigmoid activations</td><td>	Balances variance across layers</td></tr>
<tr><td>He Initialization </td><td>	Var(W) = 2 / fan_in	</td><td>ReLU activations	</td><td>Keeps variance high enough for ReLU’s sparse firing</td></tr>
<tr><td>Uniform</td><td>	Small random values</td><td>	Older/simple models</td><td>	Risk of vanishing/exploding</td></tr>
<tr><td>Zeros</td><td>	All 0s</td><td>	❌ Never use</td><td>	No symmetry breaking</td></tr>
</table>

*  fan_in = number of input neurons to the layer
*  fan_out = number of output neurons from the layer

---

In [None]:
def init_weights(m):
    if isinstance(m, nn.Linear):
        nn.init.kaiming_normal_(m.weight, nonlinearity='relu')  # He init
        nn.init.zeros_(m.bias)

model.apply(init_weights)


In [2]:
# PyTorch Full NN with He Initialization
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load data
X, y = load_iris(return_X_y=True)
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert to tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.long)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_test = torch.tensor(y_test, dtype=torch.long)

# Define model
class FullNN(nn.Module):
    def __init__(self):
        super(FullNN, self).__init__()
        self.fc1 = nn.Linear(4, 64)
        self.fc2 = nn.Linear(64, 32)
        self.out = nn.Linear(32, 3)

        # He Initialization
        nn.init.kaiming_normal_(self.fc1.weight, nonlinearity='relu')
        nn.init.kaiming_normal_(self.fc2.weight, nonlinearity='relu')
        nn.init.xavier_normal_(self.out.weight)  # output layer can use Xavier

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.out(x)

model = FullNN()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop
for epoch in range(10):
    optimizer.zero_grad()
    outputs = model(X_train)
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()

# Accuracy
with torch.no_grad():
    preds = torch.argmax(model(X_test), dim=1)
    acc = (preds == y_test).float().mean()
    print(f"Test Accuracy: {acc:.4f}")

Test Accuracy: 0.9667


---
<b>PyTorch tensor ➡</b> is a multi-dimensional array (similar to NumPy's ndarray) that can store numerical data and perform fast computations on CPUs/GPUs. Tensors are the fundamental building blocks of PyTorch, used to encode inputs, outputs, and model parameters.

Key Properties:
*  Multi-dimensional: Scalars (0D), vectors (1D), matrices (2D), or higher-dimensional arrays.
*  GPU Acceleration: Can be moved to CUDA-enabled GPUs for faster computation.
*  Autograd Support: Track operations for automatic differentiation (critical for training neural networks).
*  Optimized Operations: Backed by highly optimized C++/CUDA kernels.

---

In [4]:
# TensorFlow Full NN with He Initialization
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

# Load and preprocess
X, y = load_iris(return_X_y=True)
scaler = StandardScaler()
X = scaler.fit_transform(X)
y = tf.keras.utils.to_categorical(y, 3)  # one-hot encoding

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', kernel_initializer='he_normal', input_shape=(4,)),
    tf.keras.layers.Dense(32, activation='relu', kernel_initializer='he_normal'),
    tf.keras.layers.Dense(3, activation='softmax', kernel_initializer='glorot_normal')
])

# Compile & train
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, verbose=0)

# Evaluate
loss, acc = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Accuracy: {acc:.4f}")



  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Test Accuracy: 0.8333


Proper initialization = faster convergence + reduced vanishing/exploding.