# Single Neuron With Backpropagation
## Forward Pass
$$ z = w_1x_1 + w_2x_2 + \dots + w_nx_n + b$$
$$ \sigma(z) = \frac{1}{1 + e^{-z}}$$
Compute the output using the dot product of weights and features.

## Loss 
The Mean Squared Error quantifies the error between the neuron's predictions and the actual labels:
$$ 
MSE = \frac{1}{n} \sum_{i=1}^n(\sigma(z_i) - y_i)^2$$

## Backward Pass
$$
\begin{align}
\frac{\partial MSE}{\partial w_j} &= \frac{2}{n} \sum_{i=1}^n (\sigma(z_i) - y_i) \sigma'(z_i)x_{ij} \\
 
\frac{\partial MSE}{\partial b} &= \frac{2}{n} \sum_{i=1}^n (\sigma(z_i) - y_i) \sigma'(z_i)
\end{align}
$$

### Update the Parameters
$$
\begin{align}
w_j &= w_j - \alpha \frac{\partial MSE}{\partial w_j} \\
 
b &= b - \alpha \frac{\partial MSE}{\partial b}
\end{align}
$$


In [23]:
import numpy as np

def train_neuron(features: np.ndarray, labels: np.ndarray, 
                initial_weights: np.ndarray, initial_bias: float, 
                learning_rate: float, epochs: int) -> (np.ndarray, float, list[float]):
    
    updated_weights = initial_weights.copy()
    updated_bias = initial_bias
    mse_values = []

    for epoch in range(epochs):
        total_loss = 0
        grad_weights = np.zeros_like(updated_weights)
        grad_bias = 0
        preds = []

        for x, y_true in zip(features, labels):
            z = np.dot(updated_weights, x) + updated_bias
            y_pred = 1 / (1 + np.exp(-z))
            preds.append(y_pred)
            
            loss = (y_pred - y_true) ** 2
            total_loss += loss

            dloss = 2 * (y_pred - y_true)
            dsig = y_pred * (1 - y_pred)

            grad = dloss * dsig
            grad_weights += grad * x
            grad_bias += grad
        
        m = len(features)
        preds = np.array(preds)
        mse = np.mean((preds - labels) ** 2)
        mse_values.append(round(mse, 4))

        updated_weights -= learning_rate * (grad_weights / m)
        updated_bias -= learning_rate * (grad_bias / m)

    return updated_weights, updated_bias, mse_values

In [24]:
features = [[1.0, 2.0], [2.0, 1.0], [-1.0, -2.0]]
labels = [1, 0, 0]
initial_weights = [0.1, -0.2]
initial_bias = 0.0
learning_rate = 0.1
epochs = 2


updated_weights, updated_bias, mse_values = train_neuron(
        np.array(features), np.array(labels),
        np.array(initial_weights), initial_bias,
        learning_rate, epochs
)

print("Updated Weights:", updated_weights)
print("Updated Bias:", updated_bias)
print("MSE Values:", mse_values)

Updated Weights: [ 0.1035744  -0.14254396]
Updated Bias: -0.016719880375037202
MSE Values: [np.float64(0.3033), np.float64(0.2942)]


# Simple Convolutional 2D Layer
1. `input_matrix` the input data, for an image this is each pixel
2. `kernel` the filter kernel 
3. `padding` the extra space in the input to allow the kernel to fit.
4. `stride` the number of steps the kernel moves across the input.

A Convolution is an element-wise multiplication between the kernel and the input window, followed by a sum of the results, stored in the output matrix.

In [21]:
def simple_conv2d(input_matrix: np.ndarray, kernel: np.ndarray, padding: int, stride: int):
    input_height, input_width = input_matrix.shape
    kernel_height, kernel_width = kernel.shape

    input_matrix = np.pad(input_matrix, pad_width=padding)

    input_height_pad, input_width_pad = input_matrix.shape

    OH = ((input_height_pad - kernel_height) // stride) + 1
    OW = ((input_width_pad - kernel_width) // stride) + 1

    output_matrix = np.zeros((OW, OH))

    for i in range(0, OH):
        for j in range(0, OW):
            p = i * stride
            q = j * stride
            window = input_matrix[p:p+kernel_height, q:q+kernel_width]
        
            output_matrix[i,j] = np.sum(window * kernel)
            
    return output_matrix

In [22]:
input_matrix = np.array([
    [1, 2, 3, 4],
    [5, 6, 7, 8],
    [9, 10, 11, 12],
    [13, 14, 15, 16]
])

kernel = np.array([
    [1, 0],
    [-1, 1]
])

padding = 1
stride = 2

output = simple_conv2d(input_matrix, kernel, padding, stride)
print(output)

[[ 1.  1. -4.]
 [ 9.  7. -4.]
 [ 0. 14. 16.]]


# ReLU
The ReLU (Rectified Linear Unit) activation function is widely used in neural networks, particularly in hidden layers of deep learning models. It maps any real-valued number to the non-negative range $[0,\infin)$, which helps introduce non-linearity into the model while maintaining computational efficiency.

$$f(z) = max(0, z)$$

- It has an L shaped curve

In [None]:
def ReLU(z: np.ndarray) -> np.ndarray:
    return np.maximum(0, z)

In [10]:
out = ReLU(np.array([[-1, 2], [3, -4]]))
print(out)

[[0 2]
 [3 0]]


# Residual Connections
The main bit of ResNet, in a traditional network the output is a direct transformation of the input, $H(x)$. In a residual block, the network learns the residual $F(x) = H(x) - x$ and the output becomes:
$$ y = F(x) = x$$

A residual connection has two weight layers and an activation between them.

- It uses ReLU activation 

## Why?
- ease of learning 
- gradient flow: allows gradients to flow directly through the addition, removing vanishing gradients.

In [18]:
def residual_block(x: np.ndarray, w1: np.ndarray, w2: np.ndarray) -> np.ndarray:
    first_layer = w1 @ x 
    relu_1 = ReLU(first_layer)

    second_layer = w2 @ relu_1

    res = second_layer + x

    return ReLU(res)

In [20]:
x = np.array([1.0, 2.0])
w1 = np.array([[1.0, 0.0], [0.0, 1.0]])
w2 = np.array([[0.5, 0.0], [0.0, 0.5]])

output = residual_block(x, w1, w2)
print("Residual Block Output:", output)

Residual Block Output: [1.5 3. ]


# Global Average Pooling

Global Average Pooling (GAP) is a Pooling Operation used in CNNs to reduce the spatial dimensions of feature maps.

For a 3D input tensor:
- $H$ is the height
- $W$ is the width
- $C$ is the number of channels (feature maps)

$$
GAP(x)_c = \frac{1}{H \times W} \sum_{i=1}^H \sum_{j=1}^W x_{i,j,c}
$$
It returns a 1D vector of shape $(C,)$ where each element is the average of all values in that corresponding feature map.

Essentially:
```py
np.mean(x, axis=(0, 1))
```

### Benefits
- Parameter Reduction: By replacing fully connected layers with GAP, the number of parameters is significantly reduced, which helps in preventing overfitting.
- Spatial Invariance: GAP captures the global information from each feature map, making the model more robust to spatial translations.
- Simplicity: It is a straightforward operation that doesn't require tuning hyperparameters like pooling window size or stride.

Global Average Pooling is a key component in architectures like ResNet, where it is used before the final classification layer. It allows the network to handle inputs of varying sizes, as the output depends only on the number of channels, not the spatial dimensions.

In [13]:
def global_avg_pool(x: np.ndarray) -> np.ndarray:
    width, height, channels = x.shape
    output = np.zeros(channels)

    for c in range(channels):
        channel_sum = 0
        for i in range(height):
            for j in range(width):
                channel_sum += x[i, j, c]
            output[c] = channel_sum / (height * width)
    return output

In [14]:
x = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
print(global_avg_pool(x))

[5.5 6.5 7.5]


# Batch Normalisation for BCHW
- Batch Normalisation (BN) helps accelerate training of networks.
- This makes the learning process more stable and speeds up convergence, also introduces regularisation.

BN works by reducing internal covariate shift, which happens when the distribution of inputs to a layer changes during training. 

BN is done via the following steps:
1. Compute the Mean and Variance: for each mini-batch.
2. Normalise the inputs using mean and variance.
3. Apply Scale and Shift: after norm apply a learned gamma and shift to restore the model's ability to represent data in the original distribution.
4. Training and Inference

For an input tensor with shape BCHW (Batch Size, Channels, Height, Width)

1. Mean and Variance
$$
\begin{align}
\mu_c &= \frac{1}{B \cdot H \cdot W} \sum_{i=1}^B \sum_{h=1}^H \sum_{w=1}^W x_{i,c,h,w} \\

\sigma^2_c &= \frac{1}{B \cdot H \cdot W} \sum_{i=1}^B \sum_{h=1}^H \sum_{w=1}^W (x_{i,c,h,w} - \mu_c)^2 
\end{align}
$$
Where $x_{i,c,h,w}$ is the input activation at batch index $i$, channel $c$, height $h$, and width $w$.

2. Normalisation 
$$
\hat{x}_{i,c,h,w} = \frac{x_{i,c,h,w} - \mu_c}{\sqrt{\sigma^2_c} + \epsilon}
$$
Use use $\epsilon$ for numerical stability (avoiding division by 0)

3. Scale and Shift

Then we apply a scale ($\gamma_c$) and a shift ($\beta_c$), to adjust the distribution of features.

$$
y_{i,c,h,w} = \gamma_c \hat{x}_{i,c,h,w} + \beta_c
$$

### Key Points 
- Channel-wise Normalization: Batch Normalization normalizes the activations independently for each channel (C) because different channels in convolutional layers often have different distributions and should be treated separately.
- Improved gradient flow by reducing internal covariate shift, allowing faster and more reliable convergence.
- Introduces noise and acts as regularisation.

In [16]:
def batch_normalization(X: np.ndarray, gamma: np.ndarray, beta: np.ndarray, epsilon: float = 1e-5) -> np.ndarray:
    mean = np.mean(X, axis=(0, 2, 3), keepdims=True)
    var = np.var(X, axis=(0, 2, 3), keepdims=True)

    x_norm = (X - mean) / np.sqrt(var + epsilon)

    out = gamma * x_norm + beta

    return out

In [17]:
B, C, H, W = 2, 2, 2, 2
np.random.seed(42)
X = np.random.randn(B, C, H, W)
gamma = np.ones(C).reshape(1, C, 1, 1)
beta = np.zeros(C).reshape(1, C, 1, 1)
actual_output = batch_normalization(X, gamma, beta)
print(actual_output)

[[[[ 0.42859934 -0.51776438]
   [ 0.65360963  1.95820707]]

  [[ 0.02353721  0.02355215]
   [ 1.67355207  0.93490043]]]


 [[[-1.01139563  0.49692747]
   [-1.00236882 -1.00581468]]

  [[ 0.45676349 -1.50433085]
   [-1.33293647 -0.27503802]]]]
