# Layer Concepts

This notebook provides an overview of the neural network layers implemented in this module, including their forward and backward pass formulas.

## Dense (Fully Connected Layer)

A fully connected neural network layer, also known as a Linear layer.

### Forward Pass
The layer performs a linear transformation of the input data $X$ using a weights matrix $W$ and an optional bias vector $b$, followed by an activation function $f$.

The operations are:
1. Linear transformation: $Z = XW + b$
   $$ 
   Z = XW + b 
   $$ 
2. Activation: $A = f(Z)$
   $$ 
   A = f(Z) 
   $$ 
Where:
- $X$ is the input data.
- $W$ is the weights matrix.
- $b$ is the bias vector (if `add_bias` is True).
- $Z$ is the linear output (input to the activation function).
- $A$ is the activation output.
- $f$ is the activation function.

### Backward Pass
Calculates gradients with respect to inputs ($dX$), weights ($dW$), and bias ($dB$), and then these gradients are used by an optimizer to update the weights and bias.

The steps are:
1. Compute gradient with respect to the output of the linear part (before activation), $dZ$:
   $$ 
   dZ = dA \odot f'(Z) 
   $$ 
   Where $dA = \frac{\partial L}{\partial A}$ is the gradient of the loss $L$ with respect to the layer's activation output $A$, and $f'(Z) = \frac{\partial A}{\partial Z}$ is the derivative of the activation function. Thus, $dZ = \frac{\partial L}{\partial Z}$.

2. Compute gradient for weights, $dW$:
   $$ 
   dW = X^T dZ 
   $$ 

3. Compute gradient for bias, $dB$ (if bias is used):
   $$ 
   dB = \sum_{\text{batch}} dZ 
   $$ 
   (This sum is over the batch dimension).

4. Compute gradient with respect to the input of this layer, $dX$:
   $$ 
   dX = dZ W^T 
   $$ 
