Sure, here's a concise explanation for your documentation:

---

### NeuralNetwork Class

The `NeuralNetwork` class implements a basic feedforward neural network capable of training and making predictions. It accepts the following parameters during initialization:

- **`input`**: Represents the input data or features for the neural network.
- **`hiddenlayer`**: Specifies the number of neurons in each hidden layer as a list of integers.
- **`activation_hidden`**: Tuple containing the activation function and its derivative for the hidden layers.
- **`outputlayer`**: Specifies the number of neurons in the output layer.
- **`idealValues`**: Target output values for training purposes.
- **`activation_output`**: Tuple containing the activation function and its derivative for the output layer.
- **`learning_rate`**: Determines the step size in gradient descent optimization during training.

### Example Usage

```python
# Example instantiation
NN = NeuralNetwork(
    input=[1, 2, 3],
    hiddenlayer=[4],
    activation_hidden=(activations.LeakyRelu, activations.LeakyRelu_derivative),
    outputlayer=2,
    idealValues=[1, 2],
    activation_output=(activations.sigmoid, activations.sigmoid_derivative),
    learning_rate=0.001
)
```

### Purpose

The `NeuralNetwork` class encapsulates the functionality for creating and training a neural network model. It provides methods to initialize the network, perform forward and backward propagation, compute loss, and update model parameters using gradient descent.



<img src="https://eu-images.contentstack.com/v3/assets/blt6b0f74e5591baa03/blt790f1b7ac4e04301/6543ff50fcf447040a6b8dc7/News_Image_(47).png?width=1280&auto=webp&quality=95&format=jpg&disable=upscale"
width=500
heigth=500
/>


---


Sure! Let's go through the forward pass code, explaining each line and the underlying mathematical concepts.


<img src="https://media.licdn.com/dms/image/D4D22AQH-8ATyCCV_Ww/feedshare-shrink_800/0/1720882385473?e=1724284800&v=beta&t=wc2S7yF2Fo5zIDEjMpC-VdjkvJEg9K4m3mTTHPnknDU"/>

### Forward Pass Explained

```python
def Forwardpass(self):
    for i in range(len(self.W) - 1):
        Mat_Vec_Mul = self.W[i] @ np.transpose(self.A[i])  # weight and activation(L-1) multiplication , weight matrix of (L) and (L-1)
        
        shape = Mat_Vec_Mul.shape  # to adjust the shape from (i x 1 ) of vectors to (i,) to reduce errors in computation
        Z = (Mat_Vec_Mul.reshape(shape[0])) + self.b[i]  # Z = wx + b --> weighted sum
        self.Z.append(Z)
        A = self.activation_hidden[0](Z)
        self.A.append(A)
        
    Mat_Vec_Mul = self.W[-1] @ np.transpose(self.A[-1]) 
    shape = Mat_Vec_Mul.shape  # to adjust the shape from (i x 1 ) --> (i,) to reduce errors in computation
    Z = (Mat_Vec_Mul.reshape(shape[0])) + self.b[-1]
    self.Z.append(Z)
    A = self.activation_output[0](Z)
    self.A.append(A)
    Loss = self.Loss()
    return Loss
```

### Mathematical Explanation and Chain Rule Application

#### 1. Initialize the Forward Pass

```python
for i in range(len(self.W) - 1):
```
- **Explanation**: Iterate through each layer except the last one (output layer).

#### 2. Weighted Sum Calculation

```python
Mat_Vec_Mul = self.W[i] @ np.transpose(self.A[i])
```
- **Explanation**: Calculate the weighted sum of inputs from the previous layer. This is the dot product of the weights (`self.W[i]`) and the activations from the previous layer (`self.A[i]`).
- **Mathematical Notation**: $ Z^{(l)} = W^{(l)} \cdot A^{(l-1)} $
- **Chain Rule Application**: Not directly applicable here, but this prepares for the activation function.

#### 3. Adjust Shape for Computation

```python
shape = Mat_Vec_Mul.shape
Z = (Mat_Vec_Mul.reshape(shape[0])) + self.b[i]
```
- **Explanation**: Adjust the shape of the matrix-vector multiplication result and add the bias term.
- **Mathematical Notation**: $ Z^{(l)} = W^{(l)} \cdot A^{(l-1)} + b^{(l)} $

#### 4. Activation Function

```python
self.Z.append(Z)
A = self.activation_hidden[0](Z)
self.A.append(A)
```
- **Explanation**: Apply the activation function to the weighted sum `Z` to get the activation for the current layer. Store `Z` and `A` for later use.
- **Mathematical Notation**: $ A^{(l)} = \sigma(Z^{(l)}) $
- **Chain Rule Application**: This prepares the activations for the next layer and for backpropagation.

#### 5. Output Layer Weighted Sum

```python
Mat_Vec_Mul = self.W[-1] @ np.transpose(self.A[-1])
shape = Mat_Vec_Mul.shape
Z = (Mat_Vec_Mul.reshape(shape[0])) + self.b[-1]
```
- **Explanation**: Calculate the weighted sum for the output layer.
- **Mathematical Notation**: $ Z^{(L)} = W^{(L)} \cdot A^{(L-1)} + b^{(L)} $

#### 6. Output Layer Activation Function

```python
self.Z.append(Z)
A = self.activation_output[0](Z)
self.A.append(A)
```
- **Explanation**: Apply the activation function to the weighted sum `Z` of the output layer to get the final output.
- **Mathematical Notation**: $ \hat{y} = \sigma(Z^{(L)}) $

#### 7. Compute Loss

```python
Loss = self.Loss()
return Loss
```
- **Explanation**: Calculate the loss function to measure the difference between the predicted values and the actual values.
- **Mathematical Notation**: $ L = \frac{1}{N} \sum_{i=1}^{N} (\hat{y} - y)^2 $

### Summary

The forward pass involves computing the weighted sum and applying the activation function for each layer sequentially from the input to the output layer. Each line of code in the forward pass implements these steps, setting up the network for backpropagation by storing the activations and weighted sums needed for calculating gradients. The chain rule is implicitly applied in the sense that each layer's output becomes the input for the next layer, ensuring that the gradients can be propagated backward during the training process.



-----

Sure! Let's delve into the backward pass of the neural network, explaining each line and the underlying mathematics, including the application of the chain rule of derivatives.

### Backward Pass Explained

<img src="https://miro.medium.com/v2/resize:fit:679/0*9lo2ux8ASvt6YJkH.gif"/>


```python
def Backwardpass(self):
    # Output layer error
    output_error = (self.A[-1] - self.idealValues)
    output_delta = output_error * self.activation_output_derivative(self.Z[-1])  ## delta(L) = Error * d(sigma(L))/dz = activation_output_derivative(Z(L))

    # Reshape for correct dimensions
    output_delta = output_delta.reshape(-1, 1)
    self.A[-2] = self.A[-2].reshape(1, -1)

    # Gradients for output layer
    dW = output_delta @ self.A[-2]
    db = np.sum(output_delta, axis=1)

    self.W[-1] -= self.learning_rate * dW
    self.b[-1] -= self.learning_rate * db

    # Backpropagate through hidden layers
    delta = output_delta
    for i in range(len(self.W) - 2, -1, -1):
        delta = (self.W[i + 1].T @ delta).reshape(-1) * self.activation_hidden_derivative(self.Z[i])  ## W(L+1,L)^T * delta(L) * d(sigma(L-i))/dz = activation_hidden_derivative(self.Z[i])
        
        # Reshape for correct dimensions
        delta = delta.reshape(-1, 1)
        self.A[i] = self.A[i].reshape(1, -1)
        
        dW = delta @ self.A[i]
        db = delta.sum(axis=1, keepdims=True)
        
        self.W[i] -= self.learning_rate * dW
        self.b[i] -= self.learning_rate * db
```

### Mathematical Explanation and Chain Rule Application

#### 1. Output Layer Error

```python
output_error = (self.A[-1] - self.idealValues)
```
- **Explanation**: The output layer error is the difference between the predicted values (`self.A[-1]`) and the ideal (target) values (`self.idealValues`).
- **Mathematical Notation**: $ E = \hat{y} - y $

#### 2. Output Delta

```python
output_delta = output_error * self.activation_output_derivative(self.Z[-1])
```
- **Explanation**: Multiply the error by the derivative of the activation function of the output layer. This gives the gradient of the loss with respect to the weighted sum $ Z $ of the output layer.
- **Mathematical Notation**: $ \delta^{(L)} = E \cdot \sigma' (Z^{(L)}) $
- **Chain Rule Application**:
  - Error: $ \frac{\partial L}{\partial \hat{y}} $
  - Activation derivative: $ \frac{\partial \hat{y}}{\partial Z^{(L)}} $
  - Combined: $ \delta^{(L)} = \frac{\partial L}{\partial Z^{(L)}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial Z^{(L)}} $

#### 3. Reshape for Correct Dimensions

```python
output_delta = output_delta.reshape(-1, 1)
self.A[-2] = self.A[-2].reshape(1, -1)
```
- **Explanation**: Reshape the delta and the activations for matrix multiplication.

#### 4. Gradients for Output Layer

```python
dW = output_delta @ self.A[-2]
db = np.sum(output_delta, axis=1)
```
- **Explanation**: Calculate the gradients for the weights and biases in the output layer.
- **Mathematical Notation**:
  - Weight gradients: $ \frac{\partial L}{\partial W^{(L)}} = \delta^{(L)} \cdot A^{(L-1)} $
  - Bias gradients: $ \frac{\partial L}{\partial b^{(L)}} = \delta^{(L)} $
- **Chain Rule Application**:
  - For weights: $ \frac{\partial L}{\partial W^{(L)}} = \frac{\partial L}{\partial Z^{(L)}} \cdot \frac{\partial Z^{(L)}}{\partial W^{(L)}} $
  - For biases: $ \frac{\partial L}{\partial b^{(L)}} = \frac{\partial L}{\partial Z^{(L)}} \cdot \frac{\partial Z^{(L)}}{\partial b^{(L)}} $

#### 5. Update Output Layer Weights and Biases

```python
self.W[-1] -= self.learning_rate * dW
self.b[-1] -= self.learning_rate * db
```
- **Explanation**: Update the weights and biases using the calculated gradients and the learning rate.
- **Mathematical Notation**:
  - Weights update: $ W^{(L)} \leftarrow W^{(L)} - \eta \cdot \frac{\partial L}{\partial W^{(L)}} $
  - Biases update: $ b^{(L)} \leftarrow b^{(L)} - \eta \cdot \frac{\partial L}{\partial b^{(L)}} $

#### 6. Backpropagate Through Hidden Layers

```python
delta = output_delta
for i in range(len(self.W) - 2, -1, -1):
    delta = (self.W[i + 1].T @ delta).reshape(-1) * self.activation_hidden_derivative(self.Z[i])
```
- **Explanation**: Backpropagate the delta through each hidden layer.
- **Mathematical Notation**:
  - For layer $ l $: $ \delta^{(l)} = (\delta^{(l+1)} \cdot W^{(l+1)}) \cdot \sigma' (Z^{(l)}) $
- **Chain Rule Application**:
  - For hidden layers: $ \delta^{(l)} = \frac{\partial L}{\partial Z^{(l)}} = \left( \frac{\partial L}{\partial A^{(l+1)}} \cdot \frac{\partial A^{(l+1)}}{\partial Z^{(l+1)}} \cdot \frac{\partial Z^{(l+1)}}{\partial A^{(l)}} \right) \cdot \frac{\partial A^{(l)}}{\partial Z^{(l)}} $

#### 7. Reshape for Correct Dimensions

```python
delta = delta.reshape(-1, 1)
self.A[i] = self.A[i].reshape(1, -1)
```
- **Explanation**: Reshape the delta and the activations for matrix multiplication.

#### 8. Gradients for Hidden Layers

```python
dW = delta @ self.A[i]
db = np.sum(delta, axis=1)
```
- **Explanation**: Calculate the gradients for the weights and biases in the hidden layers.
- **Mathematical Notation**:
  - Weight gradients: $ \frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} \cdot A^{(l-1)} $
  - Bias gradients: $ \frac{\partial L}{\partial b^{(l)}} = \delta^{(l)} $
- **Chain Rule Application**:
  - For weights: $ \frac{\partial L}{\partial W^{(l)}} = \frac{\partial L}{\partial Z^{(l)}} \cdot \frac{\partial Z^{(l)}}{\partial W^{(l)}} $
  - For biases: $ \frac{\partial L}{\partial b^{(l)}} = \frac{\partial L}{\partial Z^{(l)}} \cdot \frac{\partial Z^{(l)}}{\partial b^{(l)}} $

#### 9. Update Hidden Layer Weights and Biases

```python
self.W[i] -= self.learning_rate * dW
self.b[i] -= self.learning_rate * db
```
- **Explanation**: Update the weights and biases using the calculated gradients and the learning rate.
- **Mathematical Notation**:
  - Weights update: $ W^{(l)} \leftarrow W^{(l)} - \eta \cdot \frac{\partial L}{\partial W^{(l)}} $
  - Biases update: $ b^{(l)} \leftarrow b^{(l)} - \eta \cdot \frac{\partial L}{\partial b^{(l)}} $

### Summary

The backward pass involves computing the error and gradients for each layer starting from the output layer and moving backward through the hidden layers. The chain rule is applied to propagate the error gradients backward, allowing the network to update its weights and biases to minimize the loss function. Each line of code in the backward pass implements a step in this gradient descent optimization process.


-----

## Batch Gradient Descent 

<h3> Batch gradient descent updates the weights on the entire training set in a single epoch </h3>


<img src="https://media.licdn.com/dms/image/D4D22AQH-8ATyCCV_Ww/feedshare-shrink_800/0/1720882385473?e=1724284800&v=beta&t=wc2S7yF2Fo5zIDEjMpC-VdjkvJEg9K4m3mTTHPnknDU"/>


</br>

To explain how to train a neural network using batch gradient descent with the weights and biases updated based on the entire dataset, let's work through a simple architecture and demonstrate the process with symbolic matrices and vectors. 

### Architecture and Shapes

Let’s assume the architecture as follows:
- **Input Layer:** 3 nodes
- **Hidden Layer:** 2 nodes
- **Output Layer:** 3 nodes

Here's how the matrices and vectors are shaped:

1. **Input Matrix ($\mathbf{X}$)**: Shape $(m, 3)$
   - $m$ is the number of training examples.

2. **Hidden Layer Weight Matrix ($\mathbf{W}_1$)**: Shape $(3, 2)$
   - Each column corresponds to the weights connecting one input node to each hidden node.

3. **Hidden Layer Bias Vector ($\mathbf{b}_1$)**: Shape $(1, 2)$
   - One bias per hidden node.

4. **Output Layer Weight Matrix ($\mathbf{W}_2$)**: Shape $(2, 3)$
   - Each column corresponds to the weights connecting one hidden node to each output node.

5. **Output Layer Bias Vector ($\mathbf{b}_2$)**: Shape $(1, 3)$
   - One bias per output node.

### Example with Symbolic Elements

Let's work with symbolic elements to clarify the process:

#### **1. Forward Pass**

**Input Matrix ($\mathbf{X}$)**: Suppose we have 2 training examples.
$
\mathbf{X} = \begin{bmatrix}
x_{11} & x_{12} & x_{13} \\
x_{21} & x_{22} & x_{23}
\end{bmatrix}
$
Here, $\mathbf{X}$ has shape $(2, 3)$ where each row represents a training example.

**Hidden Layer Weight Matrix ($\mathbf{W}_1$)**:
$
\mathbf{W}_1 = \begin{bmatrix}
w_{11} & w_{12} \\
w_{21} & w_{22} \\
w_{31} & w_{32}
\end{bmatrix}
$
Here, $\mathbf{W}_1$ has shape $(3, 2)$, connecting each of the 3 input nodes to each of the 2 hidden nodes.

**Hidden Layer Bias Vector ($\mathbf{b}_1$)**:
$
\mathbf{b}_1 = \begin{bmatrix}
b_{1} & b_{2}
\end{bmatrix}
$
Here, $\mathbf{b}_1$ has shape $(1, 2)$, with each element corresponding to a bias term for a hidden node.

**Compute Hidden Layer Activations ($\mathbf{A}_1$)**:
$
\mathbf{Z}_1 = \mathbf{X} \cdot \mathbf{W}_1 + \mathbf{b}_1
$
$
\mathbf{Z}_1 = \begin{bmatrix}
x_{11} & x_{12} & x_{13} \\
x_{21} & x_{22} & x_{23}
\end{bmatrix} \cdot \begin{bmatrix}
w_{11} & w_{12} \\
w_{21} & w_{22} \\
w_{31} & w_{32}
\end{bmatrix} + \begin{bmatrix}
b_{1} & b_{2}
\end{bmatrix}
$

$
\mathbf{A}_1 = \sigma(\mathbf{Z}_1)
$
where $\sigma$ is the activation function (e.g., sigmoid, tanh).

**Output Layer Weight Matrix ($\mathbf{W}_2$)**:
$
\mathbf{W}_2 = \begin{bmatrix}
w_{1} & w_{2} & w_{3} \\
w_{4} & w_{5} & w_{6}
\end{bmatrix}
$
Here, $\mathbf{W}_2$ has shape $(2, 3)$, connecting each of the 2 hidden nodes to each of the 3 output nodes.

**Output Layer Bias Vector ($\mathbf{b}_2$)**:
$
\mathbf{b}_2 = \begin{bmatrix}
b_{3} & b_{4} & b_{5}
\end{bmatrix}
$
Here, $\mathbf{b}_2$ has shape $(1, 3)$, with each element corresponding to a bias term for an output node.

**Compute Output Layer Activations ($\mathbf{A}_2$)**:
$
\mathbf{Z}_2 = \mathbf{A}_1 \cdot \mathbf{W}_2 + \mathbf{b}_2
$
$
\mathbf{Z}_2 = \begin{bmatrix}
a_{11} & a_{12} \\
a_{21} & a_{22}
\end{bmatrix} \cdot \begin{bmatrix}
w_{1} & w_{2} & w_{3} \\
w_{4} & w_{5} & w_{6}
\end{bmatrix} + \begin{bmatrix}
b_{3} & b_{4} & b_{5}
\end{bmatrix}
$

$
\mathbf{A}_2 = \text{softmax}(\mathbf{Z}_2)
$
where softmax converts the logits into probabilities for classification tasks.

#### **2. Compute Loss**

Given the true labels ($\mathbf{Y}$) and the predicted output ($\mathbf{A}_2$):
$
\text{Loss} = \frac{1}{m} \sum_{i=1}^{m} \text{loss}(\mathbf{Y}_i, \mathbf{A}_{2i})
$

#### **3. Backward Pass**

Calculate gradients for weights and biases.

**Gradient for Output Layer:**
$
\frac{\partial \text{Loss}}{\partial \mathbf{A}_2} = \mathbf{A}_2 - \mathbf{Y}
$
$
\frac{\partial \text{Loss}}{\partial \mathbf{W}_2} = \frac{1}{m} \mathbf{A}_1^T \cdot (\mathbf{A}_2 - \mathbf{Y})
$
$
\frac{\partial \text{Loss}}{\partial \mathbf{b}_2} = \frac{1}{m} \sum_{i=1}^{m} (\mathbf{A}_{2i} - \mathbf{Y}_i)
$

**Gradient for Hidden Layer:**
$
\frac{\partial \text{Loss}}{\partial \mathbf{A}_1} = (\mathbf{A}_2 - \mathbf{Y}) \cdot \mathbf{W}_2^T
$
$
\frac{\partial \text{Loss}}{\partial \mathbf{Z}_1} = \frac{\partial \text{Loss}}{\partial \mathbf{A}_1} \cdot \sigma'(\mathbf{Z}_1)
$
$
\frac{\partial \text{Loss}}{\partial \mathbf{W}_1} = \frac{1}{m} \mathbf{X}^T \cdot \frac{\partial \text{Loss}}{\partial \mathbf{Z}_1}
$
$
\frac{\partial \text{Loss}}{\partial \mathbf{b}_1} = \frac{1}{m} \sum_{i=1}^{m} \frac{\partial \text{Loss}}{\partial \mathbf{Z}_{1i}}
$

#### **4. Update Weights and Biases**

Update weights and biases using the gradients:

$
\mathbf{W}_2 = \mathbf{W}_2 - \eta \cdot \frac{\partial \text{Loss}}{\partial \mathbf{W}_2}
$
$
\mathbf{b}_2 = \mathbf{b}_2 - \eta \cdot \frac{\partial \text{Loss}}{\partial \mathbf{b}_2}
$
$
\mathbf{W}_1 = \mathbf{W}_1 - \eta \cdot \frac{\partial \text{Loss}}{\partial \mathbf{W}_1}
$
$
\mathbf{b}_1 = \mathbf{b}_1 - \eta \cdot \frac{\partial \text{Loss}}{\partial \mathbf{b}_1}
$

where $\eta$ is the learning rate.

### Summary

In summary:
- **Forward Pass**: Compute activations using the input data for the entire dataset.
- **Compute Loss**: Evaluate the loss over the entire dataset.
- **Backward Pass**: Calculate gradients using the loss function and update weights and biases.
- **Update Parameters**: Adjust weights and biases using the gradients obtained.

By processing the entire dataset in one epoch, the network updates its parameters based on the average gradients, which typically results in more stable and effective learning.


___