> This notebook is me trying to follow the explanations provided by Mr. Samson in his video: [Samson's Video](https://www.youtube.com/watch?v=w8yWXqWQYmU) <br>
>Here, I'll try to make it as precise as I can. 

## Understanding the maths behind Neural Network by building from scratch (Just NumPy)

### Problem Statement
* DataSets from [Kaggle's Digit Recognizer Dataset](https://www.kaggle.com/competitions/digit-recognizer/data)
* It contains $28 \times 28$ grayscale images of handwritten digits
* Each image is accompanied by a label from **0 to 9**.
* **Task**: Build a network that predicts what digit is written.

### Neural Network Overview

<div style="text-align:center">
    <img src="./Notebook_images/NN_diagram_overview.png" alt="NN Overview">
</div>


#### Input Layer:
* Input image is $28\times28$ pixels which is equals to $784$ pixels and we insert that into out input layers. Hence, resulting in $784$nodes.
* Here, each pixel has a value between **0 to 255**; *0 being Black* and *255 being White*.
* Now, we need to normalize these values. So, in order to normalize, we *divide the pixels value by the max_pixel value before feeding it to the network* ie., $${Normalized\_Value} = \frac{{pixel\_value}}{255}$$ 

#### Hidden Layer:
* Could have any number of nodes but to make it simple, choose $10$ nodes.
* The value of each of these nodes is calculated based on weights and biases applied to the value of the $784$ nodes in the input layer. After this calculation, a ReLU activation is applied to all nodes in the layer.
* For simplicity just using One Hidden Layer.

#### Output Layer:
* The output layer too has $10$ nodes. **Reason: Corresponds to each digit from 0 to 9**
* The value of each of these nodes will again be calculated from weights and biases applied to the value of the $10$ nodes in the hidden layer, with a softmax activation applied to them to get the final output.

### Slight Notes related to Forward and Backward Propagation:
#### Forward Prop: 
* Forward Propagation simply is a process of taking an image and running through the Neural Network to get a prediction
* The prediction made from the given image depends on the *weights and biases* of the network.
#### Backprop:
* In backprop, we take previously made prediction, calculate the error of how off it was from actual value, then run this error backwards through the NN to find out how much each weight and bias parameter contributed to this error.
* **Gradient_descent is carried out using backprop.**
* The basic idea of gradient descent is to figure out what direction each parameter can go in to decrease error by the greatest amount, then nudge each parameter in its corresponding direction over and over again until the parameters for minimum error and highest accuracy are found. 

### Maths

#### Representing the data

Each training example is represented by a 784-element vector, corresponding to the image's pixels. These vectors can be stacked into a matrix for vectorized calculations, allowing error computation for all examples simultaneously with matrix operations.

In machine learning, it's common to stack these vectors as rows in a matrix with dimensions $m×n$, where $m$ is the number of training examples and $n$ is the number of features ($784$ in this case). To simplify calculations, we'll transpose this matrix to have dimensions $n×m$, with each column representing a training example and each row representing a feature.

$$X= \begin{bmatrix}
x^{(1)}\\ x^{(2)}\\ .\\ .\\ x^{(m)}\\ 
\end{bmatrix}^T = \begin{bmatrix}
 x^{(1)}& x^{(2)} & . &  .&  x^{(m)}& 
\end{bmatrix}$$

#### Representing weights and biases

In a neural network, weights are represented as a matrix of dimensions $n^{[l]} \times n^{[l-1]}$, where $n^{[l-1]}$ is the number of nodes in the previous layer and $n^{[l]}$ is the number of nodes in the current layer. For example, $W^{[1]}$ is a $10 \times 784$ matrix, and $W^{[2]}$ is a $10 \times 10$ matrix. Biases are constant terms added to each node of the following layer and are represented as matrices with dimensions $n^{[l]} \times 1$, so both $b^{[1]}$ and $b^{[2]}$ have dimensions $10 \times 1$.

#### Forward Propagation

Forward propagation in a neural network involves calculating the unactivated values of the nodes in each layer by applying weights and biases to the input.

1. **First Hidden Layer**:
   - Compute unactivated values: $Z^{[1]} = W^{[1]}X + b^{[1]}$
   - Dimensions: $X$ (784 x m), $W^{[1]}$ (10 x 784), resulting in $Z^{[1]}$ (10 x m).
   - Bias $b^{[1]}$ (10 x 1) is broadcast to match $Z^{[1]}$.
   - Apply activation function (ReLU): $A^{[1]} = \text{ReLU}(Z^{[1]})$.

2. **Second Layer (Output Layer)**:
   - Compute unactivated values: $Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}$.
   - Apply activation function (softmax): $A^{[2]} = \text{softmax}(Z^{[2]})$.

   - Dimensions: $Z^{[2]}$ and $A^{[2]}$ are both (10 x m).

The softmax function outputs probabilities for each class, allowing the network to predict the likelihood that a given input belongs to each class. The final output matrix $A^{[2]}$ provides these prediction probabilities for all training examples.

#### Backward Propagation
Backward propagation involves computing how to adjust the neural network's parameters to minimize the loss function. For a softmax classifier, we use a cross-entropy loss function, defined as:

$$J(\hat{y}, y) = -\sum_{i=0}^{c} y_i \log(\hat{y}_i)$$

Here, $\hat{y}$ is our prediction vector, and $y$ is the one-hot encoded correct label. The loss for a given example is the log of the probability assigned to the correct prediction. The goal is to minimize this loss by updating the parameters using gradient descent.

We compute the derivative of the loss function with respect to each parameter. For simplicity, these derivatives are denoted as $dW^{[1]}$, $db^{[1]}$, $dW^{[2]}$, and $db^{[2]}$. The process starts by calculating $dA^{[2]}$, the derivative of the loss with respect to the output of the second layer:

$$dA^{[2]} = Y - A^{[2]}$$

From $dA^{[2]}$, we calculate $dW^{[2]}$ and $db^{[2]}$ as follows:

$$dW^{[2]} = \frac{1}{m} dZ^{[2]} A^{[1]T}$$

$$db^{[2]} = \frac{1}{m} \sum dZ^{[2]}$$

Next, to find $dW^{[1]}$ and $db^{[1]}$, we first determine $dZ^{[1]}$:

$$dZ^{[1]} = W^{[2]T} dZ^{[2]} \cdot g^{[1]\prime}(Z^{[1]})$$

Since our activation function is ReLU, its derivative is 1 for positive input values and 0 for negative ones. Thus, $g^{[1]\prime}(Z^{[1]})$ is a matrix of 1s and 0s based on the values of $Z^{[1]}$.

Finally, we calculate $dW^{[1]}$ and $db^{[1]}$:

$$dW^{[1]} = \frac{1}{m} dZ^{[1]} X^T$$

$$db^{[1]} = \frac{1}{m} \sum dZ^{[1]}$$

Once we have all the derivatives, we update our parameters:

$$W^{[2]} := W^{[2]} - \alpha dW^{[2]}$$

$$b^{[2]} := b^{[2]} - \alpha db^{[2]}$$

$$W^{[1]} := W^{[1]} - \alpha dW^{[1]}$$

$$b^{[1]} := b^{[1]} - \alpha db^{[1]}$$

Here, $\alpha$ is the learning rate, a hyperparameter that controls how much we adjust the parameters in each iteration.

To summarize the process: we first perform forward propagation to compute the predictions:

$$Z^{[1]} = W^{[1]} X + b^{[1]}$$

$$A^{[1]} = \text{ReLU}(Z^{[1]})$$

$$Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}$$

$$A^{[2]} = \text{softmax}(Z^{[2]})$$

Then, we perform backpropagation to compute the gradients:

$$dZ^{[2]} = A^{[2]} - Y$$

$$dW^{[2]} = \frac{1}{m} dZ^{[2]} A^{[1]T}$$

$$db^{[2]} = \frac{1}{m} \sum dZ^{[2]}$$

$$dZ^{[1]} = W^{[2]T} dZ^{[2]} \cdot g^{[1]\prime}(Z^{[1]})$$

$$dW^{[1]} = \frac{1}{m} dZ^{[1]} X^T$$

$$db^{[1]} = \frac{1}{m} \sum dZ^{[1]}$$

Finally, we update the parameters:

$$W^{[2]} := W^{[2]} - \alpha dW^{[2]}$$

$$b^{[2]} := b^{[2]} - \alpha db^{[2]}$$

$$W^{[1]} := W^{[1]} - \alpha dW^{[1]}$$

$$b^{[1]} := b^{[1]} - \alpha db^{[1]}$$

This process is repeated iteratively until the model's performance is satisfactory.

### The Code

In [None]:
# Code Session for tommorow 😊