# **Resources**

## **Neural Networks**

### **Complete Mathematics of Neural Network**

[Adam_Dhalla_Best](https://www.youtube.com/watch?v=Ixl3nykKG9M&t=2s)

<hr>

## **Mathematics**

### **Multivariable Calculus**

[IIT_Rookie](https://www.youtube.com/playlist?list=PL1XTxGlLddCzE1mHJVul22iGYE5NrSB3I)


<hr>
<hr>
<hr>


# **Complete Mathematics of Neural Network**

**course Syllabus**

<img src='./Notes_Images/syllabus.png'>

### **Prerequisites**

- Basics of Linear Algebra

- Multivariable Calculus

  - Differential Equations

  - Jacobian / Gradients

- Base ML Knowledge

### **Agenda**

- **Big Picture of Neural Networks**

- **Multivariable Calculus Refresher**

- **Neuron as Function**

- **Jacobians and Neural Networks**

- **Gradient Descent**

- **Backpropagation**

- **Backpropagation as Matrix Multiplication**

<hr>

### **Notation**

<img src='./Notes_Images/notation.png'>

<hr>


## **Neural Networks as Functions**

Neural Network is a big fancy function that is made up of many smaller functions (neurons) that work together to transform input data into output predictions.

All the parameters to this functions are the weights and biases of the neurons, which are learned during training.

Input to this function is a `vector` of features, and the output is a `vector` of predictions.

<hr>

`Neural Network` is just a `Big Calculus Problem` in which we are trying to minimize the difference between the predicted output and the actual output (loss function) by adjusting the weights and biases using optimization techniques like gradient descent.

It's all about finding the `Derivative` of `Loss Function` w.r.t `Weights` and `Biases` (parameters of the model) and using that information to update the parameters in the direction that reduces the loss.

Mathematically,

$$ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial w} $$

where:

- $ L $ is the loss function

- $ \hat{y} $ is the predicted output

- $ w $ is the weight

This is just the chain rule from calculus, and it allows us to compute the gradient of the loss function with respect to the weights and biases of the model.

<hr>

Say we've a function with thousands parameters, and to figure out how to lower that loss we'll have to find out how much each `Weight` and `Bias` contributes to the loss. So that we can add or substract some value to our initial `Weights` to lower the `Loss`.

This is done with `Backpropagation` and `Gradient Descent`.

This means we need to find $ \frac{\partial L}{\partial w_i} $ for each `Weight` $ w_i $ in the model.

This is where automatic differentiation comes in handy. It allows us to compute these gradients efficiently without having to derive them manually.

<hr>
<hr>

## **Multivariable Calculus Refresher**

### **Understand the Meaning of `Gradients`**

In `Univariate Calculus`, the derivative of a function at a point gives us the `slope of the tangent line to the function at that point`. This tells us how much the function is changing at that point.

For example, let's understand mathematically:

Let $f(x) = x^2$. The derivative of this function is $f'(x) = 2x$. This tells us that at any point $x$, the slope of the tangent line to the function is $2x$. If we evaluate this at $x=1$, we find that the slope is $2$. This means that the function is increasing at that point.

<hr>

In `Multivariable Calculus`, we extend this concept to functions of multiple variables. The gradient of a function is a `vector` that contains all of its `partial derivatives`. It points in the `direction` of the `steepest ascent` of the function.

Mathematically, if we have a function $f(x, y)$, the gradient is given by:

$$ \nabla f = \left( \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right) $$

This vector points in the direction of the steepest increase of the function.

**Image Notes**

<img src='./Notes_Images/note1.png'>

<hr>

<img src='./Notes_Images/note2.png'>

<hr>

<img src='./Notes_Images/note3.png'>

<hr>

<img src='./Notes_Images/note4.png'>

<hr>

### **Worked Examples to Understand `Gradient` and it's Significance**

**Example A — linear approximation and directional derivative**

Let

$$
f(x,y) = x^2 + 4y^2.
$$

Compute $\nabla f(x,y) = (2x, 8y)$. At point $(1,1)$:

$$
\nabla f(1,1) = (2,8).
$$

Magnitude:

$$
\|\nabla f\| = \sqrt{2^2 + 8^2} = \sqrt{4 + 64} = \sqrt{68}.
$$

We can keep the exact $\sqrt{68}$ or approximate: $\sqrt{68}\approx 8.246211\ldots$ — so the steepest rate of increase is about $8.246$ units of $f$ per unit distance in $(x,y)$-space.

If we move $\Delta\mathbf{x}=(0.01,0.005)$, linear prediction:

$$
\Delta f \approx \nabla f\cdot \Delta\mathbf{x} = 2\cdot 0.01 + 8\cdot 0.005 = 0.02 + 0.04 = 0.06.
$$

Actual $f(1.01,1.005)$ equals

$$
(1.01)^2 + 4(1.005)^2 = 1.0201 + 4\cdot 1.010025 = 1.0201 + 4.0401 = 5.0602,
$$

initial $f(1,1)=5$. Actual change $=0.0602$, close to linear prediction $0.06$. This demonstrates the gradient gives a very good first-order prediction.

<hr>

**Example B — gradient descent in 2D (one optimization step)**

Minimize

$$
f(x,y)=(x-3)^2 + 2(y+1)^2.
$$

Gradient:

$$
\nabla f = (2(x-3),\; 4(y+1)).
$$

Start at $(0,0)$. Compute gradient at $(0,0)$:

$$
\nabla f(0,0) = (2(0-3), 4(0+1)) = (-6, 4).
$$

Use learning rate $\eta=0.1$. Gradient descent update:

$$
(x_{\text{new}},y_{\text{new}}) = (0,0) - 0.1\cdot(-6,4) = (0+0.6,\; 0-0.4) = (0.6,\,-0.4).
$$

Function values:

$$
f(0,0) = 9 + 2 = 11.
$$

$$
f(0.6,-0.4) = (0.6-3)^2 + 2(-0.4+1)^2 = (-2.4)^2 + 2(0.6)^2 = 5.76 + 2\cdot0.36 = 5.76 + 0.72 = 6.48.
$$

Big drop from $11$ to $6.48$ in one step — the gradient told us a good descent direction.

Second step (quick):

$$
\nabla f(0.6,-0.4)= (2(0.6-3), 4(-0.4+1)) = (-4.8, 2.4).
$$

Update:

$$
(0.6,-0.4) - 0.1(-4.8,2.4) = (1.08,\,-0.64).
$$

Value $f(1.08,-0.64)=3.9456$ — decreasing further.

---

**Summary**

`Gradients` is the transformation of `Vector` to `Scalar` fields, providing a way to understand how functions change in multiple dimensions.


## **Understanding `Jacobian` Chain Rule**

[Understand_Jacobian](https://www.youtube.com/watch?v=wCZ1VEmVjVo)

Watch the above video to have complete idea of `Matrix`, `Vectors` and `Jacobian`.

### **What is `Jacobian`?**

The `Jacobian` is a matrix that contains all the first-order partial derivatives of a vector-valued function. If we have a function $\mathbf{f} : \mathbb{R}^n \rightarrow \mathbb{R}^m$, the Jacobian matrix is an $m \times n$ matrix given by:

$$
J(\mathbf{x}) = \begin{bmatrix}
\frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n}
\end{bmatrix}
$$

where $f_i$ are the components of the vector-valued function $\mathbf{f}$ and $x_j$ are the components of the input vector $\mathbf{x}$.

<hr>

`Jacobian` is another way of representing `Partial Derivatives` of a function that transforms a `vector input` into a `vector output`.

For example, if we've a function `f: R^2 -> R^2` defined as:

$$
f(x,y) = \begin{bmatrix}
x^2 + y \\
2x + 3y
\end{bmatrix}
$$

The Jacobian matrix is given by:

$$
J(\mathbf{x}) = \begin{bmatrix}
\frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n}
\end{bmatrix}
$$

where $f_i$ are the components of the vector-valued function $\mathbf{f}$ and $x_j$ are the components of the input vector $\mathbf{x}$.

So, first we take the `Partial Derivative` for the first function i.e. `f_1(x,y) = x^2 + y`:

$$
\frac{\partial f_1}{\partial x} = 2x, \quad \frac{\partial f_1}{\partial y} = 1
$$

Next, for the second function i.e. `f_2(x,y) = 2x + 3y`:

$$
\frac{\partial f_2}{\partial x} = 2, \quad \frac{\partial f_2}{\partial y} = 3
$$

Putting it all together, the Jacobian matrix is:

$$
J(\mathbf{x}) = \begin{bmatrix}
2x & 1 \\
2 & 3
\end{bmatrix}
$$

**Summary**

`Jacobian` is the matrix representing best `Linear Map` approximation of a function `f: R^n -> R^m` at a given point `(a,b)`.
