# **Resources**

## **Neural Networks**

### **Complete Mathematics of Neural Network**

[Adam_Dhalla_Best](https://www.youtube.com/watch?v=Ixl3nykKG9M&t=2s)

<hr>

## **Mathematics**

### **Multivariable Calculus**

[IIT_Rookie](https://www.youtube.com/playlist?list=PL1XTxGlLddCzE1mHJVul22iGYE5NrSB3I)


<hr>
<hr>
<hr>


# **Complete Mathematics of Neural Network**

**course Syllabus**

<img src='./Notes_Images/syllabus.png'>

### **Prerequisites**

- Basics of Linear Algebra

- Multivariable Calculus

  - Differential Equations

  - Jacobian / Gradients

- Base ML Knowledge

### **Agenda**

- **Big Picture of Neural Networks**

- **Multivariable Calculus Refresher**

- **Neuron as Function**

- **Jacobians and Neural Networks**

- **Gradient Descent**

- **Backpropagation**

- **Backpropagation as Matrix Multiplication**

<hr>

### **Notation**

<img src='./Notes_Images/notation.png'>

<hr>


## **Neural Networks as Functions**

Neural Network is a big fancy function that is made up of many smaller functions (neurons) that work together to transform input data into output predictions.

All the parameters to this functions are the weights and biases of the neurons, which are learned during training.

Input to this function is a `vector` of features, and the output is a `vector` of predictions.

<hr>

`Neural Network` is just a `Big Calculus Problem` in which we are trying to minimize the difference between the predicted output and the actual output (loss function) by adjusting the weights and biases using optimization techniques like gradient descent.

It's all about finding the `Derivative` of `Loss Function` w.r.t `Weights` and `Biases` (parameters of the model) and using that information to update the parameters in the direction that reduces the loss.

Mathematically,

$$ \frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial w} $$

where:

- $ L $ is the loss function

- $ \hat{y} $ is the predicted output

- $ w $ is the weight

This is just the chain rule from calculus, and it allows us to compute the gradient of the loss function with respect to the weights and biases of the model.

<hr>

Say we've a function with thousands parameters, and to figure out how to lower that loss we'll have to find out how much each `Weight` and `Bias` contributes to the loss. So that we can add or substract some value to our initial `Weights` to lower the `Loss`.

This is done with `Backpropagation` and `Gradient Descent`.

This means we need to find $ \frac{\partial L}{\partial w_i} $ for each `Weight` $ w_i $ in the model.

This is where automatic differentiation comes in handy. It allows us to compute these gradients efficiently without having to derive them manually.

<hr>
<hr>

## **Multivariable Calculus Refresher**

### **Understand the Meaning of `Gradients`**

In `Univariate Calculus`, the derivative of a function at a point gives us the `slope of the tangent line to the function at that point`. This tells us how much the function is changing at that point.

For example, let's understand mathematically:

Let $f(x) = x^2$. The derivative of this function is $f'(x) = 2x$. This tells us that at any point $x$, the slope of the tangent line to the function is $2x$. If we evaluate this at $x=1$, we find that the slope is $2$. This means that the function is increasing at that point.

<hr>

In `Multivariable Calculus`, we extend this concept to functions of multiple variables. The gradient of a function is a `vector` that contains all of its `partial derivatives`. It points in the `direction` of the `steepest ascent` of the function.

Mathematically, if we have a function $f(x, y)$, the gradient is given by:

$$ \nabla f = \left( \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right) $$

This vector points in the direction of the steepest increase of the function.

**Image Notes**

<img src='./Notes_Images/note1.png'>

<hr>

<img src='./Notes_Images/note2.png'>

<hr>

<img src='./Notes_Images/note3.png'>

<hr>

<img src='./Notes_Images/note4.png'>

<hr>

### **Worked Examples to Understand `Gradient` and it's Significance**

**Example A — linear approximation and directional derivative**

Let

$$
f(x,y) = x^2 + 4y^2.
$$

Compute $\nabla f(x,y) = (2x, 8y)$. At point $(1,1)$:

$$
\nabla f(1,1) = (2,8).
$$

Magnitude:

$$
\|\nabla f\| = \sqrt{2^2 + 8^2} = \sqrt{4 + 64} = \sqrt{68}.
$$

We can keep the exact $\sqrt{68}$ or approximate: $\sqrt{68}\approx 8.246211\ldots$ — so the steepest rate of increase is about $8.246$ units of $f$ per unit distance in $(x,y)$-space.

If we move $\Delta\mathbf{x}=(0.01,0.005)$, linear prediction:

$$
\Delta f \approx \nabla f\cdot \Delta\mathbf{x} = 2\cdot 0.01 + 8\cdot 0.005 = 0.02 + 0.04 = 0.06.
$$

Actual $f(1.01,1.005)$ equals

$$
(1.01)^2 + 4(1.005)^2 = 1.0201 + 4\cdot 1.010025 = 1.0201 + 4.0401 = 5.0602,
$$

initial $f(1,1)=5$. Actual change $=0.0602$, close to linear prediction $0.06$. This demonstrates the gradient gives a very good first-order prediction.

<hr>

**Example B — gradient descent in 2D (one optimization step)**

Minimize

$$
f(x,y)=(x-3)^2 + 2(y+1)^2.
$$

Gradient:

$$
\nabla f = (2(x-3),\; 4(y+1)).
$$

Start at $(0,0)$. Compute gradient at $(0,0)$:

$$
\nabla f(0,0) = (2(0-3), 4(0+1)) = (-6, 4).
$$

Use learning rate $\eta=0.1$. Gradient descent update:

$$
(x_{\text{new}},y_{\text{new}}) = (0,0) - 0.1\cdot(-6,4) = (0+0.6,\; 0-0.4) = (0.6,\,-0.4).
$$

Function values:

$$
f(0,0) = 9 + 2 = 11.
$$

$$
f(0.6,-0.4) = (0.6-3)^2 + 2(-0.4+1)^2 = (-2.4)^2 + 2(0.6)^2 = 5.76 + 2\cdot0.36 = 5.76 + 0.72 = 6.48.
$$

Big drop from $11$ to $6.48$ in one step — the gradient told us a good descent direction.

Second step (quick):

$$
\nabla f(0.6,-0.4)= (2(0.6-3), 4(-0.4+1)) = (-4.8, 2.4).
$$

Update:

$$
(0.6,-0.4) - 0.1(-4.8,2.4) = (1.08,\,-0.64).
$$

Value $f(1.08,-0.64)=3.9456$ — decreasing further.

---

**Summary**

`Gradients` is the transformation of `Vector` to `Scalar` fields, providing a way to understand how functions change in multiple dimensions.


## **Understanding `Jacobian` Chain Rule**

[Understand_Jacobian](https://www.youtube.com/watch?v=wCZ1VEmVjVo)

Watch the above video to have complete idea of `Matrix`, `Vectors` and `Jacobian`.

### **What is `Jacobian`?**

The `Jacobian` is a matrix that contains all the first-order partial derivatives of a vector-valued function. If we have a function $\mathbf{f} : \mathbb{R}^n \rightarrow \mathbb{R}^m$, the Jacobian matrix is an $m \times n$ matrix given by:

$$
J(\mathbf{x}) = \begin{bmatrix}
\frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n}
\end{bmatrix}
$$

where $f_i$ are the components of the vector-valued function $\mathbf{f}$ and $x_j$ are the components of the input vector $\mathbf{x}$.

<hr>

`Jacobian` is another way of representing `Partial Derivatives` of a function that transforms a `vector input` into a `vector output`.

For example, if we've a function `f: R^2 -> R^2` defined as:

$$
f(x,y) = \begin{bmatrix}
x^2 + y \\
2x + 3y
\end{bmatrix}
$$

The Jacobian matrix is given by:

$$
J(\mathbf{x}) = \begin{bmatrix}
\frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\
\vdots & \ddots & \vdots \\
\frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n}
\end{bmatrix}
$$

where $f_i$ are the components of the vector-valued function $\mathbf{f}$ and $x_j$ are the components of the input vector $\mathbf{x}$.

So, first we take the `Partial Derivative` for the first function i.e. `f_1(x,y) = x^2 + y`:

$$
\frac{\partial f_1}{\partial x} = 2x, \quad \frac{\partial f_1}{\partial y} = 1
$$

Next, for the second function i.e. `f_2(x,y) = 2x + 3y`:

$$
\frac{\partial f_2}{\partial x} = 2, \quad \frac{\partial f_2}{\partial y} = 3
$$

Putting it all together, the Jacobian matrix is:

$$
J(\mathbf{x}) = \begin{bmatrix}
2x & 1 \\
2 & 3
\end{bmatrix}
$$

**Summary**

`Jacobian` is the matrix representing best `Linear Map` approximation of a function `f: R^n -> R^m` at a given point `(a,b)`.


<hr>
<hr>


# **Understanding `Derivative` and `Gradient` in Detail**

## **Why do we find gradients?** (start from 1-D intuition)

- **In 1-D:** the derivative $f'(x)$ at a point $x=a$ is the _instantaneous rate of change_ — the slope of the tangent line to the graph at $a$.
  That slope answers: _if I change the input by a tiny amount $h$, how much does the output change, to first order?_
  Concretely,

  $$
  f(a+h)\approx f(a) + f'(a)\,h \qquad\text{for small }h.
  $$

  So we find derivatives because they give a **linear prediction** of how the function responds to small inputs. That prediction is cheap to compute and extremely useful in practice (physics, economics, optimization, error estimates).

- **Simple 1-D example (step-by-step arithmetic):**
  Take $f(x)=x^2$. Then $f'(x)=2x$.
  At $x=3$: $f'(3)=2\times 3 = 6$.
  If $h=0.01$ then linear prediction:

  $$
  f(3+0.01)\approx f(3)+f'(3)\cdot 0.01 = 9 + 6\cdot0.01 = 9 + 0.06 = 9.06.
  $$

  Exact value: $f(3.01) = (3.01)^2 = 9.0601.$
  Error $= 9.0601 - 9.06 = 0.0001$, which is $\mathcal{O}(h^2)$. The derivative gave us the dominant (linear) change.

- **Why that matters (practical reasons):**

  - Predict local behavior without evaluating the full nonlinear formula.
  - Drive optimization (find minima/maxima).
  - Model instantaneous rates (velocity = derivative of position).
  - Build numerical solvers (Newton’s method linearizes with derivatives).

---

### **If the function is “higher order” (e.g. a polynomial), doesn’t the derivative become a polynomial too — so what does the derivative _at a point_ represent?**

Short answer: **The derivative of a polynomial is indeed another polynomial, but the value of that derivative at a specific point is still the instantaneous slope at that point.** Distinguish between (a) the _derivative function_ and (b) the _derivative evaluated at a point_.

- **Derivative function vs derivative at a point**

  - The derivative $f'(x)$ is a function (it may be polynomial if $f$ is polynomial).

  - The number $f'(a)$ (plugging $x=a$) is the slope at $x=a$. That number is what we use in the linear approximation at that point.

- **Worked example (step-by-step):**
  Let $p(x)=x^3+2x$. Then compute derivative:

  $$
  p'(x)=3x^2+2.
  $$

  At $x=2$: evaluate derivative

  $$
  p'(2) = 3\times 2^2 + 2 = 3\times 4 + 2 = 12 + 2 = 14.
  $$

  So the slope at $x=2$ is $14$. The tangent line at $x=2$ is given with `point-slope` form:

  $$
  y - y_1 = m(x - x_1)
  $$

  Which becomes below for $x=2$:

  $$
  y \approx p(2) + p'(2)(x-2).
  $$

  Where,

  $$
  y_1 = p(2), \quad m = p'(2), \quad x_1 = 2.
  $$

  Now,

  Compute $p(2)=2^3+2\times2=8+4=12$. So tangent ≈ $12 + 14(x-2)$.
  For $h=0.01$ we predict

  $$
  p(2+0.01)\approx 12 + 14\cdot 0.01 = 12 + 0.14 = 12.14.
  $$

  Exact: $(2.01)^3 = 8.120601$ (because $2.01^2 = 4.0401$, times $2.01$ gives $8.120601$), and $2\cdot2.01 = 4.02$, so

  $$
  p(2.01) = 8.120601 + 4.02 = 12.140601.
  $$

  Error $= 12.140601 - 12.14 = 0.000601 $, again $\mathcal{O}(h^2)$.

- **Interpretation:** even for high-degree polynomials, the derivative evaluated at a point is the linear coefficient of the local (first-order) approximation. Higher derivatives (second, third, …) measure curvature, cubic bending, etc., and appear in higher-order Taylor terms:

  $$
  f(a+h)=f(a)+f'(a)h+\tfrac{1}{2}f''(a)h^2+\cdots.
  $$

  So what the derivative at a point is called: _instantaneous rate of change_, _slope of tangent_, or _first-order coefficient_ of the Taylor expansion.

---

### **Why is the _gradient_ required? Wasn't the derivative enough?**

Now move from single-variable to multivariable. The one-dimensional derivative is **not enough** once you have more than one input variable.

- **Problem with a single scalar derivative:** If $f$ depends on many inputs $x_1,\dots,x_n$, a single number cannot describe how $f$ changes when you vary **each** input independently. You need a collection of partial rates — one per input direction.

- **Definition (multivariable):** For $f:\mathbb{R}^n\to\mathbb{R}$, the **gradient**

  $$
  \nabla f(x) = \begin{bmatrix} \dfrac{\partial f}{\partial x_1} & \dfrac{\partial f}{\partial x_2} & \cdots & \dfrac{\partial f}{\partial x_n}\end{bmatrix}^\top
  $$

  is the vector of all first partial derivatives. It generalizes the derivative: the first-order linear approximation becomes

  $$
  f(x+h)\approx f(x) + \nabla f(x)\cdot h,
  $$

  where “$\cdot$” is the dot product. So the gradient is the **best linear predictor** in _every direction_ simultaneously.

- **What the gradient tells you:**

  - **Direction of steepest increase:** the unit direction $u$ that maximizes the directional derivative $\nabla f(x)\cdot u$ is $u = \nabla f(x)/\|\nabla f(x)\|$.

  - **Directional derivative:** the rate of change in direction $u$ is $\nabla f(x)\cdot u$.

  - **Orthogonality to level sets:** the gradient is perpendicular to level sets (contours) of $f$.

- **Concrete multivariable example (step-by-step):**
  Let $f(x,y)=x^2 + 3xy + y^2.$ Compute partials:

  $$
  \frac{\partial f}{\partial x} = 2x + 3y,\qquad
  \frac{\partial f}{\partial y} = 3x + 2y.
  $$

  At the point $(1,1)$:

  $$
  \frac{\partial f}{\partial x}\Big|_{(1,1)} = 2\cdot1 + 3\cdot1 = 2 + 3 = 5,
  $$

  $$
  \frac{\partial f}{\partial y}\Big|_{(1,1)} = 3\cdot1 + 2\cdot1 = 3 + 2 = 5.
  $$

  So $\nabla f(1,1) = \begin{bmatrix}5\\5\end{bmatrix}.$

  - **Directional derivative** in the unit direction $u=\tfrac{1}{\sqrt{2}}(1,1)$ is

    $$
    \nabla f(1,1)\cdot u = [5,5]\cdot\Big[\tfrac{1}{\sqrt2},\tfrac{1}{\sqrt2}\Big] = 5\cdot\tfrac{1}{\sqrt2} + 5\cdot\tfrac{1}{\sqrt2} = \frac{10}{\sqrt2}.
    $$

    Simplify: $\tfrac{10}{\sqrt2} = 5\sqrt2 \approx 5\times1.41421356 = 7.0710678.$

  That number (≈7.071) is the instantaneous rate of increase of $f$ if we move in the 45° direction from $(1,1)$.

- **Why derivative alone would be insufficient:** a single number cannot tell you rates along multiple coordinates or along arbitrary directions. The gradient bundles all per-input sensitivity into one object and allows you to compute directional rates by dotting with the direction vector.

- **Use in optimization (practical):** gradient tells us which way to move to decrease the function fastest (take negative gradient). Gradient descent update:

  $$
  x_{\text{new}} = x_{\text{old}} - \eta\,\nabla f(x_{\text{old}}),
  $$

  where $\eta$ is a step size. This is the engine behind most continuous optimization and training in machine learning.

---

### **Quick hierarchy recap (so the relations are crystal clear)**

1. **Derivative (1-D)** = instantaneous slope at a specific point; gives linear (first-order) approximation: $f(a+h)\approx f(a)+f'(a)h.$

2. **Derivative function** (e.g. $f'(x)$ for a polynomial) is itself a function; evaluating it at a point produces the slope at that point.

3. **Gradient (multivariable)** = vector of partial derivatives; the direct generalization of derivative when the input is multidimensional. Gives linear approximation in every direction: $f(x+h)\approx f(x)+\nabla f(x)\cdot h.$

4. **Higher derivatives (second, Hessian, etc.)** capture curvature and give quadratic (or higher) approximations: $f(a+h)=f(a)+f'(a)h+\tfrac12 f''(a)h^2+\cdots$.

---
