<a target="_blank" href="https://colab.research.google.com/github/BenjaminHerrera/MAT421/blob/main/herrera_module_F.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# **MODULE F:** Differentiation and Optimization
# **AUTHOR:** Benjamin Joseph L. Herrera
# **CLASS:** MAT 421
# **DATE:** 18 FEB 2024

## Limits and Continuity

We use limits to calculate at what value does a function approach to a given input value. This is useful for determining continuity, integrals, and derivatives. We formally define this as the following.

Given a function $f : D \rightarrow \Reals$ where $D$ is a subset of $\Reals^d$, $f$ has a limit $L \in \Reals$ when $x$ in $f(x)$ approaches some value $a \in \Reals$.

We also define an $r$-ball around $x \in \Reals^d$:

$$B_r(x) = \{z \in \Reals^d : ||z-x|| < r\}$$

This $r-ball$ is a essentially a function that returns vectors in $\Reals^d$ that are within the vicinity of $x$ around $r$.

Going back to our definition of a limit, we use the $r$-ball to restrict $x$ where $x \in D \cap B_\sigma(a)$ and $\sigma > 0$. Given this definition of zero, the limit of $f$ is accepted if $|f(x) - L| < \epsilon$ where \epsilon > 0. This essentially means that a limit is accepted if it doesn't exceeds a certain error threshold.

As we all know, continuity is essentially the notion that a function does not have "skipped" $f(x)$ values $\forall x$. We can define the occurrence of a discontinuity, the opposite of a continuity, as $f(x_*) \neq L$ where $x_*$ is the point where the discontinuity occurs.

Another thing to note is that if $f : D_1 \rightarrow \Reals^m$ is continuous at $x_i$ and $g : D_2 \rightarrow \Reals^n$ is continuous when $f(x_i)$, then $g \circ f$ is continuous at $x_i$

There are also minimums and maximums for a function $f : D \rightarrow \Reals$. Maximum is defined when $f(x^*)$ is maximum $M$ and $M \geq f(x)$, $\forall x \in D$. Minimum is defined when $f(x^{\circ})$ is minimum $N$ and $N \leq f(x)$, $\forall x \in D$

## Derivatives

Derivatives are simply defined as the rate of change of a function with respect to a given time. This is defined as the function below:

$$f'(x) = \frac{df}{dx} = \lim_{h \rightarrow 0} \frac{f(x + h) - f(x)}{h}$$

Given a function $f : D \rightarrow \Reals$ and a value $x, x^* \in D$, we can find extreme values of $f$ when $f'(x^*) > 0$ at $x$ if $f(x) > f(x^*)$ when $x > x^*$; or if $f(x) < f(x^*)$ when $x < x^*$.

Another thing to note is that if the defined function has a domain range of $[a, b]$ and is continuous, if $f(a) = f(b)$, then there exists a $d$ between the domain range where $f'(d) = 0$.

We can aptly define another notion with this notion in another way via the following equation:

$$f'(d) = \frac{f(b) - f(a)}{b - a}$$

It's simply the rise-over-run-equals-the-slope formula! The derivative of $f$ at step $d$, in a way, is define with this simple equation.

With the given definitions and notions above, we can also do a derivative of the derivative itself with the following equation:

$$f''(x) = \frac{d^2f}{dx^2} = \lim_{h \rightarrow 0} \frac{f'(x + h) - f'(x)}{h}$$

Now a function can have multiple different inputs like $f(x, y, z) = xy + z - z^2$. This can be challenging to derive the function on all inputs, but we can derive at one at a time. Say hello to partial derivatives! This can be simply be defined below:

$$\frac{\partial f(x)}{ \partial x_i} = \lim_{h \rightarrow 0} \frac{f(x + he_i)- f(x)}{h}$$

where $e_i$ is a basis vector in the $i$-th position and $e_i \in \Reals^d$.

We can find the rate of change of a function over each variables via the Jacobian. Essentially, the Jacobian of a function ($J_f(x)$) returns a matrix of size $d \times n$ where $n$ is the number of inputs that a function $f$ takes. Each element in the matrix with their corresponding position $i, j$ is essentially $\frac{\partial f_n(x)}{ \partial(x_d)}$.

This is useful for finding all of the interactions of the inputs with each other when deriving a multi-variable function $f$. We also call this the gradient of $f$ as $\nabla f(x)$.

When a composition $g \circ f$ exists, the Jacobian of that composition is:

$$J_{g \circ f}(x) = J_g(f(x)) \cdot J_f(x)$$

This is similar to chain rule!

Now if we want to take the directional derivative of a function $f$, we get a unit direction vector $v \in \Reals^d$ and find the derivative of that function via:

$$\frac{\partial f(x)}{ \partial v} = \lim_{h \rightarrow 0} \frac{f(x + hv)- f(x)}{h}$$

Pretty similar to the previous definition of partial derivatives explained above.

Now, if we want to get the gradient of the function with respect to the directional vector, we calculate via the following:

$$\frac{\partial f(x)}{\partial v} = J_f(x) \cdot v = \nabla f(x)^T \cdot v$$

What if we want to do our partial derivatives to the higher order? Well, we can define that via:

$$\frac{\partial^2 f(x)}{ \partial x_n \partial x_m} = \lim_{h \rightarrow 0} \frac{\partial f(x + he_n) / \partial x_m - \partial f(x) / \partial x_m }{h}$$

The position of the variables being partially derived by do not matter in order:

$$\frac{\partial^2 f(x)}{ \partial x_n \partial x_m} = \frac{\partial^2 f(x)}{ \partial x_m \partial x_n}$$

## Taylor's Theorem

Taylor's Theorem allows us to get a rough estimation of a function around $x$ via the form of a polynomial. Using taylor's theorem and the early definitions of derivatives explained above, we can extrapolate or estimate the definition of a function $f(z)$ on the domain $[x, z]$ via the form:

$$f(z) = f(x) + (z-b)f'(x) + \dots \frac{(z - x)^{n - 1}}{(n - 1)!}f^{n-1}(x) + C$$

Where $C$ is defined as $\frac{(z - x)^{n}}{n!} f^n(x + \sigma(z - m))$ where $\sigma$ is a value between $(0, 1)$ .

So if our precision is up to two degrees, $n=2$, the the equation would look like this:

$$f(z) = f(x) + (z-b)f'(x) + \frac{1}{2}(z - x)^2f''(\delta)$$

We also define the Multivariate Mean Value of a function $f$ as:

$$f(x) = f(x^{\circ}) + \nabla f(x^{\circ} + \delta(x-x^{\circ}))^T \cdot (x-x^{\circ})$$

## Conditions for Local Minimizers

A minimizer takes in the form:

$$\min_{x \in \Reals^d}f(x)$$

We define a global minimizer as the following. Given a function $f : \Reals^d \rightarrow \Reals$, a point $a$ is the minimum such that $f(x) \geq f(a)$ for all $x$ in $\Reals^d$.

On the other hand, a local minimizer is defined as the same as above, but instead for all $x$ in $B_\delta(a)$. Pretty much meaning what is the lowest value around $a$ within $\delta$ radius.

Now a descent direction is defined as a vector $v$, with the same definition of the function $f$ above, as: 

$$\frac{\partial f(x)}{\partial v} = \nabla f(x)^T \cdot v < 0$$

We can tell if a function has a descent direction $a$ if $\nabla f(a) \neq 0$. If not, then there is nowhere to descend, meaning it's the minima for the area. This is known as the first-order necessary condition for the minimizer of a function. This also means that the hessian of the function at point $a$ is positive semi-definite (second-order necessary condition).

We talked about necessary conditions, let's look at sufficient conditions.

For $a$ to be a strict local minima for a function $f$, same definition as above, it must have the gradient of the function $a$ be 0 and the hessian of the function at $a$ be positive definite. This is called the second-order sufficient condition. 

## Convexity and Global Minimizers

A convex function is essentially a function where a line is above any point on a function.

We define a convex set as a set $C \subseteq \Reals^d$ where $\forall x, y \in C$ and $\beta \in [0, 1]$ satisfy the notion of $(1 - \beta) \cdot x + \beta \cdot y \in D$.

Similarly, a convex function is defined as $f((1 - \beta) \cdot x + \beta \cdot y) \leq (1 - \beta) \cdot f(x) + \beta \cdot f(y)$

A first-order convexity condition for a convex function $f$ follows the notion where $f(y) \geq f(x) + \nabla f(x)^T (y - x)$. Similarly to second order conditions for minimizers, the second-order convexity condition for a convex function is defined has the fact that a function $f$ is convex iff the hessian of $f$ at point $a$ is positive semi-definite.

Now, combining the fact that if $f$ is a convex function, if $\nabla f(x) = 0$, this means that at point $x$, then it is the minimizer for $f$. But since $f$ is a convex function, then $a$ is the global minimizer for $f$.

Consequently, any local minimizer of $f$ is a global minimizer!

## Gradient Descent

Given the understanding of derivatives, gradients, minimum definitions, and minimums of convex functions, we can calculate a way to find the minimum of a function. This is known as gradient descent. This operation is useful for the core aspects of machine learning where we want to minimize a cost function $f(x)$:

$$\min_{x \in \Reals^d}f(x)$$

To find the direction of where to descend with the greatest descent, we find a descent vector $v$ that satisfies:

$$\frac{\partial f(x)}{\partial v} \geq \frac{\partial f(x)}{\partial v^{\circ}}$$

where $v^{\circ}$ is defined as $-\frac{\nabla f(x)}{||\nabla f(x)||}$.

Essentially, we are finding a direction that has a greater change in step that all the other possible descents by looking at the gradients and their path.