## Learning Objectives

- By the end of this session, you are able to take derivative of a function over one variable

- Also, you would enable to take partial derivative of a function over all of its variables 

- Find the minimum of the function

## Introduction to differentiation

- Differentiation is a technique used to calculate the slope of a graph at different points.

<img src="diff_y_x2.png" width="600" height="600">
<img src="diff_y_x2_gragh.png" width="600" height="600">

## Calculate the gradient of a function at certain point by definition

- Choose small $\Delta x$

- $f^\prime(x_0) = \frac{f(x_0 + \Delta x) - f(x_0)}{\Delta x}$

## Activity: Write a Python Code that calculates the gradient of $x^2$ at $x_0 = 3$ and $x_0 = -2$ from above definition

In [3]:
def f(x):
    return x**2


eps = 1e-6
x = 3
print((f(x + eps) - f(x)) / eps)
x = -2
print((f(x + eps) - f(x)) / eps)

6.000001000927568
-3.999998999582033


## Derivative Table

- https://www.qc.edu.hk/math/Resource/AL/Derivative%20Table.pdf

## Extend Gradient into Two-Dimensional Space

- Lets watch this video about Partial Derivative Intro from Khan Academy: https://www.youtube.com/watch?v=AXqhWeUEtQU&t=175s


- Consider the function $f(x, y) = x^2/y$. Calculate the first order
partial derivatives ($\partial f/\partial x$ and $\partial f/\partial y$) and evaluate them at the point P(2, 1).

## We can use Symbolic Python package (libarary) to compute the derivatives and partial derivatives

In [2]:
from sympy import symbols, diff
x, y = symbols('x y', real=True)
f = (x**2)/y
fx = diff(f, x, evaluate=True)
fy = diff(f, y, evaluate=True)
print(fx)
print(fy)
# print(f.evalf(subs={x: 2, y: 1}))
print(fx.evalf(subs={x: 2, y: 1}))
print(fy.evalf(subs={x: 2, y: 1}))

2*x/y
-x**2/y**2
4.00000000000000
-4.00000000000000


## Optional Reading: Tensorflow is a powerful package from Google that calculate the derivatives and partial derivatives numerically 

In [None]:
import tensorflow as tf 

x = tf.Variable(2.0)
y = tf.Variable(1.0)

with tf.GradientTape(persistent=True) as t:
    z = tf.divide(tf.multiply(x, x), y)

# Use the tape to compute the derivative of z with respect to the
# intermediate value x and y.
dz_dx = t.gradient(z, x)
dz_dy = t.gradient(z, y)


print(dz_dx)
print(dz_dy)

# All at once:
gradients = t.gradient(z, [x, y])
print(gradients)


del t

## Optional Reading: When x and y declared as constant, should add t.watch(x) and t.watch(y)

In [None]:
import tensorflow as tf 

x = tf.constant(2.0)
y = tf.constant(1.0)

with tf.GradientTape(persistent=True) as t:
    t.watch(x)
    t.watch(y)
    z = tf.divide(tf.multiply(x, x), y)

# Use the tape to compute the derivative of z with respect to the
# intermediate value y.
dz_dx = t.gradient(z, x)
dz_dy = t.gradient(z, y)

# Calculate Partial Derivative from Definition

In [14]:
def f(x, y):
    return x**2/y


eps = 1e-6
x = 2
y = 1
print((f(x + eps, y) - f(x, y)) / eps)
print((f(x, y + eps) - f(x, y)) / eps)

4.0000010006480125
-3.9999959997594203


Looks about right! This works rather well and it is trivial to implement, but it is just
an approximation, and importantly you need to call f() at least once per parameter
(not twice, since we could compute f(x, y) just once). This makes this approach
intractable for large systems (for example neural networks).

## Why Do we need partial Gradients

- In many applications, more specifically DS applications, we want to find the Minimum of a cost function

- Most of the time, cost function is the system error and we want to have minimum error

<img src="gradient_descent.png" width="800" height="800">

## Finding minimum of a function

- Assume we want to minimize the function $J$ which has two variables $w_0$ and $w_1$

- Two options to find the minimum of $J(w_0, w_1)$:

    - Take partial derivatives of $J(w_0, w_1)$ w.r.t. $w_0$ and $w_1$ -> $\partial J(w_0, w_1)/\partial w_0$ and $\partial J(w_0, w_1)/\partial w_1$ and put them to zero -> $\partial J(w_0, w_1)/\partial w_0 = 0$ and $\partial J(w_0, w_1)/\partial w_1 = 0$. In this approach we should solve system of linear or non-linear equation
    
    - Use Gradient Descent algorithm. It means define step-size $\alpha$ as a small number and arbitrary random initial value for $w_0 = np.random.randn()$ and $w_1 = np.random.randn()$. Then update the $w_0$ and $w_1$ by:
    
        $w_0 = w_0 - \alpha \partial J(w_0, w_1)/\partial w_0$
        
        $w_1 = w_1 - \alpha \partial J(w_0, w_1)/\partial w_1$
        
    In `for loop`