# Gradient Descent - Understanding Theory through Play

Having completed DeepLearning.AI's "Deep Learning Specialization", quite a few of these principles have already stuck with me. Nonetheless, I figured I should stick to the book and keep a rhythm.

Let us see then, through interaction, the theory of Gradient Descent

In [None]:
!pip install fastai --upgrade -q
!pip install fastbook -q

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from ipywidgets import interact
from fastai import *
from fastai.basics import *
from fastai.vision.all import *
from fastbook import *

def plot_function(f, title=None, min=-2.1, max=2.1, color='r', ylim=None):
    x = torch.linspace(min,max, 100)[:,None]
    if ylim: plt.ylim(ylim)
    plt.plot(x, f(x), color)
    if title is not None: plt.title(title)

### Fitting a function
Let's start with defining a function that is quadratic, and seeing a way of fixing different parameters to it (much like model weights). We can find a combination that best fits the data, though it won't match perfectly.

In [None]:
def f(x): return 3*x**2 + 2*x + 1

plot_function(f, "$3x^2 + 2x + 1$")

#### Partial
We create this way of making functions with fixed parameters.

In [None]:
def quad(a,b,c,x): return a*x**2 + b*x + c

In [None]:
from functools import partial
def mk_quad(a,b,c): return partial(quad, a, b, c) #This basically creates a function with fixed inputs (parameters)

In [None]:
f = mk_quad(3,2,1)

### Making some noise and sticking to it
Let's create a noisy sample by grabbing the original function, which is the optimal curve, and let's add some noise to the entries at random. Thus we know that we want a curve that best fits these points.

Now let me clarify what this "noise" function is doing, and what the "add_noise" function is doing. Do note I asked GPT for some clarity on this, and this text is mostly from it. I'm writing it not to come from a place of seeming like I know, but from a place where I want to internalize these ideas and be able to consult them again later.

"noise(x, scale)"
  - x is the array of the points of the original function, with the true parameters.
      We'll use x.shape so that normal(scale=scale, size=x.shape) returns an array of random values with this same shape.
  - scale is a bit harder to explain. normal() means these values come from a Normal Distribution. This curve is centered around a mean value that is 0 by default, and then it spreads out according to the Standard Deviation. The scale specified is the Standard Deviation.


Do note that the values are not between 0 and 1. Instead they are spread around the mean (which is 0 by default). By default then, most of the generated values will lie within the range of [-3, 3] approximately.
When it comes to scale, choosing a smaller one, say, 0.1, makes the values to lie within a smaller range, from about[0.7, 1.3] approximately.


Adding 1 to the noise (in add_noise) shifts the mean from 0 to 1, since any value generated now has 1 +- something.

To summarize:
The normal distribution has a mean of 0 and values are spread out according to scale. A lower scale makes values oscillate closer to the mean (and a larger scale makes them spread out more).

And adding to the final value given by normal "moves" the mean from 0 to somewhere else.

But why is the mean being moved to 1?
That is because, if you have values that are oscillating around 0, then on many cases you'll get negative, or close to 0 values. Which as a scale, it might just make many of your values smaller. On the other hand, having a mean of 1, some values will be higher, like 1.3, which make your multiplied value bigger; some values will be smaller, multiplied by something like 0.6

In [None]:
from numpy.random import normal, seed, uniform
np.random.seed(42)
def noise(x, scale):
    return normal(scale=scale, size=x.shape)
def add_noise(x, mult, add):
    return x * (1+noise(x,mult)) + noise(x,add)

Now let's use these functions to graph some points that more or less follow the curve. The noise changes it.

In [None]:
x = torch.linspace(-2, 2, steps=20)[:,None] #This gives us 20 evenly-spaced points from -2 to 2
y = add_noise(f(x), 0.3, 1.5)
#This is adding noise to a function f(x),
#with a Multiplication scale of 1.3, and an additive scale of 1.5
plt.scatter(x, y)

You can make out the curve at plain sight, yeah?
Now let's create an interactable grapher to try and make a best fit by hand.
this is kind of what the ML model does, just in an optimized manner.
It's like you could make out which movements are effective towards your goal: which parameters to move in which direction, if only slightly each time.
You move a little, see what happened, correct, or keep going.

In [None]:
#It is here I notice there's a battle between 2020 and 2022 versions of the book.

In [None]:
plot_function??

In [None]:
from ipywidgets import interact #In my case it was about 3.90, 2.10, 0.20 but it varies bc noise.
@interact(a=1.5, b=1.5, c=1.5)
def plot_quad(a,b,c):
    plt.scatter(x,y)
    plot_function(mk_quad(a,b,c), ylim=(-3,12))

### Objectively optimizing, and automatizing objective optimization
So we use Loss Functions to optimize our neural network parameters.
A neural network is a set of parameters (many, many parameters) to try and best fit a complex non-linear function.

This first function is Mean Squared Error (MSE).
- preds: the predictions you've made.
- acts: the actual values. You must know these to have a loss.

Finally, seems like it's the mean of adding every single squared error.

In [None]:
def mse(preds, acts): return ((preds-acts)**2).mean()

In [None]:
@interact(a=1.5, b=1.5, c=1.5)
def plot_quad(a,b,c):
    f = mk_quad(a,b,c)
    plt.scatter(x,y)
    loss = mse(f(x), y) #Note we're now using a loss function to objectively see when we play, if we're doing better
    plot_function(mk_quad(a,b,c), ylim=(-3,12), title=f"MSE: {loss:.2f}")

My guess is that in this extremely low-dimensional function, you can fall in local optima. In high-dimensional problems there are saddle functions everywhere, so the danger of falling into local optima is not that real. It's like many saddle functions. You can get slow cycles of plateauing but you eventually keep moving downwards.

Andrew Ng explained that in one of his videos.

### Automate time
The function calculates a slope, or a gradient. It's the same thing. It tells you which direction is downwards.
You then apply that and move the parameters to a position that lowers the errors: that lowers the loss.

So now let's go to actually automate this.

1) We need a function that takes the Coefficients of the quadtratic a, b and c as inputs.


In [None]:
def quad_mse(params):
    f = mk_quad(*params) #This spreads the parameters and passes it to the function (destructuring?)
    return mse(f(x), y)

Let's see the Tensor it generates. This is the data structure that TensorFlow and PyTorch use (perhaps different tensors, but tensors nonetheless)

In [None]:
quad_mse([1.5, 1.5, 1.5])

Lists or Vectors of numbers are 1D tensors
Rectangles of numbers / Tables are 2D Tensors
Layers of tables of numbers are 3D Tensors
and so on...

In [None]:
# In fact, let's put our params into a variable
# This is a Rank 1 tensor, which is something Andrew Ng recommends against!
abc = torch.tensor([1.5, 1.5, 1.5])

# Required grad is flagging this to calculate gradients. In other words, these are weights and
# We started the weights at these values, I presume
abc.requires_grad_()

In [None]:
loss = quad_mse(abc)
loss

grad_fn is a function
Pytorch knows how to calculate the gradients for our inputs.
Let's tell it yes, do that:

In [None]:
loss.backward()

In [None]:
abc.grad

This is telling us in which direction to move the parameters: add, subtract, subtract.
That's right, it's backward. It's like it's telling us the distance in a signed unit, so that we can move it towards 0.

Let's apply that

In [None]:
with torch.no_grad():
    abc -= abc.grad*0.01 #We'll decrease it just one tiny bit, not too much. We may overshoot otherwise.
    loss = quad_mse(abc)

print(f'loss = {loss:.2f}')

Pytorch automatically calculates derivatives. But because we're not computing a function,
but calculating the gradients instead,
we told PyTorch to hold up on the updating, since we're not applying training, just the loss calculation

This is the standard grad part of a Neural Network. I actually saw this with tape in TensorFlow.
Let's do this many times. This is like iterations in Neural Networks.

In [None]:
for i in range(5):
    loss = quad_mse(abc)
    loss.backward()
    with torch.no_grad():
        abc -= abc.grad*0.01
    print(f"step={i}; loss={loss:.2f}")

In [None]:
# Let's see abc
abc

### Bigger Problems require more flexible functions

Here we're going into Non-Linearity introduced thanks to Activation Functions

The most famous one is ReLU: Rectified Linear Unit

In [None]:
def rectified_linear(m,b,x):
    y = m*x + b
    return torch.clip(y, 0.)

You can see they've changed. Now the movement is different!

Let's use Partial to create a function and affix b=1, x=1. This is plotting just the base look of ReLU for us to see!

In [None]:
plot_function(partial(rectified_linear, 1, 1))

In [None]:
# Now let's play

@interact(m=1.5, b=1.5)
def plot_relu(m, b):
    plot_function(partial(rectified_linear, m, b), ylim=(-1,4))

Now let's create a double ReLU. We're visualizing the magic of neurons together.

In [None]:
def double_relu(m1, b1, m2, b2, x):
    return rectified_linear(m1,b1,x) + rectified_linear(m2,b2,x)

In [None]:
@interact(m1=-1.5, b1=-1.5, m2=1.5, b2=1.5)
def plot_double_relu(m1, b1, m2, b2):
    plot_function(partial(double_relu, m1, b1, m2, b2), ylim=(-1,6))

You can see how these together can make more complex functions, and fit different forms of the data.

### Tips from the lecture
He remarks that there are things that are better to know at the start, than at the end.
One of them, very important, is that you want to know if it's possible or not.

This is like Andrew Ng's approach, in which you iterate continuously in your model.
Professor Howard also says this, because if you get something quite accurate, or somewhat accurate, perhaps you CAN work on it. If you can't get any accuracy at all despite what you have, perhaps it's not possible.
It's better to find that out first.

Iterating is all about trying and taking next steps from there.

#### Matrix Multiplication
GPUs are basically really good at this computation. It's the foundation of Neural Networks.

### Normalization
Professor Howard changes a numerical continuous value to go from 0-1 to match some Binary Categorical inputs.

### Big and small numbers
When this happens, you can take the Log of something. That gets you a lot more even distribution.

He mentions fastai is doing all of this for you