# Micrograd

1. watch the [micrograd video](https://www.youtube.com/watch?v=VMj-3S1tku0) on YouTube
2. come back and complete these exercises to level up :)

## 1. *Write the function df that returns the analytical gradient of f, i.e. use your skills from calculus to take the derivative, then implement the formula.*

In [204]:
# here is a mathematical expression that takes 3 inputs and produces one output
from math import sin, cos

def f(a, b, c):
  return -a**3 + sin(3*b) - 1.0/c + b**2.5 - a**0.5

print(f(2, 3, 4))

6.336362190988558


In [205]:
# ------
# TODO
def gradf(a, b, c):
    dfda = -3 * a ** 2 - 1 / 2 * a ** (-1 / 2)
    dfdb = 3 * cos(3 * b) + 5 / 2 * b ** (3 / 2)
    dfdc = c ** (-2)

    grad = [dfda, dfdb, dfdc]
    return grad
# ------

In [206]:
# expected answer is the list of 
ans = [-12.353553390593273, 10.25699027111255, 0.0625]
yours = gradf(2, 3, 4)
for dim in range(3):
  ok = 'OK' if abs(yours[dim] - ans[dim]) < 1e-5 else 'WRONG!'
  print(f"{ok} for dim {dim}: expected {ans[dim]}, yours returns {yours[dim]}")


OK for dim 0: expected -12.353553390593273, yours returns -12.353553390593273
OK for dim 1: expected 10.25699027111255, yours returns 10.25699027111255
OK for dim 2: expected 0.0625, yours returns 0.0625


## 2. *Now estimate the gradient numerically without any calculus, using the approximation we used in the video. you should not call the function df from the last cell.*

In [207]:
# ------
# TODO
def gradf_numeric(a, b, c, h = 1e-6):
    dfda = (f(a + h, b, c) - f(a, b, c)) / h
    dfdb = (f(a, b + h, c) - f(a, b, c)) / h
    dfdc = (f(a, b, c + h) - f(a, b, c)) / h

    grad = [dfda, dfdb, dfdc]
    return grad

numerical_grad = gradf_numeric(2, 3, 4)
# ------

In [208]:
for dim in range(3):
  ok = 'OK' if abs(numerical_grad[dim] - ans[dim]) < 1e-5 else 'WRONG!'
  print(f"{ok} for dim {dim}: expected {ans[dim]}, yours returns {numerical_grad[dim]}")

OK for dim 0: expected -12.353553390593273, yours returns -12.353559348809995
OK for dim 1: expected 10.25699027111255, yours returns 10.256991666679482
OK for dim 2: expected 0.0625, yours returns 0.062499984743169534


## 3. *There is an alternative formula that provides a much better numerical  approximation to the derivative of a function. Learn about it here: https://en.wikipedia.org/wiki/Symmetric_derivative implement it. confirm that for the same step size h this version gives a better approximation.*

In [209]:
# ------
# TODO
def gradf_numeric2(a, b, c, h = 1e-6):
    dfda = (f(a + h, b, c) - f(a - h, b, c)) / (2 * h)
    dfdb = (f(a, b + h, c) - f(a, b - h, c)) / (2 * h)
    dfdc = (f(a, b, c + h) - f(a, b, c - h)) / (2 * h)

    grad = [dfda, dfdb, dfdc]
    return grad

numerical_grad2 = gradf_numeric2(2, 3, 4)
# ------

In [210]:
for dim in range(3):
  ok = 'OK' if abs(numerical_grad2[dim] - ans[dim]) < 1e-5 else 'WRONG!'
  print(f"{ok} for dim {dim}: expected {ans[dim]}, yours returns {numerical_grad2[dim]}")

OK for dim 0: expected -12.353553390593273, yours returns -12.353553391353245
OK for dim 1: expected 10.25699027111255, yours returns 10.25699027401572
OK for dim 2: expected 0.0625, yours returns 0.06250000028629188


## 4. *Without referencing our code/video __too__ much, make this cell work you'll have to implement (in some cases re-implemented) a number of functions of the Value object, similar to what we've seen in the video. Instead of the squared error loss this implements the negative log likelihood loss, which is very often used in classification.*


In [211]:
# Value class starter code, with many functions taken out
from math import exp, log

class Value:
    def __init__(self, data, _children=(), _op='', label=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(_children)
        self._op = _op
        self.label = label

    def __repr__(self):
        return f"Value(data={self.data})"
  
    def __add__(self, other): # exactly as in the video
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
    
        def _backward():
            self.grad += 1.0 * out.grad
            other.grad += 1.0 * out.grad
        out._backward = _backward

        return out

    # ------
    # TODO
    # re-implement all the other functions needed for the exercises below
    # your code here

    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), "*")

        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad

        out._backward = _backward
        return out

    def __pow__(self, other):
        assert isinstance(other, (int, float)), "only supporting int / float powers"
        x = self.data
        out = Value(self.data ** (other), (self, ), f"**{other}")

        def _backward():
            self.grad += other * (self.data ** (other - 1)) * out.grad

        out._backward = _backward
        return out


    def exp(self):
        x = self.data

        out = Value(exp(x), (self, ), "exp")

        def _backward():
            self.grad += out.data * out.grad

        out._backward = _backward
        return out


    def log(self):
        x = self.data
        out = Value(log(x), (self, ), 'log')
        def _backward():
            self.grad += (x **-1) * out.grad
        out._backward = _backward
        return out


    def tanh(self):
        x = self.data
        t = (exp(2*x) - 1)/(exp(2*x) + 1)
        out = Value(t, (self, ), 'tanh')
        
        def _backward():
          self.grad += (1 - t**2) * out.grad
        out._backward = _backward

        return out

    def __rmul__(self, other): # other * self
        return self * other

    def __truediv__(self, other):
        return self * other**-1

    def __neg__(self):
        return self *-1

    def __sub__(self, other):
        return self + (-other)

    def __radd__(self, other): 
        return self + other
    

    def sin(self):
        out = Value(sin(self.data), (self, ), 'sin')

        def _backward():
            self.grad += cos(self.data) * out.grad
        out._backward = _backward
        return out

    # ------

    def backward(self):
        topo = []
        visited = set()

        def build_topo(v):
            if v not in visited:
                visited.add(v)

                for child_v in v._prev:
                    build_topo(child_v)
                topo.append(v)
        build_topo(self)

        self.grad = 1.0
        for node in reversed(topo):
            node._backward()

In [213]:
# this is the softmax function
# https://en.wikipedia.org/wiki/Softmax_function
def softmax(logits):
    counts = [logit.exp() for logit in logits]
    denominator = sum(counts)
    out = [c / denominator for c in counts]
    return out

In [214]:
# this is the negative log likelihood loss function, pervasive in classification
logits = [Value(0.0), Value(3.0), Value(-2.0), Value(1.0)]
probs = softmax(logits)
print(f"probs: {[round(p.data, 5) for p in probs]}")

loss = -probs[3].log() # dim 3 acts as the label for this input example
loss.backward()
print(loss.data)

probs: [0.04177, 0.83902, 0.00565, 0.11355]
2.1755153626167147


In [215]:
ans = [0.041772570515350445, 0.8390245074625319, 0.005653302662216329, -0.8864503806400986]
for dim in range(4):
  ok = 'OK' if abs(logits[dim].grad - ans[dim]) < 1e-5 else 'WRONG!'
  print(f"{ok} for dim {dim}: expected {ans[dim]}, yours returns {logits[dim].grad}")

OK for dim 0: expected 0.041772570515350445, yours returns 0.041772570515350445
OK for dim 1: expected 0.8390245074625319, yours returns 0.8390245074625319
OK for dim 2: expected 0.005653302662216329, yours returns 0.005653302662216329
OK for dim 3: expected -0.8864503806400986, yours returns -0.8864503806400986


## 4. *Verify the gradient using the torch library, torch should give you the exact same gradient*

In [216]:
import torch

logits = torch.tensor([[0.0, 3.0, -2.0, 1.0]], requires_grad=True)
probs = torch.softmax(logits, dim=1)[0]
print(f"probs: {probs}")
loss = -probs[3].log()
loss.backward()

print(f"logits.grad: {logits.grad}")

probs: tensor([0.0418, 0.8390, 0.0057, 0.1135], grad_fn=<SelectBackward0>)
logits.grad: tensor([[ 0.0418,  0.8390,  0.0057, -0.8865]])
