
# Building **micrograd** from scratch – exercise notebook

This is a **self‑contained** walkthrough of a tiny automatic‑differentiation engine and a tiny neural‑network library, inspired by Andrej Karpathy’s *micrograd*.

You do **not** need to watch the video to follow this. We’ll build everything from scratch, step by step:

- a `Value` class that wraps scalars and tracks gradients,
- backpropagation (`.backward()`) over a computation graph,
- `Neuron`, `Layer`, and `MLP` classes on top of `Value`,
- a small training loop that fits a toy dataset.

Each section has:

1. **Concept / motivation** – what we’re about to build and why.
2. **Exercise cell** – with `### YOUR CODE HERE` and `raise NotImplementedError(...)`.
3. **Solution cell** – a reference implementation.

Try the exercise first, then check the solution.



## Exercise 0 – Warm‑up: derivatives as "sensitivity"

Key idea: a derivative measures **how sensitive** a function’s output is to its input.

For a function

\[
f(x) = 3x^2 - 4x + 5
\]

the derivative at a point \(x\) answers:

> If I nudge \(x\) by a tiny amount, how much does \(f(x)\) change?

On a graph of \(y = f(x)\), the derivative is the **slope** of the curve at that point: rise over run.

On a computer we can approximate this slope using **finite differences**:

\[
f'(x) \approx \frac{f(x + h) - f(x)}{h}
\]

for a small step size \(h\) (for example `1e-6`). Smaller `h` makes the approximation closer to the true derivative (up to floating‑point limits).

This is exactly the sort of information gradients will give us later for giant neural‑network expressions.

### Task

1. Implement `numerical_derivative(f, x)` using the finite‑difference formula.
2. Test it on \(f(x) = 3x^2 - 4x + 5\), and compare with the exact derivative \(f'(x) = 6x - 4\).


In [None]:

import math

def f(x):
    # f(x) = 3x^2 - 4x + 5
    return 3 * x**2 - 4 * x + 5

def numerical_derivative(f, x, h=1e-6):
    """
    Approximate f'(x) at point x using finite differences.

    Replace the 'raise' line with your implementation.
    """
    ### YOUR CODE HERE
    raise NotImplementedError("Exercise 0: implement numerical_derivative")  # noqa: TRY003


# After implementing, you can test like this:
# x0 = 3.0
# approx = numerical_derivative(f, x0)
# exact = 6 * x0 - 4
# print("approx:", approx, " exact:", exact)


In [None]:

# Solution 0 – numerical derivative

import math

def f(x):
    return 3 * x**2 - 4 * x + 5

def numerical_derivative(f, x, h=1e-6):
    return (f(x + h) - f(x)) / h

x0 = 3.0
approx = numerical_derivative(f, x0)
exact = 6 * x0 - 4

print(f"x = {x0}")
print(f"numerical derivative ~ {approx:.6f}")
print(f"exact derivative     = {exact:.6f}")



## Exercise 1 – Smart numbers: the `Value` class

Plain Python numbers like `3.0` or `-2.5` know nothing about:

- where they came from (which computation),
- how a final output depends on them (their gradient).

We want a **smart scalar** that will eventually:

- store its numeric value,
- store its gradient (derivative of some final loss w.r.t. this value),
- remember how it was computed (parents + operation).

We’ll call this smart scalar `Value`.

For now, we just need it to:

- hold `data` (a float),
- hold `grad` (initially `0.0`),
- print nicely so we can debug easily.


In [None]:

# Exercise 1 – your turn: minimal Value

class Value:
    def __init__(self, data):
        """
        Create a Value object that wraps a Python float.
        """
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 1: implement __init__")

    def __repr__(self):
        """
        Return a helpful string representation, e.g.
        Value(data=2.0, grad=0.0)
        """
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 1: implement __repr__")


# After implementing, try:
# v = Value(2.5)
# v


In [None]:

# Solution 1 – minimal Value

class Value:
    def __init__(self, data):
        # numeric value
        self.data = float(data)
        # gradient of some final scalar output w.r.t. this Value
        self.grad = 0.0

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"


Value(2.5)



## Exercise 2 – Using `+` and `*` with `Value` (operator overloading)

We’d like to write natural math with our `Value` objects:

```python
a = Value(-4.0)
b = Value(2.0)
c = a + b      # should be -2.0
d = a * b      # should be -8.0
```

In Python, `a + b` calls the method `a.__add__(b)`, and `a * b` calls `a.__mul__(b)`.

We will:

- implement `__add__` and `__mul__` so they return new `Value` objects,
- implement `__radd__` and `__rmul__` so `2 + a` and `3 * a` also work.

For now we **only** care about computing the correct numeric result. We’ll track graph structure in the next step.


In [None]:

# Exercise 2 – your turn: add + and *

class Value:
    def __init__(self, data):
        self.data = float(data)
        self.grad = 0.0

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"

    def __add__(self, other):
        """
        Return a new Value with data = self.data + other.data.
        If other is a plain number, wrap it as Value(other).
        """
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 2: implement __add__")

    def __radd__(self, other):
        """
        Called when Python evaluates: number + Value.
        """
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 2: implement __radd__")

    def __mul__(self, other):
        """
        Return a new Value with data = self.data * other.data.
        """
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 2: implement __mul__")

    def __rmul__(self, other):
        """
        Called when Python evaluates: number * Value.
        """
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 2: implement __rmul__")


# After implementing, try:
# a = Value(-4.0)
# b = Value(2.0)
# print("a + b =", a + b)
# print("a * b =", a * b)
# print("2 + a =", 2 + a)
# print("3 * a =", 3 * a)


In [None]:

# Solution 2 – addition and multiplication

class Value:
    def __init__(self, data):
        self.data = float(data)
        self.grad = 0.0

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"

    @staticmethod
    def _lift(x):
        # Make sure x is a Value
        return x if isinstance(x, Value) else Value(x)

    def __add__(self, other):
        other = Value._lift(other)
        return Value(self.data + other.data)

    def __radd__(self, other):
        # addition is commutative
        return self + other

    def __mul__(self, other):
        other = Value._lift(other)
        return Value(self.data * other.data)

    def __rmul__(self, other):
        # multiplication is commutative
        return self * other


a = Value(-4.0)
b = Value(2.0)
print("a + b =", a + b)
print("a * b =", a * b)
print("2 + a =", 2 + a)
print("3 * a =", 3 * a)



## Exercise 3 – Building a computation graph

Backpropagation needs to know **how each value was computed**.

We’ll treat each `Value` as a node in a **computation graph**:

- Inputs you create directly are leaf nodes.
- Results of operations like `+` or `*` are internal nodes.
- If `z = x + y`, then `x` and `y` are **parents** of `z`.

Example:

```python
a = Value(-4.0)
b = Value(2.0)
c = a + b
d = a * b
e = c + d
```

Graph sketch:

```
   a      b
   | \  / |
   |  \/  |
   |  /\  |
   c    d
     \ /
      e
```

We’ll give each `Value` three extra fields:

- `self._prev`: the set of parent nodes (inputs to this operation),
- `self._op`: a string describing the operation (`"+"`, `"*"`, etc.),
- `self._backward`: a function that later will know how to push gradients to the parents.

### Task

Extend `Value` to:

- accept `_children` and `_op` in the constructor,
- update `__add__` and `__mul__` so the **output** node remembers its parents and operation.


In [None]:

# Exercise 3 – your turn: add graph info

class Value:
    def __init__(self, data, _children=(), _op=""):
        self.data = float(data)
        self.grad = 0.0

        # graph information
        self._prev = set(_children)  # parents in the graph
        self._op = _op               # operation that produced this node
        self._backward = lambda: None

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"

    @staticmethod
    def _lift(x):
        return x if isinstance(x, Value) else Value(x)

    def __add__(self, other):
        other = Value._lift(other)
        # TODO: create out, remember parents and op "+"
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 3: implement __add__ with graph info")

    def __radd__(self, other):
        return self + other

    def __mul__(self, other):
        other = Value._lift(other)
        # TODO: create out, remember parents and op "*"
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 3: implement __mul__ with graph info")

    def __rmul__(self, other):
        return self * other


# After implementing, try:
# a = Value(-4.0)
# b = Value(2.0)
# c = a + b
# d = a * b + b
# print("d:", d)
# print("d._op:", d._op)
# print("parents of d:", d._prev)


In [None]:

# Solution 3 – Value as a graph node

class Value:
    def __init__(self, data, _children=(), _op="", label=""):
        self.data = float(data)
        self.grad = 0.0
        self._prev = set(_children)
        self._op = _op
        self._backward = lambda: None
        self.label = label  # optional name for debugging

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"

    @staticmethod
    def _lift(x):
        return x if isinstance(x, Value) else Value(x)

    def __add__(self, other):
        other = Value._lift(other)
        out = Value(self.data + other.data, (self, other), "+")
        return out

    def __radd__(self, other):
        return self + other

    def __mul__(self, other):
        other = Value._lift(other)
        out = Value(self.data * other.data, (self, other), "*")
        return out

    def __rmul__(self, other):
        return self * other


a = Value(-4.0, label="a")
b = Value(2.0, label="b")
c = a + b
d = a * b
e = c + d

print("e:", e)
print("e._op:", e._op)
print("e._prev:", {p.label for p in e._prev})



## Exercise 4 – Backpropagation: local gradients and `.backward()`

Now we want to automatically fill `grad` for all nodes when we call:

```python
loss.backward()
```

where `loss` is some final scalar `Value` in the graph. After this call, each node `v` should hold:

\[
v.grad = \frac{\partial \text{loss}}{\partial v}
\]

So `v.grad` tells us how sensitive the final loss is to small changes in `v`.

### Local gradients

Take two basic operations:

- Addition: `out = x + y`
  - `dout/dx = 1`
  - `dout/dy = 1`

- Multiplication: `out = x * y`
  - `dout/dx = y`
  - `dout/dy = x`

If we already know `out.grad = d(loss)/d(out)`, then by the chain rule:

```text
d(loss)/d(x) = d(loss)/d(out) * d(out)/d(x)
```

So in code:

- For addition:

```python
x.grad += 1.0 * out.grad
y.grad += 1.0 * out.grad
```

- For multiplication:

```python
x.grad += y.data * out.grad
y.grad += x.data * out.grad
```

We’ll store this logic in `out._backward()`, a tiny function that knows how to **send gradients to the parents**.

### Global backward

To propagate gradients through the whole graph, `backward()` will:

1. Build a list of all nodes reachable from `self` in **topological order** (parents before children).
2. Set `self.grad = 1.0` (because d(loss)/d(loss) = 1).
3. Go through the list **in reverse order**, calling `v._backward()` at each node.

That reverse order ensures that when we process a node, all nodes that depend on it have already passed their gradients back.

### Task

1. In `__add__` and `__mul__`, define `out._backward()` using the local rules above.
2. Implement `backward(self)` that:
   - builds a topological list of nodes,
   - seeds `self.grad = 1.0`,
   - walks the list in reverse and calls each node’s `_backward()`.


In [None]:

# Exercise 4 – your turn: implement backprop for + and *

class Value:
    def __init__(self, data, _children=(), _op="", label=""):
        self.data = float(data)
        self.grad = 0.0
        self._prev = set(_children)
        self._op = _op
        self._backward = lambda: None
        self.label = label

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"

    @staticmethod
    def _lift(x):
        return x if isinstance(x, Value) else Value(x)

    def __add__(self, other):
        other = Value._lift(other)
        out = Value(self.data + other.data, (self, other), "+")

        def _backward():
            # TODO: use local derivative for addition
            ### YOUR CODE HERE
            raise NotImplementedError("Exercise 4: implement _backward for +")

        out._backward = _backward
        return out

    def __radd__(self, other):
        return self + other

    def __mul__(self, other):
        other = Value._lift(other)
        out = Value(self.data * other.data, (self, other), "*")

        def _backward():
            # TODO: use local derivative for multiplication
            ### YOUR CODE HERE
            raise NotImplementedError("Exercise 4: implement _backward for *")

        out._backward = _backward
        return out

    def __rmul__(self, other):
        return self * other

    def backward(self):
        """
        Backpropagate gradients from this node through the graph.
        """
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 4: implement backward()")


# After implementing, test with:
# a = Value(2.0, label="a")
# b = Value(3.0, label="b")
# c = a * b
# c.backward()
# print("dc/da (should be 3):", a.grad)
# print("dc/db (should be 2):", b.grad)


In [None]:

# Solution 4 – backprop for + and *

class Value:
    def __init__(self, data, _children=(), _op="", label=""):
        self.data = float(data)
        self.grad = 0.0
        self._prev = set(_children)
        self._op = _op
        self._backward = lambda: None
        self.label = label

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"

    @staticmethod
    def _lift(x):
        return x if isinstance(x, Value) else Value(x)

    def __add__(self, other):
        other = Value._lift(other)
        out = Value(self.data + other.data, (self, other), "+")

        def _backward():
            # d(out)/d(self) = 1, d(out)/d(other) = 1
            self.grad += 1.0 * out.grad
            other.grad += 1.0 * out.grad

        out._backward = _backward
        return out

    def __radd__(self, other):
        return self + other

    def __mul__(self, other):
        other = Value._lift(other)
        out = Value(self.data * other.data, (self, other), "*")

        def _backward():
            # d(out)/d(self) = other.data, d(out)/d(other) = self.data
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad

        out._backward = _backward
        return out

    def __rmul__(self, other):
        return self * other

    def backward(self):
        # 1) build topological order
        topo = []
        visited = set()

        def build(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build(child)
                topo.append(v)

        build(self)

        # 2) zero all grads
        for v in topo:
            v.grad = 0.0

        # 3) seed gradient at this node
        self.grad = 1.0

        # 4) backprop in reverse topological order
        for v in reversed(topo):
            v._backward()


# Test: simple multiplication
a = Value(2.0, label="a")
b = Value(3.0, label="b")
c = a * b
c.backward()
print("c =", c)
print("dc/da (should be 3):", a.grad)
print("dc/db (should be 2):", b.grad)

# Test: node used twice, b = a + a
a = Value(3.0, label="a")
b = a + a
b.backward()
print("\nTest b = a + a")
print("b =", b)
print("db/da (should be 2):", a.grad)



## Exercise 5 – More math ops on `Value`

Real neural nets use more than `+` and `*`. We’d like to support:

- unary minus: `-x`
- subtraction: `x - y`, `y - x`
- division: `x / y`, `y / x`
- powers: `x ** k` for scalar `k`
- nonlinearities: `tanh()` and `relu()`
- `exp()`

The pattern is always the same:

1. Implement the **forward** computation.
2. Figure out the **local derivative** (from calculus or intuition).
3. In `_backward`, multiply `out.grad` by this local derivative and add to `self.grad`.

Some handy derivatives:

- `d(-x)/dx = -1`
- `d(x - y)/dx = 1`, `d(x - y)/dy = -1`
- `d(x**k)/dx = k * x**(k-1)`
- `d(tanh(x))/dx = 1 - tanh(x)**2`
- `d(ReLU(x))/dx = 1 if x > 0 else 0`
- `d(exp(x))/dx = exp(x)`

Division can be rewritten using powers:

\[
\frac{x}{y} = x \cdot y^{-1}
\]

Below is the full `Value` implementation with these ops. Read through slowly and make sure each `_backward` matches the derivative rule.


In [None]:

# Solution 5 – full Value engine with many scalar operations

import math

class Value:
    def __init__(self, data, _children=(), _op="", label=""):
        self.data = float(data)
        self.grad = 0.0
        self._prev = set(_children)
        self._op = _op
        self._backward = lambda: None
        self.label = label

    def __repr__(self):
        return f"Value(data={self.data}, grad={self.grad})"

    @staticmethod
    def _lift(x):
        return x if isinstance(x, Value) else Value(x)

    # ----- basic arithmetic -----
    def __add__(self, other):
        other = Value._lift(other)
        out = Value(self.data + other.data, (self, other), "+")

        def _backward():
            self.grad += 1.0 * out.grad
            other.grad += 1.0 * out.grad

        out._backward = _backward
        return out

    def __radd__(self, other):
        return self + other

    def __mul__(self, other):
        other = Value._lift(other)
        out = Value(self.data * other.data, (self, other), "*")

        def _backward():
            self.grad += other.data * out.grad
            other.grad += self.data * out.grad

        out._backward = _backward
        return out

    def __rmul__(self, other):
        return self * other

    # ----- unary minus and subtraction -----
    def __neg__(self):
        # -x is just (-1) * x
        return self * -1.0

    def __sub__(self, other):
        other = Value._lift(other)
        return self + (-other)

    def __rsub__(self, other):
        other = Value._lift(other)
        return other + (-self)

    # ----- division -----
    def __truediv__(self, other):
        other = Value._lift(other)
        return self * (other ** -1.0)

    def __rtruediv__(self, other):
        other = Value._lift(other)
        return other * (self ** -1.0)

    # ----- powers -----
    def __pow__(self, exponent):
        assert isinstance(exponent, (int, float)), "only supports int/float powers"
        out = Value(self.data ** exponent, (self,), f"**{exponent}")

        def _backward():
            self.grad += exponent * (self.data ** (exponent - 1)) * out.grad

        out._backward = _backward
        return out

    # ----- nonlinearities -----
    def tanh(self):
        x = self.data
        t = (math.exp(2 * x) - 1) / (math.exp(2 * x) + 1)
        out = Value(t, (self,), "tanh")

        def _backward():
            self.grad += (1.0 - t**2) * out.grad

        out._backward = _backward
        return out

    def relu(self):
        x = self.data
        out = Value(x if x > 0 else 0.0, (self,), "ReLU")

        def _backward():
            self.grad += (1.0 if x > 0 else 0.0) * out.grad

        out._backward = _backward
        return out

    def exp(self):
        x = self.data
        e = math.exp(x)
        out = Value(e, (self,), "exp")

        def _backward():
            self.grad += out.data * out.grad

        out._backward = _backward
        return out

    # ----- backward engine -----
    def backward(self):
        topo = []
        visited = set()

        def build(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build(child)
                topo.append(v)

        build(self)

        # zero grads
        for v in topo:
            v.grad = 0.0

        # seed gradient
        self.grad = 1.0

        # backprop
        for v in reversed(topo):
            v._backward()


# Small stress test (adapted from micrograd README)
a = Value(-4.0, label="a")
b = Value(2.0, label="b")
c = a + b
d = a * b + b**3
c = c + c + 1
c = c + 1 + c + (-a)
d = d + d * 2 + (b + a).relu()
d = d + 3 * d + (b - a).relu()
e = c - d
f = e**2
g = f / 2.0
g = g + 10.0 / f

print("g.data =", g.data)
g.backward()
print("a.grad =", a.grad)
print("b.grad =", b.grad)



### (Optional) Inspecting the computation graph

This helper prints a simple textual view of the graph starting from a root node. It can help you see how everything is wired together.


In [None]:

def dump_graph(root):
    nodes = []
    visited = set()

    def build(v):
        if v not in visited:
            visited.add(v)
            for child in v._prev:
                build(child)
            nodes.append(v)

    build(root)

    for v in nodes:
        parent_labels = [p.label or "?" for p in v._prev]
        print(f"node {v.label or '?'}: op={v._op!r}, data={v.data}, grad={v.grad}, parents={parent_labels}")


dump_graph(g)



## Exercise 6 – From scalars to a single neuron

Now we use our scalar engine to build neural‑network parts.

A single fully connected neuron computes

\[
\text{out} = \phi(w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b)
\]

- \(x_i\): inputs (we’ll use `Value` objects)
- \(w_i\): weights (trainable `Value` parameters)
- \(b\): bias (trainable `Value`)
- \(\phi\): nonlinearity (ReLU here)

Since all operations are on `Value`s (`*`, `+`, `relu`), the graph + gradients “just work”.

### `Module`

We also define a tiny base class `Module` so all neural‑network objects can:

- return their parameters via `parameters()`,
- zero their gradients via `zero_grad()`.

### Task

1. Implement `Module.parameters` (base version can just return `[]`).
2. Implement `Neuron.__call__`:
   - compute `act = sum(w_i * x_i) + b`,
   - if `self.nonlin` is `True`, return `act.relu()`, else return `act`.
3. Implement `Neuron.parameters` to return all weights plus bias.


In [None]:

# Exercise 6 – your turn: Module and Neuron

import random

class Module:
    def parameters(self):
        """
        Return a flat list of all Value parameters.

        Base class: no parameters.
        """
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 6: implement Module.parameters")

    def zero_grad(self):
        for p in self.parameters():
            p.grad = 0.0


class Neuron(Module):
    def __init__(self, nin, nonlin=True):
        """
        nin : number of inputs
        nonlin : whether to apply ReLU nonlinearity
        """
        self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
        self.b = Value(0.0)
        self.nonlin = nonlin

    def __call__(self, x):
        """
        x : list of Value inputs
        """
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 6: implement Neuron.__call__")

    def parameters(self):
        """
        Return all parameters (weights and bias) as a list.
        """
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 6: implement Neuron.parameters")


# After implementing, test with:
# n = Neuron(3)
# x = [Value(1.0), Value(-2.0), Value(0.5)]
# out = n(x)
# print("neuron output:", out)


In [None]:

# Solution 6 – Module and Neuron

import random

class Module:
    def parameters(self):
        # Base module: no parameters
        return []

    def zero_grad(self):
        for p in self.parameters():
            p.grad = 0.0


class Neuron(Module):
    def __init__(self, nin, nonlin=True):
        self.w = [Value(random.uniform(-1, 1)) for _ in range(nin)]
        self.b = Value(0.0)
        self.nonlin = nonlin

    def __call__(self, x):
        # weighted sum: w·x + b
        act = sum((wi * xi for wi, xi in zip(self.w, x)), self.b)
        # nonlinearity
        return act.relu() if self.nonlin else act

    def parameters(self):
        return self.w + [self.b]


# Test a neuron
n = Neuron(3)
x = [Value(1.0), Value(-2.0), Value(0.5)]
out = n(x)
print("neuron output:", out)



## Exercise 7 – Layers and multi‑layer perceptrons (MLPs)

Neural networks stack many neurons in **layers**, and then stack layers into a **multi‑layer perceptron** (MLP).

### Layer

A `Layer`:

- holds a list of `Neuron`s,
- on `__call__`, applies each neuron to the same input vector,
- returns a list of outputs (or a single `Value` if there is only one neuron),
- exposes all its parameters via `parameters()`.

### MLP

An `MLP` is a sequence of layers. For example:

```python
mlp = MLP(3, [4, 4, 1])
```

- input dimension is 3,
- first hidden layer has 4 neurons,
- second hidden layer has 4 neurons,
- output layer has 1 neuron.

On `__call__`, the MLP just feeds the output of one layer into the next.

### Task

1. Implement `Layer.__call__` and `Layer.parameters()`.
2. Implement `MLP.__call__` (looping through layers) and `MLP.parameters()`.


In [None]:

# Exercise 7 – your turn: Layer and MLP

class Layer(Module):
    def __init__(self, nin, nout, **kwargs):
        """
        nin : number of inputs
        nout : number of neurons in this layer
        kwargs : forwarded to Neuron (e.g. nonlin=True/False)
        """
        self.neurons = [Neuron(nin, **kwargs) for _ in range(nout)]

    def __call__(self, x):
        # apply each neuron to the same input vector
        outs = [n(x) for n in self.neurons]
        # if there is only 1 neuron, return its single output
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 7: implement Layer.__call__")

    def parameters(self):
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 7: implement Layer.parameters")


class MLP(Module):
    def __init__(self, nin, nouts):
        """
        nin : input dimension
        nouts : list of layer sizes, e.g. [4, 4, 1]
        """
        sizes = [nin] + nouts
        self.layers = [
            Layer(sizes[i], sizes[i+1], nonlin=(i != len(nouts) - 1))
            for i in range(len(nouts))
        ]

    def __call__(self, x):
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 7: implement MLP.__call__")

    def parameters(self):
        ### YOUR CODE HERE
        raise NotImplementedError("Exercise 7: implement MLP.parameters")


# After implementing, test with:
# mlp = MLP(3, [4, 4, 1])
# x = [Value(1.0), Value(-2.0), Value(0.5)]
# y = mlp(x)
# print("MLP output:", y)


In [None]:

# Solution 7 – Layer and MLP

class Layer(Module):
    def __init__(self, nin, nout, **kwargs):
        self.neurons = [Neuron(nin, **kwargs) for _ in range(nout)]

    def __call__(self, x):
        outs = [n(x) for n in self.neurons]
        return outs[0] if len(outs) == 1 else outs

    def parameters(self):
        return [p for n in self.neurons for p in n.parameters()]


class MLP(Module):
    def __init__(self, nin, nouts):
        sizes = [nin] + nouts
        self.layers = [
            Layer(sizes[i], sizes[i+1], nonlin=(i != len(nouts) - 1))
            for i in range(len(nouts))
        ]

    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

    def parameters(self):
        return [p for layer in self.layers for p in layer.parameters()]


mlp = MLP(3, [4, 4, 1])
x = [Value(1.0), Value(-2.0), Value(0.5)]
y = mlp(x)
print("MLP output:", y)



## Final example – training a tiny neural network

We now have:

- a scalar autograd engine (`Value`),
- neural‑network building blocks (`Neuron`, `Layer`, `MLP`).

Let’s put them together and **train** a small MLP on a toy classification problem.

### Dataset

We’ll create 4 points in 2D with labels in {+1, −1}. The goal is to map each point to its correct label.

### Model

Use an MLP:

```python
model = MLP(2, [4, 4, 1])
```

- input dimension = 2,
- two hidden layers with 4 neurons each,
- 1 output neuron (scalar score).

### Loss – mean squared error (MSE)

For predictions `y_pred` and targets `y_true`:

\[
\text{loss} = \frac{1}{N} \sum_i (y_{\text{pred}, i} - y_{\text{true}, i})^2
\]

Loss is small when predictions are close to targets.

### Training loop – vanilla gradient descent

Each step:

1. Forward pass: compute predictions and the loss.
2. `model.zero_grad()` – clear old gradients.
3. `loss.backward()` – fill in `p.grad` for all parameters.
4. For each parameter `p`, update `p.data += -learning_rate * p.grad`.

If everything is wired correctly, the loss should go down and predictions should approach the desired labels.


In [None]:

# Tiny 2D dataset
xs = [
    [Value(2.0),  Value(3.0)],   # want +1
    [Value(1.0),  Value(-1.0)],  # want -1
    [Value(-1.0), Value(-2.0)],  # want -1
    [Value(-2.0), Value(2.0)],   # want +1
]
ys = [1.0, -1.0, -1.0, 1.0]

model = MLP(2, [4, 4, 1])
print("Number of parameters:", len(model.parameters()))

learning_rate = 0.1
steps = 50

for step in range(steps):
    # Forward pass: compute predictions
    ypred = [model(x) for x in xs]

    # Mean squared error loss
    losses = [(yout - ytrue)**2 for yout, ytrue in zip(ypred, ys)]
    loss = sum(losses) * (1.0 / len(losses))

    # Backpropagation
    model.zero_grad()
    loss.backward()

    # Parameter update
    for p in model.parameters():
        p.data += -learning_rate * p.grad

    if step % 5 == 0 or step == steps - 1:
        print(f"step {step:03d}  loss = {loss.data:.4f}")

print("\nFinal predictions:")
for x, ytrue, yout in zip(xs, ys, ypred):
    print(f"target={ytrue:+.1f}, pred={yout.data:+.3f}")



## Wrap‑up

You’ve built, from scratch:

- a scalar automatic‑differentiation engine:
  - `Value` nodes that store data, gradients, parents, and local `_backward` functions,
  - a `.backward()` method that walks the computation graph in reverse to apply the chain rule;
- a tiny neural‑network library:
  - `Module` with `parameters()` and `zero_grad()`,
  - `Neuron`, `Layer`, and `MLP` that are built purely from `Value` operations;
- a small training loop that uses gradient descent to minimise a loss.

This is the conceptual core of what big libraries like **PyTorch**, **TensorFlow**, or **JAX** do under the hood. They operate on tensors (multi‑dimensional arrays) instead of scalars, and they are highly optimised and feature‑rich, but the **math and ideas are the same**.

From here you can experiment with:

- different activation functions (`tanh`, sigmoid, ...),
- different network architectures (more layers, more neurons),
- different loss functions (e.g. cross‑entropy),
- visualising the computation graph with tools like `graphviz`.

The key idea:

> Every operation knows how its **output** depends on its **inputs** (local derivative).  
> Backpropagation stitches all these pieces together (chain rule) to tell you how a final loss depends on **every parameter**.
