# Lab 1.7.1: Core Tensor Implementation

**Module:** 1.7 - Domain 1 Capstone: MicroGrad+  
**Time:** 3 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand how automatic differentiation works under the hood
- [ ] Build a Tensor class that tracks operations for backpropagation
- [ ] Implement the backward pass using reverse-mode autodiff
- [ ] Verify your gradients against numerical approximations

---

## üìö Prerequisites

- Completed: Modules 1.1-1.6 (Python, CUDA, Math, Neural Network Fundamentals)
- Knowledge of: Calculus (chain rule), Python classes, NumPy

---

## üåç Real-World Context

Every deep learning framework (PyTorch, TensorFlow, JAX) has an automatic differentiation engine at its core. When you call `loss.backward()` in PyTorch, it's using the same principles we'll implement here!

Understanding autograd helps you:
- Debug gradient issues in training
- Write custom layers correctly
- Understand why certain operations break gradient flow
- Appreciate the engineering behind modern ML frameworks

---

## üßí ELI5: Automatic Differentiation

> **Imagine you're baking a cake** with multiple ingredients and steps.
>
> If the cake tastes terrible, you want to know: "Was it too much salt? Not enough sugar? Overcooked?"
>
> **Automatic differentiation is like a magical recipe tracker:**
> 1. As you cook, it writes down every step: "Added 2 cups flour, mixed for 3 minutes, baked at 350¬∞F"
> 2. When the cake is done, it can trace back: "The burnt taste came from the oven being too hot"
> 3. It tells you exactly how much to change each setting to improve the result!
>
> **In AI terms:**
> - The "cake" is your model's prediction
> - The "ingredients" are your model parameters (weights)
> - The "burnt taste" is the loss (error)
> - The "how much to change" is the gradient
>
> Autograd builds a computation graph as you calculate, then walks backward through it to find gradients.

---

## Part 1: Understanding the Computation Graph

### Concept Explanation

When we compute something like `c = a * b + a`, we're implicitly building a **computation graph**:

```
    a ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ> [*] ‚îÄ‚îÄ‚îÄ> temp ‚îÄ‚îÄ‚îÄ> [+] ‚îÄ‚îÄ‚îÄ> c
          ‚îÇ              ^
    b ‚îÄ‚îÄ‚îÄ‚îÄ‚îò              ‚îÇ
    a ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

Each operation creates a new node that knows:
1. Its parents (inputs)
2. What operation was performed
3. How to compute its gradient contribution (the "local gradient")

### The Chain Rule

If `c` depends on `a` through multiple paths, we use the **chain rule**:

$$\frac{\partial c}{\partial a} = \frac{\partial c}{\partial \text{temp}} \cdot \frac{\partial \text{temp}}{\partial a} + \frac{\partial c}{\partial a}|_{\text{direct}}$$

For `c = a * b + a`:
- Path 1: `a ‚Üí temp ‚Üí c` contributes `b` (since ‚àÇ(a*b)/‚àÇa = b)
- Path 2: `a ‚Üí c` contributes `1` (since ‚àÇ(+a)/‚àÇa = 1)
- Total: `dc/da = b + 1`

In [None]:
# Let's start with a minimal example to see the computation graph
import numpy as np

# Manual computation graph for: c = a * b + a
a = 2.0
b = 3.0

# Forward pass (build the graph)
temp = a * b  # temp = 6.0
c = temp + a  # c = 8.0

# Backward pass (compute gradients)
# Starting with dc/dc = 1
dc_dc = 1.0

# Gradient through addition: c = temp + a
dc_dtemp = dc_dc * 1.0  # ‚àÇ(temp + a)/‚àÇtemp = 1
dc_da_path2 = dc_dc * 1.0  # ‚àÇ(temp + a)/‚àÇa = 1

# Gradient through multiplication: temp = a * b
dc_da_path1 = dc_dtemp * b  # ‚àÇ(a * b)/‚àÇa = b
dc_db = dc_dtemp * a  # ‚àÇ(a * b)/‚àÇb = a

# Total gradients (sum of all paths)
dc_da_total = dc_da_path1 + dc_da_path2

print(f"c = {c}")
print(f"dc/da = {dc_da_total} (expected: b + 1 = {b + 1})")
print(f"dc/db = {dc_db} (expected: a = {a})")

### üîç What Just Happened?

We manually traced through the computation graph backward:
1. Start at the output `c` with gradient 1
2. For each operation, distribute the gradient to inputs using local derivatives
3. Sum gradients that arrive at the same variable through different paths

Now let's automate this!

---

## Part 2: Building the Value Class (Simplified Tensor)

We'll start with a simplified `Value` class for scalars before moving to full tensors.

### Key Components:
1. **data**: The actual numerical value
2. **grad**: The gradient (starts at 0)
3. **_backward**: A function to compute gradient contributions
4. **_prev**: Set of parent nodes (for graph traversal)

In [None]:
class Value:
    """
    A simple scalar value with automatic differentiation support.
    This is the educational version before we build full Tensors.
    """
    
    def __init__(self, data, _children=(), _op=''):
        self.data = float(data)
        self.grad = 0.0  # Gradient starts at 0
        self._backward = lambda: None  # Default: no gradient computation
        self._prev = set(_children)  # Parent nodes
        self._op = _op  # Operation that created this node (for debugging)
    
    def __repr__(self):
        return f"Value(data={self.data:.4f}, grad={self.grad:.4f})"
    
    def __add__(self, other):
        """Addition: self + other"""
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, (self, other), '+')
        
        def _backward():
            # Gradient of addition: passes gradient unchanged to both inputs
            self.grad += out.grad  # d(a+b)/da = 1
            other.grad += out.grad  # d(a+b)/db = 1
        out._backward = _backward
        
        return out
    
    def __mul__(self, other):
        """Multiplication: self * other"""
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, (self, other), '*')
        
        def _backward():
            # Gradient of multiplication: swap and multiply by out gradient
            self.grad += other.data * out.grad  # d(a*b)/da = b
            other.grad += self.data * out.grad  # d(a*b)/db = a
        out._backward = _backward
        
        return out
    
    def __radd__(self, other):  # Handle: number + Value
        return self + other
    
    def __rmul__(self, other):  # Handle: number * Value
        return self * other
    
    def backward(self):
        """Compute gradients via reverse-mode autodiff (backpropagation)"""
        # Build topological order (children before parents)
        topo = []
        visited = set()
        
        def build_topo(v):
            if v not in visited:
                visited.add(v)
                for child in v._prev:
                    build_topo(child)
                topo.append(v)
        
        build_topo(self)
        
        # Start with gradient of 1 at the output
        self.grad = 1.0
        
        # Walk backward through the graph
        for v in reversed(topo):
            v._backward()

In [None]:
# Test our Value class
a = Value(2.0)
b = Value(3.0)
c = a * b + a

print(f"Before backward:")
print(f"  a = {a}")
print(f"  b = {b}")
print(f"  c = {c}")

c.backward()

print(f"\nAfter backward:")
print(f"  a = {a} (expected grad: {b.data + 1})")
print(f"  b = {b} (expected grad: {a.data})")
print(f"  c = {c} (expected grad: 1.0)")

# Verify
assert abs(a.grad - 4.0) < 1e-6, f"a.grad should be 4.0, got {a.grad}"
assert abs(b.grad - 2.0) < 1e-6, f"b.grad should be 2.0, got {b.grad}"
print("\n Gradients are correct!")

### Try It Yourself: Exercise 1

Add more operations to the Value class:
1. `__sub__` (subtraction)
2. `__neg__` (negation: -self)
3. `__pow__` (power: self ** n where n is a number)

<details>
<summary>Hint for subtraction</summary>

Subtraction is just addition with negation: `a - b = a + (-b)`
Or you can implement it directly with gradient `-1` for the subtracted term.
</details>

<details>
<summary>Hint for power</summary>

The derivative of `x^n` is `n * x^(n-1)`
</details>

In [None]:
# YOUR CODE HERE: Extend the Value class with more operations
class ValueExtended(Value):
    
    def __neg__(self):
        """Negation: -self"""
        # TODO: Implement this
        pass
    
    def __sub__(self, other):
        """Subtraction: self - other"""
        # TODO: Implement this
        pass
    
    def __pow__(self, n):
        """Power: self ** n (n must be int or float)"""
        # TODO: Implement this
        pass

# Test your implementation
# x = ValueExtended(3.0)
# y = x ** 2 - x
# y.backward()
# print(f"x = {x}")  # Should have grad = 2*3 - 1 = 5

---

## Part 3: From Values to Tensors

Now let's extend our scalar `Value` to handle multi-dimensional arrays. This is where things get interesting!

### Key Challenges:
1. **Broadcasting**: `[1, 2, 3] + 5` should add 5 to each element
2. **Shape management**: Matrix multiplication has specific dimension requirements
3. **Gradient accumulation**: When a tensor is used multiple times, gradients accumulate

In [None]:
# Now let's look at the full Tensor implementation
# We've already built this - let's import and explore it

import sys
from pathlib import Path

# Robust path resolution - works regardless of working directory
def _find_module_root():
    """Find the module root directory containing micrograd_plus."""
    current = Path.cwd()
    for parent in [current] + list(current.parents):
        if (parent / 'micrograd_plus' / '__init__.py').exists():
            return str(parent)
    return str(Path.cwd().parent)

sys.path.insert(0, _find_module_root())

from micrograd_plus import Tensor

# Basic operations
a = Tensor([1.0, 2.0, 3.0], requires_grad=True)
b = Tensor([4.0, 5.0, 6.0], requires_grad=True)

# Element-wise operations
c = a + b
print(f"a + b = {c}")

d = a * b
print(f"a * b = {d}")

# Reduction operations
e = (a * b).sum()
print(f"sum(a * b) = {e}")

# Backward pass
e.backward()
print(f"\na.grad = {a.grad}")  # Should be b.data = [4, 5, 6]
print(f"b.grad = {b.grad}")  # Should be a.data = [1, 2, 3]

### Understanding Tensor Gradients

For `e = sum(a * b)`:

$$e = \sum_i a_i \cdot b_i$$

$$\frac{\partial e}{\partial a_i} = b_i$$

$$\frac{\partial e}{\partial b_i} = a_i$$

This is exactly what we see in the gradients!

In [None]:
# Matrix operations
# Reset gradients
a.zero_grad()
b.zero_grad()

# Create matrices
W = Tensor([[1, 2], [3, 4], [5, 6]], requires_grad=True)  # Shape: (3, 2)
x = Tensor([[1], [2]], requires_grad=True)  # Shape: (2, 1)

print(f"W.shape = {W.shape}")
print(f"x.shape = {x.shape}")

# Matrix multiplication
y = W @ x
print(f"\nW @ x = {y}")  # Shape: (3, 1)
print(f"y.shape = {y.shape}")

# Sum to scalar for backward
loss = y.sum()
loss.backward()

print(f"\nW.grad =\n{W.grad}")
print(f"x.grad =\n{x.grad}")

### Understanding Matrix Gradient

For `y = W @ x` where `y` is summed to a scalar:

- `W.grad` should be `ones @ x.T` = `[[1], [1], [1]] @ [[1, 2]]` = `[[1, 2], [1, 2], [1, 2]]`
- `x.grad` should be `W.T @ ones` = transposed weights summed

---

## Part 4: Implementing Key Operations

Let's walk through how some key operations are implemented with their gradients.

In [None]:
# Let's trace through how multiplication works in our Tensor class

def explain_mul_gradient():
    """
    When we compute c = a * b:
    
    Forward: c = a * b (element-wise)
    
    Backward: 
        dc/da = b  (derivative of a*b w.r.t. a is b)
        dc/db = a  (derivative of a*b w.r.t. b is a)
        
    But we receive dL/dc from upstream (out.grad), so by chain rule:
        dL/da = dL/dc * dc/da = out.grad * b
        dL/db = dL/dc * dc/db = out.grad * a
    """
    a = Tensor([2.0, 3.0], requires_grad=True)
    b = Tensor([4.0, 5.0], requires_grad=True)
    
    c = a * b  # [8, 15]
    loss = c.sum()  # 23
    
    print("Forward pass:")
    print(f"  a = {a.data}")
    print(f"  b = {b.data}")
    print(f"  c = a * b = {c.data}")
    print(f"  loss = sum(c) = {loss.data}")
    
    loss.backward()
    
    print("\nBackward pass:")
    print(f"  loss.grad = 1 (always starts at 1)")
    print(f"  c.grad = {c.grad} (gradient from sum is 1s)")
    print(f"  a.grad = {a.grad} = b.data * c.grad")
    print(f"  b.grad = {b.grad} = a.data * c.grad")

explain_mul_gradient()

In [None]:
# Let's look at a more complex example: a simple neural network layer

def explain_linear_layer():
    """
    A linear layer computes: y = x @ W + b
    
    Shapes:
    - x: (batch_size, in_features) e.g., (2, 3)
    - W: (in_features, out_features) e.g., (3, 2)
    - b: (out_features,) e.g., (2,)
    - y: (batch_size, out_features) e.g., (2, 2)
    
    Gradients:
    - dL/dW = x.T @ dL/dy
    - dL/db = sum(dL/dy, axis=0)
    - dL/dx = dL/dy @ W.T
    """
    np.random.seed(42)
    
    # Input: batch of 2 samples, 3 features each
    x = Tensor(np.random.randn(2, 3), requires_grad=True)
    
    # Weights and bias
    W = Tensor(np.random.randn(3, 2), requires_grad=True)
    b = Tensor(np.random.randn(2), requires_grad=True)
    
    # Forward pass
    y = x @ W + b
    loss = y.sum()  # Simple loss for demonstration
    
    print("Shapes:")
    print(f"  x: {x.shape}")
    print(f"  W: {W.shape}")
    print(f"  b: {b.shape}")
    print(f"  y: {y.shape}")
    
    # Backward pass
    loss.backward()
    
    print("\nGradient shapes:")
    print(f"  x.grad: {x.grad.shape}")
    print(f"  W.grad: {W.grad.shape}")
    print(f"  b.grad: {b.grad.shape}")
    
    # Verify gradient formula for W: dL/dW = x.T @ dL/dy
    # Since loss = sum(y), dL/dy = ones
    expected_W_grad = x.data.T @ np.ones((2, 2))
    print(f"\nW.grad verification:")
    print(f"  Computed: {W.grad}")
    print(f"  Expected (x.T @ ones): {expected_W_grad}")
    print(f"  Match: {np.allclose(W.grad, expected_W_grad)}")

explain_linear_layer()

---

## Part 5: Handling Broadcasting

One of the trickiest parts of tensor autograd is handling **broadcasting**. When shapes don't match exactly, NumPy broadcasts smaller tensors to match larger ones.

### The Problem

```python
a = Tensor([[1, 2, 3],    # Shape: (2, 3)
            [4, 5, 6]])
b = Tensor([10, 20, 30])  # Shape: (3,)
c = a + b                  # Shape: (2, 3) - b is broadcast!
```

When we compute gradients, `c.grad` has shape `(2, 3)`, but `b.grad` should have shape `(3,)`. We need to **sum over the broadcast dimensions**.

In [None]:
# Broadcasting example
a = Tensor([[1, 2, 3], [4, 5, 6]], requires_grad=True)  # (2, 3)
b = Tensor([10, 20, 30], requires_grad=True)  # (3,) - will broadcast to (2, 3)

print(f"a.shape = {a.shape}")
print(f"b.shape = {b.shape}")

c = a + b
print(f"\nc = a + b:")
print(f"c.shape = {c.shape}")
print(f"c = {c.data}")

loss = c.sum()
loss.backward()

print(f"\nGradients:")
print(f"a.grad = {a.grad} (shape: {a.grad.shape})")
print(f"b.grad = {b.grad} (shape: {b.grad.shape})")

# Notice: b.grad is the sum of gradients over the broadcast dimension
# Each row of a received the same b, so b's gradient is the sum of all rows' gradients

### üîç Why Sum Over Broadcast Dimensions?

When `b` is broadcast to match `a`, it's conceptually like:
```
b_broadcast = [[10, 20, 30],
               [10, 20, 30]]  # Same b repeated for each row
```

Since the same `b[i]` affects multiple outputs, its gradient is the **sum** of all those contributions. This is just the chain rule!

---

## Part 6: Numerical Gradient Verification

How do we know our gradients are correct? We compare them to **numerical gradients** computed using finite differences.

### The Idea

The derivative is defined as:
$$f'(x) = \lim_{h \to 0} \frac{f(x + h) - f(x - h)}{2h}$$

We can approximate this with a small `h` (typically `1e-5`).

In [None]:
from micrograd_plus.utils import numerical_gradient, gradient_check

def test_gradient(name, f, x):
    """
    Test that analytical gradient matches numerical gradient.
    """
    # Compute analytical gradient
    x.zero_grad()
    y = f(x)
    y.backward()
    analytical = x.grad.copy()
    
    # Compute numerical gradient
    def numpy_f(arr):
        return f(Tensor(arr)).data.item()
    numerical = numerical_gradient(numpy_f, x.data.copy())
    
    # Compare
    max_error = np.max(np.abs(analytical - numerical))
    passed = np.allclose(analytical, numerical, atol=1e-4)
    
    status = "PASS" if passed else "FAIL"
    print(f"[{status}] {name}: max_error = {max_error:.2e}")
    if not passed:
        print(f"   Analytical: {analytical}")
        print(f"   Numerical: {numerical}")
    
    return passed

# Test various operations
print("Gradient verification:")
print("-" * 50)

x = Tensor([1.0, 2.0, 3.0], requires_grad=True)

test_gradient("sum(x)", lambda t: t.sum(), Tensor([1.0, 2.0, 3.0], requires_grad=True))
test_gradient("sum(x^2)", lambda t: (t ** 2).sum(), Tensor([1.0, 2.0, 3.0], requires_grad=True))
test_gradient("sum(x * y)", lambda t: (t * Tensor([4.0, 5.0, 6.0])).sum(), Tensor([1.0, 2.0, 3.0], requires_grad=True))
test_gradient("mean(x)", lambda t: t.mean(), Tensor([1.0, 2.0, 3.0], requires_grad=True))
test_gradient("sum(relu(x))", lambda t: t.relu().sum(), Tensor([-1.0, 0.0, 1.0], requires_grad=True))
test_gradient("sum(sigmoid(x))", lambda t: t.sigmoid().sum(), Tensor([-1.0, 0.0, 1.0], requires_grad=True))

In [None]:
# Test matrix multiplication gradient
np.random.seed(42)

def test_matmul_gradient():
    A = Tensor(np.random.randn(3, 4), requires_grad=True)
    B = Tensor(np.random.randn(4, 2), requires_grad=True)
    
    # Forward
    C = A @ B
    loss = C.sum()
    
    # Backward
    loss.backward()
    
    # Numerical gradient for A
    def f_A(arr):
        return (Tensor(arr) @ B).sum().data.item()
    numerical_A = numerical_gradient(f_A, A.data.copy())
    
    # Numerical gradient for B
    def f_B(arr):
        return (A @ Tensor(arr)).sum().data.item()
    numerical_B = numerical_gradient(f_B, B.data.copy())
    
    print("Matrix multiplication gradient check:")
    print(f"  A.grad matches numerical: {np.allclose(A.grad, numerical_A, atol=1e-4)}")
    print(f"  B.grad matches numerical: {np.allclose(B.grad, numerical_B, atol=1e-4)}")

test_matmul_gradient()

---

## Common Mistakes

### Mistake 1: Forgetting to Zero Gradients

```python
# Wrong: gradients accumulate!
for i in range(3):
    y = (x ** 2).sum()
    y.backward()
    print(x.grad)  # Keeps growing!

# Right: zero gradients before each backward
for i in range(3):
    x.zero_grad()
    y = (x ** 2).sum()
    y.backward()
    print(x.grad)  # Same each time
```

In [None]:
# Demonstrate the gradient accumulation problem
x = Tensor([1.0, 2.0, 3.0], requires_grad=True)

print("Without zeroing gradients (wrong!):")
for i in range(3):
    y = (x ** 2).sum()
    y.backward()
    print(f"  Iteration {i+1}: x.grad = {x.grad}")

print("\nWith zeroing gradients (correct):")
for i in range(3):
    x.zero_grad()
    y = (x ** 2).sum()
    y.backward()
    print(f"  Iteration {i+1}: x.grad = {x.grad}")

### Mistake 2: Calling backward() on Non-Scalar

```python
# Wrong: backward needs a scalar
x = Tensor([1, 2, 3], requires_grad=True)
y = x ** 2  # y is a vector!
y.backward()  # Error!

# Right: reduce to scalar first
y = x ** 2
loss = y.sum()  # Now it's a scalar
loss.backward()
```

In [None]:
# Demonstrate the non-scalar backward error
x = Tensor([1.0, 2.0, 3.0], requires_grad=True)
y = x ** 2  # Vector output

try:
    y.backward()
except RuntimeError as e:
    print(f"Error: {e}")

# Fix: reduce to scalar
x.zero_grad()
y = x ** 2
loss = y.sum()
loss.backward()
print(f"\nAfter summing to scalar:")
print(f"   x.grad = {x.grad}")

### Mistake 3: Modifying Tensor Data In-Place

```python
# Wrong: breaks the computation graph
x = Tensor([1, 2, 3], requires_grad=True)
x.data = x.data * 2  # In-place modification!
y = x.sum()
y.backward()  # Gradients will be wrong!

# Right: create new tensor
x = Tensor([1, 2, 3], requires_grad=True)
x2 = x * 2  # New tensor, graph is intact
y = x2.sum()
y.backward()
```

---

## Checkpoint

You've learned:
- How computation graphs track operations for backpropagation
- How to implement basic autograd for scalars (Value class)
- How to extend autograd to tensors with broadcasting
- How to verify gradients using numerical approximation
- Common pitfalls and how to avoid them

---

## Challenge (Optional)

Implement these additional operations in a custom Tensor class:

1. **Division**: `__truediv__` with proper gradient
2. **Exponential**: `exp()` where gradient is `exp(x) * upstream_grad`
3. **Log**: `log()` where gradient is `1/x * upstream_grad`
4. **Softmax**: A more complex operation that normalizes across an axis

For softmax, the gradient is tricky:
$$\frac{\partial \text{softmax}(x)_i}{\partial x_j} = s_i(\delta_{ij} - s_j)$$

Where $s = \text{softmax}(x)$ and $\delta_{ij}$ is 1 if $i=j$, else 0.

---

## Further Reading

- [Karpathy's micrograd](https://github.com/karpathy/micrograd) - The inspiration for this module
- [Karpathy's micrograd video](https://www.youtube.com/watch?v=VMj-3S1tku0) - 2.5 hour deep dive
- [PyTorch Autograd Mechanics](https://pytorch.org/docs/stable/notes/autograd.html) - How the pros do it
- [Backpropagation in Matrix Form](https://explained.ai/matrix-calculus/) - The math behind matrix gradients

---

## Cleanup

In [None]:
# Cleanup - release memory
from micrograd_plus.utils import cleanup_notebook
cleanup_notebook(globals())

---

## Next Steps

Now that you understand the core Tensor with autograd, we'll build on this in:
- **Lab 1.7.2**: Implementing neural network layers (Linear, ReLU, Softmax, Dropout)
- **Lab 1.7.3**: Loss functions and optimizers

Each of these will use our Tensor class as the foundation!