# Lab 1.2.1: NumPy Broadcasting Lab

**Module:** 1.2 - Python for AI/ML  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand and apply NumPy broadcasting rules
- [ ] Implement batch matrix multiplication using broadcasting
- [ ] Create outer products without explicit loops
- [ ] Achieve 100x+ speedup over loop-based implementations

---

## üìö Prerequisites

- Completed: Module 1 (DGX Spark Platform Orientation)
- Knowledge of: Basic Python, array concepts

### Required Packages
- Python 3.9+
- NumPy >= 1.21

---

## üåç Real-World Context

**Why does broadcasting matter for AI?**

Every neural network operation relies on efficient matrix math. When you:
- Add bias to all neurons in a layer ‚Üí **Broadcasting**
- Normalize a batch of images ‚Üí **Broadcasting**
- Compute attention scores in a Transformer ‚Üí **Broadcasting**

The difference between loop-based and broadcasted code can mean:
- Training in 2 hours vs. 2 weeks
- Running on your laptop vs. needing a cluster

**Real example:** GPT-style models compute attention for millions of token pairs simultaneously using broadcasting. Without it, language models would be impossibly slow.

---

## üßí ELI5: What is Broadcasting?

> **Imagine you're baking cookies...** üç™
>
> You have a recipe that says "add 1 teaspoon of vanilla to each cookie."
> 
> You don't write "1 tsp vanilla" on a separate card for each of your 100 cookies.
> Instead, you have ONE instruction that automatically applies to ALL cookies.
>
> That's broadcasting! NumPy takes a small array and automatically "stretches" 
> it to match a larger array, without actually copying the data.
>
> **In AI terms:** When you add a bias vector of shape `(128,)` to a batch of 
> 32 samples each with 128 features `(32, 128)`, NumPy broadcasts the bias 
> to add the same values to each row automatically.

---

## Part 1: Broadcasting Fundamentals

### The Broadcasting Rules

NumPy compares shapes **from right to left**. Two dimensions are compatible if:
1. They are equal, OR
2. One of them is 1

```
Array A:      (8, 1, 6, 1)
Array B:         (7, 1, 5)
Result:       (8, 7, 6, 5)  ‚úÖ Compatible!
```

Let's see this in action!

In [None]:
import numpy as np
import time

# Check NumPy version
print(f"NumPy version: {np.__version__}")
print(f"\n{'='*50}")
print("Welcome to the Broadcasting Lab! üöÄ")
print(f"{'='*50}")

In [None]:
# Example 1: Adding a scalar to an array
# The scalar is broadcast to match the array shape

arr = np.array([1, 2, 3, 4, 5])
scalar = 10

result = arr + scalar

print("Array:", arr)
print("Scalar:", scalar)
print("Result:", result)
print(f"\nWhat happened: {scalar} was 'stretched' to [10, 10, 10, 10, 10]")

In [None]:
# Example 2: Adding a row vector to a matrix
# This is EXACTLY what happens when adding bias in neural networks!

# Simulating: batch of 4 samples, 3 features each
batch = np.array([
    [1.0, 2.0, 3.0],    # Sample 1
    [4.0, 5.0, 6.0],    # Sample 2
    [7.0, 8.0, 9.0],    # Sample 3
    [10.0, 11.0, 12.0]  # Sample 4
])

# Bias: one value per feature
bias = np.array([100, 200, 300])

print(f"Batch shape: {batch.shape}")
print(f"Bias shape: {bias.shape}")

result = batch + bias

print(f"Result shape: {result.shape}")
print("\nOriginal batch:")
print(batch)
print("\nAfter adding bias:")
print(result)
print("\n‚ú® Each column got its own bias value added to all rows!")

### üîç What Just Happened?

```
batch shape: (4, 3)
bias shape:     (3,)
              -----
Result:      (4, 3)  ‚Üê bias was stretched to (4, 3) automatically!
```

NumPy "virtually" repeated the bias for each row. No memory was actually copied!

In [None]:
# Example 3: Column operations using reshape
# What if we want to add different values to each ROW instead of each column?

# Per-sample scaling factors
row_scales = np.array([1, 2, 3, 4]).reshape(-1, 1)  # Shape: (4, 1)

print(f"Batch shape: {batch.shape}")
print(f"Row scales shape: {row_scales.shape}")

result = batch * row_scales

print(f"\nRow scales:\n{row_scales}")
print(f"\nResult (each row multiplied by its scale):\n{result}")
print("\n‚ú® Row 1 √ó 1, Row 2 √ó 2, Row 3 √ó 3, Row 4 √ó 4")

### ‚úã Try It Yourself: Exercise 1

**Task:** Normalize each row to have zero mean (subtract the row mean from each element).

Given a matrix of shape `(5, 4)`, compute the mean of each row and subtract it.

```python
# Your code here
data = np.random.randn(5, 4)
# Step 1: Compute mean of each row (should be shape (5,) or (5, 1))
# Step 2: Subtract from data using broadcasting
```

<details>
<summary>üí° Hint</summary>

Use `np.mean(data, axis=1, keepdims=True)` to get a shape of `(5, 1)` which broadcasts correctly!

</details>

In [None]:
# YOUR CODE HERE - Exercise 1
np.random.seed(42)
data = np.random.randn(5, 4)
print("Original data:")
print(data)
print(f"\nRow means before: {data.mean(axis=1)}")

# TODO: Normalize each row to zero mean
# normalized = ?

# Uncomment to verify:
# print(f"Row means after: {normalized.mean(axis=1)}")  # Should be ~0

---

## Part 2: The Speed of Vectorization

Now let's see WHY broadcasting matters: **speed**.

### üßí ELI5: Why Loops are Slow

> **Imagine two ways to fill a swimming pool...** üèä
>
> **Loop approach:** Walk to the pool with a cup, pour water, walk back, refill... repeat 1 million times.
>
> **Vectorized approach:** Turn on a fire hose and fill it all at once.
>
> Python loops are like the cup method - each iteration has overhead.
> NumPy operations are like the fire hose - optimized C code processes everything in bulk.

In [None]:
def time_function(func, *args, n_runs=5):
    """Time a function and return average execution time."""
    times = []
    for _ in range(n_runs):
        start = time.perf_counter()
        result = func(*args)
        times.append(time.perf_counter() - start)
    return np.mean(times), result

# Test data
N = 1000
A = np.random.randn(N, N).astype(np.float32)
B = np.random.randn(N, N).astype(np.float32)

print(f"Matrix size: {N}x{N} = {N*N:,} elements")
print(f"Memory per matrix: {A.nbytes / 1e6:.1f} MB")

In [None]:
# SLOW: Element-wise addition with nested loops
def add_with_loops(A, B):
    """Add two matrices using nested Python loops. (DON'T DO THIS!)"""
    result = np.zeros_like(A)
    for i in range(A.shape[0]):
        for j in range(A.shape[1]):
            result[i, j] = A[i, j] + B[i, j]
    return result

# FAST: Vectorized addition
def add_vectorized(A, B):
    """Add two matrices using NumPy vectorization."""
    return A + B

# Let's use smaller matrices for the loop version (it's SLOW)
N_small = 200
A_small = A[:N_small, :N_small]
B_small = B[:N_small, :N_small]

print(f"Testing with {N_small}x{N_small} matrices...\n")

loop_time, loop_result = time_function(add_with_loops, A_small, B_small)
vec_time, vec_result = time_function(add_vectorized, A_small, B_small)

print(f"üê¢ Loop version:       {loop_time*1000:.2f} ms")
print(f"üöÄ Vectorized version: {vec_time*1000:.4f} ms")
print(f"\n‚ö° Speedup: {loop_time/vec_time:.0f}x faster!")
print(f"\n‚úÖ Results match: {np.allclose(loop_result, vec_result)}")

### üéâ Whoa! 

That's a massive speedup! Now imagine this difference when training a neural network with millions of operations...

In [None]:
# Let's do matrix multiplication - even more dramatic!

def matmul_with_loops(A, B):
    """Matrix multiply using triple nested loops. (NEVER DO THIS!)"""
    M, K = A.shape
    K2, N = B.shape
    assert K == K2, "Dimensions don't match!"
    
    result = np.zeros((M, N), dtype=A.dtype)
    for i in range(M):
        for j in range(N):
            for k in range(K):
                result[i, j] += A[i, k] * B[k, j]
    return result

def matmul_vectorized(A, B):
    """Matrix multiply using NumPy."""
    return A @ B  # or np.matmul(A, B) or np.dot(A, B)

# Use tiny matrices for loop version
N_tiny = 50
A_tiny = A[:N_tiny, :N_tiny]
B_tiny = B[:N_tiny, :N_tiny]

print(f"Testing matrix multiplication with {N_tiny}x{N_tiny} matrices...\n")

loop_time, loop_result = time_function(matmul_with_loops, A_tiny, B_tiny, n_runs=1)
vec_time, vec_result = time_function(matmul_vectorized, A_tiny, B_tiny)

print(f"üê¢ Triple loop:  {loop_time*1000:.2f} ms")
print(f"üöÄ NumPy @ :     {vec_time*1000:.4f} ms")
print(f"\n‚ö° Speedup: {loop_time/vec_time:.0f}x faster!")
print(f"\n‚úÖ Results match: {np.allclose(loop_result, vec_result, rtol=1e-4)}")

In [None]:
# Now let's see NumPy @ with full-size matrices
print(f"Full {N}x{N} matrix multiplication...\n")

vec_time_full, _ = time_function(matmul_vectorized, A, B)
print(f"üöÄ NumPy @ : {vec_time_full*1000:.2f} ms for {N}x{N} matrices")
print(f"\nLoop version would take ~{(loop_time * (N/N_tiny)**3):.0f} seconds! üò±")

---

## Part 3: Batch Matrix Multiplication

In deep learning, we rarely work with single matrices. We work with **batches** - multiple samples processed simultaneously.

### üßí ELI5: Batch Operations

> **Imagine you're a teacher grading exams...** üìù
>
> You have 32 students, each with 10 questions.
> 
> **Without batching:** Grade student 1's all questions, then student 2's, etc.
>
> **With batching:** Grade question 1 for ALL students at once, then question 2, etc.
>
> The second way is faster because you get into a rhythm with each question type!
>
> **In AI terms:** Processing a batch of 32 images together is more efficient than 
> processing them one at a time, because the GPU can parallelize the work.

In [None]:
# Batch matrix multiplication example
# Shape: (batch_size, rows, cols)

batch_size = 32
M, K, N = 64, 128, 64  # Matrix dimensions

# Batch of A matrices: (32, 64, 128)
A_batch = np.random.randn(batch_size, M, K).astype(np.float32)

# Batch of B matrices: (32, 128, 64)
B_batch = np.random.randn(batch_size, K, N).astype(np.float32)

print(f"A_batch shape: {A_batch.shape}")
print(f"B_batch shape: {B_batch.shape}")

# Method 1: Loop over batch (slow)
def batch_matmul_loop(A, B):
    results = []
    for i in range(A.shape[0]):
        results.append(A[i] @ B[i])
    return np.stack(results)

# Method 2: Single broadcasted operation (fast!)
def batch_matmul_broadcast(A, B):
    return A @ B  # NumPy handles batch dimension automatically!

loop_time, loop_result = time_function(batch_matmul_loop, A_batch, B_batch)
broadcast_time, broadcast_result = time_function(batch_matmul_broadcast, A_batch, B_batch)

print(f"\nResult shape: {loop_result.shape}")
print(f"\nüê¢ Loop over batch:    {loop_time*1000:.2f} ms")
print(f"üöÄ Broadcast matmul:   {broadcast_time*1000:.2f} ms")
print(f"\n‚ö° Speedup: {loop_time/broadcast_time:.1f}x faster!")
print(f"‚úÖ Results match: {np.allclose(loop_result, broadcast_result)}")

### üîç What Just Happened?

The `@` operator (and `np.matmul`) automatically handles batch dimensions!

```
A_batch: (32, 64, 128)  ‚Üí  32 matrices of size 64√ó128
B_batch: (32, 128, 64)  ‚Üí  32 matrices of size 128√ó64
Result:  (32, 64, 64)   ‚Üí  32 matrices of size 64√ó64
```

NumPy matched up corresponding matrices and multiplied them all at once!

---

## Part 4: Outer Products

An **outer product** creates a matrix from two vectors by multiplying each element of the first with every element of the second.

### üßí ELI5: Outer Product

> **Imagine a multiplication table...** ‚úñÔ∏è
>
> Row headers: [1, 2, 3]
> Column headers: [4, 5]
>
> The table shows every possible product:
> ```
>     4   5
> 1   4   5
> 2   8  10
> 3  12  15
> ```
>
> That's an outer product! Vector of 3 √ó Vector of 2 = Matrix of 3√ó2

In [None]:
# Simple outer product example
a = np.array([1, 2, 3])
b = np.array([4, 5])

# Method 1: Using np.outer
outer1 = np.outer(a, b)

# Method 2: Using broadcasting with reshape
outer2 = a[:, np.newaxis] * b[np.newaxis, :]  # (3,1) * (1,2) = (3,2)
# Or equivalently:
outer3 = a.reshape(-1, 1) * b.reshape(1, -1)

print("a:", a)
print("b:", b)
print("\nOuter product:")
print(outer1)
print(f"\nAll methods equal: {np.allclose(outer1, outer2) and np.allclose(outer1, outer3)}")

In [None]:
# Why outer products matter: Computing distance matrices
# Given N points, compute distance between every pair

N_points = 1000
D = 128  # Dimensionality (like embedding dimension)

# Random points (like word embeddings)
points = np.random.randn(N_points, D).astype(np.float32)

# Slow: Double loop
def distances_loop(points):
    N = len(points)
    dists = np.zeros((N, N), dtype=np.float32)
    for i in range(N):
        for j in range(N):
            dists[i, j] = np.sqrt(np.sum((points[i] - points[j])**2))
    return dists

# Fast: Broadcasting magic!
def distances_broadcast(points):
    # points: (N, D)
    # points[:, np.newaxis, :]: (N, 1, D)
    # points[np.newaxis, :, :]: (1, N, D)
    # Difference: (N, N, D) - every pair's difference vector
    diff = points[:, np.newaxis, :] - points[np.newaxis, :, :]
    return np.sqrt(np.sum(diff**2, axis=2))

# Test on smaller subset for loop version
N_test = 100
points_test = points[:N_test]

print(f"Computing {N_test}√ó{N_test} distance matrix...\n")

loop_time, loop_result = time_function(distances_loop, points_test, n_runs=1)
broadcast_time, broadcast_result = time_function(distances_broadcast, points_test)

print(f"üê¢ Double loop:    {loop_time*1000:.1f} ms")
print(f"üöÄ Broadcasting:   {broadcast_time*1000:.2f} ms")
print(f"\n‚ö° Speedup: {loop_time/broadcast_time:.0f}x faster!")
print(f"‚úÖ Results match: {np.allclose(loop_result, broadcast_result)}")

In [None]:
# Full scale test with broadcasting only
print(f"\nFull {N_points}√ó{N_points} distance matrix with broadcasting...")
full_time, full_result = time_function(distances_broadcast, points)
print(f"üöÄ Time: {full_time*1000:.1f} ms")
print(f"   Result shape: {full_result.shape}")
print(f"   Memory used: {full_result.nbytes / 1e6:.1f} MB")

### ‚úã Try It Yourself: Exercise 2

**Task:** Implement cosine similarity between all pairs of vectors using broadcasting.

Cosine similarity formula:
$$\text{cosine\_sim}(a, b) = \frac{a \cdot b}{\|a\| \cdot \|b\|}$$

Given `embeddings` of shape `(100, 64)`, compute a `(100, 100)` similarity matrix.

<details>
<summary>üí° Hint</summary>

1. First normalize each embedding to unit length
2. Then cosine similarity is just the dot product!
3. Use `embeddings @ embeddings.T` after normalization

</details>

In [None]:
# YOUR CODE HERE - Exercise 2
np.random.seed(42)
embeddings = np.random.randn(100, 64).astype(np.float32)

# TODO: Compute cosine similarity matrix
# Step 1: Compute norm of each embedding (shape: (100,) or (100, 1))
# Step 2: Normalize embeddings
# Step 3: Compute similarity matrix using @ operator

# similarity_matrix = ?

# Uncomment to verify:
# print(f"Shape: {similarity_matrix.shape}")  # Should be (100, 100)
# print(f"Diagonal (self-similarity): {similarity_matrix.diagonal()[:5]}")  # Should be ~1.0

---

## Part 5: Advanced Broadcasting Patterns

Let's look at some patterns commonly used in deep learning.

In [None]:
# Pattern 1: Softmax with numerical stability
# Used in attention mechanisms, classification outputs

def softmax_naive(x):
    """Naive softmax - can overflow with large values!"""
    exp_x = np.exp(x)
    return exp_x / exp_x.sum(axis=-1, keepdims=True)

def softmax_stable(x):
    """Stable softmax using max subtraction trick."""
    # Subtract max for numerical stability (broadcasting!)
    x_shifted = x - x.max(axis=-1, keepdims=True)
    exp_x = np.exp(x_shifted)
    return exp_x / exp_x.sum(axis=-1, keepdims=True)

# Test with batch of logits
batch_logits = np.random.randn(8, 10)  # 8 samples, 10 classes

probs = softmax_stable(batch_logits)
print(f"Input shape: {batch_logits.shape}")
print(f"Output shape: {probs.shape}")
print(f"\nProbabilities sum to 1? {np.allclose(probs.sum(axis=1), 1.0)}")
print(f"Sample output: {probs[0].round(3)}")

In [None]:
# Pattern 2: Batch Normalization
# Normalizes features across the batch dimension

def batch_normalize(x, eps=1e-5):
    """
    Batch normalization.
    x: (batch_size, features)
    Returns normalized x with mean=0, std=1 per feature
    """
    mean = x.mean(axis=0, keepdims=True)  # (1, features)
    std = x.std(axis=0, keepdims=True)    # (1, features)
    return (x - mean) / (std + eps)

# Test
batch = np.random.randn(32, 128) * 5 + 10  # Mean ~10, std ~5
normalized = batch_normalize(batch)

print(f"Before normalization:")
print(f"  Mean per feature: {batch.mean(axis=0)[:5].round(2)}")
print(f"  Std per feature: {batch.std(axis=0)[:5].round(2)}")

print(f"\nAfter normalization:")
print(f"  Mean per feature: {normalized.mean(axis=0)[:5].round(6)}")
print(f"  Std per feature: {normalized.std(axis=0)[:5].round(3)}")

In [None]:
# Pattern 3: One-hot encoding using broadcasting

def one_hot(labels, num_classes):
    """Convert labels to one-hot vectors using broadcasting."""
    # labels: (batch_size,) with values 0 to num_classes-1
    # Create class indices: [0, 1, 2, ..., num_classes-1]
    classes = np.arange(num_classes)
    # Compare: (batch_size, 1) == (num_classes,) broadcasts to (batch_size, num_classes)
    return (labels[:, np.newaxis] == classes).astype(np.float32)

# Test
labels = np.array([0, 2, 1, 3, 1])
one_hot_labels = one_hot(labels, num_classes=4)

print("Labels:", labels)
print("\nOne-hot encoded:")
print(one_hot_labels)
print("\n‚ú® No loops needed!")

---

## Part 6: Memory Considerations on DGX Spark

With 128GB of unified memory, you can work with large arrays. But it's still important to be memory-aware!

In [None]:
# Memory efficiency comparison

def check_memory(arr, name):
    """Print memory usage of an array."""
    mb = arr.nbytes / 1e6
    print(f"{name}: {arr.shape} {arr.dtype} = {mb:.1f} MB")

# Float64 (default) vs Float32 vs Float16
N = 10000
D = 1024

arr_f64 = np.random.randn(N, D)  # Default float64
arr_f32 = arr_f64.astype(np.float32)
arr_f16 = arr_f64.astype(np.float16)

check_memory(arr_f64, "float64")
check_memory(arr_f32, "float32")
check_memory(arr_f16, "float16")

print(f"\nüí° Using float32 instead of float64 halves memory usage!")
print(f"   Most ML models work fine with float32 (or even float16 for inference).")

In [None]:
# Avoiding unnecessary copies

a = np.random.randn(1000, 1000).astype(np.float32)

# BAD: Creates a copy
b = a + 0  # New array allocated
print(f"a + 0 creates copy? {a is not b}")

# GOOD: View (no copy)
c = a.reshape(1000000)  # Same data, different view
print(f"reshape creates copy? {not np.shares_memory(a, c)}")

# GOOD: In-place operations
a_copy = a.copy()
a_copy += 1  # In-place, no new allocation
print(f"\nüí° Use in-place operations (+=, *=, etc.) when possible!")

In [None]:
# Check if array is contiguous (important for performance)

a = np.random.randn(100, 100).astype(np.float32)

print("Original array:")
print(f"  C-contiguous (row-major): {a.flags['C_CONTIGUOUS']}")
print(f"  F-contiguous (col-major): {a.flags['F_CONTIGUOUS']}")

# Transpose creates a view with different stride
a_T = a.T
print("\nTransposed array (view):")
print(f"  C-contiguous: {a_T.flags['C_CONTIGUOUS']}")
print(f"  F-contiguous: {a_T.flags['F_CONTIGUOUS']}")

# Make it contiguous again
a_T_contig = np.ascontiguousarray(a_T)
print("\nAfter np.ascontiguousarray:")
print(f"  C-contiguous: {a_T_contig.flags['C_CONTIGUOUS']}")

print("\nüí° NumPy operations are fastest on C-contiguous arrays!")

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Broadcasting dimension mismatch

In [None]:
# ‚ùå Wrong: Trying to add vectors of different sizes
a = np.array([1, 2, 3])      # shape (3,)
b = np.array([1, 2, 3, 4])   # shape (4,)

try:
    result = a + b  # This will fail!
except ValueError as e:
    print(f"‚ùå Error: {e}")

# ‚úÖ Right: Use compatible shapes
a = np.array([1, 2, 3])      # shape (3,)
b = np.array([[1], [2]])     # shape (2, 1)
result = a + b               # shape (2, 3)
print(f"\n‚úÖ Compatible broadcast: {a.shape} + {b.shape} = {result.shape}")

### Mistake 2: Forgetting keepdims

In [None]:
x = np.random.randn(4, 5)

# ‚ùå Wrong: Loses dimension, can't broadcast back
mean_wrong = x.mean(axis=1)  # shape (4,)
print(f"‚ùå Without keepdims: {x.shape} - {mean_wrong.shape} needs reshape")

# ‚úÖ Right: Keeps dimension for broadcasting
mean_right = x.mean(axis=1, keepdims=True)  # shape (4, 1)
print(f"‚úÖ With keepdims: {x.shape} - {mean_right.shape} broadcasts correctly")

# The difference in action:
normalized = x - mean_right  # Works perfectly!
print(f"\nNormalized shape: {normalized.shape}")

### Mistake 3: Modifying slices unintentionally

In [None]:
# ‚ùå Wrong: Slices are views, not copies!
original = np.array([1, 2, 3, 4, 5])
slice_view = original[1:4]
slice_view[0] = 999  # This modifies original!
print(f"‚ùå Original modified: {original}")

# ‚úÖ Right: Explicitly copy if you need independence
original = np.array([1, 2, 3, 4, 5])
slice_copy = original[1:4].copy()
slice_copy[0] = 999  # Original is safe
print(f"‚úÖ Original preserved: {original}")

---

## üéâ Checkpoint

You've learned:
- ‚úÖ Broadcasting rules: shapes are compared right-to-left
- ‚úÖ Vectorization is 100x+ faster than Python loops
- ‚úÖ Batch matrix multiplication with the `@` operator
- ‚úÖ Outer products using broadcasting
- ‚úÖ Common ML patterns: softmax, batch norm, one-hot
- ‚úÖ Memory efficiency with dtypes and contiguity

---

## üöÄ Challenge (Optional)

**Implement a mini neural network forward pass using only NumPy broadcasting!**

Create a 2-layer network:
1. Input: (batch_size, 784) - flattened MNIST images
2. Hidden: (784, 256) weights + (256,) bias with ReLU activation
3. Output: (256, 10) weights + (10,) bias with softmax

Use broadcasting for all operations - no loops allowed!

```python
def forward(x, w1, b1, w2, b2):
    # TODO: Implement using broadcasting
    # hidden = relu(x @ w1 + b1)
    # output = softmax(hidden @ w2 + b2)
    pass
```

In [None]:
# YOUR CHALLENGE CODE HERE
np.random.seed(42)

# Start with random weights
batch_size = 32
x = np.random.randn(batch_size, 784).astype(np.float32)
w1 = np.random.randn(784, 256).astype(np.float32) * 0.01
b1 = np.zeros(256, dtype=np.float32)
w2 = np.random.randn(256, 10).astype(np.float32) * 0.01
b2 = np.zeros(10, dtype=np.float32)

# TODO: Implement forward pass
# probs = forward(x, w1, b1, w2, b2)
# print(f"Output shape: {probs.shape}")  # Should be (32, 10)
# print(f"Probabilities sum: {probs.sum(axis=1)}")  # Should be all 1.0

---

## üìñ Further Reading

- [NumPy Broadcasting Documentation](https://numpy.org/doc/stable/user/basics.broadcasting.html)
- [Array Broadcasting in NumPy (Stanford CS231n)](https://cs231n.github.io/python-numpy-tutorial/#numpy-broadcasting)
- [Why Vectorization is Faster](https://realpython.com/numpy-array-programming/)

---

## üßπ Cleanup

In [None]:
# Clean up large arrays
import gc

# Delete large arrays we created
del A, B, A_batch, B_batch, points
gc.collect()

print("‚úÖ Memory cleaned up!")
print("\nüéâ Congratulations! You've completed the NumPy Broadcasting Lab!")
print("   Next up: Lab 1.2.2 - Dataset Preprocessing Pipeline")