In [1]:
# Setup
import numpy as np
np.set_printoptions(suppress=True, precision=3)

def describe(name, x):
    print(f"{name}:\n{x}")
    print(f"shape: {x.shape}, dtype: {x.dtype}\n")

print("NumPy version:", np.__version__)

NumPy version: 2.4.0


---
## Part 1: Broadcasting

**Broadcasting** allows NumPy to perform operations on arrays of different shapes automatically.

### Why Broadcasting Matters
In neural networks, we often need to:
- Add a **bias** (scalar or vector) to all samples.
- Normalize data by subtracting mean or dividing by standard deviation.
- Apply the same transformation across batches of data.

Broadcasting makes these operations efficient and concise.

### Example 1: Adding a Scalar to a Matrix
When you add a scalar to a matrix, the scalar is "broadcast" to match the matrix shape.

In [2]:
# Sample data: 3 samples, 4 features each
X = np.array([
    [1, 2, 3, 4],
    [5, 6, 7, 8],
    [9, 10, 11, 12]
])
describe("X", X)

# Add a scalar (like adding a bias term)
X_biased = X + 10
describe("X + 10 (broadcast scalar)", X_biased)

X:
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
shape: (3, 4), dtype: int64

X + 10 (broadcast scalar):
[[11 12 13 14]
 [15 16 17 18]
 [19 20 21 22]]
shape: (3, 4), dtype: int64



### Example 2: Adding a 1D Array to Each Row
Broadcasting also works with 1D arrays. Each row of the matrix gets the same 1D array added to it.

In [3]:
# Bias vector: one bias per feature
bias = np.array([0.1, 0.2, 0.3, 0.4])
describe("bias", bias)

# Add bias to each row of X
X_with_bias = X + bias
describe("X + bias (broadcast 1D array)", X_with_bias)

# This is equivalent to adding the same bias to every sample!

bias:
[0.1 0.2 0.3 0.4]
shape: (4,), dtype: float64

X + bias (broadcast 1D array):
[[ 1.1  2.2  3.3  4.4]
 [ 5.1  6.2  7.3  8.4]
 [ 9.1 10.2 11.3 12.4]]
shape: (3, 4), dtype: float64



### Student Task: Broadcasting Practice

In [None]:
# TODO: Create a 3x3 matrix M
M = np.array([    ])  # Add your 3 rows here

# TODO: Multiply M by a scalar (e.g., 2.5)
M_scaled = M *   # Complete this line
describe("M scaled", M_scaled)

# TODO: Create a 1D array with 3 elements and add it to each row of M
offset = np.array([    ])  # Add 3 numbers
M_offset = M +   # Complete this
describe("M + offset", M_offset)

---
## Part 2: Matrix Multiplication for Data Transformation

**Matrix multiplication** (`@` operator or `np.dot`) is the core operation in neural networks.

### Why Matrix Multiplication?
- Transforms input features into new representations.
- Each output is a **weighted combination** of inputs.
- Enables learning: the weights are adjusted during training.

### Shape Rules
For `A @ B`:
- `A.shape = (m, n)`
- `B.shape = (n, p)`
- Result: `(m, p)`

The inner dimensions must match!

In [4]:
# Input: 2 samples, 3 features
X_input = np.array([
    [1, 0, 2],  # sample 1
    [0, 1, 1],  # sample 2
])
describe("X_input", X_input)

# Weight matrix: transforms 3 features â†’ 2 new features
W = np.array([
    [0.5, 0.3],  # weights for feature 0
    [0.2, 0.8],  # weights for feature 1
    [0.1, 0.4],  # weights for feature 2
])
describe("W (weight matrix)", W)

# Transform: X @ W â†’ (2, 3) @ (3, 2) = (2, 2)
X_transformed = X_input @ W
describe("X_input @ W", X_transformed)

print("Each output value is a weighted sum of input features!")

X_input:
[[1 0 2]
 [0 1 1]]
shape: (2, 3), dtype: int64

W (weight matrix):
[[0.5 0.3]
 [0.2 0.8]
 [0.1 0.4]]
shape: (3, 2), dtype: float64

X_input @ W:
[[0.7 1.1]
 [0.3 1.2]]
shape: (2, 2), dtype: float64

Each output value is a weighted sum of input features!


### Student Task: Matrix Multiplication

In [None]:
# TODO: Create a data matrix D with shape (4, 5) â€” 4 samples, 5 features
D = np.array([    ])  # Add 4 rows of 5 numbers each

# TODO: Create a weight matrix T with shape (5, 3) to transform 5 features â†’ 3
T = np.array([    ])  # Add 5 rows of 3 numbers each

# Perform transformation
D_transformed = D @ T
describe("D @ T", D_transformed)
print(f"Expected shape: (4, 3), Actual shape: {D_transformed.shape}")

---
## Part 3: Neural Network Intuition â€” What Are Weights?

### Weights: The "Knobs" of Learning

In a neural network:
- **Weights** are numbers that control how much each input feature contributes to each output.
- Initially, weights are random.
- During training, weights are adjusted to minimize error (via gradient descent).
- After training, weights encode the "knowledge" the network learned.

### Analogy
Think of weights as **volume knobs** on a mixing board:
- Each input is a sound track.
- Each weight controls how loud that track is in the final mix.
- Training finds the right "mix" to produce the desired output.

### Mathematical View
For one output neuron:
$$
\text{output} = w_1 \cdot x_1 + w_2 \cdot x_2 + \ldots + w_n \cdot x_n + b
$$

- $w_i$: weights (learnable)
- $x_i$: input features
- $b$: bias (learnable offset)

In matrix form: `output = X @ W + b`

In [5]:
# Example: predicting house price from [size, bedrooms, age]
# Input: one house
house = np.array([1500, 3, 10])  # sq ft, bedrooms, years old

# Weights: learned importance of each feature
weights = np.array([200, 50000, -1000])  # $ per sq ft, $ per bedroom, $ per year
bias = 100000  # base price

# Prediction (linear combination)
price = np.dot(house, weights) + bias
print(f"Predicted price: ${price:,.0f}")
print(f"\nBreakdown:")
print(f"  Size contribution: {house[0] * weights[0]:,.0f}")
print(f"  Bedrooms contribution: {house[1] * weights[1]:,.0f}")
print(f"  Age contribution: {house[2] * weights[2]:,.0f}")
print(f"  Bias (base price): {bias:,.0f}")

Predicted price: $540,000

Breakdown:
  Size contribution: 300,000
  Bedrooms contribution: 150,000
  Age contribution: -10,000
  Bias (base price): 100,000


---
## Part 4: Neural Network Intuition â€” What Are Activation Functions?

### The Problem with Pure Linear Transformations

If we only use `output = X @ W + b`, we have a **linear model**.
- Linear models can only learn straight lines or flat planes.
- Real-world patterns (images, language, etc.) are **non-linear**.

**Key Insight**: Stacking multiple linear layers without non-linearity is equivalent to a single linear layer!

### Activation Functions: Introducing Non-Linearity

**Activation functions** apply element-wise non-linear transformations:
$$
\text{output} = \text{activation}(X @ W + b)
$$

This allows networks to learn complex, curved decision boundaries.

### Common Activation Functions

1. **ReLU (Rectified Linear Unit)**: `max(0, x)`
   - Most popular in hidden layers.
   - Keeps positive values, zeros out negatives.
   - Fast and effective.

2. **Sigmoid**: `1 / (1 + exp(-x))`
   - Squashes values to (0, 1).
   - Used for probabilities in binary classification.

3. **Tanh**: `(exp(x) - exp(-x)) / (exp(x) + exp(-x))`
   - Squashes values to (-1, 1).
   - Centered around zero.

4. **Softmax**: Converts logits to probability distribution.
   - Used in the output layer for multi-class classification.

### Role of Activation Functions
- **Enable complexity**: Networks can approximate any function.
- **Introduce non-linearity**: Break the chain of linear operations.
- **Control output range**: Sigmoid/tanh for bounded outputs, ReLU for unbounded.

In [6]:
# Define common activation functions
def relu(x):
    return np.maximum(0, x)

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def tanh(x):
    return np.tanh(x)

# Test on a range of values
z = np.array([-2, -1, 0, 1, 2])
print("Input z:", z)
print("ReLU(z):", relu(z))
print("Sigmoid(z):", sigmoid(z))
print("Tanh(z):", tanh(z))

Input z: [-2 -1  0  1  2]
ReLU(z): [0 0 0 1 2]
Sigmoid(z): [0.119 0.269 0.5   0.731 0.881]
Tanh(z): [-0.964 -0.762  0.     0.762  0.964]


In [None]:
# TODO: Create your own test vector with at least 5 values
# Include negative, zero, and positive values
test_vector = np.array([    ])  # Add your numbers here
print("Test vector:", test_vector)

# TODO: Apply each activation function to your vector
relu_result =   # Apply relu to test_vector
sigmoid_result =   # Apply sigmoid to test_vector
tanh_result =   # Apply tanh to test_vector

print("\nResults:")
print("ReLU:", relu_result)
print("Sigmoid:", sigmoid_result)
print("Tanh:", tanh_result)

# TODO: Test extreme values (very large positive/negative)
extreme_values = np.array([-100, -10, 0, 10, 100])
print("\n--- Testing Extreme Values ---")
print("Input:", extreme_values)
print("ReLU (extreme):", relu(extreme_values))
print("Sigmoid (extreme):", sigmoid(extreme_values))
print("Tanh (extreme):", tanh(extreme_values))

# Observation questions:
# 1. What happens to negative values in ReLU?
# 2. What range do Sigmoid outputs fall in?
# 3. What range do Tanh outputs fall in?
# 4. Which activation "saturates" (approaches limits) for large inputs?

### Student Task: Experiment with Activation Functions
Test how different activations transform various input ranges.

### Visualizing Activation Functions (Conceptual)
If you were to plot these:
- **ReLU**: A "bent line" â€” flat at 0 for x < 0, then slopes up.
- **Sigmoid**: An "S-curve" from 0 to 1.
- **Tanh**: An "S-curve" from -1 to 1.

These non-linear shapes enable neural networks to learn curves, boundaries, and patterns.

---
## Part 5: Simulating a Simple Neural Network Layer

### A Neural Network Layer = Linear Transformation + Activation

**Formula**:
$$
\text{output} = \text{activation}(X @ W + b)
$$

**Steps**:
1. **Linear transformation**: Multiply input by weights (`X @ W`).
2. **Add bias**: Shift the result (`+ b`).
3. **Apply activation**: Introduce non-linearity (`activation(...)`).

Let's simulate a small layer processing token embeddings!

### Example: Processing Token Embeddings

Imagine we have:
- 3 tokens (words) in a sequence.
- Each token is represented by a 4-dimensional embedding.
- We want to transform embeddings to 5 dimensions using a neural layer.

**Setup**:
- Input: `(3, 4)` â€” 3 tokens, 4 features each.
- Weights: `(4, 5)` â€” transform 4 â†’ 5.
- Bias: `(5,)` â€” one bias per output feature.
- Activation: ReLU.

In [7]:
# Input embeddings: 3 tokens, 4-dimensional embeddings
# (In real NLP, these come from an embedding layer)
embeddings = np.array([
    [0.5, 0.2, -0.1, 0.8],   # token 1: "cloud"
    [0.3, -0.4, 0.6, 0.1],   # token 2: "security"
    [-0.2, 0.9, 0.0, 0.5],   # token 3: "data"
])
describe("Input embeddings", embeddings)

# Weights: transform 4D â†’ 5D (learned during training)
W_layer = np.random.randn(4, 5) * 0.5  # small random weights
describe("Weight matrix W", W_layer)

# Bias: one per output dimension
b_layer = np.random.randn(5) * 0.1
describe("Bias vector b", b_layer)

Input embeddings:
[[ 0.5  0.2 -0.1  0.8]
 [ 0.3 -0.4  0.6  0.1]
 [-0.2  0.9  0.   0.5]]
shape: (3, 4), dtype: float64

Weight matrix W:
[[ 0.499  0.66  -0.01  -0.683  0.251]
 [ 1.17   0.454  0.339  0.222 -1.098]
 [-0.707 -0.004  0.493 -0.025 -0.087]
 [-0.717  1.099  0.774 -0.249  0.31 ]]
shape: (4, 5), dtype: float64

Bias vector b:
[-0.056  0.034  0.065  0.026  0.097]
shape: (5,), dtype: float64



In [8]:
# Step 1: Linear transformation
z = embeddings @ W_layer + b_layer  # (3, 4) @ (4, 5) + (5,) â†’ (3, 5)
describe("Linear output z = X @ W + b", z)

# Step 2: Apply activation (ReLU)
output = relu(z)
describe("Activated output = ReLU(z)", output)

print("Notice: Negative values in z became 0 in output (ReLU effect)!")

Linear output z = X @ W + b:
[[-0.076  1.334  0.698 -0.468  0.259]
 [-0.87   0.158  0.299 -0.307  0.59 ]
 [ 0.539  0.86   0.759  0.238 -0.787]]
shape: (3, 5), dtype: float64

Activated output = ReLU(z):
[[0.    1.334 0.698 0.    0.259]
 [0.    0.158 0.299 0.    0.59 ]
 [0.539 0.86  0.759 0.238 0.   ]]
shape: (3, 5), dtype: float64

Notice: Negative values in z became 0 in output (ReLU effect)!


### What Just Happened?

1. **Weights mixed the input features**: Each output dimension is a weighted combination of the 4 input dimensions.
2. **Bias shifted the result**: Added a learnable offset.
3. **ReLU introduced non-linearity**: Negative values were zeroed out.

This is **exactly** what happens in every layer of a neural network!

### Deeper Networks
In practice:
- Stack **many** such layers.
- Each layer learns increasingly abstract representations.
- Early layers: edges, textures (for images) or simple patterns (for text).
- Later layers: complex objects, concepts, semantic meaning.

### Student Task: Build Your Own Layer

In [None]:
# TODO: Create input data with shape (5, 3) â€” 5 samples, 3 features
X_student = np.array([    ])  # Add 5 rows of 3 numbers each

# TODO: Create a weight matrix to transform 3 â†’ 4 features
W_student = np.random.randn(  ,  ) * 0.5  # Fill in the shape

# TODO: Create a bias vector with 4 elements
b_student = np.random.randn(  ) * 0.1  # Fill in the size

# Compute linear transformation
z_student = X_student @ W_student + b_student
describe("Linear output", z_student)

# TODO: Apply an activation function (choose: relu, sigmoid, or tanh)
output_student =   # Apply activation to z_student
describe("Activated output", output_student)

print(f"Input shape: {X_student.shape} â†’ Output shape: {output_student.shape}")

---
## Part 6: Connecting to Real Neural Networks

### Summary: What We've Learned

1. **Broadcasting**: Efficiently applies operations across batches (adding bias, normalizing).
2. **Matrix multiplication**: Core transformation in neural networks.
3. **Weights**: Learnable parameters that encode knowledge.
4. **Activation functions**: Non-linearities that enable complex pattern learning.
5. **Neural layer**: `output = activation(X @ W + b)`.

### Real-World Neural Networks

In frameworks like PyTorch or TensorFlow:
```python
# PyTorch example (conceptual)
layer = nn.Linear(in_features=4, out_features=5)  # W and b initialized
output = torch.relu(layer(input))  # Apply linear + ReLU
```

Behind the scenes, it's doing **exactly** what we coded above!

### Why This Matters 

- **Embeddings**: Words, code snippets, network traffic â†’ vectors.
- **Transformers**: Stack attention + feed-forward layers (matrix ops + activations).
- **Understanding internals**: Debug models, optimize performance, ensure reliability.

### Bonus Challenge: Two-Layer Network

In [None]:
# TODO: Build a 2-layer network
# Input: (2, 3) â€” 2 samples, 3 features
# Layer 1: 3 â†’ 5 features, ReLU activation
# Layer 2: 5 â†’ 2 features, Sigmoid activation

# Define input
X_two_layer = np.array([
    [1, 0, -1],
    [0, 2, 1],
])

# Layer 1 weights and bias
W1 = np.random.randn(3, 5) * 0.5
b1 = np.random.randn(5) * 0.1

# Layer 2 weights and bias
W2 = np.random.randn(5, 2) * 0.5
b2 = np.random.randn(2) * 0.1

# Forward pass
# TODO: Compute layer 1 output (use ReLU)
h1 =   # hidden layer 1 output
describe("Layer 1 output (hidden)", h1)

# TODO: Compute layer 2 output (use Sigmoid)
h2 =   # final output
describe("Layer 2 output (final)", h2)

print("This is a 2-layer neural network!")
print("Notice how the output is between 0 and 1 (sigmoid effect).")

---
## Summary & Key Takeaways

### Broadcasting
- Automatically expands smaller arrays to match larger ones.
- Essential for adding biases and normalizing data.

### Matrix Multiplication
- Transforms data by computing weighted combinations.
- Shape rule: `(m, n) @ (n, p) â†’ (m, p)`.

### Weights
- Learnable parameters that define transformations.
- Encode the "knowledge" learned during training.
- Adjusted via gradient descent to minimize loss.

### Activation Functions
- Introduce non-linearity, enabling complex patterns.
- ReLU: most common in hidden layers.
- Sigmoid/Softmax: used for probabilities in outputs.

### Neural Network Layers
- **Formula**: `output = activation(X @ W + b)`.
- Each layer extracts higher-level features.
- Stacking layers creates deep networks (deep learning).

### Next Steps
- Experiment with different weight initializations.
- Try different activation functions and observe outputs.
- Build deeper networks (3+ layers).
- Explore real frameworks: PyTorch, TensorFlow, JAX.

**You now understand the core mechanics of neural networks!** ðŸŽ‰