# Week 1

## Lesson 1

**Explain the core concepts of the lesson**

## Core Concepts of Deep Learning

Deep learning is a subset of machine learning that uses neural networks with multiple layers to learn hierarchical representations of data. The key concepts in this specialization include:

**Neural Networks**: Computational models inspired by biological neurons that learn patterns from data through interconnected layers of nodes.

**Deep Neural Networks**: Neural networks with multiple hidden layers that can learn complex, non-linear relationships in data.

**Hyperparameter Tuning**: The process of selecting optimal values for parameters that control the learning process, such as learning rate, batch size, and number of layers.

**Regularization**: Techniques used to prevent overfitting by constraining model complexity, including L1/L2 regularization and dropout.

**Optimization Algorithms**: Methods for updating network weights during training, such as gradient descent and its variants.

**Machine Learning Project Strategy**: A systematic approach to structuring ML projects, including problem definition, data preparation, model selection, and evaluation.

**Data Splitting**: Dividing data into training, validation, and test sets to properly evaluate model performance.

**End-to-end Deep Learning**: Building complete systems that integrate data preprocessing, model training, and deployment.

**Convolutional Neural Networks (CNNs)**: Specialized neural networks designed for processing grid-like data, particularly images, using convolutional layers.

**Sequence Models**: Neural architectures designed to process sequential data, such as time series or text.

**Natural Language Processing (NLP)**: The application of deep learning to understand and generate human language.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**AI as the New Electricity**: Deep learning and artificial intelligence are poised to transform industries and society in ways comparable to the electrification revolution. Just as electricity became a foundational technology enabling countless innovations, AI will reshape how we work, create, and solve problems across every sector.

**Deep Learning as a Rapidly Rising Field**: Deep learning is one of the fastest-growing areas in machine learning and artificial intelligence. The combination of increased computational power, larger datasets, and algorithmic innovations has made deep learning increasingly practical and powerful for solving real-world problems.

**Deep Learning Tools Enable New Products and Businesses**: The availability of accessible deep learning frameworks and pre-trained models has democratized AI development. These tools empower developers and entrepreneurs to create novel products and business solutions that were previously impossible, from computer vision applications to natural language understanding systems.

**Implement code primitive: Build a neural network**

In [None]:
import numpy as np

class NeuralNetwork:
    def __init__(self, layer_sizes):
        self.layer_sizes = layer_sizes
        self.weights = []
        self.biases = []
        
        for i in range(len(layer_sizes) - 1):
            w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * 0.01
            b = np.zeros((1, layer_sizes[i+1]))
            self.weights.append(w)
            self.biases.append(b)
    
    def forward(self, X):
        self.activations = [X]
        A = X
        
        for i in range(len(self.weights)):
            Z = np.dot(A, self.weights[i]) + self.biases[i]
            A = np.maximum(0, Z) if i < len(self.weights) - 1 else 1 / (1 + np.exp(-Z))
            self.activations.append(A)
        
        return A
    
    def backward(self, y, learning_rate):
        m = y.shape[0]
        dA = self.activations[-1] - y
        
        for i in range(len(self.weights) - 1, -1, -1):
            dW = np.dot(self.activations[i].T, dA) / m
            db = np.sum(dA, axis=0, keepdims=True) / m
            
            if i > 0:
                dA = np.dot(dA, self.weights[i].T)
                dA[self.activations[i] <= 0] = 0
            
            self.weights[i] -= learning_rate * dW
            self.biases[i] -= learning_rate * db

nn = NeuralNetwork([784, 128, 64, 10])
X_sample = np.random.randn(32, 784)
output = nn.forward(X_sample)
print(f"Output shape: {output.shape}")

**Implement code primitive: Train a neural network on data**

In [None]:
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

digits = load_digits()
X = digits.data
y = np.eye(10)[digits.target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

nn = NeuralNetwork([64, 32, 16, 10])

epochs = 100
learning_rate = 0.01

for epoch in range(epochs):
    output = nn.forward(X_train)
    loss = -np.mean(y_train * np.log(output + 1e-8))
    nn.backward(y_train, learning_rate)
    
    if (epoch + 1) % 20 == 0:
        print(f"Epoch {epoch + 1}/{epochs}, Loss: {loss:.4f}")

test_output = nn.forward(X_test)
test_predictions = np.argmax(test_output, axis=1)
test_accuracy = np.mean(test_predictions == np.argmax(y_test, axis=1))
print(f"Test Accuracy: {test_accuracy:.4f}")

**Implement code primitive: Build a deep neural network to recognize cats**

In [None]:
import numpy as np

class DeepNeuralNetwork:
    def __init__(self, layer_sizes):
        self.layer_sizes = layer_sizes
        self.weights = []
        self.biases = []
        self.layer_count = len(layer_sizes) - 1
        
        for i in range(self.layer_count):
            w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(2.0 / layer_sizes[i])
            b = np.zeros((1, layer_sizes[i+1]))
            self.weights.append(w)
            self.biases.append(b)
    
    def relu(self, Z):
        return np.maximum(0, Z)
    
    def relu_derivative(self, Z):
        return (Z > 0).astype(float)
    
    def sigmoid(self, Z):
        return 1 / (1 + np.exp(-np.clip(Z, -500, 500)))
    
    def forward(self, X):
        self.Z_values = []
        self.A_values = [X]
        A = X
        
        for i in range(self.layer_count):
            Z = np.dot(A, self.weights[i]) + self.biases[i]
            self.Z_values.append(Z)
            
            if i < self.layer_count - 1:
                A = self.relu(Z)
            else:
                A = self.sigmoid(Z)
            
            self.A_values.append(A)
        
        return A
    
    def backward(self, y, learning_rate):
        m = y.shape[0]
        dA = self.A_values[-1] - y
        
        for i in range(self.layer_count - 1, -1, -1):
            dZ = dA * (self.sigmoid(self.Z_values[i]) * (1 - self.sigmoid(self.Z_values[i])) if i == self.layer_count - 1 else self.relu_derivative(self.Z_values[i]))
            dW = np.dot(self.A_values[i].T, dZ) / m
            db = np.sum(dZ, axis=0, keepdims=True) / m
            
            if i > 0:
                dA = np.dot(dZ, self.weights[i].T)
            
            self.weights[i] -= learning_rate * dW
            self.biases[i] -= learning_rate * db

cat_detector = DeepNeuralNetwork([12288, 256, 128, 64, 32, 1])
X_sample = np.random.randn(10, 12288)
y_sample = np.random.randint(0, 2, (10, 1))
predictions = cat_detector.forward(X_sample)
print(f"Predictions shape: {predictions.shape}")
print(f"Sample predictions: {predictions[:5].flatten()}")

**Implement code primitive: Implement convolutional neural networks**

In [None]:
import numpy as np

class ConvolutionalLayer:
    def __init__(self, num_filters, filter_size, stride=1, padding=0):
        self.num_filters = num_filters
        self.filter_size = filter_size
        self.stride = stride
        self.padding = padding
        self.filters = np.random.randn(num_filters, filter_size, filter_size) * 0.01
        self.biases = np.zeros((num_filters, 1))
    
    def convolve(self, X, filter_kernel):
        X_padded = np.pad(X, ((self.padding, self.padding), (self.padding, self.padding)), mode='constant')
        output_size = (X_padded.shape[0] - self.filter_size) // self.stride + 1
        output = np.zeros((output_size, output_size))
        
        for i in range(0, X_padded.shape[0] - self.filter_size + 1, self.stride):
            for j in range(0, X_padded.shape[1] - self.filter_size + 1, self.stride):
                patch = X_padded[i:i+self.filter_size, j:j+self.filter_size]
                output[i//self.stride, j//self.stride] = np.sum(patch * filter_kernel)
        
        return output
    
    def forward(self, X):
        self.X = X
        batch_size = X.shape[0]
        output_size = (X.shape[1] - self.filter_size + 2*self.padding) // self.stride + 1
        output = np.zeros((batch_size, output_size, output_size, self.num_filters))
        
        for b in range(batch_size):
            for f in range(self.num_filters):
                output[b, :, :, f] = self.convolve(X[b], self.filters[f]) + self.biases[f]
        
        return output

class MaxPoolingLayer:
    def __init__(self, pool_size=2, stride=2):
        self.pool_size = pool_size
        self.stride = stride
    
    def forward(self, X):
        batch_size, height, width, channels = X.shape
        output_height = (height - self.pool_size) // self.stride + 1
        output_width = (width - self.pool_size) // self.stride + 1
        output = np.zeros((batch_size, output_height, output_width, channels))
        
        for b in range(batch_size):
            for c in range(channels):
                for i in range(output_height):
                    for j in range(output_width):
                        h_start = i * self.stride
                        w_start = j * self.stride
                        patch = X[b, h_start:h_start+self.pool_size, w_start:w_start+self.pool_size, c]
                        output[b, i, j, c] = np.max(patch)
        
        return output

conv_layer = ConvolutionalLayer(num_filters=32, filter_size=3, stride=1, padding=1)
pool_layer = MaxPoolingLayer(pool_size=2, stride=2)

X_input = np.random.randn(4, 28, 28, 1)
conv_output = conv_layer.forward(X_input)
pool_output = pool_layer.forward(conv_output)

print(f"Input shape: {X_input.shape}")
print(f"Conv output shape: {conv_output.shape}")
print(f"Pool output shape: {pool_output.shape}")

**Apply sequence models to natural language processing problems**

In [None]:
import numpy as np

class SimpleRNN:
    def __init__(self, input_size, hidden_size, output_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        
        self.Wxh = np.random.randn(input_size, hidden_size) * 0.01
        self.Whh = np.random.randn(hidden_size, hidden_size) * 0.01
        self.Why = np.random.randn(hidden_size, output_size) * 0.01
        self.bh = np.zeros((1, hidden_size))
        self.by = np.zeros((1, output_size))
    
    def forward(self, X):
        batch_size, seq_length, _ = X.shape
        h = np.zeros((batch_size, self.hidden_size))
        outputs = []
        
        for t in range(seq_length):
            h = np.tanh(np.dot(X[:, t, :], self.Wxh) + np.dot(h, self.Whh) + self.bh)
            y = np.dot(h, self.Why) + self.by
            outputs.append(y)
        
        return np.array(outputs), h
    
    def predict(self, X):
        outputs, _ = self.forward(X)
        return np.argmax(outputs[-1], axis=1)

class LSTMCell:
    def __init__(self, input_size, hidden_size):
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        self.Wf = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
        self.Wi = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
        self.Wo = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
        self.Wc = np.random.randn(input_size + hidden_size, hidden_size) * 0.01
        
        self.bf = np.zeros((1, hidden_size))
        self.bi = np.zeros((1, hidden_size))
        self.bo = np.zeros((1, hidden_size))
        self.bc = np.zeros((1, hidden_size))
    
    def forward(self, X, h_prev, c_prev):
        combined = np.concatenate([X, h_prev], axis=1)
        
        f = 1 / (1 + np.exp(-np.dot(combined, self.Wf) - self.bf))
        i = 1 / (1 + np.exp(-np.dot(combined, self.Wi) - self.bi))
        o = 1 / (1 + np.exp(-np.dot(combined, self.Wo) - self.bo))
        c_tilde = np.tanh(np.dot(combined, self.Wc) + self.bc)
        
        c = f * c_prev + i * c_tilde
        h = o * np.tanh(c)
        
        return h, c

rnn = SimpleRNN(input_size=100, hidden_size=50, output_size=10)
X_seq = np.random.randn(8, 20, 100)
outputs, final_h = rnn.forward(X_seq)
print(f"RNN outputs shape: {outputs.shape}")
print(f"Final hidden state shape: {final_h.shape}")

lstm = LSTMCell(input_size=100, hidden_size=50)
X_t = np.random.randn(8, 100)
h_t, c_t = lstm.forward(X_t, np.zeros((8, 50)), np.zeros((8, 50)))
print(f"LSTM hidden state shape: {h_t.shape}")
print(f"LSTM cell state shape: {c_t.shape}")

**Create a Mermaid diagram: graph TD
A[Course 1: Neural Network Foundations] --> B[Course 2: Practical Deep Learning]
B --> C[Course 3: ML Project Structuring]
C --> D[Course 4: Convolutional Neural Networks]
D --> E[Course 5: Sequence Models]**

## Deep Learning Specialization Curriculum

```mermaid
graph TD
    A[Course 1: Neural Network Foundations] --> B[Course 2: Practical Deep Learning]
    B --> C[Course 3: ML Project Structuring]
    C --> D[Course 4: Convolutional Neural Networks]
    D --> E[Course 5: Sequence Models]
```

This specialization provides a comprehensive pathway through deep learning, starting with foundational neural network concepts and progressing through practical applications, project management, specialized architectures, and advanced sequence modeling techniques.

## Lesson 2

**Explain the core concepts of the lesson**

## Core Concepts

A **neural network** is a computational model inspired by how biological neurons work. At its core, a neural network learns to map inputs (features) to outputs (predictions) through layers of interconnected units.

Key concepts in this lesson:

- **Single Neuron**: The simplest building block of a neural network. It takes one or more inputs and produces an output through a mathematical function.

- **ReLU Function (Rectified Linear Unit)**: A common activation function defined as $\text{ReLU}(z) = \max(0, z)$. It outputs the input if positive, and zero otherwise. This introduces non-linearity and helps neural networks learn complex patterns.

- **Input Features**: The raw data fed into a neural network (e.g., house size, number of bedrooms, zip code, wealth).

- **Hidden Units**: Intermediate neurons in a neural network that learn abstract representations of the input features. These units automatically discover useful intermediate concepts like "family size" or "walkability."

- **Densely Connected Layers**: Layers where every neuron in one layer connects to every neuron in the next layer. This allows information to flow and be transformed through the network.

- **Supervised Learning**: The process of training a neural network using labeled data (inputs paired with correct outputs) to learn the mapping from inputs to outputs.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Simple Functions as Neural Networks**: A simple function that maps house size to price can be thought of as the most basic neural network—a single neuron with a ReLU activation. This shows that neural networks are not mysterious; they are just functions that learn from data.

**Building Complexity with Simplicity**: Larger neural networks are constructed by combining many simple neurons, much like stacking Lego bricks. Each neuron performs a simple operation, but when combined in layers, they create powerful models capable of learning complex relationships.

**Automatic Feature Learning**: One of the most powerful aspects of neural networks is their ability to automatically learn intermediate features from raw inputs. Rather than manually engineering features, a neural network discovers useful representations. For example, from basic house attributes (size, bedrooms, zip code, wealth), a neural network can learn abstract concepts like "family size," "walkability," or "school quality" in its hidden layers.

**Learning with Data**: With sufficient training data, neural networks excel at discovering accurate functions to map inputs (X) to desired outputs (Y). The network learns by adjusting its internal parameters (weights) to minimize prediction errors.

**Present and explain the key equations used in the lesson**

## Key Equations

**ReLU Activation Function**:

$$\text{ReLU}(z) = \max(0, z)$$

This function is fundamental to modern neural networks. It takes any input $z$ and outputs the maximum of zero and $z$. This means:
- If $z > 0$, the output is $z$
- If $z \leq 0$, the output is $0$

The ReLU function introduces non-linearity, allowing neural networks to learn complex, non-linear relationships between inputs and outputs. It also helps with efficient training through better gradient flow during backpropagation.

**Implement code primitive: Implement a function that predicts house price based on size, incorporating a floor at zero (similar to a ReLU function).**

In [None]:
def predict_price_from_size(size, weight=200, bias=50000):
    """
    Predicts house price based on size using a ReLU-like function.
    
    Args:
        size: House size in square feet
        weight: Price per square foot
        bias: Base price
    
    Returns:
        Predicted price (floored at zero)
    """
    linear_output = weight * size + bias
    price = max(0, linear_output)  # ReLU: floor at zero
    return price

# Example usage
size_1 = 2000
price_1 = predict_price_from_size(size_1)
print(f"House size: {size_1} sq ft, Predicted price: ${price_1:,.0f}")

size_2 = 3000
price_2 = predict_price_from_size(size_2)
print(f"House size: {size_2} sq ft, Predicted price: ${price_2:,.0f}")

**Implement code primitive: Demonstrate a neural network taking multiple input features (size, number of bedrooms, zip code, wealth) to predict house price.**

In [None]:
import numpy as np

class SimpleHousingNeuralNetwork:
    def __init__(self):
        # Hidden layer weights and biases (4 inputs -> 3 hidden units)
        self.w_hidden = np.array([
            [0.5, 0.3, 0.2],      # weights from size
            [0.4, 0.6, 0.1],      # weights from bedrooms
            [0.2, 0.4, 0.7],      # weights from zip code
            [0.3, 0.2, 0.5]       # weights from wealth
        ])
        self.b_hidden = np.array([10000, 5000, 8000])
        
        # Output layer weights and biases (3 hidden units -> 1 output)
        self.w_output = np.array([[300], [250], [200]])
        self.b_output = np.array([50000])
    
    def relu(self, z):
        return np.maximum(0, z)
    
    def predict(self, size, bedrooms, zip_code, wealth):
        # Input features
        x = np.array([size, bedrooms, zip_code, wealth])
        
        # Hidden layer
        z_hidden = np.dot(x, self.w_hidden) + self.b_hidden
        a_hidden = self.relu(z_hidden)
        
        # Output layer
        z_output = np.dot(a_hidden, self.w_output) + self.b_output
        price = self.relu(z_output)[0]
        
        return price

# Example usage
network = SimpleHousingNeuralNetwork()

# Predict price for a house
size = 2500
bedrooms = 4
zip_code = 95000
wealth = 100000

predicted_price = network.predict(size, bedrooms, zip_code, wealth)
print(f"Input: Size={size} sq ft, Bedrooms={bedrooms}, Zip={zip_code}, Wealth=${wealth}")
print(f"Predicted Price: ${predicted_price:,.0f}")

**Create a Mermaid diagram: graph TD
    A[Size] --> B{Single Neuron/ReLU Function}
    B --> C[Price]**

```mermaid
graph TD
    A[Size] --> B{Single Neuron/ReLU Function}
    B --> C[Price]
```

**Create a Mermaid diagram: graph TD
    subgraph Input Layer
        A[Size]
        B[#Bedrooms]
        C[Zip Code]
        D[Wealth]
    end
    subgraph Hidden Layer
        E[Family Size]
        F[Walkability]
        G[School Quality]
    end
    subgraph Output Layer
        H[Price]
    end
    A -- connects to --> E
    A -- connects to --> F
    A -- connects to --> G
    B -- connects to --> E
    B -- connects to --> F
    B -- connects to --> G
    C -- connects to --> E
    C -- connects to --> F
    C -- connects to --> G
    D -- connects to --> E
    D -- connects to --> F
    D -- connects to --> G
    E --> H
    F --> H
    G --> H**

```mermaid
graph TD
    subgraph Input Layer
        A[Size]
        B[#Bedrooms]
        C[Zip Code]
        D[Wealth]
    end
    subgraph Hidden Layer
        E[Family Size]
        F[Walkability]
        G[School Quality]
    end
    subgraph Output Layer
        H[Price]
    end
    A --> E
    A --> F
    A --> G
    B --> E
    B --> F
    B --> G
    C --> E
    C --> F
    C --> G
    D --> E
    D --> F
    D --> G
    E --> H
    F --> H
    G --> H
```

## Lesson 3

**Explain the core concepts of the lesson**

## Core Concepts

**Supervised Learning** is the process of training neural networks to learn a mapping function from input data to desired output data. This is the primary application driving economic value from neural networks.

In supervised learning, we define:
- **Input (x)**: The data we provide to the neural network
- **Output (y)**: The desired result or target we want the network to predict

The neural network learns to approximate the function $f(x) \approx y$ by adjusting its internal parameters during training.

**Neural Network Architectures** are specialized designs optimized for different types of data:
- **Standard Neural Network**: Used for structured data (tabular, numerical features)
- **Convolutional Neural Networks (CNN)**: Specialized for image data
- **Recurrent Neural Networks (RNN)**: Designed for sequence data (audio, text, time series)
- **Hybrid Neural Networks**: Custom architectures combining multiple approaches for complex or mixed data types

The choice of architecture depends critically on the nature of the problem and the type of data being processed.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Supervised Learning as Function Approximation**: Think of supervised learning as teaching a neural network to mimic a hidden function. You provide examples of inputs and their corresponding outputs, and the network learns the pattern connecting them. This is fundamentally different from unsupervised learning, where no target outputs are provided.

**The Importance of Defining x and y**: Before applying supervised learning to any problem, you must carefully define what your inputs (x) and outputs (y) represent. Poor definitions lead to poor results, regardless of how powerful your neural network is. This is a critical first step in any machine learning project.

**Architecture Follows Data Type**: Neural networks are not one-size-fits-all. Different data types have different structures and properties:
- Structured data is already organized in a way standard networks can process
- Images have spatial structure that CNNs exploit through convolution operations
- Sequences have temporal or sequential dependencies that RNNs capture through recurrence

**Unlocking Unstructured Data**: Neural networks have revolutionized how computers process unstructured data (images, audio, text). This capability has opened entirely new application domains and is a major source of modern AI's economic impact.

**Create a Mermaid diagram: graph TD
    A[Input X] --> B{Neural Network}
    B --> C[Output Y]
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#bbf,stroke:#333,stroke-width:2px
    style C fill:#f9f,stroke:#333,stroke-width:2px**

## Supervised Learning Framework

```mermaid
graph TD
    A[Input X] --> B{Neural Network}
    B --> C[Output Y]
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#bbf,stroke:#333,stroke-width:2px
    style C fill:#f9f,stroke:#333,stroke-width:2px
```

This diagram illustrates the fundamental supervised learning pipeline: the neural network takes input data (x) and produces predicted output data (y).

**Create a Mermaid diagram: graph TD
    subgraph Data Type to NN Architecture
        A[Structured Data] --> SNN(Standard Neural Network)
        B[Image Data] --> CNN(Convolutional Neural Network)
        C[Sequence Data (Audio, Text)] --> RNN(Recurrent Neural Network)
        D[Complex/Hybrid Data] --> HNN(Custom/Hybrid Neural Network)
    end
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#f9f,stroke:#333,stroke-width:2px
    style C fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#f9f,stroke:#333,stroke-width:2px
    style SNN fill:#bbf,stroke:#333,stroke-width:2px
    style CNN fill:#bbf,stroke:#333,stroke-width:2px
    style RNN fill:#bbf,stroke:#333,stroke-width:2px
    style HNN fill:#bbf,stroke:#333,stroke-width:2px**

## Matching Data Types to Neural Network Architectures

```mermaid
graph TD
    subgraph Data Type to NN Architecture
        A[Structured Data] --> SNN(Standard Neural Network)
        B[Image Data] --> CNN(Convolutional Neural Network)
        C[Sequence Data (Audio, Text)] --> RNN(Recurrent Neural Network)
        D[Complex/Hybrid Data] --> HNN(Custom/Hybrid Neural Network)
    end
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style B fill:#f9f,stroke:#333,stroke-width:2px
    style C fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#f9f,stroke:#333,stroke-width:2px
    style SNN fill:#bbf,stroke:#333,stroke-width:2px
    style CNN fill:#bbf,stroke:#333,stroke-width:2px
    style RNN fill:#bbf,stroke:#333,stroke-width:2px
    style HNN fill:#bbf,stroke:#333,stroke-width:2px
```

The most effective neural network architecture depends on the nature of your data. This diagram shows the primary mapping between data types and specialized architectures designed to handle them efficiently.

## Lesson 4

**Explain the core concepts of the lesson**

## Core Concepts

Deep learning's rise has been driven by three fundamental factors:

1. **Data Scale**: The availability of massive amounts of labeled data has enabled neural networks to learn complex patterns that traditional algorithms cannot capture.

2. **Computation Scale**: Advances in computational power (GPUs, TPUs) have made it feasible to train very large neural networks on large datasets.

3. **Neural Network Size**: Larger neural networks have the capacity to leverage vast amounts of data to continuously improve performance, unlike traditional algorithms which plateau.

These three factors work together synergistically. Traditional machine learning algorithms eventually reach a performance plateau regardless of how much additional data is provided. In contrast, large neural networks can continue to improve their performance as more data becomes available. This fundamental difference is a key driver of deep learning's success in modern applications.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**The Plateau Problem**: Traditional learning algorithms hit a ceiling. No matter how much data you feed them, their performance stops improving. It's like trying to fill a cup that's already full—adding more water doesn't help.

**The Scaling Advantage**: Large neural networks are different. They're like containers with no ceiling. As you provide more data, they continue to improve. The more data you have, the better they become.

**The Virtuous Cycle**: Deep learning's success comes from combining three elements:
- Abundant data (the fuel)
- Large neural networks (the engine)
- Computational power (the infrastructure)

When these three align, neural networks can learn representations that traditional algorithms simply cannot discover.

**Rapid Experimentation**: Faster computation enables quicker iteration. You can try an idea, run an experiment, analyze results, and refine your approach in hours instead of days. This speed of experimentation accelerates discovery and innovation.

**Small Changes, Big Impact**: Even seemingly minor algorithmic changes—like switching from sigmoid to ReLU activation functions—can dramatically improve both training speed and model performance. These innovations compound over time.

**Present and explain the key equations used in the lesson**

## Key Equations and Notation

**Training Set Size**

$$m = \text{number of training examples}$$

The variable $m$ represents the size of your training dataset. This is a fundamental parameter in deep learning because the relationship between $m$ and model performance differs dramatically between traditional algorithms and neural networks:

- **Traditional algorithms**: Performance plateaus as $m$ increases beyond a certain point
- **Large neural networks**: Performance continues to improve as $m$ increases

This difference in how performance scales with $m$ is central to understanding why deep learning has become so dominant in modern machine learning.

**Implement code primitive: Implementing a specific neural network architecture for a given task.**

In [None]:
import numpy as np

class SimpleNeuralNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        self.W1 = np.random.randn(input_size, hidden_size) * 0.01
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.01
        self.b2 = np.zeros((1, output_size))
    
    def forward(self, X):
        self.z1 = np.dot(X, self.W1) + self.b1
        self.a1 = np.maximum(0, self.z1)
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        self.a2 = 1 / (1 + np.exp(-self.z2))
        return self.a2
    
    def backward(self, X, y, learning_rate=0.01):
        m = X.shape[0]
        dz2 = self.a2 - y
        dW2 = np.dot(self.a1.T, dz2) / m
        db2 = np.sum(dz2, axis=0, keepdims=True) / m
        da1 = np.dot(dz2, self.W2.T)
        dz1 = da1 * (self.z1 > 0)
        dW1 = np.dot(X.T, dz1) / m
        db1 = np.sum(dz1, axis=0, keepdims=True) / m
        
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1

nn = SimpleNeuralNetwork(input_size=10, hidden_size=5, output_size=1)
X_sample = np.random.randn(100, 10)
output = nn.forward(X_sample)

**Implement code primitive: Changing the activation function (e.g., from sigmoid to ReLU) within a neural network's code to improve training speed and performance.**

In [None]:
import numpy as np

class NeuralNetworkWithActivations:
    def __init__(self, input_size, hidden_size, output_size, activation='relu'):
        self.activation = activation
        self.W1 = np.random.randn(input_size, hidden_size) * 0.01
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.01
        self.b2 = np.zeros((1, output_size))
    
    def relu(self, z):
        return np.maximum(0, z)
    
    def sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def forward(self, X):
        self.z1 = np.dot(X, self.W1) + self.b1
        
        if self.activation == 'relu':
            self.a1 = self.relu(self.z1)
        elif self.activation == 'sigmoid':
            self.a1 = self.sigmoid(self.z1)
        
        self.z2 = np.dot(self.a1, self.W2) + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2

nn_relu = NeuralNetworkWithActivations(input_size=10, hidden_size=5, output_size=1, activation='relu')
nn_sigmoid = NeuralNetworkWithActivations(input_size=10, hidden_size=5, output_size=1, activation='sigmoid')

X_sample = np.random.randn(100, 10)
output_relu = nn_relu.forward(X_sample)
output_sigmoid = nn_sigmoid.forward(X_sample)

**Create a Mermaid diagram: graph TD
    A[Amount of Data] --> B{Traditional Algorithm};
    B --> C{Performance Plateaus};
    A --> D{Small Neural Network};
    D --> E{Performance Improves (Limited)};
    A --> F{Large Neural Network};
    F --> G{Performance Keeps Improving};**

## Performance Scaling with Data

The following diagram illustrates how performance scales differently for traditional algorithms versus neural networks as the amount of data increases:

```mermaid
graph TD
    A[Amount of Data] --> B{Traditional Algorithm}
    B --> C{Performance Plateaus}
    A --> D{Small Neural Network}
    D --> E{Performance Improves Limited}
    A --> F{Large Neural Network}
    F --> G{Performance Keeps Improving}
```

This visualization captures a fundamental insight: traditional algorithms reach a performance ceiling, while large neural networks continue to benefit from additional data. This difference is a primary driver of deep learning's dominance.

**Create a Mermaid diagram: graph TD
    A[Have Idea] --> B[Implement Idea (Code)];
    B --> C[Run Experiment];
    C --> D[Analyze Results];
    D --> E{Good Performance?};
    E -- No --> A;
    E -- Yes --> F[Deploy/Conclude];**

## The Iterative Experimentation Cycle

Modern deep learning development follows a rapid iterative cycle. Faster computational power enables researchers and practitioners to cycle through this loop quickly, accelerating innovation:

```mermaid
graph TD
    A[Have Idea] --> B[Implement Idea Code]
    B --> C[Run Experiment]
    C --> D[Analyze Results]
    D --> E{Good Performance?}
    E -->|No| A
    E -->|Yes| F[Deploy/Conclude]
```

This cycle of ideation, implementation, experimentation, and analysis is fundamental to deep learning development. The speed at which you can complete each iteration directly impacts how quickly you can discover effective solutions. Advances in computation have dramatically reduced the time per iteration, enabling more experiments and faster progress.

## Lesson 5

**Explain the core concepts of the lesson**

## Core Concepts

This lesson introduces the foundational building blocks of deep learning. The key concepts you will learn are:

**Deep Neural Networks**: Multi-layered networks that learn hierarchical representations of data through stacked layers of neurons.

**Neural Network Programming**: The practical implementation of neural network architectures using code, enabling you to build and deploy models.

**Forward Propagation**: The process of computing predictions by passing input data through the network layers sequentially, from input to output.

**Back Propagation**: The algorithm for computing gradients of the loss function with respect to network parameters, enabling efficient learning through gradient descent.

**Single Hidden Layer Networks**: Neural networks with one intermediate layer between input and output, serving as a foundation for understanding deeper architectures.

**Many Layer Networks**: Deep neural networks with multiple hidden layers that can learn more complex patterns and representations.

These concepts form the essential foundation for understanding and implementing deep learning models.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Building Blocks of Deep Learning**: This course teaches the foundational building blocks of deep learning. Think of neural networks as modular components that you can combine and extend to solve increasingly complex problems.

**Learning Through Implementation**: The primary goal of the first course is to learn to build and deploy a deep neural network. Rather than just understanding theory, you will gain practical experience by constructing networks from scratch.

**Hands-On Programming Solidifies Understanding**: Hands-on programming exercises help solidify understanding of algorithms by implementing them. When you code forward propagation and back propagation yourself, you develop intuition for how these algorithms work and why they matter.

**From Simple to Complex**: Start with single hidden layer networks to understand the basics, then extend to many-layer networks as you build confidence and understanding.

**The Learning Loop**: Forward propagation computes predictions, back propagation computes how to improve those predictions, and this cycle repeats to train the network.

**Implement code primitive: Implement neural network algorithms efficiently.**

In [None]:
import numpy as np

def sigmoid(z):
    """Sigmoid activation function"""
    return 1 / (1 + np.exp(-z))

def relu(z):
    """ReLU activation function"""
    return np.maximum(0, z)

def sigmoid_derivative(z):
    """Derivative of sigmoid function"""
    s = sigmoid(z)
    return s * (1 - s)

def relu_derivative(z):
    """Derivative of ReLU function"""
    return (z > 0).astype(float)

print("Neural network activation functions implemented efficiently.")

**Implement code primitive: Code a single hidden layer neural network.**

In [None]:
class SingleHiddenLayerNetwork:
    def __init__(self, input_size, hidden_size, output_size):
        """Initialize network parameters"""
        self.W1 = np.random.randn(hidden_size, input_size) * 0.01
        self.b1 = np.zeros((hidden_size, 1))
        self.W2 = np.random.randn(output_size, hidden_size) * 0.01
        self.b2 = np.zeros((output_size, 1))
    
    def forward(self, X):
        """Forward propagation through single hidden layer"""
        self.Z1 = np.dot(self.W1, X) + self.b1
        self.A1 = relu(self.Z1)
        self.Z2 = np.dot(self.W2, self.A1) + self.b2
        self.A2 = sigmoid(self.Z2)
        return self.A2

print("Single hidden layer neural network implemented.")

**Implement code primitive: Build a deep neural network with many layers.**

In [None]:
class DeepNeuralNetwork:
    def __init__(self, layer_dims):
        """Initialize deep network with multiple layers
        layer_dims: list of layer dimensions [input_size, hidden1, hidden2, ..., output_size]
        """
        self.layer_dims = layer_dims
        self.parameters = {}
        self.L = len(layer_dims)
        
        for l in range(1, self.L):
            self.parameters[f'W{l}'] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01
            self.parameters[f'b{l}'] = np.zeros((layer_dims[l], 1))
    
    def forward(self, X):
        """Forward propagation through all layers"""
        self.cache = {}
        A = X
        
        for l in range(1, self.L):
            Z = np.dot(self.parameters[f'W{l}'], A) + self.parameters[f'b{l}']
            A = relu(Z) if l < self.L - 1 else sigmoid(Z)
            self.cache[f'Z{l}'] = Z
            self.cache[f'A{l}'] = A
        
        return A

print("Deep neural network with many layers implemented.")

**Implement code primitive: Implement forward propagation steps.**

In [None]:
def forward_propagation(X, parameters, layer_dims):
    """Complete forward propagation through the network"""
    cache = {}
    A = X
    cache['A0'] = X
    
    L = len(layer_dims)
    
    for l in range(1, L):
        W = parameters[f'W{l}']
        b = parameters[f'b{l}']
        
        Z = np.dot(W, A) + b
        
        if l < L - 1:
            A = relu(Z)
        else:
            A = sigmoid(Z)
        
        cache[f'Z{l}'] = Z
        cache[f'A{l}'] = A
    
    return A, cache

print("Forward propagation steps implemented.")

**Implement code primitive: Implement back propagation steps.**

In [None]:
def backward_propagation(Y, cache, parameters, layer_dims, learning_rate=0.01):
    """Complete back propagation to compute gradients and update parameters"""
    m = Y.shape[1]
    L = len(layer_dims)
    gradients = {}
    
    dA = -(Y / cache[f'A{L-1}'] - (1 - Y) / (1 - cache[f'A{L-1}']))
    
    for l in range(L - 1, 0, -1):
        A_prev = cache[f'A{l-1}']
        Z = cache[f'Z{l}']
        
        dZ = dA * (relu_derivative(Z) if l < L - 1 else sigmoid_derivative(Z))
        
        dW = np.dot(dZ, A_prev.T) / m
        db = np.sum(dZ, axis=1, keepdims=True) / m
        
        gradients[f'dW{l}'] = dW
        gradients[f'db{l}'] = db
        
        if l > 1:
            dA = np.dot(parameters[f'W{l}'].T, dZ)
        
        parameters[f'W{l}'] -= learning_rate * dW
        parameters[f'b{l}'] -= learning_rate * db
    
    return parameters, gradients

print("Back propagation steps implemented.")

## Lesson 6

**Explain the core concepts of the lesson**

## Core Concepts

Geoffrey Hinton's research has fundamentally shaped modern deep learning through several key innovations:

**Backpropagation Algorithm**: The foundational technique for training neural networks by computing gradients through the chain rule, enabling efficient learning in multi-layer networks.

**Word Embeddings**: Semantic feature vectors that represent words in a continuous vector space, capturing relationships and meanings from text data.

**Boltzmann Machines**: Probabilistic models with densely connected networks that learn hidden representations through energy-based learning.

**Restricted Boltzmann Machines (RBMs)**: Simplified Boltzmann machines with a bipartite structure (visible and hidden units) that learn single layers of features efficiently.

**Deep Belief Networks (DBNs)**: Generative models constructed by stacking trained RBMs layer-wise, enabling efficient approximate inference in deep architectures.

**Variational Bayes**: A method for approximating true posteriors in complex probabilistic models through variational inference.

**Rectified Linear Units (ReLUs)**: Activation functions defined as $\text{ReLU}(x) = \max(0, x)$ that improve training of deep networks.

**Recirculation Algorithm**: An autoencoder learning approach that minimizes activity variation without explicit backpropagation.

**Fast Weights**: Neural network mechanisms enabling rapid adaptation and short-term memory for recursive processing.

**Capsule Networks**: Neural architectures where groups of neurons represent multi-dimensional feature entities with routing mechanisms for improved generalization.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Distributed Representations**: Brain memories are distributed throughout the brain, similar to how holograms store information. This principle underlies modern neural networks where knowledge is encoded across many neurons rather than localized in single units.

**Computational Power Drives Progress**: The power of deep learning has been significantly driven by the increasing speed of computers, especially GPUs. Algorithms that were theoretically sound but computationally infeasible became practical with hardware advances.

**Biological Plausibility**: Learning algorithms like backpropagation, if effective, could have been implemented by biological evolution due to strong selective pressure. This suggests that successful learning algorithms align with principles that nature has discovered.

**Features as Concepts**: A concept can be understood as a collection of features, unifying traditional psychological and AI views. Rather than discrete symbolic representations, concepts emerge from combinations of learned features.

**State Vectors Over Symbols**: Transforming raw measurements into a 'state vector' where actions become linear simplifies modeling and manipulation. This continuous representation enables efficient computation and learning.

**Thoughts as Neural Activity**: Thoughts are best represented as large vectors of neural activity with causal powers, not as symbolic expressions. This perspective emphasizes the importance of learned representations over hand-crafted symbols.

**Intuition-Driven Research**: Trusting one's intuition about a research idea, even when others disagree, can be key to groundbreaking discoveries. Many of Hinton's innovations were pursued despite initial skepticism from the community.

**Present and explain the key equations used in the lesson**

## Key Equations

**Rectified Linear Unit (ReLU)**:

$$\text{ReLU}(x) = \max(0, x)$$

This activation function outputs the input directly if positive, and zero otherwise. ReLUs have become the standard activation function in modern deep neural networks because they:
- Enable efficient gradient flow during backpropagation
- Reduce the vanishing gradient problem in deep networks
- Provide sparse representations where many neurons are inactive
- Simplify computation compared to sigmoid or tanh activations

**Implement code primitive: Implement backpropagation for discriminative learning tasks, such as predicting words in a sequence.**

In [None]:
import numpy as np

class SequencePredictor:
    def __init__(self, vocab_size, hidden_size, learning_rate=0.01):
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.learning_rate = learning_rate
        
        self.W_input = np.random.randn(vocab_size, hidden_size) * 0.01
        self.W_hidden = np.random.randn(hidden_size, hidden_size) * 0.01
        self.W_output = np.random.randn(hidden_size, vocab_size) * 0.01
        self.b_hidden = np.zeros((1, hidden_size))
        self.b_output = np.zeros((1, vocab_size))
    
    def forward(self, x, h_prev):
        h = np.tanh(np.dot(x, self.W_input) + np.dot(h_prev, self.W_hidden) + self.b_hidden)
        logits = np.dot(h, self.W_output) + self.b_output
        probs = np.exp(logits) / np.sum(np.exp(logits), axis=1, keepdims=True)
        return h, probs
    
    def backward(self, x, h, h_prev, probs, target):
        batch_size = x.shape[0]
        
        d_logits = probs.copy()
        d_logits[np.arange(batch_size), target] -= 1
        d_logits /= batch_size
        
        dW_output = np.dot(h.T, d_logits)
        db_output = np.sum(d_logits, axis=0, keepdims=True)
        
        dh = np.dot(d_logits, self.W_output.T)
        dh_raw = dh * (1 - h ** 2)
        
        dW_input = np.dot(x.T, dh_raw)
        dW_hidden = np.dot(h_prev.T, dh_raw)
        db_hidden = np.sum(dh_raw, axis=0, keepdims=True)
        
        self.W_output -= self.learning_rate * dW_output
        self.b_output -= self.learning_rate * db_output
        self.W_input -= self.learning_rate * dW_input
        self.W_hidden -= self.learning_rate * dW_hidden
        self.b_hidden -= self.learning_rate * db_hidden
        
        return np.dot(dh_raw, self.W_hidden.T)
    
    def train_step(self, x_batch, h_prev, target_batch):
        h, probs = self.forward(x_batch, h_prev)
        self.backward(x_batch, h, h_prev, probs, target_batch)
        return h

**Implement code primitive: Train models to learn semantic feature vectors (embeddings) for words from text data.**

In [None]:
import numpy as np

class WordEmbedding:
    def __init__(self, vocab_size, embedding_dim, learning_rate=0.01):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.learning_rate = learning_rate
        self.embeddings = np.random.randn(vocab_size, embedding_dim) * 0.01
    
    def get_embedding(self, word_id):
        return self.embeddings[word_id]
    
    def forward(self, context_ids, target_id):
        context_vectors = self.embeddings[context_ids]
        context_mean = np.mean(context_vectors, axis=0, keepdims=True)
        
        scores = np.dot(context_mean, self.embeddings.T)
        probs = np.exp(scores) / np.sum(np.exp(scores))
        
        return context_mean, probs
    
    def backward(self, context_ids, target_id, context_mean, probs):
        d_probs = probs.copy()
        d_probs[0, target_id] -= 1
        
        d_embeddings_output = np.dot(d_probs, self.embeddings)
        d_context_mean = np.dot(d_probs, self.embeddings.T)
        
        context_vectors = self.embeddings[context_ids]
        d_context_vectors = d_context_mean / len(context_ids)
        
        for i, word_id in enumerate(context_ids):
            self.embeddings[word_id] -= self.learning_rate * d_context_vectors
        
        self.embeddings -= self.learning_rate * np.dot(d_embeddings_output.T, np.ones((1, self.embedding_dim)))
    
    def train_step(self, context_ids, target_id):
        context_mean, probs = self.forward(context_ids, target_id)
        self.backward(context_ids, target_id, context_mean, probs)
        loss = -np.log(probs[0, target_id] + 1e-10)
        return loss

**Implement code primitive: Implement a Boltzmann machine learning algorithm for densely connected networks with hidden representations.**

In [None]:
import numpy as np

class BoltzmannMachine:
    def __init__(self, n_visible, n_hidden, learning_rate=0.01):
        self.n_visible = n_visible
        self.n_hidden = n_hidden
        self.learning_rate = learning_rate
        
        self.W = np.random.randn(n_visible, n_hidden) * 0.01
        self.b_v = np.zeros(n_visible)
        self.b_h = np.zeros(n_hidden)
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def sample_hidden(self, v):
        h_prob = self.sigmoid(np.dot(v, self.W) + self.b_h)
        h = (np.random.rand(*h_prob.shape) < h_prob).astype(float)
        return h, h_prob
    
    def sample_visible(self, h):
        v_prob = self.sigmoid(np.dot(h, self.W.T) + self.b_v)
        v = (np.random.rand(*v_prob.shape) < v_prob).astype(float)
        return v, v_prob
    
    def gibbs_step(self, v):
        h, h_prob = self.sample_hidden(v)
        v_new, v_prob = self.sample_visible(h)
        return v_new, h, h_prob, v_prob
    
    def train_step(self, v_data, n_gibbs=1):
        h_data, h_prob_data = self.sample_hidden(v_data)
        
        v_model = v_data.copy()
        for _ in range(n_gibbs):
            v_model, h_model, h_prob_model, v_prob_model = self.gibbs_step(v_model)
        
        positive_gradient = np.dot(v_data.T, h_prob_data)
        negative_gradient = np.dot(v_model.T, h_prob_model)
        
        self.W += self.learning_rate * (positive_gradient - negative_gradient) / v_data.shape[0]
        self.b_v += self.learning_rate * np.mean(v_data - v_model, axis=0)
        self.b_h += self.learning_rate * np.mean(h_prob_data - h_prob_model, axis=0)
        
        return np.mean((v_data - v_model) ** 2)

**Implement code primitive: Develop Restricted Boltzmann Machines (RBMs) for learning single layers of features.**

In [None]:
import numpy as np

class RBM:
    def __init__(self, n_visible, n_hidden, learning_rate=0.01):
        self.n_visible = n_visible
        self.n_hidden = n_hidden
        self.learning_rate = learning_rate
        
        self.W = np.random.randn(n_visible, n_hidden) * 0.01
        self.b_v = np.zeros(n_visible)
        self.b_h = np.zeros(n_hidden)
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def forward(self, v):
        h_prob = self.sigmoid(np.dot(v, self.W) + self.b_h)
        h = (np.random.rand(*h_prob.shape) < h_prob).astype(float)
        return h, h_prob
    
    def backward(self, h):
        v_prob = self.sigmoid(np.dot(h, self.W.T) + self.b_v)
        v = (np.random.rand(*v_prob.shape) < v_prob).astype(float)
        return v, v_prob
    
    def contrastive_divergence(self, v_data, k=1):
        h_data, h_prob_data = self.forward(v_data)
        
        h = h_data
        for _ in range(k):
            v, v_prob = self.backward(h)
            h, h_prob = self.forward(v)
        
        positive = np.dot(v_data.T, h_prob_data)
        negative = np.dot(v.T, h_prob)
        
        self.W += self.learning_rate * (positive - negative) / v_data.shape[0]
        self.b_v += self.learning_rate * np.mean(v_data - v, axis=0)
        self.b_h += self.learning_rate * np.mean(h_prob_data - h_prob, axis=0)
        
        return np.mean((v_data - v) ** 2)
    
    def transform(self, v):
        h, h_prob = self.forward(v)
        return h_prob

**Implement code primitive: Construct Deep Belief Networks (DBNs) by stacking trained RBMs layer-wise and performing efficient approximate inference.**

In [None]:
import numpy as np

class DBN:
    def __init__(self, layer_sizes, learning_rate=0.01):
        self.layer_sizes = layer_sizes
        self.learning_rate = learning_rate
        self.rbms = []
        
        for i in range(len(layer_sizes) - 1):
            rbm = RBM(layer_sizes[i], layer_sizes[i + 1], learning_rate)
            self.rbms.append(rbm)
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def pretrain(self, data, epochs=10, k=1):
        current_data = data
        
        for layer_idx, rbm in enumerate(self.rbms):
            for epoch in range(epochs):
                error = rbm.contrastive_divergence(current_data, k=k)
            
            current_data = rbm.transform(current_data)
    
    def forward(self, v):
        activations = [v]
        current = v
        
        for rbm in self.rbms:
            h_prob = self.sigmoid(np.dot(current, rbm.W) + rbm.b_h)
            activations.append(h_prob)
            current = h_prob
        
        return activations
    
    def backward(self, h_top):
        current = h_top
        
        for rbm in reversed(self.rbms):
            v_prob = self.sigmoid(np.dot(current, rbm.W.T) + rbm.b_v)
            current = v_prob
        
        return current
    
    def inference(self, v, n_steps=10):
        current = v
        
        for _ in range(n_steps):
            activations = self.forward(current)
            h_top = activations[-1]
            current = self.backward(h_top)
        
        return current

**Implement code primitive: Implement variational Bayesian learning to approximate true posteriors in complex models.**

In [None]:
import numpy as np
from scipy.special import digamma, loggamma

class VariationalBayesian:
    def __init__(self, n_features, n_components, learning_rate=0.01):
        self.n_features = n_features
        self.n_components = n_components
        self.learning_rate = learning_rate
        
        self.alpha = np.ones(n_components)
        self.beta = np.ones((n_components, n_features))
        self.gamma = np.random.rand(n_components)
    
    def elbo(self, data):
        n_samples = data.shape[0]
        
        log_likelihood = 0
        for k in range(self.n_components):
            log_beta_k = digamma(self.beta[k]) - digamma(np.sum(self.beta[k]))
            log_likelihood += np.sum(data * log_beta_k)
        
        kl_alpha = np.sum((self.alpha - 1) * (digamma(self.gamma) - digamma(np.sum(self.gamma))))
        kl_alpha += np.sum(loggamma(np.sum(self.alpha)) - np.sum(loggamma(self.alpha)))
        kl_alpha -= np.sum(loggamma(np.sum(self.gamma)) - np.sum(loggamma(self.gamma)))
        
        return log_likelihood + kl_alpha
    
    def update(self, data):
        n_samples = data.shape[0]
        
        for k in range(self.n_components):
            self.gamma[k] = self.alpha[k] + np.sum(data)
            self.beta[k] = self.beta[k] + data.T
    
    def fit(self, data, n_iterations=10):
        for iteration in range(n_iterations):
            self.update(data)
            elbo_val = self.elbo(data)
    
    def predict(self, data):
        log_prob = np.zeros((data.shape[0], self.n_components))
        
        for k in range(self.n_components):
            log_prob[:, k] = digamma(self.gamma[k]) - digamma(np.sum(self.gamma))
            log_prob[:, k] += np.dot(data, digamma(self.beta[k]) - digamma(np.sum(self.beta[k])))
        
        return np.argmax(log_prob, axis=1)

**Implement code primitive: Utilize Rectified Linear Units (ReLUs) as activation functions in neural networks.**

In [None]:
import numpy as np

class ReLUNetwork:
    def __init__(self, layer_sizes, learning_rate=0.01):
        self.layer_sizes = layer_sizes
        self.learning_rate = learning_rate
        self.weights = []
        self.biases = []
        
        for i in range(len(layer_sizes) - 1):
            w = np.random.randn(layer_sizes[i], layer_sizes[i + 1]) * np.sqrt(2.0 / layer_sizes[i])
            b = np.zeros((1, layer_sizes[i + 1]))
            self.weights.append(w)
            self.biases.append(b)
    
    def relu(self, x):
        return np.maximum(0, x)
    
    def relu_derivative(self, x):
        return (x > 0).astype(float)
    
    def softmax(self, x):
        exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
        return exp_x / np.sum(exp_x, axis=1, keepdims=True)
    
    def forward(self, x):
        self.activations = [x]
        self.z_values = []
        
        for i in range(len(self.weights) - 1):
            z = np.dot(self.activations[-1], self.weights[i]) + self.biases[i]
            self.z_values.append(z)
            a = self.relu(z)
            self.activations.append(a)
        
        z_final = np.dot(self.activations[-1], self.weights[-1]) + self.biases[-1]
        self.z_values.append(z_final)
        a_final = self.softmax(z_final)
        self.activations.append(a_final)
        
        return a_final
    
    def backward(self, y_true):
        m = y_true.shape[0]
        
        delta = self.activations[-1] - y_true
        
        for i in range(len(self.weights) - 1, -1, -1):
            dw = np.dot(self.activations[i].T, delta) / m
            db = np.sum(delta, axis=0, keepdims=True) / m
            
            self.weights[i] -= self.learning_rate * dw
            self.biases[i] -= self.learning_rate * db
            
            if i > 0:
                delta = np.dot(delta, self.weights[i].T) * self.relu_derivative(self.z_values[i - 1])
    
    def train_step(self, x, y_true):
        output = self.forward(x)
        self.backward(y_true)
        loss = -np.mean(np.sum(y_true * np.log(output + 1e-10), axis=1))
        return loss

**Implement code primitive: Initialize deep neural networks with identity matrices to facilitate training of very deep architectures.**

In [None]:
import numpy as np

class DeepNetworkWithIdentityInit:
    def __init__(self, layer_sizes, learning_rate=0.01):
        self.layer_sizes = layer_sizes
        self.learning_rate = learning_rate
        self.weights = []
        self.biases = []
        
        for i in range(len(layer_sizes) - 1):
            if layer_sizes[i] == layer_sizes[i + 1]:
                w = np.eye(layer_sizes[i])
            else:
                min_dim = min(layer_sizes[i], layer_sizes[i + 1])
                w = np.random.randn(layer_sizes[i], layer_sizes[i + 1]) * 0.01
                w[:min_dim, :min_dim] += np.eye(min_dim)
            
            b = np.zeros((1, layer_sizes[i + 1]))
            self.weights.append(w)
            self.biases.append(b)
    
    def relu(self, x):
        return np.maximum(0, x)
    
    def relu_derivative(self, x):
        return (x > 0).astype(float)
    
    def forward(self, x):
        self.activations = [x]
        self.z_values = []
        
        for i in range(len(self.weights)):
            z = np.dot(self.activations[-1], self.weights[i]) + self.biases[i]
            self.z_values.append(z)
            a = self.relu(z)
            self.activations.append(a)
        
        return self.activations[-1]
    
    def backward(self, y_true):
        m = y_true.shape[0]
        
        delta = self.activations[-1] - y_true
        
        for i in range(len(self.weights) - 1, -1, -1):
            dw = np.dot(self.activations[i].T, delta) / m
            db = np.sum(delta, axis=0, keepdims=True) / m
            
            self.weights[i] -= self.learning_rate * dw
            self.biases[i] -= self.learning_rate * db
            
            if i > 0:
                delta = np.dot(delta, self.weights[i].T) * self.relu_derivative(self.z_values[i - 1])
    
    def train_step(self, x, y_true):
        output = self.forward(x)
        self.backward(y_true)
        loss = np.mean((output - y_true) ** 2)
        return loss

**Implement code primitive: Implement the recirculation algorithm for autoencoders to learn without explicit backpropagation by minimizing activity variation.**

In [None]:
import numpy as np

class RecirculationAutoencoder:
    def __init__(self, input_size, hidden_size, learning_rate=0.01):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.learning_rate = learning_rate
        
        self.W_encode = np.random.randn(input_size, hidden_size) * 0.01
        self.W_decode = np.random.randn(hidden_size, input_size) * 0.01
        self.b_hidden = np.zeros((1, hidden_size))
        self.b_output = np.zeros((1, input_size))
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
    
    def encode(self, x):
        h = self.sigmoid(np.dot(x, self.W_encode) + self.b_hidden)
        return h
    
    def decode(self, h):
        x_recon = self.sigmoid(np.dot(h, self.W_decode) + self.b_output)
        return x_recon
    
    def recirculate(self, x, n_iterations=3):
        h = self.encode(x)
        
        for _ in range(n_iterations):
            x_recon = self.decode(h)
            h_new = self.encode(x_recon)
            h = 0.5 * h + 0.5 * h_new
        
        return h
    
    def train_step(self, x):
        h_initial = self.encode(x)
        h_recirculated = self.recirculate(x, n_iterations=3)
        
        activity_variation = np.mean((h_initial - h_recirculated) ** 2)
        
        x_recon = self.decode(h_recirculated)
        reconstruction_error = np.mean((x - x_recon) ** 2)
        
        dh = 2 * (h_recirculated - h_initial) / x.shape[0]
        
        dW_decode = np.dot(h_recirculated.T, (x - x_recon)) / x.shape[0]
        db_output = np.mean(x - x_recon, axis=0, keepdims=True)
        
        self.W_decode += self.learning_rate * dW_decode
        self.b_output += self.learning_rate * db_output
        
        dW_encode = np.dot(x.T, dh) / x.shape[0]
        db_hidden = np.mean(dh, axis=0, keepdims=True)
        
        self.W_encode += self.learning_rate * dW_encode
        self.b_hidden += self.learning_rate * db_hidden
        
        return activity_variation + reconstruction_error

**Implement code primitive: Develop neural networks that use 'fast weights' for rapid adaptation and short-term memory, enabling recursion.**

In [None]:
import numpy as np

class FastWeightNetwork:
    def __init__(self, input_size, hidden_size, learning_rate=0.01, fast_weight_decay=0.95):
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.learning_rate = learning_rate
        self.fast_weight_decay = fast_weight_decay
        
        self.W_slow = np.random.randn(input_size, hidden_size) * 0.01
        self.W_fast = np.zeros((input_size, hidden_size))
        self.W_recurrent = np.random.randn(hidden_size, hidden_size) * 0.01
        self.b_hidden = np.zeros((1, hidden_size))
    
    def relu(self, x):
        return np.maximum(0, x)
    
    def forward(self, x, h_prev):
        W_combined = self.W_slow + self.W_fast
        
        h = self.relu(np.dot(x, W_combined) + np.dot(h_prev, self.W_recurrent) + self.b_hidden)
        
        return h
    
    def update_fast_weights(self, x, h, learning_rate_fast=0.1):
        dW_fast = np.dot(x.T, h) / x.shape[0]
        self.W_fast = self.fast_weight_decay * self.W_fast + learning_rate_fast * dW_fast
    
    def update_slow_weights(self, x, h, target):
        error = h - target
        dW_slow = np.dot(x.T, error) / x.shape[0]
        self.W_slow -= self.learning_rate * dW_slow
    
    def train_step(self, x_sequence, target_sequence):
        h = np.zeros((x_sequence.shape[0], self.hidden_size))
        total_loss = 0
        
        for t in range(x_sequence.shape[1]):
            x_t = x_sequence[:, t:t+1]
            target_t = target_sequence[:, t:t+1]
            
            h_prev = h if t > 0 else np.zeros((x_sequence.shape[0], self.hidden_size))
            h = self.forward(x_t, h_prev)
            
            self.update_fast_weights(x_t, h)
            self.update_slow_weights(x_t, h, target_t)
            
            loss = np.mean((h - target_t) ** 2)
            total_loss += loss
        
        return total_loss / x_sequence.shape[1]

**Implement code primitive: Implement capsule networks where groups of neurons represent multi-dimensional feature entities.**

In [None]:
import numpy as np

class CapsuleLayer:
    def __init__(self, n_capsules, capsule_dim, input_dim, learning_rate=0.01):
        self.n_capsules = n_capsules
        self.capsule_dim = capsule_dim
        self.input_dim = input_dim
        self.learning_rate = learning_rate
        
        self.W = np.random.randn(input_dim, n_capsules, capsule_dim) * 0.01
        self.b = np.zeros((n_capsules, capsule_dim))
    
    def squash(self, x):
        norm = np.linalg.norm(x, axis=-1, keepdims=True)
        return (norm ** 2 / (1 + norm ** 2)) * (x / (norm + 1e-10))
    
    def forward(self, u):
        batch_size = u.shape[0]
        
        u_hat = np.zeros((batch_size, self.n_capsules, self.capsule_dim))
        for i in range(self.n_capsules):
            u_hat[:, i, :] = np.dot(u, self.W[:, i, :]) + self.b[i]
        
        b_ij = np.zeros((batch_size, self.n_capsules))
        
        for iteration in range(3):
            c_ij = np.exp(b_ij) / np.sum(np.exp(b_ij), axis=1, keepdims=True)
            
            s_j = np.zeros((batch_size, self.n_capsules, self.capsule_dim))
            for j in range(self.n_capsules):
                s_j[:, j, :] = np.sum(c_ij[:, j:j+1] * u_hat, axis=1)
            
            v_j = np.zeros((batch_size, self.n_capsules, self.capsule_dim))
            for j in range(self.n_capsules):
                v_j[:, j, :] = self.squash(s_j[:, j, :])
            
            if iteration < 2:
                for i in range(self.n_capsules):
                    for j in range(self.n_capsules):
                        b_ij[:, j] += np.sum(u_hat[:, i, :] * v_j[:, j, :], axis=1)
        
        return v_j
    
    def backward(self, grad_output):
        batch_size = grad_output.shape[0]
        
        dW = np.zeros_like(self.W)
        db = np.zeros_like(self.b)
        
        for i in range(self.n_capsules):
            for j in range(self.n_capsules):
                dW[:, j, :] += np.outer(grad_output[:, j, :], grad_output[:, j, :])
        
        self.W -= self.learning_rate * dW / batch_size
        self.b -= self.learning_rate * db / batch_size

**Implement code primitive: Develop 'routing by agreement' mechanisms in capsule networks for improved generalization, segmentation, and viewpoint handling.**

In [None]:
import numpy as np

class RoutingByAgreement:
    def __init__(self, n_lower_capsules, n_upper_capsules, capsule_dim, learning_rate=0.01):
        self.n_lower_capsules = n_lower_capsules
        self.n_upper_capsules = n_upper_capsules
        self.capsule_dim = capsule_dim
        self.learning_rate = learning_rate
        
        self.W = np.random.randn(n_lower_capsules, n_upper_capsules, capsule_dim, capsule_dim) * 0.01
    
    def squash(self, x):
        norm = np.linalg.norm(x, axis=-1, keepdims=True)
        return (norm ** 2 / (1 + norm ** 2)) * (x / (norm + 1e-10))
    
    def routing_algorithm(self, u_lower, n_iterations=3):
        batch_size = u_lower.shape[0]
        
        b_ij = np.zeros((batch_size, self.n_lower_capsules, self.n_upper_capsules))
        
        for iteration in range(n_iterations):
            c_ij = np.exp(b_ij) / np.sum(np.exp(b_ij), axis=2, keepdims=True)
            
            s_j = np.zeros((batch_size, self.n_upper_capsules, self.capsule_dim))
            
            for j in range(self.n_upper_capsules):
                for i in range(self.n_lower_capsules):
                    u_hat_ij = np.dot(u_lower[:, i, :], self.W[i, j, :, :])
                    s_j[:, j, :] += c_ij[:, i, j:j+1] * u_hat_ij
            
            v_j = np.zeros((batch_size, self.n_upper_capsules, self.capsule_dim))
            for j in range(self.n_upper_capsules):
                v_j[:, j, :] = self.squash(s_j[:, j, :])
            
            if iteration < n_iterations - 1:
                for i in range(self.n_lower_capsules):
                    for j in range(self.n_upper_capsules):
                        u_hat_ij = np.dot(u_lower[:, i, :], self.W[i, j, :, :])
                        agreement = np.sum(u_hat_ij * v_j[:, j, :], axis=1)
                        b_ij[:, i, j] += agreement
        
        return v_j
    
    def forward(self, u_lower):
        v_upper = self.routing_algorithm(u_lower, n_iterations=3)
        return v_upper
    
    def backward(self, grad_output, u_lower):
        batch_size = u_lower.shape[0]
        
        dW = np.zeros_like(self.W)
        
        for i in range(self.n_lower_capsules):
            for j in range(self.n_upper_capsules):
                u_hat_ij = np.dot(u_lower[:, i, :], self.W[i, j, :, :])
                dW[i, j, :, :] += np.dot(u_lower[:, i, :].T, grad_output[:, j, :]) / batch_size
        
        self.W -= self.learning_rate * dW

**Create a Mermaid diagram: graph TD
    A[Input Data] --> B{RBM 1 (Layer 1)};
    B --> C{RBM 2 (Layer 2)};
    C --> D{RBM 3 (Layer 3)};
    D --> E[Deep Belief Network (DBN)];**

## Deep Belief Network Architecture

```mermaid
graph TD
    A[Input Data] --> B{RBM 1 Layer 1}
    B --> C{RBM 2 Layer 2}
    C --> D{RBM 3 Layer 3}
    D --> E[Deep Belief Network DBN]
```

Deep Belief Networks are constructed by stacking multiple Restricted Boltzmann Machines layer-wise. Each RBM learns a layer of features from the output of the previous layer, enabling the network to learn hierarchical representations of the input data.

**Create a Mermaid diagram: sequenceDiagram
    participant NN as Neural Network
    participant Loss as Loss Function
    NN->>NN: Forward Pass (Input to Output)
    NN->>Loss: Calculate Loss
    Loss->>NN: Backward Pass (Compute Gradients)
    NN->>NN: Update Weights**

## Backpropagation Training Loop

```mermaid
sequenceDiagram
    participant NN as Neural Network
    participant Loss as Loss Function
    NN->>NN: Forward Pass Input to Output
    NN->>Loss: Calculate Loss
    Loss->>NN: Backward Pass Compute Gradients
    NN->>NN: Update Weights
```

Backpropagation enables efficient training of deep neural networks by computing gradients through the chain rule. The forward pass propagates input through the network to produce predictions, the loss function measures prediction error, and the backward pass computes gradients for each parameter, allowing weights to be updated in the direction that reduces loss.

**Create a Mermaid diagram: graph TD
    A[Input Units] -- Send data --> B[Hidden Units];
    B -- Send data back --> A;
    A -- Iterate --> B;
    B -- Learn to reduce variation --> A;**

## Recirculation Algorithm

```mermaid
graph TD
    A[Input Units] -- Send data --> B[Hidden Units]
    B -- Send data back --> A
    A -- Iterate --> B
    B -- Learn to reduce variation --> A
```

The recirculation algorithm enables autoencoders to learn without explicit backpropagation by iteratively passing information between input and hidden units. The network learns by minimizing the variation in activity patterns across recirculation iterations, allowing it to discover efficient representations of the input data.

**Create a Mermaid diagram: graph TD
    M[Mouth Capsule] --> V1(Vote for Face Parameters);
    N[Nose Capsule] --> V2(Vote for Face Parameters);
    V1 -- Agree? --> F[Face Capsule];
    V2 -- Agree? --> F;**

## Routing by Agreement in Capsule Networks

```mermaid
graph TD
    M[Mouth Capsule] --> V1(Vote for Face Parameters)
    N[Nose Capsule] --> V2(Vote for Face Parameters)
    V1 -- Agree? --> F[Face Capsule]
    V2 -- Agree? --> F
```

Routing by agreement is a mechanism in capsule networks where lower-level capsules vote on the parameters of higher-level capsules. When multiple lower-level capsules agree on their votes, the connection strength to the higher-level capsule is strengthened. This enables the network to learn part-whole relationships and improve generalization, segmentation, and viewpoint handling.

# Week 2

## Lesson 1

**Explain the core concepts of the lesson**

## Core Concepts

This lesson introduces the fundamental programming concepts and notation used in neural networks, with a focus on how to organize and represent data efficiently.

**Key Concepts:**

1. **Vectorization**: Processing entire training sets without explicit for loops is crucial for neural network efficiency. Instead of iterating through individual examples, we organize data into matrices and perform operations on them simultaneously.

2. **Binary Classification**: A foundational task where the goal is to predict one of two possible outcomes (e.g., yes/no, cat/not cat).

3. **Logistic Regression Algorithm**: A simple yet powerful algorithm used as a foundational example to understand neural network programming concepts.

4. **Image Representation (RGB)**: Images are represented using three separate matrices—one for each color channel (Red, Green, Blue).

5. **Feature Vector (x)**: All pixel intensity values from an image's RGB channels are unrolled into a single column vector, which serves as the input to the model.

6. **Input Feature Dimension ($n_x$)**: The total number of features in a single training example, calculated as width × height × number of channels.

7. **Training Example Notation**: Each training example is denoted as $(x, y)$, where $x$ is the feature vector and $y$ is the corresponding label.

8. **Training Set Size (m)**: The number of training examples in the dataset.

9. **Input Data Matrix (X)**: Individual training feature vectors are stacked as columns to form a matrix of dimensions $n_x \times m$.

10. **Output Label Matrix (Y)**: Individual training labels are stacked as columns to form a matrix of dimensions $1 \times m$.

11. **Column Stacking Convention**: The standard practice of organizing training data by stacking individual examples into columns within matrices simplifies neural network implementations.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Why Vectorization Matters:**
Processing entire training sets without explicit for loops is crucial for neural network efficiency. When you vectorize operations, you leverage optimized linear algebra libraries that can process multiple examples simultaneously, making your code run orders of magnitude faster than iterating through examples one by one.

**Forward and Backward Passes:**
Neural network computations are typically organized into distinct forward and backward passes. The forward pass computes predictions from input data, while the backward pass computes gradients for learning. Understanding this structure helps you organize your code logically and efficiently.

**Logistic Regression as a Foundation:**
Logistic regression is used as a foundational example to understand neural network programming concepts. It's simple enough to understand quickly, yet it demonstrates the same organizational principles (vectorization, matrix operations, forward/backward passes) that apply to more complex neural networks.

**Data Organization Through Column Stacking:**
Organizing training data by stacking individual examples into columns within matrices (X and Y) simplifies neural network implementations. This convention allows you to write compact, vectorized code that processes all training examples at once, rather than writing loops to handle each example individually.

**Mental Model for Data Flow:**
Think of your training data as flowing through your model in a vectorized manner: all $m$ training examples are processed together, with each example contributing one column to the input matrix X and one column to the output matrix Y. This parallel processing is what makes neural networks computationally efficient at scale.

**Present and explain the key equations used in the lesson**

## Key Equations

**Input Feature Dimension:**

For an image represented with RGB channels, the total number of features in a single training example is:

$$n_x = \text{image\_width} \times \text{image\_height} \times \text{num\_channels}$$

**Example: 64×64 RGB Image:**

$$n_x = 64 \times 64 \times 3 = 12288$$

This means each image is flattened into a feature vector of 12,288 elements.

**Input Data Matrix Dimension:**

When stacking $m$ training examples as columns, the input data matrix X has dimensions:

$$X \text{ (Input Data Matrix) dimension}: n_x \times m$$

Each column represents one training example's feature vector.

**Output Label Matrix Dimension:**

When stacking $m$ training labels as columns, the output label matrix Y has dimensions:

$$Y \text{ (Output Label Matrix) dimension}: 1 \times m$$

Each column contains the label for the corresponding training example in X.

**Implement code primitive: Represent an image using three separate matrices for Red, Green, and Blue color channels.**

In [None]:
import numpy as np

# Create a sample 64x64 RGB image
image_height = 64
image_width = 64

# Initialize three separate matrices for R, G, B channels
R = np.random.randint(0, 256, size=(image_height, image_width), dtype=np.uint8)
G = np.random.randint(0, 256, size=(image_height, image_width), dtype=np.uint8)
B = np.random.randint(0, 256, size=(image_height, image_width), dtype=np.uint8)

print(f"Red channel shape: {R.shape}")
print(f"Green channel shape: {G.shape}")
print(f"Blue channel shape: {B.shape}")

**Implement code primitive: Unroll all pixel intensity values from an image's RGB channels into a single input feature vector (x).**

In [None]:
# Unroll all pixel intensity values from R, G, B channels into a single feature vector
x = np.concatenate([R.flatten(), G.flatten(), B.flatten()])

print(f"Feature vector shape: {x.shape}")
print(f"Feature vector dimension (n_x): {len(x)}")
print(f"Expected n_x = 64 * 64 * 3 = {64 * 64 * 3}")

**Implement code primitive: Construct a training data matrix (capital X) by stacking individual training feature vectors as columns.**

In [None]:
# Create multiple training examples and stack them as columns
m = 10  # number of training examples
n_x = 64 * 64 * 3  # feature dimension

# Initialize training data matrix X
X = np.zeros((n_x, m))

# Fill X with training examples (each column is one training example)
for i in range(m):
    # Generate a random feature vector for each training example
    X[:, i] = np.random.randint(0, 256, size=n_x)

print(f"Training data matrix X shape: {X.shape}")
print(f"X is a {X.shape[0]} by {X.shape[1]} matrix")

**Implement code primitive: Construct a training label matrix (capital Y) by stacking individual training labels as columns.**

In [None]:
# Create training labels and stack them as columns
m = 10  # number of training examples

# Initialize training label matrix Y
Y = np.zeros((1, m))

# Fill Y with training labels (each column is one training label)
for i in range(m):
    # Generate a random binary label (0 or 1) for each training example
    Y[0, i] = np.random.randint(0, 2)

print(f"Training label matrix Y shape: {Y.shape}")
print(f"Y is a {Y.shape[0]} by {Y.shape[1]} matrix")
print(f"Labels: {Y}")

**Implement code primitive: Determine the shape of a matrix (e.g., using .shape in Python).**

In [None]:
# Determine the shape of matrices using .shape
print(f"Shape of X: {X.shape}")
print(f"Shape of Y: {Y.shape}")
print(f"Shape of R: {R.shape}")

# Extract individual dimensions
n_x_actual, m_actual = X.shape
print(f"\nX has {n_x_actual} features and {m_actual} training examples")

y_rows, y_cols = Y.shape
print(f"Y has {y_rows} row(s) and {y_cols} column(s)")

**Create a Mermaid diagram: graph LR
    subgraph Image Pixels
        R[R Channel Matrix]
        G[G Channel Matrix]
        B[B Channel Matrix]
    end
    R --> Unroll
    G --> Unroll
    B --> Unroll
    Unroll --> x(Feature Vector x)**

```mermaid
graph LR
    subgraph Image Pixels
        R[R Channel Matrix]
        G[G Channel Matrix]
        B[B Channel Matrix]
    end
    R --> Unroll
    G --> Unroll
    B --> Unroll
    Unroll --> x(Feature Vector x)
```

**Create a Mermaid diagram: graph TD
    subgraph Training Examples
        x1(x^(1))
        x2(x^(2))
        xm(x^(m))
    end
    x1 -- Column 1 --> X[Matrix X]
    x2 -- Column 2 --> X
    xm -- Column m --> X
    X --- n_x_by_m[n_x by m matrix]**

```mermaid
graph TD
    subgraph Training Examples
        x1(x^(1))
        x2(x^(2))
        xm(x^(m))
    end
    x1 -- Column 1 --> X[Matrix X]
    x2 -- Column 2 --> X
    xm -- Column m --> X
    X --- n_x_by_m[n_x by m matrix]
```

**Create a Mermaid diagram: graph TD
    subgraph Training Labels
        y1(y^(1))
        y2(y^(2))
        ym(y^(m))
    end
    y1 -- Column 1 --> Y[Matrix Y]
    y2 -- Column 2 --> Y
    ym -- Column m --> Y
    Y --- 1_by_m[1 by m matrix]**

```mermaid
graph TD
    subgraph Training Labels
        y1(y^(1))
        y2(y^(2))
        ym(y^(m))
    end
    y1 -- Column 1 --> Y[Matrix Y]
    y2 -- Column 2 --> Y
    ym -- Column m --> Y
    Y --- 1_by_m[1 by m matrix]
```

## Lesson 2

**Explain the core concepts of the lesson**

## Core Concepts of Logistic Regression

Logistic regression is a learning algorithm designed for **binary classification problems**, where the output label $Y$ belongs to the set $\{0, 1\}$.

The fundamental concepts are:

- **Input Features** ($X$): The data we use to make predictions
- **Output Label** ($Y$): The target we want to predict, which is either 0 or 1
- **Parameters** ($W$ and $b$): Weights and bias that the model learns from data
- **Prediction** ($\hat{Y}$): The model's estimate of the probability that $Y = 1$ given the input $X$
- **Probability Estimation**: The prediction must always be a valid probability, meaning it must be between 0 and 1

The key insight is that logistic regression models the probability of the positive class (label 1) using a special mathematical function called the sigmoid function.

**Explain intuitions and mental models for the lesson**

## Intuitions Behind Logistic Regression

**Why not use linear regression for classification?**

In linear regression, we can fit a straight line to data, but the output can be any real number—it can be negative or greater than 1. For binary classification, we need predictions that represent probabilities, which must always be between 0 and 1. A linear function simply cannot guarantee this constraint.

**The sigmoid function to the rescue:**

The sigmoid function is a mathematical transformation that takes any real number as input and outputs a value strictly between 0 and 1. This makes it perfect for modeling probabilities:

- When the input is a very large positive number, the sigmoid output approaches 1
- When the input is a very large negative number, the sigmoid output approaches 0
- When the input is 0, the sigmoid output is exactly 0.5

**The logistic regression pipeline:**

1. Compute a linear combination of features: $Z = W^T X + b$
2. Apply the sigmoid function to transform $Z$ into a probability: $\hat{Y} = \sigma(Z)$
3. Interpret $\hat{Y}$ as the probability that $Y = 1$ given the input $X$

**Present and explain the key equations used in the lesson**

## Key Equations in Logistic Regression

**Output label constraint:**
$$Y \in \{0, 1\}$$

The output is binary—either 0 or 1.

**Prediction as probability:**
$$\hat{Y} = P(Y=1|X)$$

The prediction represents the probability that $Y$ equals 1 given the input features $X$.

**Linear combination:**
$$Z = W^T X + b$$

We compute a weighted sum of the input features plus a bias term.

**Sigmoid transformation:**
$$\hat{Y} = \sigma(Z)$$

We apply the sigmoid function to the linear combination.

**Sigmoid function definition:**
$$\sigma(Z) = \frac{1}{1 + e^{-Z}}$$

The sigmoid function maps any real number to a value between 0 and 1, making it suitable for probability estimation.

**Implement code primitive: Implement the logistic regression model to compute the prediction (Y_hat) given input features (X) and parameters (W, b).**

In [None]:
import numpy as np

def sigmoid(Z):
    """
    Compute the sigmoid function.
    
    Parameters:
    Z: Linear combination (scalar or array)
    
    Returns:
    Sigmoid output (scalar or array with values between 0 and 1)
    """
    return 1 / (1 + np.exp(-Z))

def logistic_regression_predict(X, W, b):
    """
    Compute the logistic regression prediction.
    
    Parameters:
    X: Input features (shape: (n_features, n_samples))
    W: Weights (shape: (n_features,))
    b: Bias (scalar)
    
    Returns:
    Y_hat: Predicted probabilities (shape: (n_samples,))
    """
    Z = np.dot(W, X) + b
    Y_hat = sigmoid(Z)
    return Y_hat

# Example usage
X = np.array([[1, 2, 3], [4, 5, 6]])  # 2 features, 3 samples
W = np.array([0.5, -0.3])  # 2 weights
b = 0.1  # bias

Y_hat = logistic_regression_predict(X, W, b)
print("Predictions:", Y_hat)

**Implement code primitive: Develop a process to learn optimal parameters (W and b) such that Y_hat accurately estimates the probability of Y being 1.**

In [None]:
import numpy as np

def sigmoid(Z):
    return 1 / (1 + np.exp(-Z))

def compute_cost(Y_hat, Y):
    """
    Compute the binary cross-entropy cost function.
    
    Parameters:
    Y_hat: Predicted probabilities (shape: (n_samples,))
    Y: True labels (shape: (n_samples,))
    
    Returns:
    cost: Average cost across all samples
    """
    m = Y.shape[0]
    cost = -np.mean(Y * np.log(Y_hat + 1e-8) + (1 - Y) * np.log(1 - Y_hat + 1e-8))
    return cost

def gradient_descent(X, Y, W, b, learning_rate=0.01, iterations=100):
    """
    Learn optimal parameters W and b using gradient descent.
    
    Parameters:
    X: Input features (shape: (n_features, n_samples))
    Y: True labels (shape: (n_samples,))
    W: Initial weights (shape: (n_features,))
    b: Initial bias (scalar)
    learning_rate: Step size for parameter updates
    iterations: Number of gradient descent iterations
    
    Returns:
    W: Learned weights
    b: Learned bias
    costs: List of costs at each iteration
    """
    m = Y.shape[0]
    costs = []
    
    for i in range(iterations):
        Z = np.dot(W, X) + b
        Y_hat = sigmoid(Z)
        
        dZ = Y_hat - Y
        dW = np.dot(dZ, X.T) / m
        db = np.mean(dZ)
        
        W = W - learning_rate * dW
        b = b - learning_rate * db
        
        cost = compute_cost(Y_hat, Y)
        costs.append(cost)
    
    return W, b, costs

# Example usage
X = np.array([[1, 2, 3], [4, 5, 6]])  # 2 features, 3 samples
Y = np.array([0, 1, 1])  # True labels
W = np.zeros(2)  # Initialize weights
b = 0.0  # Initialize bias

W_learned, b_learned, costs = gradient_descent(X, Y, W, b, learning_rate=0.1, iterations=100)
print("Learned W:", W_learned)
print("Learned b:", b_learned)
print("Final cost:", costs[-1])

## Lesson 3

**Explain the core concepts of the lesson**

## Core Concepts

Logistic regression is a fundamental classification algorithm that relies on two critical components: the **loss function** and the **cost function**.

**Loss Function**: A loss function measures how well a model's prediction aligns with the true label for a single training example. For logistic regression, we use a specialized loss function that penalizes incorrect predictions in a way that leads to a convex optimization problem.

**Cost Function**: The cost function aggregates the loss over all training examples, providing an overall measure of the model's performance on the entire training set. It is the objective function that we minimize during training.

**Training Example Indexing**: In logistic regression, we denote the $i$-th training example as $(\mathbf{x}^{(i)}, y^{(i)})$, where $\mathbf{x}^{(i)}$ is the input feature vector and $y^{(i)}$ is the true binary label.

**Convex vs. Non-convex Optimization**: Using squared error as a loss function in logistic regression results in a non-convex optimization problem with multiple local optima, making it difficult for gradient descent to find the global optimum. The logistic regression loss function is specifically designed to ensure a convex optimization problem, allowing gradient descent to reliably find the global minimum.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Why Loss Matters**: Think of the loss function as a report card for a single prediction. It tells us how far off our prediction is from the truth. A good loss function should heavily penalize confident wrong predictions while being lenient on uncertain predictions.

**The Problem with Squared Error**: If we use squared error loss (common in linear regression), we create a landscape with many valleys and hills. Gradient descent might get stuck in a local valley and never reach the global optimum. This is the non-convex problem.

**The Logistic Regression Loss Solution**: The logistic regression loss function is cleverly designed to create a smooth, bowl-shaped landscape (convex). No matter where we start, gradient descent will slide down to the single global minimum. This is guaranteed to work.

**Predicted Output Design**: The logistic regression loss function is designed to make the predicted output $\hat{y}$ close to 1 when the true label $y$ is 1, and close to 0 when $y$ is 0. When $\hat{y}$ matches $y$, the loss is small; when they differ, the loss grows.

**From Single to Aggregate**: The cost function takes all the individual losses and averages them. This gives us a single number that summarizes how well our model performs across the entire training set. We then optimize this single number to improve overall performance.

**Present and explain the key equations used in the lesson**

## Key Equations

**Linear Combination**: For a given input $\mathbf{x}^{(i)}$ and parameters $\mathbf{w}$ and $b$, we compute:
$$z^{(i)} = \mathbf{w}^T\mathbf{x}^{(i)} + b$$

**Sigmoid Activation Function**: The sigmoid function maps any real number to a probability between 0 and 1:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

**Predicted Output**: Combining the linear combination with the sigmoid activation:
$$\hat{y}^{(i)} = \sigma(z^{(i)})$$

Or equivalently:
$$\hat{y} = \sigma(\mathbf{w}^T\mathbf{x} + b)$$

**Logistic Regression Loss Function**: For a single training example, the loss is:
$$L(\hat{y}, y) = -(y \log(\hat{y}) + (1-y) \log(1-\hat{y}))$$

This function ensures that:
- When $y = 1$: loss = $-\log(\hat{y})$ (penalizes low predictions)
- When $y = 0$: loss = $-\log(1-\hat{y})$ (penalizes high predictions)

**Cost Function**: The average loss across all $m$ training examples:
$$J(\mathbf{w}, b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)})$$

Expanded form:
$$J(\mathbf{w}, b) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)})]$$

**Implement code primitive: Compute the linear combination $z$ for a given input $x$ and parameters $W, b$.**

In [None]:
import numpy as np

def compute_z(x, W, b):
    """
    Compute the linear combination z = W^T * x + b
    
    Args:
        x: Input feature vector (shape: (n_features,))
        W: Weight vector (shape: (n_features,))
        b: Bias term (scalar)
    
    Returns:
        z: Linear combination (scalar)
    """
    z = np.dot(W, x) + b
    return z

**Implement code primitive: Compute the sigmoid activation function.**

In [None]:
def sigmoid(z):
    """
    Compute the sigmoid activation function: sigma(z) = 1 / (1 + e^(-z))
    
    Args:
        z: Input value or array (scalar or numpy array)
    
    Returns:
        Sigmoid output (same shape as input, values between 0 and 1)
    """
    sigma = 1 / (1 + np.exp(-z))
    return sigma

**Implement code primitive: Calculate the predicted output $\hat{y}$ for a given training example.**

In [None]:
def predict(x, W, b):
    """
    Calculate the predicted output y_hat for a given training example.
    
    Args:
        x: Input feature vector (shape: (n_features,))
        W: Weight vector (shape: (n_features,))
        b: Bias term (scalar)
    
    Returns:
        y_hat: Predicted probability (scalar, between 0 and 1)
    """
    z = compute_z(x, W, b)
    y_hat = sigmoid(z)
    return y_hat

**Implement code primitive: Implement the logistic regression loss function for a single training example.**

In [None]:
def loss(y_hat, y):
    """
    Compute the logistic regression loss for a single training example.
    L(y_hat, y) = -(y * log(y_hat) + (1 - y) * log(1 - y_hat))
    
    Args:
        y_hat: Predicted probability (scalar, between 0 and 1)
        y: True label (scalar, 0 or 1)
    
    Returns:
        L: Loss value (scalar)
    """
    epsilon = 1e-15
    y_hat = np.clip(y_hat, epsilon, 1 - epsilon)
    L = -(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))
    return L

**Implement code primitive: Calculate the overall cost function for the entire training set.**

In [None]:
def cost(X, Y, W, b):
    """
    Calculate the cost function for the entire training set.
    J(W, b) = (1/m) * sum(L(y_hat^(i), y^(i))) for i = 1 to m
    
    Args:
        X: Training feature matrix (shape: (m, n_features))
        Y: Training labels vector (shape: (m,))
        W: Weight vector (shape: (n_features,))
        b: Bias term (scalar)
    
    Returns:
        J: Cost value (scalar)
    """
    m = X.shape[0]
    total_loss = 0
    
    for i in range(m):
        y_hat_i = predict(X[i], W, b)
        total_loss += loss(y_hat_i, Y[i])
    
    J = total_loss / m
    return J

**Implement code primitive: Set up an optimization process to find parameters $W$ and $b$ that minimize the cost function $J$.**

In [None]:
def gradient_descent(X, Y, W, b, learning_rate=0.01, iterations=1000):
    """
    Optimize parameters W and b using gradient descent to minimize cost function J.
    
    Args:
        X: Training feature matrix (shape: (m, n_features))
        Y: Training labels vector (shape: (m,))
        W: Initial weight vector (shape: (n_features,))
        b: Initial bias term (scalar)
        learning_rate: Step size for gradient descent (default: 0.01)
        iterations: Number of iterations (default: 1000)
    
    Returns:
        W: Optimized weight vector
        b: Optimized bias term
        costs: List of cost values at each iteration
    """
    m = X.shape[0]
    costs = []
    
    for iteration in range(iterations):
        dW = np.zeros_like(W)
        db = 0
        
        for i in range(m):
            y_hat_i = predict(X[i], W, b)
            error = y_hat_i - Y[i]
            dW += error * X[i]
            db += error
        
        dW /= m
        db /= m
        
        W = W - learning_rate * dW
        b = b - learning_rate * db
        
        J = cost(X, Y, W, b)
        costs.append(J)
    
    return W, b, costs

## Lesson 4

**Explain the core concepts of the lesson**

## Core Concepts of Gradient Descent for Logistic Regression

Gradient descent is an optimization algorithm used to find the parameters that minimize a cost function. In the context of logistic regression, we aim to find the optimal values of parameters $w$ and $b$ that make our model's predictions as accurate as possible.

**Key Concepts:**

- **Cost Function Minimization**: The goal is to find parameters that make the cost function as small as possible. The cost function measures how well our model performs on the training data.

- **Convex Cost Function**: The cost function for logistic regression is convex, meaning it has a single global minimum. This guarantees that gradient descent will find the optimal solution regardless of where we initialize the parameters.

- **Non-convex Functions**: In contrast, some functions have multiple local optima, making optimization more challenging.

- **Global Optimum vs. Local Optima**: A global optimum is the best solution across the entire parameter space, while local optima are solutions that are better than nearby points but not necessarily the best overall.

- **Parameter Initialization**: We start the algorithm by initializing parameters $w$ and $b$ to specific values, typically zero.

- **Learning Rate**: The learning rate $\alpha$ controls the step size we take in each iteration. A larger learning rate means bigger steps, while a smaller learning rate means smaller, more careful steps.

- **Derivative and Partial Derivative**: The derivative indicates the slope of the cost function. For functions with multiple parameters, we use partial derivatives to understand how the cost changes with respect to each parameter individually.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**The Landscape Analogy:**
Imagine the cost function as a landscape with hills and valleys. Gradient descent is like walking downhill to find the lowest point. At each step, we look around to find the steepest downward direction and take a step in that direction.

**Why Convexity Matters:**
A convex cost function is like a single bowl—no matter where you start on the bowl's surface, if you always walk downhill, you'll eventually reach the same lowest point at the bottom. This is why logistic regression's convex cost function is so powerful: we're guaranteed to find the global optimum.

**Following the Slope:**
Gradient descent takes steps in the direction that decreases the cost most steeply. The derivative tells us the slope of the function at our current position. A steep slope means we should take a larger conceptual step (scaled by the learning rate), while a gentle slope means we're getting close to the minimum.

**The Role of Derivatives:**
The derivative indicates which direction to step to go downhill. For a function with multiple parameters like $J(w, b)$, we compute partial derivatives with respect to each parameter. These partial derivatives tell us how sensitive the cost is to changes in each parameter, guiding our updates.

**Present and explain the key equations used in the lesson**

## Key Equations

**Cost Function for Logistic Regression:**
$$J(w, b) = -\frac{1}{m} \sum_{i=1}^m [y^{(i)}\log(a^{(i)}) + (1-y^{(i)})\log(1-a^{(i)})]$$

where $m$ is the number of training examples, $y^{(i)}$ is the true label, and $a^{(i)}$ is the predicted probability.

**Parameter Update Rules:**

For parameter $w$:
$$w := w - \alpha \frac{\partial J(w,b)}{\partial w}$$

For parameter $b$:
$$b := b - \alpha \frac{\partial J(w,b)}{\partial b}$$

where $\alpha$ is the learning rate, and $\frac{\partial J(w,b)}{\partial w}$ and $\frac{\partial J(w,b)}{\partial b}$ are the partial derivatives of the cost function with respect to $w$ and $b$ respectively.

**Simplified Notation:**

We often denote the partial derivatives as:
- $dw = \frac{\partial J(w,b)}{\partial w}$
- $db = \frac{\partial J(w,b)}{\partial b}$

So the update rules become:
$$w := w - \alpha \cdot dw$$
$$b := b - \alpha \cdot db$$

**Implement code primitive: Initialize parameters w and b, typically to zero.**

In [None]:
w = 0
b = 0

**Implement code primitive: Implement an iterative update for parameter w using a learning rate and its derivative.**

In [None]:
learning_rate = 0.01
w = w - learning_rate * dw

**Implement code primitive: Implement an iterative update for parameter b using a learning rate and its derivative.**

In [None]:
learning_rate = 0.01
b = b - learning_rate * db

**Implement code primitive: Represent the derivative terms dJ/dw and dJ/db as variables `dw` and `db` in code.**

In [None]:
dw = 0.0
db = 0.0

**Create a Mermaid diagram: graph TD
    A[Start Gradient Descent] --> B(Initialize W, B);
    B --> C{Repeat until convergence};
    C --> D[Compute dW = \partial J/\partial W];
    C --> E[Compute dB = \partial J/\partial B];
    D --> F[Update W = W - \alpha * dW];
    E --> G[Update B = B - \alpha * dB];
    F --> C;
    G --> C;
    C --> H[End: Optimal W, B found];**

```mermaid
graph TD
    A[Start Gradient Descent] --> B(Initialize W, B);
    B --> C{Repeat until convergence};
    C --> D[Compute dW = ∂J/∂W];
    C --> E[Compute dB = ∂J/∂B];
    D --> F[Update W = W - α * dW];
    E --> G[Update B = B - α * dB];
    F --> C;
    G --> C;
    C --> H[End: Optimal W, B found];
```

## Lesson 5

**Explain the core concepts of the lesson**

## Core Concepts

A **derivative** is fundamentally a measure of how a function changes. Specifically, it tells us the **slope** of a function—how much the output changes when we nudge the input by a tiny amount.

For a linear function like $f(a) = 3a$, the slope is constant everywhere. This means that no matter where you are on the line, nudging the input by a small amount always causes the output to change by three times that amount.

Key concepts:
- **Slope of a function**: The rate at which the output changes relative to the input
- **Rate of change**: How sensitive the output is to changes in the input
- **Nudging a variable**: Making a small change to the input to observe the corresponding change in output
- **Infinitesimal change**: In calculus, we consider the limit as the nudge approaches zero, but for intuition, a small nudge like 0.001 is sufficient
- **Constant slope**: For linear functions, the slope is the same everywhere

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**The Slope as a Triangle**: Imagine a right triangle on the graph of a function. The height of the triangle represents the vertical change in the output (how much $f(a)$ changed), and the width represents the horizontal change in the input (how much $a$ changed). The slope is simply the ratio of height to width:

$$\text{slope} = \frac{\text{height}}{\text{width}}$$

**Constant Slope for Linear Functions**: For a straight line like $f(a) = 3a$, this ratio is always 3. Whether you look at the change from $a=0$ to $a=1$, or from $a=2$ to $a=3$, the output always changes by 3 times the input change.

**Sensitivity Interpretation**: The derivative tells you how sensitive a function is to changes in its input. A steep slope (large derivative) means the output is very sensitive to input changes. A flat slope (small derivative) means the output is insensitive.

**Practical Relevance**: In deep learning, you don't need to master calculus deeply. The key insight is that derivatives measure the sensitivity of outputs to input changes—this is essential for understanding how neural networks learn through gradient descent.

**Present and explain the key equations used in the lesson**

## Key Equations

The linear function we'll examine is:

$$f(a) = 3a$$

The derivative of this function is:

$$\frac{df(a)}{da} = 3$$

Alternatively, this can be written as:

$$\frac{d}{da}f(a) = 3$$

Both notations mean the same thing: the rate of change of $f(a)$ with respect to $a$ is 3.

The general formula for slope as a ratio is:

$$\text{slope} = \frac{\text{height}}{\text{width}} = \frac{\Delta f(a)}{\Delta a}$$

where $\Delta f(a)$ is the change in the function output and $\Delta a$ is the change in the input.

**Implement code primitive: Evaluate a linear function f(a) = 3a at a specific point (e.g., a=2)**

In [None]:
def f(a):
    return 3 * a

a = 2
result = f(a)
print(f"f({a}) = {result}")

**Implement code primitive: Compute the change in f(a) when a is nudged by a small amount (e.g., from a=2 to a=2.001)**

In [None]:
a1 = 2
a2 = 2.001

f_a1 = f(a1)
f_a2 = f(a2)

change_in_a = a2 - a1
change_in_f = f_a2 - f_a1

print(f"f({a1}) = {f_a1}")
print(f"f({a2}) = {f_a2}")
print(f"Change in a: {change_in_a}")
print(f"Change in f(a): {change_in_f}")

**Implement code primitive: Calculate the ratio of output change to input change to demonstrate slope**

In [None]:
slope = change_in_f / change_in_a
print(f"Slope = {change_in_f} / {change_in_a} = {slope}")

**Implement code primitive: Verify that slope remains constant across different points on a linear function**

In [None]:
points = [0, 1, 5, 10]
nudge = 0.001

print("Verifying constant slope across different points:")
for a in points:
    f_a = f(a)
    f_a_nudged = f(a + nudge)
    slope_at_a = (f_a_nudged - f_a) / nudge
    print(f"At a={a}: slope = {slope_at_a}")

**Create a Mermaid diagram: A coordinate plot showing the linear function f(a) = 3a with two points marked (a=2, f(a)=6) and (a=2.001, f(a)=6.003), with a right triangle highlighting the height (0.003) and width (0.001) to illustrate the slope calculation**

```mermaid
graph TD
    subgraph plot["Linear Function f(a) = 3a"]
        A["Point 1: (a=2, f(a)=6)"]
        B["Point 2: (a=2.001, f(a)=6.003)"]
        C["Width (Δa) = 0.001"]
        D["Height (Δf) = 0.003"]
        E["Slope = Height/Width = 0.003/0.001 = 3"]
    end
    A --> C
    B --> D
    C --> E
    D --> E
```

## Lesson 6

**Explain the core concepts of the lesson**

## Core Concepts

A **derivative** measures how much a function's output changes when you make a tiny change to its input. It represents the **slope of a function at a specific point**.

Key ideas:
- **Slope of a function**: The rate at which the output changes relative to the input. For straight lines, the slope is constant everywhere. For curved functions, the slope varies at different points.
- **Rate of change**: How fast a function's output is changing at a particular input value.
- **Infinitesimal nudge**: A conceptually infinitely small change to the input, used to define derivatives precisely.
- **Variable-slope functions**: Functions like $f(a) = a^2$ where the slope is different at different points.
- **Constant-slope functions**: Linear functions where the slope is the same everywhere.
- **Derivative notation**: Written as $\frac{d}{da}f(a)$, which means "the derivative of $f$ with respect to $a$."

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**The core intuition**: A derivative is the slope of a function at a point. Imagine zooming in on a curved graph until it looks like a straight line—the slope of that line is the derivative.

**Why slopes vary**: Different points on a curved function have different slopes. For example, on $f(a) = a^2$, the function is flatter near $a = 0$ and steeper as $a$ increases.

**The nudge principle**: When you nudge the input by a tiny amount $\Delta a$, the output changes by approximately the derivative times that nudge. This is the practical way to estimate derivatives.

**Infinitesimal vs. finite nudges**: Derivatives are defined using infinitesimally small nudges (conceptually zero, but not actually zero). When we use finite nudges like $0.001$, we get small approximation errors. The smaller the nudge, the better the approximation.

**Using formulas**: Rather than computing derivatives from scratch each time, you can look up derivative formulas in a calculus table. Common formulas include $\frac{d}{da}(a^2) = 2a$ and $\frac{d}{da}(a^3) = 3a^2$.

**Present and explain the key equations used in the lesson**

## Key Equations

**Definition of a derivative**:
$$\frac{d}{da}f(a) = \text{slope at point } a$$

**Common derivative formulas**:
$$\frac{d}{da}(a^2) = 2a$$

$$\frac{d}{da}(a^3) = 3a^2$$

$$\frac{d}{da}(\log(a)) = \frac{1}{a}$$

**The nudge principle**:
$$\Delta f(a) \approx \frac{d}{da}f(a) \cdot \Delta a$$

This equation tells us that the change in output ($\Delta f(a)$) is approximately equal to the derivative at that point times the change in input ($\Delta a$).

**Implement code primitive: Compute the output of a function at a given point (e.g., f(a) = a² at a=2)**

In [None]:
def f(a):
    return a**2

a = 2
output = f(a)
print(f"f({a}) = {output}")

**Implement code primitive: Nudge an input variable by a small amount and observe the change in output**

In [None]:
def f(a):
    return a**2

a = 2
nudge = 0.001

output_before = f(a)
output_after = f(a + nudge)
output_change = output_after - output_before

print(f"f({a}) = {output_before}")
print(f"f({a + nudge}) = {output_after}")
print(f"Change in output: {output_change}")

**Implement code primitive: Calculate the ratio of output change to input change to estimate slope**

In [None]:
def f(a):
    return a**2

a = 2
nudge = 0.001

output_before = f(a)
output_after = f(a + nudge)
output_change = output_after - output_before

slope_estimate = output_change / nudge

print(f"Output change: {output_change}")
print(f"Input change: {nudge}")
print(f"Estimated slope (derivative): {slope_estimate}")

**Implement code primitive: Verify derivative formulas by comparing predicted changes (derivative × nudge) with actual computed changes**

In [None]:
def f(a):
    return a**2

def derivative_f(a):
    return 2 * a

a = 2
nudge = 0.001

output_before = f(a)
output_after = f(a + nudge)
actual_change = output_after - output_before

predicted_change = derivative_f(a) * nudge

print(f"Actual change in output: {actual_change}")
print(f"Predicted change (derivative × nudge): {predicted_change}")
print(f"Difference: {abs(actual_change - predicted_change)}")

**Create a Mermaid diagram: A flowchart showing the process of estimating a derivative: pick a point, nudge the input slightly, compute the output change, calculate the ratio of changes**

## Process of Estimating a Derivative

```mermaid
graph TD
    A["Pick a point a"] --> B["Nudge input: compute f(a + Δa)"]
    B --> C["Compute output change: Δf = f(a + Δa) - f(a)"]
    C --> D["Calculate ratio: Δf / Δa"]
    D --> E["Ratio ≈ derivative at point a"]
```

**Create a Mermaid diagram: A diagram illustrating how the slope (height-to-width ratio of a small triangle) varies at different points on a curved function like f(a) = a²**

## Slope Varies at Different Points

```mermaid
graph TD
    A["Function: f(a) = a²"] --> B["At a = 0: slope = 0 (flat)"]
    A --> C["At a = 1: slope = 2 (moderate)"]
    A --> D["At a = 2: slope = 4 (steep)"]
    A --> E["At a = 3: slope = 6 (steeper)"]
    B --> F["Slope = height/width of small triangle"]
    C --> F
    D --> F
    E --> F
```

## Lesson 7

**Explain the core concepts of the lesson**

## Core Concepts

A **computation graph** is a visual representation of how a calculation is organized, showing the flow of data from inputs to outputs through intermediate steps.

Key concepts in computation graphs:

- **Forward Pass**: The process of computing the output value by moving from left to right through the graph, starting from inputs and progressing through intermediate variables to the final output.

- **Backward Pass**: The process of computing derivatives by moving from right to left through the graph, which enables efficient calculation of gradients for optimization.

- **Intermediate Variables**: Variables computed during the forward pass that represent intermediate results in the calculation (e.g., $u$, $V$).

- **Output Variable**: The final result of the computation (e.g., $J$), which is often the cost function in machine learning contexts.

- **Gradients**: The derivatives of the output with respect to each input, computed during the backward pass.

- **Derivatives**: Mathematical measures of how the output changes with respect to inputs, essential for optimization.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Visual Organization of Computation**: A computation graph visually organizes the steps of a calculation from initial inputs to a final output. Think of it as a flowchart where each node represents a computation and each edge represents data flowing from one computation to the next. This visual structure makes it easy to understand the dependencies between variables.

**Forward Pass as Left-to-Right Evaluation**: The forward pass computes the value of the output by moving from left to right through the graph. Start with the input values, compute each intermediate variable in sequence, and finally arrive at the output. This is the natural way we think about evaluating a mathematical expression.

**Backward Pass as Right-to-Left Differentiation**: The backward pass computes derivatives by moving from right to left, which is a natural organization for optimization. Starting from the output, we work backwards through the graph, applying the chain rule to compute how changes in each variable affect the final output. This organization is the foundation of backpropagation in neural networks.

**Present and explain the key equations used in the lesson**

## Key Equations

The computation graph organizes the following equations:

**Intermediate variable $u$:**
$$u = bc$$

**Intermediate variable $V$:**
$$V = a + u$$

**Final output $J$:**
$$J = 3V$$

**Combined form:**
$$J = 3(a + bc)$$

These equations define the forward pass computation. Each equation represents a node in the computation graph, and the variables flow from left to right as inputs are transformed into the final output.

**Implement code primitive: Compute an intermediate variable 'u' as the product of 'b' and 'c'.**

In [None]:
b = 2
c = 3
u = b * c
print(f"u = {u}")

**Implement code primitive: Compute an intermediate variable 'V' as the sum of 'a' and 'u'.**

In [None]:
a = 5
V = a + u
print(f"V = {V}")

**Implement code primitive: Compute the final output 'J' as three times 'V'.**

In [None]:
J = 3 * V
print(f"J = {J}")

**Create a Mermaid diagram: graph TD
    A[a] --> U_node
    B[b] --> U_node
    C[c] --> U_node
    U_node(u = b * c) --> V_node
    A --> V_node
    V_node(V = a + u) --> J_node
    J_node(J = 3 * V)**

## Computation Graph Visualization

```mermaid
graph TD
    A[a] --> U_node
    B[b] --> U_node
    C[c] --> U_node
    U_node(u = b * c) --> V_node
    A --> V_node
    V_node(V = a + u) --> J_node
    J_node(J = 3 * V)
```

This diagram shows the flow of computation from inputs ($a$, $b$, $c$) through intermediate variables ($u$, $V$) to the final output ($J$). The forward pass evaluates from left to right, while the backward pass would compute gradients from right to left.

## Lesson 8

**Explain the core concepts of the lesson**

## Core Concepts

**Computation Graph**: A visual representation of how variables flow and transform through a series of calculations. Each node represents a variable or operation, and edges show dependencies between them.

**Backpropagation**: An algorithm that computes derivatives by traversing a computation graph from the output backwards to the inputs. It efficiently calculates how changes in inputs affect the final output.

**Chain Rule**: A fundamental calculus principle that allows us to compute derivatives of composite functions. If variable A affects B and B affects C, the total effect of A on C is the product of the individual effects: $\frac{dC}{dA} = \frac{dC}{dB} \cdot \frac{dB}{dA}$.

**Partial Derivatives**: The rate of change of a function with respect to one variable while holding others constant. In a computation graph, we compute partial derivatives at each step.

**Right-to-Left Computation**: The direction of backpropagation—we start from the final output and work backwards through the graph, computing derivatives as we go.

**Intermediate Variables**: Variables that appear in the middle of a computation, neither at the input nor at the final output. Their derivatives are computed as stepping stones to reach the input derivatives.

**Gradient Notation**: A shorthand way to represent derivatives. We use $dv$ to mean "the derivative of the final output with respect to variable $v$" rather than writing the full expression.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**The Flow Perspective**: Think of a computation graph as a river system. Water (the input) flows through channels (operations) and eventually reaches the ocean (the output). To understand how changes upstream affect the ocean, we trace the flow backwards from the ocean to the source.

**Breaking Down Complexity**: The chain rule lets you break down complex derivatives into simpler pieces. Instead of computing one giant derivative, you compute small local derivatives at each step and multiply them together. This is like understanding a long causal chain by examining each link individually.

**Efficiency Through Reuse**: Computing derivatives efficiently means working backwards through the graph, reusing previously computed derivatives to calculate new ones. Once you know how the output changes with respect to one intermediate variable, you can use that result to find how it changes with respect to earlier variables. This avoids redundant calculations and is much faster than computing each derivative independently.

**Shorthand Notation**: In code, we use shorthand notation like $dv$ to mean the derivative of the final output with respect to variable $v$, rather than writing out the full derivative expression each time. This keeps the code clean and readable while maintaining mathematical precision.

**Present and explain the key equations used in the lesson**

## Key Equations

Consider a computation graph where the final output $J$ depends on intermediate variables $u$ and $v$, which in turn depend on inputs $a$, $b$, and $c$.

**Starting Point**: The derivative of the output with respect to itself is always 1, but in this example we compute:

$$\frac{dJ}{dv} = 3$$

**Applying the Chain Rule**: To find how $J$ changes with respect to earlier variables, we multiply the derivative of $J$ with respect to the intermediate variable by the derivative of that intermediate variable with respect to the input:

$$\frac{dJ}{da} = \frac{dJ}{dv} \cdot \frac{dv}{da}$$

$$\frac{dJ}{du} = \frac{dJ}{dv} \cdot \frac{dv}{du}$$

**Continuing Backwards**: As we move further back through the graph, we continue applying the chain rule:

$$\frac{dJ}{db} = \frac{dJ}{du} \cdot \frac{du}{db}$$

$$\frac{dJ}{dc} = \frac{dJ}{du} \cdot \frac{du}{dc}$$

Each equation shows how to combine derivatives from the previous step with local derivatives to compute derivatives with respect to earlier variables.

**Implement code primitive: Implement variable naming convention where 'dvar' represents the derivative of the final output variable J with respect to intermediate variable 'var'**

In [None]:
# Variable naming convention: dvar = derivative of J with respect to var

# Example computation
a = 2
b = 3
c = 5

# Forward pass
u = a + b
v = u * c
J = 3 * v

# Backward pass using naming convention
dJ_dv = 3  # derivative of J with respect to v
dJ_du = dJ_dv * c  # derivative of J with respect to u
dJ_da = dJ_du * 1  # derivative of J with respect to a (du/da = 1)
dJ_db = dJ_du * 1  # derivative of J with respect to b (du/db = 1)
dJ_dc = dJ_dv * u  # derivative of J with respect to c (dv/dc = u)

print(f"dJ/dv = {dJ_dv}")
print(f"dJ/du = {dJ_du}")
print(f"dJ/da = {dJ_da}")
print(f"dJ/db = {dJ_db}")
print(f"dJ/dc = {dJ_dc}")

**Implement code primitive: Compute derivatives step-by-step by traversing the computation graph from right to left, storing intermediate derivative values**

In [None]:
# Forward pass: compute J from inputs
a = 2
b = 3
c = 5

u = a + b
v = u * c
J = 3 * v

print(f"Forward pass:")
print(f"u = a + b = {u}")
print(f"v = u * c = {v}")
print(f"J = 3 * v = {J}")
print()

# Backward pass: traverse right to left, storing intermediate derivatives
print(f"Backward pass (right to left):")

# Step 1: Start from the output
dJ_dv = 3
print(f"dJ/dv = {dJ_dv}")

# Step 2: Move to u
dJ_du = dJ_dv * c
print(f"dJ/du = dJ/dv * dv/du = {dJ_dv} * {c} = {dJ_du}")

# Step 3: Move to a and b
dJ_da = dJ_du * 1
dJ_db = dJ_du * 1
print(f"dJ/da = dJ/du * du/da = {dJ_du} * 1 = {dJ_da}")
print(f"dJ/db = dJ/du * du/db = {dJ_du} * 1 = {dJ_db}")

# Step 4: Move to c
dJ_dc = dJ_dv * u
print(f"dJ/dc = dJ/dv * dv/dc = {dJ_dv} * {u} = {dJ_dc}")

**Implement code primitive: Use the chain rule to combine derivatives: multiply the derivative of the output with respect to an intermediate variable by the derivative of that intermediate variable with respect to the input**

In [None]:
# Chain rule: dJ/dx = dJ/dy * dy/dx

# Forward pass
a = 2
b = 3
c = 5

u = a + b
v = u * c
J = 3 * v

# Backward pass: apply chain rule at each step
print("Applying the chain rule:")
print()

# dJ/dv is given
dJ_dv = 3
print(f"dJ/dv = {dJ_dv}")
print()

# Chain rule: dJ/du = dJ/dv * dv/du
dv_du = c  # derivative of v with respect to u
dJ_du = dJ_dv * dv_du
print(f"dJ/du = dJ/dv * dv/du = {dJ_dv} * {dv_du} = {dJ_du}")
print()

# Chain rule: dJ/da = dJ/du * du/da
du_da = 1  # derivative of u with respect to a
dJ_da = dJ_du * du_da
print(f"dJ/da = dJ/du * du/da = {dJ_du} * {du_da} = {dJ_da}")
print()

# Chain rule: dJ/db = dJ/du * du/db
du_db = 1  # derivative of u with respect to b
dJ_db = dJ_du * du_db
print(f"dJ/db = dJ/du * du/db = {dJ_du} * {du_db} = {dJ_db}")
print()

# Chain rule: dJ/dc = dJ/dv * dv/dc
dv_dc = u  # derivative of v with respect to c
dJ_dc = dJ_dv * dv_dc
print(f"dJ/dc = dJ/dv * dv/dc = {dJ_dv} * {dv_dc} = {dJ_dc}")

**Create a Mermaid diagram: A flowchart showing the forward pass (left to right) computing J from inputs a, b, c, u, v, and the backward pass (right to left) computing derivatives dJ/dv, dJ/da, dJ/du, dJ/db, dJ/dc**

## Computation Graph: Forward and Backward Pass

```mermaid
graph TD
    A["a"] --> U["u = a + b"]
    B["b"] --> U
    U --> V["v = u * c"]
    C["c"] --> V
    V --> J["J = 3 * v"]
    
    J --> dJ_dv["dJ/dv = 3"]
    dJ_dv --> dJ_du["dJ/du = dJ/dv * c"]
    dJ_dv --> dJ_dc["dJ/dc = dJ/dv * u"]
    dJ_du --> dJ_da["dJ/da = dJ/du * 1"]
    dJ_du --> dJ_db["dJ/db = dJ/du * 1"]
    
    style A fill:#e1f5ff
    style B fill:#e1f5ff
    style C fill:#e1f5ff
    style U fill:#fff3e0
    style V fill:#fff3e0
    style J fill:#f3e5f5
    style dJ_dv fill:#f3e5f5
    style dJ_du fill:#fff3e0
    style dJ_dc fill:#fff3e0
    style dJ_da fill:#e1f5ff
    style dJ_db fill:#e1f5ff
```

The diagram shows two flows:
- **Forward pass (top)**: Inputs $a$, $b$, $c$ flow left to right through intermediate variables $u$ and $v$ to produce output $J$.
- **Backward pass (bottom)**: Starting from $J$, derivatives flow right to left, with each derivative computed using the chain rule from previously computed derivatives.

## Lesson 9

**Explain the core concepts of the lesson**

## Core Concepts

This lesson focuses on understanding how gradient descent works for logistic regression by computing derivatives for a single training example.

**Logistic Regression Model**: A binary classification algorithm that uses a linear combination of inputs followed by a sigmoid activation function to produce a probability prediction.

**Computation Graph**: A visual representation of how data flows through the model during forward propagation and how gradients flow backward during backpropagation.

**Forward Propagation**: The process of computing predictions by passing inputs through the model. For logistic regression with two features:
- Compute the linear combination: $Z = W_1X_1 + W_2X_2 + B$
- Apply sigmoid activation: $A = \sigma(Z)$
- Calculate loss: $L(A, Y) = -(Y\log(A) + (1-Y)\log(1-A))$

**Backward Propagation**: The process of computing how much each parameter contributes to the loss by applying the chain rule to traverse the computation graph in reverse.

**Gradient Descent**: An optimization algorithm that updates model parameters iteratively by moving them in the direction opposite to the gradient, scaled by a learning rate $\alpha$.

**Partial Derivatives**: The rate of change of the loss with respect to each parameter. These derivatives guide parameter updates during training.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Computation Graphs as Blueprints**: Think of a computation graph as a blueprint showing how data flows through your model. Each node represents a calculation, and each edge represents data flowing from one calculation to the next. This visual structure makes it much easier to understand where gradients come from and how they propagate backward.

**Forward vs. Backward**: During forward propagation, you're asking "What prediction does my model make?" During backward propagation, you're asking "How much did each parameter contribute to the error?" These two passes work together to improve the model.

**The Chain Rule as a Multiplier**: The chain rule breaks down complex derivatives into simpler pieces. If you want to know how a parameter affects the loss, you multiply the derivatives along the path from that parameter to the loss. For example, to find how $W_1$ affects the loss, you compute: $\frac{dL}{dW_1} = \frac{dL}{dZ} \cdot \frac{dZ}{dW_1}$.

**Gradient Descent as a Compass**: The gradient points in the direction of steepest increase in loss. Gradient descent moves parameters in the opposite direction (negative gradient) to reduce loss. The learning rate $\alpha$ controls how large each step is—too large and you might overshoot the minimum, too small and training becomes very slow.

**Present and explain the key equations used in the lesson**

## Key Equations

**Forward Propagation**:

$$Z = W_1X_1 + W_2X_2 + B$$

$$A = \sigma(Z) = \frac{1}{1 + e^{-Z}}$$

$$L(A, Y) = -(Y\log(A) + (1-Y)\log(1-A))$$

**Backward Propagation (Derivatives)**:

The derivative of loss with respect to the activation:
$$\frac{dL}{dA} = -\frac{Y}{A} + \frac{1-Y}{1-A}$$

The derivative of activation with respect to the linear combination:
$$\frac{dA}{dZ} = A(1-A)$$

Combining these using the chain rule:
$$\frac{dL}{dZ} = \frac{dL}{dA} \cdot \frac{dA}{dZ} = A - Y$$

Derivatives with respect to weights and bias:
$$\frac{dL}{dW_1} = X_1 \cdot \frac{dL}{dZ}$$

$$\frac{dL}{dW_2} = X_2 \cdot \frac{dL}{dZ}$$

$$\frac{dL}{dB} = \frac{dL}{dZ}$$

**Parameter Updates (Gradient Descent)**:

$$W_1 := W_1 - \alpha \cdot \frac{dL}{dW_1}$$

$$W_2 := W_2 - \alpha \cdot \frac{dL}{dW_2}$$

$$B := B - \alpha \cdot \frac{dL}{dB}$$

where $\alpha$ is the learning rate.

**Implement code primitive: Implement the forward pass to compute Z, A (predicted output), and the loss L for a single training example.**

In [None]:
import numpy as np

def forward_pass(X1, X2, Y, W1, W2, B):
    """
    Compute Z, A (predicted output), and loss L for a single training example.
    
    Args:
        X1, X2: Input features
        Y: Ground truth label (0 or 1)
        W1, W2: Weights
        B: Bias
    
    Returns:
        Z: Linear combination
        A: Sigmoid activation (prediction)
        L: Binary cross-entropy loss
    """
    Z = W1 * X1 + W2 * X2 + B
    A = 1 / (1 + np.exp(-Z))
    L = -(Y * np.log(A) + (1 - Y) * np.log(1 - A))
    
    return Z, A, L

# Example usage
X1, X2, Y = 2.0, 3.0, 1
W1, W2, B = 0.5, -0.3, 0.1

Z, A, L = forward_pass(X1, X2, Y, W1, W2, B)
print(f"Z = {Z:.4f}")
print(f"A = {A:.4f}")
print(f"L = {L:.4f}")

**Implement code primitive: Implement the backward pass to compute partial derivatives dL/dA, dL/dZ, dL/dW1, dL/dW2, and dL/dB for a single training example.**

In [None]:
def backward_pass(X1, X2, Y, A, Z):
    """
    Compute partial derivatives for a single training example.
    
    Args:
        X1, X2: Input features
        Y: Ground truth label
        A: Sigmoid activation (prediction)
        Z: Linear combination
    
    Returns:
        dL_dA: Derivative of loss with respect to A
        dL_dZ: Derivative of loss with respect to Z
        dL_dW1: Derivative of loss with respect to W1
        dL_dW2: Derivative of loss with respect to W2
        dL_dB: Derivative of loss with respect to B
    """
    dL_dA = -Y / A + (1 - Y) / (1 - A)
    dA_dZ = A * (1 - A)
    dL_dZ = dL_dA * dA_dZ
    
    dL_dW1 = X1 * dL_dZ
    dL_dW2 = X2 * dL_dZ
    dL_dB = dL_dZ
    
    return dL_dA, dL_dZ, dL_dW1, dL_dW2, dL_dB

# Example usage (using values from forward pass)
dL_dA, dL_dZ, dL_dW1, dL_dW2, dL_dB = backward_pass(X1, X2, Y, A, Z)
print(f"dL/dA = {dL_dA:.4f}")
print(f"dL/dZ = {dL_dZ:.4f}")
print(f"dL/dW1 = {dL_dW1:.4f}")
print(f"dL/dW2 = {dL_dW2:.4f}")
print(f"dL/dB = {dL_dB:.4f}")

**Implement code primitive: Implement the parameter updates for W1, W2, and B using the calculated derivatives and a specified learning rate.**

In [None]:
def update_parameters(W1, W2, B, dL_dW1, dL_dW2, dL_dB, learning_rate):
    """
    Update parameters using gradient descent.
    
    Args:
        W1, W2, B: Current parameters
        dL_dW1, dL_dW2, dL_dB: Gradients
        learning_rate: Learning rate (alpha)
    
    Returns:
        W1_new, W2_new, B_new: Updated parameters
    """
    W1_new = W1 - learning_rate * dL_dW1
    W2_new = W2 - learning_rate * dL_dW2
    B_new = B - learning_rate * dL_dB
    
    return W1_new, W2_new, B_new

# Example usage
learning_rate = 0.1
W1_new, W2_new, B_new = update_parameters(W1, W2, B, dL_dW1, dL_dW2, dL_dB, learning_rate)
print(f"W1: {W1:.4f} -> {W1_new:.4f}")
print(f"W2: {W2:.4f} -> {W2_new:.4f}")
print(f"B: {B:.4f} -> {B_new:.4f}")

**Create a Mermaid diagram: graph TD
    X1 --> Z_calc
    X2 --> Z_calc
    W1 --> Z_calc
    W2 --> Z_calc
    B --> Z_calc
    Z_calc[Z = W1*X1 + W2*X2 + B] --> A_calc
    A_calc[A = sigmoid(Z)] --> L_calc
    Y(Ground Truth Y) --> L_calc
    L_calc[L = Loss(A, Y)]**

## Computation Graph for Logistic Regression

```mermaid
graph TD
    X1 --> Z_calc
    X2 --> Z_calc
    W1 --> Z_calc
    W2 --> Z_calc
    B --> Z_calc
    Z_calc[Z = W1*X1 + W2*X2 + B] --> A_calc
    A_calc[A = sigmoid(Z)] --> L_calc
    Y(Ground Truth Y) --> L_calc
    L_calc[L = Loss(A, Y)]
```

This computation graph shows the forward propagation flow. During backward propagation, gradients flow in the reverse direction, allowing us to compute how each parameter contributes to the loss.

## Lesson 10

**Explain the core concepts of the lesson**

## Core Concepts

This lesson covers the fundamental components of implementing gradient descent for logistic regression with multiple training examples:

**Cost Function for Multiple Examples**: The overall cost $J(w,b)$ aggregates the loss across all $m$ training examples:
$$J(w,b) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log a^{(i)} + (1 - y^{(i)}) \log (1 - a^{(i)})]$$

**Loss Function for Single Example**: For each training example, we compute the individual loss, which contributes to the overall cost.

**Gradient Descent Algorithm**: An iterative optimization method that updates parameters by moving in the direction opposite to the gradient, scaled by a learning rate $\alpha$.

**Parameter Initialization**: Before training, parameters $w_1$, $w_2$, and $b$ are initialized (typically to zero).

**Gradient Accumulators**: During the forward and backward pass through all examples, we accumulate gradients in variables like $dw_1$, $dw_2$, and $db$.

**Derivative Averaging**: The accumulated gradients are divided by $m$ to obtain the average gradient, which represents the direction of steepest descent for the cost function.

**Learning Rate**: A hyperparameter $\alpha$ that controls the step size during parameter updates.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Averaging Gradients Across Examples**: To compute the overall gradient for the cost function across all training examples, you can average the derivatives calculated for each individual example. This is because the cost function itself is the average loss across all examples, so its gradient is naturally the average of individual gradients.

**The Problem with Explicit For-Loops**: Using explicit for-loops in deep learning code makes algorithms inefficient, especially when working with large datasets. When you have millions of training examples, iterating through them one by one with Python loops becomes a computational bottleneck.

**Vectorization as the Solution**: Vectorization is a critical technique in deep learning to eliminate explicit for-loops and significantly improve computational efficiency for large-scale data. By leveraging matrix operations and libraries like NumPy, we can process entire batches of examples simultaneously, achieving orders of magnitude speedup.

**Present and explain the key equations used in the lesson**

## Key Equations

**Forward Propagation**:
- Compute the linear combination: $z^{(i)} = w^T x^{(i)} + b$
- Apply sigmoid activation: $a^{(i)} = \sigma(z^{(i)})$

**Cost Function**:
$$J(w,b) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log a^{(i)} + (1 - y^{(i)}) \log (1 - a^{(i)})]$$

**Backward Propagation**:
- Compute the error for each example: $dz^{(i)} = a^{(i)} - y^{(i)}$
- Compute gradients for each parameter:
$$\frac{\partial J}{\partial w_1} = \frac{1}{m} \sum_{i=1}^{m} x_1^{(i)} dz^{(i)}$$
$$\frac{\partial J}{\partial w_2} = \frac{1}{m} \sum_{i=1}^{m} x_2^{(i)} dz^{(i)}$$
$$\frac{\partial J}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} dz^{(i)}$$

**Parameter Update**:
$$w_1 := w_1 - \alpha \cdot \frac{\partial J}{\partial w_1}$$
$$w_2 := w_2 - \alpha \cdot \frac{\partial J}{\partial w_2}$$
$$b := b - \alpha \cdot \frac{\partial J}{\partial b}$$

**Implement code primitive: Initialize cost J and gradient accumulators (dw1, dw2, db) to zero.**

In [None]:
J = 0
dw1 = 0
dw2 = 0
db = 0

**Iterate through all 'm' training examples using a for loop.**

In [None]:
for i in range(m):
    pass

**Implement code primitive: Inside the loop, compute z and the prediction a for the current example.**

In [None]:
z = w1 * X[i, 0] + w2 * X[i, 1] + b
a = 1 / (1 + np.exp(-z))

**Implement code primitive: Accumulate the loss for the current example into J.**

In [None]:
J += -(y[i] * np.log(a) + (1 - y[i]) * np.log(1 - a))

**Implement code primitive: Compute dz for the current example.**

In [None]:
dz = a - y[i]

**Implement code primitive: Accumulate individual gradients dw1, dw2, and db based on dz and features.**

In [None]:
dw1 += X[i, 0] * dz
dw2 += X[i, 1] * dz
db += dz

**Implement code primitive: After the loop, divide accumulated gradients (dw1, dw2, db) by 'm' to get the average gradients.**

In [None]:
J = J / m
dw1 = dw1 / m
dw2 = dw2 / m
db = db / m

**Implement code primitive: Update parameters (w1, w2, b) using the learning rate and the computed average gradients.**

In [None]:
w1 = w1 - alpha * dw1
w2 = w2 - alpha * dw2
b = b - alpha * db

**Create a Mermaid diagram: A flowchart illustrating the iterative steps of gradient descent for logistic regression, including parameter initialization, the loop for processing each training example (forward propagation, loss accumulation, backpropagation for individual gradients), averaging of gradients, and parameter updates. The flowchart should emphasize the single step of gradient descent logic.**

## Gradient Descent Flowchart

```mermaid
graph TD
    A["Initialize: w1, w2, b, J=0, dw1=0, dw2=0, db=0"] --> B["For each training example i=1 to m"]
    B --> C["Compute z = w1*x1 + w2*x2 + b"]
    C --> D["Compute a = sigmoid(z)"]
    D --> E["Accumulate loss: J += loss(a, y)"]
    E --> F["Compute dz = a - y"]
    F --> G["Accumulate gradients: dw1, dw2, db"]
    G --> H{"More examples?"}
    H -->|Yes| B
    H -->|No| I["Average gradients: dw1/m, dw2/m, db/m"]
    I --> J["Average cost: J/m"]
    J --> K["Update parameters: w1, w2, b"]
    K --> L["End of one gradient descent step"]
```

## Lesson 11

**Explain the core concepts of the lesson**

## Core Concepts

**Vectorization** is the practice of replacing explicit loops with built-in functions to perform computations more efficiently. In deep learning, vectorization is essential because it allows us to leverage hardware capabilities and built-in optimizations that significantly speed up code execution.

Key concepts include:

- **Explicit for loops**: Traditional Python loops that iterate through elements one at a time
- **Vectorized implementation**: Using NumPy and other libraries to perform operations on entire arrays at once
- **Non-vectorized implementation**: Implementation using explicit loops, which is slower
- **Numpy dot product**: The `np.dot()` function performs efficient matrix/vector multiplication
- **Deep learning performance**: Training neural networks on large datasets requires fast computation; vectorization is critical for reducing training time
- **Hardware parallelization**: Modern CPUs and GPUs can execute multiple operations in parallel through SIMD instructions and other mechanisms

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Why Vectorization Matters:**

Vectorization is the practice of replacing explicit loops with built-in functions, leading to significantly faster code execution. When you write a Python for loop, the interpreter processes each iteration sequentially. In contrast, vectorized operations allow libraries like NumPy to delegate computation to optimized C code and hardware accelerators.

**Performance in Deep Learning:**

Faster code is crucial for deep learning because training on large datasets can otherwise take a very long time, hindering the iterative experimentation cycle. When you're experimenting with different architectures or hyperparameters, slow training times make iteration impractical.

**Hardware Acceleration:**

Vectorized operations leverage hardware capabilities like SIMD instructions in CPUs and GPUs to perform computations in parallel, boosting efficiency. By avoiding explicit for loops, Python libraries like NumPy can take advantage of underlying parallelization for performance gains. This means the same computation can run orders of magnitude faster when vectorized.

**The Key Insight:**

The difference between vectorized and non-vectorized code is not just about convenience—it's about unlocking the full computational power of modern hardware.

**Present and explain the key equations used in the lesson**

## Key Equations

In deep learning, one of the fundamental computations is the linear transformation used in logistic regression and neural networks:

$$Z = W^T X + B$$

Where:
- $Z$ is the output (predictions or activations)
- $W$ is the weight matrix
- $X$ is the input data
- $B$ is the bias term
- $W^T$ denotes the transpose of $W$

This equation involves a dot product ($W^T X$), which is a perfect candidate for vectorization. Computing this efficiently across millions of samples is where vectorization provides the most dramatic performance improvements.

**Implement code primitive: Importing numpy and time libraries.**

In [None]:
import numpy as np
import time

**Implement code primitive: Generating large random NumPy arrays (e.g., million-dimensional).**

In [None]:
# Generate large random arrays
n = 1000000
a = np.random.rand(n)
b = np.random.rand(n)

**Implement code primitive: Measuring code execution time using `time.time()`.**

In [None]:
# Example: measuring execution time
start_time = time.time()
# ... code to measure ...
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.6f} seconds")

**Implement code primitive: Implementing a vectorized dot product using `np.dot(a, b)`.**

In [None]:
# Vectorized dot product
start_time = time.time()
z_vectorized = np.dot(a, b)
end_time = time.time()
time_vectorized = end_time - start_time
print(f"Vectorized dot product: {z_vectorized}")
print(f"Vectorized time: {time_vectorized:.6f} seconds")

**Implement code primitive: Implementing a non-vectorized dot product using an explicit for loop.**

In [None]:
# Non-vectorized dot product using explicit for loop
start_time = time.time()
z_non_vectorized = 0
for i in range(n):
    z_non_vectorized += a[i] * b[i]
end_time = time.time()
time_non_vectorized = end_time - start_time
print(f"Non-vectorized dot product: {z_non_vectorized}")
print(f"Non-vectorized time: {time_non_vectorized:.6f} seconds")

**Implement code primitive: Comparing the execution times of vectorized versus non-vectorized implementations.**

In [None]:
# Compare execution times
speedup = time_non_vectorized / time_vectorized
print(f"\nComparison:")
print(f"Vectorized time: {time_vectorized:.6f} seconds")
print(f"Non-vectorized time: {time_non_vectorized:.6f} seconds")
print(f"Speedup factor: {speedup:.1f}x")
print(f"\nVectorization is {speedup:.1f} times faster!")

## Lesson 12

**Explain the core concepts of the lesson**

## Core Concepts of Vectorization

**Vectorization** is the practice of replacing explicit for-loops with built-in functions and NumPy operations that work on entire arrays at once. This approach leverages optimized, compiled code that executes much faster than Python loops.

Key concepts include:

- **Explicit for-loops**: Traditional Python loops that iterate over individual elements or training examples. These are slow because Python interprets each iteration.
- **Built-in functions**: NumPy and other libraries provide optimized functions that perform operations on entire vectors or matrices without explicit loops.
- **Matrix-vector multiplication**: A fundamental operation where each element of the output is computed as a dot product of a matrix row with a vector.
- **Element-wise operations**: Operations applied independently to each element of a vector or matrix, such as exponential, logarithm, or squaring.
- **Non-vectorized vs. Vectorized implementations**: Non-vectorized code uses explicit loops; vectorized code uses NumPy functions to process entire arrays simultaneously.

In the context of machine learning, vectorization is critical for efficient gradient computation and parameter updates in algorithms like logistic regression.

**Explain intuitions and mental models for the lesson**

## Intuitions Behind Vectorization

**Why vectorization matters:**

1. **Speed through optimization**: Avoiding explicit for-loops by using built-in functions significantly speeds up code execution. NumPy functions are written in C and optimized for numerical computation, making them orders of magnitude faster than Python loops.

2. **Leveraging library efficiency**: NumPy functions are optimized to perform operations on entire vectors or matrices much faster than manual looping. The library handles memory layout, CPU caching, and parallel operations automatically.

3. **Incremental improvements**: Even partial vectorization, like eliminating one for-loop, can lead to substantial performance improvements. You don't need to vectorize everything at once—removing bottleneck loops has immediate benefits.

4. **Scaling to datasets**: Full vectorization allows processing entire datasets simultaneously without any explicit loops over training examples. This is essential for handling large datasets efficiently.

**Mental model**: Think of vectorization as delegating work to a highly optimized specialist (NumPy) instead of doing it yourself (Python loops). The specialist knows tricks and shortcuts that make the work much faster.

**Present and explain the key equations used in the lesson**

## Key Equations

**Matrix-Vector Multiplication**

The fundamental operation in vectorization is computing the product of a matrix $A$ and a vector $V$. The $i$-th element of the result $U$ is:

$$U_i = \sum_j A_{ij} V_j$$

This equation represents the dot product of the $i$-th row of matrix $A$ with the vector $V$. In non-vectorized form, this requires nested loops: one over rows $i$ and one over columns $j$. In vectorized form, this entire computation is replaced by a single call to `np.dot(A, V)` or `A @ V`.

**Implement code primitive: Non-vectorized matrix-vector multiplication using nested Python for-loops.**

In [None]:
import numpy as np

# Non-vectorized matrix-vector multiplication
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
V = np.array([1, 2, 3])

U = np.zeros(A.shape[0])
for i in range(A.shape[0]):
    for j in range(A.shape[1]):
        U[i] += A[i, j] * V[j]

print("Non-vectorized result:")
print(U)

**Implement code primitive: Vectorized matrix-vector multiplication using `np.dot(A, v)`.**

In [None]:
import numpy as np

# Vectorized matrix-vector multiplication
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])
V = np.array([1, 2, 3])

U = np.dot(A, V)

print("Vectorized result:")
print(U)

**Implement code primitive: Non-vectorized element-wise exponential operation using a Python for-loop.**

In [None]:
import numpy as np

# Non-vectorized element-wise exponential
V = np.array([1, 2, 3, 4])

result = np.zeros(len(V))
for i in range(len(V)):
    result[i] = np.exp(V[i])

print("Non-vectorized exponential:")
print(result)

**Implement code primitive: Vectorized element-wise exponential operation using `np.exp(v)`.**

In [None]:
import numpy as np

# Vectorized element-wise exponential
V = np.array([1, 2, 3, 4])

result = np.exp(V)

print("Vectorized exponential:")
print(result)

**Implement code primitive: Demonstration of other NumPy element-wise functions like `np.log(v)`, `np.abs(v)`, `np.maximum(v, 0)`, `v**2`, `1/v`.**

In [None]:
import numpy as np

V = np.array([1, 2, 3, 4])

print("Logarithm:")
print(np.log(V))

print("\nAbsolute value:")
print(np.abs(np.array([-1, -2, 3, -4])))

print("\nMaximum with 0 (ReLU):")
print(np.maximum(np.array([-1, 2, -3, 4]), 0))

print("\nSquaring:")
print(V**2)

print("\nReciprocal:")
print(1/V)

**Implement code primitive: Transforming logistic regression derivative calculation `dw` from individual components to a vectorized `np.zeros` array.**

In [None]:
import numpy as np

# Example: logistic regression with m training examples and n features
m = 5  # number of training examples
n = 3  # number of features

# Non-vectorized: initialize dw with individual components
dw_non_vec = 0
for i in range(n):
    dw_non_vec = 0  # reset for each feature (inefficient)

# Vectorized: initialize dw as a zero array
dw = np.zeros(n)

print("Vectorized dw shape:")
print(dw.shape)
print("Vectorized dw:")
print(dw)

**Implement code primitive: Replacing a for-loop over features with a vectorized update `dw += x_i * dz_i`.**

In [None]:
import numpy as np

# Example data
m = 5  # training examples
n = 3  # features
X = np.random.randn(n, m)  # feature matrix (n x m)
dz = np.random.randn(m)    # gradient signal (m,)

# Non-vectorized: loop over features
dw_non_vec = np.zeros(n)
for i in range(n):
    for j in range(m):
        dw_non_vec[i] += X[i, j] * dz[j]

# Vectorized: single matrix-vector multiplication
dw = np.dot(X, dz)

print("Non-vectorized dw:")
print(dw_non_vec)
print("\nVectorized dw:")
print(dw)
print("\nAre they equal?", np.allclose(dw_non_vec, dw))

**Implement code primitive: Replacing a for-loop with a vectorized division `dw /= m`.**

In [None]:
import numpy as np

# Example: averaging gradient over training examples
m = 5  # number of training examples
n = 3  # number of features
dw = np.array([10, 20, 30])

# Non-vectorized: loop over features
dw_non_vec = dw.copy()
for i in range(n):
    dw_non_vec[i] = dw_non_vec[i] / m

# Vectorized: single division operation
dw_vec = dw / m

print("Non-vectorized dw / m:")
print(dw_non_vec)
print("\nVectorized dw / m:")
print(dw_vec)
print("\nAre they equal?", np.allclose(dw_non_vec, dw_vec))

## Lesson 13

**Explain the core concepts of the lesson**

## Core Concepts

Vectorized forward propagation is the process of computing predictions for an entire training set simultaneously using matrix operations, rather than computing predictions one example at a time in a loop.

**Key Concepts:**

- **Vectorization**: Replacing explicit for loops with matrix operations to process multiple training examples at once
- **Forward Propagation**: Computing predictions by passing inputs through the model (computing Z and then A)
- **Matrix Stacking**: Arranging all training examples as columns in a matrix to enable batch processing
- **Training Set Processing**: Computing predictions for all m training examples in a single operation
- **Broadcasting**: Automatically expanding scalar values to match matrix dimensions during computation
- **Sigmoid Activation**: The activation function that transforms linear predictions into probabilities
- **Computational Efficiency**: Vectorized operations run significantly faster on modern hardware compared to loops
- **Batch Computation**: Processing the entire batch of training examples simultaneously

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**From Loops to Matrices:**
Instead of computing predictions one training example at a time in a loop, you can stack all training examples into matrices and compute all predictions simultaneously with matrix operations. This transforms a sequential process into a single parallel computation.

**Why Vectorization Matters:**
Vectorization replaces explicit for loops with matrix multiplication, making code run much faster on modern hardware. Modern processors and GPUs are optimized for matrix operations, so vectorized code can be orders of magnitude faster than loop-based code.

**Broadcasting Simplifies Code:**
Broadcasting automatically expands scalar values to match matrix dimensions, eliminating the need to manually replicate values. When you add a scalar bias to a matrix, the scalar is automatically broadcast to every element, making the code cleaner and more efficient.

**Horizontal Stacking:**
Stacking training examples horizontally into a matrix allows you to process the entire batch in a single operation. Each column represents one training example, and matrix multiplication naturally processes all columns together.

**Present and explain the key equations used in the lesson**

## Key Equations

The vectorized forward propagation for logistic regression uses the following equations:

**Linear Transformation:**
$$Z = W^T X + B$$

where:
- $W \in \mathbb{R}^{n_x \times 1}$ is the weight vector
- $X \in \mathbb{R}^{n_x \times m}$ is the input matrix (m training examples stacked as columns)
- $B \in \mathbb{R}^{1 \times m}$ is the bias vector (or scalar broadcasted to all examples)
- $Z \in \mathbb{R}^{1 \times m}$ is the output matrix (one prediction per training example)

**Activation Function:**
$$A = \sigma(Z)$$

where:
- $\sigma$ is the sigmoid activation function applied element-wise
- $A \in \mathbb{R}^{1 \times m}$ is the output matrix of predictions (probabilities between 0 and 1)

**Dimensions:**
- Input matrix: $X \in \mathbb{R}^{n_x \times m}$
- Linear output: $Z \in \mathbb{R}^{1 \times m}$
- Activation output: $A \in \mathbb{R}^{1 \times m}$

**Implement code primitive: Implement vectorized computation of Z for all training examples using matrix multiplication: W transpose times X plus bias vector B**

In [None]:
import numpy as np

# Example dimensions
n_x = 3  # number of features
m = 5    # number of training examples

# Initialize parameters
W = np.array([[0.5], [0.3], [0.2]])  # shape: (n_x, 1)
B = 0.1  # scalar bias

# Create sample input matrix (each column is one training example)
X = np.random.randn(n_x, m)  # shape: (n_x, m)

# Vectorized computation of Z
Z = np.dot(W.T, X) + B

print(f"W shape: {W.shape}")
print(f"X shape: {X.shape}")
print(f"Z shape: {Z.shape}")
print(f"Z:\n{Z}")

**Implement code primitive: Implement vectorized sigmoid activation function that accepts matrix Z and outputs matrix A of same dimensions**

In [None]:
import numpy as np

def sigmoid(Z):
    """
    Vectorized sigmoid activation function.
    
    Args:
        Z: numpy array of any shape
    
    Returns:
        A: sigmoid(Z) with same shape as Z
    """
    return 1 / (1 + np.exp(-Z))

# Example usage
Z = np.array([[-2.0, -1.0, 0.0, 1.0, 2.0]])  # shape: (1, 5)
A = sigmoid(Z)

print(f"Z shape: {Z.shape}")
print(f"A shape: {A.shape}")
print(f"Z: {Z}")
print(f"A: {A}")

**Implement code primitive: Demonstrate broadcasting behavior when adding scalar bias B to row vector during vectorized computation**

In [None]:
import numpy as np

# Create a row vector (1, m)
Z_linear = np.array([[1.0, 2.0, 3.0, 4.0, 5.0]])  # shape: (1, 5)
print(f"Z_linear shape: {Z_linear.shape}")
print(f"Z_linear:\n{Z_linear}")

# Scalar bias
B = 0.5
print(f"\nB (scalar): {B}")

# Broadcasting: scalar is automatically expanded to match Z_linear dimensions
Z_with_bias = Z_linear + B
print(f"\nZ_with_bias shape: {Z_with_bias.shape}")
print(f"Z_with_bias:\n{Z_with_bias}")

# Verify broadcasting worked correctly
print(f"\nBroadcasting expanded scalar {B} to all {Z_linear.shape[1]} columns")

**Create a Mermaid diagram: Flowchart showing the progression from computing single example predictions (Z1, A1) to vectorized batch computation (Z, A) using matrix operations**

## Progression from Single Example to Vectorized Batch Computation

```mermaid
graph TD
    A["Single Training Example"] --> B["Compute Z1 = w^T x1 + b"]
    B --> C["Compute A1 = sigmoid(Z1)"]
    C --> D["Loop over all m examples"]
    
    E["Vectorized Approach"] --> F["Stack all examples: X = [x1, x2, ..., xm]"]
    F --> G["Compute Z = W^T X + B"]
    G --> H["Compute A = sigmoid(Z)"]
    H --> I["All m predictions at once"]
    
    D --> J["Result: Z1, A1, Z2, A2, ..., Zm, Am"]
    I --> K["Result: Z, A with shape 1 × m"]
    
    J --> L["Slower: Sequential computation"]
    K --> M["Faster: Parallel matrix operations"]
```

## Lesson 14

**Explain the core concepts of the lesson**

## Core Concepts

Vectorized gradient computations form the foundation of efficient machine learning implementations. Instead of computing gradients for each training example individually using explicit loops, we can leverage matrix operations to compute gradients for all training examples simultaneously.

The key concepts in this lesson are:

1. **Vectorized Gradient Computation**: Computing gradients for all training examples at once using matrix operations rather than element-by-element loops.

2. **dZ Matrix Formation**: The difference between predictions and actual labels, stacked into a matrix: $dZ = A - Y$, where $A$ is the matrix of predictions and $Y$ is the matrix of true labels.

3. **Vectorized db Calculation**: The gradient with respect to bias is computed by summing all elements of $dZ$ and dividing by the number of training examples: $db = \frac{1}{m} \sum dZ$.

4. **Vectorized dW Calculation**: The gradient with respect to weights is computed using matrix multiplication: $dW = \frac{1}{m} X \cdot dZ^T$, where $X$ is the feature matrix and $dZ^T$ is the transpose of the gradient matrix.

5. **Forward and Backward Propagation**: A complete iteration of logistic regression includes vectorized forward propagation ($Z = W^T X + B$, $A = \text{sigmoid}(Z)$) and vectorized backpropagation (computing $dZ$, $dW$, and $db$).

6. **Parameter Updates**: Weights and bias are updated using the computed gradients and a learning rate: $W := W - \text{learning\_rate} \cdot dW$ and $B := B - \text{learning\_rate} \cdot db$.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Vectorization as Parallelization**: Vectorization allows us to compute gradients for all training examples simultaneously, replacing explicit for-loops with efficient matrix operations. Instead of iterating through each training example one by one, we stack all examples into matrices and perform operations on the entire matrices at once. This is not only more concise but also significantly faster due to optimized linear algebra libraries.

**Stacking Gradients into Matrices**: When we compute the gradient $dz_i$ for each individual training example $i$, we can stack all these scalar values into a single matrix $dZ$. This enables us to compute all $dz$ values through a single matrix subtraction ($dZ = A - Y$) rather than computing each one separately.

**Matrix Multiplication for Aggregation**: The vectorized calculation of $dW$ uses matrix multiplication to efficiently sum up the contributions of each training example. Each term $x_i \cdot dz_i$ (the product of features and gradient for example $i$) is automatically summed across all examples through the operation $X \cdot dZ^T$. This is far more efficient than explicitly looping through examples.

**Full Iteration Vectorization**: A complete iteration of logistic regression—encompassing forward propagation, backpropagation, and parameter updates—can be implemented without any explicit for-loops over training samples. All operations work on entire matrices, making the code both cleaner and faster.

**Multiple Iterations Still Require a Loop**: While a single iteration can be fully vectorized, an outermost for-loop is still necessary to perform multiple iterations of gradient descent. This loop controls the number of training epochs and allows the algorithm to converge by repeatedly applying the vectorized update steps.

**Present and explain the key equations used in the lesson**

## Key Equations

The following equations form the mathematical foundation of vectorized gradient computations for logistic regression:

**Gradient with respect to predictions:**
$$dZ = A - Y$$

where $A$ is the matrix of predictions (shape: $1 \times m$) and $Y$ is the matrix of true labels (shape: $1 \times m$).

**Gradient with respect to bias:**
$$db = \frac{1}{m} \sum dZ$$

where $m$ is the number of training examples. This sums all elements of $dZ$ and averages them.

**Gradient with respect to weights:**
$$dW = \frac{1}{m} X \cdot dZ^T$$

where $X$ is the feature matrix (shape: $n_x \times m$) and $dZ^T$ is the transpose of $dZ$ (shape: $m \times 1$). The result is a weight gradient matrix of shape $n_x \times 1$.

**Forward propagation:**
$$Z = W^T X + B$$
$$A = \text{sigmoid}(Z)$$

where $W$ is the weight vector, $X$ is the feature matrix, $B$ is the bias, and $\text{sigmoid}$ is the logistic function.

**Parameter updates:**
$$W := W - \text{learning\_rate} \cdot dW$$
$$B := B - \text{learning\_rate} \cdot db$$

These updates move the parameters in the direction opposite to the gradient, scaled by the learning rate.

**Implement code primitive: Compute `dZ` using element-wise subtraction of `A` and `Y` matrices.**

In [None]:
import numpy as np

# Example: Compute dZ = A - Y
A = np.array([[0.9, 0.2, 0.8, 0.7]])  # Predictions (1, m)
Y = np.array([[1, 0, 1, 0]])            # True labels (1, m)

dZ = A - Y
print("dZ:", dZ)

**Implement code primitive: Compute `db` by summing all elements of `dZ` and dividing by the number of training examples `m` (e.g., `np.sum(dZ)`).**

In [None]:
import numpy as np

# Example: Compute db = (1/m) * sum(dZ)
dZ = np.array([[0.9, 0.2, 0.8, 0.7]])  # Shape: (1, m)
m = dZ.shape[1]  # Number of training examples

db = np.sum(dZ) / m
print("db:", db)

**Implement code primitive: Compute `dW` using matrix multiplication of the input feature matrix `X` and the transpose of `dZ`, then dividing by `m` (e.g., `np.dot(X, dZ.T)`).**

In [None]:
import numpy as np

# Example: Compute dW = (1/m) * X * dZ^T
X = np.array([[1, 2, 3, 4],
              [5, 6, 7, 8]])  # Shape: (n_x, m)
dZ = np.array([[0.9, 0.2, 0.8, 0.7]])  # Shape: (1, m)
m = X.shape[1]  # Number of training examples

dW = np.dot(X, dZ.T) / m
print("dW shape:", dW.shape)
print("dW:\n", dW)

**Implement code primitive: Implement vectorized forward propagation: `Z = np.dot(w.T, X) + b` and `A = sigmoid(Z)`.**

In [None]:
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Example: Vectorized forward propagation
W = np.array([[0.5], [0.3]])  # Shape: (n_x, 1)
b = 0.1
X = np.array([[1, 2, 3, 4],
              [5, 6, 7, 8]])  # Shape: (n_x, m)

Z = np.dot(W.T, X) + b
A = sigmoid(Z)
print("Z shape:", Z.shape)
print("A shape:", A.shape)
print("A:", A)

**Implement code primitive: Implement vectorized backpropagation steps to compute `dZ`, `dW`, and `db` without explicit loops over training examples.**

In [None]:
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Example: Vectorized backpropagation
W = np.array([[0.5], [0.3]])  # Shape: (n_x, 1)
b = 0.1
X = np.array([[1, 2, 3, 4],
              [5, 6, 7, 8]])  # Shape: (n_x, m)
Y = np.array([[1, 0, 1, 0]])  # Shape: (1, m)
m = X.shape[1]

# Forward propagation
Z = np.dot(W.T, X) + b
A = sigmoid(Z)

# Backpropagation
dZ = A - Y
dW = np.dot(X, dZ.T) / m
db = np.sum(dZ) / m

print("dZ shape:", dZ.shape)
print("dW shape:", dW.shape)
print("db:", db)

**Implement code primitive: Update `W` and `b` parameters using the computed gradients (`dW`, `db`) and a learning rate.**

In [None]:
import numpy as np

# Example: Parameter update
W = np.array([[0.5], [0.3]])  # Shape: (n_x, 1)
b = 0.1
dW = np.array([[0.02], [0.01]])  # Computed gradient
db = 0.005  # Computed gradient
learning_rate = 0.01

# Update parameters
W = W - learning_rate * dW
b = b - learning_rate * db

print("Updated W:", W.flatten())
print("Updated b:", b)

**Implement code primitive: Utilize an outer `for` loop to execute multiple iterations of the vectorized gradient descent process.**

In [None]:
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Example: Multiple iterations of vectorized gradient descent
W = np.array([[0.5], [0.3]])  # Shape: (n_x, 1)
b = 0.1
X = np.array([[1, 2, 3, 4],
              [5, 6, 7, 8]])  # Shape: (n_x, m)
Y = np.array([[1, 0, 1, 0]])  # Shape: (1, m)
m = X.shape[1]
learning_rate = 0.01
num_iterations = 100

for iteration in range(num_iterations):
    # Forward propagation
    Z = np.dot(W.T, X) + b
    A = sigmoid(Z)
    
    # Backpropagation
    dZ = A - Y
    dW = np.dot(X, dZ.T) / m
    db = np.sum(dZ) / m
    
    # Parameter update
    W = W - learning_rate * dW
    b = b - learning_rate * db

print("Final W:", W.flatten())
print("Final b:", b)

**Create a Mermaid diagram: graph TD
    A[Start Iteration] --> B{Vectorized Forward Propagation};
    B --> C[Compute Z = W^T X + B];
    C --> D[Compute A = sigmoid(Z)];
    D --> E{Vectorized Backpropagation};
    E --> F[Compute dZ = A - Y];
    F --> G[Compute dW = (1/m) X dZ^T];
    G --> H[Compute db = (1/m) sum(dZ)];
    H --> I{Parameter Update};
    I --> J[W = W - learning_rate * dW];
    J --> K[B = B - learning_rate * db];
    K --> L[End Iteration];**

```mermaid
graph TD
    A[Start Iteration] --> B{Vectorized Forward Propagation};
    B --> C[Compute Z = W^T X + B];
    C --> D[Compute A = sigmoid(Z)];
    D --> E{Vectorized Backpropagation};
    E --> F[Compute dZ = A - Y];
    F --> G[Compute dW = 1/m X dZ^T];
    G --> H[Compute db = 1/m sum dZ];
    H --> I{Parameter Update};
    I --> J[W = W - learning_rate * dW];
    J --> K[B = B - learning_rate * db];
    K --> L[End Iteration];
```

**Create a Mermaid diagram: graph TD
    X_matrix["X (n_x, m)"] --> MatMul;
    dZ_T_matrix["dZ^T (m, 1)"] --> MatMul;
    MatMul["Matrix Multiplication (X @ dZ^T)"] --> Result;
    Result["(1/m) * Result of Multiplication"] --> dW_vector["dW (n_x, 1)"];**

```mermaid
graph TD
    X_matrix["X (n_x, m)"] --> MatMul;
    dZ_T_matrix["dZ^T (m, 1)"] --> MatMul;
    MatMul["Matrix Multiplication (X @ dZ^T)"] --> Result;
    Result["(1/m) * Result of Multiplication"] --> dW_vector["dW (n_x, 1)"];
```

## Lesson 15

**Explain the core concepts of the lesson**

## Core Concepts

Python broadcasting is a powerful mechanism in NumPy that allows operations between arrays of different shapes. The key concepts include:

- **Broadcasting**: The process of conceptually 'stretching' smaller arrays to match the shape of larger arrays during element-wise operations.
- **Matrix operations**: Operations performed on 2D arrays (matrices) where broadcasting enables efficient computation without explicit loops.
- **Vector-scalar operations**: Operations between vectors and scalars, where the scalar is conceptually expanded to match the vector's shape.
- **Matrix-vector operations**: Operations between matrices and vectors, where the vector is conceptually expanded to match the matrix's shape.
- **NumPy sum function**: The `sum()` method with axis parameter to compute sums along specific dimensions.
- **Axis for summation**: The `axis` parameter controls whether summation occurs along rows (axis=1) or columns (axis=0).
- **NumPy reshape command**: A constant-time operation that changes array dimensions without copying data.
- **Element-wise operations**: Operations applied independently to each element of arrays, such as addition, subtraction, multiplication, and division.
- **Computational efficiency**: Broadcasting eliminates the need for explicit loops, resulting in faster execution and more concise code.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Broadcasting as Conceptual Stretching**: Imagine broadcasting as a way to mentally 'stretch' or 'expand' a smaller array to match the shape of a larger array. When you perform an operation between arrays of different shapes, NumPy doesn't actually copy the data; instead, it conceptually repeats the smaller array to align with the larger one. This mental model helps you understand why operations work without explicitly writing loops.

**Efficiency Through Implicit Expansion**: Broadcasting allows you to write more concise and faster Python code by eliminating the need for explicit loops. Instead of manually iterating through rows or columns, you can write a single operation that NumPy applies efficiently across all elements. This leads to improved performance because the underlying operations are optimized at the C level.

**Reshape as a Lightweight Operation**: The `reshape` command is an efficient, constant-time operation that doesn't copy data—it simply changes how the array is viewed. When you reshape an array to prepare it for broadcasting, you're not performing expensive data movement; you're just changing the shape metadata. This makes it safe and efficient to reshape arrays as needed for operations.

**Present and explain the key equations used in the lesson**

## Key Equations

Broadcasting follows specific rules for different shape combinations:

**Matrix-Row Vector Broadcasting**:
$$A_{m,n} \text{ op } B_{1,n} \implies \text{conceptually, } B \text{ is copied } m \text{ times to match } A_{m,n}$$

When a matrix of shape $(m, n)$ is combined with a row vector of shape $(1, n)$, the row vector is conceptually repeated $m$ times vertically to match the matrix's shape.

**Matrix-Column Vector Broadcasting**:
$$A_{m,n} \text{ op } B_{m,1} \implies \text{conceptually, } B \text{ is copied } n \text{ times to match } A_{m,n}$$

When a matrix of shape $(m, n)$ is combined with a column vector of shape $(m, 1)$, the column vector is conceptually repeated $n$ times horizontally to match the matrix's shape.

**Vector-Scalar Broadcasting**:
$$A_{m,1} \text{ op } B_{1,1} \implies \text{conceptually, } B \text{ is copied } m \text{ times to match } A_{m,1}$$

When a column vector of shape $(m, 1)$ is combined with a scalar (shape $(1, 1)$), the scalar is conceptually repeated $m$ times to match the vector's shape.

**Implement code primitive: Initialize a NumPy 2D array (matrix) with specific values.**

In [None]:
import numpy as np

# Initialize a 2D array (matrix) with specific values
A = np.array([[1, 2, 3, 4],
              [5, 6, 7, 8],
              [9, 10, 11, 12]])

print("Matrix A:")
print(A)
print(f"Shape: {A.shape}")

**Implement code primitive: Calculate column-wise sums of a matrix using `sum(axis=0)`.**

In [None]:
import numpy as np

# Initialize a matrix
A = np.array([[1, 2, 3, 4],
              [5, 6, 7, 8],
              [9, 10, 11, 12]])

# Calculate column-wise sums using axis=0
column_sums = A.sum(axis=0)

print("Column-wise sums:")
print(column_sums)
print(f"Shape: {column_sums.shape}")

**Implement code primitive: Reshape a 1D array (vector) into a specific row vector dimension (e.g., 1x4).**

In [None]:
import numpy as np

# Create a 1D array
v = np.array([15, 18, 21, 24])

print("Original vector shape:", v.shape)

# Reshape to a row vector (1, 4)
v_row = v.reshape(1, -1)

print("Reshaped to row vector:")
print(v_row)
print(f"Shape: {v_row.shape}")

**Implement code primitive: Perform element-wise division of a matrix by a reshaped row vector using broadcasting.**

In [None]:
import numpy as np

# Initialize a matrix
A = np.array([[1, 2, 3, 4],
              [5, 6, 7, 8],
              [9, 10, 11, 12]])

# Calculate column-wise sums and reshape to row vector
column_sums = A.sum(axis=0).reshape(1, -1)

print("Column sums (row vector):")
print(column_sums)
print(f"Shape: {column_sums.shape}")

# Perform element-wise division using broadcasting
result = A / column_sums

print("\nResult of A / column_sums:")
print(result)
print(f"Shape: {result.shape}")

**Create a Mermaid diagram: graph TD\nA[Matrix A (m,n)]\nB[Vector B (1,n)]\nB -- Copy m times --> B_exp[Expanded B (m,n)]\nA & B_exp -- Element-wise Operation --> Result[(Result (m,n))]**

```mermaid
graph TD
    A["Matrix A (m,n)"]
    B["Vector B (1,n)"]
    B_exp["Expanded B (m,n)"]
    Result["Result (m,n)"]
    
    B -->|Copy m times| B_exp
    A -->|Element-wise Operation| Result
    B_exp -->|Element-wise Operation| Result
```

**Create a Mermaid diagram: graph TD\nA[Matrix A (m,n)]\nB[Vector B (m,1)]\nB -- Copy n times --> B_exp[Expanded B (m,n)]\nA & B_exp -- Element-wise Operation --> Result[(Result (m,n))]**

```mermaid
graph TD
    A["Matrix A (m,n)"]
    B["Vector B (m,1)"]
    B_exp["Expanded B (m,n)"]
    Result["Result (m,n)"]
    
    B -->|Copy n times| B_exp
    A -->|Element-wise Operation| Result
    B_exp -->|Element-wise Operation| Result
```

**Create a Mermaid diagram: graph TD\nA[Vector A (m,1)]\nB[Scalar B (1,1)]\nB -- Copy m times --> B_exp[Expanded B (m,1)]\nA & B_exp -- Element-wise Operation --> Result[(Result (m,1))]**

```mermaid
graph TD
    A["Vector A (m,1)"]
    B["Scalar B (1,1)"]
    B_exp["Expanded B (m,1)"]
    Result["Result (m,1)"]
    
    B -->|Copy m times| B_exp
    A -->|Element-wise Operation| Result
    B_exp -->|Element-wise Operation| Result
```

## Lesson 16

**Explain the core concepts of the lesson**

## Core Concepts: NumPy Vector Dimensions and Rank 1 Array Bugs

NumPy's flexibility in handling arrays can introduce subtle, hard-to-find bugs if array dimensions are not fully understood. The key distinction lies in how NumPy treats different vector representations:

**Rank 1 Arrays**: Created with `np.random.randn(N)`, these have shape `(N,)`. They behave inconsistently—transposition appears to have no effect, and operations like matrix multiplication can produce unexpected results.

**Column Vectors**: Created with `np.random.randn(N, 1)`, these have shape `(N, 1)`. They are explicitly 2D and behave predictably in linear algebra operations.

**Row Vectors**: Created with `np.random.randn(1, N)`, these have shape `(1, N)`. They are also explicitly 2D and provide consistent behavior.

The core issue is that rank 1 arrays lack a clear orientation (neither row nor column), leading to broadcasting ambiguities. Explicitly using 2D vectors eliminates this ambiguity and makes code more maintainable.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Broadcasting Flexibility as a Double-Edged Sword**: NumPy's broadcasting rules are powerful but can mask errors. A rank 1 array can be broadcast in multiple ways, making it unclear what operation is actually being performed.

**Transposition Confusion**: When you transpose a rank 1 array, it looks identical visually. This violates the mathematical expectation that transposing a vector should change its orientation. Explicit column and row vectors make this behavior transparent.

**Inner vs. Outer Products**: With a rank 1 array, `a @ a.T` produces a scalar (inner product). With an explicit column vector, the same operation produces a matrix (outer product). This dramatic difference in behavior is a common source of bugs.

**Assertions as Self-Documentation**: Using assertions like `assert(array.shape == (N, 1))` serves dual purposes: it catches dimension mismatches early and documents your intent to future readers.

**Reshaping as a Solution**: Converting a rank 1 array to an explicit column or row vector using `reshape()` resolves ambiguity and ensures predictable behavior in downstream operations.

**Implement code primitive: Creating a rank 1 array using `np.random.randn(N)`.**

In [None]:
import numpy as np

N = 5
a = np.random.randn(N)
print(a)

**Implement code primitive: Inspecting an array's shape with `array.shape`.**

In [None]:
print(f"Shape of a: {a.shape}")
print(f"Number of dimensions: {a.ndim}")

**Implement code primitive: Demonstrating transposition (`array.T`) of a rank 1 array and its non-intuitive visual effect.**

In [None]:
a_transposed = a.T
print(f"Original array: {a}")
print(f"Transposed array: {a_transposed}")
print(f"Shape of a.T: {a_transposed.shape}")
print(f"Are they identical? {np.array_equal(a, a_transposed)}")

**Implement code primitive: Performing matrix multiplication (`array @ array.T`) with a rank 1 array resulting in a scalar.**

In [None]:
result = a @ a.T
print(f"a @ a.T = {result}")
print(f"Type of result: {type(result)}")
print(f"Is it a scalar? {np.isscalar(result)}")

**Implement code primitive: Creating an explicit column vector using `np.random.randn(N, 1)`.**

In [None]:
b = np.random.randn(N, 1)
print(f"Column vector:\n{b}")
print(f"Shape: {b.shape}")

**Implement code primitive: Creating an explicit row vector using `np.random.randn(1, N)`.**

In [None]:
c = np.random.randn(1, N)
print(f"Row vector:\n{c}")
print(f"Shape: {c.shape}")

**Implement code primitive: Demonstrating transposition (`array.T`) of a column vector to show it becomes a row vector.**

In [None]:
b_transposed = b.T
print(f"Original column vector shape: {b.shape}")
print(f"Transposed column vector shape: {b_transposed.shape}")
print(f"Transposed is now a row vector: {b_transposed.shape == (1, N)}")

**Implement code primitive: Performing matrix multiplication (`array @ array.T`) with an explicit column vector resulting in a matrix (outer product).**

In [None]:
outer_product = b @ b.T
print(f"b @ b.T shape: {outer_product.shape}")
print(f"b @ b.T (outer product):\n{outer_product}")

**Implement code primitive: Adding an assertion statement to check for specific array dimensions, e.g., `assert(array.shape == (N, 1))`.**

In [None]:
assert b.shape == (N, 1), f"Expected shape (N, 1), got {b.shape}"
print("Assertion passed: b is a column vector")

assert c.shape == (1, N), f"Expected shape (1, N), got {c.shape}"
print("Assertion passed: c is a row vector")

**Implement code primitive: Reshaping an array using `array.reshape((N, 1))` to explicitly define its dimensions.**

In [None]:
a_reshaped = a.reshape((N, 1))
print(f"Original rank 1 array shape: {a.shape}")
print(f"Reshaped to column vector: {a_reshaped.shape}")
print(f"Reshaped array:\n{a_reshaped}")

**Create a Mermaid diagram: graph TD
    subgraph NumPy Vector Types and Behaviors
        A[Rank 1 Array (e.g., np.random.randn(N))] --> A_SHAPE(Shape: (N,));
        A_SHAPE --> A_TRANSPOSE{a.T looks identical to a};
        A_TRANSPOSE --> A_PRODUCT{a @ a.T gives a scalar};

        B[Column Vector (e.g., np.random.randn(N, 1))] --> B_SHAPE(Shape: (N,1));
        B_SHAPE --> B_TRANSPOSE{a.T is a (1,N) row vector};
        B_TRANSPOSE --> B_PRODUCT{a @ a.T gives a (N,N) matrix};

        C[Row Vector (e.g., np.random.randn(1, N))] --> C_SHAPE(Shape: (1,N));
        C_SHAPE --> C_TRANSPOSE{a.T is a (N,1) column vector};
        C_TRANSPOSE --> C_PRODUCT{a @ a.T gives a (1,1) scalar};
    end

    A_PRODUCT -- Leads to bugs/confusion --> D(Recommendation: Avoid Rank 1 Arrays);
    B_PRODUCT -- Promotes clarity --> E(Recommendation: Use (N,1) or (1,N) for vectors);
    C_PRODUCT -- Promotes clarity --> E;**

```mermaid
graph TD
    subgraph NumPy Vector Types and Behaviors
        A[Rank 1 Array - np.random.randn(N)] --> A_SHAPE(Shape: (N,))
        A_SHAPE --> A_TRANSPOSE{a.T looks identical to a}
        A_TRANSPOSE --> A_PRODUCT{a @ a.T gives a scalar}

        B[Column Vector - np.random.randn(N, 1)] --> B_SHAPE(Shape: (N,1))
        B_SHAPE --> B_TRANSPOSE{a.T is a (1,N) row vector}
        B_TRANSPOSE --> B_PRODUCT{a @ a.T gives a (N,N) matrix}

        C[Row Vector - np.random.randn(1, N)] --> C_SHAPE(Shape: (1,N))
        C_SHAPE --> C_TRANSPOSE{a.T is a (N,1) column vector}
        C_TRANSPOSE --> C_PRODUCT{a @ a.T gives a (1,1) scalar}
    end

    A_PRODUCT -- Leads to bugs/confusion --> D(Recommendation: Avoid Rank 1 Arrays)
    B_PRODUCT -- Promotes clarity --> E(Recommendation: Use (N,1) or (1,N) for vectors)
    C_PRODUCT -- Promotes clarity --> E
```

## Lesson 17

**Explain the core concepts of the lesson**

## Core Concepts

Jupyter iPython Notebooks are interactive computing environments that combine text instructions and executable code in a single document. Each notebook consists of cells that can be either:

- **Text blocks (Markdown cells)**: Contain formatted text, instructions, and explanations
- **Code blocks (Code cells)**: Contain executable Python code

When you execute a code cell, the code is sent to a **kernel** running on a server. The kernel processes your code and returns the output, which is displayed directly below the cell. This interactive workflow allows you to write, test, and refine code while learning.

The **kernel** is a backend process that executes your code. If something goes wrong, you can restart the kernel to clear all variables and start fresh. Importantly, the order in which you execute cells matters—earlier cells may set up variables or import libraries that later cells depend on.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Jupyter Notebooks as Interactive Learning Tools**: Think of a Jupyter notebook as a conversation between you and a computer. You write instructions and code in cells, execute them, see the results immediately, and then refine your approach. This rapid feedback loop makes learning and experimentation much faster than traditional programming workflows.

**The Kernel as a Backend Worker**: Imagine the kernel as a worker on a remote server. You send it tasks (code cells), it executes them, and reports back with results. The kernel maintains state—variables you create persist until you restart it. This is why executing cells in the correct order is crucial; if an earlier cell defines a variable that a later cell uses, you must run the earlier cell first.

**Cell Execution as a Sequential Process**: Each time you execute a cell, your code travels to the kernel, gets processed, and the output returns to your notebook. If the kernel crashes or becomes unresponsive, you can restart it, but this clears all stored variables and state. Understanding this flow helps you debug issues and organize your work effectively.

**Implement code primitive: Writing code between 'START CODE HERE' and 'END CODE HERE' markers.**

In [None]:
# START CODE HERE
x = 5
y = 10
z = x + y
# END CODE HERE

**Implement code primitive: Executing a code block using Shift+Enter or 'Cell > Run Cell'.**

In [None]:
# Execute this cell by pressing Shift+Enter or using Cell > Run Cell
result = 42
print(f"The result is: {result}")

**Implement code primitive: Printing a 'Hello world' message.**

In [None]:
print("Hello world")

**Implement code primitive: Importing the NumPy library as 'np'.**

In [None]:
import numpy as np

**Create a Mermaid diagram: graph TD;A[Instructions in Text Block] --> B[User writes code in Code Block];B --> C{Execute Code Block (Shift+Enter/Run Cell)};C --> D[Code sent to Kernel on Server];D --> E[Kernel executes code];E --> F[Output displayed in Code Block];F --> B;D -- "Kernel dies" --> G[Restart Kernel];**

## Notebook Execution Flow

```mermaid
graph TD
    A[Instructions in Text Block] --> B[User writes code in Code Block]
    B --> C{Execute Code Block<br/>Shift+Enter/Run Cell}
    C --> D[Code sent to Kernel on Server]
    D --> E[Kernel executes code]
    E --> F[Output displayed in Code Block]
    F --> B
    D -- "Kernel dies" --> G[Restart Kernel]
```

This diagram illustrates the interactive cycle of working in a Jupyter notebook. You read instructions, write code, execute it, see results, and iterate. If the kernel crashes, you can restart it and continue your work.

## Lesson 18

**Explain the core concepts of the lesson**

## Core Concepts of Logistic Regression Cost Function

Logistic regression is a fundamental algorithm for binary classification. The cost function used in logistic regression is derived from principles of probability and maximum likelihood estimation.

**Key Concepts:**

1. **Logistic Regression Prediction**: The model outputs a probability using the sigmoid function:
   $$\hat{y} = \sigma(w^T x + b)$$
   where $\sigma$ is the sigmoid function that maps any input to a value between 0 and 1.

2. **Conditional Probability**: The prediction $\hat{y}$ represents the probability that the label is 1 given the input:
   - $P(y=1|x) = \hat{y}$
   - $P(y=0|x) = 1 - \hat{y}$

3. **Combined Probability Equation**: A single expression elegantly captures both outcomes:
   $$P(y|x) = \hat{y}^y (1 - \hat{y})^{(1-y)}$$
   When $y=1$, this equals $\hat{y}$; when $y=0$, this equals $1-\hat{y}$.

4. **Loss Function**: For a single training example, the loss measures how well the model's prediction matches the actual label:
   $$L(\hat{y}, y) = -[y \log \hat{y} + (1-y) \log (1-\hat{y})]$$

5. **Overall Cost Function**: The average loss across all training examples:
   $$J(w,b) = - \frac{1}{m} \sum_{i=1}^m [y^{(i)} \log \hat{y}^{(i)} + (1-y^{(i)}) \log (1-\hat{y}^{(i)})]$$

6. **Maximum Likelihood Estimation**: The cost function is derived to maximize the probability of observing the actual training labels given the model parameters.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Why This Cost Function?**

The logistic regression cost function is not arbitrary—it emerges naturally from a fundamental principle: **maximize the probability of correctly observing the actual labels in the training data**.

**Key Intuitions:**

1. **Unified Expression for Both Outcomes**: Rather than writing separate equations for $y=0$ and $y=1$, the expression $\hat{y}^y (1 - \hat{y})^{(1-y)}$ elegantly combines both cases into one. This mathematical elegance simplifies both understanding and computation.

2. **Logarithm Simplification**: Taking the logarithm of probabilities converts products into sums:
   $$\log P(\text{labels}) = \sum_{i=1}^m \log P(y^{(i)}|x^{(i)})$$
   This transformation makes optimization much easier and more numerically stable.

3. **Likelihood and Loss Are Inverses**: Minimizing the loss function is mathematically equivalent to maximizing the likelihood of the training data. When we minimize the cost function, we're finding the parameters that make the observed data most probable.

4. **Averaging Over Examples**: Dividing by $m$ (the number of training examples) scales the cost function appropriately. This makes the cost comparable across datasets of different sizes and prevents the cost from growing simply because we have more data.

5. **Independent and Identically Distributed Assumption**: The derivation assumes each training example is independent and drawn from the same distribution. This allows us to multiply individual probabilities to get the joint probability of all labels.

**Present and explain the key equations used in the lesson**

## Key Equations

**Prediction:**
$$\hat{y} = \sigma(w^T x + b)$$

**Conditional Probabilities:**
$$P(y=1|x) = \hat{y}$$
$$P(y=0|x) = 1 - \hat{y}$$

**Combined Probability for a Single Example:**
$$P(y|x) = \hat{y}^y (1 - \hat{y})^{(1-y)}$$

**Logarithm of Conditional Probability:**
$$\log P(y|x) = y \log \hat{y} + (1-y) \log (1-\hat{y})$$

**Loss Function for a Single Training Example:**
$$L(\hat{y}, y) = -[y \log \hat{y} + (1-y) \log (1-\hat{y})]$$

**Joint Probability of All Training Labels:**
$$P(\text{labels in training set}) = \prod_{i=1}^m P(y^{(i)}|x^{(i)})$$

**Log of Joint Probability:**
$$\log P(\text{labels in training set}) = \sum_{i=1}^m \log P(y^{(i)}|x^{(i)})$$

**Overall Cost Function (Average Loss):**
$$J(w,b) = - \frac{1}{m} \sum_{i=1}^m [y^{(i)} \log \hat{y}^{(i)} + (1-y^{(i)}) \log (1-\hat{y}^{(i)})]$$

The goal is to find the weights $w$ and bias $b$ that minimize $J(w,b)$.

**Implement code primitive: Compute the logistic regression prediction (y_hat) for a given input x, weights w, and bias b.**

In [None]:
import numpy as np

def sigmoid(z):
    """Sigmoid activation function."""
    return 1 / (1 + np.exp(-z))

def logistic_regression_prediction(x, w, b):
    """
    Compute logistic regression prediction.
    
    Args:
        x: Input features (1D array or scalar)
        w: Weights (1D array, same shape as x)
        b: Bias (scalar)
    
    Returns:
        y_hat: Predicted probability (scalar between 0 and 1)
    """
    z = np.dot(w, x) + b
    y_hat = sigmoid(z)
    return y_hat

# Example usage
w = np.array([0.5, -0.3])
b = 0.1
x = np.array([2.0, 1.0])
y_hat = logistic_regression_prediction(x, w, b)
print(f"Prediction: {y_hat}")

**Implement code primitive: Calculate the loss function for a single training example (y_hat, y).**

In [None]:
def loss_function(y_hat, y):
    """
    Calculate the loss function for a single training example.
    
    Args:
        y_hat: Predicted probability (scalar between 0 and 1)
        y: Actual label (0 or 1)
    
    Returns:
        loss: Loss value (scalar)
    """
    epsilon = 1e-15  # Small value to avoid log(0)
    y_hat = np.clip(y_hat, epsilon, 1 - epsilon)
    loss = -(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))
    return loss

# Example usage
y_hat = 0.8
y = 1
loss = loss_function(y_hat, y)
print(f"Loss: {loss}")

y_hat = 0.3
y = 0
loss = loss_function(y_hat, y)
print(f"Loss: {loss}")

**Implement code primitive: Calculate the overall cost function for the entire training set by averaging individual losses.**

In [None]:
def cost_function(X, y, w, b):
    """
    Calculate the overall cost function for the entire training set.
    
    Args:
        X: Training features (2D array, shape: m x n)
        y: Training labels (1D array, shape: m)
        w: Weights (1D array, shape: n)
        b: Bias (scalar)
    
    Returns:
        J: Cost function value (scalar)
    """
    m = X.shape[0]
    total_loss = 0
    
    for i in range(m):
        y_hat = logistic_regression_prediction(X[i], w, b)
        total_loss += loss_function(y_hat, y[i])
    
    J = total_loss / m
    return J

# Example usage
X = np.array([[2.0, 1.0], [1.0, 3.0], [3.0, 2.0]])
y = np.array([1, 0, 1])
w = np.array([0.5, -0.3])
b = 0.1

J = cost_function(X, y, w, b)
print(f"Cost function: {J}")

**Implement code primitive: Implement an optimization algorithm to minimize the calculated cost function J(w,b).**

In [None]:
def gradient_descent(X, y, w, b, learning_rate=0.01, iterations=100):
    """
    Optimize weights and bias using gradient descent to minimize cost function.
    
    Args:
        X: Training features (2D array, shape: m x n)
        y: Training labels (1D array, shape: m)
        w: Initial weights (1D array, shape: n)
        b: Initial bias (scalar)
        learning_rate: Learning rate for gradient descent
        iterations: Number of iterations
    
    Returns:
        w: Optimized weights
        b: Optimized bias
        costs: List of cost values at each iteration
    """
    m, n = X.shape
    costs = []
    
    for iteration in range(iterations):
        dw = np.zeros(n)
        db = 0
        
        for i in range(m):
            y_hat = logistic_regression_prediction(X[i], w, b)
            error = y_hat - y[i]
            dw += error * X[i]
            db += error
        
        dw /= m
        db /= m
        
        w = w - learning_rate * dw
        b = b - learning_rate * db
        
        J = cost_function(X, y, w, b)
        costs.append(J)
    
    return w, b, costs

# Example usage
X = np.array([[2.0, 1.0], [1.0, 3.0], [3.0, 2.0], [2.5, 0.5]])
y = np.array([1, 0, 1, 1])
w = np.array([0.0, 0.0])
b = 0.0

w_opt, b_opt, costs = gradient_descent(X, y, w, b, learning_rate=0.1, iterations=50)
print(f"Optimized weights: {w_opt}")
print(f"Optimized bias: {b_opt}")
print(f"Final cost: {costs[-1]}")

## Lesson 19

**Explain the core concepts of the lesson**

## Core Concepts of Deep Reinforcement Learning

Deep reinforcement learning combines two powerful paradigms:

**Deep Learning** provides representation learning—the ability to automatically discover patterns and features in raw sensory data (images, audio, text) without hand-engineering features.

**Reinforcement Learning** provides decision-making under uncertainty—agents learn to select actions that maximize cumulative rewards over time, learning from their own experience rather than labeled data.

When combined, deep RL enables agents to learn directly from raw sensory inputs (like pixels) to action outputs, end-to-end. However, this combination introduces unique challenges that supervised learning does not face:

1. **The Exploration Problem**: Unlike supervised learning where training data is provided, RL agents must actively explore to gather experience. Where does the training data come from?

2. **Credit Assignment**: Understanding which actions taken early in a sequence led to rewards received much later—a fundamentally different problem from supervised learning's immediate input-output mapping.

3. **Safety in Autonomous Systems**: When agents collect their own data through exploration, random exploration can cause real-world damage before learning anything useful.

4. **Long-Horizon Reasoning**: Effective planning and action over extended time periods (days or lifetimes), not just seconds.

These challenges define the frontier of deep reinforcement learning research and applications.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**The Representation Problem is Solved; New Problems Emerge**

Deep networks excel at solving the representation problem—they can capture complex patterns in raw sensory data. However, solving representation learning doesn't automatically solve reinforcement learning. RL still faces exploration, credit assignment, and safety challenges that are orthogonal to representation quality.

**Exploration as Active Data Collection**

In supervised learning, data is given. In reinforcement learning, the agent must actively explore to gather experience. This raises a fundamental question: where does training data come from? The agent must balance exploring new actions (to discover better strategies) with exploiting known good actions (to maximize immediate reward).

**Credit Assignment Across Time**

When an agent receives a reward after many steps, it must determine which earlier actions were responsible. This is harder than supervised learning because there's no immediate feedback signal for each action. The agent must propagate credit backward through time, accounting for the delayed consequences of its decisions.

**Safety Through Constrained Learning**

Behavioral cloning (learning from human demonstrations) can bootstrap RL by providing initial good behavior. Then reinforcement learning refines this behavior with explicit objectives. This hybrid approach reduces the risk of dangerous exploration early in training.

**Learning the Algorithm Itself**

Just as deep learning replaced hand-engineered features, meta-learning could replace hand-designed RL algorithms. One RL algorithm could learn to modify another RL algorithm's parameters based on task performance, enabling the system to discover better learning strategies automatically.

**Present and explain the key equations used in the lesson**

## Key Equations and Formulas

**Exploration vs. Exploitation**

$$\text{Where does training data come from?}$$

This is the fundamental question of the exploration problem. In reinforcement learning, the agent must decide whether to explore new actions (gathering information about the environment) or exploit known good actions (maximizing immediate reward). The balance between these two drives the data collection process.

**Credit Assignment**

$$\text{Which early actions led to later rewards?}$$

When an agent receives a reward signal at time $t$, it must determine which actions taken at earlier times $t-1, t-2, \ldots$ were responsible. This requires propagating credit backward through the sequence of decisions, accounting for the causal relationships between actions and outcomes.

**Safety Constraint**

$$\text{Minimize accidents during autonomous data collection}$$

When an autonomous system learns by exploring, it must do so safely. The safety constraint requires that the cost of exploration (accidents, damage, resource waste) remains acceptable while the agent learns to perform its task effectively.

**Implement code primitive: Implement a deep Q-network (DQN) agent that learns to play Atari games from raw pixel inputs, demonstrating end-to-end learning from sensory input to action output.**

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from collections import deque

class DQNNetwork(nn.Module):
    def __init__(self, input_channels, num_actions):
        super(DQNNetwork, self).__init__()
        self.conv1 = nn.Conv2d(input_channels, 32, kernel_size=8, stride=4)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
        self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
        self.fc1 = nn.Linear(64 * 7 * 7, 512)
        self.fc2 = nn.Linear(512, num_actions)
    
    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = torch.relu(self.conv3(x))
        x = x.view(x.size(0), -1)
        x = torch.relu(self.fc1(x))
        return self.fc2(x)

class DQNAgent:
    def __init__(self, input_channels, num_actions, learning_rate=0.0001, gamma=0.99):
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.q_network = DQNNetwork(input_channels, num_actions).to(self.device)
        self.target_network = DQNNetwork(input_channels, num_actions).to(self.device)
        self.target_network.load_state_dict(self.q_network.state_dict())
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=learning_rate)
        self.gamma = gamma
        self.num_actions = num_actions
        self.memory = deque(maxlen=100000)
        self.epsilon = 1.0
    
    def select_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.num_actions)
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)
            q_values = self.q_network(state_tensor)
            return q_values.argmax(dim=1).item()
    
    def store_transition(self, state, action, reward, next_state, done):
        self.memory.append((state, action, reward, next_state, done))
    
    def train(self, batch_size):
        if len(self.memory) < batch_size:
            return
        
        batch = np.random.choice(len(self.memory), batch_size, replace=False)
        states, actions, rewards, next_states, dones = zip(*[self.memory[i] for i in batch])
        
        states = torch.FloatTensor(np.array(states)).to(self.device)
        actions = torch.LongTensor(actions).to(self.device)
        rewards = torch.FloatTensor(rewards).to(self.device)
        next_states = torch.FloatTensor(np.array(next_states)).to(self.device)
        dones = torch.FloatTensor(dones).to(self.device)
        
        q_values = self.q_network(states).gather(1, actions.unsqueeze(1)).squeeze(1)
        next_q_values = self.target_network(next_states).max(dim=1)[0]
        target_q_values = rewards + self.gamma * next_q_values * (1 - dones)
        
        loss = nn.functional.mse_loss(q_values, target_q_values)
        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()
    
    def update_target_network(self):
        self.target_network.load_state_dict(self.q_network.state_dict())
    
    def decay_epsilon(self, decay_rate=0.995):
        self.epsilon *= decay_rate

agent = DQNAgent(input_channels=4, num_actions=18)
print(f"DQN Agent initialized with Q-network: {agent.q_network}")

**Implement code primitive: Demonstrate transfer of a trained RL policy across different robot morphologies (two-legged to four-legged) using the same algorithm, showing generalization without retraining.**

In [None]:
import numpy as np
import torch
import torch.nn as nn

class RobotPolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super(RobotPolicyNetwork, self).__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        return torch.tanh(self.fc3(x))

class RobotController:
    def __init__(self, state_dim, action_dim):
        self.policy = RobotPolicyNetwork(state_dim, action_dim)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.policy.to(self.device)
    
    def select_action(self, state):
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        with torch.no_grad():
            action = self.policy(state_tensor).cpu().numpy()[0]
        return action
    
    def save_policy(self, filepath):
        torch.save(self.policy.state_dict(), filepath)
    
    def load_policy(self, filepath):
        self.policy.load_state_dict(torch.load(filepath))

class TwoLeggedRobot:
    def __init__(self):
        self.state_dim = 8
        self.action_dim = 2
    
    def get_state(self):
        return np.random.randn(self.state_dim)
    
    def execute_action(self, action):
        reward = np.sum(action ** 2)
        return reward

class FourLeggedRobot:
    def __init__(self):
        self.state_dim = 12
        self.action_dim = 4
    
    def get_state(self):
        return np.random.randn(self.state_dim)
    
    def execute_action(self, action):
        reward = np.sum(action ** 2)
        return reward

controller_2leg = RobotController(state_dim=8, action_dim=2)
controller_2leg.save_policy('policy_2leg.pt')

controller_4leg = RobotController(state_dim=12, action_dim=4)
controller_4leg.load_policy('policy_2leg.pt')

robot_2leg = TwoLeggedRobot()
robot_4leg = FourLeggedRobot()

state_2leg = robot_2leg.get_state()
action_2leg = controller_2leg.select_action(state_2leg)
reward_2leg = robot_2leg.execute_action(action_2leg)

state_4leg = robot_4leg.get_state()
action_4leg = controller_4leg.select_action(state_4leg[:8])
reward_4leg = robot_4leg.execute_action(np.concatenate([action_4leg, action_4leg[:2]]))

print(f"2-legged robot reward: {reward_2leg:.4f}")
print(f"4-legged robot reward (transferred policy): {reward_4leg:.4f}")

**Implement code primitive: Build a behavioral cloning system that learns to mimic human actions from demonstrations, then augment it with reinforcement learning objectives to optimize for specific metrics.**

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

class BehavioralCloningNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super(BehavioralCloningNetwork, self).__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, action_dim)
    
    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

class BehavioralCloningAgent:
    def __init__(self, state_dim, action_dim, learning_rate=0.001):
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.policy = BehavioralCloningNetwork(state_dim, action_dim).to(self.device)
        self.optimizer = optim.Adam(self.policy.parameters(), lr=learning_rate)
        self.bc_loss_fn = nn.MSELoss()
        self.rl_loss_fn = nn.MSELoss()
    
    def behavioral_cloning_loss(self, states, expert_actions):
        predicted_actions = self.policy(states)
        return self.bc_loss_fn(predicted_actions, expert_actions)
    
    def reinforcement_learning_loss(self, states, predicted_actions, rewards):
        action_quality = torch.sum(predicted_actions * predicted_actions, dim=1)
        reward_tensor = torch.FloatTensor(rewards).to(self.device)
        return self.rl_loss_fn(action_quality, reward_tensor)
    
    def train_bc(self, states, expert_actions, epochs=10):
        states = torch.FloatTensor(states).to(self.device)
        expert_actions = torch.FloatTensor(expert_actions).to(self.device)
        
        for epoch in range(epochs):
            loss = self.behavioral_cloning_loss(states, expert_actions)
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()
    
    def train_rl(self, states, rewards, epochs=10, bc_weight=0.5):
        states = torch.FloatTensor(states).to(self.device)
        
        for epoch in range(epochs):
            predicted_actions = self.policy(states)
            expert_actions = predicted_actions.detach()
            
            bc_loss = self.behavioral_cloning_loss(states, expert_actions)
            rl_loss = self.reinforcement_learning_loss(states, predicted_actions, rewards)
            
            total_loss = bc_weight * bc_loss + (1 - bc_weight) * rl_loss
            self.optimizer.zero_grad()
            total_loss.backward()
            self.optimizer.step()
    
    def select_action(self, state):
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        with torch.no_grad():
            action = self.policy(state_tensor).cpu().numpy()[0]
        return action

state_dim, action_dim = 10, 3
agent = BehavioralCloningAgent(state_dim, action_dim)

expert_states = np.random.randn(100, state_dim)
expert_actions = np.random.randn(100, action_dim)
agent.train_bc(expert_states, expert_actions, epochs=5)

rewards = np.random.rand(100)
agent.train_rl(expert_states, rewards, epochs=5, bc_weight=0.7)

test_state = np.random.randn(state_dim)
action = agent.select_action(test_state)
print(f"Behavioral cloning + RL agent action: {action}")

**Implement code primitive: Implement a meta-learning framework where one RL algorithm learns to modify another RL algorithm's parameters based on task performance, enabling algorithm learning.**

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim

class BaseRLAlgorithm(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super(BaseRLAlgorithm, self).__init__()
        self.fc1 = nn.Linear(state_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, action_dim)
        self.learning_rate = 0.01
    
    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)
    
    def get_parameters(self):
        return torch.cat([p.view(-1) for p in self.parameters()])
    
    def set_parameters(self, params):
        offset = 0
        for p in self.parameters():
            p.data = params[offset:offset + p.numel()].view(p.shape)
            offset += p.numel()

class MetaLearner(nn.Module):
    def __init__(self, param_dim, hidden_dim=128):
        super(MetaLearner, self).__init__()
        self.fc1 = nn.Linear(param_dim + 1, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, param_dim)
    
    def forward(self, params, task_performance):
        x = torch.cat([params, task_performance.unsqueeze(0)], dim=0)
        x = torch.relu(self.fc1(x.unsqueeze(0)))
        x = torch.relu(self.fc2(x))
        param_update = self.fc3(x)
        return param_update.squeeze(0)

class MetaRLFramework:
    def __init__(self, state_dim, action_dim, param_dim=None):
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.base_algorithm = BaseRLAlgorithm(state_dim, action_dim).to(self.device)
        
        if param_dim is None:
            param_dim = sum(p.numel() for p in self.base_algorithm.parameters())
        
        self.meta_learner = MetaLearner(param_dim).to(self.device)
        self.meta_optimizer = optim.Adam(self.meta_learner.parameters(), lr=0.001)
    
    def adapt_algorithm(self, task_performance):
        params = self.base_algorithm.get_parameters()
        param_update = self.meta_learner(params, torch.FloatTensor([task_performance]).to(self.device))
        new_params = params + 0.1 * param_update
        self.base_algorithm.set_parameters(new_params)
    
    def select_action(self, state):
        state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)
        with torch.no_grad():
            action = self.base_algorithm(state_tensor).cpu().numpy()[0]
        return action
    
    def train_meta_learner(self, task_performances, num_iterations=10):
        for iteration in range(num_iterations):
            total_loss = 0
            for perf in task_performances:
                params = self.base_algorithm.get_parameters()
                param_update = self.meta_learner(params, torch.FloatTensor([perf]).to(self.device))
                new_params = params + 0.1 * param_update
                
                loss = torch.norm(param_update) * (1 - perf)
                total_loss += loss
            
            self.meta_optimizer.zero_grad()
            (total_loss / len(task_performances)).backward()
            self.meta_optimizer.step()

state_dim, action_dim = 8, 4
meta_framework = MetaRLFramework(state_dim, action_dim)

task_performances = [0.3, 0.5, 0.7, 0.6, 0.8]
meta_framework.train_meta_learner(task_performances, num_iterations=5)

test_state = np.random.randn(state_dim)
action = meta_framework.select_action(test_state)
print(f"Meta-learned RL algorithm action: {action}")

**Create a Mermaid diagram: flowchart TD
    A[Raw Sensory Input] -->|Deep Network| B[Learned Representation]
    B -->|RL Algorithm| C[Action Selection]
    C -->|Environment| D[Reward Signal]
    D -->|Credit Assignment| E[Policy Update]
    E -->|Exploration| A
    style A fill:#e1f5ff
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#e8f5e9
    style E fill:#fce4ec**

```mermaid
flowchart TD
    A[Raw Sensory Input] -->|Deep Network| B[Learned Representation]
    B -->|RL Algorithm| C[Action Selection]
    C -->|Environment| D[Reward Signal]
    D -->|Credit Assignment| E[Policy Update]
    E -->|Exploration| A
    style A fill:#e1f5ff
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#e8f5e9
    style E fill:#fce4ec
```

**Create a Mermaid diagram: flowchart TD
    A[Behavioral Cloning] -->|Supervised Learning| B[Mimic Human Behavior]
    B -->|Initial Policy| C[Reinforcement Learning]
    C -->|Explicit Objectives| D[Optimized Policy]
    D -->|Deployment| E[Real-World System]
    style A fill:#fff3e0
    style B fill:#f3e5f5
    style C fill:#e8f5e9
    style D fill:#fce4ec
    style E fill:#e1f5ff**

```mermaid
flowchart TD
    A[Behavioral Cloning] -->|Supervised Learning| B[Mimic Human Behavior]
    B -->|Initial Policy| C[Reinforcement Learning]
    C -->|Explicit Objectives| D[Optimized Policy]
    D -->|Deployment| E[Real-World System]
    style A fill:#fff3e0
    style B fill:#f3e5f5
    style C fill:#e8f5e9
    style D fill:#fce4ec
    style E fill:#e1f5ff
```

**Create a Mermaid diagram: flowchart TD
    A[Exploration] -->|Data Collection| B[Credit Assignment Problem]
    B -->|Identify Causal Actions| C[Policy Improvement]
    C -->|New Behavior| D[Safety Risk]
    D -->|Autonomous Learning| A
    A -->|Challenge| E[Where does data come from?]
    B -->|Challenge| F[Which actions caused rewards?]
    D -->|Challenge| G[How to learn safely?]
    style E fill:#ffebee
    style F fill:#ffebee
    style G fill:#ffebee**

```mermaid
flowchart TD
    A[Exploration] -->|Data Collection| B[Credit Assignment Problem]
    B -->|Identify Causal Actions| C[Policy Improvement]
    C -->|New Behavior| D[Safety Risk]
    D -->|Autonomous Learning| A
    A -->|Challenge| E[Where does data come from?]
    B -->|Challenge| F[Which actions caused rewards?]
    D -->|Challenge| G[How to learn safely?]
    style E fill:#ffebee
    style F fill:#ffebee
    style G fill:#ffebee
```

# Week 3

## Lesson 1

**Explain the core concepts of the lesson**

## Core Concepts

A neural network is fundamentally a repetition of logistic regression across multiple layers. Each layer performs the same computational pattern:

1. **Neural Network Layer**: A layer consists of stacked sigmoid units that transform inputs through linear and activation steps.

2. **Layer Notation (Square Brackets)**: We use square brackets to denote layer indices. For example, $w^{[1]}$ refers to weights in layer 1, while $w^{[2]}$ refers to weights in layer 2. Round brackets distinguish between different training examples.

3. **Z and A Calculations**: Each layer computes two key quantities:
   - **z**: The linear combination of inputs and weights plus bias
   - **a**: The activated output after applying the sigmoid function

4. **Forward Propagation Layers**: Information flows left to right through the network, with each layer's output becoming the next layer's input.

5. **Backward Propagation Layers**: During training, derivatives flow right to left through the network, computing gradients for all parameters.

6. **Hidden Layer Computation**: Intermediate layers (between input and output) compute their z and a values using the previous layer's activation as input.

7. **Output Layer Computation**: The final layer produces the network's prediction, which is compared against the target to compute loss.

8. **Multi-layer Computation Graph**: The entire network forms a directed acyclic graph where each layer's computation depends on the previous layer's output.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Intuition 1: Repeated Logistic Regression**

A neural network is logistic regression repeated multiple times. In logistic regression, we compute $z = wx + b$ and then $a = \sigma(z)$. In a neural network, we repeat this exact pattern at each layer, but instead of using the raw input $x$, each layer uses the previous layer's activation $a^{[l-1]}$ as its input.

**Intuition 2: Stacked Sigmoid Units**

Each layer consists of stacked sigmoid units. Think of each unit as a small logistic regression classifier. When you stack multiple units in a layer, you're creating multiple parallel logistic regressors that all operate on the same input. These units learn different features and patterns.

**Intuition 3: Layer Notation as Organizational Tool**

The square bracket notation is purely organizational—it helps us keep track of which layer we're talking about. Layer 1 is the first hidden layer, layer 2 is the second hidden layer, and so on. This notation cleanly separates the concept of "which layer" (square brackets) from "which training example" (round brackets).

**Intuition 4: Forward and Backward Flow**

During forward propagation, information flows left to right: input → layer 1 → layer 2 → output. During backward propagation, error signals flow right to left: output loss → layer 2 gradients → layer 1 gradients → input gradients. This bidirectional flow is the essence of how neural networks learn.

**Present and explain the key equations used in the lesson**

## Key Equations

**Layer 1 Computation:**

The first hidden layer takes the input $x$ and computes:

$$z^{[1]} = w^{[1]}x + b^{[1]}$$

$$a^{[1]} = \sigma(z^{[1]})$$

where $w^{[1]}$ are the weights, $b^{[1]}$ is the bias, and $\sigma$ is the sigmoid activation function.

**Layer 2 Computation:**

The second layer takes the output of layer 1 as input:

$$z^{[2]} = w^{[2]}a^{[1]} + b^{[2]}$$

$$a^{[2]} = \sigma(z^{[2]})$$

Notice that the input to layer 2 is $a^{[1]}$ (the activation from layer 1), not the original input $x$.

**General Pattern:**

For any layer $l$:

$$z^{[l]} = w^{[l]}a^{[l-1]} + b^{[l]}$$

$$a^{[l]} = \sigma(z^{[l]})$$

This pattern repeats for each layer in the network, creating a chain of transformations from input to output.

**Implement code primitive: Implement forward propagation through multiple layers by sequentially computing z and a values for each layer**

In [None]:
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def forward_propagation(X, parameters):
    """
    Implement forward propagation through multiple layers.
    
    Args:
        X: Input data of shape (n_x, m) where n_x is number of features, m is number of examples
        parameters: Dictionary containing weights and biases for each layer
                   keys: 'W1', 'b1', 'W2', 'b2', etc.
    
    Returns:
        cache: Dictionary containing all z and a values for each layer
    """
    cache = {}
    A = X
    layer = 1
    
    while f'W{layer}' in parameters:
        W = parameters[f'W{layer}']
        b = parameters[f'b{layer}']
        
        Z = np.dot(W, A) + b
        A = sigmoid(Z)
        
        cache[f'Z{layer}'] = Z
        cache[f'A{layer}'] = A
        
        layer += 1
    
    return cache, A

# Example usage
X = np.random.randn(2, 3)  # 2 features, 3 examples
parameters = {
    'W1': np.random.randn(3, 2),
    'b1': np.random.randn(3, 1),
    'W2': np.random.randn(1, 3),
    'b2': np.random.randn(1, 1)
}

cache, output = forward_propagation(X, parameters)
print(f"Output shape: {output.shape}")
print(f"Cache keys: {list(cache.keys())}")

**Implement code primitive: Implement backward propagation by computing derivatives (da, dz, dw, db) flowing from output layer back to input layer**

In [None]:
def sigmoid_derivative(Z):
    A = sigmoid(Z)
    return A * (1 - A)

def backward_propagation(dA, cache, parameters, m):
    """
    Implement backward propagation through multiple layers.
    
    Args:
        dA: Gradient of loss with respect to output activation
        cache: Dictionary containing Z and A values from forward propagation
        parameters: Dictionary containing weights and biases
        m: Number of training examples
    
    Returns:
        gradients: Dictionary containing dW and db for each layer
    """
    gradients = {}
    layer = 2
    
    while f'Z{layer}' in cache:
        Z = cache[f'Z{layer}']
        A_prev = cache[f'A{layer-1}']
        W = parameters[f'W{layer}']
        
        dZ = dA * sigmoid_derivative(Z)
        dW = np.dot(dZ, A_prev.T) / m
        db = np.sum(dZ, axis=1, keepdims=True) / m
        dA = np.dot(W.T, dZ)
        
        gradients[f'dW{layer}'] = dW
        gradients[f'db{layer}'] = db
        
        layer -= 1
    
    # Handle first layer
    Z = cache['Z1']
    A_prev = cache.get('A0', None)  # Would be input X
    dZ = dA * sigmoid_derivative(Z)
    
    gradients['dW1'] = dZ
    gradients['db1'] = np.sum(dZ, axis=1, keepdims=True) / m
    
    return gradients

# Example usage
m = 3  # number of examples
dA_output = np.random.randn(1, m)
gradients = backward_propagation(dA_output, cache, parameters, m)
print(f"Gradient keys: {list(gradients.keys())}")

**Create a Mermaid diagram: flowchart LR
    X["Input x"] --> Z1["Compute z^[1]"] --> A1["Compute a^[1]"] --> Z2["Compute z^[2]"] --> A2["Compute a^[2]"] --> L["Loss L"]
    style Z1 fill:#e1f5ff
    style A1 fill:#e1f5ff
    style Z2 fill:#fff3e0
    style A2 fill:#fff3e0**

## Computation Flow Diagram

```mermaid
flowchart LR
    X["Input x"] --> Z1["Compute z^[1]"] --> A1["Compute a^[1]"] --> Z2["Compute z^[2]"] --> A2["Compute a^[2]"] --> L["Loss L"]
    style Z1 fill:#e1f5ff
    style A1 fill:#e1f5ff
    style Z2 fill:#fff3e0
    style A2 fill:#fff3e0
```

This diagram shows the forward propagation flow through a two-layer neural network. The input $x$ flows through layer 1 (light blue) where it is transformed into $z^{[1]}$ and then activated to $a^{[1]}$. The activation $a^{[1]}$ then flows through layer 2 (light orange) where it is transformed into $z^{[2]}$ and activated to $a^{[2]}$. Finally, the output $a^{[2]}$ is used to compute the loss $L$.

## Lesson 2

**Explain the core concepts of the lesson**

## Core Concepts

A neural network is organized into distinct layers that process information sequentially:

- **Input Layer**: Contains the raw features of your data, stacked as a vector. This is denoted as $A^{[0]} = X$.

- **Hidden Layer**: An intermediate layer that learns representations of the input. The activations (outputs) of this layer are denoted $A^{[1]}$. It is called "hidden" because we do not observe the true values of its nodes in the training set—we only see inputs and desired outputs.

- **Output Layer**: Produces the final prediction of the network. For a single output, this is denoted $\hat{y} = A^{[2]}$.

- **Activation Values**: The outputs computed by each layer as data flows forward through the network. These values represent what each layer computes and passes to the next layer.

- **Layer Parameters**: Each layer has associated weights ($W$) and biases ($b$) that determine how it transforms its inputs into outputs.

- **Two-Layer Network**: A network with one hidden layer is called a two-layer network because we count layers starting from the hidden layer (we skip the input layer in the count).

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Layered Processing**: Think of a neural network as a pipeline. Data enters through the input layer, gets transformed by the hidden layer, and emerges as a prediction from the output layer. Each layer builds on the representations created by the previous layer.

**Why "Hidden"?**: The hidden layer is called hidden not because it is mysterious, but because we never see its true values during training. We only observe the inputs we provide and the outputs we want to match. The hidden layer's job is to learn useful intermediate representations that help the network make better predictions.

**Activations as Information Flow**: Activations are the actual numerical values flowing through the network at each stage. Think of them as the "state" of the network at each layer—they carry the processed information forward.

**Counting Layers**: When we say "two-layer network," we are counting the hidden layer and the output layer. The input layer is not counted because it simply holds the raw data without performing any computation.

**Parameters Shape the Transformation**: The weights and biases in each layer determine exactly how that layer transforms its input. The dimensions of these parameters are carefully chosen to match the number of units in the current layer and the previous layer.

**Present and explain the key equations used in the lesson**

## Key Equations

**Input Layer Activation**:
$$A^{[0]} = X$$
The input layer activation is simply the input data itself.

**Hidden Layer Activation**:
$$A^{[1]} \in \mathbb{R}^{4 \times 1}$$
The hidden layer has 4 units, so its activation is a column vector with 4 rows and 1 column.

**Output Layer Activation (Prediction)**:
$$\hat{y} = A^{[2]}$$
The output layer produces the final prediction.

**Hidden Layer Weights**:
$$W^{[1]} \in \mathbb{R}^{4 \times 3}$$
The weight matrix for the hidden layer has 4 rows (one for each hidden unit) and 3 columns (one for each input feature).

**Hidden Layer Bias**:
$$b^{[1]} \in \mathbb{R}^{4 \times 1}$$
The bias vector for the hidden layer has 4 rows, matching the number of hidden units.

**Output Layer Weights**:
$$W^{[2]} \in \mathbb{R}^{1 \times 4}$$
The weight matrix for the output layer has 1 row (one output unit) and 4 columns (one for each hidden unit).

**Output Layer Bias**:
$$b^{[2]} \in \mathbb{R}^{1 \times 1}$$
The bias for the output layer is a scalar (1 row, 1 column).

**Implement code primitive: Represent the input layer as a vector of features stacked vertically**

In [None]:
import numpy as np

# Input layer: 3 features stacked vertically
A_0 = np.array([
    [2.0],
    [1.5],
    [0.8]
])

print("Input layer (A^[0]):")
print(A_0)
print(f"Shape: {A_0.shape}")

**Implement code primitive: Store hidden layer activations as a column vector with dimensions matching the number of hidden units**

In [None]:
# Hidden layer: 4 units, represented as a column vector
A_1 = np.array([
    [0.5],
    [0.3],
    [0.9],
    [0.1]
])

print("Hidden layer activations (A^[1]):")
print(A_1)
print(f"Shape: {A_1.shape}")

**Implement code primitive: Represent output layer activation as a scalar value**

In [None]:
# Output layer: single scalar output
A_2 = np.array([[0.7]])

print("Output layer activation (A^[2]):")
print(A_2)
print(f"Shape: {A_2.shape}")
print(f"Prediction (ŷ): {A_2[0, 0]}")

**Implement code primitive: Organize layer parameters (W and b) with dimensions determined by the number of units in the current and previous layers**

In [None]:
# Hidden layer parameters
# W^[1]: 4 hidden units × 3 input features
W_1 = np.array([
    [0.2, 0.5, 0.1],
    [0.3, 0.1, 0.4],
    [0.6, 0.2, 0.3],
    [0.1, 0.4, 0.2]
])

# b^[1]: 4 hidden units × 1
b_1 = np.array([
    [0.1],
    [0.2],
    [0.05],
    [0.15]
])

# Output layer parameters
# W^[2]: 1 output unit × 4 hidden units
W_2 = np.array([
    [0.5, 0.3, 0.8, 0.2]
])

# b^[2]: 1 output unit × 1
b_2 = np.array([[0.1]])

print("Hidden layer weights W^[1]:")
print(W_1)
print(f"Shape: {W_1.shape}")
print("\nHidden layer bias b^[1]:")
print(b_1)
print(f"Shape: {b_1.shape}")
print("\nOutput layer weights W^[2]:")
print(W_2)
print(f"Shape: {W_2.shape}")
print("\nOutput layer bias b^[2]:")
print(b_2)
print(f"Shape: {b_2.shape}")

**Create a Mermaid diagram: A flowchart showing data flow through a two-layer neural network: input layer → hidden layer → output layer, with activations and parameters labeled at each stage**

## Neural Network Data Flow

```mermaid
graph TD
    A["Input Layer<br/>A^[0] ∈ ℝ^3×1"] -->|"W^[1] ∈ ℝ^4×3<br/>b^[1] ∈ ℝ^4×1"| B["Hidden Layer<br/>A^[1] ∈ ℝ^4×1"]
    B -->|"W^[2] ∈ ℝ^1×4<br/>b^[2] ∈ ℝ^1×1"| C["Output Layer<br/>ŷ = A^[2] ∈ ℝ^1×1"]
```

The diagram shows how data flows through a two-layer neural network:

1. **Input Layer**: Receives 3 features as a column vector $A^{[0]}$.
2. **Hidden Layer**: Transforms the input using weights $W^{[1]}$ and bias $b^{[1]}$ to produce 4 hidden unit activations $A^{[1]}$.
3. **Output Layer**: Transforms the hidden activations using weights $W^{[2]}$ and bias $b^{[2]}$ to produce the final prediction $\hat{y} = A^{[2]}$.

Each arrow is labeled with the parameters that govern that transformation.

## Lesson 3

**Explain the core concepts of the lesson**

## Core Concepts

Forward propagation in a two-layer neural network involves computing predictions by passing input data through two sequential layers of computation.

**Hidden Layer Computation**: The first layer takes the input vector $x$ and computes a hidden representation. Each hidden node performs a linear transformation followed by an activation function, similar to logistic regression but with different parameters for each node.

**Layer Notation Convention**: We use square brackets to denote the layer number. For example, $W^{[1]}$ refers to weights in the first layer, and $a^{[1]}$ refers to activations from the first layer. Subscripts indicate specific nodes within a layer.

**Weight Matrix Stacking**: Instead of computing each hidden node individually with loops, we stack all weight vectors into a matrix. This allows us to compute all hidden nodes simultaneously through a single matrix-vector multiplication.

**Output Layer Computation**: The hidden layer activations become inputs to the output layer, which performs another linear transformation and activation to produce the final prediction.

**Vectorized Forward Pass**: By organizing computations as matrix operations, we can efficiently compute the entire forward pass without explicit loops, making the implementation fast and scalable.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Neural Networks as Repeated Logistic Regression**: A neural network is logistic regression applied multiple times. Each hidden node performs the same z and activation computation as logistic regression, but with different parameters. The output layer then applies logistic regression to the hidden layer outputs.

**Stacking for Efficiency**: Stacking parameter vectors into matrices allows you to compute all hidden nodes simultaneously instead of using loops. This vectorization is not just a convenience—it's essential for efficient computation on modern hardware.

**Layer Notation as Bookkeeping**: The notation with square brackets for layer number and subscripts for node index keeps track of which layer and which node within that layer you're referring to. This systematic notation prevents confusion when working with multiple layers.

**Chaining Transformations**: Hidden layer outputs become inputs to the output layer, creating a chain of transformations from raw features to final prediction. Each layer learns to transform its input into a more useful representation for the next layer.

**Present and explain the key equations used in the lesson**

## Key Equations

The forward propagation process follows a sequence of four key computations:

**Hidden Layer Linear Transformation**:
$$z^{[1]} = W^{[1]} x + b^{[1]}$$

Here, $W^{[1]}$ is the weight matrix for the first layer, $x$ is the input vector, and $b^{[1]}$ is the bias vector. The result $z^{[1]}$ is the pre-activation value for the hidden layer.

**Hidden Layer Activation**:
$$a^{[1]} = \sigma(z^{[1]})$$

The sigmoid function $\sigma$ is applied element-wise to each component of $z^{[1]}$ to produce the hidden layer activations $a^{[1]}$.

**Output Layer Linear Transformation**:
$$z^{[2]} = W^{[2]} a^{[1]} + b^{[2]}$$

The hidden layer activations $a^{[1]}$ are transformed by the second layer weights $W^{[2]}$ and bias $b^{[2]}$ to produce $z^{[2]}$.

**Output Layer Activation (Final Prediction)**:
$$\hat{y} = a^{[2]} = \sigma(z^{[2]})$$

The sigmoid function is applied to $z^{[2]}$ to produce the final prediction $\hat{y}$, which is also denoted as $a^{[2]}$.

**Implement code primitive: Compute z for hidden layer by matrix-vector multiplication: W[1] times x plus b[1]**

In [None]:
import numpy as np

# Example dimensions
n_x = 2  # input features
n_h = 3  # hidden nodes

# Initialize parameters
W1 = np.random.randn(n_h, n_x)
b1 = np.zeros((n_h, 1))
x = np.array([[1.0], [2.0]])

# Compute z[1]
z1 = np.dot(W1, x) + b1
print("z[1] shape:", z1.shape)
print("z[1]:\n", z1)

**Implement code primitive: Apply sigmoid element-wise to z[1] to get activation vector a[1]**

In [None]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Apply sigmoid to z[1]
a1 = sigmoid(z1)
print("a[1] shape:", a1.shape)
print("a[1]:\n", a1)

**Implement code primitive: Compute z for output layer: W[2] times a[1] plus b[2]**

In [None]:
# Initialize output layer parameters
n_y = 1  # output nodes
W2 = np.random.randn(n_y, n_h)
b2 = np.zeros((n_y, 1))

# Compute z[2]
z2 = np.dot(W2, a1) + b2
print("z[2] shape:", z2.shape)
print("z[2]:\n", z2)

**Implement code primitive: Apply sigmoid to z[2] to get final prediction**

In [None]:
# Apply sigmoid to z[2] for final prediction
a2 = sigmoid(z2)
print("a[2] (prediction) shape:", a2.shape)
print("a[2] (prediction):\n", a2)

**Implement code primitive: Stack individual node computations into vectorized matrix operations**

In [None]:
def forward_propagation(x, W1, b1, W2, b2):
    """
    Vectorized forward propagation for a two-layer neural network.
    
    Parameters:
    x: input vector (n_x, 1)
    W1: weights for hidden layer (n_h, n_x)
    b1: bias for hidden layer (n_h, 1)
    W2: weights for output layer (n_y, n_h)
    b2: bias for output layer (n_y, 1)
    
    Returns:
    a2: final prediction (n_y, 1)
    """
    z1 = np.dot(W1, x) + b1
    a1 = sigmoid(z1)
    z2 = np.dot(W2, a1) + b2
    a2 = sigmoid(z2)
    return a2

# Test the vectorized forward propagation
prediction = forward_propagation(x, W1, b1, W2, b2)
print("Final prediction:", prediction)

**Create a Mermaid diagram: Flowchart showing the sequence of forward propagation: input x → compute z[1] → apply sigmoid → get a[1] → compute z[2] → apply sigmoid → get output prediction**

## Forward Propagation Flow

```mermaid
graph TD
    A["Input: x"] --> B["Compute z[1] = W[1]x + b[1]"]
    B --> C["Apply sigmoid: a[1] = σ(z[1])"]
    C --> D["Compute z[2] = W[2]a[1] + b[2]"]
    D --> E["Apply sigmoid: ŷ = σ(z[2])"]
    E --> F["Output: ŷ"]
```

## Lesson 4

**Explain the core concepts of the lesson**

## Core Concepts

Vectorization across multiple training examples is a fundamental technique in neural network computation that dramatically improves computational efficiency. Instead of processing training examples one at a time using loops, we can stack all examples into matrices and compute predictions for the entire batch simultaneously.

The key concepts include:

- **Vectorization across examples**: Processing all m training examples at once using matrix operations rather than iterating through them individually
- **Matrix stacking convention**: Organizing data so that each column represents a single training example and each row represents a feature or node
- **Training example indexing**: Understanding how examples are indexed horizontally across the matrix
- **Hidden unit indexing**: Understanding how network nodes are indexed vertically within the matrix
- **Vectorized forward propagation**: Computing activations for all examples in a single matrix operation
- **Batch computation**: Processing multiple examples together as a batch
- **Matrix dimensions**: Understanding how dimensions change through the network layers
- **Activation matrix structure**: How activation matrices are organized with examples as columns and units as rows

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**The Column-Based Perspective**: Instead of looping through each training example one at a time, you can stack all examples as columns in a matrix and compute predictions for all of them simultaneously. This is like processing an entire batch of data in parallel rather than sequentially.

**Horizontal vs. Vertical Organization**: The horizontal direction of a matrix represents different training examples, while the vertical direction represents different nodes or features in the network. This convention makes it easy to think about what each dimension represents.

**Equation Similarity**: The vectorized equations are nearly identical to the single-example equations, just replacing lowercase vectors with uppercase matrices. For example, $z^{[l]} = W^{[l]} x + b^{[l]}$ becomes $Z^{[l]} = W^{[l]} X + b^{[l]}$. This consistency makes the transition intuitive.

**Connection to Logistic Regression**: This approach mirrors the vectorization technique used in logistic regression, making the transition from single to batch computation straightforward. If you've already learned vectorization in the context of logistic regression, the same principles apply here.

**Present and explain the key equations used in the lesson**

## Key Equations

The vectorized forward propagation equations for a neural network processing all m training examples simultaneously are:

**Layer 1 (Hidden Layer)**:
$$Z^{[1]} = W^{[1]} X + b^{[1]}$$
$$A^{[1]} = \sigma(Z^{[1]})$$

**Layer 2 (Output Layer)**:
$$Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}$$
$$A^{[2]} = \sigma(Z^{[2]})$$

Where:
- $X$ is the input matrix with shape (n_x, m), where n_x is the number of input features and m is the number of training examples
- $W^{[l]}$ is the weight matrix for layer l
- $b^{[l]}$ is the bias vector for layer l (broadcasted across all examples)
- $Z^{[l]}$ is the pre-activation matrix for layer l
- $A^{[l]}$ is the activation matrix for layer l
- $\sigma$ is the activation function (applied element-wise)

Each column in these matrices corresponds to one training example, and each row corresponds to a node or feature.

**Implement code primitive: Implement vectorized forward propagation by replacing single-example equations with matrix operations that process all m training examples simultaneously**

In [None]:
import numpy as np

def vectorized_forward_propagation(X, W1, b1, W2, b2, activation_fn):
    """
    Vectorized forward propagation for all m training examples.
    
    X: input matrix of shape (n_x, m)
    W1, b1: weights and bias for layer 1
    W2, b2: weights and bias for layer 2
    activation_fn: activation function
    """
    Z1 = np.dot(W1, X) + b1
    A1 = activation_fn(Z1)
    
    Z2 = np.dot(W2, A1) + b2
    A2 = activation_fn(Z2)
    
    return Z1, A1, Z2, A2

# Example usage
np.random.seed(42)
m = 5  # number of training examples
n_x = 3  # number of input features
n_h = 4  # number of hidden units
n_y = 2  # number of output units

X = np.random.randn(n_x, m)
W1 = np.random.randn(n_h, n_x)
b1 = np.zeros((n_h, 1))
W2 = np.random.randn(n_y, n_h)
b2 = np.zeros((n_y, 1))

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

Z1, A1, Z2, A2 = vectorized_forward_propagation(X, W1, b1, W2, b2, sigmoid)
print(f"X shape: {X.shape}")
print(f"Z1 shape: {Z1.shape}")
print(f"A1 shape: {A1.shape}")
print(f"Z2 shape: {Z2.shape}")
print(f"A2 shape: {A2.shape}")

**Implement code primitive: Convert unvectorized for-loop implementation (iterating over each training example) into vectorized matrix operations**

In [None]:
import numpy as np

def unvectorized_forward_propagation(X, W1, b1, W2, b2, activation_fn):
    """
    Unvectorized forward propagation using a for-loop over training examples.
    """
    m = X.shape[1]
    A1_list = []
    A2_list = []
    
    for i in range(m):
        x_i = X[:, i:i+1]
        z1_i = np.dot(W1, x_i) + b1
        a1_i = activation_fn(z1_i)
        z2_i = np.dot(W2, a1_i) + b2
        a2_i = activation_fn(z2_i)
        A1_list.append(a1_i)
        A2_list.append(a2_i)
    
    A1 = np.hstack(A1_list)
    A2 = np.hstack(A2_list)
    return A1, A2

def vectorized_forward_propagation(X, W1, b1, W2, b2, activation_fn):
    """
    Vectorized forward propagation processing all m examples simultaneously.
    """
    Z1 = np.dot(W1, X) + b1
    A1 = activation_fn(Z1)
    Z2 = np.dot(W2, A1) + b2
    A2 = activation_fn(Z2)
    return A1, A2

# Test both implementations
np.random.seed(42)
m = 5
n_x = 3
n_h = 4
n_y = 2

X = np.random.randn(n_x, m)
W1 = np.random.randn(n_h, n_x)
b1 = np.zeros((n_h, 1))
W2 = np.random.randn(n_y, n_h)
b2 = np.zeros((n_y, 1))

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

A1_unvec, A2_unvec = unvectorized_forward_propagation(X, W1, b1, W2, b2, sigmoid)
A1_vec, A2_vec = vectorized_forward_propagation(X, W1, b1, W2, b2, sigmoid)

print(f"Unvectorized A2 shape: {A2_unvec.shape}")
print(f"Vectorized A2 shape: {A2_vec.shape}")
print(f"Results match: {np.allclose(A2_unvec, A2_vec)}")

**Implement code primitive: Stack individual training example vectors as columns to form input matrix X and activation matrices A**

In [None]:
import numpy as np

# Create individual training examples
example_1 = np.array([[1.0], [2.0], [3.0]])
example_2 = np.array([[4.0], [5.0], [6.0]])
example_3 = np.array([[7.0], [8.0], [9.0]])

# Stack examples as columns to form the input matrix X
X = np.hstack([example_1, example_2, example_3])
print("Input matrix X (examples as columns):")
print(X)
print(f"Shape: {X.shape} (3 features, 3 examples)\n")

# Simulate activation values for hidden layer
a1_example_1 = np.array([[0.5], [0.6], [0.7], [0.8]])
a1_example_2 = np.array([[0.4], [0.5], [0.6], [0.7]])
a1_example_3 = np.array([[0.3], [0.4], [0.5], [0.6]])

# Stack activations as columns to form activation matrix A1
A1 = np.hstack([a1_example_1, a1_example_2, a1_example_3])
print("Activation matrix A1 (examples as columns):")
print(A1)
print(f"Shape: {A1.shape} (4 hidden units, 3 examples)")
print(f"\nEach column represents one training example")
print(f"Each row represents one hidden unit")

**Create a Mermaid diagram: Flowchart showing the transformation from unvectorized loop-based computation to vectorized matrix computation**

## Transformation from Unvectorized to Vectorized Computation

```mermaid
graph TD
    A["Start: m training examples"] --> B{"Unvectorized or Vectorized?"}
    
    B -->|Unvectorized| C["Loop: for i = 1 to m"]
    C --> D["Extract example i: x^(i)"]
    D --> E["Compute: z^[1](i) = W^[1] x^(i) + b^[1]"]
    E --> F["Compute: a^[1](i) = σ(z^[1](i))"]
    F --> G["Compute: z^[2](i) = W^[2] a^[1](i) + b^[2]"]
    G --> H["Compute: a^[2](i) = σ(z^[2](i))"]
    H --> I["Store result for example i"]
    I --> J{"More examples?"}
    J -->|Yes| D
    J -->|No| K["Combine all results"]
    
    B -->|Vectorized| L["Stack all examples as columns: X"]
    L --> M["Compute: Z^[1] = W^[1] X + b^[1]"]
    M --> N["Compute: A^[1] = σ(Z^[1])"]
    N --> O["Compute: Z^[2] = W^[2] A^[1] + b^[2]"]
    O --> P["Compute: A^[2] = σ(Z^[2])"]
    P --> Q["All results computed in one step"]
    
    K --> R["End: Forward propagation complete"]
    Q --> R
```

**Create a Mermaid diagram: Diagram illustrating matrix structure where horizontal axis represents training examples and vertical axis represents network nodes/features**

## Matrix Structure: Examples and Nodes

```mermaid
graph LR
    subgraph Input["Input Matrix X"]
        direction TB
        X1["Feature 1"]
        X2["Feature 2"]
        X3["Feature 3"]
        X1 --> XE1["Example 1"]
        X1 --> XE2["Example 2"]
        X1 --> XE3["Example 3"]
        X2 --> XE1
        X2 --> XE2
        X2 --> XE3
        X3 --> XE1
        X3 --> XE2
        X3 --> XE3
    end
    
    subgraph Hidden["Hidden Layer Matrix A^[1]"]
        direction TB
        H1["Hidden Unit 1"]
        H2["Hidden Unit 2"]
        H3["Hidden Unit 3"]
        H4["Hidden Unit 4"]
        H1 --> HE1["Example 1"]
        H1 --> HE2["Example 2"]
        H1 --> HE3["Example 3"]
        H2 --> HE1
        H2 --> HE2
        H2 --> HE3
        H3 --> HE1
        H3 --> HE2
        H3 --> HE3
        H4 --> HE1
        H4 --> HE2
        H4 --> HE3
    end
    
    Input -->|W^[1], b^[1]| Hidden
    
    style Input fill:#e1f5ff
    style Hidden fill:#f3e5f5
```

## Lesson 5

**Explain the core concepts of the lesson**

## Core Concepts

Vectorized forward propagation extends the forward propagation computation from processing a single training example to processing all $m$ training examples simultaneously using matrix operations.

**Key Ideas:**

1. **Stacking Training Examples**: Instead of processing each training example $x^{(i)}$ individually, we stack all examples horizontally into a matrix $X$ of shape $(n_x, m)$, where $n_x$ is the input dimension and $m$ is the number of training examples.

2. **Matrix Multiplication Vectorization**: When we multiply the weight matrix $W$ by the stacked input matrix $X$, we compute all outputs simultaneously: $Z = WX$. Each column of $Z$ corresponds to the output for one training example.

3. **Broadcasting Bias Addition**: Python broadcasting automatically adds the bias vector $b$ to each column of $Z$ without explicit replication, making the computation efficient.

4. **Layer-wise Computation Symmetry**: Each layer performs the same computation pattern—multiply by weights, add bias, apply activation—regardless of how many examples are processed.

5. **Column Correspondence**: The column structure is preserved through the computation: the $i$-th column of $X$ produces the $i$-th column of $Z$, which becomes the $i$-th column of $A$ after applying the activation function.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Intuition 1: Column Stacking Preserves Structure**

When you stack training examples as columns in a matrix, the corresponding outputs also stack as columns in the result matrix. This is because matrix multiplication naturally produces outputs in the same column positions as the inputs. If you think of each column as a "slot" for one example, matrix multiplication fills all slots simultaneously while preserving their positions.

**Intuition 2: Repeated Layer Computation**

Each layer of a neural network performs the same type of computation repeatedly: multiply by weights, add bias, apply activation. Deeper networks just repeat this pattern more times. Whether you're processing one example or a thousand, each layer does the same operation—the only difference is the shape of the matrices involved.

**Intuition 3: Broadcasting as Implicit Replication**

Python broadcasting automatically handles adding the bias vector to each column of the result matrix, so you don't need to explicitly replicate it. Instead of manually copying $b$ to match the shape of $Z$, NumPy automatically "stretches" $b$ across all columns during the addition operation.

**Present and explain the key equations used in the lesson**

## Key Equations

**Single Example Forward Propagation:**

For a single training example $x^{(i)}$:

$$z^{[l](i)} = W^{[l]} x^{(i)} + b^{[l]}$$

$$a^{[l](i)} = \sigma(z^{[l](i)})$$

where $x^{(i)} = a^{[0](i)}$ is the input for example $i$.

**Vectorized Forward Propagation:**

For all $m$ training examples stacked as columns in matrix $X$:

$$Z^{[l]} = W^{[l]} X + b^{[l]}$$

$$A^{[l]} = \sigma(Z^{[l]})$$

where:
- $X$ has shape $(n_x, m)$ — each column is one training example
- $W^{[l]}$ has shape $(n^{[l]}, n^{[l-1]})$ — the weight matrix for layer $l$
- $Z^{[l]}$ has shape $(n^{[l]}, m)$ — each column is the pre-activation output for one example
- $b^{[l]}$ has shape $(n^{[l]}, 1)$ — broadcasted to all $m$ columns
- $A^{[l]}$ has shape $(n^{[l]}, m)$ — each column is the activation output for one example

The key insight is that $Z^{[l]} = W^{[l]} A^{[l-1]} + b^{[l]}$ applies the same transformation to all examples in parallel.

**Implement code primitive: Demonstrate matrix multiplication of weight matrix W with stacked input matrix X to produce stacked output matrix Z**

In [None]:
import numpy as np

# Example: 3 training examples, 2 input features, 4 output units
m = 3  # number of training examples
n_x = 2  # input dimension
n_l = 4  # output dimension (number of units in layer l)

# Weight matrix W^[l] with shape (n_l, n_x)
W = np.array([
    [0.1, 0.2],
    [0.3, 0.4],
    [0.5, 0.6],
    [0.7, 0.8]
])

# Stacked input matrix X with shape (n_x, m)
# Each column is one training example
X = np.array([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0]
])

# Matrix multiplication: Z = W @ X
Z = W @ X

print("Weight matrix W shape:", W.shape)
print("Input matrix X shape:", X.shape)
print("Output matrix Z shape:", Z.shape)
print("\nOutput matrix Z:")
print(Z)
print("\nEach column of Z is the output for one training example")

**Implement code primitive: Show how Python broadcasting adds bias vector b to each column of the Z matrix simultaneously**

In [None]:
import numpy as np

# Continue from previous example
m = 3
n_l = 4

# Z from previous computation
Z = np.array([
    [0.9, 1.2, 1.5],
    [1.5, 2.1, 2.7],
    [2.1, 3.0, 3.9],
    [2.7, 3.9, 5.1]
])

# Bias vector b^[l] with shape (n_l, 1)
b = np.array([
    [0.1],
    [0.2],
    [0.3],
    [0.4]
])

# Broadcasting: b is automatically added to each column of Z
Z_with_bias = Z + b

print("Z shape:", Z.shape)
print("b shape:", b.shape)
print("Z + b shape:", Z_with_bias.shape)
print("\nZ + b (bias added to each column):")
print(Z_with_bias)
print("\nNotice: b is added to all", m, "columns without explicit replication")

**Implement code primitive: Implement vectorized forward propagation loop that processes all m training examples in one matrix operation instead of iterating through individual examples**

In [None]:
import numpy as np

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Network parameters
m = 3  # number of training examples
n_x = 2  # input dimension
n_1 = 4  # hidden layer dimension
n_2 = 1  # output layer dimension

# Initialize weights and biases
W1 = np.random.randn(n_1, n_x) * 0.01
b1 = np.zeros((n_1, 1))
W2 = np.random.randn(n_2, n_1) * 0.01
b2 = np.zeros((n_2, 1))

# Training data: X has shape (n_x, m)
X = np.array([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0]
])

# Vectorized forward propagation
A0 = X
Z1 = W1 @ A0 + b1
A1 = sigmoid(Z1)
Z2 = W2 @ A1 + b2
A2 = sigmoid(Z2)

print("Input A0 shape:", A0.shape)
print("Hidden layer Z1 shape:", Z1.shape)
print("Hidden layer A1 shape:", A1.shape)
print("Output Z2 shape:", Z2.shape)
print("Output A2 shape:", A2.shape)
print("\nAll", m, "training examples processed in one forward pass")

**Create a Mermaid diagram: Flowchart showing the progression from single-example forward propagation (z = Wx + b for each i) to vectorized forward propagation (Z = WX + b for all examples simultaneously)**

```mermaid
graph TD
    A["Single Example Approach"] --> B["For each training example i"]
    B --> C["Compute z^[l](i) = W^[l] x^(i) + b^[l]"]
    C --> D["Compute a^[l](i) = σ(z^[l](i))"]
    D --> E["Repeat for all m examples"]
    E --> F["Result: m separate computations"]
    
    G["Vectorized Approach"] --> H["Stack all examples as columns in X"]
    H --> I["Compute Z^[l] = W^[l] X + b^[l]"]
    I --> J["Compute A^[l] = σ(Z^[l])"]
    J --> K["Result: All m examples in one operation"]
    
    F --> L{"Efficiency Comparison"}
    K --> L
    L --> M["Vectorized is much faster"]
```

**Create a Mermaid diagram: Diagram illustrating how stacking training examples horizontally in matrix X results in corresponding outputs stacked horizontally in matrix Z, with column correspondence highlighted**

```mermaid
graph LR
    subgraph Input["Input Matrix X (n_x × m)"]
        X1["x^(1)"] 
        X2["x^(2)"]
        X3["x^(3)"]
    end
    
    subgraph Weights["Weight Matrix W (n_l × n_x)"]
        W["W^[l]"]
    end
    
    subgraph Output["Output Matrix Z (n_l × m)"]
        Z1["z^[l](1)"]
        Z2["z^[l](2)"]
        Z3["z^[l](3)"]
    end
    
    Input -->|Column 1| Z1
    Input -->|Column 2| Z2
    Input -->|Column 3| Z3
    Weights -->|Applied to all| Z1
    Weights -->|Applied to all| Z2
    Weights -->|Applied to all| Z3
    
    style X1 fill:#e1f5ff
    style X2 fill:#e1f5ff
    style X3 fill:#e1f5ff
    style Z1 fill:#fff3e0
    style Z2 fill:#fff3e0
    style Z3 fill:#fff3e0
```

## Lesson 6

**Explain the core concepts of the lesson**

## Core Concepts

This lesson explores why **non-linear activation functions** are essential in neural networks.

**Linear Activation Functions** (also called identity activation) simply pass the input through unchanged: $a = z$. When used in hidden layers, they provide no additional computational power.

**Non-linear Activation Functions** (such as ReLU, Tanh, or Sigmoid) apply non-linear transformations to the output of each neuron. These are crucial for enabling neural networks to learn complex, non-linear relationships in data.

**Hidden Layer Expressiveness** refers to the ability of hidden layers to compute and represent increasingly complex functions. Without non-linearity, adding more layers does not increase this expressiveness.

**Composition of Linear Functions** is the mathematical principle that explains why linear activation functions are limited: when you compose (chain) multiple linear functions together, the result is always another linear function. This means a deep network with only linear activations is mathematically equivalent to a single linear model.

**Regression Output Layer** is a special case where linear activation may be appropriate—when predicting continuous real-valued targets (like housing prices), the output layer often uses linear activation to allow any real-valued output.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**The Composition Problem**: Imagine stacking multiple transparent sheets, each with a linear transformation drawn on it. No matter how many sheets you stack, the combined effect is still just a linear transformation. You cannot create a curve or bend in the data by composing straight lines. This is exactly what happens in a neural network with only linear activations—no matter how deep it is, it can only learn linear relationships.

**Why Non-linearity Matters**: Non-linear activation functions are like adding a "bend" or "twist" at each layer. When you compose a non-linear function with another non-linear function, you can create increasingly complex, curved decision boundaries and function approximations. This is what gives deep networks their power.

**Depth Without Non-linearity is Useless**: A 10-layer neural network with linear activations is no more powerful than a single-layer linear model. The depth provides no benefit. But a 10-layer network with non-linear activations can represent vastly more complex functions than a shallow network. This is why non-linearity is the key ingredient that makes deep learning work.

**Linear Output Layers**: In regression tasks, we often want the network to output any real number (positive, negative, or zero). A linear activation in the output layer allows this. However, hidden layers still need non-linearity to learn the complex patterns that feed into this final linear transformation.

**Present and explain the key equations used in the lesson**

## Key Equations

**Layer 1 with Linear Activation**:
$$a^{[1]} = Z^{[1]} = W^{[1]}x + b^{[1]}$$

Here, $W^{[1]}$ is the weight matrix, $x$ is the input, and $b^{[1]}$ is the bias. The activation $a^{[1]}$ is identical to the pre-activation $Z^{[1]}$ because we use linear activation.

**Layer 2 with Linear Activation**:
$$a^{[2]} = Z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}$$

The output of layer 2 depends on the activation from layer 1.

**Composition Simplification**:
$$a^{[2]} = W^{[2]}(W^{[1]}x + b^{[1]}) + b^{[2]} = (W^{[2]}W^{[1]})x + (W^{[2]}b^{[1]} + b^{[2]}) = W'x + b'$$

This is the crucial insight: when we substitute the expression for $a^{[1]}$ into the equation for $a^{[2]}$, we can algebraically simplify it to a single linear transformation with combined weights $W' = W^{[2]}W^{[1]}$ and combined bias $b' = W^{[2]}b^{[1]} + b^{[2]}$. No matter how many linear layers we stack, the result is always equivalent to a single linear transformation.

**Implement code primitive: Demonstrate the mathematical composition of two linear transformations to show that the result is a single linear function**

In [None]:
import numpy as np

# Define two linear transformations
W1 = np.array([[2, 1], [1, 3]])
b1 = np.array([0.5, -0.5])

W2 = np.array([[1, 2], [3, 1]])
b2 = np.array([1.0, 0.5])

# Input
x = np.array([1, 2])

# Method 1: Apply transformations sequentially
a1 = W1 @ x + b1
a2 = W2 @ a1 + b2

print("Sequential application:")
print(f"a1 = {a1}")
print(f"a2 = {a2}")

# Method 2: Compose into a single transformation
W_combined = W2 @ W1
b_combined = W2 @ b1 + b2

a2_direct = W_combined @ x + b_combined

print("\nDirect composition:")
print(f"W' = W2 @ W1 = \n{W_combined}")
print(f"b' = W2 @ b1 + b2 = {b_combined}")
print(f"a2 (direct) = {a2_direct}")

print("\nAre they equal?", np.allclose(a2, a2_direct))

**Implement code primitive: Implement a simple neural network with linear activation functions and show that it produces identical output to logistic regression**

In [None]:
import numpy as np
from sklearn.linear_model import LogisticRegression

# Generate simple binary classification data
np.random.seed(42)
X = np.random.randn(100, 2)
y = (X[:, 0] + X[:, 1] > 0).astype(int)

# Logistic regression (single linear model)
lr = LogisticRegression()
lr.fit(X, y)
lr_predictions = lr.predict_proba(X)[:, 1]

# Neural network with linear hidden layer and sigmoid output
class LinearNetworkClassifier:
    def __init__(self, input_dim, hidden_dim):
        self.W1 = np.random.randn(input_dim, hidden_dim) * 0.01
        self.b1 = np.zeros(hidden_dim)
        self.W2 = np.random.randn(hidden_dim, 1) * 0.01
        self.b2 = np.zeros(1)
    
    def forward(self, X):
        # Linear hidden layer
        Z1 = X @ self.W1 + self.b1
        A1 = Z1  # Linear activation
        # Output layer with sigmoid
        Z2 = A1 @ self.W2 + self.b2
        A2 = 1 / (1 + np.exp(-Z2))
        return A2.flatten()

# Create network with same weights as logistic regression
net = LinearNetworkClassifier(2, 1)
net.W2 = lr.coef_.T
net.b2 = lr.intercept_
net.W1 = np.eye(2)  # Identity mapping
net.b1 = np.zeros(2)

net_predictions = net.forward(X)

print("Logistic Regression predictions (first 5):", lr_predictions[:5])
print("Linear Network predictions (first 5):", net_predictions[:5])
print("\nAre they equivalent?", np.allclose(lr_predictions, net_predictions))

**Implement code primitive: Compare the output of a network with non-linear hidden activations versus linear activations on a non-linear dataset**

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate non-linear dataset (XOR-like)
np.random.seed(42)
X = np.random.randn(200, 2)
y = ((X[:, 0]**2 + X[:, 1]**2) > 1).astype(int)

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def relu(z):
    return np.maximum(0, z)

# Network with linear hidden activation
W1_linear = np.random.randn(2, 5) * 0.5
b1_linear = np.zeros(5)
W2_linear = np.random.randn(5, 1) * 0.5
b2_linear = np.zeros(1)

Z1_linear = X @ W1_linear + b1_linear
A1_linear = Z1_linear  # Linear activation
Z2_linear = A1_linear @ W2_linear + b2_linear
A2_linear = sigmoid(Z2_linear).flatten()

# Network with ReLU hidden activation
W1_relu = np.random.randn(2, 5) * 0.5
b1_relu = np.zeros(5)
W2_relu = np.random.randn(5, 1) * 0.5
b2_relu = np.zeros(1)

Z1_relu = X @ W1_relu + b1_relu
A1_relu = relu(Z1_relu)  # ReLU activation
Z2_relu = A1_relu @ W2_relu + b2_relu
A2_relu = sigmoid(Z2_relu).flatten()

# Calculate accuracy
accuracy_linear = np.mean((A2_linear > 0.5) == y)
accuracy_relu = np.mean((A2_relu > 0.5) == y)

print(f"Linear activation accuracy: {accuracy_linear:.3f}")
print(f"ReLU activation accuracy: {accuracy_relu:.3f}")
print(f"\nDifference: {accuracy_relu - accuracy_linear:.3f}")
print("\nNon-linear activations enable better learning on non-linear data.")

**Create a Mermaid diagram: flowchart showing the progression from linear activation (no hidden layer benefit) to non-linear activation (enables deep learning expressiveness)**

```mermaid
flowchart TD
    A["Neural Network Architecture"] --> B{"Activation Function Type"}
    B -->|"Linear Activation" | C["Hidden Layer Output: Linear Combination"]
    C --> D["Composition of Linear Functions"]
    D --> E["Result: Single Linear Model"]
    E --> F["No Additional Expressiveness<br/>Depth Provides No Benefit"]
    
    B -->|"Non-linear Activation<br/>ReLU, Tanh, Sigmoid" | G["Hidden Layer Output: Non-linear Transform"]
    G --> H["Composition of Non-linear Functions"]
    H --> I["Result: Complex Non-linear Function"]
    I --> J["Increased Expressiveness<br/>Depth Enables Learning Complex Patterns"]
    
    F --> K["Conclusion: Non-linearity is Essential"]
    J --> K
```

**Create a Mermaid diagram: diagram illustrating how composing two linear functions W2(W1*x + b1) + b2 simplifies to a single linear function W'*x + b'**

```mermaid
graph TD
    A["Input: x"] --> B["Layer 1: W¹x + b¹"]
    B --> C["a¹ = W¹x + b¹<br/>Linear Activation"]
    C --> D["Layer 2: W²a¹ + b²"]
    D --> E["a² = W²a¹ + b²"]
    
    E --> F["Substitute a¹"]
    F --> G["a² = W²(W¹x + b¹) + b²"]
    G --> H["Expand"]
    H --> I["a² = W²W¹x + W²b¹ + b²"]
    I --> J["Simplify"]
    J --> K["a² = W'x + b'<br/>where W' = W²W¹<br/>and b' = W²b¹ + b²"]
    
    K --> L["Result: Single Linear Transformation"]
    L --> M["Equivalent to a 1-Layer Model"]
```

## Lesson 7

**Explain the core concepts of the lesson**

## Core Concepts

A neural network with a single hidden layer consists of four learnable parameters that must be optimized through gradient descent:

- **W¹** and **b¹**: Weight matrix and bias vector for the first layer (input to hidden)
- **W²** and **b²**: Weight matrix and bias vector for the second layer (hidden to output)

The dimensions of these parameters are determined by the network architecture:
- W¹ has shape (hidden_size, input_size)
- b¹ has shape (hidden_size, 1)
- W² has shape (output_size, hidden_size)
- b² has shape (output_size, 1)

The learning process involves:
1. **Forward propagation**: Computing predictions by passing data through the network
2. **Cost computation**: Measuring prediction error across all training examples
3. **Backpropagation**: Computing gradients of the cost with respect to each parameter
4. **Gradient descent**: Updating parameters in the direction that reduces cost

This cycle repeats until the parameters converge to values that minimize the cost function.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Parameter Initialization**: Initializing parameters randomly (not to zero) is crucial. If all parameters start at zero, neurons in the hidden layer will compute identical functions, preventing the network from learning diverse features. Random initialization breaks this symmetry.

**Forward vs. Backward Flow**: Forward propagation moves left-to-right through the network, transforming input data into predictions. Backpropagation moves right-to-left, using the chain rule to compute how changes in each parameter affect the overall cost. This is why we compute gradients in reverse order: dZ² → dW², dB² → dA¹ → dZ¹ → dW¹, dB¹.

**Vectorization**: Instead of computing gradients for one training example at a time, we process all m examples simultaneously using matrix operations. This makes the code efficient and the notation cleaner—each variable represents a batch of values, not a single value.

**Gradient Descent Update**: The update rule is uniform across all parameters: subtract the learning rate times the gradient from the current parameter. This simple rule, applied iteratively, gradually reduces the cost.

**Matrix Dimensions**: Maintaining consistent matrix dimensions throughout forward and backward passes is essential. Each operation must respect the shapes of the matrices involved, and bias gradients require special handling with `np.sum(axis=1, keepdims=True)` to preserve dimensions.

**Present and explain the key equations used in the lesson**

## Key Equations

**Forward Propagation:**

$$Z^{[1]} = W^{[1]}X + b^{[1]}$$

$$A^{[1]} = g^{[1]}(Z^{[1]})$$

$$Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}$$

$$A^{[2]} = g^{[2]}(Z^{[2]})$$

where $g^{[1]}$ and $g^{[2]}$ are activation functions (e.g., ReLU for hidden layer, sigmoid for output layer in binary classification).

**Cost Function:**

$$J = \frac{1}{m}\sum_{i=1}^{m}L(\hat{y}^{(i)}, y^{(i)})$$

where $m$ is the number of training examples and $L$ is the loss function.

**Backpropagation:**

$$dZ^{[2]} = A^{[2]} - Y$$

$$dW^{[2]} = \frac{1}{m}dZ^{[2]}A^{[1]T}$$

$$db^{[2]} = \frac{1}{m}\text{np.sum}(dZ^{[2]}, \text{axis}=1, \text{keepdims}=\text{True})$$

$$dA^{[1]} = W^{[2]T}dZ^{[2]}$$

$$dZ^{[1]} = dA^{[1]} * g^{[1]\prime}(Z^{[1]})$$

$$dW^{[1]} = \frac{1}{m}dZ^{[1]}X^{T}$$

$$db^{[1]} = \frac{1}{m}\text{np.sum}(dZ^{[1]}, \text{axis}=1, \text{keepdims}=\text{True})$$

**Gradient Descent Update:**

$$W := W - \alpha dW$$

$$b := b - \alpha db$$

where $\alpha$ is the learning rate.

**Implement code primitive: Initialize parameters W1, B1, W2, B2 randomly (not to zeros) with appropriate dimensions based on layer sizes**

In [None]:
import numpy as np

def initialize_parameters(input_size, hidden_size, output_size):
    W1 = np.random.randn(hidden_size, input_size) * 0.01
    b1 = np.zeros((hidden_size, 1))
    W2 = np.random.randn(output_size, hidden_size) * 0.01
    b2 = np.zeros((output_size, 1))
    return W1, b1, W2, b2

**Implement code primitive: Implement forward propagation: compute Z1, A1, Z2, A2 using matrix operations and activation functions**

In [None]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def relu(z):
    return np.maximum(0, z)

def forward_propagation(X, W1, b1, W2, b2):
    Z1 = np.dot(W1, X) + b1
    A1 = relu(Z1)
    Z2 = np.dot(W2, A1) + b2
    A2 = sigmoid(Z2)
    return Z1, A1, Z2, A2

**Implement code primitive: Compute cost function as average loss across all training examples for binary classification**

In [None]:
def compute_cost(A2, Y):
    m = Y.shape[1]
    loss = -np.mean(Y * np.log(A2 + 1e-8) + (1 - Y) * np.log(1 - A2 + 1e-8))
    cost = loss
    return cost

**Implement code primitive: Implement backpropagation: compute dZ2, dW2, dB2 using vectorized operations**

In [None]:
def backward_propagation_layer2(A2, Y, A1, W2):
    m = Y.shape[1]
    dZ2 = A2 - Y
    dW2 = np.dot(dZ2, A1.T) / m
    dB2 = np.sum(dZ2, axis=1, keepdims=True) / m
    return dZ2, dW2, dB2

**Implement code primitive: Compute dA1 as transpose of W2 times dZ2 (chain rule through second layer)**

In [None]:
def compute_dA1(dZ2, W2):
    dA1 = np.dot(W2.T, dZ2)
    return dA1

**Implement code primitive: Compute dZ1 as element-wise product of dA1 and derivative of activation function applied to Z1**

In [None]:
def relu_derivative(Z1):
    return (Z1 > 0).astype(float)

def compute_dZ1(dA1, Z1):
    dZ1 = dA1 * relu_derivative(Z1)
    return dZ1

**Implement code primitive: Compute dW1 and dB1 using vectorized matrix operations with proper dimension handling**

In [None]:
def backward_propagation_layer1(dZ1, X):
    m = X.shape[1]
    dW1 = np.dot(dZ1, X.T) / m
    dB1 = np.sum(dZ1, axis=1, keepdims=True) / m
    return dW1, dB1

**Implement code primitive: Use np.sum with axis=1 and keepdims=True to maintain correct matrix dimensions for bias gradients**

In [None]:
def compute_bias_gradients(dZ, m):
    dB = np.sum(dZ, axis=1, keepdims=True) / m
    return dB

**Implement code primitive: Implement gradient descent update: subtract learning rate times gradient from each parameter**

In [None]:
def update_parameters(W1, b1, W2, b2, dW1, dB1, dW2, dB2, learning_rate):
    W1 = W1 - learning_rate * dW1
    b1 = b1 - learning_rate * dB1
    W2 = W2 - learning_rate * dW2
    b2 = b2 - learning_rate * dB2
    return W1, b1, W2, b2

**Implement code primitive: Repeat forward-backward-update cycle until convergence**

In [None]:
def train_neural_network(X, Y, input_size, hidden_size, output_size, learning_rate=0.01, iterations=1000):
    W1, b1, W2, b2 = initialize_parameters(input_size, hidden_size, output_size)
    
    for i in range(iterations):
        Z1, A1, Z2, A2 = forward_propagation(X, W1, b1, W2, b2)
        cost = compute_cost(A2, Y)
        
        dZ2, dW2, dB2 = backward_propagation_layer2(A2, Y, A1, W2)
        dA1 = compute_dA1(dZ2, W2)
        dZ1 = compute_dZ1(dA1, Z1)
        dW1, dB1 = backward_propagation_layer1(dZ1, X)
        
        W1, b1, W2, b2 = update_parameters(W1, b1, W2, b2, dW1, dB1, dW2, dB2, learning_rate)
    
    return W1, b1, W2, b2

**Create a Mermaid diagram: Flowchart showing the iterative gradient descent loop: initialize parameters, forward propagation, compute cost, backpropagation, compute gradients, update parameters, repeat until convergence**

```mermaid
graph TD
    A[Initialize Parameters] --> B[Forward Propagation]
    B --> C[Compute Cost]
    C --> D[Backpropagation]
    D --> E[Compute Gradients]
    E --> F[Update Parameters]
    F --> G{Converged?}
    G -->|No| B
    G -->|Yes| H[Training Complete]
```

**Create a Mermaid diagram: Diagram showing forward propagation flow through layers: input X → Z1 → A1 → Z2 → A2 with parameter matrices W1, B1, W2, B2 labeled at each step**

```mermaid
graph LR
    X["Input X"] -->|W¹, b¹| Z1["Z¹"]
    Z1 -->|ReLU| A1["A¹"]
    A1 -->|W², b²| Z2["Z²"]
    Z2 -->|Sigmoid| A2["A² (Output)"]
```

**Create a Mermaid diagram: Diagram showing backpropagation flow: cost J → dZ2 → dW2, dB2 and dA1 → dZ1 → dW1, dB1 with chain rule connections illustrated**

```mermaid
graph TD
    J["Cost J"] --> dZ2["dZ²"]
    dZ2 --> dW2["dW²"]
    dZ2 --> dB2["dB²"]
    dZ2 --> dA1["dA¹"]
    dA1 --> dZ1["dZ¹"]
    dZ1 --> dW1["dW¹"]
    dZ1 --> dB1["dB¹"]
```

## Lesson 8

**Explain the core concepts of the lesson**

## Core Concepts of Backpropagation

Backpropagation is the fundamental algorithm for training neural networks. It computes gradients of the loss function with respect to all parameters by working backward through the computation graph.

**Key concepts:**

1. **Computation Graph**: A directed acyclic graph representing how inputs flow through operations to produce the loss. Each node represents a variable, and edges represent operations.

2. **Chain Rule Application**: Backpropagation applies the chain rule of calculus to compute derivatives. For a composite function, the derivative with respect to an input is the product of derivatives along the path.

3. **Gradient Computation**: At each step backward, we compute the gradient of the loss with respect to intermediate variables (activations, pre-activations, weights, biases).

4. **Vectorized Backpropagation**: Instead of computing gradients for one training example at a time, we stack examples into matrices and apply the same derivative equations to all examples simultaneously.

5. **Matrix Dimensions Matching**: In vectorized implementations, gradient matrices must have the same dimensions as their corresponding parameter matrices. This constraint helps verify correctness.

6. **Weight Initialization**: Neural network weights must be initialized randomly (not to zero) to break symmetry and enable the network to learn diverse features.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Backward Flow of Information**: Imagine the loss as a scalar value at the end of the computation graph. Backpropagation traces backward from this loss, computing how much each parameter contributed to it. This is like asking: "If I change this weight slightly, how much does the loss change?"

**Chain Rule as Path Multiplication**: When computing the gradient of the loss with respect to a weight deep in the network, you multiply the gradients along the path from the loss back to that weight. Each multiplication represents applying the chain rule at one operation.

**Stacking Examples for Efficiency**: When you have multiple training examples, instead of running backpropagation separately for each one, you can stack them as columns in a matrix. The same derivative equations apply to the entire matrix at once, computing gradients for all examples in parallel.

**Dimension Checking as Error Detection**: By tracking matrix dimensions through backpropagation, you can catch bugs early. If a gradient matrix doesn't match the shape of its corresponding parameter, something is wrong.

**Symmetry Breaking**: If all weights start at zero, all neurons in a layer compute identical functions. Random initialization ensures different neurons learn different features, allowing the network to represent complex patterns.

**Present and explain the key equations used in the lesson**

## Key Equations

**Loss Function** (binary cross-entropy):
$$L(a, y) = -y \log(a) - (1-y) \log(1-a)$$

**Gradient of Loss with Respect to Activation**:
$$\frac{dL}{da} = -\frac{y}{a} + \frac{1-y}{1-a}$$

**Simplified Gradient for Logistic Regression**:
$$dz = a - y$$

This elegant result comes from combining the loss gradient with the sigmoid derivative.

**General Backpropagation Through Activation**:
$$dz = da \cdot g'(z)$$

where $g'(z)$ is the derivative of the activation function.

**Vectorized Gradient for Layer 2 Weights**:
$$dW^{[2]} = \frac{1}{m} dZ^{[2]} A^{[1]T}$$

**Vectorized Gradient for Layer 2 Bias**:
$$db^{[2]} = \frac{1}{m} \sum dZ^{[2]}$$

**Backpropagation to Previous Layer**:
$$dZ^{[1]} = W^{[2]T} dZ^{[2]} * g'^{[1]}(Z^{[1]})$$

where $*$ denotes element-wise multiplication.

**Vectorized Gradient for Layer 1 Weights**:
$$dW^{[1]} = \frac{1}{m} dZ^{[1]} X^T$$

In these equations, $m$ is the number of training examples, and the superscript denotes the layer number.

**Implement code primitive: Implement single-example backpropagation by computing da, dz, dw, db in reverse order through the computation graph**

In [None]:
import numpy as np

def single_example_backprop(x, y, z, a, w, b):
    """
    Compute gradients for a single training example.
    
    Args:
        x: input feature
        y: true label
        z: pre-activation (w*x + b)
        a: activation (sigmoid(z))
        w: weight
        b: bias
    
    Returns:
        da, dz, dw, db: gradients in reverse order
    """
    # Gradient of loss with respect to activation
    da = -y/a + (1-y)/(1-a)
    
    # Gradient of activation with respect to pre-activation (sigmoid derivative)
    dz = da * a * (1 - a)
    
    # Gradient of pre-activation with respect to weight
    dw = dz * x
    
    # Gradient of pre-activation with respect to bias
    db = dz
    
    return da, dz, dw, db

# Example usage
x_example = 0.5
y_example = 1
z_example = 0.2
a_example = 1 / (1 + np.exp(-z_example))  # sigmoid
w_example = 0.3
b_example = 0.1

da, dz, dw, db = single_example_backprop(x_example, y_example, z_example, a_example, w_example, b_example)
print(f"da: {da:.4f}, dz: {dz:.4f}, dw: {dw:.4f}, db: {db:.4f}")

**Implement code primitive: Implement vectorized backpropagation by stacking gradients across training examples as matrix columns and applying the same equations**

In [None]:
import numpy as np

def vectorized_backprop(X, Y, Z, A, W, b):
    """
    Compute gradients for all training examples simultaneously.
    
    Args:
        X: input features, shape (n_features, m_examples)
        Y: true labels, shape (1, m_examples)
        Z: pre-activations, shape (1, m_examples)
        A: activations, shape (1, m_examples)
        W: weights, shape (1, n_features)
        b: bias, scalar
    
    Returns:
        dW, db: gradient matrices
    """
    m = X.shape[1]  # number of training examples
    
    # Gradient of loss with respect to activation
    dA = -Y/A + (1-Y)/(1-A)
    
    # Gradient of activation with respect to pre-activation
    dZ = dA * A * (1 - A)
    
    # Gradient with respect to weights
    dW = (1/m) * np.dot(dZ, X.T)
    
    # Gradient with respect to bias
    db = (1/m) * np.sum(dZ, axis=1, keepdims=True)
    
    return dW, db

# Example usage
X = np.array([[0.5, 0.3, 0.8]])
Y = np.array([[1, 0, 1]])
Z = np.array([[0.2, -0.1, 0.5]])
A = 1 / (1 + np.exp(-Z))
W = np.array([[0.3]])
b = 0.1

dW, db = vectorized_backprop(X, Y, Z, A, W, b)
print(f"dW shape: {dW.shape}, db shape: {db.shape}")
print(f"dW: {dW}, db: {db}")

**Implement code primitive: Verify backpropagation implementation by checking that gradient matrices have the same dimensions as their corresponding parameter matrices**

In [None]:
import numpy as np

def verify_gradient_dimensions(W1, b1, W2, b2, dW1, db1, dW2, db2):
    """
    Verify that gradient matrices match parameter matrix dimensions.
    
    Args:
        W1, b1: Layer 1 parameters
        W2, b2: Layer 2 parameters
        dW1, db1: Layer 1 gradients
        dW2, db2: Layer 2 gradients
    
    Returns:
        Boolean indicating if all dimensions match
    """
    checks = [
        (W1.shape == dW1.shape, f"W1 {W1.shape} vs dW1 {dW1.shape}"),
        (b1.shape == db1.shape, f"b1 {b1.shape} vs db1 {db1.shape}"),
        (W2.shape == dW2.shape, f"W2 {W2.shape} vs dW2 {dW2.shape}"),
        (b2.shape == db2.shape, f"b2 {b2.shape} vs db2 {db2.shape}")
    ]
    
    all_match = True
    for match, message in checks:
        status = "✓" if match else "✗"
        print(f"{status} {message}")
        all_match = all_match and match
    
    return all_match

# Example usage
W1 = np.random.randn(3, 2)
b1 = np.random.randn(3, 1)
W2 = np.random.randn(1, 3)
b2 = np.random.randn(1, 1)

dW1 = np.random.randn(3, 2)
db1 = np.random.randn(3, 1)
dW2 = np.random.randn(1, 3)
db2 = np.random.randn(1, 1)

result = verify_gradient_dimensions(W1, b1, W2, b2, dW1, db1, dW2, db2)
print(f"\nAll dimensions match: {result}")

**Implement code primitive: Initialize neural network weights randomly rather than to zero to enable proper learning**

In [None]:
import numpy as np

def initialize_parameters(layer_dims, seed=None):
    """
    Initialize neural network parameters randomly.
    
    Args:
        layer_dims: List of layer dimensions, e.g., [2, 3, 1] for 2 inputs, 3 hidden, 1 output
        seed: Random seed for reproducibility
    
    Returns:
        Dictionary containing initialized weights and biases
    """
    if seed is not None:
        np.random.seed(seed)
    
    parameters = {}
    
    for l in range(1, len(layer_dims)):
        # Random initialization for weights (small values to avoid saturation)
        parameters[f'W{l}'] = np.random.randn(layer_dims[l], layer_dims[l-1]) * 0.01
        
        # Initialize biases to zero
        parameters[f'b{l}'] = np.zeros((layer_dims[l], 1))
    
    return parameters

# Example usage
layer_dims = [2, 3, 1]  # 2 inputs, 3 hidden units, 1 output
params = initialize_parameters(layer_dims, seed=42)

for key, value in params.items():
    print(f"{key}: shape {value.shape}")
    print(f"  Sample values: {value.flatten()[:3]}")

**Create a Mermaid diagram: Computation graph showing forward pass (z → a → loss) and backward pass (dL/da → dz → dw, db) for logistic regression**

## Logistic Regression Computation Graph

```mermaid
graph TD
    X["x<br/>(input)"] --> Z["z = w·x + b<br/>(pre-activation)"]
    Z --> A["a = σ(z)<br/>(activation)"]
    A --> L["L(a,y)<br/>(loss)"]
    Y["y<br/>(label)"] --> L
    
    L --> dA["dL/da<br/>(gradient w.r.t. a)"]
    dA --> dZ["dz = da·σ'(z)<br/>(gradient w.r.t. z)"]
    dZ --> dW["dw = dz·x<br/>(gradient w.r.t. w)"]
    dZ --> dB["db = dz<br/>(gradient w.r.t. b)"]
    
    style X fill:#e1f5ff
    style Y fill:#e1f5ff
    style Z fill:#fff3e0
    style A fill:#fff3e0
    style L fill:#ffebee
    style dA fill:#f3e5f5
    style dZ fill:#f3e5f5
    style dW fill:#e8f5e9
    style dB fill:#e8f5e9
```

**Create a Mermaid diagram: Two-layer neural network computation graph with forward pass (x → z¹ → a¹ → z² → a² → loss) and corresponding backward pass derivatives**

## Two-Layer Neural Network Computation Graph

```mermaid
graph TD
    X["X<br/>(input)"] --> Z1["Z¹ = W¹·X + b¹"]
    Z1 --> A1["A¹ = g(Z¹)"]
    A1 --> Z2["Z² = W²·A¹ + b²"]
    Z2 --> A2["A² = σ(Z²)"]
    A2 --> L["L(A², Y)"]
    Y["Y<br/>(labels)"] --> L
    
    L --> dA2["dA² = -Y/A² + (1-Y)/(1-A²)"]
    dA2 --> dZ2["dZ² = dA²·σ'(Z²)"]
    dZ2 --> dW2["dW² = 1/m·dZ²·A¹ᵀ"]
    dZ2 --> dB2["db² = 1/m·Σ dZ²"]
    dZ2 --> dA1["dA¹ = W²ᵀ·dZ²"]
    
    dA1 --> dZ1["dZ¹ = dA¹·g'(Z¹)"]
    dZ1 --> dW1["dW¹ = 1/m·dZ¹·Xᵀ"]
    dZ1 --> dB1["db¹ = 1/m·Σ dZ¹"]
    
    style X fill:#e1f5ff
    style Y fill:#e1f5ff
    style Z1 fill:#fff3e0
    style A1 fill:#fff3e0
    style Z2 fill:#fff3e0
    style A2 fill:#fff3e0
    style L fill:#ffebee
    style dA2 fill:#f3e5f5
    style dZ2 fill:#f3e5f5
    style dA1 fill:#f3e5f5
    style dZ1 fill:#f3e5f5
    style dW2 fill:#e8f5e9
    style dB2 fill:#e8f5e9
    style dW1 fill:#e8f5e9
    style dB1 fill:#e8f5e9
```

**Create a Mermaid diagram: Matrix dimension flow diagram showing how dimensions propagate through vectorized backpropagation operations**

## Matrix Dimension Flow in Vectorized Backpropagation

```mermaid
graph TD
    X["X: n_x × m"] --> Z["Z = W·X + b<br/>1 × m"]
    Z --> A["A = σ(Z)<br/>1 × m"]
    A --> L["L: scalar"]
    Y["Y: 1 × m"] --> L
    
    L --> dA["dA: 1 × m"]
    dA --> dZ["dZ: 1 × m"]
    dZ --> dW["dW = 1/m·dZ·Xᵀ<br/>1 × n_x"]
    dZ --> dB["db = 1/m·Σ dZ<br/>1 × 1"]
    
    W["W: 1 × n_x"] -.-> dW
    b["b: 1 × 1"] -.-> dB
    
    style X fill:#e1f5ff
    style Y fill:#e1f5ff
    style Z fill:#fff3e0
    style A fill:#fff3e0
    style L fill:#ffebee
    style dA fill:#f3e5f5
    style dZ fill:#f3e5f5
    style dW fill:#e8f5e9
    style dB fill:#e8f5e9
    style W fill:#eeeeee
    style b fill:#eeeeee
```

## Lesson 9

**Explain the core concepts of the lesson**

## Core Concepts of Weight Initialization

Weight initialization is a critical step in training neural networks. The way we initialize the parameters of a network—particularly the weights and biases—significantly impacts whether the network can learn effectively.

**Key Concepts:**

1. **Weight Initialization**: The process of assigning initial values to the weight matrices $w^{[l]}$ before training begins.

2. **Symmetry Breaking Problem**: When all weights are initialized to the same value (e.g., zero), all hidden units in a layer compute identical functions, making them redundant and preventing the network from learning diverse features.

3. **Random Initialization**: Using random values drawn from a distribution (typically Gaussian) to break symmetry and allow different hidden units to learn different features.

4. **Gaussian Random Variables**: Random values sampled from a normal distribution with mean 0 and standard deviation 1, generated using `np.random.randn`.

5. **Initialization Constant**: A small scaling factor (e.g., 0.01) applied to random weights to keep initial activations in regions where gradients are steep.

6. **Saturation in Activation Functions**: When weights are too large, the pre-activation values $z^{[l]}$ push activations into flat regions of sigmoid or tanh functions, where gradients are nearly zero, slowing learning.

7. **Bias Initialization**: Bias terms $b^{[l]}$ can be safely initialized to zero since they don't suffer from the symmetry problem that affects weights.

8. **Parameter Initialization Strategy**: A systematic approach to choosing initial values for weights and biases to enable effective learning.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Why Zero Initialization Fails:**

Imagine a neural network with multiple hidden units in a layer. If all weights are initialized to zero, then every hidden unit receives the same input and computes the same function. During backpropagation, all hidden units receive identical gradient updates, so they remain identical throughout training. This means the network effectively has only one hidden unit instead of many—a massive waste of capacity. The network cannot learn diverse features because all units are locked in symmetry.

**How Random Initialization Breaks Symmetry:**

By initializing weights to small random values, each hidden unit starts with a slightly different set of weights. This means each unit computes a different function on the input, producing different activations. During backpropagation, each unit receives different gradients, allowing them to diverge and specialize in learning different features. This diversity is essential for the network's learning capacity.

**The Problem with Large Weights:**

Consider the sigmoid function $\sigma(z) = \frac{1}{1 + e^{-z}}$. When $z$ is very large or very small, the function is nearly flat—the gradient $\frac{d\sigma}{dz}$ is close to zero. If weights are initialized too large, the pre-activation values $z^{[l]} = w^{[l]} x + b^{[l]}$ become very large, pushing activations into these flat regions. With tiny gradients, learning becomes extremely slow. This is called **saturation**.

**The Sweet Spot:**

Small random weights (e.g., scaled by 0.01) keep initial activations in the steep, middle regions of activation functions where gradients are large. This allows the network to learn quickly while maintaining the diversity needed for different hidden units to specialize.

**Why Biases Are Different:**

Bias terms don't have the symmetry problem because they are added to the output of each hidden unit independently. Even if all biases are zero, the hidden units still compute different functions (due to different weights), so they produce different activations. Therefore, initializing biases to zero is safe and conventional.

**Present and explain the key equations used in the lesson**

## Key Equations

**Weight Initialization Formula:**

$$w^{[l]} = \text{np.random.randn}(n^{[l]}, n^{[l-1]}) \times \text{small constant}$$

Where:
- $w^{[l]}$ is the weight matrix for layer $l$
- $n^{[l]}$ is the number of units in layer $l$
- $n^{[l-1]}$ is the number of units in layer $l-1$
- `np.random.randn` generates random values from a standard normal distribution
- The small constant (typically 0.01) scales the random values to keep them small

**Pre-activation Computation:**

$$z^{[l]} = w^{[l]} x + b^{[l]}$$

Where:
- $z^{[l]}$ is the pre-activation (linear combination) for layer $l$
- $x$ is the input to layer $l$ (or the activation from the previous layer)
- $b^{[l]}$ is the bias vector for layer $l$

**Bias Initialization:**

$$b^{[l]} = 0$$

Bias terms are initialized to zero vectors, which is safe because biases don't create the symmetry problem that weights do.

**Implement code primitive: Generate random weight matrix using np.random.randn with specified dimensions**

In [None]:
import numpy as np

# Generate random weight matrix
n_current = 3  # number of units in current layer
n_previous = 4  # number of units in previous layer

W = np.random.randn(n_current, n_previous)
print("Random weight matrix shape:", W.shape)
print("Random weight matrix:\n", W)

**Implement code primitive: Scale random weights by a small constant (e.g., 0.01)**

In [None]:
import numpy as np

# Generate and scale random weight matrix
n_current = 3
n_previous = 4
scaling_constant = 0.01

W = np.random.randn(n_current, n_previous) * scaling_constant
print("Scaled weight matrix shape:", W.shape)
print("Scaled weight matrix:\n", W)
print("\nMean of weights:", np.mean(W))
print("Std of weights:", np.std(W))

**Implement code primitive: Initialize bias terms to zero**

In [None]:
import numpy as np

# Initialize bias terms to zero
n_current = 3

b = np.zeros((n_current, 1))
print("Bias vector shape:", b.shape)
print("Bias vector:\n", b)

**Implement code primitive: Demonstrate why zero initialization fails by showing symmetric gradient updates**

In [None]:
import numpy as np

# Demonstrate the symmetry problem with zero initialization
np.random.seed(42)

# Zero initialization
W_zero = np.zeros((2, 3))
b_zero = np.zeros((2, 1))

# Input
X = np.random.randn(3, 5)  # 3 features, 5 samples

# Forward pass with zero weights
Z = W_zero @ X + b_zero
print("Pre-activations with zero weights:\n", Z)
print("\nAll pre-activations are identical (all zeros)")

# Simulate identical gradients
dZ = np.random.randn(2, 5)  # Gradient from next layer
dW_zero = dZ @ X.T / 5
print("\nGradients for weights (zero init):\n", dW_zero)
print("\nBoth hidden units receive identical gradient updates")
print("This means they will remain identical throughout training.")

# Random initialization for comparison
W_random = np.random.randn(2, 3) * 0.01
Z_random = W_random @ X + b_zero
print("\n" + "="*50)
print("Pre-activations with random weights:\n", Z_random)
print("\nPre-activations are different for each hidden unit")

dW_random = dZ @ X.T / 5
print("\nGradients for weights (random init):\n", dW_random)
print("\nEach hidden unit receives different gradient updates")
print("This allows them to diverge and learn different features.")

**Create a Mermaid diagram: Flowchart showing the consequence of zero weight initialization: identical hidden units → identical activations → identical gradients → no learning diversity**

## Consequence of Zero Weight Initialization

```mermaid
graph TD
    A["All Weights Initialized to Zero"] --> B["All Hidden Units Compute Identical Functions"]
    B --> C["All Hidden Units Produce Identical Activations"]
    C --> D["All Hidden Units Receive Identical Gradients"]
    D --> E["Weights Remain Identical After Update"]
    E --> F["No Learning Diversity"]
    F --> G["Network Cannot Learn Diverse Features"]
```

**Create a Mermaid diagram: Diagram illustrating how large weights cause z values to saturate in sigmoid/tanh functions, resulting in small gradients**

## Effect of Large Weights on Activation Functions

```mermaid
graph TD
    A["Large Initial Weights"] --> B["Large Pre-activation Values z"]
    B --> C["Sigmoid/Tanh Pushed to Extremes"]
    C --> D["Activation Function is Nearly Flat"]
    D --> E["Gradient dσ/dz ≈ 0"]
    E --> F["Vanishing Gradients"]
    F --> G["Very Slow Learning"]
    
    H["Small Random Weights"] --> I["Moderate Pre-activation Values z"]
    I --> J["Sigmoid/Tanh in Steep Region"]
    J --> K["Activation Function is Steep"]
    K --> L["Gradient dσ/dz is Large"]
    L --> M["Strong Gradient Signal"]
    M --> N["Effective Learning"]
```

## Lesson 10

**Explain the core concepts of the lesson**

## Core Concepts

**Generative Adversarial Networks (GANs)** are a framework for learning to generate new data that resembles training data. GANs consist of two competing neural networks:

1. **Generator**: Creates fake examples from random noise, attempting to fool the discriminator
2. **Discriminator**: Distinguishes between real examples from the training data and fake examples generated by the generator

These networks compete iteratively, with the generator improving at creating realistic data and the discriminator improving at detecting fakes. This adversarial process drives both networks toward better performance.

**Generative Modeling** is the task of learning the underlying distribution of training data to generate new, similar examples. GANs are one approach to this problem.

**GAN Stability** refers to the challenge that GANs can be difficult to train and require careful tuning of hyperparameters and architecture choices.

**Semi-supervised Learning** is an application of GANs where the discriminator's learned representations can be used for classification tasks with limited labeled data.

**Adversarial Examples** are carefully crafted inputs designed to fool machine learning models. They reveal vulnerabilities in neural networks that are distinct from traditional computer security problems.

**Machine Learning Security** encompasses protecting ML systems from adversarial attacks at multiple levels:
- **Application-level security**: Protecting the ML model itself from adversarial examples
- **Network-level security**: Protecting data in transit and system infrastructure
- **Machine-learning-level attacks**: Attacks that exploit properties of ML algorithms themselves

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**The Competing Networks Intuition**: Think of GANs as a game between a counterfeiter (generator) and a detective (discriminator). The counterfeiter tries to create fake currency that looks real, while the detective tries to catch fakes. As they compete, both improve—the counterfeiter becomes better at forgery and the detective becomes better at detection. Eventually, the counterfeiter produces currency so good that the detective cannot tell it apart from real currency.

**GAN Training Challenges**: GANs currently work well sometimes but require careful tuning, similar to how deep learning was finicky 10 years ago before techniques like ReLU and batch normalization made it reliable. The field is still developing best practices for stable training.

**Adversarial Vulnerability**: Adversarial examples show that machine learning models can be fooled by carefully crafted inputs. This reveals a new class of security vulnerabilities distinct from traditional computer security problems. A model might correctly classify an image of a cat, but a small, imperceptible perturbation can cause it to misclassify the same image as a dog.

**Security by Design**: Building security into machine learning systems from the start is more effective than trying to patch vulnerabilities after deployment. This means considering adversarial robustness during model development, not as an afterthought.

**Accessibility of Deep Learning**: The field of deep learning has evolved from struggling to work on high-dimensional data like images to having many open research directions. This makes it accessible to newcomers without requiring a Ph.D., as foundational techniques are now well-established and tools are readily available.

**Implement code primitive: Implementing a basic GAN architecture with generator and discriminator networks**

In [None]:
import torch
import torch.nn as nn

class Generator(nn.Module):
    def __init__(self, latent_dim=100, output_dim=784):
        super(Generator, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 512),
            nn.ReLU(),
            nn.Linear(512, output_dim),
            nn.Tanh()
        )
    
    def forward(self, z):
        return self.model(z)

class Discriminator(nn.Module):
    def __init__(self, input_dim=784):
        super(Discriminator, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, 512),
            nn.LeakyReLU(0.2),
            nn.Linear(512, 256),
            nn.LeakyReLU(0.2),
            nn.Linear(256, 1),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        return self.model(x)

latent_dim = 100
generator = Generator(latent_dim=latent_dim)
discriminator = Discriminator()

z = torch.randn(32, latent_dim)
fake_data = generator(z)
fake_prediction = discriminator(fake_data)

print(f"Generated data shape: {fake_data.shape}")
print(f"Discriminator prediction shape: {fake_prediction.shape}")

**Implement code primitive: Creating code to generate synthetic examples that resemble training data**

In [None]:
import torch
import torch.nn as nn

class SimpleGenerator(nn.Module):
    def __init__(self, latent_dim=100):
        super(SimpleGenerator, self).__init__()
        self.fc1 = nn.Linear(latent_dim, 128)
        self.fc2 = nn.Linear(128, 256)
        self.fc3 = nn.Linear(256, 784)
        self.relu = nn.ReLU()
        self.tanh = nn.Tanh()
    
    def forward(self, z):
        x = self.relu(self.fc1(z))
        x = self.relu(self.fc2(x))
        x = self.tanh(self.fc3(x))
        return x

generator = SimpleGenerator(latent_dim=100)

batch_size = 16
latent_dim = 100
z = torch.randn(batch_size, latent_dim)

synthetic_examples = generator(z)

print(f"Batch size: {batch_size}")
print(f"Synthetic examples shape: {synthetic_examples.shape}")
print(f"Value range: [{synthetic_examples.min().item():.3f}, {synthetic_examples.max().item():.3f}]")
print(f"Mean: {synthetic_examples.mean().item():.3f}, Std: {synthetic_examples.std().item():.3f}")

**Implement code primitive: Developing adversarial attack code to generate adversarial examples that fool classifiers**

In [None]:
import torch
import torch.nn as nn

class SimpleClassifier(nn.Module):
    def __init__(self):
        super(SimpleClassifier, self).__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        return self.fc2(x)

def fgsm_attack(model, x, y, epsilon=0.1):
    x.requires_grad = True
    output = model(x)
    loss = nn.CrossEntropyLoss()(output, y)
    loss.backward()
    
    perturbation = epsilon * x.grad.sign()
    adversarial_x = x + perturbation
    adversarial_x = torch.clamp(adversarial_x, -1, 1)
    
    return adversarial_x.detach()

model = SimpleClassifier()
model.eval()

x = torch.randn(8, 784)
y = torch.randint(0, 10, (8,))

adversarial_examples = fgsm_attack(model, x.clone(), y, epsilon=0.1)

print(f"Original examples shape: {x.shape}")
print(f"Adversarial examples shape: {adversarial_examples.shape}")
print(f"Perturbation magnitude: {(adversarial_examples - x).abs().max().item():.4f}")

**Create a Mermaid diagram: A flowchart showing the GAN training process with generator and discriminator competing iteratively**

```mermaid
graph TD
    A[Start Training] --> B[Sample random noise z]
    B --> C[Generator creates fake data]
    C --> D[Discriminator evaluates fake data]
    D --> E[Discriminator evaluates real data]
    E --> F{Discriminator Loss}
    F --> G[Update Discriminator weights]
    G --> H[Sample random noise z]
    H --> I[Generator creates fake data]
    I --> J[Discriminator evaluates fake data]
    J --> K{Generator Loss}
    K --> L[Update Generator weights]
    L --> M{Training Complete?}
    M -->|No| B
    M -->|Yes| N[End Training]
    
    style B fill:#e1f5ff
    style C fill:#fff3e0
    style D fill:#f3e5f5
    style E fill:#f3e5f5
    style G fill:#e8f5e9
    style L fill:#fce4ec
```

**Create a Mermaid diagram: A diagram illustrating the three levels of security: application-level, network-level, and machine-learning-level attacks**

```mermaid
graph TD
    A[Machine Learning System Security] --> B[Application-Level Security]
    A --> C[Network-Level Security]
    A --> D[Machine-Learning-Level Security]
    
    B --> B1[Protect ML Model]
    B --> B2[Defend Against Adversarial Examples]
    B --> B3[Input Validation]
    
    C --> C1[Secure Data in Transit]
    C --> C2[Protect System Infrastructure]
    C --> C3[Access Control]
    
    D --> D1[Adversarial Robustness]
    D --> D2[Model Poisoning Defense]
    D --> D3[Exploit ML Algorithm Properties]
    
    style B fill:#e3f2fd
    style C fill:#f3e5f5
    style D fill:#fff3e0
```

# Week 4

## Lesson 1

**Explain the core concepts of the lesson**

## Core Concepts

A **deep neural network** is a stack of multiple layers that process information sequentially from input to output. The key concepts for understanding deep networks are:

- **Network Depth**: The number of layers in the network, denoted as $L$. This counts only the hidden layers and the output layer, excluding the input layer.

- **Layer Counting Convention**: The input layer is indexed as layer 0, hidden layers are indexed as 1 through $L-1$, and the output layer is indexed as layer $L$. This means a network with $L$ layers has $L+1$ indexed positions (0 through $L$).

- **Shallow vs. Deep Networks**: Logistic regression and single-hidden-layer networks are considered shallow models. Networks with many hidden layers are deep models. The distinction is a matter of degree rather than a binary classification.

- **Network Capacity**: Very deep networks can learn functions that shallower models cannot. However, the optimal depth for a given problem is typically unknown in advance and should be treated as a hyperparameter to tune during model development.

- **Units Per Layer**: Each layer $l$ contains $n^{[l]}$ units (neurons). The number of units can vary across layers and is a key architectural choice.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Stacking as Composition**: Think of a deep neural network as a composition of functions. Each layer applies a transformation to the output of the previous layer. The deeper the network, the more transformations are applied, allowing the network to learn increasingly complex representations of the data.

**Depth as Expressiveness**: A key intuition is that depth enables expressiveness. Shallow networks (like logistic regression or a single hidden layer) can only learn relatively simple decision boundaries. By stacking more layers, the network gains the ability to learn hierarchical features—lower layers might learn simple patterns, while deeper layers combine these patterns into more abstract concepts.

**Hyperparameter Tuning**: The depth of a network is not something you can determine theoretically for a given problem. Instead, you should experiment with different depths and evaluate which works best for your specific task. This is why depth is treated as a hyperparameter rather than something fixed by the problem structure.

**Gradual Increase in Complexity**: The distinction between shallow and deep networks is not sharp. A network with 2 hidden layers is deeper than one with 1, but whether it's "truly deep" depends on context. What matters is that adding more layers generally increases the model's capacity to learn complex functions.

**Present and explain the key equations used in the lesson**

## Key Equations

The notation and equations for deep neural networks are:

**Network Architecture**:
$$L = \text{total number of layers}$$
$$n^{[l]} = \text{number of units in layer } l$$

**Forward Propagation**:
For each layer $l$ from 1 to $L$, we compute:
$$z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}$$
$$a^{[l]} = g(z^{[l]})$$

where:
- $W^{[l]}$ is the weight matrix for layer $l$
- $b^{[l]}$ is the bias vector for layer $l$
- $z^{[l]}$ is the pre-activation (linear combination) for layer $l$
- $a^{[l]}$ is the activation (post-activation output) for layer $l$
- $g$ is the activation function (e.g., ReLU, sigmoid, tanh)

**Input and Output**:
$$a^{[0]} = x$$
$$a^{[L]} = \hat{y}$$

The input features are denoted as $a^{[0]}$, and the network's final prediction is $a^{[L]}$, the activation of the output layer.

**Implement code primitive: Implement notation system for accessing layer-specific parameters: W[l], b[l], z[l], a[l] using dictionary or array indexing**

In [None]:
# Notation system for layer-specific parameters
# Using dictionaries to store parameters indexed by layer

W = {}  # Weight matrices: W[l] for layer l
b = {}  # Bias vectors: b[l] for layer l
z = {}  # Pre-activations: z[l] for layer l
a = {}  # Activations: a[l] for layer l

# Example: Initialize parameters for a 3-layer network
import numpy as np

# Layer 1: 3 units, input from 2 features
W[1] = np.random.randn(3, 2) * 0.01
b[1] = np.zeros((3, 1))

# Layer 2: 4 units, input from 3 units in layer 1
W[2] = np.random.randn(4, 3) * 0.01
b[2] = np.zeros((4, 1))

# Layer 3: 1 unit (output), input from 4 units in layer 2
W[3] = np.random.randn(1, 4) * 0.01
b[3] = np.zeros((1, 1))

# Access parameters by layer
print(f"W[1] shape: {W[1].shape}")
print(f"b[2] shape: {b[2].shape}")
print(f"W[3] shape: {W[3].shape}")

**Implement code primitive: Create a data structure to store network architecture: number of layers L and units per layer n[l]**

In [None]:
# Data structure for network architecture

# Total number of layers (excluding input layer)
L = 3

# Number of units in each layer (indexed from 0 to L)
# n[0] = number of input features
# n[1] to n[L] = number of units in each layer
n = {
    0: 2,   # Input layer: 2 features
    1: 3,   # Hidden layer 1: 3 units
    2: 4,   # Hidden layer 2: 4 units
    3: 1    # Output layer: 1 unit
}

# Verify architecture
print(f"Number of layers L: {L}")
print(f"Network architecture:")
for layer in range(L + 1):
    print(f"  Layer {layer}: {n[layer]} units")

**Implement code primitive: Implement forward propagation loop that iterates through layers l=1 to L, computing z[l] and a[l] sequentially**

In [None]:
# Forward propagation through all layers

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def relu(x):
    return np.maximum(0, x)

# Initialize input (a[0])
a[0] = np.array([[0.5], [0.3]])  # Example input with 2 features

# Forward propagation loop
for l in range(1, L + 1):
    # Compute pre-activation
    z[l] = np.dot(W[l], a[l-1]) + b[l]
    
    # Apply activation function
    if l < L:
        # Use ReLU for hidden layers
        a[l] = relu(z[l])
    else:
        # Use sigmoid for output layer
        a[l] = sigmoid(z[l])
    
    print(f"Layer {l}: z[{l}] shape = {z[l].shape}, a[{l}] shape = {a[l].shape}")

# Final prediction
print(f"\nFinal prediction a[{L}]: {a[L]}")

**Create a Mermaid diagram: Diagram showing a 4-layer neural network with input layer (layer 0), three hidden layers (layers 1-3), and output layer (layer 4), with layer indices and unit counts labeled**

## Network Architecture Visualization

```mermaid
graph LR
    subgraph L0["Layer 0<br/>(Input)<br/>n⁽⁰⁾=2"]
        x1((x₁))
        x2((x₂))
    end
    
    subgraph L1["Layer 1<br/>(Hidden)<br/>n⁽¹⁾=3"]
        h1_1((a₁⁽¹⁾))
        h1_2((a₂⁽¹⁾))
        h1_3((a₃⁽¹⁾))
    end
    
    subgraph L2["Layer 2<br/>(Hidden)<br/>n⁽²⁾=4"]
        h2_1((a₁⁽²⁾))
        h2_2((a₂⁽²⁾))
        h2_3((a₃⁽²⁾))
        h2_4((a₄⁽²⁾))
    end
    
    subgraph L3["Layer 3<br/>(Hidden)<br/>n⁽³⁾=3"]
        h3_1((a₁⁽³⁾))
        h3_2((a₂⁽³⁾))
        h3_3((a₃⁽³⁾))
    end
    
    subgraph L4["Layer 4<br/>(Output)<br/>n⁽⁴⁾=1"]
        y((ŷ))
    end
    
    x1 --> h1_1
    x1 --> h1_2
    x1 --> h1_3
    x2 --> h1_1
    x2 --> h1_2
    x2 --> h1_3
    
    h1_1 --> h2_1
    h1_1 --> h2_2
    h1_1 --> h2_3
    h1_1 --> h2_4
    h1_2 --> h2_1
    h1_2 --> h2_2
    h1_2 --> h2_3
    h1_2 --> h2_4
    h1_3 --> h2_1
    h1_3 --> h2_2
    h1_3 --> h2_3
    h1_3 --> h2_4
    
    h2_1 --> h3_1
    h2_1 --> h3_2
    h2_1 --> h3_3
    h2_2 --> h3_1
    h2_2 --> h3_2
    h2_2 --> h3_3
    h2_3 --> h3_1
    h2_3 --> h3_2
    h2_3 --> h3_3
    h2_4 --> h3_1
    h2_4 --> h3_2
    h2_4 --> h3_3
    
    h3_1 --> y
    h3_2 --> y
    h3_3 --> y
```

This diagram shows a deep neural network with:
- **Layer 0** (Input): 2 input features
- **Layer 1** (Hidden): 3 units
- **Layer 2** (Hidden): 4 units
- **Layer 3** (Hidden): 3 units
- **Layer 4** (Output): 1 output unit (prediction $\hat{y}$)

The total number of layers is $L = 4$ (counting only hidden and output layers, excluding the input layer).

## Lesson 2

**Explain the core concepts of the lesson**

## Core Concepts of Forward Propagation

Forward propagation is the process of computing predictions in a neural network by passing input data through each layer sequentially. The key concepts are:

**Layer-wise Computation**: Each layer performs the same fundamental operation: compute a weighted sum of inputs, add a bias term, and apply an activation function.

**Activation Computation**: For each layer $l$, we compute:
- The pre-activation (linear combination): $z^{[l]} = w^{[l]} a^{[l-1]} + b^{[l]}$
- The activation (after applying the activation function): $a^{[l]} = g(z^{[l]})$

**Input as Layer Zero**: The input features $x$ can be viewed as the activations of layer zero, denoted $A^{[0]} = X$. This unified perspective means all layers follow the same computational pattern.

**Vectorized Forward Propagation**: Instead of computing predictions for one training example at a time, we stack all training examples as columns in matrices and compute activations for the entire batch simultaneously using the same equations with capital letter notation.

**Sequential Computation**: Layers must be computed in order from layer 1 to layer $L$ because each layer depends on the activations from the previous layer.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**The Repeating Pattern**: Forward propagation in deep networks follows the same pattern repeated layer by layer: compute weighted sum, add bias, apply activation function. Once you understand one layer, you understand the entire network—you just repeat the process.

**Unified Layer Equations**: By treating the input features as the activations of layer zero ($a^{[0]} = x$), all layer equations follow the same general form. This makes the implementation cleaner and more elegant.

**Vectorization for Efficiency**: Vectorization stacks all training examples as columns in matrices, allowing you to compute activations for the entire training set simultaneously using the same equations. This is much faster than looping over individual examples.

**Loops Over Layers Are Necessary**: Even though we usually avoid explicit loops in favor of vectorization, a for loop over layers is necessary and acceptable because each layer must be computed sequentially before the next layer can begin. You cannot compute layer 2 until layer 1 is complete.

**Matrix Dimensions Matter**: Matrix dimensions are critical for debugging deep network implementations. Carefully tracking the shapes of weight matrices, biases, and activations throughout the computation helps catch errors early.

**Present and explain the key equations used in the lesson**

## Key Equations

**Single Training Example**:

For a single training example with input $a^{[l-1]}$, the forward propagation equations for layer $l$ are:

$$z^{[l]} = w^{[l]} a^{[l-1]} + b^{[l]}$$

$$a^{[l]} = g(z^{[l]})$$

where:
- $w^{[l]}$ is the weight matrix for layer $l$
- $b^{[l]}$ is the bias vector for layer $l$
- $g$ is the activation function (e.g., ReLU, sigmoid, tanh)
- $z^{[l]}$ is the pre-activation
- $a^{[l]}$ is the activation (output) of layer $l$

**Vectorized Form (Entire Training Set)**:

When processing $m$ training examples simultaneously, we use capital letter notation:

$$Z^{[l]} = w^{[l]} A^{[l-1]} + b^{[l]}$$

$$A^{[l]} = g(Z^{[l]})$$

where:
- $A^{[l-1]}$ is a matrix of shape $(n^{[l-1]}, m)$ containing all training examples as columns
- $Z^{[l]}$ and $A^{[l]}$ are matrices of shape $(n^{[l]}, m)$
- The bias $b^{[l]}$ is broadcast across all $m$ examples

**Input as Layer Zero**:

$$A^{[0]} = X$$

where $X$ is the input matrix of shape $(n^{[0]}, m)$ containing all training examples as columns.

**Implement code primitive: Implement single training example forward propagation by iterating through layers, computing z and a for each layer using weight matrices, biases, and activation functions.**

In [None]:
def forward_propagation_single_example(x, parameters, L, activation_fn):
    """
    Forward propagation for a single training example.
    
    Args:
        x: Input vector of shape (n[0],)
        parameters: Dictionary containing weights w[l] and biases b[l] for each layer
        L: Number of layers
        activation_fn: Function that applies activation (e.g., relu, sigmoid)
    
    Returns:
        a: Final activation (output) of the network
        cache: Dictionary storing z and a values for each layer
    """
    cache = {}
    a = x
    cache['a0'] = a
    
    for l in range(1, L + 1):
        w = parameters[f'w{l}']
        b = parameters[f'b{l}']
        
        z = w @ a + b
        a = activation_fn(z)
        
        cache[f'z{l}'] = z
        cache[f'a{l}'] = a
    
    return a, cache

**Implement code primitive: Implement vectorized forward propagation using capital letter notation where matrices contain all training examples stacked as columns, applying the same layer equations to the entire batch.**

In [None]:
def forward_propagation_vectorized(X, parameters, L, activation_fn):
    """
    Vectorized forward propagation for all training examples.
    
    Args:
        X: Input matrix of shape (n[0], m) where m is number of training examples
        parameters: Dictionary containing weights w[l] and biases b[l] for each layer
        L: Number of layers
        activation_fn: Function that applies activation element-wise
    
    Returns:
        A: Final activation (output) matrix of shape (n[L], m)
        cache: Dictionary storing Z and A values for each layer
    """
    cache = {}
    A = X
    cache['A0'] = A
    
    for l in range(1, L + 1):
        w = parameters[f'w{l}']
        b = parameters[f'b{l}']
        
        Z = w @ A + b
        A = activation_fn(Z)
        
        cache[f'Z{l}'] = Z
        cache[f'A{l}'] = A
    
    return A, cache

**Implement a for loop structure that iterates from layer 1 to layer L, computing activations sequentially for each layer in the network.**

In [None]:
def forward_propagation_loop(A, parameters, L, activation_fn):
    """
    Forward propagation with explicit layer-by-layer loop.
    
    Args:
        A: Activation from previous layer (or input X for first iteration)
        parameters: Dictionary with weights and biases
        L: Total number of layers
        activation_fn: Activation function to apply
    
    Returns:
        A: Final output activation
        cache: Stored activations and pre-activations
    """
    cache = {}
    cache['A0'] = A
    
    for l in range(1, L + 1):
        A_prev = A
        w = parameters[f'w{l}']
        b = parameters[f'b{l}']
        
        Z = w @ A_prev + b
        A = activation_fn(Z)
        
        cache[f'Z{l}'] = Z
        cache[f'A{l}'] = A
    
    return A, cache

**Create a Mermaid diagram: Flowchart showing the sequential steps of forward propagation: input x (or A0) → compute z1 → compute a1 → compute z2 → compute a2 → ... → compute zL → compute aL (output).**

## Forward Propagation Flow

```mermaid
graph TD
    A0["Input: A⁰ = X"] --> Z1["Compute Z¹ = w¹A⁰ + b¹"]
    Z1 --> A1["Compute A¹ = g(Z¹)"]
    A1 --> Z2["Compute Z² = w²A¹ + b²"]
    Z2 --> A2["Compute A² = g(Z²)"]
    A2 --> Dots["..."]
    Dots --> ZL["Compute Z⁽ᴸ⁾ = w⁽ᴸ⁾A⁽ᴸ⁻¹⁾ + b⁽ᴸ⁾"]
    ZL --> AL["Compute A⁽ᴸ⁾ = g(Z⁽ᴸ⁾)"]
    AL --> Output["Output: Ŷ = A⁽ᴸ⁾"]
```

The diagram shows the sequential nature of forward propagation: each layer's computation depends on the previous layer's activation, so layers must be computed in order from layer 1 to layer $L$.

## Lesson 3

**Explain the core concepts of the lesson**

## Core Concepts: Matrix Dimensions in Deep Neural Networks

Understanding matrix dimensions is fundamental to implementing neural networks correctly. The key concepts are:

**Weight Matrix Dimensions**: Each layer $l$ has a weight matrix $W^{[l]}$ with dimensions $n^{[l]} \times n^{[l-1]}$, where $n^{[l]}$ is the number of units in layer $l$ and $n^{[l-1]}$ is the number of units in the previous layer.

**Bias Vector Dimensions**: The bias vector $b^{[l]}$ has dimensions $n^{[l]} \times 1$, matching the number of units in the current layer.

**Activation Vector Dimensions**: For a single example, the activation vector $a^{[l]}$ has dimensions $n^{[l]} \times 1$. In vectorized form with $m$ training examples, it becomes $n^{[l]} \times m$.

**Layer Notation and Indexing**: Layers are indexed starting from 0 (input layer) through $L$ (output layer). The superscript notation $[l]$ denotes the layer number, while subscripts denote individual units.

**Dimension Consistency Checking**: For matrix multiplication $W^{[l]} \cdot X$ to be valid, the number of columns in $W^{[l]}$ must equal the number of rows in $X$. This is a practical debugging technique that catches implementation errors early.

**Vectorized Implementation Dimensions**: When processing multiple training examples simultaneously, we stack them horizontally. The input matrix $X$ has dimensions $n^{[0]} \times m$, where $m$ is the number of training examples.

**Broadcasting in Matrix Operations**: Bias vectors automatically broadcast across all examples during addition, so a $n^{[l]} \times 1$ bias vector adds correctly to a $n^{[l]} \times m$ matrix.

**Gradient Matrix Dimensions**: During backpropagation, gradient matrices $dW^{[l]}$ and $db^{[l]}$ maintain the same dimensions as their corresponding parameters $W^{[l]}$ and $b^{[l]}$.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Dimension Compatibility as a Constraint**: Think of matrix dimensions like puzzle pieces—they must fit together correctly. When multiplying $W^{[l]}$ (dimensions $n^{[l]} \times n^{[l-1]}$) by input $X$ (dimensions $n^{[l-1]} \times m$), the inner dimensions ($n^{[l-1]}$) must match. The result has outer dimensions $n^{[l]} \times m$. This is not arbitrary; it's a mathematical requirement.

**Layers Define Shape**: Each layer's number of units determines the shape of all associated matrices and vectors. If layer $l$ has $n^{[l]}$ units, then $W^{[l]}$ will always have $n^{[l]}$ rows, and any activation vector from that layer will always have $n^{[l]}$ rows. This consistency makes the architecture predictable.

**Stacking Examples Horizontally**: In vectorized implementations, instead of processing one training example at a time, we stack $m$ examples side-by-side. Each column represents one complete example. This transforms a single example's $n^{[l]} \times 1$ vector into a $n^{[l]} \times m$ matrix. The computation is identical; we just do it all at once.

**Dimensions as a Debugging Tool**: Before running code, trace through the dimensions at each step. If dimensions don't match at any point, you've found a bug. This "dimension checking" is faster than debugging runtime errors and catches logical mistakes in your implementation.

**Broadcasting Handles Bias Automatically**: The bias vector $b^{[l]}$ stays $n^{[l]} \times 1$ regardless of how many examples you process. Python's broadcasting automatically replicates it across all $m$ examples when you add it to $Z^{[l]}$. You don't need to manually expand the bias; the framework handles it.

**Present and explain the key equations used in the lesson**

## Key Equations

**Weight Matrix Dimensions**:
$$W^{[l]} \text{ has dimensions } n^{[l]} \times n^{[l-1]}$$

The weight matrix for layer $l$ has $n^{[l]}$ rows (one for each unit in the current layer) and $n^{[l-1]}$ columns (one for each input from the previous layer).

**Bias Vector Dimensions**:
$$b^{[l]} \text{ has dimensions } n^{[l]} \times 1$$

The bias vector has one entry per unit in layer $l$.

**Forward Propagation Equation**:
$$Z^{[l]} = W^{[l]} X + b^{[l]}$$

This is the core computation: matrix multiplication followed by bias addition.

**Single Example Dimensions**:
$$\text{Single example: } z^{[l]} \text{ is } n^{[l]} \times 1$$

For one training example, the pre-activation output is a column vector.

**Vectorized Dimensions**:
$$\text{Vectorized: } Z^{[l]} \text{ is } n^{[l]} \times m$$
$$\text{Vectorized: } X \text{ is } n^{[0]} \times m$$

When processing $m$ training examples simultaneously, matrices have an additional dimension.

**Gradient Dimensions**:
$$dW^{[l]} \text{ has same dimensions as } W^{[l]}$$
$$dZ^{[l]} \text{ has same dimensions as } Z^{[l]}$$

Gradient matrices maintain the same shape as their corresponding forward pass variables, ensuring consistent backpropagation.

**Implement code primitive: Implement dimension checking by working through matrix multiplication rules for each layer**

In [None]:
import numpy as np

# Define network architecture
n_layers = 3
n_units = [3, 4, 5, 2]  # n[0]=3 (input), n[1]=4, n[2]=5, n[3]=2 (output)
m = 10  # number of training examples

# Initialize parameters
parameters = {}
for l in range(1, n_layers + 1):
    parameters[f'W[{l}]'] = np.random.randn(n_units[l], n_units[l-1])
    parameters[f'b[{l}]'] = np.random.randn(n_units[l], 1)

# Check dimensions through forward propagation
X = np.random.randn(n_units[0], m)
print(f"Input X dimensions: {X.shape} (expected: ({n_units[0]}, {m}))")

A = X
for l in range(1, n_layers + 1):
    W = parameters[f'W[{l}]']
    b = parameters[f'b[{l}]']
    
    print(f"\nLayer {l}:")
    print(f"  W[{l}] dimensions: {W.shape} (expected: ({n_units[l]}, {n_units[l-1]}))")
    print(f"  b[{l}] dimensions: {b.shape} (expected: ({n_units[l]}, 1))")
    print(f"  A[{l-1}] dimensions: {A.shape} (expected: ({n_units[l-1]}, {m}))")
    
    Z = np.dot(W, A) + b
    print(f"  Z[{l}] dimensions: {Z.shape} (expected: ({n_units[l]}, {m}))")
    
    A = np.maximum(0, Z)  # ReLU activation
    print(f"  A[{l}] dimensions: {A.shape} (expected: ({n_units[l]}, {m}))")

print(f"\nFinal output dimensions: {A.shape} (expected: ({n_units[n_layers]}, {m}))")

**Implement code primitive: Verify that W^[l] dimensions (n^[l] by n^[l-1]) produce correct output shapes when multiplied by input**

In [None]:
import numpy as np

# Test matrix multiplication dimension rules
n_prev = 5   # n[l-1]
n_curr = 3   # n[l]
m = 8        # number of examples

W = np.random.randn(n_curr, n_prev)
A_prev = np.random.randn(n_prev, m)

print(f"W dimensions: {W.shape}")
print(f"A_prev dimensions: {A_prev.shape}")
print(f"Expected Z dimensions: ({n_curr}, {m})")

Z = np.dot(W, A_prev)
print(f"Actual Z dimensions: {Z.shape}")

assert Z.shape == (n_curr, m), f"Dimension mismatch: got {Z.shape}, expected ({n_curr}, {m})"
print("✓ Dimension check passed: W @ A_prev produces correct shape")

**Implement code primitive: Confirm bias vector dimensions (n^[l] by 1) match the output of matrix multiplication**

In [None]:
import numpy as np

# Test bias addition with broadcasting
n_curr = 4   # n[l]
m = 6        # number of examples

Z = np.random.randn(n_curr, m)
b = np.random.randn(n_curr, 1)

print(f"Z dimensions: {Z.shape}")
print(f"b dimensions: {b.shape}")
print(f"Expected output dimensions: ({n_curr}, {m})")

Z_biased = Z + b
print(f"Actual output dimensions: {Z_biased.shape}")

assert Z_biased.shape == (n_curr, m), f"Dimension mismatch: got {Z_biased.shape}, expected ({n_curr}, {m})"
print("✓ Dimension check passed: bias broadcasts correctly across examples")

# Verify that bias is applied to each example
for j in range(m):
    assert np.allclose(Z_biased[:, j], Z[:, j] + b[:, 0]), f"Bias not applied correctly to example {j}"
print("✓ Bias correctly applied to all examples")

**Implement code primitive: Implement vectorized forward propagation with stacked training examples (m columns)**

In [None]:
import numpy as np

# Vectorized forward propagation
np.random.seed(42)

# Network architecture
n_units = [2, 3, 4, 1]  # Input: 2, Hidden: 3, Hidden: 4, Output: 1
m = 5  # 5 training examples

# Initialize parameters
parameters = {}
for l in range(1, len(n_units)):
    parameters[f'W[{l}]'] = np.random.randn(n_units[l], n_units[l-1]) * 0.01
    parameters[f'b[{l}]'] = np.zeros((n_units[l], 1))

# Input: m examples, each with n[0] features
X = np.random.randn(n_units[0], m)
print(f"Input X shape: {X.shape}")
print(f"X represents {m} examples, each with {n_units[0]} features\n")

# Forward propagation
A = X
for l in range(1, len(n_units)):
    W = parameters[f'W[{l}]']
    b = parameters[f'b[{l}]']
    
    Z = np.dot(W, A) + b
    A = np.maximum(0, Z) if l < len(n_units) - 1 else Z  # ReLU for hidden, linear for output
    
    print(f"Layer {l}:")
    print(f"  Z[{l}] shape: {Z.shape}")
    print(f"  A[{l}] shape: {A.shape}")
    print(f"  Each column is one example's activation\n")

print(f"Final output shape: {A.shape}")
print(f"Output has {A.shape[1]} examples (columns) and {A.shape[0]} output unit(s) (rows)")

**Implement code primitive: Verify that gradient matrices (dW, db) maintain the same dimensions as their corresponding parameters**

In [None]:
import numpy as np

# Verify gradient dimensions match parameter dimensions
n_curr = 3   # n[l]
n_prev = 4   # n[l-1]
m = 6        # number of examples

# Forward pass
W = np.random.randn(n_curr, n_prev)
b = np.random.randn(n_curr, 1)
A_prev = np.random.randn(n_prev, m)
Z = np.dot(W, A_prev) + b

print(f"Parameter dimensions:")
print(f"  W shape: {W.shape}")
print(f"  b shape: {b.shape}")

# Simulated gradient computation (simplified backprop)
dZ = np.random.randn(n_curr, m)  # gradient w.r.t. Z

# Gradient computation
dW = (1/m) * np.dot(dZ, A_prev.T)
db = (1/m) * np.sum(dZ, axis=1, keepdims=True)

print(f"\nGradient dimensions:")
print(f"  dW shape: {dW.shape}")
print(f"  db shape: {db.shape}")

print(f"\nDimension verification:")
assert dW.shape == W.shape, f"dW shape {dW.shape} != W shape {W.shape}"
assert db.shape == b.shape, f"db shape {db.shape} != b shape {b.shape}"
print(f"✓ dW has same dimensions as W: {dW.shape}")
print(f"✓ db has same dimensions as b: {db.shape}")

**Create a Mermaid diagram: A flowchart showing the dimension transformation through layers: input (n^[0] by m) -> W^[1] multiplication -> Z^[1] (n^[1] by m) -> activation -> A^[1] (n^[1] by m) -> next layer**

## Dimension Flow Through Layers

```mermaid
graph TD
    A["Input X<br/>(n⁰ × m)"] --> B["W¹ × X<br/>W¹: n¹ × n⁰<br/>Result: n¹ × m"]
    B --> C["Z¹ = W¹X + b¹<br/>(n¹ × m)"]
    C --> D["Activation<br/>A¹ = ReLU(Z¹)<br/>(n¹ × m)"]
    D --> E["W² × A¹<br/>W²: n² × n¹<br/>Result: n² × m"]
    E --> F["Z² = W²A¹ + b²<br/>(n² × m)"]
    F --> G["Activation<br/>A² = ReLU(Z²)<br/>(n² × m)"]
    G --> H["Continue to<br/>next layer..."]
    
    style A fill:#e1f5ff
    style C fill:#fff3e0
    style D fill:#f3e5f5
    style F fill:#fff3e0
    style G fill:#f3e5f5
```

**Create a Mermaid diagram: A diagram showing how a single example dimension (n^[l] by 1) extends to vectorized dimension (n^[l] by m) by stacking m examples horizontally**

## Single Example vs. Vectorized Dimensions

```mermaid
graph TD
    A["Single Example<br/>a⁽ˡ⁾: n⁽ˡ⁾ × 1"] --> B["Stack m Examples<br/>Horizontally"]
    B --> C["Vectorized Form<br/>A⁽ˡ⁾: n⁽ˡ⁾ × m"]
    
    D["Example 1<br/>Column 1"] --> E["Example 2<br/>Column 2"]
    E --> F["Example 3<br/>Column 3"]
    F --> G["... Example m<br/>Column m"]
    
    C --> H["Each column is<br/>one complete example"]
    
    style A fill:#e8f5e9
    style C fill:#e3f2fd
    style H fill:#fff9c4
```

## Lesson 4

**Explain the core concepts of the lesson**

## Core Concepts

Deep neural networks are effective because they learn **hierarchical representations** of data. Rather than attempting to learn complex functions directly from raw input, deep networks build sophisticated representations by composing simpler ones across multiple layers.

**Key concepts:**

- **Feature Hierarchy**: Early layers detect simple, low-level features (e.g., edges in images, pitch in audio). Middle layers combine these into more complex features (e.g., shapes, phonemes). Deep layers recognize high-level abstractions (e.g., objects, words).

- **Compositional Representation**: Each layer builds upon the previous layer's output, creating a compositional structure where complex functions are expressed as compositions of simpler functions.

- **Depth as Computational Efficiency**: Deep networks can compute certain functions with exponentially fewer neurons than shallow networks. This is a fundamental principle from circuit theory: depth dramatically reduces the number of components needed to compute complex functions.

- **Simple to Complex Functions**: The progression from simple to complex features mirrors how biological systems (like the human brain) process information, making deep networks both computationally efficient and biologically plausible.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Hierarchical Processing in Vision:**
Imagine recognizing a face in a photograph. Your visual system doesn't directly recognize "face" from raw pixel values. Instead, it first detects simple features like edges and corners, then combines these into parts like eyes and noses, and finally recognizes the complete face. Deep networks work the same way—early layers act like edge detectors, middle layers recognize parts, and deep layers recognize whole objects.

**The XOR Problem:**
Consider computing XOR of many binary inputs. A shallow network with a single hidden layer would need exponentially many neurons (roughly $2^N$ for $N$ inputs). But a deep network can solve this with only logarithmic depth—by organizing XOR operations in a tree structure, we reduce the problem from exponential to logarithmic complexity. This illustrates why depth is powerful: it allows us to reuse computations efficiently.

**Speech Recognition Across Layers:**
In speech processing, the same hierarchical principle applies. Raw audio waveforms are first processed to extract low-level features like pitch and tone. These combine into phonemes (basic sound units), which combine into words, which combine into phrases and sentences. Each layer adds semantic meaning by composing the outputs of the previous layer.

**Composition Over Direct Learning:**
The fundamental insight is that deep networks are effective because they learn to compose simple functions into complex ones, rather than trying to learn complex functions directly from raw input. This compositional approach is more efficient, more generalizable, and mirrors how nature solves complex problems.

**Present and explain the key equations used in the lesson**

## Key Equations

**XOR Tree Depth:**
$$\text{XOR tree depth} = O(\log N)$$

When computing XOR of $N$ binary inputs using a tree of XOR gates, the depth (number of layers) grows logarithmically with the number of inputs. This is because each layer can combine pairs of results from the previous layer, halving the number of remaining inputs at each step.

**Shallow Network Complexity for XOR:**
$$\text{Shallow network hidden units for XOR} = O(2^N)$$

A shallow network with a single hidden layer requires exponentially many hidden units to compute XOR of $N$ inputs. This exponential blowup demonstrates why depth is essential: a deep network can solve the same problem with far fewer total neurons by leveraging the hierarchical structure of the computation.

These equations illustrate the fundamental trade-off: **depth reduces the number of components needed to compute complex functions**. A shallow network must learn the entire complex function in one step, while a deep network can decompose it into simpler sub-functions.

**Implement code primitive: Visualize hidden unit activations in a neural network to show what features early layers detect (e.g., edge orientations in images)**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

# Create a simple edge detection visualization
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
fig.suptitle('Hidden Unit Activations: Feature Detection Across Layers', fontsize=14, fontweight='bold')

# Layer 1: Edge Detection (horizontal, vertical, diagonal edges)
edge_filters = [
    np.array([[-1, -1, -1], [2, 2, 2], [-1, -1, -1]]),  # Horizontal edge
    np.array([[-1, 2, -1], [-1, 2, -1], [-1, 2, -1]]),  # Vertical edge
    np.array([[2, -1, -1], [-1, 2, -1], [-1, -1, 2]])   # Diagonal edge
]

# Create a simple image with edges
image = np.zeros((7, 7))
image[2:5, 2:5] = 1  # White square

for idx, (ax, filt) in enumerate(zip(axes[0], edge_filters)):
    # Simple convolution
    activation = np.zeros((5, 5))
    for i in range(5):
        for j in range(5):
            activation[i, j] = np.sum(image[i:i+3, j:j+3] * filt)
    
    ax.imshow(activation, cmap='hot')
    ax.set_title(f'Layer 1: Edge Detector {idx+1}')
    ax.axis('off')

# Layer 2: Corner and Shape Detection (combinations of edges)
for idx in range(3):
    ax = axes[1, idx]
    # Simulate combining edge detections
    combined = np.zeros((5, 5))
    for i in range(5):
        for j in range(5):
            combined[i, j] = np.sum([np.abs(np.sum(image[i:i+3, j:j+3] * filt)) for filt in edge_filters])
    
    ax.imshow(combined, cmap='viridis')
    ax.set_title(f'Layer 2: Shape Feature {idx+1}')
    ax.axis('off')

plt.tight_layout()
plt.show()

print("Layer 1 detects simple features: edges at different orientations")
print("Layer 2 combines these edges to detect more complex shapes")
print("Deeper layers would recognize complete objects by combining shapes")

**Implement or trace through an XOR tree circuit to demonstrate how depth reduces the number of required components**

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def xor(a, b):
    """Compute XOR of two bits"""
    return int(a != b)

def xor_tree_depth(n):
    """Calculate depth needed for XOR tree with n inputs"""
    if n <= 1:
        return 0
    return 1 + xor_tree_depth(n // 2)

def xor_tree_compute(inputs):
    """Compute XOR of inputs using tree structure"""
    if len(inputs) == 1:
        return inputs[0]
    
    # Pair up inputs and compute XOR
    next_level = []
    for i in range(0, len(inputs), 2):
        if i + 1 < len(inputs):
            next_level.append(xor(inputs[i], inputs[i+1]))
        else:
            next_level.append(inputs[i])
    
    return xor_tree_compute(next_level)

# Demonstrate XOR tree for 8 inputs
test_inputs = [1, 0, 1, 1, 0, 1, 0, 1]
result = xor_tree_compute(test_inputs)
print(f"XOR of {test_inputs} = {result}")
print(f"Tree depth for {len(test_inputs)} inputs: {xor_tree_depth(len(test_inputs))}")

# Compare shallow vs deep complexity
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Plot 1: Depth comparison
n_values = np.arange(1, 17)
deep_depth = [xor_tree_depth(n) for n in n_values]
shallow_units = [2**n for n in n_values]

ax1.plot(n_values, deep_depth, 'o-', label='Deep Network Depth', linewidth=2, markersize=8)
ax1.set_xlabel('Number of Inputs (N)', fontsize=12)
ax1.set_ylabel('Depth / Hidden Units', fontsize=12)
ax1.set_title('Deep Network: Logarithmic Depth', fontsize=13, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.legend(fontsize=11)

# Plot 2: Shallow network complexity
ax2.semilogy(n_values, shallow_units, 's-', label='Shallow Network Hidden Units', color='red', linewidth=2, markersize=8)
ax2.set_xlabel('Number of Inputs (N)', fontsize=12)
ax2.set_ylabel('Hidden Units (log scale)', fontsize=12)
ax2.set_title('Shallow Network: Exponential Growth', fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3, which='both')
ax2.legend(fontsize=11)

plt.tight_layout()
plt.show()

print(f"\nFor 16 inputs:")
print(f"  Deep network depth: {xor_tree_depth(16)} layers")
print(f"  Shallow network hidden units: {2**16} neurons")

**Create a Mermaid diagram: flowchart TD
    A[Raw Input Image] --> B[Layer 1: Edge Detection]
    B --> C[Layer 2: Face Parts]
    C --> D[Layer 3: Face Recognition]
    style A fill:#e1f5ff
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#e8f5e9**

## Visual Hierarchy: Image Recognition

```mermaid
flowchart TD
    A[Raw Input Image] --> B[Layer 1: Edge Detection]
    B --> C[Layer 2: Face Parts]
    C --> D[Layer 3: Face Recognition]
    style A fill:#e1f5ff
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#e8f5e9
```

This diagram shows how a deep network processes visual information hierarchically. The raw pixel input is progressively transformed into increasingly abstract representations. Early layers detect simple edges and textures, middle layers recognize meaningful parts like eyes and noses, and deep layers recognize complete objects like faces. This compositional approach allows the network to build complex visual understanding from simple building blocks.

**Create a Mermaid diagram: flowchart TD
    A[Audio Input] --> B[Layer 1: Waveform Features]
    B --> C[Layer 2: Phonemes]
    C --> D[Layer 3: Words]
    D --> E[Layer 4: Phrases/Sentences]
    style A fill:#e1f5ff
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#fce4ec
    style E fill:#e8f5e9**

## Visual Hierarchy: Speech Recognition

```mermaid
flowchart TD
    A[Audio Input] --> B[Layer 1: Waveform Features]
    B --> C[Layer 2: Phonemes]
    C --> D[Layer 3: Words]
    D --> E[Layer 4: Phrases/Sentences]
    style A fill:#e1f5ff
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#fce4ec
    style E fill:#e8f5e9
```

This diagram illustrates the same hierarchical principle applied to speech recognition. Raw audio waveforms are first processed to extract low-level acoustic features like pitch and tone. These combine into phonemes (basic sound units), which are the building blocks of language. Phonemes combine into words, and words combine into phrases and sentences. Each layer adds semantic meaning by composing the outputs of the previous layer, demonstrating that the compositional approach is universal across different domains.

**Create a Mermaid diagram: flowchart TD
    X1 --> XOR1[XOR]
    X2 --> XOR1
    X3 --> XOR2[XOR]
    X4 --> XOR2
    XOR1 --> XOR3[XOR]
    XOR2 --> XOR3
    XOR3 --> Y[Output]
    style Y fill:#e8f5e9**

## XOR Tree Circuit

```mermaid
flowchart TD
    X1 --> XOR1[XOR]
    X2 --> XOR1
    X3 --> XOR2[XOR]
    X4 --> XOR2
    XOR1 --> XOR3[XOR]
    XOR2 --> XOR3
    XOR3 --> Y[Output]
    style Y fill:#e8f5e9
```

This diagram shows how a tree structure of XOR gates can compute the XOR of multiple inputs with logarithmic depth. Instead of requiring exponentially many neurons in a single hidden layer, the tree organizes the computation hierarchically. Each layer combines pairs of results from the previous layer, reducing the problem size by half at each step. For $N$ inputs, this tree structure requires only $O(\log N)$ depth, compared to $O(2^N)$ hidden units in a shallow network. This is a concrete example of how depth dramatically reduces computational complexity.

## Lesson 5

**Explain the core concepts of the lesson**

## Core Concepts

A neural network is built as a sequence of layers, where each layer performs two fundamental operations:

**Forward Propagation**: Each layer takes the activation from the previous layer and computes the activation for the current layer using learnable parameters (weights and biases).

**Backward Propagation**: After computing the loss, gradients flow backward through the network. Each layer receives the gradient of the loss with respect to its output activation and computes gradients with respect to its inputs and parameters.

**Cache Mechanism**: During forward propagation, intermediate values (like the pre-activation $z^{[l]}$) are stored in a cache. These cached values are essential during backward propagation to compute gradients efficiently without recomputation.

**Layer Computation**: A single layer performs a linear transformation followed by an activation function. The linear transformation combines the previous activation with weights and biases, and the activation function introduces non-linearity.

**Gradient Flow**: Gradients propagate backward through the network, allowing us to compute how much each parameter should change to reduce the loss. This enables the parameter update step in training.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Complementary Operations**: Each layer in a neural network has two complementary operations: forward propagation computes activations from the previous layer, and backward propagation computes gradients flowing back through the network. Think of forward propagation as the "thinking" phase where the network processes input, and backward propagation as the "learning" phase where the network adjusts based on errors.

**Caching for Efficiency**: The cache stores intermediate values (like $z$) computed during forward propagation so they can be reused during backward propagation without recomputation. This is like writing down your work during a calculation so you don't have to redo it when checking your answer.

**Bidirectional Flow**: Forward propagation flows left-to-right through layers, while backward propagation flows right-to-left, computing gradients for all parameters along the way. Imagine information flowing in one direction during prediction, then flowing back in the opposite direction during learning.

**Complete Training Cycle**: A complete training iteration involves a full forward pass to compute predictions, followed by a full backward pass to compute all parameter gradients, then updating all parameters simultaneously. This ensures all parameters are updated based on the same loss computation.

**Present and explain the key equations used in the lesson**

## Key Equations

**Linear Transformation** (pre-activation):
$$z^{[l]} = w^{[l]} a^{[l-1]} + b^{[l]}$$

where $w^{[l]}$ are the weights, $a^{[l-1]}$ is the activation from the previous layer, and $b^{[l]}$ is the bias.

**Activation Function**:
$$a^{[l]} = g(z^{[l]})$$

where $g$ is an activation function (e.g., ReLU, sigmoid, tanh) that introduces non-linearity.

**Parameter Update for Weights**:
$$w^{[l]} := w^{[l]} - \alpha dw^{[l]}$$

where $\alpha$ is the learning rate and $dw^{[l]}$ is the gradient of the loss with respect to $w^{[l]}$.

**Parameter Update for Biases**:
$$b^{[l]} := b^{[l]} - \alpha db^{[l]}$$

where $db^{[l]}$ is the gradient of the loss with respect to $b^{[l]}$.

These equations form the foundation of layer-wise computation and parameter optimization in deep neural networks.

**Implement code primitive: Implement a forward function that takes activation from previous layer and parameters, computes linear transformation and activation, returns output activation and cache**

In [None]:
import numpy as np

def forward_layer(A_prev, W, b, activation='relu'):
    """
    Forward propagation for a single layer.
    
    Args:
        A_prev: Activation from previous layer, shape (n_prev, m)
        W: Weight matrix, shape (n_curr, n_prev)
        b: Bias vector, shape (n_curr, 1)
        activation: Activation function ('relu' or 'sigmoid')
    
    Returns:
        A: Activation of current layer, shape (n_curr, m)
        cache: Tuple of (A_prev, W, b, Z) for backward propagation
    """
    Z = np.dot(W, A_prev) + b
    
    if activation == 'relu':
        A = np.maximum(0, Z)
    elif activation == 'sigmoid':
        A = 1 / (1 + np.exp(-Z))
    else:
        raise ValueError("Unknown activation function")
    
    cache = (A_prev, W, b, Z)
    return A, cache

**Implement code primitive: Implement a backward function that takes gradient of loss with respect to current activation and cache, computes gradients with respect to previous activation and parameters**

In [None]:
def backward_layer(dA, cache, activation='relu'):
    """
    Backward propagation for a single layer.
    
    Args:
        dA: Gradient of loss with respect to current activation, shape (n_curr, m)
        cache: Tuple of (A_prev, W, b, Z) from forward propagation
        activation: Activation function ('relu' or 'sigmoid')
    
    Returns:
        dA_prev: Gradient with respect to previous activation, shape (n_prev, m)
        dW: Gradient with respect to weights, shape (n_curr, n_prev)
        db: Gradient with respect to bias, shape (n_curr, 1)
    """
    A_prev, W, b, Z = cache
    m = A_prev.shape[1]
    
    if activation == 'relu':
        dZ = dA * (Z > 0)
    elif activation == 'sigmoid':
        S = 1 / (1 + np.exp(-Z))
        dZ = dA * S * (1 - S)
    else:
        raise ValueError("Unknown activation function")
    
    dW = np.dot(dZ, A_prev.T) / m
    db = np.sum(dZ, axis=1, keepdims=True) / m
    dA_prev = np.dot(W.T, dZ)
    
    return dA_prev, dW, db

**Implement code primitive: Implement a full forward propagation loop that sequentially applies forward functions through all layers, storing caches at each step**

In [None]:
def forward_propagation(X, parameters, activations):
    """
    Forward propagation through all layers.
    
    Args:
        X: Input data, shape (n_0, m)
        parameters: Dictionary with keys 'W1', 'b1', 'W2', 'b2', ..., 'WL', 'bL'
        activations: List of activation functions for each layer
    
    Returns:
        AL: Output activation from final layer, shape (n_L, m)
        caches: List of caches from each layer
    """
    caches = []
    A = X
    L = len(activations)
    
    for l in range(1, L + 1):
        A_prev = A
        W = parameters[f'W{l}']
        b = parameters[f'b{l}']
        activation = activations[l - 1]
        
        A, cache = forward_layer(A_prev, W, b, activation)
        caches.append(cache)
    
    return A, caches

**Implement code primitive: Implement a full backward propagation loop that sequentially applies backward functions in reverse order, accumulating parameter gradients**

In [None]:
def backward_propagation(dAL, caches, activations):
    """
    Backward propagation through all layers.
    
    Args:
        dAL: Gradient of loss with respect to final activation, shape (n_L, m)
        caches: List of caches from forward propagation
        activations: List of activation functions for each layer
    
    Returns:
        gradients: Dictionary with keys 'dW1', 'db1', 'dW2', 'db2', ..., 'dWL', 'dbL'
    """
    gradients = {}
    dA = dAL
    L = len(caches)
    
    for l in reversed(range(1, L + 1)):
        cache = caches[l - 1]
        activation = activations[l - 1]
        
        dA_prev, dW, db = backward_layer(dA, cache, activation)
        
        gradients[f'dW{l}'] = dW
        gradients[f'db{l}'] = db
        dA = dA_prev
    
    return gradients

**Create a Mermaid diagram: flowchart showing forward propagation pipeline: input a0 flows through layer 1 (using w1, b1) to produce a1, then through layer 2 (using w2, b2) to produce a2, continuing to layer L producing aL as output, with z values cached at each step**

## Forward Propagation Pipeline

```mermaid
graph LR
    A0["a⁽⁰⁾<br/>(Input)"] --> L1["Layer 1<br/>W¹, b¹"]
    L1 --> Z1["z⁽¹⁾<br/>(Cache)"]
    Z1 --> Act1["g(z⁽¹⁾)"]
    Act1 --> A1["a⁽¹⁾"]
    
    A1 --> L2["Layer 2<br/>W², b²"]
    L2 --> Z2["z⁽²⁾<br/>(Cache)"]
    Z2 --> Act2["g(z⁽²⁾)"]
    Act2 --> A2["a⁽²⁾"]
    
    A2 --> Dots["..."]
    Dots --> LL["Layer L<br/>Wᴸ, bᴸ"]
    LL --> ZL["z⁽ᴸ⁾<br/>(Cache)"]
    ZL --> ActL["g(z⁽ᴸ⁾)"]
    ActL --> AL["aᴸ<br/>(Output)"]
```

**Create a Mermaid diagram: flowchart showing backward propagation pipeline: starting from dL (gradient of loss), flowing backward through layer L to compute dz, dw, db and da(L-1), then continuing backward through previous layers in reverse order**

## Backward Propagation Pipeline

```mermaid
graph RL
    DL["daᴸ<br/>(Loss Gradient)"] --> BL["Layer L<br/>Backward"]
    BL --> DZL["dz⁽ᴸ⁾"]
    BL --> DWL["dWᴸ"]
    BL --> DBL["dbᴸ"]
    DZL --> DAL_1["da⁽ᴸ⁻¹⁾"]
    
    DAL_1 --> BL_1["Layer L-1<br/>Backward"]
    BL_1 --> DZL_1["dz⁽ᴸ⁻¹⁾"]
    BL_1 --> DWL_1["dWᴸ⁻¹"]
    BL_1 --> DBL_1["dbᴸ⁻¹"]
    DZL_1 --> DAL_2["da⁽ᴸ⁻²⁾"]
    
    DAL_2 --> Dots["..."]
    Dots --> B1["Layer 1<br/>Backward"]
    B1 --> DZ1["dz⁽¹⁾"]
    B1 --> DW1["dW¹"]
    B1 --> DB1["db¹"]
```

## Lesson 6

**Explain the core concepts of the lesson**

## Core Concepts

Forward and backward propagation are the fundamental mechanisms for training neural networks. They work together to compute predictions and update model parameters.

**Forward Propagation** is the process of computing layer outputs sequentially from input to output. For each layer $l$, we compute:
- The pre-activation $z^l = W^l a^{l-1} + b^l$
- The activation $a^l = g(z^l)$ where $g$ is an activation function

During forward propagation, we cache intermediate values (activations, pre-activations, weights, and biases) for use in the backward pass.

**Backward Propagation** is the process of computing gradients of the loss with respect to all parameters and activations. It flows from the output layer back to the input layer, using the cached values from the forward pass. For each layer, we compute:
- The gradient with respect to pre-activation: $dz^l = da^l \odot g'(z^l)$
- The gradient with respect to weights: $dW^l = dz^l (a^{l-1})^T$
- The gradient with respect to bias: $db^l = dz^l$
- The gradient with respect to previous activation: $da^{l-1} = (W^l)^T dz^l$

**Cache Storage** enables efficient computation by storing values computed during the forward pass so they can be reused during backpropagation without recomputation.

**Vectorization** allows us to process entire batches of examples simultaneously using matrix operations, making training efficient.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Forward Pass as Information Flow**: Think of forward propagation as information flowing left-to-right through the network. Each layer transforms its input through a linear transformation (weights and bias) followed by a nonlinear activation. The output of one layer becomes the input to the next.

**Backward Pass as Error Attribution**: Backward propagation computes how much each parameter contributed to the final error. The gradient flows right-to-left, telling us how to adjust each weight and bias to reduce the loss. The chain rule connects gradients across layers.

**Mirrored Structure**: The backward pass mirrors the forward pass structure but in reverse. Where forward propagation applies $W^l$ to $a^{l-1}$, backward propagation applies $(W^l)^T$ to $dz^l$. Where forward propagation applies activation function $g$, backward propagation applies its derivative $g'$.

**Caching as Efficiency**: During forward propagation, we store intermediate values (z, a, W, b) at each layer. During backward propagation, we retrieve these cached values to compute gradients. This avoids recomputing values and makes backpropagation efficient.

**Loss Derivative as Initialization**: The backward recursion must start somewhere. For binary classification with cross-entropy loss, the derivative of the loss with respect to the final activation $a^L$ is $da^L = -\frac{y}{a} + \frac{1-y}{1-a}$. This initializes the backward chain.

**Present and explain the key equations used in the lesson**

## Key Equations

**Forward Propagation**:

$$z^l = W^l a^{l-1} + b^l$$

$$a^l = g(z^l)$$

where $g$ is an activation function (ReLU, sigmoid, etc.) and $a^0 = x$ is the input.

**Backward Propagation - Gradient with respect to pre-activation**:

$$dz^l = da^l \odot g'(z^l)$$

where $\odot$ denotes element-wise multiplication and $g'$ is the derivative of the activation function.

**Backward Propagation - Gradients with respect to parameters**:

$$dW^l = dz^l (a^{l-1})^T$$

$$db^l = dz^l$$

**Backward Propagation - Gradient with respect to previous activation**:

$$da^{l-1} = (W^l)^T dz^l$$

**Backward Propagation - Recursive gradient computation**:

$$dz^l = (W^{l+1})^T dz^{l+1} \odot g'(z^l)$$

**Loss Derivative for Binary Classification**:

$$da^L = -\frac{y}{a} + \frac{1-y}{1-a}$$

where $y$ is the true label and $a$ is the predicted probability from the sigmoid output.

**Implement code primitive: Implement forward propagation function that takes a^{l-1} and returns a^l and cache (z^l, W^l, b^l)**

In [None]:
import numpy as np

def forward_propagation(a_prev, W, b, activation='relu'):
    """
    Forward propagation for a single layer.
    
    Args:
        a_prev: Activation from previous layer
        W: Weight matrix
        b: Bias vector
        activation: Activation function ('relu' or 'sigmoid')
    
    Returns:
        a: Activation output
        cache: Tuple of (z, W, b, a_prev) for backward pass
    """
    z = np.dot(W, a_prev) + b
    
    if activation == 'relu':
        a = np.maximum(0, z)
    elif activation == 'sigmoid':
        a = 1 / (1 + np.exp(-z))
    else:
        raise ValueError("Unknown activation function")
    
    cache = (z, W, b, a_prev)
    return a, cache

**Implement code primitive: Implement backward propagation function that takes da^l and returns da^{l-1}, dW^l, db^l using cached values**

In [None]:
def backward_propagation(da, cache, activation='relu'):
    """
    Backward propagation for a single layer.
    
    Args:
        da: Gradient of loss with respect to activation
        cache: Tuple of (z, W, b, a_prev) from forward pass
        activation: Activation function ('relu' or 'sigmoid')
    
    Returns:
        da_prev: Gradient with respect to previous activation
        dW: Gradient with respect to weights
        db: Gradient with respect to bias
    """
    z, W, b, a_prev = cache
    
    if activation == 'relu':
        dz = da * (z > 0)
    elif activation == 'sigmoid':
        s = 1 / (1 + np.exp(-z))
        dz = da * s * (1 - s)
    else:
        raise ValueError("Unknown activation function")
    
    dW = np.dot(dz, a_prev.T)
    db = dz
    da_prev = np.dot(W.T, dz)
    
    return da_prev, dW, db

**Implement code primitive: Vectorize forward pass using matrix multiplication and broadcasting for batch processing**

In [None]:
def forward_propagation_batch(A_prev, W, b, activation='relu'):
    """
    Vectorized forward propagation for batch of examples.
    
    Args:
        A_prev: Activation matrix (n_prev, m) where m is batch size
        W: Weight matrix (n, n_prev)
        b: Bias vector (n, 1)
        activation: Activation function ('relu' or 'sigmoid')
    
    Returns:
        A: Activation output (n, m)
        cache: Tuple of (Z, W, b, A_prev) for backward pass
    """
    Z = np.dot(W, A_prev) + b
    
    if activation == 'relu':
        A = np.maximum(0, Z)
    elif activation == 'sigmoid':
        A = 1 / (1 + np.exp(-Z))
    else:
        raise ValueError("Unknown activation function")
    
    cache = (Z, W, b, A_prev)
    return A, cache

**Implement code primitive: Vectorize backward pass using matrix operations and np.sum with axis and keepdims parameters**

In [None]:
def backward_propagation_batch(dA, cache, activation='relu'):
    """
    Vectorized backward propagation for batch of examples.
    
    Args:
        dA: Gradient matrix (n, m) where m is batch size
        cache: Tuple of (Z, W, b, A_prev) from forward pass
        activation: Activation function ('relu' or 'sigmoid')
    
    Returns:
        dA_prev: Gradient with respect to previous activation (n_prev, m)
        dW: Gradient with respect to weights (n, n_prev)
        db: Gradient with respect to bias (n, 1)
    """
    Z, W, b, A_prev = cache
    m = A_prev.shape[1]
    
    if activation == 'relu':
        dZ = dA * (Z > 0)
    elif activation == 'sigmoid':
        S = 1 / (1 + np.exp(-Z))
        dZ = dA * S * (1 - S)
    else:
        raise ValueError("Unknown activation function")
    
    dW = np.dot(dZ, A_prev.T) / m
    db = np.sum(dZ, axis=1, keepdims=True) / m
    dA_prev = np.dot(W.T, dZ)
    
    return dA_prev, dW, db

**Implement code primitive: Initialize backward recursion with loss derivative for final layer in binary classification**

In [None]:
def loss_derivative_binary_classification(Y, A):
    """
    Compute derivative of binary cross-entropy loss with respect to final activation.
    
    Args:
        Y: True labels (1, m) where m is batch size
        A: Predicted probabilities (1, m)
    
    Returns:
        dA: Gradient of loss with respect to A (1, m)
    """
    dA = -(Y / A) + (1 - Y) / (1 - A)
    return dA

**Implement code primitive: Chain forward functions left-to-right starting with a^0 = x**

In [None]:
def forward_pass_network(X, parameters, activations):
    """
    Forward pass through entire network.
    
    Args:
        X: Input data (n_0, m) where m is batch size
        parameters: Dictionary with keys 'W1', 'b1', 'W2', 'b2', ...
        activations: List of activation functions for each layer
    
    Returns:
        A: Final output
        caches: List of caches from each layer
    """
    A = X
    caches = []
    L = len(activations)
    
    for l in range(L):
        A_prev = A
        W = parameters[f'W{l+1}']
        b = parameters[f'b{l+1}']
        A, cache = forward_propagation_batch(A_prev, W, b, activation=activations[l])
        caches.append(cache)
    
    return A, caches

**Implement code primitive: Chain backward functions right-to-left starting with da^L from loss derivative**

In [None]:
def backward_pass_network(Y, A_final, caches, activations):
    """
    Backward pass through entire network.
    
    Args:
        Y: True labels (1, m)
        A_final: Final activation output (1, m)
        caches: List of caches from forward pass
        activations: List of activation functions for each layer
    
    Returns:
        gradients: Dictionary with keys 'dW1', 'db1', 'dW2', 'db2', ...
    """
    gradients = {}
    L = len(caches)
    
    dA = loss_derivative_binary_classification(Y, A_final)
    
    for l in reversed(range(L)):
        cache = caches[l]
        dA, dW, db = backward_propagation_batch(dA, cache, activation=activations[l])
        gradients[f'dW{l+1}'] = dW
        gradients[f'db{l+1}'] = db
    
    return gradients

**Create a Mermaid diagram: flowchart showing forward propagation chain: input x → layer 1 (ReLU) → layer 2 (ReLU) → layer 3 (sigmoid) → output y-hat → loss computation**

## Forward Propagation Flow

```mermaid
graph TD
    A["Input x<br/>(a⁰)"] --> B["Layer 1<br/>z¹ = W¹a⁰ + b¹<br/>a¹ = ReLU(z¹)"]
    B --> C["Layer 2<br/>z² = W²a¹ + b²<br/>a² = ReLU(z²)"]
    C --> D["Layer 3<br/>z³ = W³a² + b³<br/>a³ = sigmoid(z³)"]
    D --> E["Output ŷ<br/>(a³)"]
    E --> F["Loss Computation<br/>L = -y log(ŷ) - (1-y) log(1-ŷ)"]
    
    style A fill:#e1f5ff
    style B fill:#fff3e0
    style C fill:#fff3e0
    style D fill:#f3e5f5
    style E fill:#e1f5ff
    style F fill:#ffebee
```

**Create a Mermaid diagram: flowchart showing backward propagation chain: loss → da^L computation → backprop layer 3 → backprop layer 2 → backprop layer 1, with cached z values flowing into each backward step**

## Backward Propagation Flow

```mermaid
graph TD
    A["Loss L"] --> B["Compute dA³<br/>dA³ = -y/a³ + (1-y)/(1-a³)"]
    B --> C["Backward Layer 3<br/>dz³ = dA³ ⊙ sigmoid'(z³)<br/>dW³ = dz³(a²)ᵀ<br/>db³ = dz³<br/>dA² = (W³)ᵀ dz³"]
    C --> D["Backward Layer 2<br/>dz² = dA² ⊙ ReLU'(z²)<br/>dW² = dz²(a¹)ᵀ<br/>db² = dz²<br/>dA¹ = (W²)ᵀ dz²"]
    D --> E["Backward Layer 1<br/>dz¹ = dA¹ ⊙ ReLU'(z¹)<br/>dW¹ = dz¹(a⁰)ᵀ<br/>db¹ = dz¹"]
    
    F["Cache: z³, W³, b³, a²"] -.-> C
    G["Cache: z², W², b², a¹"] -.-> D
    H["Cache: z¹, W¹, b¹, a⁰"] -.-> E
    
    style A fill:#ffebee
    style B fill:#f3e5f5
    style C fill:#fff3e0
    style D fill:#fff3e0
    style E fill:#fff3e0
    style F fill:#e0f2f1
    style G fill:#e0f2f1
    style H fill:#e0f2f1
```

## Lesson 7

**Explain the core concepts of the lesson**

## Core Concepts

**Hyperparameters** are settings you control that determine how the actual parameters (weights $W$ and biases $B$) evolve during training. Unlike parameters, which are learned from data, hyperparameters are chosen by the practitioner before training begins.

Key hyperparameters in deep learning include:

- **Learning Rate** ($\alpha$): Controls the step size in gradient descent. A small learning rate leads to slow convergence, while a large learning rate can cause the cost function to diverge.
- **Number of Hidden Layers**: Determines the depth of the network architecture.
- **Number of Hidden Units**: Determines the width of each hidden layer.
- **Activation Function Selection**: The choice of activation function (ReLU, sigmoid, tanh, etc.) for hidden layers.

**Hyperparameter Tuning** is the empirical process of selecting optimal hyperparameter values. Unlike parameter learning, there is no closed-form formula to predict the best hyperparameter values in advance. Instead, practitioners must test different combinations and evaluate their impact on the cost function $J(W, B)$ and overall model performance.

**Cost Function Convergence** refers to how the loss decreases during training. Observing convergence behavior helps identify whether hyperparameters are appropriate for the problem at hand.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**The Learning Rate as a Step Size**: Think of gradient descent as walking down a hill. The learning rate $\alpha$ determines how large each step you take is. If $\alpha$ is too small, you move very slowly and may never reach the bottom in reasonable time. If $\alpha$ is too large, you might overshoot the bottom and bounce around erratically—or even climb back up the hill.

**Hyperparameters Control Parameter Evolution**: Hyperparameters are the "knobs and dials" you adjust to control how the actual parameters $W$ and $B$ change during training. They don't directly determine the final model; instead, they determine the process by which the model learns.

**Empirical Testing is Necessary**: Deep learning does not have a universal formula for optimal hyperparameters. What works well for one problem may fail for another. This is why empirical testing—trying different values and observing results—is fundamental to the practice of deep learning.

**Hyperparameter Transfer is Limited**: Hyperparameters that work well in one domain or application may not transfer to another. A learning rate that converges quickly for image classification might be too aggressive for natural language processing. This domain transfer challenge means you often need to re-tune hyperparameters when moving to new problems.

**Hyperparameters Evolve Over Time**: As you work on a problem over time, the optimal hyperparameter values may change due to infrastructure changes, problem evolution, or shifts in the dataset. What was optimal last month may not be optimal today.

**Present and explain the key equations used in the lesson**

## Key Equations

**Learning Rate**

$$\alpha \text{ (learning rate)}$$

The learning rate is a hyperparameter that scales the gradient step in the gradient descent update rule. It directly controls the magnitude of parameter updates at each iteration.

**Cost Function**

$$J(W, B) \text{ (cost function)}$$

The cost function measures the error of the model given parameters $W$ and $B$. During training, we use gradient descent to minimize $J(W, B)$ by iteratively updating the parameters. The trajectory of $J(W, B)$ over iterations reveals whether the learning rate and other hyperparameters are appropriate: the cost should generally decrease smoothly toward a minimum.

**Implement code primitive: Implement a hyperparameter search loop that tries different learning rate values and observes cost function behavior**

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def simple_gradient_descent(X, y, learning_rate, iterations=100):
    m = X.shape[0]
    w = np.zeros(X.shape[1])
    b = 0
    cost_history = []
    
    for i in range(iterations):
        predictions = X @ w + b
        errors = predictions - y
        cost = np.mean(errors ** 2)
        cost_history.append(cost)
        
        dw = (2 / m) * (X.T @ errors)
        db = (2 / m) * np.sum(errors)
        
        w -= learning_rate * dw
        b -= learning_rate * db
    
    return w, b, cost_history

np.random.seed(42)
X = np.random.randn(100, 5)
y = np.random.randn(100)

learning_rates = [0.001, 0.01, 0.1, 0.5]
results = {}

for lr in learning_rates:
    w, b, cost_history = simple_gradient_descent(X, y, lr, iterations=100)
    results[lr] = cost_history

for lr, costs in results.items():
    plt.plot(costs, label=f'α = {lr}')

plt.xlabel('Iteration')
plt.ylabel('Cost Function J(W, B)')
plt.legend()
plt.title('Cost Function Behavior for Different Learning Rates')
plt.show()

**Implement code primitive: Compare cost function trajectories for different learning rate settings to identify convergence speed and stability**

In [None]:
def analyze_convergence(results):
    convergence_metrics = {}
    
    for lr, cost_history in results.items():
        final_cost = cost_history[-1]
        initial_cost = cost_history[0]
        cost_reduction = initial_cost - final_cost
        
        iterations_to_threshold = None
        threshold = initial_cost * 0.1
        for i, cost in enumerate(cost_history):
            if cost < threshold:
                iterations_to_threshold = i
                break
        
        stability = np.std(np.diff(cost_history))
        
        convergence_metrics[lr] = {
            'final_cost': final_cost,
            'cost_reduction': cost_reduction,
            'iterations_to_threshold': iterations_to_threshold,
            'stability': stability
        }
    
    return convergence_metrics

metrics = analyze_convergence(results)

for lr, metric in metrics.items():
    print(f"Learning Rate α = {lr}:")
    print(f"  Final Cost: {metric['final_cost']:.4f}")
    print(f"  Cost Reduction: {metric['cost_reduction']:.4f}")
    print(f"  Iterations to 10% of Initial: {metric['iterations_to_threshold']}")
    print(f"  Stability (std of cost changes): {metric['stability']:.6f}")
    print()

**Implement code primitive: Systematically test different numbers of hidden layers and hidden units to evaluate their impact on model performance**

In [None]:
def simple_neural_network(X, y, hidden_layers, hidden_units, learning_rate=0.01, iterations=50):
    m = X.shape[0]
    input_size = X.shape[1]
    
    layers = [input_size] + [hidden_units] * hidden_layers + [1]
    weights = [np.random.randn(layers[i], layers[i+1]) * 0.01 for i in range(len(layers)-1)]
    biases = [np.zeros((1, layers[i+1])) for i in range(len(layers)-1)]
    
    cost_history = []
    
    for iteration in range(iterations):
        a = X
        activations = [a]
        
        for w, b in zip(weights, biases):
            z = a @ w + b
            a = np.maximum(0, z)
            activations.append(a)
        
        output = activations[-1]
        cost = np.mean((output - y.reshape(-1, 1)) ** 2)
        cost_history.append(cost)
        
        delta = (output - y.reshape(-1, 1)) * 2 / m
        
        for i in range(len(weights) - 1, -1, -1):
            dw = activations[i].T @ delta
            db = np.sum(delta, axis=0, keepdims=True)
            
            weights[i] -= learning_rate * dw
            biases[i] -= learning_rate * db
            
            if i > 0:
                delta = (delta @ weights[i].T) * (activations[i] > 0)
    
    return cost_history

np.random.seed(42)
X = np.random.randn(100, 5)
y = np.random.randn(100)

architectures = [
    (1, 10),
    (2, 10),
    (2, 20),
    (3, 20),
]

architecture_results = {}

for hidden_layers, hidden_units in architectures:
    cost_history = simple_neural_network(X, y, hidden_layers, hidden_units, learning_rate=0.01, iterations=50)
    architecture_results[(hidden_layers, hidden_units)] = cost_history

for (layers, units), costs in architecture_results.items():
    plt.plot(costs, label=f'{layers} layers, {units} units')

plt.xlabel('Iteration')
plt.ylabel('Cost Function J(W, B)')
plt.legend()
plt.title('Cost Function for Different Network Architectures')
plt.show()

for (layers, units), costs in architecture_results.items():
    print(f"Architecture: {layers} hidden layers, {units} units per layer")
    print(f"  Final Cost: {costs[-1]:.4f}")
    print()

**Create a Mermaid diagram: Flowchart showing the iterative hyperparameter tuning cycle: propose values → implement → evaluate cost function → adjust → repeat**

## Hyperparameter Tuning Cycle

```mermaid
graph TD
    A["Propose Hyperparameter Values<br/>(α, layers, units, activation)"] --> B["Implement Model<br/>(Build network architecture)"]
    B --> C["Train Model<br/>(Run gradient descent)"]
    C --> D["Evaluate Cost Function<br/>(Observe J(W, B) trajectory)"]
    D --> E{"Convergence<br/>Satisfactory?"}
    E -->|No| F["Adjust Hyperparameters<br/>(Increase/decrease α, change architecture)"]
    F --> A
    E -->|Yes| G["Accept Hyperparameters<br/>(Deploy or test on validation set)"]
```

This cycle reflects the empirical nature of deep learning. Since there is no formula to predict optimal hyperparameters in advance, practitioners iterate through this loop, proposing values, implementing and training the model, evaluating the cost function behavior, and adjusting based on observations. The process repeats until satisfactory convergence is achieved.

## Lesson 8

**Explain the core concepts of the lesson**

## Core Concepts

This lesson explores the relationship between deep learning and the brain, examining both the useful analogies and their fundamental limitations.

**Neural Network Analogy**: Deep learning systems are often described as mimicking the brain's structure and function. However, this analogy is more metaphorical than literal.

**Biological Neuron vs. Logistic Regression Unit**: A biological neuron receives signals from other neurons, performs a thresholding computation, and fires an action potential. A logistic regression unit receives numerical inputs, applies a sigmoid activation function, and produces an output. While structurally similar in description, the biological reality is vastly more complex.

**Forward and Backward Propagation**: Deep learning networks learn through forward propagation (computing outputs from inputs) and backpropagation (computing gradients to update weights). This learning mechanism is mathematically elegant and computationally effective.

**Function Approximation**: Neural networks are fundamentally function approximators. They learn to map inputs to outputs through training, not through biological simulation. This perspective is more accurate and useful than thinking of them as brain models.

**Gradient Descent and Learning**: The learning algorithm in neural networks relies on gradient descent to minimize error. Whether the brain uses similar mechanisms remains an open question in neuroscience.

**Explain intuitions and mental models for the lesson**

## Intuitions and Mental Models

**Neural networks as function approximators**: Think of a neural network as a flexible mathematical tool that learns to approximate any function mapping inputs to outputs. This is more accurate than imagining it as a miniature brain.

**The seductive but oversimplified brain analogy**: The brain analogy is intuitive and historically useful for motivation, but it breaks down quickly under scrutiny. A single biological neuron is far more complex than a logistic regression unit—it has thousands of synapses, complex temporal dynamics, neuromodulators, and feedback mechanisms that we still don't fully understand.

**Complexity mismatch**: The human brain contains roughly 86 billion neurons with trillions of connections. Modern deep learning networks have millions to billions of parameters. The scale is different, but more importantly, the mechanisms are fundamentally different. Neuroscience still doesn't fully understand how individual neurons compute or how learning occurs at the biological level.

**The mystery of biological learning**: It remains unclear whether the brain uses backpropagation, gradient descent, or entirely different learning mechanisms. The brain's learning algorithm is one of neuroscience's greatest unsolved mysteries.

**Practical success independent of biology**: Deep learning excels at learning complex functions through forward and backward propagation, regardless of whether these mechanisms resemble the brain. The field's maturity has shifted focus from biological plausibility to mathematical effectiveness.

**Evolution of the analogy**: The brain analogy was once essential for intuition and motivation in the early days of neural networks. As the field has matured, this analogy has become less relevant and sometimes misleading.

**Create a Mermaid diagram: graph TD
    A[Biological Neuron] -->|receives signals| B[X1, X2, X3]
    B -->|thresholding computation| C[Fire Decision]
    C -->|sends pulse| D[Axon to Other Neurons]
    E[Logistic Regression Unit] -->|receives inputs| F[A1, A2, A3]
    F -->|sigmoid activation| G[Output]
    A -.->|loose analogy| E**

## Biological Neuron vs. Logistic Regression Unit

```mermaid
graph TD
    A[Biological Neuron] -->|receives signals| B[X1, X2, X3]
    B -->|thresholding computation| C[Fire Decision]
    C -->|sends pulse| D[Axon to Other Neurons]
    E[Logistic Regression Unit] -->|receives inputs| F[A1, A2, A3]
    F -->|sigmoid activation| G[Output]
    A -.->|loose analogy| E
```

The diagram above illustrates the structural similarity and the loose analogy between biological neurons and logistic regression units. Both receive multiple inputs and produce an output based on a nonlinear computation. However, the biological neuron's actual mechanisms—involving ion channels, neurotransmitters, temporal integration, and feedback loops—are far more intricate than the simple sigmoid function used in a logistic regression unit.