# Lab E.2: Graph Convolutional Networks from Scratch

**Module:** E - Graph Neural Networks  
**Time:** 2 hours  
**Difficulty:** ‚≠ê‚≠ê‚≠ê (Intermediate-Advanced)

---

## üéØ Learning Objectives

By the end of this notebook, you will:
- [ ] Understand the message passing framework for GNNs
- [ ] Implement a Graph Convolutional Network layer from scratch
- [ ] Build and train a 2-layer GCN for node classification
- [ ] Achieve >80% accuracy on the Cora dataset
- [ ] Visualize learned node embeddings with t-SNE
- [ ] Compare scratch implementation with PyG's optimized version

---

## üìö Prerequisites

- Completed: Lab E.1 (PyTorch Geometric Setup)
- Knowledge of: PyTorch nn.Module, matrix operations, basic neural networks

---

## üåç Real-World Context

**Graph Convolutional Networks power:**

- **Google Maps:** Predicting traffic and ETA using road networks
- **Drug Discovery:** Predicting if a molecule will be effective medicine
- **Fraud Detection:** Identifying suspicious transaction patterns
- **Recommendation Systems:** Pinterest's PinSage recommends billions of items

GCNs are the foundation of modern graph learning. Understanding them deeply will help you tackle any graph problem.

---

## üßí ELI5: What Are Graph Convolutions?

> **Imagine you're at a school party** where everyone wears a colored shirt (that's their "feature").
>
> To figure out what group someone belongs to, you could:
> 1. Look at their own shirt color (their features)
> 2. Look at the shirt colors of their friends (neighbor features)
> 3. Mix these together to get a better picture
>
> **A Graph Convolution does exactly this:**
> 1. Each person looks at their own features
> 2. Collects features from friends (neighbors)
> 3. Averages everything together
> 4. Transforms the result through a learnable filter
>
> **After doing this once**, you know about your direct friends.
> **After doing it twice**, you know about friends-of-friends.
> **After three times**, you have a good sense of your local "neighborhood."
>
> **In AI terms:** This is called *message passing*. Each node sends messages to neighbors, receives messages from neighbors, and updates itself. The "messages" are learned representations, and the GCN learns what information to extract and share.

---

## Part 1: Setup and Data Loading

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from torch_geometric.datasets import Planetoid
from torch_geometric.utils import add_self_loops, degree
import time

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

In [None]:
# Load Cora dataset
dataset = Planetoid(root='/tmp/Cora', name='Cora')
data = dataset[0].to(device)

print("=" * 50)
print("CORA DATASET")
print("=" * 50)
print(f"Nodes: {data.num_nodes}")
print(f"Edges: {data.num_edges}")
print(f"Features per node: {dataset.num_features}")
print(f"Classes: {dataset.num_classes}")
print(f"Training nodes: {data.train_mask.sum().item()}")
print(f"Validation nodes: {data.val_mask.sum().item()}")
print(f"Test nodes: {data.test_mask.sum().item()}")

---

## Part 2: The Message Passing Framework

### 2.1 Understanding Message Passing

Every Graph Neural Network follows the **Message Passing** paradigm:

```
For each node i:
    1. MESSAGE:   Compute messages from all neighbors j
    2. AGGREGATE: Combine messages (sum, mean, max, etc.)
    3. UPDATE:    Update node i's representation
```

Mathematically:

$$h_i^{(l+1)} = \text{UPDATE}\left(h_i^{(l)}, \text{AGGREGATE}\left(\{\text{MESSAGE}(h_i^{(l)}, h_j^{(l)}, e_{ij}) : j \in \mathcal{N}(i)\}\right)\right)$$

Where:
- $h_i^{(l)}$ = representation of node $i$ at layer $l$
- $\mathcal{N}(i)$ = neighbors of node $i$
- $e_{ij}$ = edge features (optional)

In [None]:
# Let's visualize message passing with a simple example

def visualize_message_passing():
    """
    Demonstrates message passing on a tiny graph.
    """
    print("üîÑ MESSAGE PASSING EXAMPLE")
    print("=" * 50)
    
    # Tiny graph: 4 nodes, simple connections
    #   0 -- 1
    #   |    |
    #   2 -- 3
    
    # Node features (2D for simplicity)
    features = torch.tensor([
        [1.0, 0.0],  # Node 0
        [0.0, 1.0],  # Node 1
        [1.0, 1.0],  # Node 2
        [0.5, 0.5],  # Node 3
    ])
    
    # Edges (undirected)
    edges = [(0, 1), (1, 0), (0, 2), (2, 0), 
             (1, 3), (3, 1), (2, 3), (3, 2)]
    
    print("Initial node features:")
    for i, f in enumerate(features):
        print(f"  Node {i}: {f.tolist()}")
    
    print("\nüì® Round 1: Each node collects neighbor messages")
    
    # For each node, aggregate neighbor features (mean)
    neighbors = {
        0: [1, 2],
        1: [0, 3],
        2: [0, 3],
        3: [1, 2]
    }
    
    new_features = []
    for node in range(4):
        # Collect neighbor features
        neighbor_feats = torch.stack([features[n] for n in neighbors[node]])
        
        # Include self (self-loop)
        all_feats = torch.cat([features[node].unsqueeze(0), neighbor_feats])
        
        # Aggregate: mean
        aggregated = all_feats.mean(dim=0)
        new_features.append(aggregated)
        
        print(f"  Node {node}:")
        print(f"    Neighbors: {neighbors[node]}")
        print(f"    Neighbor features: {neighbor_feats.tolist()}")
        print(f"    After aggregation: {aggregated.tolist()}")
    
    print("\n‚úÖ After 1 round, each node knows about its direct neighbors!")
    print("\nüí° With 2 rounds, nodes would know about 2-hop neighbors.")

visualize_message_passing()

### üîç What Just Happened?

We simulated one round of message passing:

1. Each node collected features from its neighbors
2. Combined them with its own feature (self-loop)
3. Computed the mean as the new representation

**Key insight:** After this process:
- Nodes 0 and 2 have similar representations (they share neighbor 0 and 2)
- Nodes 1 and 3 have similar representations (they share neighbor 1 and 3)

This is how GNNs create **structurally-aware embeddings**!

---

## Part 3: Implementing GCN from Scratch

### 3.1 The GCN Layer Formula

The Graph Convolutional Network (Kipf & Welling, 2017) uses this update rule:

$$H^{(l+1)} = \sigma\left(\tilde{D}^{-1/2} \tilde{A} \tilde{D}^{-1/2} H^{(l)} W^{(l)}\right)$$

Where:
- $\tilde{A} = A + I$ (adjacency matrix with self-loops)
- $\tilde{D}$ = degree matrix of $\tilde{A}$
- $H^{(l)}$ = node features at layer $l$
- $W^{(l)}$ = learnable weight matrix
- $\sigma$ = activation function (ReLU)

### üßí ELI5: What's with all the D's and A's?

> **Imagine you're averaging your friends' opinions:**
>
> - **Without normalization:** If you have 100 friends and I have 2, your "average" is way bigger (summing 100 vs 2 numbers)
> - **With normalization:** We divide by the number of friends, so everyone's vote counts equally
>
> The $D^{-1/2}$ terms do exactly this - they normalize by degree so nodes with many neighbors don't dominate.

In [None]:
class GCNLayerScratch(nn.Module):
    """
    Graph Convolutional Layer - implemented from scratch.
    
    This implements: H' = œÉ(D^(-1/2) √É D^(-1/2) H W)
    
    Where:
    - √É = A + I (adjacency with self-loops)
    - D = degree matrix
    - H = node features
    - W = learnable weights
    - œÉ = activation (ReLU, applied externally)
    
    Args:
        in_channels: Number of input features per node
        out_channels: Number of output features per node
    
    Example:
        >>> layer = GCNLayerScratch(1433, 64)
        >>> x = torch.randn(2708, 1433)
        >>> edge_index = ...  # [2, num_edges]
        >>> out = layer(x, edge_index)  # [2708, 64]
    """
    
    def __init__(self, in_channels: int, out_channels: int):
        super().__init__()
        self.in_channels = in_channels
        self.out_channels = out_channels
        
        # Learnable weight matrix
        self.weight = nn.Parameter(torch.Tensor(in_channels, out_channels))
        
        # Initialize weights using Xavier/Glorot initialization
        nn.init.xavier_uniform_(self.weight)
    
    def forward(self, x: torch.Tensor, edge_index: torch.Tensor) -> torch.Tensor:
        """
        Forward pass.
        
        Args:
            x: Node features [num_nodes, in_channels]
            edge_index: Graph edges [2, num_edges]
            
        Returns:
            Updated node features [num_nodes, out_channels]
        """
        num_nodes = x.size(0)
        
        # Step 1: Add self-loops (A ‚Üí A + I)
        edge_index_with_loops, _ = add_self_loops(edge_index, num_nodes=num_nodes)
        
        # Step 2: Compute degree normalization (D^(-1/2))
        row, col = edge_index_with_loops
        deg = degree(row, num_nodes=num_nodes, dtype=x.dtype)
        deg_inv_sqrt = deg.pow(-0.5)
        deg_inv_sqrt[deg_inv_sqrt == float('inf')] = 0
        
        # Normalization coefficient for each edge
        norm = deg_inv_sqrt[row] * deg_inv_sqrt[col]
        
        # Step 3: Linear transformation (H @ W)
        x = x @ self.weight
        
        # Step 4: Message passing (aggregate normalized neighbor features)
        out = torch.zeros_like(x)
        
        # This is the "slow" way - we'll optimize later
        for i, (src, dst) in enumerate(edge_index_with_loops.t()):
            out[dst] += norm[i] * x[src]
        
        return out
    
    def __repr__(self):
        return f'GCNLayerScratch({self.in_channels}, {self.out_channels})'

In [None]:
# Test the layer
layer = GCNLayerScratch(dataset.num_features, 64).to(device)
print(f"Layer: {layer}")
print(f"Weight shape: {layer.weight.shape}")

# Forward pass
out = layer(data.x, data.edge_index)
print(f"\nInput shape: {data.x.shape}")
print(f"Output shape: {out.shape}")
print("\n‚úÖ GCN layer working!")

### 3.2 Optimized GCN Layer with Scatter Operations

The for-loop above is slow. Let's use vectorized scatter operations:

In [None]:
class GCNLayer(nn.Module):
    """
    Optimized Graph Convolutional Layer using scatter operations.
    
    Same formula as GCNLayerScratch, but much faster!
    Uses scatter_add for efficient message aggregation.
    """
    
    def __init__(self, in_channels: int, out_channels: int, bias: bool = True):
        super().__init__()
        self.in_channels = in_channels
        self.out_channels = out_channels
        
        # Learnable parameters
        self.weight = nn.Parameter(torch.Tensor(in_channels, out_channels))
        if bias:
            self.bias = nn.Parameter(torch.Tensor(out_channels))
        else:
            self.register_parameter('bias', None)
        
        self.reset_parameters()
    
    def reset_parameters(self):
        """Initialize weights."""
        nn.init.xavier_uniform_(self.weight)
        if self.bias is not None:
            nn.init.zeros_(self.bias)
    
    def forward(self, x: torch.Tensor, edge_index: torch.Tensor) -> torch.Tensor:
        """
        Optimized forward pass using scatter operations.
        """
        num_nodes = x.size(0)
        
        # Step 1: Add self-loops
        edge_index, _ = add_self_loops(edge_index, num_nodes=num_nodes)
        
        # Step 2: Compute normalization
        row, col = edge_index
        deg = degree(row, num_nodes=num_nodes, dtype=x.dtype)
        deg_inv_sqrt = deg.pow(-0.5)
        deg_inv_sqrt[deg_inv_sqrt == float('inf')] = 0
        norm = deg_inv_sqrt[row] * deg_inv_sqrt[col]
        
        # Step 3: Linear transformation
        x = x @ self.weight
        
        # Step 4: Efficient message passing with scatter_add
        # For each edge (src ‚Üí dst), add normalized src feature to dst
        out = torch.zeros_like(x)
        src_features = x[row] * norm.view(-1, 1)  # Normalized source features
        out.scatter_add_(0, col.view(-1, 1).expand_as(src_features), src_features)
        
        # Step 5: Add bias
        if self.bias is not None:
            out = out + self.bias
        
        return out
    
    def __repr__(self):
        return f'GCNLayer({self.in_channels}, {self.out_channels})'

In [None]:
# Compare speed: slow vs optimized
layer_slow = GCNLayerScratch(dataset.num_features, 64).to(device)
layer_fast = GCNLayer(dataset.num_features, 64).to(device)

# Warm-up
_ = layer_slow(data.x, data.edge_index)
_ = layer_fast(data.x, data.edge_index)
torch.cuda.synchronize() if torch.cuda.is_available() else None

# Time slow version
start = time.time()
for _ in range(10):
    _ = layer_slow(data.x, data.edge_index)
torch.cuda.synchronize() if torch.cuda.is_available() else None
slow_time = (time.time() - start) / 10

# Time fast version
start = time.time()
for _ in range(10):
    _ = layer_fast(data.x, data.edge_index)
torch.cuda.synchronize() if torch.cuda.is_available() else None
fast_time = (time.time() - start) / 10

print(f"Slow (loop) version: {slow_time*1000:.2f} ms per forward pass")
print(f"Fast (scatter) version: {fast_time*1000:.2f} ms per forward pass")
print(f"\nSpeedup: {slow_time/fast_time:.1f}x faster! üöÄ")

---

## Part 4: Building the Complete GCN Model

Now let's stack multiple GCN layers to create a full model for node classification:

In [None]:
class GCN(nn.Module):
    """
    Two-layer Graph Convolutional Network for node classification.
    
    Architecture:
        Input ‚Üí GCN Layer 1 ‚Üí ReLU ‚Üí Dropout ‚Üí GCN Layer 2 ‚Üí Output
    
    Args:
        num_features: Number of input features per node
        num_classes: Number of output classes
        hidden_dim: Hidden layer dimension (default: 64)
        dropout: Dropout probability (default: 0.5)
    """
    
    def __init__(self, num_features: int, num_classes: int, 
                 hidden_dim: int = 64, dropout: float = 0.5):
        super().__init__()
        
        self.conv1 = GCNLayer(num_features, hidden_dim)
        self.conv2 = GCNLayer(hidden_dim, num_classes)
        self.dropout = dropout
    
    def forward(self, x: torch.Tensor, edge_index: torch.Tensor) -> torch.Tensor:
        """
        Forward pass.
        
        Args:
            x: Node features [num_nodes, num_features]
            edge_index: Graph edges [2, num_edges]
            
        Returns:
            Class logits [num_nodes, num_classes]
        """
        # Layer 1: GCN + ReLU + Dropout
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=self.dropout, training=self.training)
        
        # Layer 2: GCN (no activation - raw logits)
        x = self.conv2(x, edge_index)
        
        return x
    
    def get_embeddings(self, x: torch.Tensor, edge_index: torch.Tensor) -> torch.Tensor:
        """
        Get intermediate node embeddings (after layer 1).
        Useful for visualization.
        """
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        return x

In [None]:
# Create model
model = GCN(
    num_features=dataset.num_features,
    num_classes=dataset.num_classes,
    hidden_dim=64,
    dropout=0.5
).to(device)

print("GCN Model Architecture:")
print("=" * 50)
print(model)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nTotal parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")

---

## Part 5: Training the GCN

### 5.1 Training Loop

In [None]:
def train(model, data, optimizer):
    """
    Train for one epoch.
    
    Returns:
        Training loss
    """
    model.train()
    optimizer.zero_grad()
    
    # Forward pass
    out = model(data.x, data.edge_index)
    
    # Compute loss ONLY on training nodes
    loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
    
    # Backward pass
    loss.backward()
    optimizer.step()
    
    return loss.item()


@torch.no_grad()
def evaluate(model, data):
    """
    Evaluate on train/val/test sets.
    
    Returns:
        Tuple of (train_acc, val_acc, test_acc)
    """
    model.eval()
    
    # Forward pass
    out = model(data.x, data.edge_index)
    pred = out.argmax(dim=1)
    
    # Compute accuracy for each split
    accs = []
    for mask in [data.train_mask, data.val_mask, data.test_mask]:
        correct = pred[mask].eq(data.y[mask]).sum().item()
        acc = correct / mask.sum().item()
        accs.append(acc)
    
    return tuple(accs)

In [None]:
# Training configuration
model = GCN(
    num_features=dataset.num_features,
    num_classes=dataset.num_classes,
    hidden_dim=64,
    dropout=0.5
).to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

# Training history
history = {
    'loss': [],
    'train_acc': [],
    'val_acc': [],
    'test_acc': []
}

print("Training GCN on Cora...")
print("=" * 60)

best_val_acc = 0
best_epoch = 0

for epoch in range(200):
    loss = train(model, data, optimizer)
    train_acc, val_acc, test_acc = evaluate(model, data)
    
    # Save history
    history['loss'].append(loss)
    history['train_acc'].append(train_acc)
    history['val_acc'].append(val_acc)
    history['test_acc'].append(test_acc)
    
    # Track best model
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        best_epoch = epoch
        best_test_acc = test_acc
    
    # Print progress
    if epoch % 20 == 0 or epoch == 199:
        print(f"Epoch {epoch:03d} | Loss: {loss:.4f} | "
              f"Train: {train_acc:.4f} | Val: {val_acc:.4f} | Test: {test_acc:.4f}")

print("\n" + "=" * 60)
print(f"üéâ Best validation accuracy: {best_val_acc:.4f} at epoch {best_epoch}")
print(f"üìä Test accuracy at best val: {best_test_acc:.4f}")

In [None]:
# Plot training curves
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss curve
axes[0].plot(history['loss'], color='steelblue', linewidth=2)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Loss')
axes[0].grid(True, alpha=0.3)

# Accuracy curves
axes[1].plot(history['train_acc'], label='Train', linewidth=2)
axes[1].plot(history['val_acc'], label='Validation', linewidth=2)
axes[1].plot(history['test_acc'], label='Test', linewidth=2)
axes[1].axhline(y=0.8, color='red', linestyle='--', label='80% target')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Accuracy Curves')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

if best_test_acc >= 0.8:
    print("\nüéâ Congratulations! You achieved >80% test accuracy!")
else:
    print(f"\nüìà Try tuning hyperparameters. Current: {best_test_acc:.1%}")

---

## Part 6: Visualizing Learned Embeddings

Let's see what the GCN learned by visualizing node embeddings with t-SNE:

### Understanding t-SNE for Visualization

**t-SNE (t-Distributed Stochastic Neighbor Embedding)** is a dimensionality reduction technique perfect for visualization:

- Reduces high-dimensional data (64D embeddings) to 2D for plotting
- Preserves local structure: similar points stay close together
- From **scikit-learn** (`sklearn`), a popular ML library

**Key Parameters:**
- `n_components`: Output dimensions (usually 2 for visualization)
- `perplexity`: Balance between local/global structure (typically 5-50)
- `random_state`: For reproducibility

In [None]:
# scikit-learn (sklearn) is the standard Python ML library
# TSNE is used for visualizing high-dimensional data in 2D
from sklearn.manifold import TSNE

# Get embeddings from trained model
model.eval()
with torch.no_grad():
    embeddings = model.get_embeddings(data.x, data.edge_index)
    embeddings = embeddings.cpu().numpy()  # TSNE needs NumPy arrays
    labels = data.y.cpu().numpy()

print(f"Embedding shape: {embeddings.shape}")
print("Running t-SNE (this may take a minute)...")

# Create TSNE object and fit-transform embeddings
# fit_transform(): fits the model and transforms data in one step
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
embeddings_2d = tsne.fit_transform(embeddings)

print("t-SNE complete!")

In [None]:
# Plot embeddings colored by class
class_names = ['Case-Based', 'Genetic Alg.', 'Neural Nets', 'Probabilistic', 
               'Reinfortic', 'Rule Learn.', 'Theory']

plt.figure(figsize=(12, 10))

scatter = plt.scatter(
    embeddings_2d[:, 0], 
    embeddings_2d[:, 1], 
    c=labels, 
    cmap='tab10',
    alpha=0.7,
    s=20
)

# Create legend
handles, _ = scatter.legend_elements()
plt.legend(handles, class_names, loc='upper right', title='Paper Topic')

plt.title('GCN Node Embeddings (t-SNE Visualization)', fontsize=14)
plt.xlabel('t-SNE Dimension 1')
plt.ylabel('t-SNE Dimension 2')
plt.tight_layout()
plt.show()

print("üí° Notice how papers of the same class cluster together!")
print("   The GCN learned to create separable representations.")

In [None]:
# Compare: Original features vs Learned embeddings
print("Comparing Original Features vs GCN Embeddings...")

# t-SNE on original features
original_features = data.x.cpu().numpy()
tsne_original = TSNE(n_components=2, random_state=42, perplexity=30)
original_2d = tsne_original.fit_transform(original_features)

# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Original features
axes[0].scatter(original_2d[:, 0], original_2d[:, 1], 
                c=labels, cmap='tab10', alpha=0.7, s=20)
axes[0].set_title('Original Bag-of-Words Features', fontsize=12)
axes[0].set_xlabel('t-SNE Dimension 1')
axes[0].set_ylabel('t-SNE Dimension 2')

# GCN embeddings
axes[1].scatter(embeddings_2d[:, 0], embeddings_2d[:, 1], 
                c=labels, cmap='tab10', alpha=0.7, s=20)
axes[1].set_title('GCN Learned Embeddings', fontsize=12)
axes[1].set_xlabel('t-SNE Dimension 1')
axes[1].set_ylabel('t-SNE Dimension 2')

plt.tight_layout()
plt.show()

print("\nüîç The GCN embeddings show MUCH clearer class separation!")
print("   This is because GCN uses both features AND graph structure.")

---

## Part 7: Comparison with PyG's Optimized Implementation

Let's compare our implementation with PyTorch Geometric's optimized version:

In [None]:
from torch_geometric.nn import GCNConv

class PyGGCN(nn.Module):
    """
    GCN using PyTorch Geometric's optimized layers.
    """
    
    def __init__(self, num_features, num_classes, hidden_dim=64, dropout=0.5):
        super().__init__()
        self.conv1 = GCNConv(num_features, hidden_dim)
        self.conv2 = GCNConv(hidden_dim, num_classes)
        self.dropout = dropout
    
    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.conv2(x, edge_index)
        return x

# Create PyG model
pyg_model = PyGGCN(
    num_features=dataset.num_features,
    num_classes=dataset.num_classes,
    hidden_dim=64
).to(device)

print("PyG GCN Model:")
print(pyg_model)

In [None]:
# Compare speed
our_model = GCN(dataset.num_features, dataset.num_classes).to(device)

# Warm-up
for _ in range(10):
    _ = our_model(data.x, data.edge_index)
    _ = pyg_model(data.x, data.edge_index)
torch.cuda.synchronize() if torch.cuda.is_available() else None

# Time our implementation
n_runs = 100
start = time.time()
for _ in range(n_runs):
    _ = our_model(data.x, data.edge_index)
torch.cuda.synchronize() if torch.cuda.is_available() else None
our_time = (time.time() - start) / n_runs * 1000

# Time PyG implementation
start = time.time()
for _ in range(n_runs):
    _ = pyg_model(data.x, data.edge_index)
torch.cuda.synchronize() if torch.cuda.is_available() else None
pyg_time = (time.time() - start) / n_runs * 1000

print("Speed Comparison:")
print("=" * 40)
print(f"Our implementation: {our_time:.3f} ms per forward pass")
print(f"PyG implementation: {pyg_time:.3f} ms per forward pass")
print(f"\nPyG is {our_time/pyg_time:.1f}x faster (highly optimized C++ backend)")

In [None]:
# Train PyG model and compare accuracy
pyg_model = PyGGCN(dataset.num_features, dataset.num_classes).to(device)
optimizer = torch.optim.Adam(pyg_model.parameters(), lr=0.01, weight_decay=5e-4)

for epoch in range(200):
    pyg_model.train()
    optimizer.zero_grad()
    out = pyg_model(data.x, data.edge_index)
    loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()

# Evaluate
pyg_model.eval()
with torch.no_grad():
    pred = pyg_model(data.x, data.edge_index).argmax(dim=1)
    pyg_test_acc = (pred[data.test_mask] == data.y[data.test_mask]).float().mean().item()

print(f"\nAccuracy Comparison:")
print("=" * 40)
print(f"Our implementation: {best_test_acc:.4f}")
print(f"PyG implementation: {pyg_test_acc:.4f}")
print("\n‚úÖ Both achieve similar accuracy - our implementation is correct!")

---

## ‚úã Try It Yourself: Exercise 1

**Task:** Experiment with different hidden dimensions.

Train GCNs with hidden_dim = [16, 32, 64, 128, 256] and compare:
1. Training time
2. Final test accuracy
3. Number of parameters

Which hidden dimension works best?

In [None]:
# Your code here!
# Hint: Create a loop over hidden dimensions

hidden_dims = [16, 32, 64, 128, 256]
results = []

for hidden_dim in hidden_dims:
    # Create model
    # model = GCN(..., hidden_dim=hidden_dim, ...)
    
    # Train for 200 epochs
    # ...
    
    # Record: hidden_dim, num_params, test_acc, train_time
    pass

# Plot results
# ...

<details>
<summary>üí° Hint</summary>

```python
for hidden_dim in hidden_dims:
    model = GCN(dataset.num_features, dataset.num_classes, 
                hidden_dim=hidden_dim).to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)
    
    start = time.time()
    for epoch in range(200):
        train(model, data, optimizer)
    train_time = time.time() - start
    
    _, _, test_acc = evaluate(model, data)
    n_params = sum(p.numel() for p in model.parameters())
    
    results.append((hidden_dim, n_params, test_acc, train_time))
    print(f"hidden_dim={hidden_dim}: {test_acc:.4f} acc, {n_params} params")
```
</details>

---

## ‚úã Try It Yourself: Exercise 2

**Task:** Try a 3-layer GCN.

Deeper GCNs can sometimes perform worse due to "over-smoothing" - all nodes become too similar!

1. Modify the GCN class to have 3 layers
2. Train and compare accuracy
3. Do you see over-smoothing?

**Hint:** Look at the t-SNE visualization - do classes still separate?

In [None]:
# Your code here!

class GCN3Layer(nn.Module):
    """Three-layer GCN."""
    
    def __init__(self, num_features, num_classes, hidden_dim=64):
        super().__init__()
        # TODO: Add conv1, conv2, conv3
        pass
    
    def forward(self, x, edge_index):
        # TODO: Implement forward pass
        pass

---

## ‚ö†Ô∏è Common Mistakes

### Mistake 1: Training on all nodes instead of just training mask
```python
# ‚ùå Wrong: Using ALL labels for loss
loss = F.cross_entropy(out, data.y)

# ‚úÖ Right: Only use training nodes
loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
```
**Why:** Test nodes should never influence training - that's data leakage!

### Mistake 2: Forgetting to add self-loops
```python
# ‚ùå Wrong: No self-loops
# Each node can't see its own features!

# ‚úÖ Right: Add self-loops before message passing
edge_index, _ = add_self_loops(edge_index, num_nodes=num_nodes)
```
**Why:** Without self-loops, a node only sees neighbors, not itself.

### Mistake 3: Wrong normalization
```python
# ‚ùå Wrong: No normalization (high-degree nodes dominate)
out[dst] += x[src]

# ‚úÖ Right: Symmetric normalization
norm = deg_inv_sqrt[row] * deg_inv_sqrt[col]
out[dst] += norm * x[src]
```
**Why:** High-degree nodes would have much larger representations otherwise.

### Mistake 4: Too many layers (over-smoothing)
```python
# ‚ùå Risky: Many GCN layers
# After ~6 layers, all nodes look the same!
for i in range(10):
    x = gcn_layer(x, edge_index)

# ‚úÖ Safe: 2-3 layers for most tasks
x = self.conv1(x, edge_index)
x = F.relu(x)
x = self.conv2(x, edge_index)
```
**Why:** Each GCN layer averages neighbors - too many layers makes everything similar.

---

## üéâ Checkpoint

You've learned:
- ‚úÖ The message passing framework (Message ‚Üí Aggregate ‚Üí Update)
- ‚úÖ How to implement GCN layers from scratch
- ‚úÖ The importance of normalization and self-loops
- ‚úÖ How to train GCNs for node classification
- ‚úÖ Visualizing learned embeddings with t-SNE
- ‚úÖ The over-smoothing problem with deep GCNs

---

## üöÄ Challenge (Optional)

**Advanced Challenge:** Implement residual connections to fight over-smoothing.

Add skip connections like in ResNet:
```
h' = h + GCN(h)  # Instead of h' = GCN(h)
```

Does this allow deeper GCNs to work better?

In [None]:
# Advanced Challenge: ResGCN

class ResGCN(nn.Module):
    """GCN with residual connections."""
    
    def __init__(self, num_features, num_classes, hidden_dim=64, num_layers=4):
        super().__init__()
        # Your code here!
        pass
    
    def forward(self, x, edge_index):
        # Your code here!
        # Remember: h' = h + GCN(h)
        pass

---

## üìñ Further Reading

- [GCN Paper](https://arxiv.org/abs/1609.02907) - Original 2017 paper by Kipf & Welling
- [Understanding Convolutions on Graphs](https://distill.pub/2021/understanding-gnns/) - Interactive visualization
- [PyG Documentation](https://pytorch-geometric.readthedocs.io/) - Official docs
- [Over-smoothing in GNNs](https://arxiv.org/abs/2006.13318) - Analysis of the problem

---

## üßπ Cleanup

In [None]:
# Clear GPU memory
import gc

del model, pyg_model, our_model
del embeddings, embeddings_2d, original_2d

gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print(f"GPU memory after cleanup: {torch.cuda.memory_allocated() / 1e6:.1f} MB")

print("‚úÖ Cleanup complete!")

---

## ‚è≠Ô∏è Next Steps

GCNs treat all neighbors equally. But shouldn't some neighbors be more important than others?

**In Lab E.3: Graph Attention Networks**, you'll:
- Learn about attention mechanisms for graphs
- Implement GAT layers that learn neighbor importance
- Visualize which neighbors get the most attention
- Achieve even better accuracy than GCN!

Let's add attention to our graphs! üëÄ