# Graph Convolutional Network (GCN) Pipeline for ENZYMES Dataset

**Author:** Shuai Huang

## Introduction

This assignment focuses on implementing Graph Convolutional Networks (GCNs) for graph classification using the ENZYMES dataset from PyTorch Geometric. You will explore graph structures, preprocess the data, implement GCN models (both manually and using PyG), and evaluate their performance.

Please fill in the missing code between the designated markers:

    ```Python
    ### Your code starts
    ```
    and
    ```Python
    ### Your code ends
    ```


## Dataset Description

### ENZYMES Dataset Overview

The **ENZYMES dataset** is a collection of **600 graphs**, where each graph represents a **protein**.  
The **nodes** in each graph correspond to **amino acids (residues)**, and **edges** represent **spatial closeness** between residues.  
The dataset is used for **graph classification**, where the goal is to predict the **enzyme class** of each protein.  
There are **six enzyme classes** in total.

#### **Graph Structure**
Each graph has:

- **Nodes**: Amino acids (**average ~32 nodes per graph**)
- **Edges**: Spatial connectivity (**~62 edges per graph**)
- **Node Features**: **21-dimensional feature vectors per node**
- **Graph Labels**: One of **six enzyme classes**

#### **Use Cases**
This dataset is widely used in **biochemical function prediction** and is suitable for training **Graph Neural Networks (GNNs)** such as **Graph Convolutional Networks (GCN)**.


## 1. Device Selection

In [None]:
import torch

# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 2. Package Installation

To use PyTorch Geometric (PyG), we need to install PyTorch, torch_geometric, and the required dependencies.

In [None]:
# Install PyTorch (if not installed)
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Install PyTorch Geometric dependencies
!pip install torch-geometric torch-scatter torch-sparse torch-cluster torch-spline-conv -f https://data.pyg.org/whl/torch-2.0.0+cu118.html

## 3. Prepare Dataset

We will download the ENZYMES dataset using TUDataset from PyTorch Geometric and prepare it for training.

In [None]:
from torch_geometric.datasets import TUDataset

# Load the ENZYMES dataset
dataset = TUDataset(root='data/ENZYMES', name='ENZYMES')

# Move the dataset to the selected device (CPU/GPU)
dataset = dataset.shuffle()

print(f"Dataset loaded: {dataset}")
print(f"Number of graphs: {len(dataset)}")
print(f"Number of classes: {dataset.num_classes}")
print(f"Node feature dimension: {dataset.num_node_features}")

## 4. Explore the Data

### Instruction:
Before training, we should explore the dataset to understand its structure:

- **Total number of graphs**
- **Number of nodes per graph**
- **Number of edges per graph**
- **Node features and labels**

In [None]:
import numpy as np

# Get basic dataset statistics
### Your code starts
num_graphs = 
num_classes = 
num_node_features = 
### Your code ends

# Collect statistics about node and edge counts
num_nodes_list = []
num_edges_list = []

for graph in dataset:
    num_nodes_list.append(graph.num_nodes)
    num_edges_list.append(graph.num_edges)

print(f"Total Graphs: {num_graphs}")
print(f"Number of Classes: {num_classes}")
print(f"Node Feature Dimension: {num_node_features}")

print(f"Avg. Nodes per Graph: {np.mean(num_nodes_list):.2f}")
print(f"Avg. Edges per Graph: {np.mean(num_edges_list):.2f}")

# Visualize the first graph
first_graph = dataset[0]
print(f"First Graph Details:\n {first_graph}")


## 5. Preprocessing (Normalization, Splitting into Train/Validation/Test)

### Instruction:
- Shuffle the dataset before splitting.
- Normalize node features to ensure better convergence.
- Split the dataset into train (80%), validation (10%), and test (10%) sets.

In [None]:
import torch
from torch_geometric.transforms import NormalizeFeatures
from torch.utils.data import random_split

# Normalize the node features
### Your code starts
dataset = TUDataset(root="data/ENZYMES", name="ENZYMES", transform=NormalizeFeatures())


### Your code ends

print(f"Train size: {len(train_dataset)}, Validation size: {len(val_dataset)}, Test size: {len(test_dataset)}")


## 6. GCN Model Definition

Instruction:
We will implement two GCN models:

1. GCN by hand (using basic matrix operations)
2. GCN using PyG (utilizing torch_geometric.nn.GCNConv)

### 6.1 Implementing GCN by Hand

We will manually implement the **graph convolution operation** **without** using `torch_geometric.nn`.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.utils import add_self_loops, degree
from torch_geometric.nn import global_mean_pool  # For graph-level pooling

class GCNHandmade(nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super(GCNHandmade, self).__init__()
        ### Your code starts
        self.fc1 = 
        self.fc2 = 
        self.classifier =   # Final graph-level classification layer
        ### Your code ends

    def forward(self, x, edge_index, batch):
        ### Your code starts

        # Step 1: Add self-loops
        

        # Step 2: Compute degree matrix
        

        # Step 3: Normalize adjacency matrix
        

        # Step 4: Message Passing
        

        # Step 5: **Graph-level pooling** (Aggregate node representations into a single vector per graph)
        

        # Step 6: Final classification
        
        
        ### Your code ends

        return F.log_softmax(out, dim=1)  # Graph-level classification output


## 6.2 Implementing GCN using PyG

Now, we define a GCN model using PyG's built-in layers.

In [None]:
from torch_geometric.nn import GCNConv

from torch_geometric.nn import GCNConv, global_mean_pool

class GCNPyG(nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super(GCNPyG, self).__init__()
        ### Your code starts
        self.conv1 = 
        self.conv2 = 
        self.classifier =   # Graph-level classifier
        ### Your code ends

    def forward(self, x, edge_index, batch):
        ### Your code starts
        
        # Step 1: apply GCN layers with activation functions
        

        # Step 2: Graph-level pooling (aggregate node embeddings per graph)
        
        
        ### Your code ends
        
        # Final graph-level classification
        out = self.classifier(x_graph)
        return F.log_softmax(out, dim=1)



## 7. Training the Model

We will train both models using the same pipeline and store hyperparameters in a dictionary for easy tuning.

In [None]:
# Define hyperparameters
hyperparams = {
    "hidden_channels": 64,
    "learning_rate": 0.01,
    "epochs": 50,
    "batch_size": 32
}

# Define loss function and optimizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def train(model, train_dataset, val_dataset):
    model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=hyperparams["learning_rate"])
    loss_fn = nn.CrossEntropyLoss()

    for epoch in range(hyperparams["epochs"]):
        model.train()
        total_loss = 0

        for graph in train_dataset:
            graph = graph.to(device)
            optimizer.zero_grad()
            out = model(graph.x, graph.edge_index, graph.batch)  # Pass batch indices
            loss = loss_fn(out, graph.y)  # Graph-level loss
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_dataset):.4f}")

    print("Training Complete!")


## 8. Evaluating the Model

We will evaluate both models and plot accuracy results.

In [None]:
import matplotlib.pyplot as plt

def evaluate(model, test_dataset):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for graph in test_dataset:
            graph = graph.to(device)

            out = model(graph.x, graph.edge_index, graph.batch)  # **Pass batch index**
            pred = out.argmax(dim=1)
            correct += (pred == graph.y).sum().item()
            total += graph.y.size(0)

    accuracy = correct / total
    print(f"Test Accuracy: {accuracy * 100:.2f}%")
    return accuracy

# Train and evaluate both models
gcn_handmade = GCNHandmade(dataset.num_node_features, hyperparams["hidden_channels"], dataset.num_classes)
gcn_pyg = GCNPyG(dataset.num_node_features, hyperparams["hidden_channels"], dataset.num_classes)

print("\nTraining GCN (Handmade)...")
train(gcn_handmade, train_dataset, val_dataset)

print("\nTraining GCN (PyG)...")
train(gcn_pyg, train_dataset, val_dataset)

# Evaluate both models
acc_handmade = evaluate(gcn_handmade, test_dataset)
acc_pyg = evaluate(gcn_pyg, test_dataset)

# Plot results
models = ["GCN Handmade", "GCN PyG"]
accuracies = [acc_handmade, acc_pyg]

plt.bar(models, accuracies, color=['blue', 'green'])
plt.xlabel("Model")
plt.ylabel("Accuracy")
plt.title("Comparison of GCN Implementations")
plt.ylim([0, 1])
plt.show()


## 9. Making Predictions

We will make predictions on test samples and visualize them.

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score, roc_curve
import seaborn as sns

def make_predictions_and_plot(model, test_dataset):
    model.eval()
    all_preds = []
    all_labels = []
    all_probs = []

    with torch.no_grad():
        for graph in test_dataset:
            graph = graph.to(device)
            out = model(graph.x, graph.edge_index, graph.batch)  # Pass batch index
            prob = torch.exp(out)  # Convert log-softmax to probabilities
            pred = prob.argmax(dim=1)

            all_preds.append(pred.item())
            all_labels.append(graph.y.item())
            all_probs.append(prob[:, 1].item())  # Probabilities for class 1 (binary case)

    # Convert to numpy arrays
    all_preds = np.array(all_preds)
    all_labels = np.array(all_labels)
    all_probs = np.array(all_probs)

    ## ✅ **Confusion Matrix Plot**
    ### Your code starts
    cm = confusion_matrix(all_labels, all_preds)
    plt.figure(figsize=(6,5))
    sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=range(6), yticklabels=range(6))
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.title("Confusion Matrix")
    plt.show()

    ## ✅ **Classification Report**
    print("Classification Report:\n", classification_report(all_labels, all_preds, digits=4))

    ## ✅ **ROC Curve & AUC Score**
    if len(set(all_labels)) == 2:  # Only compute if binary classification
        auc_score = roc_auc_score(all_labels, all_probs)
        fpr, tpr, _ = roc_curve(all_labels, all_probs)

        plt.figure(figsize=(6,5))
        plt.plot(fpr, tpr, label=f"AUC = {auc_score:.4f}")
        plt.plot([0,1], [0,1], linestyle='--', color='gray')
        plt.xlabel("False Positive Rate")
        plt.ylabel("True Positive Rate")
        plt.title("ROC Curve")
        plt.legend()
        plt.show()

        print(f"AUC Score: {auc_score:.4f}")

    return all_preds, all_labels


In [None]:
make_predictions_and_plot(gcn_pyg, test_dataset)

# Key findings and observations?