### Practice with GNNs

The purpose of this homework is for you to build intuition about graph neural networks. You will explore three different neural network learning approaches

- A graph autoencoder that uses one-hot encodings for the node feature vector
- A graph convolutional neural network that uses a node feature vector obtained using a transformer
- A graph convolution neural network that includes some training

---

#### Imports and Utilities

**Imports:** Import all libraries that will be used in this notebook. Sort them into categories so that it's easy to see how they might be used.

In [None]:
# GNN libraries
import torch
from torch_geometric.data import Data
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, GAE

# Numpy and random
import random
import numpy as np

# Matplot lib utilities
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.axes import Axes


# Network x
import networkx as nx

# Useful utilities for understanding results
from sklearn.manifold import TSNE
from matplotlib.patches import Patch
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

# Data  and type management
from numpy.typing import NDArray
from typing import Hashable
from collections import defaultdict
import pandas as pd


**Drawing Utilities:** Functions that we'll use to show results. We'll look at networks with labeled nodes so we'll need utilities for plotting networks with labeled nodes and with colored nodes. It will also be helpful to inspect clusters of the embeddings using a scatterplot.

In [None]:
def plot_scatter(embeddings, labels, title="t-SNE of Node Embeddings", cmap=plt.cm.tab10):
    """Plot a 2D t-SNE scatterplot of node embeddings colored by labels."""
    z = TSNE(n_components=2, perplexity=5, random_state=42).fit_transform(embeddings)
    plt.figure(figsize=(8, 6))
    _ = plt.scatter(z[:, 0], z[:, 1], c=labels, cmap=cmap, s=60, edgecolors='k')
    plt.title(title)
    plt.axis('off')
    plt.show()

def plot_ground_truth_graph(G, node_labels, title="Graph with Node Labels", cmap=plt.cm.tab10):
    """Visualize a NetworkX graph with nodes colored by labels."""
    pos = nx.spring_layout(G, seed=42)
    plt.figure(figsize=(8, 6))
    nx.draw(G, pos, node_color=node_labels, cmap=cmap, node_size=100, with_labels=False)
    plt.title(title)
    plt.axis('off')
    legend_handles = [Patch(color=plt.cm.tab10(i), label=label) for i, label in enumerate(unique_labels)]
    plt.legend(handles=legend_handles, title="Subjects", bbox_to_anchor=(1.05, 1), loc="upper left")

def plot_graph(G: nx.Graph, 
               node_labels: list[int],
               pos: dict[Hashable, tuple[float, float]] | None = None,
               title: str = " ", 
               ax: Axes | None = None) -> None:
    if pos is None:
        pos = nx.spring_layout(G, seed=42)
    if ax is None:
        plt.figure()
        ax = plt.gca()
    nx.draw(G, 
            pos = None, 
            node_color=node_labels, 
            cmap=plt.cm.tab10,
            node_size=100, 
            ax = ax, 
            with_labels=False)
    ax.set_title(title)

#### The Dataset

We'll construct a smaller dataset from the [Cora dataset](https://paperswithcode.com/dataset/cora). ChatGPT-4o helped with code to subsample this dataset to something more managable.

**Import:** Import information about the nodes. The dataset assigns a unique paper ID to each paper. The dataset also includes a bag of words feature vector with 1433 words. Label the columns in the dataset with "feature_num". Label the last column with the class of the paper, which in the Cora dataset is the subject area into which the paper was categorized.

In [None]:
# Load node data
column_names = ['paper_id'] + [f'feature_{i}' for i in range(1433)] + ['subject']
nodes: pd.DataFrame  = pd.read_csv('datasets/cora.content', sep='\t', header=None, names=column_names)
nodes.head()


Load the edges in the dataset, which is given by which papers cite which other papers. 

In [None]:
# Load edge data
edges: pd.DataFrame = pd.read_csv('datasets/cora.cites', sep='\t', header=None, names=['source', 'target'])
edges.head()


---

#### Build Networkx Graph

**Create Graph from DataFrame:** Create an undirected networkx graph. Nodes are indexed by the paper ID

In [None]:
# Create a graph
G: nx.Graph = nx.from_pandas_edgelist(edges, 'source', 'target', create_using=nx.Graph())

# Add node attributes
for _, row in nodes.iterrows():
    G.nodes[row['paper_id']].update(row.to_dict())

# Show first 5 entries
num_nodes: int = 5
for node in G.nodes():
    print(f"node {node} is in class {G.nodes[node]['subject']} = {G.nodes[node]['feature_0']}")
    num_nodes -=1
    if num_nodes == 0:
        break


**Create a connected subgraph:** The full Cora database makes it a bit difficult to see results, so create a connected subgraph.

In [None]:
# Code from ChatGPT in response to prompt of how to create smaller dataset
# with from Cora while ensuring that the resulting graph is connected
def snowball_sample(G_full: nx.Graph, start_node: str, target_size: int) -> nx.Graph:
    visited = set([start_node])
    frontier = set(G_full.neighbors(start_node))
    
    while len(visited) < target_size and frontier:
        next_node = random.choice(list(frontier))
        visited.add(next_node)
        frontier.update(G_full.neighbors(next_node))
        frontier -= visited  # remove already visited nodes
    
    return G_full.subgraph(visited).copy()

In [None]:
# Ensure consistent results for everyone doing the assignment
random.seed(42)

# Start from a high-degree node
start_node = max(G.degree, key=lambda x: x[1])[0]

# Sample 250 connected nodes
G_sub = snowball_sample(G, start_node=start_node, target_size=250)

Visualize

In [None]:
# Map class labels to integers
labels = nx.get_node_attributes(G_sub, 'subject')
unique_labels = sorted(set(labels.values()))
label_to_int = {label: i for i, label in enumerate(unique_labels)}

# Create a color list for the nodes
node_colors = [label_to_int[labels[node]] for node in G_sub.nodes()]

# Show
plot_ground_truth_graph(G_sub, node_colors, title="Cora Subgraph (250 Nodes) Colored by Class Label")


Notice how some classes don't have as many samples as others. Indeed, there are no nodes from the _Probabilistic Methods_ class.

In [None]:
class_count: dict[str, int] = {subject: 0 for subject in set(nodes['subject'])}
for node in G_sub.nodes():
    class_count[G_sub.nodes[node]['subject']] += 1
print(class_count)

---

### Build PyTorch Data Structures

We will explore three types of models:
- Graph autoencoders that use one-hot encoding for each node to create a node embedding
- Graph convolutional neural networks that use the node's feature vector to create a node embedding
- Graph convolutional neural networks that use the node's feature vector and some hand-labeled nodes to do semi-supervised learning.

Each model will be built using a class in the PyTorch Geometric library. The input to these models will be the data from the graph, and this data must be put into the format required by PyTorch. The dataformat is abbreviated _PyG_ for PyTorch Geometric. It needs to have edge information (since we are doing graph neural networks), node features, node labels, and information about training data (if using semi-supervised learning).

I used ChatGPT-4o to put the data into the correct format. Two formats are sufficient: one with one-hot encoding and one with node features plus training data.

**One-hot Encoding**

Build PyG data structure for data using one-hot encoding. 


In [None]:
from torch_geometric.data import Data

# Step 1: One-hot encode the 250 nodes in the subgraph
ordered_nodes = list(G_sub.nodes())
num_nodes = len(ordered_nodes)
x = torch.eye(num_nodes)  # Shape: (250, 250), each row is a one-hot vector

# Step 2: Build node ID → index mapping
node_to_index = {node_id: i for i, node_id in enumerate(ordered_nodes)}

# Step 3: Remap edges using index mapping
edges = [
    [node_to_index[src], node_to_index[dst]]
    for src, dst in G_sub.edges()
]
edge_index = torch.tensor(edges, dtype=torch.long).T  # Shape: (2, num_edges)

# Step 4: Map class labels to integers
labels = nx.get_node_attributes(G_sub, 'subject')
unique_labels = sorted(set(labels.values()))
label_to_index = {label: i for i, label in enumerate(unique_labels)}
y = torch.tensor([label_to_index[labels[node]] for node in ordered_nodes], dtype=torch.long)

# Step 5: Create PyG Data object
data_onehot: Data = Data(x=x, edge_index=edge_index, y=y)

# Inspect
print(f"x shape: {data_onehot.x.shape}, edge_index shape: {data_onehot.edge_index.shape}, y shape: {data_onehot.y.shape}")

**Node Feature Vector**

Encode the feature of the input using the bag-of-words labels in the Cora dataset.

In [None]:
# Step 1: Prepare node features and labels
X = nodes.iloc[:, 1:-1].values  # 1433 BoW features
y = nodes["subject"]
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Map paper_id to index
paper_id_to_idx = {paper_id: i for i, paper_id in enumerate(nodes["paper_id"])}

# Build edge index for subgraph
ordered_nodes = list(G_sub.nodes())
node_to_index = {node_id: i for i, node_id in enumerate(ordered_nodes)}
edge_list = [
    [node_to_index[src], node_to_index[dst]]
    for src, dst in G_sub.edges()
    if src in node_to_index and dst in node_to_index
]
edge_index = torch.tensor(edge_list, dtype=torch.long).T

# Build node feature matrix and label vector (aligned with ordered_nodes)
X_sub = torch.tensor([X[paper_id_to_idx[n]] for n in ordered_nodes], dtype=torch.float)
y_sub = torch.tensor([y_encoded[paper_id_to_idx[n]] for n in ordered_nodes], dtype=torch.long)

data_cora: Data = Data(x=X_sub, edge_index=edge_index, y=y_sub)

**Node Labels**

We'll eventually use the node labels, so let's randomly choose some of the nodes in the graph and use the known classes for those nodes as labeled training data. We'll add this to the existing data structure for simplicity. We'll only use 10% of the data for training.

In [None]:
# Step 0: Choose what percentage of data to use as training data
training_percent = 0.1

# Step 1: Build training mask with ~10% nodes (balanced across classes)
num_classes = len(set(y_sub.tolist())) + 1
num_total = len(y_sub)
num_train = int(training_percent * num_total)

# Group indices by class
class_indices = defaultdict(list)
for i, label in enumerate(y_sub.tolist()):
    class_indices[label].append(i)

# Sample evenly from each class
samples_per_class = num_train // num_classes
train_indices = []

for label, indices in class_indices.items():
    np.random.seed(42)
    np.random.shuffle(indices)
    train_indices.extend(indices[:samples_per_class])

# Build training mask
train_mask = torch.zeros(num_total, dtype=torch.bool)
train_mask[train_indices] = True
data_cora.train_mask = train_mask

---

#### GNN Architecture

To make comparisons fair, we'll use essentially the same architecture for each GNN we build. The figure below was generated by ChatGPT-4o with prompts by me. It has obvious errors but illustrates the basic idea.

<img src="figures/Graph_Convolutional_Network_Architecture.png" alt="General architecture for each GNN" width = "800">

The architecture has two hidden layers, and the nonlinear "squashing" function between each layer will be an ReLU. Each GNN will use the same dimension for the hidden layers and for the output (except for the semi-supervised implementation). 

In [None]:
HIDDEN_1_NODES=128
HIDDEN_2_NODES=64 
EMBEDDING_DIMENSION=32

I've made no effort to optimize the number of layers or the number of nodes per layer. I've also made no effort to optimize the nonlinear squashing function.

---
---

#### Graph Autoencoder

**Architecture:** Define a GAE model with two hidden layers and relu's between the hidden layers.

In [None]:
class GCNEncoder(torch.nn.Module):
    def __init__(self, in_channels, hidden1, hidden2, out_channels):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden1)
        self.conv2 = GCNConv(hidden1, hidden2)
        self.conv3 = GCNConv(hidden2, out_channels)

    def forward(self, x, edge_index):
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = self.conv2(x, edge_index)
        x = F.relu(x)
        x = self.conv3(x, edge_index)
        return x

Let's discuss the elements of this class not only because it is a nice review but also because each of the GNNs that we'll define will use this identical architecture.

The `__init__` function inherits from the `torch.nn.Module` superclass and instantiates three convolutional layers: one for the first hidden layer, one for the second hidden layer, and one for the output layer. 

The `foward` function strings together the outputs from the previous layer with the inputs of the next layer. Stated simply, it passes the input "forward" through each layer, squashing things from one layer to the next.

**Training:** Add the training function, which just sequences the steps required for training. They key part of the training function is the definition of loss: `loss = model.recon_loss(z, data.edge_index)`. This says to use the reconstruction loss. In case you've forgotten, this just means that the goal of this network is to take the embedding and turn it into the the adjacency matrix using the following steps.

Recall that the goal is to take node $i$, call it $u_i$, and compute a real-valued vector representation of the node, call it ${\mathbf z}_i$. We want the embedding to satisfy the property that two similar nodes end up close to each other in the embedding space.

- if ${\rm sim}(u_i, u_j)$ is high then ${\mathbf z}_i$ is near ${\mathbf z}_j$.

For a graph convolutional autoencoder, we need to define what we mean both by _similar_ and by _near_. 

- We'll use adjacency to define _similar_, so two nodes are similar if $A_{ij}=1$.
- We'll use cosine similarity as the metric for _near_, so we want ${\mathbf z}_i^T{\mathbf z}_j$ to be high.

The maximum value of the cosine between two vectors is 1, and the minimum value for the cosine between two vectors is -1. And since we aren't dividing by the length of ${\mathbf z}_i$ and ${\mathbf z}_j$ like we technicall have to do if we want the vector product to represent actual cosine, we aren't guaranteed that ${\mathbf z}_i^T{\mathbf z}_j$ will even be between $-1$ and $1$. To fix this, we'll pass this product through the sigmoid function, which means that _near_ is defined as

$$ \sigma({\mathbf z}_i^T{\mathbf z}_j) = \frac{1}{1+e^{-{\mathbf z}_i^T{\mathbf z}_j}}$$

which squashes the values of _near_ so that they are always between $0$ and $1$. In other words, we'll approximate $A_{ij}$ by $\sigma({\mathbf z}_i^T{\mathbf z}_j)$.  The decoder does this computation for us.

In [None]:
def train(model, data, optimizer):
    model.train()
    optimizer.zero_grad()
    z = model.encode(data.x, data.edge_index)
    loss = model.recon_loss(z, data.edge_index)
    loss.backward()
    optimizer.step()
    return loss.item()

Set up the model and train. 
- The encoder instantiates the GCN architecture using the required number of input and hidden layers. Its goal is to output the embedding.
- The `model` is set to `GAE(encoder)`. This is the GAE decoder, which tries to produce the adjacency matrix using the math described above
- The `model.recon_loss` uses a specific loss function designed to efficiently compute error between the actual adjacency matrix and the predicted adjacency matrix.

In [None]:
# Parameters
num_epochs: int = 200

# Model setup
encoder = GCNEncoder(in_channels=data_onehot.num_node_features, hidden1=HIDDEN_1_NODES, hidden2=HIDDEN_2_NODES, out_channels=EMBEDDING_DIMENSION)
model = GAE(encoder)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Training loop
for epoch in range(1, num_epochs+1):
    loss = train(model, data_onehot, optimizer)
    if epoch % 20 == 0:
        print(f"Epoch {epoch:3d} | Loss: {loss:.4f}")

The embedding dimension is pretty big (32 dimensions) so we can't visualize it very well. We can get a sense for whether or not there is good clustering by projecting the 32 dimensions down to 2 dimensions using the TSNE algorithm, which tries to choose the values in the 2 dimensions so that the result is easy to visualize. The plot therefore gives us a sense of whether the embedding can be used to identify similar nodes by how well they cluster in the embedding space. 

In [None]:
# Put the mode in evaluation (instead of training) mode
model.eval()

# Tell PyTorch to not track gradients and then infer the embedding
with torch.no_grad():
    node_embeddings_GAE = model.encode(data_onehot.x, data_onehot.edge_index)  # shape: [num_nodes, embedding_dim]

plot_scatter(node_embeddings_GAE, data_onehot.y.cpu(), title="t-SNE 2D Projection of GAE Node Embeddings")

## Question 1

Answer this question before running the next cell. 

Based on patterns you see in the scatterplot above, what will happen when the 32-dimensional embeddings are clustered into 6 clusters (since the seventh class never appears in the data)? In other words, suppose that we find six clusters and then label each node by its cluster. How well will these node labels correspond to the actual node labels? Justify your answer by looking at the ground truth plot we generated above and discussing what kinds of correlations you see between the graph structure and the node class.

## Answer 1

Put your answer here

---

## Question 2

Run the following cell, which shows a side-by-side comparison of the node classes of ground truth versus the labels produced by the GAE. What did you get right in your answer to Problem 1? What did you get wrong?

In [None]:
# Run KMeans on the learned embeddings node_embeddings
kmeans = KMeans(n_clusters=6, init="random", n_init=10, random_state=1234)

# Assign nodes to classes according to which cluster they belong
cluster_labels_GAE = kmeans.fit_predict(node_embeddings_GAE.cpu().numpy())  # Assuming z is a torch tensor

# Get true labels
true_labels = data_onehot.y.cpu().numpy()  # Ground truth labels from PyG Data object

# Plot side-by-side comparison
_, axes = plt.subplots(1, 2, figsize=(14, 6))

# Show both cluster and true plots side by side
plot_graph(G_sub, cluster_labels_GAE, title = "KMeans Cluster Labels from GAE Embeddings", ax=axes[1])
plot_graph(G_sub, true_labels, title = "True Classes", ax=axes[0])


## Answer 2

Put your answer to question 2 here

---
---

#### Using Node Features in GAE


Set up a graph convolutional neural network that uses the feature vectors but no other training data. We'll use the same model just a different set of input data. Thus, there is no need to define the encoder and training function again. We'll instantiate the model, train, and look at the results.

In [None]:
# Parameters
num_epochs: int = 200

# Model setup
encoder = GCNEncoder(in_channels=data_cora.num_node_features, hidden1=HIDDEN_1_NODES, hidden2=HIDDEN_2_NODES, out_channels=EMBEDDING_DIMENSION)
model = GAE(encoder)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Training loop
for epoch in range(1, num_epochs+1):
    loss = train(model, data_cora, optimizer)
    if epoch % 20 == 0:
        print(f"Epoch {epoch:3d} | Loss: {loss:.4f}")

In [None]:
model.eval()
with torch.no_grad():
    node_embeddings_GAE_features = model(data_cora.x, data_cora.edge_index)  # Output shape: [250, 64]


Visualize whether the result is likely to cluster

In [None]:
plot_scatter(node_embeddings_GAE_features, 
             data_cora.y.cpu(),
             title="t-SNE 2D Projection of GAE Embeddings with Feature Vector")

## Question 3

Answer this question before running the next cell. 

Based on patterns you see in the scatterplot above, what will happen when the 32-dimensional embeddings are clustered into 6 clusters (since the seventh class never appears in the data)? In other words, suppose that we find six clusters and then label each node by its cluster. How well will these node labels correspond to the actual node labels? Justify your answer by looking at the ground truth plot we generated above and discussing what kinds of correlations you see between the graph structure and the node class. Remember that this problem uses the node feature set instead of one-hot encoding.

## Answer 3

Put your answer here

---

## Question 4

Run the following cell, which shows a side-by-side comparison of the node classes of ground truth versus the labels produced by the GAE. What did you get right in your answer to Problem 3? What did you get wrong?

In [None]:
# Run KMeans on the learned embeddings node_embeddings
kmeans = KMeans(n_clusters=6, init="random", n_init=10, random_state=1234)

# Assign nodes to classes according to which cluster they belong
cluster_labels_gae_cora = kmeans.fit_predict(node_embeddings_GAE_features.cpu().numpy())  # Assuming z is a torch tensor

# Get true labels
true_labels = data_cora.y.cpu().numpy()  # Ground truth labels from PyG Data object

# Plot side-by-side comparison
_, axes = plt.subplots(1, 2, figsize=(14, 6))

# Show both cluster and true plots side by side
plot_graph(G_sub, cluster_labels_gae_cora, title = "KMeans Cluster Labels from GAE Embeddings on Cora features", ax=axes[1])
plot_graph(G_sub, true_labels, title = "True Classes", ax=axes[0])


## Answer 4

Put your answer here

---
---

#### Adding semi-supervised learning to the GCN


Set up the GCN to do learning, but with classifier error. Since we'll be using the same encoder, we don't need to repeat it. However, we will repeat it and give it a different name. We are doing this because the dimension of the output layer won't be the embedding dimension, but rather the number of desired node classes.

In [None]:
# Define GCN with classifier output
class GCNClassifier(torch.nn.Module):
    def __init__(self, in_channels, hidden1, hidden2, out_channels):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden1)
        self.conv2 = GCNConv(hidden1, hidden2)
        self.conv3 = GCNConv(hidden2, out_channels)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = self.conv2(x, edge_index)
        x = F.relu(x)
        x = self.conv3(x, edge_index)
        return x


The training will use a different error since the goal is not to find the 32-dimensional embedding but rather to estimate the node's class. 

In [None]:
def train(model, data, optimizer):
    model.train()
    optimizer.zero_grad()
    out = model(data)
    loss = F.cross_entropy(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()
    return loss.item(), out

The big difference is that we are using training data and compute cross entropy loss, which is a measure of the difference between the predicted class and the actual class.

We can now instantiate the class and train it. Recall how the autoencode instantiated a model: 

`encoder = GCNEncoder(in_channels=data_onehot.num_node_features, hidden1=HIDDEN_1_NODES, hidden2=HIDDEN_2_NODES, out_channels=EMBEDDING_DIMENSION)`
sets the number of output channels to the embedding dimension followed by the instantiation 

`model = GAE(encoder)`

which creates the model as a graph autoencoder.

The classifier model uses `out_channels=num_classes` and instantiates the entire model.

In [None]:
# Instantiate model
model = GCNClassifier(in_channels=data_cora.num_node_features, hidden1=HIDDEN_1_NODES, hidden2=HIDDEN_2_NODES, out_channels=num_classes)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Train only on labeled nodes
for epoch in range(1, 201):
    loss, out = train(model, data_cora, optimizer)
    
    if epoch % 20 == 0:
        model.eval()
        pred = out.argmax(dim=1)
        acc = (pred[data_cora.train_mask] == data_cora.y[data_cora.train_mask]).float().mean().item()
        print(f"Epoch {epoch:3d} | Loss: {loss:.4f} | Acc (overall): {acc:.4f}")

We output an _accuracy_ score, which is how well the model is learning the training set. The `pred=out.argmax(dim=1)` chooses the class by finding the class with highest output score.

Visualize results, comparing what happens when we use the embedding plus some clusters to what happens when we just run the classifier.

In [None]:
# Extract embeddings (use softmax output or conv2 layer as embedding)
model.eval()
with torch.no_grad():
    node_embeddings = model(data_cora)

print("Embeddings shape:", node_embeddings.shape)
print("Label shape:", data_cora.y.shape)

plot_scatter(node_embeddings, 
             data_cora.y.cpu(),
             title="t-SNE 2D Projection of Classifier Embedding with Feature Vector")

## Question 5

Answer this question before running the next cell.

The embedding produced above uses information about correct node classes. How well will we be able to find clusters in the embedding and use those to predict the class of each node? And how well will this prediction compare to if we just use the classifier prediction directly? Justify your answer.

## Answer 5

Run the following code and look at the clusters.

In [None]:
# --- Clustering on embeddings ---
kmeans = KMeans(n_clusters=num_classes, random_state=0)
cluster_labels_classifier_embedding = kmeans.fit_predict(node_embeddings.cpu().numpy())

# --- Classification prediction ---
predicted_labels = node_embeddings.argmax(dim=1).cpu().numpy()

# --- True labels ---
true_labels = data_cora.y.cpu().numpy()

_, axes = plt.subplots(1, 3, figsize=(18, 6))

plot_graph(G_sub, true_labels, title="True Labels", ax=axes[0])
plot_graph(G_sub, cluster_labels_classifier_embedding, title="KMeans Clusters on GCN Embeddings", ax=axes[1])
plot_graph(G_sub, predicted_labels, title="GCN Predicted Labels (argmax)", ax=axes[2])


## Question 6

What did you get right and wrong in your answer to problem 5? What did you learn?

## Answer 6

---
---

#### ARI: Quantifying Perforance

The Adjusted Rand Index (ARI) measures how similar two groupins are, adjusting for random chance. Prompting ChatGPT-4o notes that ARI evaluates

- Which pairs of points were grouped together
- Which pairs were placed in different groups
- Whether this matches what happened in the ground truth labels

ChatGPT also suggests the following interpretations for ARI ranges.

**Adjusted Rand Index (ARI) Interpretation**

| ARI Score | Interpretation                       |
|-----------|--------------------------------------|
| **1.0**   | Perfect match with true labels       |
| 0.8–0.9   | Excellent clustering                  |
| 0.5–0.7   | Good structure, moderate alignment    |
| 0.2–0.4   | Weak clustering, some structure       |
| **0.0**   | No better than random assignment      |
| **< 0.0** | Worse than random (actively misleading) |

## Question 7

Answer this question before running the code.

Put the following four ways of classifying nodes in order from smallest to largest ARI scores. Justify your answer.
- Classes from GAE with one-hot encoding
- Classes from GAE with bag-of-words node feature encoding
- Classes from GCN classifier with clustered embedding
- Classes from GCN classifier output

## Answer 7

Put your answer here.

---

Compute and print out the ARI for each classifier.

Compute the ARI for the GAE

In [None]:
ari = adjusted_rand_score(data_onehot.y.cpu().numpy(), cluster_labels_GAE)
print(f"Adjusted Rand Index (ARI) comparing GAE clusters with \none-hot encoding to true labels: {ari:.4f}")

ari = adjusted_rand_score(data_onehot.y.cpu().numpy(), cluster_labels_gae_cora)
print(f"Adjusted Rand Index (ARI) comparing GAE clusters with \nbag-of-words encoding to true labels: {ari:.4f}")

# ARI: KMeans on GCN embeddings
ari_kmeans_classifier = adjusted_rand_score(true_labels, cluster_labels_classifier_embedding)

# ARI: Classifier predictions
ari_classifier = adjusted_rand_score(true_labels, predicted_labels)

print(f"Adjusted Rand Index (KMeans on embeddings): {ari_kmeans_classifier:.4f}")
print(f"Adjusted Rand Index (GCN predicted labels): {ari_classifier:.4f}")

## Question 8

What did you get right and what did you get wrong? Why might you have gotten the answer wrong? Try changing the code so that 30% of the data is used for training (search on `training_percent` to find where to set it). What do you learn from that?

## Answer 8

Put your answer here.