# Graph Learning

## Lab 7: Graph Neural Networks

In this lab, you will learn to classify nodes using a graph neural network (GNN).

## Import

In [None]:
import numpy as np
from scipy import sparse
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

In [None]:
!pip install scikit-network

In [None]:
from sknetwork.classification import get_accuracy_score
from sknetwork.data import load_netset
from sknetwork.embedding import Spectral
from sknetwork.gnn import GNNClassifier
from sknetwork.utils import directed2undirected
from IPython.display import SVG
from sknetwork.visualization import visualize_graph
from sknetwork.classification import DiffusionClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.manifold import TSNE

## Data

We will work on the following datasets (see the [NetSet](https://netset.telecom-paris.fr/) collection for details):
* Cora (directed graph + bipartite graph)
* WikiVitals (directed graph + bipartite graph)

Both datasets are graphs with node features (given by the bipartite graph) and ground-truth labels.

In [None]:
cora = load_netset('cora')
wikivitals = load_netset('wikivitals')

In [None]:
def visualize_embedding(embedding, labels, size=(6,6)):
    """Visualize embedding in 2 dimensions using TSNE. """
    print("Computing TSNE...")
    tsne = TSNE(random_state=8).fit_transform(embedding)
    fig, ax = plt.subplots(1, 1, figsize=size)
    plt.scatter(tsne[:, 0], tsne[:, 1], c=labels, s=50, cmap='hsv')
    plt.xticks([])
    plt.yticks([])
    plt.show()

## 1. Cora

We start with the Cora dataset. We check the embedding of the nodes before and after learning, and the impact of the GNN architecture on accuracy.

In [None]:
dataset = cora

In [None]:
adjacency = dataset.adjacency
features = dataset.biadjacency
labels_true = dataset.labels

In [None]:
# we use undirected graphs
adjacency = directed2undirected(adjacency)

In [None]:
SVG(visualize_graph(adjacency, width=800, height=800))

## To do

Consider a GNN with a single hidden layer of dimension 16.

* Run a single forward pass on the data, without learning.
* Display the embedding provided by the hidden layer.

In [None]:
hidden_dim = 16

In [None]:
n_labels = len(set(labels_true))

In [None]:
gnn = GNNClassifier(dims=[hidden_dim, n_labels], verbose=True)

In [None]:
gnn

In [None]:
output = gnn.forward(adjacency, features)

In [None]:
# hidden layer
embedding = gnn.layers[0].embedding

In [None]:
visualize_embedding(embedding, labels_true)

## To do

We now train the GNN.

* Train the GNN with 50% / 50% train / test split.
* Give the accuracy of the classification on the train and test sets.
* Give the total number of parameters.
* Display the embedding provided by the hidden layer.

In [None]:
# train / test split
ratio_train = 0.5
labels = labels_true.copy()
mask_train = np.random.random(size=len(labels)) < ratio_train
mask_test = ~mask_train
labels[mask_test] = -1

In [None]:
gnn.fit(adjacency, features, labels)

In [None]:
labels_pred = gnn.predict()

In [None]:
accuracy_train = get_accuracy_score(labels_true[mask_train], labels_pred[mask_train])
accuracy_test = get_accuracy_score(labels_true[mask_test], labels_pred[mask_test])
print('Accuracy (train):', round(accuracy_train, 2))
print('Accuracy (test):', round(accuracy_test, 2))
print('Number of parameters :', (features.shape[0] * hidden_dim + hidden_dim))

In [None]:
embedding = gnn.layers[0].embedding
visualize_embedding(embedding, labels_pred)

## To do

* Retrain the GNN with an empty graph.
* Compare the accuracy of the classification with that of the previous model.
* Comment the results. <br>What is the learning model?

In [None]:
empty = sparse.csr_matrix(adjacency.shape)
gnn_empty = GNNClassifier(dims=[hidden_dim, n_labels], verbose=True)
gnn_empty.fit(empty, features, labels_true)
labels_pred_empty = gnn_empty.predict()
accuracy_train_empty = get_accuracy_score(labels_true[mask_train], labels_pred_empty[mask_train])
accuracy_test_empty = get_accuracy_score(labels_true[mask_test], labels_pred_empty[mask_test])
print('\n--- Results with Empty Graph ---')
print('Accuracy (train, empty graph):', round(accuracy_train_empty, 5))
print('Accuracy (test, empty graph):', round(accuracy_test_empty, 5))
print('\n--- Comparison ---')
print('Accuracy (train,  previous model):', round(accuracy_train, 5))
print('Accuracy (test, previous model):', round(accuracy_test, 5))


## Comment :
The results show that the model trained on the empty graph (i.e., without any edge information) significantly outperforms the standard GNN on both training and test accuracy. This suggests that the node features alone are highly informative, and the graph structure does not contribute positively to the classification task in this case.

In fact, the GNN with one hidden layer (dim=16) might suffer from underfitting or oversmoothing, especially if the graph connectivity is not well aligned with class boundaries. This indicates that, for the Cora dataset, a simple MLP-like model using only node features can generalize better than a GNN relying on neighborhood aggregation.

## To do

We now consider a hidden layer of dimension 32.

* Retrain the GNN (with the graph).
* Give the accuracy of the classification and the number of parameters.
* Comment the results.

In [None]:
gnn_32 = GNNClassifier(dims=[32, n_labels], verbose=True)
gnn_32.fit(adjacency, features, labels_true)
labels_pred_32 = gnn_32.predict()
accuracy_train_32 = get_accuracy_score(labels_true[mask_train], labels_pred_32[mask_train])
accuracy_test_32 = get_accuracy_score(labels_true[mask_test], labels_pred_32[mask_test])

input_dim = features.shape[1]
hidden_dim = 32
output_dim = n_labels

n_params = (input_dim * hidden_dim + hidden_dim) + (hidden_dim * output_dim + output_dim)
print("Number of parameters:", n_params)
print('--- GNN with Hidden Layer Dimension = 32 ---')
print('Accuracy (train):', round(accuracy_train_32, 4))
print('Accuracy (test):', round(accuracy_test_32, 4))
print('Number of parameters:', n_params)


## To do

Finally , we take 2 hidden layers, each of dimension 16.

* Retrain the GNN.
* Give the accuracy of the classification and the number of parameters.
* Comment the results.

In [None]:
gnn_2layers = GNNClassifier(dims=[16, 16, n_labels], verbose=True)
gnn_2layers.fit(adjacency, features, labels_true)
labels_pred_2layers = gnn_2layers.predict()
accuracy_train_2layers = get_accuracy_score(labels_true[mask_train], labels_pred_2layers[mask_train])
accuracy_test_2layers = get_accuracy_score(labels_true[mask_test], labels_pred_2layers[mask_test])
input_dim = features.shape[1]
hidden_dim1 = 16
hidden_dim2 = 16
output_dim = n_labels

n_params_2layers = (
    (input_dim * hidden_dim1 + hidden_dim1) +
    (hidden_dim1 * hidden_dim2 + hidden_dim2) +
    (hidden_dim2 * output_dim + output_dim)
)

print('--- GNN with 2 Hidden Layers of Dimension 16 ---')
print('Accuracy (train):', round(accuracy_train_2layers, 4))
print('Accuracy (test):', round(accuracy_test_2layers, 4))
print('Number of parameters:', n_params_2layers)


The results show that the GNN with two hidden layers of dimension 16 achieves good accuracy on both training and test sets, around 94-95%, with fewer parameters (23,335). However, increasing the hidden layer size to a single layer of dimension 32 significantly improves the accuracy, reaching almost perfect scores (around 99.8%) at the cost of roughly double the number of parameters (46,119). This suggests that a wider single hidden layer better captures the data patterns in this case, while deeper but narrower architectures may be less effective for this task.

## 2. Wikivitals

We now focus on Wikivitals. We take the spectral embedding of the article-word bipartite graph as features.

In [None]:
dataset = wikivitals

In [None]:
adjacency = dataset.adjacency
biadjacency = dataset.biadjacency
names = dataset.names
labels_true = dataset.labels
names_labels = dataset.names_labels

In [None]:
# we consider the graph as undirected
adjacency = directed2undirected(adjacency)

In [None]:
# we use the spectral embedding of the bipartite graph as features
spectral = Spectral(20)
features = spectral.fit_transform(biadjacency)

## To do

We consider a GNN with a single hidden layer of dimension 16.
* Train the GNN with 50% / 50% train / test split.
* Give the accuracy of the classification.
* Display the confusion matrix of the test set.
* Give for each label the 5 articles of the test set classified with the highest confidence.

In [None]:
n_labels = len(set(labels_true))

In [None]:
gnn = GNNClassifier(dims=[hidden_dim, n_labels], verbose=True)

In [None]:
num_nodes = len(labels_true)
indices = np.arange(num_nodes)
np.random.shuffle(indices)
split = num_nodes // 2
mask_train = indices[:split]
mask_test = indices[split:]
gnn.fit(adjacency, features, labels_true)
labels_pred = gnn.predict()
accuracy_train = get_accuracy_score(labels_true[mask_train], labels_pred[mask_train])
accuracy_test = get_accuracy_score(labels_true[mask_test], labels_pred[mask_test])
print(f"Accuracy (train): {accuracy_train:.4f}")
print(f"Accuracy (test): {accuracy_test:.4f}")

In [None]:
cm = confusion_matrix(labels_true[mask_test], labels_pred[mask_test], labels=range(n_labels))
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.title("Confusion Matrix - Test Set")
plt.show()

In [None]:
probs = gnn.predict_proba()
for label in range(n_labels):
    mask_label_test = (labels_pred[mask_test] == label)
    test_indices_label = mask_test[mask_label_test]
    conf_scores = probs[test_indices_label, label]
    top5_idx = np.argsort(conf_scores)[-5:][::-1]
    top5_nodes = test_indices_label[top5_idx]

    print(f"\nLabel {label} ({names_labels[label]}) - Top 5 articles with highest confidence:")
    for node_id in top5_nodes:
        print(f"  {names[node_id]} (confidence: {conf_scores[np.where(test_indices_label == node_id)[0][0]]:.4f})")


## To do

Compare the results with those obtained with:
* Heat diffusion on the graph.
* Logistic regression on the features.

In [None]:
# HeatDiffusion
labels_train_masked = labels_true.copy()
labels_train_masked[mask_test] = -1
diffusion = DiffusionClassifier()
diffusion.fit(adjacency, labels_train_masked)
labels_pred_diffusion = diffusion.predict()
accuracy_diffusion = get_accuracy_score(labels_true[mask_test], labels_pred_diffusion[mask_test])
print(f"Accuracy with Heat Diffusion: {accuracy_diffusion:.4f}")

# Logistic Regression on Spectral Features
X_train_spec = features[mask_train]
X_test_spec = features[mask_test]
y_train = labels_true[mask_train]
y_test = labels_true[mask_test]

lr_spec = LogisticRegression(max_iter=1000)
lr_spec.fit(X_train_spec, y_train)
accuracy_spec = lr_spec.score(X_test_spec, y_test)
print(f"Accuracy with Spectral Features + Logistic Regression: {accuracy_spec:.4f}")


These results show a clear progression in model performance as you move from simpler to more complex approaches. Logistic regression on spectral features already outperforms heat diffusion combined with logistic regression, indicating that the spectral embeddings capture more informative representations of the nodes for classification. However, the GNN with a single hidden layer of dimension 16 achieves the highest test accuracy, demonstrating the benefit of leveraging both graph structure and node features jointly through learned message passing. This confirms that GNNs can better exploit complex relationships in the data than traditional methods based solely on fixed embeddings or diffusion processes.