# Graph Learning
## Lab 5: Heat Diffusion

In this lab, you will learn to use heat diffusion for **contrastive ranking** and **node classification**.

## Import

In [None]:
from IPython.display import SVG

In [None]:
import numpy as np
from scipy import sparse
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from sknetwork.data import load_netset, grid, karate_club
from sknetwork.regression import Dirichlet
from sknetwork.classification import DiffusionClassifier
from sknetwork.ranking import PageRank
from sknetwork.utils import directed2undirected
from sknetwork.visualization import visualize_graph
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

## Data

We will work on the following graphs (see the [NetSet](https://netset.telecom-paris.fr/) collection for details):
* Openflights (graph)
* WikiVitals (directed graph + bipartite graph)

In [None]:
openflights = load_netset('openflights')
wikivitals = load_netset('wikivitals')

## 1. Graphs

## Grid

We first illustrate the notion of **contrastive ranking** on a $k\times k$ grid.

In [None]:
k = 5
dataset = grid(k, k, True)
adjacency = dataset.adjacency
position = dataset.position

In [None]:
image = visualize_graph(adjacency, position, names=np.arange(k * k), width=200, height=200)
SVG(image)

## To do

* Display the solution to the Dirichlet problem with 1 hot source and 1 cold source, located on the opposite corners of the square.
* Add a hot source at the top-left corner of the square. What is the free node of highest temperature?

In [None]:
dirichlet = Dirichlet()

In [None]:
bottom_left = 0
top_right = k * k - 1

In [None]:
values = {bottom_left: 0, top_right: 1}

In [None]:
temperatures = dirichlet.fit_predict(adjacency, values=values)

In [None]:
image = visualize_graph(adjacency, position, scores=temperatures, width=200, height=200, names=np.arange(k * k))
SVG(image)

In [None]:
# Add a hot source in the top left
top_left = k - 1
values[top_left] = 1
temperatures = dirichlet.fit_predict(adjacency, values=values)
image = visualize_graph(adjacency, position, scores=temperatures, width=200, height=200, names=np.arange(k * k))
SVG(image)

In [None]:
temp_copy = temperatures.copy()
mask = (temperatures == 1) | (temperatures == 0)  # take the seed nodes (temp 0 or 1)
temp_copy[mask] = -1  # exclude the seed nodes (set temperature to -1)
highest_temp_node = np.argmax(temp_copy)
print(f"The free node with the highest temperature is Node n°  {highest_temp_node}")

The result seems logical, because the furthest nodes to Node 0 are nodes 19 and 23, but the hottest ones must be the one who's closer to both out hot Nodes 4 and 24, so 19 is the hottest one.

## Karate Club


We now consider the **classification** of nodes by heat diffusion. We use the [karate club graph](https://en.wikipedia.org/wiki/Zachary%27s_karate_club) that has ground-truth labels.

In [None]:
dataset = karate_club(True)

In [None]:
adjacency = dataset.adjacency
position = dataset.position
labels_true = dataset.labels

In [None]:
n = len(labels_true)

In [None]:
image = visualize_graph(adjacency, position, labels=labels_true, names=np.arange(n))
SVG(image)

## To do

* Select one node in each ground-truth cluster, and predict the labels of the other nodes by heat diffusion.
* Display the graph with the predicted labels.
* What is the accuracy of the classification?

In [None]:
labels_true = np.array(labels_true)
nodes_cluster0 = np.where(labels_true == 0)[0]
nodes_cluster1 = np.where(labels_true == 1)[0]
node0 = np.random.choice(nodes_cluster0)
node1 = np.random.choice(nodes_cluster1)
y_partial = -np.ones(n, dtype=int)
y_partial[node0] = 0
y_partial[node1] = 1

classifier = DiffusionClassifier()

y_pred = classifier.fit_predict(adjacency, y_partial)
accuracy = accuracy_score(labels_true, y_pred)
print(f"Accuracy: {accuracy:.2f}")

image = visualize_graph(adjacency, position, labels=y_pred, names=np.arange(n))
display(SVG(image))




## To do

The classifier applies temperature centering after diffusion.

* Repeat the same experiment without temperature centering.
* Do the same experiments with 3 nodes in a ground-truth cluster and 1 in the other.
* Comment the results.

In [None]:
# Repeating the experiments with no centering
labels_true = np.array(labels_true)
nodes_cluster0 = np.where(labels_true == 0)[0]
nodes_cluster1 = np.where(labels_true == 1)[0]
node0 = np.random.choice(nodes_cluster0)
node1 = np.random.choice(nodes_cluster1)
y_partial = -np.ones(n, dtype=int)
y_partial[node0] = 0
y_partial[node1] = 1

classifier = DiffusionClassifier(centering=False)

y_pred = classifier.fit_predict(adjacency, y_partial)
accuracy = accuracy_score(labels_true, y_pred)
print(f"Accuracy: {accuracy:.2f}")

image = visualize_graph(adjacency, position, labels=y_pred, names=np.arange(n))
display(SVG(image))


In [None]:
# Repeating the experiments with 3 nodes from one cluster and just one node from the other one
labels_true = np.array(labels_true)
nodes_cluster0 = np.where(labels_true == 0)[0]
nodes_cluster1 = np.where(labels_true == 1)[0]
node0 = np.random.choice(nodes_cluster0)
nodes1 = np.random.choice(nodes_cluster1, 3, replace=False)
y_partial = -np.ones(n, dtype=int)
y_partial[node0] = 0
y_partial[nodes1] = 1

classifier = DiffusionClassifier()

y_pred = classifier.fit_predict(adjacency, y_partial)
accuracy = accuracy_score(labels_true, y_pred)
print(f"Accuracy: {accuracy:.2f}")

image = visualize_graph(adjacency, position, labels=y_pred, names=np.arange(n))
display(SVG(image))


### No centering :
When temperature centering is disabled, the diffusion process may produce biased class probabilities. This is because the raw heat values that propagate through the graph are not normalized or adjusted to account for the global balance between classes. As a result, nodes closer to a labeled example may receive stronger signals regardless of their structural similarity to other regions. Sometimes we can even get an accuracy that's less than 50%.

### Balanced (1 vs 1) vs. Imbalanced (3 vs 1) Labeling :
In a balanced setup where one labeled node is chosen per class, the classifier treats both classes equally, and the diffusion process is less biased. However, when more nodes are labeled in one class than the other (e.g., 3 vs 1), the diffusion inherently favors the majority-labeled class.

## Openflights


We now show how to classify the nodes of a graph **without labels**. We identify some nodes of interest and use these seeds to classify the other nodes.

In [None]:
dataset = openflights

In [None]:
adjacency = dataset.adjacency
position = dataset.position
names = dataset.names

In [None]:
image = visualize_graph(adjacency, position, width=800, height=400, node_size=3, display_edges=False)
SVG(image)

## To do

* Display the same world map with the labels predicted for 3 seeds (Paris, New-York, Beijing), each with its own  label.
* Add Madrid with another label and comment the results.

In [None]:
paris = 622
newyork = 1842
beijing = 1618
madrid = 572

In [None]:
# Clustering with only three Labels
n = len(names)
y_partial = -np.ones(n, dtype=int)
y_partial[paris] = 0
y_partial[newyork] = 1
y_partial[beijing] = 2
classifier = DiffusionClassifier()
y_pred = classifier.fit_predict(adjacency, y_partial)

image = visualize_graph(adjacency, position, labels=y_pred, width=800, height=400, node_size=3, display_edges=False)
display(SVG(image))


In [None]:
# Adding label 4 Madrid
y_partial[madrid] = 3
classifier = DiffusionClassifier()
y_pred = classifier.fit_predict(adjacency, y_partial)

image = visualize_graph(adjacency, position, labels=y_pred, width=800, height=400, node_size=3, display_edges=False)
display(SVG(image))


## Commenting results ;
With only three labeled cities, the diffusion classifier propagates labels throughout the graph based on proximity and connectivity. Most airports got classified into the region corresponding to their closest and most connected seed city. Europe and Africa are mostly influenced by Paris, North America by New York, and Asia by Beijing. This division reflects global airline route patterns quite well.
But Introducing Madrid creates a more nuanced segmentation of Europe and possibly parts of Africa or Latin America. Since Madrid is geographically close to Paris but may have distinct flight routes (especially toward Spanish-speaking countries like in South America), it slightly splits the region previously dominated by Paris. As a result, some nodes that were previously labeled like Paris may now get the Madrid label, especially if they are more connected to Madrid than to Paris. This shows how sensitive diffusion is to the choice and number of seed nodes — adding one can change boundaries significantly.


We finally use *Constrative ranking*

## To do

* List the top-10 airports that are the closest to Tokyo in terms of Personalized PageRank.
* List of the top-10 airports that are both the closest to Tokyo and the furthest from Paris Charles de Gaulle, using heat diffusion (contrastive ranking).<br>What do you observe?
* Give the temperatures of these airports after heat diffusion and explain the result.

In [None]:
tokyo = 1084

In [None]:
# Top 10 airports that are the closest to Tokyo in terms of PPR
pagerank = PageRank()
scores = pagerank.fit_predict(adjacency, weights={tokyo: 1})
top_pagerank = np.argsort(-scores)[:10]
print(names[top_pagerank])

In [None]:
# Top 10 airports closest to Tokyo and furthest to Paris
dirichlet = Dirichlet()
values = {paris: 0, tokyo: 1}
temperatures = dirichlet.fit_predict(adjacency, values)
top_diffusion = np.argsort(-temperatures)[:10]
print("Top 10 airports close to Tokyo and far from Paris (by heat diffusion):")
for i in top_diffusion:
    print(f"{names[i]}: temperature = {temperatures[i]:.4f}")

Personalized PageRank (PPR) ranks airports based on how frequently they’re visited during a biased random walk starting from Tokyo. It highlights airports that are not only geographically close but also highly connected to Tokyo in the global network. As a result, the top airports tend to be major East Asian hubs like Osaka or Seoul, which are frequently reached from Tokyo and serve as central nodes in the regional flight network.

Contrastive heat diffusion using the Dirichlet method offers a different perspective by setting Tokyo as a hot source (1) and Paris as a cold one (0). It computes which airports are structurally closer to Tokyo and farther from Paris. This leads to a more regionally focused ranking, emphasizing airports that fall under Tokyo’s sphere of influence while being distant from European hubs. It’s more effective at revealing clear separations between different zones in the network.

We see that the temperatures of these airports is equal to one which means that they are directly connected to Tokyo and with no influence from Paris.

## Wikipedia Vitals

We now consider the graph of links of WikiVitals, considered as undirected. We first focus on **constrastive ranking**, then on **node classification**.

In [None]:
dataset = wikivitals

In [None]:
adjacency = dataset.adjacency
names = dataset.names
labels = dataset.labels
names_labels = dataset.names_labels

In [None]:
adjacency = directed2undirected(adjacency)

## Contrastive ranking

## To do

* List the top-10 articles that are close to **Cat** and **Dog** in terms of Personalized PageRank.
* Compare with the list of top-10 articles that are close to **Cat** and **Dog** and far from **Bear** and **Tiger** using heat diffusion (contrastive ranking).
* List the top-10 articles that are close to **Bear** and **Tiger** and far from **Cat** and **Dog** (you can use previous diffusion). Interpret the results.

In [None]:
pagerank = PageRank()

In [None]:
#  Top 10 articles close to Cat and Dog
cat = np.flatnonzero(names == "Cat")[0]
dog = np.flatnonzero(names == "Dog")[0]
scores = pagerank.fit_predict(adjacency, weights={cat: 1, dog: 1}, force_bipartite=True)
top_pagerank = np.argsort(-scores)[:10]
print(names[top_pagerank])

In [None]:
# Top 10 closest to cat dog and furthest to Bear and tiger
bear = np.flatnonzero(names == "Bear")[0]
tiger = np.flatnonzero(names == "Tiger")[0]
dirichlet = Dirichlet()
values = {cat: 1, dog: 1, bear: 0, tiger: 0}
temperatures = dirichlet.fit_predict(adjacency, values, force_bipartite=True)
top_diffusion = np.argsort(-temperatures)[:10]
print("Top 10 articles close to cat dog and far from bear tiger: ", names[top_diffusion])


In [None]:
# Top 10 furthest to cat dog and closest to Bear and tiger
top_diffusion = np.argsort(-temperatures)[len(top_diffusion)-10:]
print("Top 10 articles far from cat dog and closest to bear tiger: ", names[top_diffusion])

PPR results givs articles that are directly or indirectly connected to "Cat" and "Dog", like "Domestic Animal", "Pet", or other related animals. However, this method does not consider any negative influence (e.g., Bear or Tiger), so high scores could also include general animal-related nodes.

Contrastive diffusion adds nuance: it not only finds what is close to "Cat" and "Dog" but also penalizes nodes that are related to "Bear" and "Tiger". As a result, the top results are more specific to the pet/domestic animal theme, filtering out unrelated wild animals.

Similarly, when reversed (Bear and Tiger = hot, Cat and Dog = cold), the diffusion reveals a wild animal cluster—articles about predators, wilderness, or ecosystems distant from domesticated animals, also legendary creatures.

## Node classification

We now use the ground-truth labels provided by the category of each article.

In [None]:
print(names_labels)

## To do

* What is the accuracy of node classification, using half of the nodes in the train set?
* Display the confusion matrix. What is the most difficult class to predict?
* Give examples of misclassification for each label (e.g., 2 false positives and 2 false negatives per label).

In [None]:
# Accuracy of node classification using half of the train set
dataset = wikivitals
adjacency = dataset.adjacency
labels_true = dataset.labels
classifier = DiffusionClassifier()
labels = labels_true.copy()
mask = np.random.random(size=len(labels_true)) < 0.5
labels[~mask] = -1
labels_pred = classifier.fit_predict(adjacency, labels)
print("Accuracy : ", np.mean(labels_pred[~mask] == labels_true[~mask]))

In [None]:
cm = confusion_matrix(labels_true[~mask], labels_pred[~mask])

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix on Test Set")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.show()

In [None]:
# Examples of missclassifications
unique_labels = np.unique(labels_true)
test_indices = np.where(~mask)[0]
print("Examples of Misclassifications:\n")
for label in unique_labels:
    false_positives = []
    false_negatives = []
    for idx in test_indices:
        true = labels_true[idx]
        pred = labels_pred[idx]
        if pred == label and true != label:
            false_positives.append(idx)
        if true == label and pred != label:
            false_negatives.append(idx)
    label_name = names_labels[label]
    print(f"Label  {label_name}:")
    print("-False Positives:")
    for i in false_positives[:2]:
        print(f"{names[i]} — True = {labels_true[i]}, Pred = {labels_pred[i]}")
    print("-False Negatives:")
    for i in false_negatives[:2]:
        print(f"{names[i]} — True = {labels_true[i]}, Pred = {labels_pred[i]}")
    print()


## To do

* Classify each article of the **People** category in one of the other categories (Mathematics, History, etc.), using all other labels.
* List the top-5 predicted people of each category using Personalized PageRank.

In [None]:
# Classify each article of the People category in other categories
people = np.flatnonzero(names_labels == "People")[0]
labels = labels_true.copy()
labels[labels == people] = -1
labels_pred = classifier.fit_predict(adjacency, labels)

In [None]:
# Top 5 people of each category
for label in np.unique(labels_true):
    if label != 6:
        pagerank = PageRank()
        mask = (labels_pred == label) & (labels_true == 6)
        scores = pagerank.fit_predict(adjacency, mask, force_bipartite=True)
        scores *= mask
        print(names_labels[label], f"({np.sum(mask)})")
        print(names[np.argsort(-scores)[:5]])
        print()

## 2. Bipartite graphs

Finally, we consider the bipartite graph between articles and words in WikiVitals.

In [None]:
dataset = wikivitals
biadjacency = dataset.biadjacency

In [None]:
words = dataset.names_col

## To do

* Repeat the experiments on node classification using the bipartite graph between articles and words.
* Which information seems to be richer, the text or the links between articles?
* Propose a solution exploiting both sources of information.

In [None]:
classifier = DiffusionClassifier()
labels = labels_true.copy()
mask = np.random.random(size=len(labels_true)) < 0.5
labels[~mask] = -1
labels_pred = classifier.fit_predict(biadjacency, labels)
print(np.mean(labels_pred[~mask] == labels_true[~mask]))

In [None]:
labels = labels_true.copy()
labels[labels == people] = -1
labels_pred = classifier.fit_predict(biadjacency, labels)
for label in np.unique(labels_true):
    if label != 6:
        pagerank = PageRank()
        mask = (labels_pred == label) & (labels_true == 6)
        scores = pagerank.fit_predict(biadjacency, mask)
        scores *= mask
        print(names_labels[label], f"({np.sum(mask)})")
        print(names[np.argsort(-scores)[:5]])
        print()

## Which is richer texts or links ?
Article links (original graph):
Often reflect human-curated structure (like Wikipedia hyperlinks), which capture semantic relationships between topics.

Text (bipartite graph):
Encodes more semantic content directly. It works well if the text contains distinctive vocabulary per topic — especially for science/history categories.

## Proposition to use both sources of information :
To effectively leverage both article links and textual content for node classification, a powerful approach is to construct a combined graph that integrates both sources of information. This can be done by forming a block adjacency matrix where articles are connected to each other via their original hyperlinks and simultaneously linked to the words they contain through a bipartite structure. Specifically, the article-article adjacency matrix (capturing link structure) is placed in the top-left, and the article-word biadjacency matrix (capturing textual content) in the top-right and bottom-left blocks, forming an augmented graph. By applying a diffusion-based method or a graph neural network on this unified graph, information can propagate through both semantic and structural pathways, enabling more accurate and robust classification. This fusion effectively combines the global structure of hyperlinks with the local semantic signals from text.

In [None]:
full_matrix = sparse.hstack((adjacency, biadjacency))
labels_pred = classifier.fit_predict(full_matrix, labels)
print(np.mean(labels_pred[~mask] == labels_true[~mask]))