# SD212: Graph mining
## Lab 5: Heat diffusion

In this lab, you will learn to use diffusion for **contrastive ranking** and **classification** of nodes.

## Import

In [None]:
from IPython.display import SVG

In [None]:
import numpy as np
from scipy import sparse

In [None]:
from sknetwork.data import load_netset, grid, karate_club
from sknetwork.regression import Dirichlet
from sknetwork.classification import DirichletClassifier
from sknetwork.ranking import PageRank, top_k
from sknetwork.utils import membership_matrix
from sknetwork.visualization import svg_graph

## Data

We will work on the following graphs (see the [NetSet](https://netset.telecom-paris.fr/) collection for details):
* Openflights (graph)
* WikiVitals (directed graph + bipartite graph)

In [None]:
openflights = load_netset('openflights')
wikivitals = load_netset('wikivitals')

## 1. Graphs

## Grid

We first illustrate the notion of **contrastive ranking** on a grid.

In [None]:
k = 5
graph = grid(k, k, True)
adjacency = graph.adjacency
position = graph.position

In [None]:
image = svg_graph(adjacency, position, names=np.arange(k * k), width=200, height=200)
SVG(image)

## To do

* Display the graph with 1 hot source and 1 cold source, located on the opposite corners of the square.
* Do the same with $k$ hot sources, at the bottom of the square, and 2 cold sources, at the top corners.<br>
What is the free node of highest temperature?

In [None]:
dirichlet = Dirichlet()

In [None]:
corner_sw = 0
corner_ne = k * k - 1

In [None]:
seeds = {corner_ne: 0, corner_sw: 1}

In [None]:
temperatures = dirichlet.fit_transform(adjacency, seeds)

In [None]:
image = svg_graph(adjacency, position, scores=temperatures, width=200, height=200)
SVG(image)

In [None]:
bottom_side = k * np.arange(k)

In [None]:
corner_nw = k - 1

In [None]:
seeds = {i: 1 for i in bottom_side}
seeds.update({corner_nw: 0, corner_ne: 0})

In [None]:
temperatures = dirichlet.fit_transform(adjacency, seeds)

In [None]:
image = svg_graph(adjacency, position, scores=temperatures, names=np.arange(k*k), width=200, height=200)
SVG(image)

In [None]:
temperatures

## Karate Club


We now consider the **classification** of nodes by the Dirichlet method. We use the [karate club graph](https://en.wikipedia.org/wiki/Zachary%27s_karate_club) that has ground-truth labels.

In [None]:
graph = karate_club(True)

In [None]:
adjacency = graph.adjacency
position = graph.position
labels_true = graph.labels

In [None]:
n = len(labels_true)

In [None]:
image = svg_graph(adjacency, position, labels=labels_true, names=np.arange(n))
SVG(image)

## To do

* Select 2 seeds, one in each cluster, and display the graph with the predicted labels. <br>
What is the accuracy of the classification?
* Display the graph with the temperature of each node at equilibrium.
* Give the accuracy averaged over 20 experiments with 2 seeds selected at random, one in each cluster.<br>
Do the same with 2 seeds per cluster.

In [None]:
classifier = DirichletClassifier()

In [None]:
seeds = {i: labels_true[i] for i in [12, 23]}

In [None]:
labels = classifier.fit_transform(adjacency, seeds)

In [None]:
image = svg_graph(adjacency, position, labels=labels, seeds=seeds)
SVG(image)

In [None]:
clusters = [np.argwhere(labels_true==label).ravel() for label in set(labels_true)]

In [None]:
n_seeds = 1
seeds = {i: labels_true[i] for cluster in clusters for i in np.random.choice(cluster, n_seeds, replace=False)}

In [None]:
seeds

In [None]:
n_seeds = 1
accuracy_scores = []
for t in range(20):
    seeds = {i: labels_true[i] for cluster in clusters for i in np.random.choice(cluster, n_seeds, replace=False)}
    labels_pred = classifier.fit_transform(adjacency, seeds)
    accuracy_scores.append(np.mean(labels_pred==labels_true))

In [None]:
# 1 seed per cluster
np.mean(accuracy_scores)

In [None]:
classifier.membership_

In [None]:
temperatures = classifier.membership_[:, 1].toarray().ravel()

In [None]:
image = svg_graph(adjacency, position, scores=temperatures, seeds=seeds)
SVG(image)

## Openflights


We now show how to classify the nodes of a graph **without labels**, with manual labelling of a few nodes.

In [None]:
graph = openflights

In [None]:
adjacency = graph.adjacency
position = graph.position
names = graph.names

In [None]:
image = svg_graph(adjacency, position, width=800, height=400, node_size=3, display_edges=False)
SVG(image)

## To do

* Display the same world map with the labels predicted for 3 seeds (Paris, New-York, Beijing), each with its own  label.
* Add a seed in Madrid with another label and observe the result.

In [None]:
classifier = DirichletClassifier(centering=True)

In [None]:
paris = 622
newyork = 1842
beijing = 1618
madrid = 572

In [None]:
labels = classifier.fit_transform(adjacency, seeds = {paris:0, newyork:1, beijing:2, madrid:3})

In [None]:
image = svg_graph(adjacency, position, labels=labels, width=800, height=400, node_size=3, display_edges=False)
SVG(image)

In [None]:
image = svg_graph(adjacency, position, labels=labels, width=800, height=400, node_size=3, display_edges=False)
SVG(image)

Finally, we illustrate **contrastive ranking** on Openflights.

## To do

* List the top-10 airports that are close to Tokyo in terms of Personalized PageRank.
* Compare with the list of the top-10 airports that are close to Tokyo and far from Paris, in terms of heat diffusion.
* Observe the temperatures of these airports and explain the result.
* Check your guess on displaying the following aggregate graph: Tokyo, top-10 airports except Tokyo, rest of the world.<br>**Hint:** Use the function ``membership_matrix``.

In [None]:
{i: name for i, name in enumerate(names) if 'Tokyo' in name}

In [None]:
tokyo = 1084

In [None]:
pagerank = PageRank()

In [None]:
scores = pagerank.fit_transform(adjacency, seeds={tokyo:1})

In [None]:
names[top_k(scores, 10)]

In [None]:
temperatures = dirichlet.fit_transform(adjacency, seeds={tokyo: 1, paris: 0})

In [None]:
names[top_k(temperatures, 10)]

In [None]:
temperatures[top_k(temperatures, 10)]

In [None]:
# the world
labels = np.zeros(len(names), dtype=int)

In [None]:
# top-10 airports
labels[top_k(temperatures, 10)] = np.arange(10) + 1

In [None]:
# aggregation
membership = membership_matrix(labels)

In [None]:
membership

In [None]:
aggregate = membership.T.dot(adjacency).dot(membership).tocsr()

In [None]:
aggregate

In [None]:
SVG(svg_graph(aggregate, names=['world'] + list(names[top_k(temperatures, 10)])))

## 2. Directed and bipartite graphs

We now consider the graph of links of WikiVitals. We first focus on **contrastive ranking**, then on **node classification**.

## Wikipedia Vitals

In [None]:
graph = wikivitals

In [None]:
adjacency = graph.adjacency
names = graph.names
labels = graph.labels

## To do

* List the top-10 articles that are close to **Cat** and **Dog** in terms of Personalized PageRank, considering the graph as bipartite.
* Compare with the list of top-10 articles that are close to **Cat** and **Dog** and far from **Bear** and **Tiger** using diffusion.
* List the top-10 articles that are close to **Bear** and **Tiger** and far from **Cat** and **Dog**.<br>
**Note:** You can use previous diffusion.

In [None]:
pagerank = PageRank()

In [None]:
cat = int(np.argwhere(names=='Cat'))
dog = int(np.argwhere(names=='Dog'))

In [None]:
scores = pagerank.fit_transform(adjacency, seeds={cat:1, dog:1}, force_bipartite=True)

In [None]:
names[top_k(scores, 10)]

In [None]:
tiger = int(np.argwhere(names=='Tiger'))
bear = int(np.argwhere(names=='Bear'))

In [None]:
temperatures = dirichlet.fit_transform(adjacency, {cat:1, dog:1, bear:0, tiger:0})

In [None]:
names[top_k(temperatures, 10)]

In [None]:
names[top_k(-temperatures, 10)]

For node classification, we hide the labels of half of the nodes and try to predict these labels from the labels of the remaining nodes.

## To do

* What is the F1 score of the classification?
* Do the same by restricting the classification to all labels except the label **People**.<br>
What do you observe? Interpret this result.

In [None]:
graph.names_labels

In [None]:
mask_train = np.ones(len(names), dtype=bool)
mask_test = np.zeros(len(names), dtype=bool)
for label in set(labels):
    index_label = np.argwhere(labels==label).ravel()
    index_test = np.random.choice(index_label, size=int(len(index_label)/2), replace=False)
    mask_train[index_test] = 0
    mask_test[index_test] = 1

In [None]:
np.sum(mask_train)

In [None]:
np.sum(mask_test)

In [None]:
any(mask_test&mask_train)

In [None]:
labels_train = -np.ones(len(names), dtype=int)
labels_train[mask_train] = labels[mask_train]

In [None]:
labels_pred = classifier.fit_transform(adjacency, labels_train)

In [None]:
from sklearn.metrics import f1_score

In [None]:
f1_score(labels[mask_test], labels_pred[mask_test], average='micro')

In [None]:
labels_train = -np.ones(len(names), dtype=int)
labels_train[labels!=6] = labels[labels!=6]

In [None]:
labels_pred = classifier.fit_transform(adjacency, labels_train)

In [None]:
names[(labels==6)&(labels_pred==2)]

In [None]:
graph.names_labels

## To do

* Classify each article of the **People** category in one of the other categories (Mathematics, History, etc.), using all other labels.
* List top-5 people of each category using PageRank.
* Do the same experiments with the bipartite graph between articles and words.

In [None]:
biadjacency = graph.biadjacency