# SD212: Graph mining
## Lab 3: PageRank

In this lab, you will learn to compute, use and interpret various [PageRank](https://en.wikipedia.org/wiki/PageRank) scores.

## Import

In [None]:
from IPython.display import SVG

In [None]:
import numpy as np
from scipy import sparse

In [None]:
from sknetwork.data import load_netset, linear_graph, miserables
from sknetwork.linalg import normalize
from sknetwork.ranking import PageRank, BiPageRank, top_k
from sknetwork.visualization import svg_graph

## Data

We will work on the following graphs (see the [NetSet](https://netset.telecom-paris.fr/) collection for details):
* Openflights (graph)
* WikiVitals (digraph)
* Cinema (bigraph)

In [None]:
openflights = load_netset('openflights')
wikivitals = load_netset('wikivitals')
cinema = load_netset('cinema')

## 1. Graphs

The PageRank corresponds to the stationary distribution of a random walk with restart probability $1-\alpha$. Unless otherwise specified, we take the default value $\alpha = 0.85$ and the restart probability distribution is uniform over the set of nodes.

## Linear graph

Consider a linear graph:

In [None]:
n = 10

In [None]:
graph = linear_graph(n, True)
adjacency = graph.adjacency
position = graph.position

In [None]:
image = svg_graph(adjacency, position, names=np.arange(n))

In [None]:
SVG(image)

## To do

* What are the two best ranked nodes? Try with different values of $\alpha$ and interpret the results.
* What is the exact PageRank vector when $\alpha=1$? You need to compute it!

In [None]:
pagerank = PageRank(damping_factor=0.85, solver='lanczos')

In [None]:
scores = pagerank.fit_transform(adjacency)

In [None]:
image = svg_graph(adjacency, position, names=np.arange(n), scores=scores)

In [None]:
SVG(image)

## Les MisÃ©rables


In [None]:
graph = miserables(True)

In [None]:
adjacency = graph.adjacency
position = graph.position
names = graph.names

In [None]:
image = svg_graph(adjacency, position, names, scale=2)

In [None]:
SVG(image)

## To do

* Display the graph with PageRank scores.
* List the 10 best ranked characters.
* Compare with:
    1. the 10 nodes of highest degrees,
    2. the 10 nodes of highest weights,
* Try different values of $\alpha$ (e.g., 0.1, 0.9, 0.99) and interpret the results.

In [None]:
pagerank = PageRank(damping_factor=0.1, solver='lanczos')

In [None]:
scores = pagerank.fit_transform(adjacency)

In [None]:
# scores in log scale are more visible
image = svg_graph(adjacency, position, names, scores=np.log(scores), scale=2)

In [None]:
SVG(image)

In [None]:
top_nodes = top_k(scores, 10)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

In [None]:
n = len(names)

In [None]:
weights = adjacency.dot(np.ones(n))

In [None]:
top_nodes = top_k(weights, 10)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

In [None]:
degrees = (adjacency > 0).dot(np.ones(n))

In [None]:
top_nodes = top_k(degrees, 10)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

## Openflights


In [None]:
graph = openflights

In [None]:
adjacency = graph.adjacency
position = graph.position
names = graph.names

In [None]:
image = svg_graph(adjacency, position, width=800, height=400, display_node_weight=True, display_edges=False)

In [None]:
SVG(image)

## To do

* Display the same world map with PageRank scores.
* List the 10 best ranked airports, and compare with the 10 airports of highest traffic.
* Display the world map with Personalized PageRank scores, starting from Tokyo international airport.
* List the corresponding 10 best ranked airports.

In [None]:
pagerank = PageRank()

In [None]:
scores = pagerank.fit_transform(adjacency)

In [None]:
image = svg_graph(adjacency, position, scores=np.log(scores), node_order=np.argsort(scores), width=800, height=400, display_node_weight=True, display_edges=False)

In [None]:
SVG(image)

In [None]:
top_nodes = top_k(scores, 10)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

In [None]:
n = adjacency.shape[0]

In [None]:
weights = adjacency.dot(np.ones(n))

In [None]:
top_nodes = top_k(weights, 10)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

In [None]:
{i: name for i, name in enumerate(names) if 'Tokyo' in name}

In [None]:
tokyo = 1084

In [None]:
pagerank = PageRank()

In [None]:
scores = pagerank.fit_transform(adjacency, seeds = {tokyo: 1})

In [None]:
image = svg_graph(adjacency, position, scores=np.log(scores), node_order=np.argsort(scores), width=800, height=400, display_node_weight=True, display_edges=False) 

In [None]:
SVG(image)

In [None]:
top_nodes = top_k(scores, 10)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

## 2. Directed graphs

## Wikipedia Vitals

In [None]:
graph = wikivitals

In [None]:
adjacency = graph.adjacency
names = graph.names

## To do

* List the 10 best ranked pages of Wikipedia Vitals.
* Compare with the 10 nodes of highest out-degrees and the 10 nodes of highest in-degrees. Interpret the results.
* Which article of Wikipedia Vitals is in the top-20 of PageRank but not in the top-20 of in-degrees?

In [None]:
pagerank = PageRank(solver='lanczos')

In [None]:
scores = pagerank.fit_transform(adjacency)

In [None]:
top_nodes = top_k(scores, 10)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

In [None]:
n = len(names)

In [None]:
out_degrees = adjacency.dot(np.ones(n))

In [None]:
top_nodes = top_k(out_degrees, 10)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

In [None]:
in_degrees = adjacency.T.dot(np.ones(n))

In [None]:
top_nodes = top_k(in_degrees, 10)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

In [None]:
nodes = list(set(top_k(scores, 20)) - set(top_k(in_degrees, 20)))

In [None]:
print(names[nodes])

## To do

* List the 20 closest pages to **Picasso** in Wikipedia Vitals. Who is the best ranked painter other than Picasso?
* List the 20 closest pages to both **Cat** and **Dog** in Wikipedia Vitals.
* In both cases, propose a method to get the 20 closest pages **specifically** related to the target(s). <br>
**Hint:** Consider the difference between two PageRank scores.

In [None]:
picasso = 4936

In [None]:
pagerank=PageRank()

In [None]:
scores_ppr = pagerank.fit_transform(adjacency, seeds = {picasso: 1})

In [None]:
top_nodes = top_k(scores_ppr, 20)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

In [None]:
scores_pr = pagerank.fit_transform(adjacency)

In [None]:
top_nodes = top_k(scores_ppr - scores_pr, 20)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

In [None]:
cat = 1401

In [None]:
dog = 1395

In [None]:
scores_ppr = pagerank.fit_transform(adjacency, seeds = {cat: 1, dog: 1})

In [None]:
top_nodes = top_k(scores_ppr, 20)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

In [None]:
top_nodes = top_k(scores_ppr - scores_pr, 20)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

## 3. Bipartite graphs

## Cinema

In [None]:
graph = cinema

In [None]:
biadjacency = graph.biadjacency
movies = graph.names_row
actors = graph.names_col

## To do

* List the top-10 movies in terms of PageRank.
* List the 5 closest movies to **Mon Oncle**. 
* List the 5 closest actors to **Catherine Deneuve**.

In [None]:
pagerank = BiPageRank()

In [None]:
scores = pagerank.fit_transform(biadjacency)

In [None]:
top_nodes = top_k(scores, 10)

In [None]:
movies[top_nodes]

In [None]:
deneuve = 2173

In [None]:
scores = pagerank.fit_transform(biadjacency, seeds_col = {deneuve:1})

In [None]:
scores_col = pagerank.scores_col_

In [None]:
top_nodes = top_k(scores_col, 5)

In [None]:
actors[top_nodes]

## 4. Directed graphs as bipartite graphs

Directed graphs can be represented as bipartite graphs by duplicating each node, one as source of edges and the other as destination of edges. The biadjacency matrix of the bipartite graph is simply the adjacency matrix of the directed graph. 

The PageRank scores obtained with the bipartite graph differ from those obtained with the directed graph: they correspond to the **forward-backward** random walk in the directed graph, edges being alternately followed in forward and backward directions.

## Wikipedia Vitals

In [None]:
graph = wikivitals

In [None]:
adjacency = graph.adjacency
names = graph.names

## To do

Do the same experiments as above using BiPageRank:
* List the 10 best ranked pages of Wikipedia Vitals. Give 2 rankings: one for rows, the other for columns.
* Compare with the 10 nodes of highest out-degrees and the 10 nodes of highest in-degrees, respectively.<br> Interpret the results.

In [None]:
pagerank = BiPageRank(damping_factor=0.85)

In [None]:
scores = pagerank.fit_transform(adjacency)

In [None]:
top_nodes = top_k(scores, 10)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

In [None]:
top_nodes_out = top_k(out_degrees, 10)

In [None]:
for i, node in enumerate(top_nodes_out):
    print(i + 1, names[node])

In [None]:
len(set(top_nodes) & set(top_nodes_out))

In [None]:
top_nodes = top_k(pagerank.scores_col_, 10)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

In [None]:
top_nodes_in = top_k(in_degrees, 10)

In [None]:
for i, node in enumerate(top_nodes_in):
    print(i + 1, names[node])

In [None]:
len(set(top_nodes) & set(top_nodes_in))

## To do

Using BiPageRank:
* List the 20 closest pages to **Picasso** in Wikipedia Vitals. 
* List the 20 closest pages to both **Cat** and **Dog** in Wikipedia Vitals.

In your opinion, what is the best algorithm for content recommendation, PageRank or BiPageRank?<br>
Justify your answer.

In [None]:
pagerank = BiPageRank()

In [None]:
scores_pr = pagerank.fit_transform(adjacency)

In [None]:
scores = pagerank.fit_transform(adjacency, seeds_row = {picasso:1})

In [None]:
top_nodes = top_k(scores, 20)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])

In [None]:
scores = pagerank.fit_transform(adjacency, seeds_row = {cat:1, dog:1})

In [None]:
top_nodes = top_k(scores, 20)

In [None]:
for i, node in enumerate(top_nodes):
    print(i + 1, names[node])