# SD212: Graph mining
## Lab 7: Graph embedding

In this lab, you will learn to embed the nodes of a graph in a vector space of low dimension.

## Import

In [None]:
from IPython.display import SVG

In [None]:
import numpy as np
from scipy import sparse
import matplotlib.pyplot as plt

In [None]:
# if you get any error, please update scikit-network!
from sknetwork.data import load_netset, karate_club
from sknetwork.clustering import KMeans
from sknetwork.embedding import Spectral, BiSpectral
from sknetwork.linalg import HalkoEig
from sknetwork.ranking import PageRank, BiPageRank, top_k
from sknetwork.visualization import svg_graph, svg_digraph

## Data

We will work on the following graphs (see the [NetSets](https://graphs.telecom-paristech.fr/Home_page.html#netsets-section) collection for details):
* Openflights (graph)
* WikiVitals (digraph)
* Cinema (bigraph)

In [None]:
openflights = load_netset('openflights')
wikivitals = load_netset('wikivitals')
cinema = load_netset('cinema')

## 1. Graphs

## Karate Club


We first consider the spectral embedding of the [karate club graph](https://en.wikipedia.org/wiki/Zachary%27s_karate_club).

In [None]:
graph = karate_club(True)

In [None]:
adjacency = graph.adjacency
position = graph.position
labels_true = graph.labels

In [None]:
image = svg_graph(adjacency, position, labels=labels_true)
SVG(image)

## To do

* Display the spectrum of the normalized Laplacian (e.g., first 20 eigenvalues). 
* What does the spectrum suggest?
* Display the graph with the embedding in dimension 2.
* Check that the embedding weighted by the degrees is centered.
* How many nodes are misclassified using the sign of the first eigenvector? What are these nodes?

In [None]:
spectral = Spectral(20, barycenter=False, normalized=False, regularization=None)

In [None]:
spectral.fit(adjacency)

In [None]:
eigenvalues = spectral.eigenvalues_

In [None]:
eigenvalues

Display the spectrum of the normalized Laplacian (e.g., first 20 eigenvalues).

In [None]:
spectral = Spectral(20, normalized_laplacian=True, barycenter=False, normalized=False, regularization=None)
spectral.fit(adjacency)
eigenvalues_n = spectral.eigenvalues_

In [None]:
eigenvalues_n

In [None]:
full_eigenvalues = np.hstack(([0],eigenvalues_n))

In [None]:
plt.plot(full_eigenvalues, 'o')
plt.grid()

What does the spectrum suggest?

- This eigenvalues from Karate Club suggest that there exists 4 clusters,
the first 4 are not zero because the four clusters are not completely isolated,
but we have a gap from the four eigenvalue.

Display the graph with the embedding in dimension 2 of Karate Club.

In [None]:
spectral = Spectral(2, normalized_laplacian=True, barycenter=False, normalized=False, regularization=None)
embedding = spectral.fit_transform(adjacency)
plt.plot(embedding[:,0],embedding[:,1], 'o')
plt.gca().set_aspect('equal', adjustable='box')
plt.show()

Check that the embedding weighted by the degrees is centered.

In [None]:
degree = np.diag(adjacency.toarray().sum(axis=1))
mean = np.dot(degree,embedding)
print ( np.abs(np.mean(mean[:,0])) )
print ( np.abs(np.mean(mean[:,1])) )

In [None]:
#embedding

In [None]:
degree

In [None]:
#mean

In [None]:
#np.dot(degree,embedding)

- As values are:
-- 3.2653618371328135e-17
-- 2.938825653419532e-16
they are centered.

How many nodes are misclassified using the sign of the first eigenvector? What are these nodes?

In [None]:
labels_true
first_label = [embedding[:,0] > 0]

In [None]:
misclass = labels_true - first_label

print(f"There are {np.size(np.nonzero(misclass[0]))} nodes misclassified")
print(f"The nodes misclassified are : {np.nonzero(misclass[0])}")

## Openflights


We now consider a larger graph. We use the Laplacian eigenmap in dimension 20. We measure proximity between nodes in terms of [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity). Equivalently, we work on the embedding in the unit sphere, which can be obtained through the parameter ``normalized`` of the class ``Spectral``.

In [None]:
graph = openflights

In [None]:
adjacency = graph.adjacency
position = graph.position
names = graph.names

In [None]:
image = svg_graph(adjacency, position, width=800, height=400, node_size=3, display_edges=False)
SVG(image)

## To do

* Display the spectrum of the normalized Laplacian.
* Display the same world map with the scores equal to the cosine similarity to **Beijing** in the embedding space.
* List the 10 closest airports to **Madrid** in terms of cosine similarity in the embedding space.
* Display the same world map with the 10 clusters found by k-means in the embedding space.<br> **Hint:** Use the class ``KMeans`` with the appropriate embedding.

In [None]:
?Spectral

In [None]:
#spectral = Spectral(20, normalized_laplacian=True)
spectral = Spectral(20, normalized = True, regularization = None, barycenter = False)

In [None]:
beijing = 1618

In [None]:
madrid = 572

Display the spectrum of the normalized Laplacian.

In [None]:
#spectral = Spectral(10, normalized_laplacian=True, barycenter=False, normalized=False, regularization=None)
spectral.fit(adjacency)
eigenvalues_n = spectral.eigenvalues_

In [None]:
full_eigenvalues = np.hstack(([0],eigenvalues_n))
plt.plot(full_eigenvalues, 'o')
plt.grid()

Display the same world map with the scores equal to the cosine similarity to **Beijing** in the embedding space.

In [None]:
embedding = spectral.embedding_
score = embedding.dot(embedding[beijing])

In [None]:
score

In [None]:
image = svg_graph(adjacency, position, scores = score, width=800, height=400, node_size=3, display_edges=False)
SVG(image)

List the 10 closest airports to Madrid in terms of cosine similarity in the embedding space.

In [None]:
embedding = spectral.embedding_
score = embedding.dot(embedding[madrid])
image = svg_graph(adjacency, position, scores = score, width=800, height=400, node_size=3, display_edges=False)
SVG(image)

List the 10 closest airports to Madrid in terms of cosine similarity in the embedding space.

In [None]:
top10 = score.argsort()[-10:][::-1]

In [None]:
names[top10]

Display the same world map with the 10 clusters found by k-means in the embedding space.

In [None]:
kmeans = KMeans (10,embedding_method = Spectral(20, normalized=True))
label = kmeans.fit_transform(adjacency)

In [None]:
image = svg_graph(adjacency, position, labels = label, width=800, height=400, node_size=3, display_edges=False)
SVG(image)

## 2. Digraphs

## Wikipedia Vitals

We now consider a large directed graph. We use the class ``BiSpectral`` (i.e., the directed graph is viewed as a bipartite graph).

In [None]:
graph = wikivitals

In [None]:
adjacency = graph.adjacency
names = graph.names
labels = graph.labels_hierarchy
names_labels = graph.names_labels_hierarchy

## To do

* Check that the average cosine similarity between nodes in some set $S$ is given by the square norm of the barycenter of $S$.
* Give the average cosine similarity between nodes in the **Mammals** category.
* Compare with the expected cosine similarity between two nodes sampled uniformly at random.
* Give the average cosine similarity between nodes in some category of your choice.

In [None]:
spectral = BiSpectral(20, solver=HalkoEig(random_state=42))

In [None]:
spectral.fit(adjacency)
embedding = spectral.embedding_

In [None]:
mammal = 274

In [None]:
names_labels[mammal]

* Check that the average cosine similarity between nodes in some set $S$ is given by the square norm of the barycenter of $S$.

* Give the average cosine similarity between nodes in the Mammals category.

In [None]:
nodes = np.where(labels == mammal)[0]

In [None]:
np.linalg.norm(np.mean(embedding[nodes],axis = 0))**2

Average Cosine similarity

In [None]:
matrix = np.dot( embedding[nodes], np.transpose(embedding[nodes]) )

In [None]:
np.mean(matrix)

In [None]:
total = np.sum( matrix[np.triu_indices(92)] )
total = total/4278

In [None]:
total

Compare with the expected cosine similarity between two nodes sampled uniformly at random.

In [None]:
two_elem_index = np.random.choice(nodes,2)
print(two_elem_index)
result = embedding[two_elem_index]
print(result[0]@result[1])

In [None]:
liste = []
for i in range(10000):
    two_elem_index = np.random.choice(np.size(labels),2)
    result = embedding[two_elem_index]
    liste.append(result[0]@result[1])
np.mean(liste)

In [None]:
result = np.mean (np.dot( embedding, np.transpose(embedding) ) )
result

Give the average cosine similarity between nodes in some category of your choice.

In [None]:
label = 500
names_labels[label]

In [None]:
nodes = np.where(labels == label)[0]

In [None]:
result = np.mean (np.dot( embedding[nodes], np.transpose(embedding[nodes]) ) )
result

On utilise le cosinus parce que celui-ci donne une mesure de proximité
If 1 c'est que il est très proche

## 3. Bigraphs

## Cinema

Finally, we consider the co-embedding of a large bipartite graph. We use the Laplacian eigenmap in dimension 20.

In [None]:
graph = cinema

In [None]:
biadjacency = graph.biadjacency
movies = graph.names_row
actors = graph.names_col

## To do

* What is the cosine similarity between **Marion Cotillard** and **The Godfather**  in the embedding space?
* What is the cosine similarity between **The Godfather** and **The Godfather 2** in the embedding space?

In [None]:
bispectral = BiSpectral(20, regularization=0.5, solver=HalkoEig(random_state=42))

In [None]:
cotillard = 3814

In [None]:
actors[cotillard]

In [None]:
godfather = 69768
godfather2 = 69770

In [None]:
movies[godfather]

In [None]:
movies[godfather2]

* What is the cosine similarity between Marion Cotillard and The Godfather in the embedding space?

In [None]:
bispectral.fit(biadjacency)
embedding_row = bispectral.embedding_row_
embedding_col = bispectral.embedding_col_

In [None]:
embedding_row[godfather]@embedding_col[cotillard]

In [None]:
embedding_row[godfather2]@embedding_col[cotillard]