## Node Classification and Link Prediction with DeepWalk

Using DeepWalk for unsupervised node representation learning that can subsequently be used for supervised node classification and link prediction.

#### References

\[1\] [Deepwalk: Online learning for social representations](https://www.thejournal.club/c/paper/54593/), B. Perozzi, R. AlRfou, S. Skiena, KDD, 2014

In [None]:
import networkx as nx
import pandas as pd
import numpy as np
import os

import matplotlib.pyplot as plt
from sklearn import preprocessing, feature_extraction, model_selection

import dgl
import torch
import torch.nn as nn
import torch.nn.functional as F

from gensim.models import Word2Vec

from sklearn.manifold import TSNE

### Load the dataset (Cora)

In [None]:
dataset = dgl.data.CoraGraphDataset()
graph = dataset[0]

We are going to perform a random walk starting from each node in the graph.

DGL has a method for generating random walk data, `dgl.sampling.random_walk(...)`.

In [None]:
nodes = graph.nodes()
num_repeats = 10
length = 20       # Random walk length
doc = []
for _ in range(num_repeats):
    sentences, _ = dgl.sampling.random_walk(graph, nodes=nodes, length=length)
    doc.extend(sentences.tolist())

The graph has 2708 nodes so the number of "sentences" should equal `num_repeats` times 2708.

In [None]:
len(doc)

We are going to use `gensim`'s implementation of `Word2Vec`.

We are going to set the dimnsionality of the embedding vectors to __128__ and the window size to __5__.

In [None]:
w2v_model = Word2Vec(doc, vector_size=128, window=5, min_count=0, sg=1, workers=2)

We can retrieve the embedding vector for each node by using the node ID. Note that node IDs are integers from 0 to 2707.

In [None]:
graph.nodes()  # the node IDs

We can retrience the embedding vectors via the `wv` member variable of the `Word2Vec` models using the node ID as the key.

In [None]:
#dir(w2v_model.wv)

In [None]:
w2v_model.wv[1000]

### Visualise the learned embeddings

In [None]:
node_labels = graph.ndata["label"]

In [None]:
# Store node vectors in a 2D numpy array. We make sure that the row index correspond to the node ID.
word_vectors = []
for node in nodes:
    word_vectors.append(w2v_model.wv[node.item()])
word_vectors = np.array(word_vectors)
word_vectors.shape

In [None]:
transform = TSNE(n_components=2)
node_embeddings_2d = transform.fit_transform(word_vectors) 

In [None]:
def plot_embeddings(node_embeddings, node_labels, title, x_label="$X_1$", y_label="$X_2", alpha=0.7, figsize=(7,7)):
    fig, ax = plt.subplots(figsize=figsize)
    ax.scatter(node_embeddings[:, 0], 
               node_embeddings[:, 1], 
               c=node_labels, 
               cmap="jet", alpha=alpha)
    ax.set(aspect="equal", xlabel=x_label, ylabel=y_label)
    plt.title(title)
    plt.show()    

In [None]:
plot_embeddings(node_embeddings_2d, 
                graph.ndata['label'], 
                title='DeepWalk node embeddings for cora dataset', 
                x_label="$X_1$", 
                y_label="$X_2$", 
                alpha=0.7, 
                figsize=(8,8))

## Train a node classification model

In [None]:
X = word_vectors
y = graph.ndata["label"]

In [None]:
X_train = X[graph.ndata["train_mask"]]
y_train = y[graph.ndata["train_mask"]]
X_test = X[graph.ndata["test_mask"]]
y_test = y[graph.ndata["test_mask"]]

In [None]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier(n_estimators=1000, max_depth=2, n_jobs=2, random_state=42)

In [None]:
rf.fit(X_train, y_train)

In [None]:
train_acc = rf.score(X_train, y_train)
test_acc = rf.score(X_test, y_test)
print(f"Train acc: {train_acc:.2} and Test acc: {test_acc:.2}")

## Extensions

[Node2Vec](https://www.thejournal.club/c/paper/97424/) improves on DeepWalk by using biased random walks. It has 2 parameters that control how likely the random walker is to return back to the previous node or follow and edge further away. Carefully selecting these parameters allows the practitioner to calculate node embeddings that emphasize either node homophily or structural equivalance.

A further extension to heterogeneous graphs is [MetaPath2Vec](https://www.thejournal.club/c/paper/290795/). It works the same as DeepWalk but the random walker is guided by a metapath (a valid sequence of node types) such that the node to follow is only allowed to be of the type specified in the metapath.

## Exercises

\[1\] Can you improve the results? Some ideas are to generate more random walk data with longer or shorter walk lengths. Tune the [**Word2Vec**](https://www.thejournal.club/c/paper/47668/) hyper-parameters, e.g., different window size, differenet embeddings dimensionality.
\[2\] Implement biased random walks as proposed in **Node2Vec**.

# Link prediction

For link prediction we need to estimate embedding vectors for edges. We are going to combine the node embeddings at the two ends of an edge into an edge vector. Once we have these edge vectors, we can train a binary classifier to predict if an edge between two nodes should exist or not.

Before we can do the above, we must split remove some of the edges from the graph and keep them to the side for training and evaluating the performance of the binary classifier.

In [None]:
# all the graph edges returned as tuple of 2 tensor such that the first holds the source node and the second
# the target node for each edge
all_edges = graph.edges()   
all_edges

In [None]:
num_train_edges = int(len(all_edges[0]) * 0.1)   # use 10% of edges as positive examples for training
print(f"Number of positive examples in train set: {num_train_edges}")
# Randomly select num_test_edges out of edges
edge_index = np.random.randint(0, len(all_edges[0]), size=num_train_edges)
edges = (all_edges[0][edge_index], all_edges[1][edge_index])
edge_labels = torch.ones(num_train_edges, dtype=int)

In [None]:
# Keep track of the edge IDs so we can remove them from the graph later
eids = graph.edge_ids(edges[0], edges[1])
eids

In [None]:
# Sample negative examples, that is find pairs of nodes that are not connected by an edge
source_nodes_candidate = nodes[torch.randperm(num_train_edges)]
target_nodes_candidate = nodes[torch.randperm(num_train_edges)]
source_nodes = []
target_nodes = []
count = 0
for s, t in zip(source_nodes_candidate, target_nodes_candidate):
    if not graph.has_edges_between(s, t):
        source_nodes.append(s.item())
        target_nodes.append(t.item())
        count += 1
    if count == num_train_edges:
        break
        
len(source_nodes), len(target_nodes)

In [None]:
edges = (torch.cat((edges[0], torch.tensor(source_nodes))),torch.cat((edges[1], torch.tensor(target_nodes)))) 

In [None]:
edge_labels = torch.cat((edge_labels, torch.zeros(len(source_nodes))))

In [None]:
len(edge_labels), len(edges[0])

Now, remove the edges from the graph.

In [None]:
print(f"graph num edges before: {graph.number_of_edges()}")
graph.remove_edges(eids)
print(f"graph num edges after: {graph.number_of_edges()}")

Now that the graph is different, let us re-calculate the node representations using DeepWalk. This is the same procedure we used earlier for unsupervised node representation learning that was subsequently used for node classification.

In [None]:
nodes = graph.nodes()
num_repeats = 10
length = 20       # Random walk length
doc = []
for _ in range(num_repeats):
    sentences, _ = dgl.sampling.random_walk(graph, nodes=nodes, length=length)
    doc.extend(sentences.tolist())

w2v_model = Word2Vec(doc, vector_size=128, window=5, min_count=0, sg=1, workers=2)

# Store node vectors in a 2D numpy array. We make sure that the row index correspond to the node ID.
node_vectors = []
for node in nodes:
    node_vectors.append(w2v_model.wv[node.item()])
node_vectors = np.array(node_vectors)
node_vectors.shape

An edge vector will be a function of the two node vectors that define the edge. We have different options available for such a function. We define several below.

In [None]:
def operator_hadamard(u, v):
    return u * v


def operator_l1(u, v):
    return np.abs(u - v)


def operator_l2(u, v):
    return (u - v) ** 2


def operator_avg(u, v):
    return (u + v) / 2.0

In [None]:
def get_edge_vectors(edges, node_vectors, op=operator_hadamard):
    edge_vectors = []
    for source, target in zip(edges[0], edges[1]):
        edge_vectors.append(op(node_vectors[source], node_vectors[target]))
    edge_vectors = torch.tensor(edge_vectors)
    return edge_vectors

In [None]:
edge_vectors = get_edge_vectors(edges, node_vectors, op=operator_hadamard)

In [None]:
edge_vectors.shape, len(edge_labels)

Last step is to split our edge data into train and test sets.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, roc_curve

In [None]:
(
    edge_vectors_train, 
    edge_vectors_test, 
    edge_labels_train, 
    edge_labels_test
) = train_test_split(edge_vectors, edge_labels, test_size=0.2)

In [None]:
edge_vectors_train.shape, edge_vectors_test.shape, edge_labels_train.shape, edge_labels_test.shape

Let's train a Random Forest to predict edges. Since we are training a binary classification model, we can evaluate its performance using accuracy but also AUC.

In [None]:
rf = RandomForestClassifier(n_estimators=1000, max_depth=2, n_jobs=2, random_state=1993)

In [None]:
rf.fit(edge_vectors_train, edge_labels_train)

In [None]:
print(f"Accuracy on train data: {rf.score(edge_vectors_train, edge_labels_train):.2}")
print(f"Accuracy on test data : {rf.score(edge_vectors_test, edge_labels_test):.2}")

In [None]:
edge_labels_train_pred_prob = rf.predict_proba(edge_vectors_train)[:, 1]
edge_labels_test_pred_prob = rf.predict_proba(edge_vectors_test)[:, 1]
print(f"AUC on train data: {roc_auc_score(edge_labels_train, edge_labels_train_pred_prob):.2}")
print(f"AUC on test data: {roc_auc_score(edge_labels_test, edge_labels_test_pred_prob):.2}")

Plot the ROC curves

In [None]:
edge_labels_test_pred = rf.predict(edge_vectors_test)
fpr, tpr, _ = roc_curve(edge_labels_test, edge_labels_test_pred)

In [None]:
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC curve (test data)");

## Exercises

\[1\] Try different operators for combining node embeddings to generate edge embeddings. How does performance change?

\[2\] Consider sampling negative examples using the *local* approach mentioned in the presentation. How does performance change?