## Anonymous Walks for Graph Embeddings
(based on the paper [Anonymous Walk Embeddings](https://arxiv.org/abs/1805.11921) 2018, by Sergey Ivanov and Evgeny Burnaev)

### What are graph embeddings ?

Graph embeddings are a way to represent the nodes and edges of a graph in a lower-dimensional vector space. Graphs are complex data structures that consist of nodes and edges. They are used to model relationships between entities in various domains such as social networks, recommendation systems, biology, and knowledge graphs.

Graph embeddings aim to capture the structural and relational information of the graph in a continuous vector space. This representation makes it easier to perform various machine learning tasks such as node classification, link prediction, and graph clustering.

![Graph embeddings](images/embeddings.jpg)

### What are anonymous walks ?

In order to understand anonymous walks, we first need to understand random walks. 

Thus, a random walk is a mathematical formalization of a path that consists of a succession of random steps. In the context of graph theory, a random walk on a graph is a traversal that begins at a starting node and moves to a neighboring node at each step based on a probability distribution.

<img src="images/random_walk.png" alt="Random walk" width="500"/>

For example, a random walk on the graph above might be: <span style="color:red; font-weight: bold;">1 -> 3 -> 4 -> 3 -> 2</span>

Normally states in a random walk correspond to a label or a global name of a node; however, for certain reasons, such as privacy and security, such states could be unavailable. It has been shown that an anonymized version of a random walk can provide a flexible way to reconstruct a network even when global names are absent. We next define the notion of **anonymous walk**.

<span style="color:red; font-weight: bold;">Definition 1: </span> An anonymous walk is a type of walk on a graph where only the underlying structure of the walk is considered, ignoring the specific identities of the nodes. This abstraction provides a robust and efficient way to capture graph structures, especially in settings where node identities are irrelevant or unavailable.

<span style="color:red; font-weight: bold;">Definition 2: </span> . Let s = (u<sub>1</sub>, u<sub>2</sub> , ... , u<sub>k</sub>) be an ordered list of elements u<sub>i</sub> ∈ V. We define the positional function pos: (s, u<sub>i</sub>) -> q such that for any ordered list s = (u<sub>1</sub>, u<sub>2</sub> , ... , u<sub>k</sub>) and an element u<sub>i</sub> ∈ V it returns a list q =  (p<sub>1</sub>, p<sub>2</sub> , ... , p<sub>l</sub>) for all positions p<sub>j</sub> ∈ N of u<sub>i</sub> occurences in a list s.

For example, if  s = (a, b, c, b, c) then pos(s, a) = 1 as element a appears only on the first position and pos(s, b) = (2, 4).

We denote mapping of a random walk **w** to anonymous walk **a** by w → a.


<img src="images/mapping.png" alt="Random walk" width="800"/>

In the image above we can see an example demonstrating the concept of anonymous
walk. Two different random walks 1 and 2 of the graph correspond
to the same anonymous walk 1. A random walk 3 corresponds to
another anonymous walk 2.

### How do we use anonymous walks to generate embeddings ?

The question remains: How do we use these anonymous walks to generate embeddings for a graph ?
In their paper, Ivanov and Burnaev propose 2 unsupervised algorithms for generating embeddings. The first algorithm is feature-based, while the second is data-driven.\
Feature-based approaches are those where the model is explicitly provided with engineered features that are designed based on domain knowledge or insight into the data. Data-driven approaches, also known as end-to-end or representation learning approaches, are those where the model is provided with raw data and it automatically learns useful features during the training process.
 - <span style="color:red; font-weight: bold;">Feature-based embedding</span>  Anonymous walk embedding of a graph G is the vector f<sub>G</sub> of size n, whose i-th component corresponds to a probability p(a<sub>i</sub>) of having anonymous walk a<sub>i</sub> in a graph G.\
 f<sub>G</sub> = (p(a<sub>1</sub>), p(a<sub>2</sub>), ... , p(a<sub>n</sub>),) i.e. the values in the embedding vector are the probabilities with which certain anonymous walks appear in the graph.\
 However, as the exact calculation of network embeddings can be computationally expensive, Ivanov and Burnaev demonstrate a way to sample walks in graph, in order to approximate the actual distribution with a given confidence.
 - <span style="color:red; font-weight: bold;">Data-driven embedding</span> Next, they show how one can learn
distributed graph representations in a data-driven manner,
similar to learning paragraph vectors in NLP. In this case, an anonymous walk is a
word, a randomly sampled set of anonymous walks starting
from the same node is a set of co-occurring words, and a
graph is a document.
In their experiments, they show that network embeddings
even with simple SVM classifier achieve an increase in classification accuracy compared to state-of-the-art supervised
neural network methods and graph kernels. This demonstrates that representation of your data can be more promising subject to study than the type and architecture of your
predictive model.

### Code