# Machine learning with graphs

## Assignment 3 (16/03/2021)

Solution notebook for Homeworks proposed in the [MLG](http://jcid.webs.tsc.uc3m.es/machine-learning-group/) in the seminar of 2021 based on [Machine learning with graphs](http://snap.stanford.edu/class/cs224w-2019/) course by Standford university.

    Author: Daniel Bacaicoa Barber (13 mar, 2021)


In [None]:
#Importing generic libraries.
import numpy as np
import pandas as pd
import scipy 

# Graph related libraries 
import networkx as nx

# Util libraries
from collections import Counter, OrderedDict
import itertools
import random

#Plotting library
import matplotlib.pyplot as plt

### 2 Structural Roles: Rolx and ReFex

In this problem, we will explore the structural role extraction algorithm Rolx and its recursive feature extraction method ReFex. As part of this exploration, we will work with a dataset representing a scientist co-authorship network, which can be dowloaded at http://www-personal.umich.edu/~mejn/netdata/netscience.zip. 
> <font size="1">We provide a binary file named hw1-q2.graph for you to directly load into snap by
'''G = snap.TUNGraph.Load(snap.TFIn("hw1-q2.graph"))'''. You are welcome to either use this file or load from raw
data yourself.</font>

Although the graph is weighted, for simplicity we treat it as **undirected and unweighted** in this problem.


Feature extraction consists of two steps; we first extract basic local features from every node, and we subsequently aggregate them along graph edges so that global features are also obtained.

Collectively, feature extraction constructs a matrix $V \in \mathbb{R}^{n\times f}$ where for each of the n nodes we have $f$ features to cover local and global information. RolX extracts node roles from that matrix.

#### 2.1 Basic Features

We begin by loading the graph G provided in the bundle and computing three basic features for the nodes. For each node $v$, we choose 3 basic local features (in this order):

1. The degree of $v$, i.e., deg($v$)

In [None]:
#Your code here

2. The number of edges in the egonet of $v$, where egonet of $v$ is defined as the subgraph of $G$ induced by $v$ and its neighborhood.

In [None]:
#Your code here

3. The number of edges that connect the egonet of $v$ and the rest of the graph, i.e., the number of edges that enter or leave the egonet of $v$.

In [None]:
#Your code here

We use $\tilde{V}_u$ to represent the vector of the basic features of node $u$. For any pair of nodes $u$ and $v$, we can use cosine similarity to measure how similar two nodes are according to their feature vectors $x$ and $y$:

$$Sim(x, y) = \dfrac{x\cdot y}{\Vert x \Vert_2 \cdot \Vert y \Vert_2} = \dfrac{\sum_{i}x_iy_i}{\sqrt{\sum_{i}x_i^2} \cdot \sqrt{\sum_{i}y_i^2}} $$ 

Also, when $\Vert x \Vert_2 = 0$ or $\Vert y \Vert_2 = 0$, we defined $Sim(x, y) = 0$

In [None]:
#Your code here

Compute the basic feature vector for the node with ID 9, and report the top 5 nodes that are most similar to node 9 (excluding node 9). As a sanity check, no element in $\tilde{V}_9$ is larger than 10.

In [1]:
#Your code here

#### 2.2 Recursive Features

In this next step, we recursively generate some more features. We use mean and sum as aggregation functions.

Initially, we have a feature vector $\tilde{V}_u \in \mathbb{R}^3$ for every node $u$. In the first iteration, we concatenate
the mean of all $u$'s neighbors' feature vectors to $\tilde{V}_u$, and do the same for *sum*, i.e.,

$$\tilde{V}_u^{(1)}=\left[\tilde{V}_u;\ \frac{1}{\vert N(u) \vert}\sum_{v\in N(u)} \tilde{V}_u;\ \sum_{v\in N(u)}\tilde{V}_u \right]\in \mathbb{R}^9 $$
where $N(u)$ is the set of $u$'s neighbors in the graph. If $N(u) = \emptyset$, set the mean and sum to 0.



In [None]:
def Recursive_features(V_0):
    '''
    input:  V_u^(k) of dimension (n, 3^(k+1))
    output: V_u^(k+1) of dimension (n, 3^(k+2))
    '''
    
    #Your code here

After $K$ iterations, we obtain the overall feature matrix $V = \tilde{V}_u^{(K)} \in \mathbb{R}^{3^{K+1}}$.

For this exercise, run $K = 2$ iterations, and report the top 5 nodes that are most similar to node 9 (excluding node 9). If there are ties, e.g. 4th, 5th, and 6th have the same similarity, report any of them to fill up the top-5 ranking. As a sanity check, the similarities between the reported nodes and node 9 are all greater than 0.9.

In [None]:
#Your code here

Compare your obtained top 5 nodes with previous results from 2.1. In particular, are there common
nodes? Are there different nodes? In one sentence, why would this change?

#### 2.3 Role Discovery

In this part, we explore more about the graph according to the recursive feature vectors of nodes and node similarity.

> (a) Produce a 20-bin histogram to show the distribution of cosine similarity between node 9 and any other node in the graph (according to their recursive feature vectors). Note here that the x-axis is cosine similarity with node 9, and the y-axis is the number of nodes.

In [None]:
#Your code here

According to the histogram, can you spot some groups / roles? How many can you spot? (Clue:
look for the spikes! )

> (b) For these groups / roles in the cosine similarity histogram, take one node $u$ from each group to examine the feature vector, and draw the subgraph of the node based on its feature vector. You can draw the subgraph by hand, or you can use libraries such as networkx or graphviz.

For these drawings, you should use the local features for $u$, and pay attention to the features aggregated from its 1-hop neighbors, but feel free to ignore further features if they are difficult to incorporate. Also, you should not draw nodes that are more than 3-hops away from $u$.

In [None]:
#Your code here

Briefly argue how different structural roles are captured in 1-2 sentences.