# Machine learning with graphs

## Assignment 2 (02/03/2021)

Solution notebook for Homeworks proposed in the [MLG](http://jcid.webs.tsc.uc3m.es/machine-learning-group/) in the seminar of 2021 based on [Machine learning with graphs](http://snap.stanford.edu/class/cs224w-2019/) course by Standford university.

    Author: Daniel Bacaicoa Barber (27 feb, 2021)
    
<font size="1">This notebook may contain several errata. Use it at your own risk.</font>

In [None]:
#Importing generic libraries.
import numpy as np
import pandas as pd
import scipy 

# Graph related libraries 
import networkx as nx

# Util libraries
from collections import Counter, OrderedDict
import itertools
import random

#Plotting library
import matplotlib.pyplot as plt

### 1 Network Characteristics
One of the goals of network analysis is to find mathematical models that characterize real-world
networks and that can then be used to generate new networks with similar properties. In this
problem, we will explore two famous models --Erdös-Rényi and Small World-- and compare them
to real-world data from an academic collaboration network. Note that in this problem all networks
are undirected. You may use the starter code in hw1-q1-starter.py for this problem.

> Erdös-Rényi Random graph ($G(n,m)\ random\ network$): Generate a random instance of this
model by using n = 5242 nodes and picking m = 14484 edges at random. Write code to
construct instances of this model, i.e., do not call a SNAP function.


**Note**:

The total number of edges in an *undirected* graph (with *self edges*) is computed as:

$$E=\frac{1}{2}\sum_{i=1}^N\sum_{\substack{j=1\\j\neq i}}^N A_{ij} \left(+ \sum_{i=1}^N A_{ij}\right)$$

which means that we have to select E values equal to 1 in the strictly upper triangular matrix (upper triangular matrix). Same with the lower matrix.


In [None]:
def genErdosRenyi(N = 5242, E = 14484):
    """
    :param - N: number of nodes
    :param - E: number of edges

    return type: nx.Graph (undirected_graph) 
    return: Erdos-Renyi graph with N nodes and E edges
    """
    # Your code here

    return Graph

In [None]:
# Toy example for verifying

Toy_erdos = genErdosRenyi(N = 20, E = 10)
print('Number of nodes in the toy Erdos-Renyi random graph is: %i'%Toy_erdos.number_of_nodes())
print('Number of edges in the toy Erdos-Renyi random graph is: %i'%Toy_erdos.number_of_edges())
nx.draw(Toy_erdos,with_labels=True)

> Small-World Random Network: Generate an instance from this model as follows: begin with
n = 5242 nodes arranged as a ring, i.e., imagine the nodes form a circle and each node is
connected to its two direct neighbors (e.g., node 399 is connected to nodes 398 and 400),
giving us 5242 edges. Next, connect each node to the neighbors of its neighbors (e.g., node
399 is also connected to nodes 397 and 401). This gives us another 5242 edges. Finally,
randomly select 4000 pairs of nodes not yet connected and add an edge between them. In
total, this will make m = 5242 $\cdot$ 2 + 4000 = 14484 edges. Write code to construct instances of
this model, i.e., do not call a SNAP function.



In [None]:
def genCircle(N = 5242):
    """
    :param - N: number of nodes

    return type: nx.Graph (undirected_graph)
    return: Circle graph with N nodes and N edges. Imagine the nodes form a
        circle and each node is connected to its two direct neighbors.
    """
    # Your code here
    
    return Graph

In [None]:
# Toy example for verifying

Toy_circle = genCircle(N = 20)
print('Number of nodes in the toy circle graph is: %i'%Toy_circle.number_of_nodes())
print('Number of edges in the toy circle graph is: %i'%Toy_circle.number_of_edges())
nx.draw(Toy_circle,with_labels=True)

In [None]:
def connectNbrOfNbr(Graph, N = 5242):
    """
    :param - Graph: nx.Graph object representing a circle graph on N nodes
    :param - N: number of nodes

    return type: nx.Graph (undirected_graph)
    return: Graph object with additional N edges added by connecting each node
        to the neighbors of its neighbors
    """
    # Your code here
    
    return Graph

In [None]:
# Toy example for verifying
Toy_NeigCirc = connectNbrOfNbr(Toy_circle, N = Toy_circle.number_of_nodes())
print('Number of nodes in the toy neighbour circle graph is: %i'%Toy_NeigCirc.number_of_nodes())
print('Number of edges in the toy neighbour circle graph is: %i'%Toy_NeigCirc.number_of_edges())
nx.draw(Toy_NeigCirc,with_labels=True)

In [None]:
def connectRandomNodes(Graph, M = 4000):
    """
    :param - Graph: nx.Graph object representing an undirected graph
    :param - M: number of edges to be added

    return type: nx.Graph (undirected_graph)
    return: Graph object with additional M edges added by connecting M randomly
        selected pairs of nodes not already connected.
    """
    # Your code here
    
    return Graph

In [None]:
# Toy example for verifying

Toy_RandCirc = connectRandomNodes(Toy_NeigCirc, M = 10)
print('Number of nodes in the toy neighbour circle random graph is: %i'%Toy_RandCirc.number_of_nodes())
print('Number of edges in the toy neighbour circle random graph is: %i'%Toy_RandCirc.number_of_edges())
nx.draw(Toy_RandCirc,with_labels=True)

In [None]:
def genSmallWorld(N = 5242, E = 14484):
    """
    :param - N: number of nodes
    :param - E: number of edges

    return type: nx.Graph (undirected_graph)
    return: Small-World graph with N nodes and E edges
    """
    # Your code here

    return Graph

In [None]:
# Toy example for verifying
Toy_SmallWorld = genSmallWorld(N = 20, E = 41)
print('Number of nodes in the toy Small-World random graph is: %i'%Toy_SmallWorld.number_of_nodes())
print('Number of edges in the toy Small-World random graph is: %i'%Toy_SmallWorld.number_of_edges())
nx.draw(Toy_SmallWorld,with_labels=True)
print('All degrees shoud be 4 except for 2 nodes that should be 5')
print('    Degrees are: ', list(dict(Toy_SmallWorld.degree()).values()))

> Real-World Collaboration Network: Download [this undirected network](http://snap.stanford.edu/data/ca-GrQc.txt.gz). Nodes in this network represent authors of research
papers on the arXiv in the General Relativity and Quantum Cosmology section. There is
an edge between two authors if they have co-authored at least one paper together. Note
that some edges may appear twice in the data, once for each direction. Ignoring repeats and
self-edges, there are 5242 nodes and 14484 edges. (Note: Repeats are automatically ignored
when loading an (un)directed graph with SNAP's LoadEdgeList function).

In [None]:
def loadCollabNet(path):
    """
    :param - path: path to edge list file

    return type: nx.Graph (undirected_graph)
    return: Graph loaded from edge list at `path and self edges removed

    Do not forget to remove the self edges!
    """
    # Your code here
    
    return Graph

#### 1.1 Degree Distribution
Generate a random graph from both the Erdös-Rényi (i.e., $G(n,m)$) and Small-World models and
read in the collaboration network. Delete all of the self-edges in the collaboration network (there
should be 14,484 total edges remaining).

Plot the degree distribution of all three networks in the same plot on a log-log scale. In other words,
generate a plot with the horizontal axis representing node degrees and the vertical axis representing
the proportion of nodes with a given degree (by "log-log scale" we mean that both the horizontal
and vertical axis must be in logarithmic scale). 

In one to two sentences, describe one key difference
between the degree distribution of the collaboration network and the degree distributions of the
random graph models.

In [None]:
def getDataPointsToPlot(Graph):
    """
    :param - Graph: nx.Graph object representing an undirected graph

    return values:
    X: list of degrees
    Y: list of frequencies: Y[i] = fraction of nodes with degree X[i]
    """
    # Your code here
    
    return X, Y

The code for plotting the degree distribution.

In [None]:
def Q1_1():
    """
    Code for HW1 Q1.1
    """
    global erdosRenyi, smallWorld, collabNet
    erdosRenyi = genErdosRenyi(5242, 14484)
    smallWorld = genSmallWorld(5242, 14484)
    collabNet = loadCollabNet("ca-GrQc.txt")

    x_erdosRenyi, y_erdosRenyi = getDataPointsToPlot(erdosRenyi)
    plt.loglog(x_erdosRenyi, y_erdosRenyi, color = 'y', label = 'Erdos Renyi Network')

    x_smallWorld, y_smallWorld = getDataPointsToPlot(smallWorld)
    plt.loglog(x_smallWorld, y_smallWorld, linestyle = 'dashed', color = 'r', label = 'Small World Network')

    x_collabNet, y_collabNet = getDataPointsToPlot(collabNet)
    plt.loglog(x_collabNet, y_collabNet, linestyle = 'dotted', color = 'b', label = 'Collaboration Network')

    plt.xlabel('Node Degree (log)')
    plt.ylabel('Proportion of Nodes with a Given Degree (log)')
    plt.title('Degree Distribution of Erdos Renyi, Small World, and Collaboration Networks')
    plt.legend()
    plt.show()

In [None]:
# Execute code for Q1.1
Q1_1()

In one to two sentences, describe one key difference between the degree distribution of the collaboration network and the degree distributions of the
random graph models.

Your answer:

#### 1.2 Clustering Coefficient
Recall that the local clustering coefficient for a node vi was defined in class as

$$ C_i = \begin{cases}\frac{2|e_i|}{k_i\cdot(k_i-1)} & k_i \geq 2 \\
0 & \mathrm{otherwise}\end{cases}$$

where $k_i$ is the degree of node $v_i$ and $e_i$ is the number of edges between the neighbors of $v_i$. The
*average clustering coefficient* is defined as

$$C=\frac{1}{|V|}\sum_{i\in V}C_i$$

Compute and report the average clustering coefficient of the three networks. For this question,
write your own implementation to compute the clustering coefficient, instead of using a built-in
SNAP function.

Which network has the largest clustering coefficient? In one to two sentences, explain. Think about
the underlying process that generated the network.

In [None]:
def calcClusteringCoefficientSingleNode(Node, Graph):
    """
    :param - Node: node from nx.Graph object. Graph.Nodes() will give an
                   iterable of nodes in a graph
    :param - Graph: nx.Graph object representing an undirected graph

    return type: float
    returns: local clustering coeffient of Node
    """
    # Your code here 
    C = 0.0
    

    return C

In [None]:
def calcClusteringCoefficient(Graph):
    """
    :param - Graph: nx.Graph object representing an undirected graph

    return type: float
    returns: clustering coeffient of Graph
    """
    # Your code here! If you filled out calcClusteringCoefficientSingleNode,
    #       you'll probably want to call it in a loop here
    C = 0.0

    return C


In [None]:
def Q1_2():
    """
    Code for Q1.2
    """
    C_erdosRenyi = calcClusteringCoefficient(erdosRenyi)
    C_smallWorld = calcClusteringCoefficient(smallWorld)
    C_collabNet = calcClusteringCoefficient(collabNet)

    print('Clustering Coefficient for Erdos Renyi Network: %f' % C_erdosRenyi)
    print('Clustering Coefficient for Small World Network: %f' % C_smallWorld)
    print('Clustering Coefficient for Collaboration Network: %f' % C_collabNet)

In [None]:
# Execute code for Q1.2
Q1_2()

Which network has the largest clustering coefficient? In one to two sentences, explain. Think about
the underlying process that generated the network.

Your answer: