# Basic PageRank

In this notebook we will see a basic implementation of the PageRank algorithm, along with some considerations on the complexity. The graphs has been taken from the SNAP (Stanford Large Network Dataset) collection:

https://snap.stanford.edu/data/

We will start with a small directed graph called "email-Eu-core network" (https://snap.stanford.edu/data/email-Eu-core.html). It represents the emails sent from user  *i*  to user  *j* (during an observation period). In case of emails with multiple recipients, there is an edge for each recipient. The graph contains 1005 nodes and 25571 edges.

The format is very simple: each line contains an edge **from** one node **to** another node 
```text
0 1
2 3
2 4
5 6
5 7
8 9
10 11
```

We first load the data as a list of lists (each line becomes a list with 2 elements), then we will transform the data into a matrix and work with the matrix formulation. After that, we  will transform the data into adjacency list, to improve the efficiency and we reformulate the PageRank accordingly. 

## Loading the data


As usual, we first define the function to load the data, adapting such a function to the specific file input format.

In particular, we are going to assign to nodes progressive numbers, so we do not need to rely on the numbering in the file itself (and the node id can be used as matrix index).

In [None]:
def load_data(filename):
    input_lines = []
    raw_lines = open(filename, 'r').read().splitlines()
    num_nodes = 0
    nodes = {}
    for line in raw_lines:
        line_content = line.split()
        from_id = int(line_content[0])
        to_id = int(line_content[1])
        if from_id not in nodes:
            nodes[from_id] = num_nodes
            num_nodes += 1
        if to_id not in nodes:
            nodes[to_id] = num_nodes
            num_nodes += 1
        input_lines.append([nodes[from_id], nodes[to_id]])
    return input_lines, num_nodes

The input file containing the dataset is called "4-email-Eu-core.txt".

On Colab, remember to mount your Drive
```python
from google.colab import drive
drive.mount('/content/drive')
input_file = "/content/drive/My Drive/..."
```
Let's load our dataset and see its initial content:

In [None]:
input_file = "./4-email-Eu-core.txt"

input_edges, num_nodes = load_data(input_file)

print("\nThe dataset contains", num_nodes, "nodes and", 
      len(input_edges),"edges.\n")
print("The first five edges are:", input_edges[:5],"\n")

## Matrix formulation

Building and maintaining in memory the full adjacency matrix is inefficient, since the matrix is sparse. But working with a matrix is more intuitive, therefore we will start with this approach. 

As done before, we will use the Numpy library for handling matrixes. We first fill the matrix with "1" if there is an edge, then we will transform the matrix into a column stochastic one.

In [None]:
import numpy as np

# create an NxN matrix of zeros, where N = number of nodes 
matrix_nodes = np.zeros((num_nodes, num_nodes))

# Set element i,j to "1" if there is an edge from j to i
for edge in input_edges:
    from_id = edge[0]
    to_id = edge[1]
    matrix_nodes[to_id, from_id] = 1
    
# compute the "sparsity", i.e., percentage of non-zero cells
sparsity = 100*float(np.count_nonzero(matrix_nodes))/float(num_nodes*num_nodes)
print("Sparsity: %.2f%%" % (sparsity))

# Show a snippet of the matrix
matrix_nodes

We normalize each element with the sum of the column. If the column has no entry (i.e., no outgoing links, it's a dead-end), we fill each element with 1/N.

In [None]:
for col in range(num_nodes):
    degree = np.sum(matrix_nodes[:,col])
    if degree > 0:
        matrix_nodes[:,col] *= 1.0/degree
    else:
        matrix_nodes[:,col] = 1.0/num_nodes

# Show a snippet of the normalized matrix
matrix_nodes

## Preliminary questions

### Question  Q1
<div class="alert alert-info">
Consider the matrix before the normalization: compute the number of dead-end nodes (and its percentage with respect to the total number of nodes).
</div>

In [None]:
# your answer

### Question  Q2
<div class="alert alert-info">
Compute the sparsity of the matrix (before and after normalization), i.e., the ratio between the non-zero elements and the total number of elements.
</div>

In [None]:
# your answer

### Question  Q3
<div class="alert alert-info">
Compute the amount of memory (bytes) required to store the matrix, before and after the normalization.
</div>

In [None]:
# your answer

### Question  Q4
<div class="alert alert-info">
Compute some basic statistics about the graph, such as:
    
- minimum, maximum and average outdegree (number of outgoing links);
- outdegree distribution (percentage of nodes with outdegree 0, 1, 2, 3, ...). 
- minimum, maximum and average indegree (number of incoming links);
- indegree distribution (percentage of nodes with indegree 0, 1, 2, 3, ...). 
</div>

In [None]:
# your answer

# Pagerank (matrix formulation)

We are now ready to compute the Pagerank with the power iteration approach. The inputs are:

- The adjacency matrix;
- The teleport parameter beta;
- The target error;
- The maximum number of iterations.


In [None]:
def matrix_pagerank(input_matrix, beta, target_error, max_iterations):
    # infer the number of nodes from the matrix size
    num_nodes = input_matrix.shape[0]
    # initialize the ranking vector
    rank_prev = np.full((num_nodes), 1.0/num_nodes)
    # iterate at most "max_iterations" times
    for curr_iteration in range(max_iterations):
        rank_new = beta*input_matrix.dot(rank_prev) + (1.0-beta)/num_nodes
        # comute the error
        curr_err = np.sum(abs(rank_new - rank_prev))
        # for debugging: print the error at each iteration 
        #print("iteration:", curr_iteration, ", err:", curr_err)
        if curr_err < target_error:
            break
        # note the ".copy()", otherwise they end up to be the same vector
        rank_prev = rank_new.copy()
    return rank_new, curr_err, curr_iteration

We run the PageRank with the following parameters:

In [None]:
beta = 0.8
error = 0.0001
max_iterations = 30

pg_nodes, err, iterations = matrix_pagerank(matrix_nodes, beta, error, max_iterations)
print("\nComputed", iterations, "iterations with final error", err, "\n")
print(pg_nodes)

### Question  Q5
<div class="alert alert-info">
Modify the function so that the error is not an absolute value (such as 0.0001), but a relative one for each node rank. For instance, one could stop the iterations if, for each node, the variation of the rank is below 1% with respect to the previous rank.
</div>

## Adjacency list formulation

Instead of a matrix, we maintain a data structure (dicionary) in which, for each node (key), we have a list of neighbors (value).

In [None]:
adj_nodes = {}
for edge in input_edges:
    from_id = edge[0]
    to_id = edge[1]
    if from_id not in adj_nodes:
        adj_nodes[from_id] = [to_id]
    else:
        adj_nodes[from_id].append(to_id)

# For simplicity, we sort each value (not necessary, but better for visualization)
for node in adj_nodes:
    adj_nodes[node].sort()

# Show the neighbors of the first node
print(adj_nodes[0])

We can now change our Pagerank formulation considering as input the adjacency list. Note that, differently from the matrix, we need to explicitly indicate the number of nodes, beacuse nodes with no outgoing links (dead-ends) do not appear (as key) in the adjacency list and we cannot not infer from the size of the list the number of nodes.

Note also that we evaluate at each step how much ranking has been lost (due to dead-ends), and we reintegrate it.

In [None]:
def adj_pagerank(adj_list, num_nodes, beta, target_error, max_iterations):
    # initialize the ranking vector
    rank_prev = np.full((num_nodes), 1.0/num_nodes)
    # iterate at most "max_iterations" times
    for curr_iteration in range(max_iterations):
        # since rank_new is incrementally build every time, it has to be initialized
        rank_new = np.zeros((num_nodes))
        # the leaked ranking is found decrementally
        leaked = 1.0
        for node in adj_list:
            # we derive the outdegree from the list size
            outdegree = len(adj_list[node])
            leaked -= rank_prev[node]
            for neigh in adj_list[node]:
                rank_new[neigh] += beta*rank_prev[node]/outdegree
        # add the teleport (1-beta) and the leaked values (times beta)
        rank_new += (1.0-beta+beta*leaked)/num_nodes
        # compute the error
        curr_err = np.sum(abs(rank_new - rank_prev))
        if curr_err < target_error:
            break
        rank_prev = rank_new.copy()
    return rank_new, curr_err, curr_iteration

Let's try it and compare with the results obtained with the matrix formulation:


In [None]:
pg_nodes_adj, err, iterations = adj_pagerank(adj_nodes, num_nodes, beta, error, max_iterations)
print("\nComputed", iterations, "iterations with final error", err, "\n")
print(pg_nodes_adj)

print("\nThe sum of the element-wise difference with the previous formulation is:\n", 
      np.sum(abs(pg_nodes_adj - pg_nodes)), "\n")

## Additional questions

### Question  Q6
<div class="alert alert-info">
Compute the amount of memory (bytes) required to store the adjacency list, and compare it with the memory used by the matrix. 
</div>

In [None]:
# your answer here

### Question  Q7
<div class="alert alert-info">
Find the top 5 ranked nodes, the bottom 5 ranked nodes, and compute the ranking range (difference between the highest and lowest ranking).
</div>

In [None]:
# your answer here

### Question  Q8
<div class="alert alert-info">
Divide the different ranking values into ranges (with constant size, or exponentially increasing size) and show the percentage of nodes in each range. 
    
Compare the distribution with the indegree distribution computed before.
</div>

In [None]:
# your answer here

### Question  Q9
<div class="alert alert-info">
Analyze a larger graph (file "4-soc-Epinions1.txt"), and in particular:
    
- How many nodes and edges does it have?
- Is your system able to handle the PageRank computation using the matrix formulation?
- Compute the PageRank with the adjancency list formulation.
- Reply to Q7-Q8 for this graph.
</div>

In [None]:
# your answer here

### Question  Q10
<div class="alert alert-info">
Can you reformulate the PageRank algorithm assuming to have the adjacency list of the incoming links? 
    
Does this formulation has benefits?
</div>

In [None]:
# your answer here