# Networks: structure, evolution & processes
**Internet Analytics - Lab 2**

---

**Group:** W

**Names:**

* Olivier Cloux
* Thibault Urien
* Saskia Reiss

---

#### Instructions

*This is a template for part 4 of the lab. Clearly write your answers, comments and interpretations in Markodown cells. Don't forget that you can add $\LaTeX$ equations in these cells. Feel free to add or remove any cell.*

*Please properly comment your code. Code readability will be considered for grading. To avoid long cells of codes in the notebook, you can also embed long python functions and classes in a separate module. Don’t forget to hand in your module if that is the case. In multiple exercises, you are required to come up with your own method to solve various problems. Be creative and clearly motivate and explain your methods. Creativity and clarity will be considered for grading.*

---

## 2.4 PageRank

### 2.4.1 Random Surfer Model

#### Exercise 2.12

In [1]:
#necessary imports
import networkx as nx
import matplotlib.pyplot as plt
import random as random
import csv

#global variables
jumps = 1000
dangling_fac = 0.05

In [2]:
#Load of graphs
G1=nx.read_adjlist('../data/components.graph', create_using=nx.DiGraph(), nodetype=int)
G2=nx.read_adjlist('../data/absorbing.graph', create_using=nx.DiGraph(), nodetype=int)
Gwiki=nx.read_adjlist('../data/wikipedia.graph', create_using=nx.DiGraph(), nodetype=int)


In [3]:
#helper functions, to make code cleaner
def print_dict_sorted(d, precision):
    """print a dictionnary sorted by its keys
    argument 'precision' precises number of desired leading zeros
    """
    print("Weight of each node :")
    for k, v in sorted(d.items()): 
        print("Node",str(k).zfill(precision),"has score", v)

#Surfer
def surfer(G, jumps):
    """Surfs through an nx graph"""
    nodes_list = G.nodes()
    nodes_and_weight = dict(zip(G.nodes(), [0]*G.number_of_nodes()))
    seed = random.sample(nodes_list, 1).pop()
    current = seed
    i = 0
    while i < jumps:
        
        i+= 1
        nodes_and_weight[current] += 1
        possible_nodes = G.edges(current)
        if len(possible_nodes) >= 1: #check for dead end
            current = random.sample(possible_nodes,1).pop()[1]
        else:
            print("Reached a dead end after",i,"jumps and",i+1,"visited pages. No links in this page")
            break
    
    #return normalized version
    nodes_and_weight.update((k, v/i) for k,v in nodes_and_weight.items())
    return nodes_and_weight

#### Results of components graph
We see below that not all components are connected (the network is not one giant component). Thus, entering at one node traps us in the connected component and excludes us from different component(s). This behaviour is to be avoided.

In [4]:
surf1 = surfer(G1, jumps)
print_dict_sorted(surf1, 1)    

Weight of each node :
Node 0 has score 0.288
Node 1 has score 0.288
Node 2 has score 0.289
Node 3 has score 0.135
Node 4 has score 0.0
Node 5 has score 0.0
Node 6 has score 0.0
Node 7 has score 0.0


#### Results of absorbing graph
The result here is better seen when launching the code multiple times.
We quickly see there is a dangling node (node with not outgoing edge). This denotes an absorbing behaviour, meaning once this node is reached we can't keep crawling.

In [5]:
surf2 = surfer(G2, jumps)
print_dict_sorted(surf2, 1)

Reached a dead end after 7 jumps and 8 visited pages. No links in this page
Weight of each node :
Node 0 has score 0.14285714285714285
Node 1 has score 0.14285714285714285
Node 2 has score 0.2857142857142857
Node 3 has score 0.2857142857142857
Node 4 has score 0.14285714285714285


#### Exercise 2.13

In [6]:
def modified_surfer(G, jumps, dang_fac):
    """Surfs through an nx graph"""
    nodes_list = G.nodes()
    nodes_and_weight = dict(zip(G.nodes(), [0]*G.number_of_nodes())) #create dict of node ID and its score (0)
    seed = random.sample(nodes_list, 1).pop()
    current = seed
    i = 0
    while i < jumps:
        
        nodes_and_weight[current] += 1
        possible_nodes = G.edges(current)
        if len(possible_nodes) == 0 or random.randrange(0, 1) < dang_fac: 
            current = random.sample(nodes_list, 1).pop() #take one node at random
        else:
            current = random.sample(possible_nodes,1).pop()[1] #pick one in linked nodes
        i += 1
        
    #return normalized version
    nodes_and_weight.update((k, v/i) for k,v in nodes_and_weight.items())
    return nodes_and_weight

#### Results of components graph with modified surfer
The below result seems much better, as we now visited all components. 

In [7]:
surf1 = modified_surfer(G1, jumps, dangling_fac)
print_dict_sorted(surf1, 1)

Weight of each node :
Node 0 has score 0.137
Node 1 has score 0.101
Node 2 has score 0.124
Node 3 has score 0.137
Node 4 has score 0.128
Node 5 has score 0.11
Node 6 has score 0.143
Node 7 has score 0.12


#### Results of absorbing graph with modified surfer
Our surfer does not halt anymore when reaching a dangling node, which is a correct behaviour. 

In [8]:
surf2 = modified_surfer(G2, jumps, dangling_fac)
print_dict_sorted(surf2, 1)

Weight of each node :
Node 0 has score 0.201
Node 1 has score 0.199
Node 2 has score 0.191
Node 3 has score 0.226
Node 4 has score 0.183


---

### 2.4.2 Power Iteration Method

#### Exercise 2.14: Power Iteration method

In [9]:
import numpy as np
np.set_printoptions(threshold=np.inf)
theta = 0.85

In [13]:
N = Gwiki.number_of_nodes()
w = np.zeros(N)
H = np.zeros((N, N))

for node in Gwiki.nodes(): #analyse every node iteratively
    edges = (Gwiki.edges(node)) #type : edges = list of connected nodes
    if(len(edges) == 0): #no outgoing egde <-> dangling node
        w[node] = 1
    else :
        edges_indices = np.array([x[1] for x in edges], dtype=int)  
        H[node][edges_indices] = 1/len(edges)
ones = np.ones((N,1))
H2 = H + ((w) * ones.T)/N
G = theta*H2 + (1-theta)*np.ones((N, N))/N


In [11]:
pivec = np.ones(size)/size #original pi

for i in range(50000): #long operation, decrease for faster result
    pivec = pivec @ G

max_indices = np.argpartition(pivec, -10)[-10:]


In [12]:
value_to_index = map(lambda x : (pivec[x], x), max_indices)
sorted_indices = sorted(value_to_index, key= lambda x:x[0], reverse=True)
with open('../data/wikipedia_titles.tsv', newline='\n') as datafile:
    fin = datafile.read().splitlines(True)[1:]
    reader = csv.reader(fin, delimiter='\t')
dictionnary = dict((int(row[0]), row[1]) for row in reader)

print("The 10 max elements are :")
j = 1
for i in sorted_indices:    
    print("#",j,":",dictionnary[i[1]],"with score",i[0])
    j+=1

The 10 max elements are :
# 1 : United States with score 0.00128528805214
# 2 : United Kingdom with score 0.000889928006083
# 3 : France with score 0.000859309596939
# 4 : Europe with score 0.000770214720349
# 5 : Germany with score 0.00068103315614
# 6 : England with score 0.000661682508149
# 7 : World War II with score 0.000639597663051
# 8 : Latin with score 0.000620056910491
# 9 : India with score 0.00061750101724
# 10 : English language with score 0.000580789083603


---

### 2.4.3 Gaming the system *(Bonus)*

#### Exercise 2.15 *(Bonus)*