# Homework 5 - Visit the Wikipedia hyperlinks graph!

In this assignment we perform an analysis of the Wikipedia Hyperlink graph. In particular, given extra information about the
categories to which an article belongs to, we are curious to rank the articles according to some criteria.

### [RQ1] Build the graph G = (V, E), where V is the set of articles and E the hyperlinks among them, and provide its basic information:
- If it is direct or not
- The number of nodes
- The number of edges
- The average node degree. Is the graph dense?

In [11]:
import pandas as pd
import json
from tqdm import tqdm
import collections 
import networkx as nx

Observing the file **wiki-topcats-reduced.txt**, we can notice the existence of a couple of nodes (107 e 104) for which it exists an edge from 107 to 104 and from 104 to 107. This make us realize that the graph is direct, otherwise, it would be useless the existence of one of the two edges. We thought about this proof: let's suppose that the graph is not direct. Then it can not exists a couple of nodes $a$ and $b$ such that exists an edge $a \rightarrow b$ and an edge $b \rightarrow a$. If exists a couple of nodes with this property than the graph is direct. To prove this we opened the file **reduced** and save it in a dataframe.

In [None]:
reduced = pd.read_csv('wiki-topcats-reduced.txt', sep = '\t', names = ['source', 'destination'])

In [None]:
reduced.head()

In [None]:
len(reduced)

Now we for each couple $(a,b)$ we add an edge $b\rightarrow a$. In this way for each couple $(a,b)$ exists an edge $b\rightarrow a$ and $a\rightarrow b$. So if the are duplicates, then already exist a couple of nodes with this property, so the graph will be direct.
To do this we invert the two columns, concatenating the two dataframe, and then we remove the duplicates.

In [None]:
reduced_inv=reduced.reindex(columns=['destination','source'])
reduced_inv.columns=['source','destination']

In [None]:
print('Number of edges of the graph with possible duplicates: 'len(pd.concat([reduced,reduced_inv])))
print('Number of edges of the graph without duplicates: 'len(pd.concat([reduced,reduced_inv]).drop_duplicates())

We can see that the number of edges decrease. So the graph is direct.

Now we create our graph, using a dictonary, in which the keys are the nodes and the values the list of nodes to which the node is linked by the edge. We also build the "inverse_graph", that we will need later. In the inverse graph the keys are the list of nodes and the values are the nodes from which the edge starts.  

In [2]:
direct_graph=defaultdict(list)
inverse_graph=defaultdict(list)
with open('wiki-topcats-reduced.txt') as f: 
    for line in tqdm(f):
        l=list(map(int,line.split()))
        direct_graph[l[0]].append(l[1])
        inverse_graph[l[1]].append(l[0])
        

2645247it [00:09, 293706.34it/s]


To compute the total number number of nodes, we make the union between the  set of the keys of the direct_graph and the set of the keys of the inverse_graph.
The number of edge is the sum of the lengths of the lists in values of the direct graph. 

In [3]:
V=len(set(direct_graph.keys()).union(set(inverse_graph.keys())))
E=sum(map(len,direct_graph.values()))
D=E/(V*(V-1))
print('Number of nodes: ',V)
print('Number of edges: ',E)
print('Average node degree (IN): ',sum(map(len,inverse_graph.values()))/V)
print('Average node degree (OUT): ',E/V)
print('Density ratio: ', D)

Number of nodes:  461193
Number of edges:  2645247
Average node degree (IN):  5.735661642739591
Average node degree (OUT):  5.735661642739591
Density ratio:  1.2436602635647606e-05


From the in-degree and the density ratio we computed, we can say that the graph is not dense!

In [None]:
Block_ranking=[]
categories={}
with open('wiki-topcats-categories.txt') as f:
    for line in f:
        l=line.split()
        categories[l[0][9:-1]]=list(map(int,l[1::]))
        Block_ranking.append(list(map(int,l[1::])))

In [None]:
graph=categories.copy()
for key in categories:
    if len(categories[key])<=3500:
        del(graph[key]) 

In [None]:
for key in graph:
    for article in graph[key]:
        if article not in direct_graph:
            direct_graph[article]=[]

In [29]:
def BFS(G, v):
    visited=set()
    search_tree=defaultdict(list)
    queue=[]
    queue.append(v)
    visited.add(v)
    while  queue:
        u=queue.pop(0)
        for w in G[u]:
            if not w in visited:
                queue.append(w)
                visited.add(w)
                search_tree[u].append(w)
                
    return(search_tree)

In [None]:
graph1 = {'A': ['B', 'C', 'E'],
         'B': ['A','D', 'E'],
         'C': ['A', 'F', 'G'],
         'D': ['B'],
         'E': ['A', 'B','D'],
         'F': ['C'],
         'G': ['C']}

t=BFS(graph1,'A')

In [None]:
t

In [None]:
def compute_distances(search_tree, root):
    res = {}
    distance = 0
    def compute_distances2(search_tree, root, res, distance = 0):
        res[root] = distance
        try:
            for n in search_tree[root]:
                compute_distances2(search_tree, n, res, distance = distance+1)
        except KeyError:
            pass

    compute_distances2(search_tree, root, res, distance)
    return res
    


In [None]:
d1 = compute_distances(t, 'A')

In [None]:
def merge_distances(d1, d2):
    dres = defaultdict(list)
    for k,v in list(d1.items()) + list(d2.items()):
        dres[k].append(v)
      
    return dres

In [None]:
r=merge_distances(d1,d1)

In [None]:
r

In [None]:
def compute_graph_distance(G, nodes):
    """compute the distances between the nodes in nodes and
    all the others nodes in the graph G"""
    if not nodes:
        raise RuntimeError("The nodes list cannot be empty.")
    dist = compute_distances(BFS(G, nodes[0]), nodes[0])
    for n in tqdm(nodes[1:]):
        dist1 = compute_distances(BFS(G, n), n)
        dist = merge_distances(dist, dist1)
    return dist

In [None]:
#graph['Asteroids_named_for_people']
direct_graph[graph['Asteroids_named_for_people'][1]]

In [None]:
distances = compute_graph_distance(direct_graph, graph['Asteroids_named_for_people'])