# Homework 5 - Visit the Wikipedia hyperlinks graph!

In this assignment we perform an analysis of the Wikipedia Hyperlink graph. In particular, given extra information about the
categories to which an article belongs to, we are curious to rank the articles according to some criteria.

### [RQ1] Build the graph G = (V, E), where V is the set of articles and E the hyperlinks among them, and provide its basic information:
- If it is direct or not
- The number of nodes
- The number of edges
- The average node degree. Is the graph dense?

In [16]:
import pandas as pd
import json
from tqdm import tqdm
from collections import defaultdict
from collections import deque
import networkx as nx
import math
import statistics
import heapq 
import json

Observing the file **wiki-topcats-reduced.txt**, we can notice the existence of a couple of nodes (107 e 104) for which it exists an edge from 107 to 104 and from 104 to 107. This make us realize that the graph is direct, otherwise, it would be useless the existence of one of the two edges. We thought about this proof: let's suppose that the graph is not direct. Then it can not exists a couple of nodes $a$ and $b$ such that exists an edge $a \rightarrow b$ and an edge $b \rightarrow a$. If exists a couple of nodes with this property than the graph is direct. To prove this we opened the file **reduced** and save it in a dataframe.

In [2]:
reduced = pd.read_csv('wiki-topcats-reduced.txt', sep = '\t', names = ['source', 'destination'])

In [3]:
reduced.head()

Unnamed: 0,source,destination
0,52,401135
1,52,1069112
2,52,1163551
3,62,12162
4,62,167659


In [4]:
len(reduced)

2645247

Now we for each couple $(a,b)$ we add an edge $b\rightarrow a$. In this way for each couple $(a,b)$ exists an edge $b\rightarrow a$ and $a\rightarrow b$. So if the are duplicates, then already exist a couple of nodes with this property, so the graph will be direct.
To do this we invert the two columns, concatenating the two dataframe, and then we remove the duplicates.

In [5]:
reduced_inv=reduced.reindex(columns=['destination','source'])
reduced_inv.columns=['source','destination']

In [6]:
print('Number of edges of the graph with possible duplicates: ',len(pd.concat([reduced,reduced_inv])))
print('Number of edges of the graph without duplicates: ',len(pd.concat([reduced,reduced_inv]).drop_duplicates()))

Number of edges of the graph with possible duplicates:  5290494
Number of edges of the graph without duplicates:  4348125


We can see that the number of edges decrease. So the graph is direct.

Now we create our graph, using a dictonary, in which the keys are the nodes and the values the list of nodes to which the node is linked by the edge. We also build the "inverse_graph", that we will need later. In the inverse graph the keys are the list of nodes and the values are the nodes from which the edge starts.  

In [7]:
direct_graph=defaultdict(list)
inverse_graph=defaultdict(list)
with open('wiki-topcats-reduced.txt') as f: 
    for line in tqdm(f):
        l=list(map(int,line.split()))
        direct_graph[l[0]].append(l[1])
        inverse_graph[l[1]].append(l[0])
#complete the graph
for key in inverse_graph.keys():
    if key not in direct_graph:
        direct_graph[key]=[]

2645247it [00:10, 260346.94it/s]


To compute the total number number of nodes, we make the union between the  set of the keys of the direct_graph and the set of the keys of the inverse_graph.
The number of edge is the sum of the lengths of the lists in values of the direct graph. 

In [8]:
V=len(direct_graph.keys())
E=sum(map(len,direct_graph.values()))
D=E/(V*(V-1))
print('Number of nodes: ',V)
print('Number of edges: ',E)
print('Average node degree (IN): ',sum(map(len,inverse_graph.values()))/V)
print('Average node degree (OUT): ',E/V)
print('Density ratio: ', D)

Number of nodes:  461193
Number of edges:  2645247
Average node degree (IN):  5.735661642739591
Average node degree (OUT):  5.735661642739591
Density ratio:  1.2436602635647606e-05


From the in-degree and the density ratio we computed, we can say that the graph is not dense! Now we create another dictionary that has for keys the categories and values the list of articles that are in the category. After we will remove the categories that have less than 3500 argument.

In [9]:
categories={}
with open('wiki-topcats-categories.txt') as f:
    for line in f:
        l=line.split()
        categories[l[0][9:-1]]=list(map(int,l[1::]))

In [10]:
graph=categories.copy()
for key in categories:
    if len(categories[key])<=3500:
        del(graph[key]) 

Now we remove the useless node, that are the nodes without outcoming edges and incoming edges

In [11]:
clean_cat=graph.copy()
vertex=set(direct_graph.keys()).union(set(inverse_graph.keys()))
for key in graph:
    clean_cat[key]=list(set(graph[key]).intersection(vertex))

Now we add to the graph, the nodes that don't have outgoing edges.

### [RQ2] Given a category  $C_0 = \{article_1, article_2, \dots \}$ as input we want to rank all of the nodes in V according to the following criteria: 
Obtain a block-ranking, where the blocks are represented by the categories. In particular, we want:

$$block_{RANKING}=\left[\array{C_0,\\C_1,\\ \cdots \\C_C }\right]$$ 

Each category $C_i$ corresponds to a list of nodes.

In the following chuncks of code, we implement some tools that we need for the analysis. The first tool is the BFS (Breadth-first-search), that we need for explore all the graph. Given a graph and a starting node (root) it returns the search_tree that is a tree that for each node has for sons, the closer nodes. 

In [12]:
def bfs(graph, root):
    seen, queue = set([root]), deque([root])
    distances=defaultdict(lambda: math.inf)
    distances[root] = 0
    while queue:
        vertex = queue.popleft()
        d = distances[vertex]
        for node in graph[vertex]:
            if node not in seen:
                seen.add(node)
                queue.append(node)
                distances[node] = d+1
    return(distances)

Using the search_tree, we compute the distances between the root and each node using the following recursive function.

In [13]:
def merge_dist(d_min,d):
    for key in d_min:
        d_min[key].append(d[key])
    return d_min

In [28]:
def compute_median(l):
    c=[x for x in l if x!=math.inf]
    n_inf=l.count(math.inf)
    return statistics.median(c)/math.log(n_inf)

In [15]:
C0=clean_cat['Year_of_birth_unknown']
min_dist={i:[] for i in direct_graph.keys()}
for root in tqdm(C0):
    dist=bfs(direct_graph, root)
    min_dist=merge_dist(min_dist, dist)

100%|██████████| 2536/2536 [1:07:51<00:00,  1.59it/s]  


In [17]:
with open('distances.json', 'w') as fp:
    json.dump(min_dist, fp)

In [18]:
short_path=defaultdict(list)
for cat in tqdm(clean_cat):
    nodes=clean_cat[cat]
    for key in nodes:
        short_path[cat]+=min_dist[key] #.append(min_dist[key])

100%|██████████| 35/35 [11:07<00:00, 16.58s/it]


In [None]:
short_path

In [29]:
ranking=[]
for cat in tqdm(short_path):
    heapq.heappush(ranking,(float(compute_median(short_path[cat])),cat))

100%|██████████| 35/35 [10:14<00:00,  3.30s/it]  


In [30]:
ranking

[(0.350766055217393, 'Living_people'),
 (0.3595978887745843, 'English-language_films'),
 (0.37091218182629093, 'American_film_actors'),
 (0.3882433810587469, 'American_military_personnel_of_World_War_II'),
 (0.3688198550787606, 'American_films'),
 (0.38744280712432094, 'People_from_New_York_City'),
 (0.3728531191284263, 'American_television_actors'),
 (0.39593849257133024, 'Article_Feedback_Pilot'),
 (0.4482138406654756, 'Association_football_goalkeepers'),
 (0.40389049910919095, 'English_television_actors'),
 (0.3786004143060232, 'Harvard_University_alumni'),
 (0.4000412099261835, 'Year_of_birth_missing_(living_people)'),
 (0.3919424891362564, 'Fellows_of_the_Royal_Society'),
 (0.43284885756703656, 'The_Football_League_players'),
 (0.3732793895257392, 'Black-and-white_films'),
 (0.4431134262516374, 'Place_of_birth_missing_(living_people)'),
 (0.4331479705370126, 'English_footballers'),
 (0.49705223728207354, 'Asteroids_named_for_people'),
 (0.45289685369664195, 'English-language_album

In [31]:
n=len(ranking)
block_ranking=[]
for  i in range(n):
    block_ranking.append(heapq.heappop(ranking))
    

In [32]:
block_ranking

[(0.350766055217393, 'Living_people'),
 (0.3595978887745843, 'English-language_films'),
 (0.3688198550787606, 'American_films'),
 (0.37091218182629093, 'American_film_actors'),
 (0.3728531191284263, 'American_television_actors'),
 (0.3732793895257392, 'Black-and-white_films'),
 (0.3786004143060232, 'Harvard_University_alumni'),
 (0.38744280712432094, 'People_from_New_York_City'),
 (0.3882433810587469, 'American_military_personnel_of_World_War_II'),
 (0.3919424891362564, 'Fellows_of_the_Royal_Society'),
 (0.3951740593136729, 'British_films'),
 (0.39593849257133024, 'Article_Feedback_Pilot'),
 (0.39819989403016337, 'American_Jews'),
 (0.4000412099261835, 'Year_of_birth_missing_(living_people)'),
 (0.40389049910919095, 'English_television_actors'),
 (0.43284885756703656, 'The_Football_League_players'),
 (0.4331479705370126, 'English_footballers'),
 (0.43630320779825527, 'Association_football_midfielders'),
 (0.43862634487184377, 'Debut_albums'),
 (0.43909758020639655, 'Year_of_birth_missi