# Homework 5 - Visit the Wikipedia hyperlinks graph!
In this assignment we perform an analysis of the Wikipedia Hyperlink graph. In particular, given extra information about the categories to which an article belongs to, we are curious to rank the articles according to some criteria. 

In [168]:
import pandas as pd
import json
import pickle
from tqdm import tqdm

## Research questions


### **[RQ1]** 
Build the graph <img src="https://latex.codecogs.com/gif.latex?G=(V,&space;E)" title="G=(V, E)" /> where *V* is the set of articles and *E* the hyperlinks among them, and provide its basic information:
 
- If it is direct or not
- The number of nodes
- The number of edges 
- The average node degree. Is the graph dense?

###### Build the graph!

In [208]:
F = open('wiki-topcats-reduced.txt','r') 
rows=F.read().split('\n') #split the rows
grafo={} #initialize the graph
for row in rows:
        link=row.split('\t') 
        if link[0] not in grafo: #add the vertex if it doesn't exist
            try:
                grafo[link[0]]=set()
                grafo[link[0]].add(link[1]) #add the edge
            except: print('empty row')
        else:
            grafo[link[0]].add(link[1])
F.close()

empty row


###### Find out if it's directed or not:

In [210]:
for neighbors in grafo['52']:
    print(grafo[neighbors])

{'1161659', '402300', '723911', '1058269', '401227', '827334', '1163806', '401295', '401628', '1061728', '946986', '1163338', '401137', '1163551', '1288276', '595633', '401310', '1062938', '401231', '401171', '400980', '447882', '1060341', '1184538', '401154', '1394526', '1184217', '402715', '167532', '724192', '401018', '824998', '401975', '1169888', '1399606', '60219', '1288076', '630946', '1061885', '401019', '1061824', '401474', '1163407', '401609', '606279', '401457', '401053', '810461', '1061905', '1571179', '776478', '961942', '402265', '1062323', '401067', '1245651', '809904', '402718', '401184', '401505', '401981', '401315'}
{'1066969', '1060396', '1069113', '1069258', '1061304', '1656982', '1069275', '1069008', '1062611'}
{'1061960', '1062448', '1163562', '1400452', '66625', '1181827', '1454311', '1170120', '1262459', '263677', '1179310', '1638488', '167374', '751054', '1166302', '8961', '604445', '806219', '1383234', '811181', '1184026', '1164926', '1061736', '1163557', '678

We see that it's undirected, since we have unspecular edges.

###### Get the number of nodes!

In [211]:
nodes=len(grafo)
nodes

428958

###### Get the number of edges!

In [212]:
edges=0
for node in grafo:
    edges+=(len(grafo[node]))
edges

2645247

###### Get the average node degree. Is the graph dense?

In graph theory, the degree of a vertex of a graph is the number of edges incident to the vertex. The degree of a vertex $v$ is denoted $\deg(v)$.

In [213]:
avg_degree= edges/nodes
avg_degree

6.1666806540500465

As we see, the average node degree is slightly great than six.
In mathematics, a dense graph is a graph in which the number of edges is close to the maximal number of edges.
We can conclude that the graph is not dense. It is very sparse indeed.

In [244]:
F = open('wiki-topcats-page-names.txt','r')
articles={}
for line in F.readlines():
    num=line.split()[0]
    tit=line.split()[1:]
    title=' '.join(tit)
    articles[num]=title
F.close()

### **[RQ2]** 
Given a category <img src="https://latex.codecogs.com/gif.latex?C_0&space;=&space;\{article_1,&space;article_2,&space;\dots&space;\}" title="C_0 = \{article_1, article_2, \dots \}" /> as input we want to rank all of the nodes in *V* according to the following criteria:
	
* Obtain a *block-ranking*, where the blocks are represented by the categories. In particular, we want:

<img src="https://latex.codecogs.com/gif.latex?block_{RANKING}&space;=\begin{bmatrix}&space;C_0&space;\\&space;C_1&space;\\&space;\dots&space;\\&space;C_c\\&space;\end{bmatrix}" title="block_{RANKING} =\begin{bmatrix} C_0 \\ C_1 \\ \dots \\ C_c\\ \end{bmatrix}" />
	
Each category $C_i$ corresponds to a list of nodes. 

The first category of the rank, $C_0$, always corresponds to the input category. The order of the remaining categories is given by:

<img src="https://latex.codecogs.com/gif.latex?$$distance(C_0,&space;C_i)&space;=&space;median(ShortestPath(C_0,&space;C_i))$$" title="distance(C_0, C_i) = median(ShortestPath(C_0, C_i))" />

The lower is the distance from $C_0$, the higher is the $C_i$ position in the rank. $ShortestPath(C_0, C_i)$ is the set of all the possible shortest paths between the nodes of $C_0$  and $C_i$. Moreover, the length of a path is given by the sum of the weights of the edges it is composed by.



We create a dictionary that associate evry category with his articles.

In [214]:
F = open('wiki-topcats-categories.txt','r')
categories={}
for line in F.readlines():
    riga=line.split(' ')
    categoria=(riga[0].replace('Category:','').replace(';',''))
    articles=(riga[1:-1])
    articles.append(riga[-1].replace('\n',''))
    categories[categoria]= articles
F.close()

We take care of categories that contains only articles that are not in our graph:

In [215]:
todelete=[]
for cat in categories:
    esiste=False
    for art in categories[cat]:
        esiste=art in grafo
        if esiste==True: break          
    if esiste==False:
        todelete.append(cat)
for cat in todelete:
    del categories[cat]            

We remove every article that is not in our graph:

In [217]:
for cat in categories:
    toremove=[]
    for art in categories[cat]:
        if art not in grafo:
            toremove.append(art)
    for art in toremove:
        categories[cat].remove(art)

        

We store our final categories dictionary in an external pickle file so we can reload it in need.

In [218]:
with open('categories.pickle', 'wb') as handle:
    pickle.dump(categories, handle, protocol=pickle.HIGHEST_PROTOCOL)


In [219]:
with open('categories.pickle', 'rb') as handle:
    categories = pickle.load(handle)

In [221]:
len(categories)

11954

In [229]:
inputcat=input()

Japanese_expatriate_footballers


In [249]:
categories[inputcat][1:10]

['87351',
 '87352',
 '88946',
 '90384',
 '90502',
 '90528',
 '90530',
 '90754',
 '91000']

In [248]:
articles['87351']

'Junichi Inamoto'


* Once you obtain the $"block_{RANKING}"$ vector, you want to sort the nodes in each category. The way you should sort them is explained by this example:

	*	Suppose the categories order, given from the previous point, is <img src="https://latex.codecogs.com/gif.latex?C_0,&space;C_1,&space;C_2" title="C_0, C_1, C_2" />


__[STEP1]__ Compute subgraph induced by <img src="https://latex.codecogs.com/gif.latex?C_0" title="C_0" />. For each node compute the sum of the weigths of the in-edges.

 <img src="https://latex.codecogs.com/gif.latex?score_{article_i}&space;=&space;\sum_{i&space;\in&space;in-edges}&space;w_i" title="score_{article_i} = \sum_{j \in in-edges(article_i)} w_j" />

__[STEP2]__ Extend the graph to the nodes that belong to <img src="https://latex.codecogs.com/gif.latex?C_1" title="C_1" />. Thus, for each article in <img src="https://latex.codecogs.com/gif.latex?C_1" title="C_1" /> compute the score as before. __Note__ that the in-edges coming from the previous category, <img src="https://latex.codecogs.com/gif.latex?C_0" title="C_0" />, have as weights the score of the node that sends the edge.


__[STEP3]__ Repeat Step2 up to the last category of the ranking. In the last step of the example you clearly see the weight update of the edge coming from node *E*.
	
![alt text](imgs/algorithm.PNG)
