In [1]:
import os
import json

from bs4 import BeautifulSoup
from igraph import *
from prettytable import PrettyTable

#### 4.8. Find the 10 Wikipedia documents with the most inlinks. Show the collection of anchor text for those pages

In [2]:
wiki_network = {}

for root, dirs, files in os.walk("wiki-small/"):
    for file in files:
        if file.endswith(".html"):
            with open(os.path.join(root, file), 'r') as f:
                outlinks = []
                if f.name not in wiki_network: # If page not computed yet
                    html = f.read()
                    soup = BeautifulSoup(html, "html5lib")
                    
                    for a in soup.find_all('a', href=True):
                        if 'wikipedia' in a['href'] or '../' in a['href']:
                            if a['href'] not in outlinks: # If link not in list
                                outlinks.append(a['href'])
                    
                    wiki_network[f.name] = outlinks

# Save in .json file
with open('wiki-network.json', 'w') as outfile:
    json.dump(wiki_network, outfile)

***

Dado que el anterior algoritmo dura unos minutos en leer todas las páginas y coger sus links, en el caso de que no queramos ejecutarlo de nuevo, podemos ejecutar las siguientes líneas para leer directamente el fichero .json que se ha generado con los links entre cada nodo.

In [3]:
with open('wiki-network.json') as f:
    wiki_network = json.load(f)

La variable **wiki_network** es un diccionario cuyas claves serán páginas y cuyos valores serán listas de páginas a las que apunta la clave.

A continuación, generamos el grafo:

In [4]:
g = Graph(directed=True)
g.add_vertices(list(set(list(wiki_network.keys()) + list([a for v in wiki_network.values() for a in v]))))
g.add_edges([(v, a) for v in wiki_network.keys() for a in wiki_network[v]])
g.vs["label"] = list(set(list(wiki_network.keys()) + list([a for v in wiki_network.values() for a in v])))

Por último, calculamos el número de inlinks que tiene cada página y mostramos las 10 con mayor número:

In [5]:
deg_in = g.degree(mode='in')
deg_in_top_index = sorted(range(len(deg_in)), 
                          key=lambda i: deg_in[i], 
                          reverse=True)[:10]

t = PrettyTable(['Page', 'Number of inlinks'])
for i in deg_in_top_index:
    t.add_row([g.vs["label"][i], deg_in[i]])
    
print(t)

+-----------------------------------------------------------------------------------+-------------------+
|                                        Page                                       | Number of inlinks |
+-----------------------------------------------------------------------------------+-------------------+
|                http://en.wikipedia.org/wiki/Charitable_organization               |        6043       |
|                               ../../../../index.html                              |        6043       |
| http://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_License |        6043       |
|                 http://en.wikipedia.org/wiki/Wikipedia:Copyrights                 |        6043       |
|         ../../../../articles/c/o/m/Wikipedia%7ECommunity_Portal_6a3c.html         |        6043       |
|               ../../../../articles/a/b/o/Wikipedia%7EAbout_8d82.html              |        6043       |
|            ../../../../articles/c/u/r/Portal

#### 4.9. Compute PageRank for the Wikipedia documents. List the 20 documents with the highest PageRank values together with the values.

In [6]:
page_rank = g.pagerank()
page_rank_top_index = sorted(range(len(page_rank)), 
                             key=lambda i: page_rank[i], 
                             reverse=True)[:20]

t = PrettyTable(['Page', 'PageRank'])
for i in page_rank_top_index:
    t.add_row([g.vs["label"][i], page_rank[i]])
    
print(t)

+-----------------------------------------------------------------------------------+------------------------+
|                                        Page                                       |        PageRank        |
+-----------------------------------------------------------------------------------+------------------------+
|                http://en.wikipedia.org/wiki/Charitable_organization               | 0.00032614838074396577 |
|                               ../../../../index.html                              | 0.00032614838074396577 |
| http://en.wikipedia.org/wiki/Wikipedia:Text_of_the_GNU_Free_Documentation_License | 0.00032614838074396577 |
|                 http://en.wikipedia.org/wiki/Wikipedia:Copyrights                 | 0.00032614838074396577 |
|         ../../../../articles/c/o/m/Wikipedia%7ECommunity_Portal_6a3c.html         | 0.00032614838074396577 |
|               ../../../../articles/a/b/o/Wikipedia%7EAbout_8d82.html              | 0.00032614838074396577 |
|

#### 4.10. Figure 4.11 shows an algorithm for computing PageRank. Prove that the entries of the vector $I$ sum to 1 every time the algorithm enters the loop on line 9.

![title](PageRank algorithm.png)

In [7]:
P = g.vs["label"] # Get page of the graph

I = [0] * len(P) # Create a vector of length |P|
for i in range(len(I)):
    I[i] = 1 / len(P) # Start with each page being equally likely
    
sum(I) # Sum vector I to prove that sum 1 every time the algorithm enters the loop on line 9

0.9999999999948229

Vemos que no suma exactamente 1 porque se van arrastrando los decimales en la suma. No obstante, podemos observar, a continuación, como todos los valores del vector $I$ son $3.735901641181591 \cdot 10^{-6}$ (solo se van a mostrar los 10 primeros elementos):

In [8]:
I[:10]

[3.735901641181591e-06,
 3.735901641181591e-06,
 3.735901641181591e-06,
 3.735901641181591e-06,
 3.735901641181591e-06,
 3.735901641181591e-06,
 3.735901641181591e-06,
 3.735901641181591e-06,
 3.735901641181591e-06,
 3.735901641181591e-06]

El tamaño de $P$, es decir, el número de página que hay, es:

In [9]:
len(P)

267673

Por lo que si multiplicamos $3.735901641181591 \cdot 10^{-6}$ por el tamaño de $P$, vemos como nos da 1:

In [10]:
I[1] * len(P)

1.0