## Introduction

This notebook is part of the BAINSA Wikipedia Knowledge Graph Project. Using the Wiki Dumps, we aim to construct an unweighted directed graph containg the Wikipedia articles as nodes and page links as edges. In this particular notebook, we use the files "articles.txt" and "article_links.txt" that we have created based on data gathered from the Wiki dumps. The structure of "articles.txt" is: page_id page_title. The structure of "article_links" is: page_title page_id_1 page_id_2 ... page_id_n, where page_id_i represents the id of the ith page that contains the link to the current page. In other words, "article.txt" contains the nodes, while "article_links" is a collection of inward edges. The problem is that some of pages and ids in "article_links" correspond to improper pages (e.g., redirects). Therefore, we have to filter for only those titles that are present in both files, and then check that the ids of the links are present in the intersection.

## Storage

We use dictionaries to store the data from the two files. The first one (ids_titles) uses the titles as keys and the ids as values, whereas the second one (all_titles) uses the titles as keys and the list of ids representing inward edges as values. What we have to do is find the "intersection" of the titles and then filter out the ids correspong to edges connecting nodes that are not in the graph.

In [1]:
import numpy as np

In [2]:
ids_titles = dict()
i = 0
with open("/Users/mateicosa/Bocconi/BAINSA/articles.txt", "r") as f:
    line = f.readline()
    while line:
        split_line = line.split()
        ids_titles[split_line[1][1:-1].replace("\\", "")] = split_line[0]
        line = f.readline()
        i += 1

In [3]:
all_titles = dict()
i = 0 
clean_titles = set(ids_titles.keys())
with open("/Users/mateicosa/Bocconi/BAINSA/article_links.txt", "r") as g:
    line = g.readline()
    while line:
        split_line = line.split()
        title = split_line[0][1:-1].replace("\\", "")
        link_ids = np.array(split_line[1:])
        if title != "" and title in clean_titles:
            all_titles[title] = link_ids
            i += 1
            clean_titles.remove(title)
        line = g.readline()

In [4]:
print(len(ids_titles))
print(len(all_titles))

6090843
5957440


We notice that the first file contains more data than the second one, hence filtering is needed.

In [5]:
del clean_titles

## Filtering the data

We create a dictionary containing the keys of all_titles to speed up the search process. In the end, we want to produce 2 output files: "final_articles.txt" and "inward_edges.txt". Our aim is for "final_articles.txt" to contain the page_id followed by the page_title, while "inward_edges.txt" to contain the page_id, followed by the ids of the pages from which the page can be reached. We proceed with the creation of the first file, while keeping track of the titles and ids that are missing.

In [6]:
remaining_titles = set(all_titles.keys())

In [7]:
output = open("final_articles.txt", "w")
missing_titles = []
missing_ids = []
valid_ids = set()
for title in ids_titles.keys():
    if title in remaining_titles:
        output.write(ids_titles[title] + " " + title + '\n')
        remaining_titles.remove(title)
        valid_ids.add(ids_titles[title])
    else:
        missing_titles.append(title)
        missing_ids.append(ids_titles[title])
output.close()

In [8]:
print(len(missing_titles), len(missing_ids), len(all_titles))

133403 133403 5957440


In [9]:
missing_titles = set(missing_titles)
missing_ids = set(missing_ids)

We now delete the keys that belong to only one of the dictionaries.

In [10]:
for title in missing_titles:
    if title in all_titles.keys():
        del all_titles[title]

In [11]:
print(len(all_titles))
del missing_titles

5957440


Finally, we create the last file by cross-checing the page ids.

In [12]:
output = open("inward_edges.txt", "w")
for title in all_titles.keys():
    output.write(ids_titles[title])
    for page_id in all_titles[title]:
        if page_id in valid_ids:
            output.write(' ' + page_id)
    output.write('\n')
output.close()

## Conclusion

We now have all the "ingredients" to build the directed, unweighted graph of English Wikipedia.