## Introduction

This notebook is part of the BAINSA Wikipedia Knowledge Graph Project. Using the Wiki Dumps, we aim to construct an unweighted directed graph containg the Wikipedia articles as nodes and page links as edges. In this particular notebook, we use the file "enwiki-latest-pagelinks.sql". To parse it, we exploit the structure of the SQL language as shown in the example below.

In [1]:
with open("/Users/mateicosa/Downloads/enwiki-latest-pagelinks.sql", "r") as f:
    line = "a"
    i = 0
    while "INSERT INTO" not in line:
        i += 1
        line = f.readline()

In [2]:
line = line[31:]
print(line[:100])

(586,0,'!',0),(4748,0,'!',0),(9773,0,'!',0),(15019,0,'!',0),(15154,0,'!',0),(25213,0,'!',0),(73634,0


## Line processing

We define a function to process individual lines of the SQL file. The schema of the table created by the SQL statement contained in the file can be found at https://www.mediawiki.org/wiki/Manual:Pagelinks_table. The fields that are of interest to us are pl_from (0), pl_namespace (1), pl_title (2), pl_from_namespace (3). To obtain a file containing only proper articles, we eliminate the pages for which the pl_namespace and pl_from_namespace are different from 0 (i.e, they are not in the article namespace). We store the page_id and the page_title in a dictionary defined outside the function.

In [3]:
def process_line(line):
    n = len(line)
    i = 0
    while i < n:
        if line[i] == "(":
            start = i + 1
        else:
            raise Exception(f"Something went wrong {first_line[i-10:i+10]}")
        while not(line[i] == ")" and line[i + 1] == "," and line[i + 2] == "("):
            i += 1
            if i == n - 2:
                i += 2
                break
        end = i
        i += 2
        block = line[start:end]
        block = block.split(",")
        if len(block) >= 4:
            if block[1] != '0' or block[3] != '0':
                continue
            page_id = block[0]
            title = block[2]
            if title in g:
                g[title].append(page_id)
            else:
                g[title] = [page_id]

## Parsing the file

Using the previously defined function, we want to parse the entire file line by line, store the relevant information in a dictionary and write it to the output text file. To avoid memory overflow, we will write evey 100 titles in the file and than reinitialize the data structure with an empty dictionary. Furthermore, we eliminate pages that are disambiguations, as these are not proper articles.

In [4]:
g = dict()
i = 0
j = 0
output = open("article_links.txt", "w")
with open("/Users/mateicosa/Downloads/enwiki-latest-pagelinks.sql", "r") as f:
    line = ""
    while "INSERT INTO" not in line:
        line = f.readline()
    line = line[31:]
    process_line(line)
    line = "INSERT INTO"
    while line:
        i += 1
        j += 1
        line = f.readline()
        if "INSERT INTO" not in line:
            break
        line = line[31:]
        process_line(line)
        if i == 100:
            i = 0
            for title in g:
                if "(disambiguation)" not in title:
                    output.write(title + " " + ' '.join(g[title]) + "\n")
            g = dict()
output.close()

## Conclusion

We are left with a file that contains rows having the following structure: page_title page_id_1 page_id_2 ... page_id_n, where page_id_i represents the id of the ith page that contains the link to the current page. We will later use this file together with another one containg the pages titles and ids in order to produce the graph.