## Introduction

This notebook is part of the BAINSA Wikipedia Knowledge Graph Project. Using the Wiki Dumps, we aim to construct an unweighted directed graph containg the Wikipedia articles as nodes and page links as edges. 
In this particular notebook, we use the file "enwiki-20221020-page.sql". To parse it, we exploit the structure of the SQL language as shown in the example below.

In [None]:
with open("enwiki-20221020-page.sql") as f:
    line = "a"
    while "INSERT INTO" not in line:
        line = f.readline()
    print(line)

In [None]:
line = line[26:]
print(line)

## Line processing

We define a function to process individual lines of the SQL file. The schema of the table created by the SQL statement contained in the file can be found at https://www.mediawiki.org/wiki/Manual:Page_table. The fields that are of interest to us are page_id (0), page_namespace (1), page_title (2), page_is_redirect (3). To obtain a file containing only proper articles, we eliminate the pages for which the flag page_is redirect is 1, as well as all pages not contained in the article namespace (i.e., namespace != 0). We store the page_id and the page_title in a dictionary defined outside the function.

In [39]:
def process_line(line):
    n = len(line)
    i = 0
    while i < n:
        if line[i] == "(":
            start = i + 1
        else:
            raise Exception(f"Something went wrong {first_line[i-10:i+10]}")
        while not(line[i] == ")" and line[i + 1] == "," and line[i + 2] == "("):
            i += 1
            if i == n - 2:
                i += 2
                break
        end = i
        i += 2
        block = line[start:end]
        block = block.split(",")
        if block[1] != '0': 
            count[0] += 1
            continue
        if block[3] != '0':
            count[1] += 1
            continue
        page_id = int(block[0])
        title = block[2].replace(" ", "_")
        if page_id in g:
            g[page_id].append(title)
        else:
            g[page_id] = [title]

## Parsing the file

Using the previously defined function, we want to parse the entire file line by line, store the relevant information in a dictionary and write it to the output text file. To avoid memory overflow, we will write evey 100 titles in the file and than reinitialize the data structure with an empty dictionary. Furthermore, we eliminate pages that are disambiguations, as these are not proper articles.

In [40]:
g = dict()
i = 0
output = open("articles.txt", "w")
with open("/Users/mateicosa/Downloads/enwiki-20221020-page.sql", "r") as f:
    line = ""
    while "INSERT INTO" not in line:
        line = f.readline()
    line = line[26:]
    process_line(line)
    line = "INSERT INTO"
    while line:
        i += 1
        j += 1
        line = f.readline()
        if "INSERT INTO" not in line:
            break
        line = line[26:]
        process_line_complete(line)
        if i == 100:
            i = 0
            for page_id in g:
                if "(disambiguation)" not in g[page_id][0]:
                    output.write(str(page_id) + " " + ' '.join(g[page_id]) + "\n")
            g = dict()
output.close()

## Conclusion

We are left with a file that contains rows having the following structure: page_id page_title. We will later use this file together with another one containg the links in order to produce the graph.