# Wikipedia Dataset generation

In this notebook, we will generate a dataset from Wikipedia articles. We will use the `wikipedia` library to download the articles and then we will extract the text from them. We will only consider the most popular articles in English to prevent our dataset from being too large.

In [24]:
import wikipediaapi as wiki
from tqdm.notebook import tqdm
import os
import json

In [3]:
# Create user agent
wiki_wiki = wiki.Wikipedia("RUG NLP Q&A", 'en')

## Most Popular Articles

We will refer to the Popular pages list on wikipedia to get the most popular articles. We see that there are 1079 links to other pages. We will extract the links and then download the articles.

In [27]:
# Get all pages mentioned in this url https://en.wikipedia.org/wiki/Wikipedia:Popular_pages
popular_pages = wiki_wiki.page("Wikipedia:Popular_pages")
popular_pages_links = popular_pages.links

print(f"Found {len(popular_pages_links)} popular pages")

Found 1079 popular pages


We will conform to the medallion standard for the dataset. Thus, we will have raw data, bronze data, silver data, and gold data. The raw data will be the text of the articles. The bronze data will be the processed text of the articles. The silver data will a json file containing the text and the title of the article. The gold data will be the embeddings of each sentence in the article.

## Raw Data

In [23]:
# Get the text of all the pages as specified in the links above
# Save the text in a file with the name of the page as the file name, save it in data/raw
for page in tqdm(popular_pages_links):
    page = wiki_wiki.page(page)

    # If the file already exists, skip it
    if os.path.exists(f"data/raw/{page.title}.txt"):
        continue
    
    # Account for pages with / in their name by replacing it with _
    file_name = page.title.replace("/", "_")

    # replace ? with _ in the file name
    file_name = file_name.replace("?", "_")

    with open(f"data/raw/{file_name}.txt", "w") as f:
        f.write(page.text)


  0%|          | 0/1079 [00:00<?, ?it/s]

## Bronze Data

We will remove the references, external links and see also sections from the articles.

In [28]:
# Load the data from the raw files, remove the see also, references, external links and notes sections. Then save it in data/bronze
for file in tqdm(os.listdir("data/raw/")):
    with open(f"data/raw/{file}", "r") as f:
        text = f.read()

    # Remove the see also, references, external links and notes sections
    text = text.split("See also")[0]
    text = text.split("References")[0]
    text = text.split("External links")[0]
    text = text.split("Notes ")[0]

    with open(f"data/bronze/{file}", "w") as f:
        f.write(text)

  0%|          | 0/1025 [00:00<?, ?it/s]

## Silver Data

We will create a json file containing the title and the text of the article. This will be easier to work with.

In [29]:
# Load the text of all the pages in data/bronze and store it in a json file in data/silver
data = {}
for file in tqdm(os.listdir("data/bronze")):
    with open(f"data/bronze/{file}", "r") as f:
        data[file] = f.read()

with open("data/silver/data.json", "w") as f:
    json.dump(data, f)

  0%|          | 0/1025 [00:00<?, ?it/s]

The gold data will be generated in the embedding model notebook.