# Tutorial: Extracting Wikipedia Pages & Detecting Redirects

Wikipedia provides large datasets of its contents via downloadable dumps. One common format is the .bz2-compressed pages-articles XML. This tutorial shows you how to read and parse this file using Python.

In [None]:
import bz2

# Define the path to the dump file - Update this with your path to the dump
dump_path = "D:Users/Paschalis/phd/data/dumps/EN/wikipedia/enwiki-20250123-pages-articles-multistream.xml.bz2"

# Initialize a list to store the results
titles_and_texts = []

# Open and read the compressed XML file
with bz2.open(dump_path, 'rt', encoding='utf-8') as file:
    inside_page = False
    inside_text_tag = False
    title = ""
    text = []

    for line in file:
        line = line.strip()

        if line == "<page>":
            inside_page = True
            title = ""
            text = []
            inside_text_tag = False

        elif line == "</page>":
            if title and text:
                titles_and_texts.append({
                    "title": title,
                    "text": '\n'.join(text).strip()
                })
            inside_page = False

        elif inside_page and line.startswith("<title>") and line.endswith("</title>"):
            title = line[len("<title>"):-len("</title>")]

        elif inside_page and "<text" in line:
            inside_text_tag = True
            # Capture content after opening <text> tag (may include attributes)
            start = line.find('>') + 1
            end = line.find("</text>")
            if end != -1:
                text.append(line[start:end])
                inside_text_tag = False
            else:
                text.append(line[start:])

        elif inside_text_tag:
            end = line.find("</text>")
            if end != -1:
                text.append(line[:end])
                inside_text_tag = False
            else:
                text.append(line)

## Detecting Redirects

Redirects in the Wikipedia dump tend to start with a `#`.
For instance, a redirect in the dump can look like:
```
Redirect_name  -->  #REDIRECT Actual page title

In [None]:
if title and text:
    page_text = '\n'.join(text).strip()
    is_redirect = page_text.lower().startswith("#")
    titles_and_texts.append({
        "title": title,
        "text": page_text,
        "is_redirect": is_redirect
    })
    
# print a few of them to check them
for entry in titles_and_texts[:100]: # adjust the number here if you want to print less redirects
    print(f"Title: {entry['title']}")
    print(f"Redirect: {entry.get('is_redirect', False)}")
    print(f"Text snippet: {entry['text'][:100]}...\n")

## Detect pages that you might want to remove

Our goal was to remove pages that act as automatic templates or are mostly help pages or list of links. To detect pages that adhere to this type you need to perform a manual check of titles.

What we noticed when we were building WikiTextGraph was that pages that have a colon (":") usually fit to the description above. These titles can vary from language to language, thus it can take some time to spot them.

Below we provide an example to see how could you look for pages like these.

In [None]:
for entry in titles_and_texts:
    if "Mediawiki:" in entry["title"]: # here you can change "Mediawiki" with something else
        # refer to LANG_SETTINGS.yml for some examples in multiple language versions
        print("Title: ", entry["title"])
        print("Text: ", entry["text"][:100])