1. **Entry Point** (`wikitextgraph.py`)
- Parses command-line arguments (`--dump_filepath`, `--language_code`, `--base_dir`, `--generate_graph`).
- If required inputs are missing, launches a GUI prompt via `gui_prompt_for_inputs()`.
- Ensures the base output directory exists.
- Calls `parse_wikidump()` from `parser_module`, passing file path, language settings, and graph flag.

---

2. **Configuration** (`config.py` & `LANG_SETTINGS.yml`)
- `LANG_SETTINGS.yml`: Stores language-specific regex patterns for section headers, page filters, and redirect keywords for 10 languages (en, es, el, pl, it, nl, eu, hi, de, vi).
- `config.py`: 
    - `load_language_settings()`: Reads and compiles regex patterns from YAML.
    - `get_language_settings()`: Returns settings for a given language code (defaults to English).

---

3. **Parsing and Processing** (`parser_module.py`)

An `xml.sax` handler that:

- Buffers `<title>` and `<text>` elements in batches (default 10k pages).

- On each batch end, creates a DataFrame, filters non-content pages, extracts main text via `utils.extract_wiki_main_text()`, and writes to a Parquet file with gzip compression.

- Cleans up memory with `gc.collect()` between batches.

- Loads language settings (section delimiter regex, filter patterns, redirect keywords).

- Sets up output directories and files (e.g., `en/output/en_WP_titles_texts.parquet`). 

- Instantiates WikiXmlHandler, streams the `.bz2` dump through SAX.

- After parsing, optionally calls `generate_graph()` to build the link graph.

---

4. **Text Cleaning and Utilities** (`utils.py`)

- Template and Tag Removal: Strips `{{templates}}`, `<ref>` tags, HTML comments, and trims to main content (starting at first bold text and stopping at "See also" sections).

- Excludes pages whose titles match specified patterns (e.g., namespaces: Category:, Template:, disambiguation).

- Link Extraction Helpers: Extracts wikilinks with regex, fixes underscores in titles, and resolves redirects using a provided mapping.

---

5. **GUI Interface** (`gui.py`)

- A Tkinter-based window prompting users to:
  1. Select a compressed `.bz2` dump file.
  2. Choose a language code from a dropdown.
  3. Decide whether to generate the graph.
  4. Select an output directory.
- Buttons to open GitHub repo or contact developer.
- Returns inputs to `wikitextgraph.py` for processing.

---

6. **Graph Generation** (`graph.py` - not shown)

- Reads the Parquet of titles/texts, extracts links, resolves redirects, and constructs a node-edge Parquet representation.
- Outputs in `base_dir/<lang>/graph/`:
  - `redirects_rev_mapping.pkl.gzip`
  - `<lang>_id_node_mapping.parquet`
  - `<lang>_wiki_graph.parquet`

---

In [14]:
import unittest
import re
import pandas as pd
from utils import (
    extract_wiki_main_text,
    filter_non_content_pages,
    extract_wikilinks,
    fix_dubious_links,
    resolve_redirects
)

# Section headers to stop main text extraction
section_patt = re.compile(
    r"(==\s*(See also|Publications|References|Notes|Footnotes|External links|Further reading)\s*==|WP:SEEALSO)"
)

class TestWikiUtils(unittest.TestCase):

    def test_extract_wiki_main_text(self):
        """
        Test that extract_wiki_main_text removes refs/comments/templates and truncates at known end sections.
        """
        sample_text = """
        <!-- Comment about the article -->
        {{Infobox}}
        '''Python''' is a high-level programming language. <ref>Reference here</ref>
        It was created by [[Guido van Rossum|Guido]].
        == References ==
        <ref>Extra ref</ref>
        """
        expected_output = "'''Python''' is a high-level programming language. It was created by [[Guido van Rossum|Guido]]."
        result = extract_wiki_main_text(sample_text, section_patt)
        self.assertEqual(" ".join(result.split()), " ".join(expected_output.split()))

    def test_filter_non_content_pages(self):
        """
        Test that filter_non_content_pages removes pages matching a namespace pattern.
        """
        df = pd.DataFrame({
            'title': ['Article1', 'User:Example', 'Talk:Article2', 'Main Article'],
            'text': ['some text'] * 4
        })
        patterns = ['^user:', '^talk:']
        filtered_df = filter_non_content_pages(df, patterns, redirect_keywords=[])
        expected_titles = ['Article1', 'Main Article']
        self.assertListEqual(list(filtered_df['title']), expected_titles)

    def test_extract_wikilinks(self):
        """
        Test that extract_wikilinks extracts internal Wikipedia links.
        """
        text = "This links to [[Python (programming language)]] and [[Guido van Rossum|Guido]]."
        wiki_link_regex = re.compile(r'\[\[(?:[^|\]]*\|)?([^\]]+)\]\]')
        links = extract_wikilinks(wiki_link_regex, text)
        self.assertListEqual(links, ['Python (programming language)', 'Guido'])

    def test_fix_dubious_links(self):
        """
        Test that underscores are replaced with spaces in links.
        """
        self.assertEqual(fix_dubious_links("Python_(programming_language)"), "Python (programming language)")
        self.assertEqual(fix_dubious_links("Guido_van_Rossum"), "Guido van Rossum")
        self.assertIsNone(fix_dubious_links(None))

    def test_resolve_redirects(self):
        """
        Test that resolve_redirects replaces known redirects.
        """
        series = pd.Series(["PyLang", "Guido", "Monty Python"])
        redirect_map = {
            "PyLang": "Python (programming language)",
            "Guido": "Guido van Rossum"
        }
        resolved = resolve_redirects(series, redirect_map)
        expected = pd.Series(["Python (programming language)", "Guido van Rossum", "Monty Python"])
        pd.testing.assert_series_equal(resolved, expected)

if __name__ == "__main__":
    unittest.main(argv=["first-arg-is-ignored"], exit=False)

......
----------------------------------------------------------------------
Ran 6 tests in 0.007s

OK
