A Python Library to process wiki dumps xml.
pip install wikixml --upgradeVisit: https://dumps.wikimedia.org/zhwiki/latest/
Download the latest wiki dump file with proxy:
curl -L --proxy http://127.0.0.1:11111 -o ~/repos/wikixml/data/zhwiki-latest-pages-meta-current.xml.bz2 https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-meta-current.xml.bz2Run example:
python example.pySee: example.py
from wikixml import WikiXmlParser
if __name__ == "__main__":
wiki_xml_bz2 = "zhwiki-20241101-pages-meta-current.xml.bz2"
file_path = Path(__file__).parent / "data" / wiki_xml_bz2
parser = WikiXmlParser(file_path)
# parser.preview_lines(5000)
parser.preview_pages(max_pages=100)Extract wiki pages from XML and write to MongoDB
python -m wikixml.mongo -d zhwiki -f "../data/zhwiki-latest-pages-meta-current.xml.bz2"