Skip to content

Hansimov/wikixml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wikixml

A Python Library to process wiki dumps xml.

Install

pip install wikixml --upgrade

Download Wiki Dumps

Visit: https://dumps.wikimedia.org/zhwiki/latest/

Download the latest wiki dump file with proxy:

curl -L --proxy http://127.0.0.1:11111 -o ~/repos/wikixml/data/zhwiki-latest-pages-meta-current.xml.bz2 https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-meta-current.xml.bz2

WikiXmlParser

Run example:

python example.py

See: example.py

from wikixml import WikiXmlParser

if __name__ == "__main__":
    wiki_xml_bz2 = "zhwiki-20241101-pages-meta-current.xml.bz2"
    file_path = Path(__file__).parent / "data" / wiki_xml_bz2
    parser = WikiXmlParser(file_path)
    # parser.preview_lines(5000)
    parser.preview_pages(max_pages=100)

WikiPagesMongoWriter

Extract wiki pages from XML and write to MongoDB

python -m wikixml.mongo -d zhwiki -f "../data/zhwiki-latest-pages-meta-current.xml.bz2"

About

Python library to process wiki dumps xml

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages