Skip to content
This repository has been archived by the owner on Mar 3, 2024. It is now read-only.

CyberZHG/wiki-dump-reader

Repository files navigation

Wiki-Dump Reader

Travis Coverage

Extract corpora from wiki-dump.

Install

pip install wiki-dump-reader

Usage

The dump file *wiki-*-pages-articles.xml should be downloaded first. Then you can iterate and get cleaned text from the text:

from wiki_dump_reader import Cleaner, iterate

cleaner = Cleaner()
for title, text in iterate('*wiki-*-pages-articles.xml'):
    text = cleaner.clean_text(text)
    cleaned_text, links = cleaner.build_links(text)

Just ignore links if you don't need them:

cleaned_text, _ = cleaner.build_links(text)

See examples for an intuitive feeling.

About

Extract corpora from Wikipedia dumps

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages