Extract corpora from wiki-dump.
pip install wiki-dump-reader
The dump file *wiki-*-pages-articles.xml
should be downloaded first. Then you can iterate and get cleaned text from the text:
from wiki_dump_reader import Cleaner, iterate
cleaner = Cleaner()
for title, text in iterate('*wiki-*-pages-articles.xml'):
text = cleaner.clean_text(text)
cleaned_text, links = cleaner.build_links(text)
Just ignore links
if you don't need them:
cleaned_text, _ = cleaner.build_links(text)
See examples for an intuitive feeling.