#wikipedia2text A tool to convert a Wikipedia dump file into plain text
Use python3 wikipedia2text.py -h
to print usage.
A typical usage would be like this python3 wikipedia2text out.txt -dl fr,en
There are a few arguments you can use to change the "normal" behavior:
-dl (--download_languages)
Allow you to specify one or more languages to download from wikipedia (imperatively separated by a comma).-i (--input_files)
Allow you to specify one or more files that were already downloaded (both options can be combined)-m (--mix_sentences)
Will shuffle the data to make it more uniform (must havesort
)-u (--unique)
Will remove duplicate lines (must haveuniq
)-d (--keep_digits)
Will keep digits in string-s (--split_sentences)
Will split sentences on "."-ml (--min_length)
Will not save any string for whichlen(str) < x
The result of python3 wikipedia2text.py -dl fr,en -m -u -s
can be found here http://node1.belval.org:3080/en-fr-wiki-corpus.bz2 (It includes french!)
Using python3 wikipedia2text.py -dl fr,en -m -u -s
took about 12 hours using a Ryzen 1700X and a 120Mb/s Internet connection.