Skip to content

Belval/wikipedia2text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

#wikipedia2text A tool to convert a Wikipedia dump file into plain text

How to use it

Use python3 wikipedia2text.py -h to print usage.

A typical usage would be like this python3 wikipedia2text out.txt -dl fr,en

There are a few arguments you can use to change the "normal" behavior:

  • -dl (--download_languages) Allow you to specify one or more languages to download from wikipedia (imperatively separated by a comma).
  • -i (--input_files) Allow you to specify one or more files that were already downloaded (both options can be combined)
  • -m (--mix_sentences) Will shuffle the data to make it more uniform (must have sort)
  • -u (--unique) Will remove duplicate lines (must have uniq)
  • -d (--keep_digits) Will keep digits in string
  • -s (--split_sentences) Will split sentences on "."
  • -ml (--min_length) Will not save any string for which len(str) < x

Where can I get a precomputed file?

The result of python3 wikipedia2text.py -dl fr,en -m -u -s can be found here http://node1.belval.org:3080/en-fr-wiki-corpus.bz2 (It includes french!)

How long can it take?

Using python3 wikipedia2text.py -dl fr,en -m -u -s took about 12 hours using a Ryzen 1700X and a 120Mb/s Internet connection.

About

A tool to convert a Wikipedia dump file into plain text

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages