Skip to content

Create a wiki corpus using a wiki dump file for Natural Language Processing

Notifications You must be signed in to change notification settings

PJ-Duo/wiki-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

wiki-corpus

Having a large and diverse corpus like Wikipedia is invaluable for developing and testing new natural language processing algorithms and models. Wiki-Corpus extracts the text content of each wikipedia article and convert it into a format that can be used for natural language processing tasks, such as tokenization, part-of-speech tagging and etc.

Prerequisites

Ensure you have a wiki dump file in .xml.bz2 format downloaded from enwiki or metawiki. Please be aware a full enwiki dump is extremely large in size (+19GB). If you want a smaller in size dump (often for dev or test purposes), you should go for metawiki.

You should have Python installed.

Usage

Inside of the cloned repo, pass the url of the xml.bz2 file and initiate the process:

> python main.py metawiki-latest-pages-articles.xml

You can exit the process at anytime you feel enough text corpus is made, or wait until everything is processed. Have a look at your text corpus, named corpus.txt inside of the out folder.

FAQ

Nothing happens after I run the file with the passed argument
Try `CTRL + C` after you run the command. Only do it once, because doing it twice kills the process.

Releases

No releases published

Packages

No packages published

Languages