WikiExtrator

Extractor for wikipedia xml dump files. For each wikipedia article, its title, infoboxes, categories, langlinks, pageid, abstract, etc.. are extracted from a dump file and written in separate .dat files.

How to use

The wikipedia dump files are very large and must be downloaded separately from "dumps.wikimedia.org" or from the Keg server (path : "10.1.1.66:/data/dump/") example of dump file : FR : "frwiki-20161120-pages-articles-multistream.xml" ZH : "zhwiki-20160203-pages-articles.xml"
from src/main/Preprocess.java -> main() un-comment or comment the languages / files that need to be extracted

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
lib		lib
src		src
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md
build.xml		build.xml
buildo.xml		buildo.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib

lib

src

src

.classpath

.classpath

.gitignore

.gitignore

.project

.project

README.md

README.md

build.xml

build.xml

buildo.xml

buildo.xml

Repository files navigation

WikiExtrator

How to use

About

Releases

Packages

Languages

THU-KEG/WikiExtrator

Folders and files

Latest commit

History

Repository files navigation

WikiExtrator

How to use

About

Resources

Stars

Watchers

Forks

Languages