EnWikiIndexing

A simple program to create inverted indexes for English Wikipedia dump using Hadoop. By the way, this's my final project for Distributed System course 2013 in Fudan University.

Overview

The project can build inverted indexes for English Wikipedia XML dump file. The following inverted index types are supported.

Normal indexes: TF + DF
Indexes with term weighting: TF + IDF
Positional indexes: TF + DF + positions

Inverted indexes are built through MapReduce using Hadoop and stored in HDFS. Then, we import and transform the result in HDFS into Lucene indexes. Finally we can do full words search from Lucene (as web search).

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.settings		.settings
lib		lib
sample		sample
src/com		src/com
wiki		wiki
.gitignore		.gitignore
.project		.project
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EnWikiIndexing

Overview

About

Releases

Packages

Contributors 2

Languages

Raysmond/EnWikiIndexing

Folders and files

Latest commit

History

Repository files navigation

EnWikiIndexing

Overview

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages