Skip to content

synhershko/BzReader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The main purpose of BzReader is to allow the offline browsing of the Wikipedia.

BzReader project page is http://code.google.com/p/bzreader/

This fork adds the following functionalities:

	1. Upgrade to Lucene.Net 2.9.2
	2. Proper search for Hebrew dumps, using HebMorph (http://www.code972.com/blog/hebmorph)

Released under GPLv2

Compiling from source
---------------------

	1. Make sure you have a local copy of Lucene.Net (DLL or sources). Their SVN repository is at https://svn.apache.org/repos/asf/lucene/lucene.net
	2. If you are interested in the Hebrew integration, you may want to compile HebMorph from source as well instead of using the precompiled option: http://github.com/synhershko/HebMorph

Usage
-----
The typical usage would be:

	1. Download one or several .xml.bz2 dumps of Wikipedia topics from http://download.wikipedia.org/backup-index.html
	2. You will need the files named pages-articles.xml.bz2 for your language. Alternatively, you may use any database dumps which follow Wikipedia XML format - Wikibooks, Wikitravel etc.
	3. Install BzReader
	4. To enable proper Hebrew searches, point to a path where hspell files reside. Go to Options > Set hspell path, and select the hspell-data-files folder shipping with the BzReader. Without doing so, a simple Hebrew analyzer will be used instead of a morphological one.
	5. Open the Wikipedia dump in BzReader (you can have multiple dumps opened at the same time, all of them will be searched together)
	6. Let BzReader create the index for fast retrieval of topics. This might be quite lengthy for big dumps. It takes about 4 hours on Core 2 Duo to index the current dump of all English topics (enwiki-20080312-pages-articles.xml.bz2, 3.5 Gb file). The BzReader index for this dump takes about 550 Mb.
	7. Browse!
	8. Enter your search terms into the search window on the left, wait for the result(s) and click on the topic. Try also something like "music*" to find anything that starts with "music".


Known issues
------------
Chinese/Japanese/Korean users - please note that indexing the dumps in your languages is currently too slow to be of real use. This is due to the complexity of the actual languages. Unfortunately, I don't really know the appropriate languages to actually fix the tokenizing code. Just as a guide - 2.7 Mb compressed jawikinews dump takes about 1.5 hours to index on Core 2 Duo.

About

Fork of BzReader. BzReader is an application to browse Wikipedia compressed dumps offline. Original project can be found at http://code.google.com/p/bzreader/

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published