markburns/wwwjdic2db
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Database generator for Jim Breen's wwwjdic kanjidic file ======================================================= This is basically a command line app that has been built alongside a Rails app so that there is easy access to things like ActiveRecord, and the various Rake tasks that Rails developers are used to. _____________ Note this is only currently working with Ruby 1.9.1 tested against : *ruby 1.9.1p129 (2009-05-12 revision 23412) [i686-linux] *ruby 1.9.1 p378 _____________ Current Status of this Project ============================== At the moment running the spec should run the tests and generate a Sqlite3 database from kanjidic. The basic structure of the database can be seen by examining schema.rb or looking at the models. But in summary: The main model is Kanji, and there are various other models which refer to the other things in kanjidic. For an introduction to the kanjidic file structure please see Jim Breen's page at http://www.csse.monash.edu.au/~jwb/kanjidic.html or see http://www.csse.monash.edu.au/~jwb/kanjidic_doc.html for more detailed information. This application imports all the data from the kanjidic file I perceive the data to be of two basic types: * Language related data * Dictionary indexes _____________________ Language related data ===================== The language related models are *Kunyomi *Onyomi *Nanori *Meaning (the English meaning of a Kanji) *Korean (the Korean reading of a Kanji) *Pinyin (the Chinese Pinyin reading of a Kanji) Each of these models have their own tables and join tables to join to the Kanji model. Dictionary indexes ================== The majority of the data refer to various dictionary indexes and study book indexes such as James Heisig's "Remembering the Kanji" These indexes have been moved into the kanji_lookups table, where the dictionary_id can be used to find out which dictionary or index it refers to. _________________________ Future ======= Future plans for this project. * Using Rsync do an hourly check and download only the changes to the kanjidic file, and update the database accordingly. * Do periodic entire rebuilds of the database. * Provide a copy of this database for the wwwjdic website to allow people to get up to date versions of the database. * Incorporate data from edict into their own tables * Incorporate example sentences from the tanaka corpus (tatoeba project) * Create join tables to the kanji for tanaka corpus based on the reading and sense number * Possible automated discovery of likely joins between edict words and kanjidic entries (this is possibly a research project in itself, but it may be worth finding some exact matches and at least creating a 'possible matches' table. This would mean for any client applications using the database that they may be able to match edict entries to kanji reading entries, and also tanaka corpus sentences to their corresponding kanji reading entries)
About
A project for converting the kanjidic (and in future edict and Tanaka corpus) files from Jim Breen's wwwjdic project into a database format.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published