Skip to content

markburns/wwwjdic2db

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Database generator for Jim Breen's wwwjdic kanjidic file
=======================================================

This is basically a command line app that has been built
alongside a Rails app so that there is easy access to things
like ActiveRecord, and the various Rake tasks that Rails
developers are used to.

_____________

Note this is only currently working with Ruby 1.9.1
tested against :
*ruby 1.9.1p129 (2009-05-12 revision 23412) [i686-linux]
*ruby 1.9.1 p378


_____________

Current Status of this Project
==============================

At the moment running the spec should run the tests and 
generate a Sqlite3 database from kanjidic.

The basic structure of the database can be seen by 
examining schema.rb or looking at the models.

But in summary:

The main model is Kanji, and there are various other
models which refer to the other things in kanjidic.

For an introduction to the kanjidic file structure please
see Jim Breen's page at http://www.csse.monash.edu.au/~jwb/kanjidic.html
or see http://www.csse.monash.edu.au/~jwb/kanjidic_doc.html for more
detailed information.

This application imports all the data from the kanjidic file I perceive
the data to be of two basic types:

* Language related data
* Dictionary indexes

_____________________

Language related data
=====================

The language related models are

*Kunyomi
*Onyomi
*Nanori
*Meaning (the English meaning of a Kanji)
*Korean (the Korean reading of a Kanji)
*Pinyin (the Chinese Pinyin reading of a Kanji)

Each of these models have their own tables and join tables
to join to the Kanji model.

Dictionary indexes
==================

The majority of the data refer to various dictionary indexes
and study book indexes such as James Heisig's "Remembering the 
Kanji"

These indexes have been moved into the kanji_lookups table, 
where the dictionary_id can be used to find out which dictionary or
index it refers to.


_________________________

Future
=======

Future plans for this project.

* Using Rsync do an hourly check and download only the changes to the kanjidic file, and
update the database accordingly. 
* Do periodic entire rebuilds of the database.
* Provide a copy of this database for the wwwjdic website to allow people to get up to date
versions of the database.
* Incorporate data from edict into their own tables
* Incorporate example sentences from the tanaka corpus (tatoeba project)
* Create join tables to the kanji for tanaka corpus based on the reading and sense number
* Possible automated discovery of likely joins between edict words and kanjidic entries
(this is possibly a research project in itself, but it may be worth finding some exact matches
and at least creating a 'possible matches' table. This would mean for any client applications
using the database that they may be able to match edict entries to kanji reading entries, and
also tanaka corpus sentences to their corresponding kanji reading entries)

About

A project for converting the kanjidic (and in future edict and Tanaka corpus) files from Jim Breen's wwwjdic project into a database format.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published