GitHub - markburns/wwwjdic2db: A project for converting the kanjidic (and in future edict and Tanaka corpus) files from Jim Breen's wwwjdic project into a database format.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
app		app
config		config
db		db
doc		doc
features		features
lib/tasks		lib/tasks
log		log
public		public
script		script
spec		spec
test		test
vendor/plugins/yaml_db		vendor/plugins/yaml_db
.gitignore		.gitignore
README		README
Rakefile		Rakefile
lsm		lsm

Repository files navigation

Database generator for Jim Breen's wwwjdic kanjidic file
=======================================================

This is basically a command line app that has been built
alongside a Rails app so that there is easy access to things
like ActiveRecord, and the various Rake tasks that Rails
developers are used to.

_____________

Note this is only currently working with Ruby 1.9.1
tested against :
*ruby 1.9.1p129 (2009-05-12 revision 23412) [i686-linux]
*ruby 1.9.1 p378


_____________

Current Status of this Project
==============================

At the moment running the spec should run the tests and 
generate a Sqlite3 database from kanjidic.

The basic structure of the database can be seen by 
examining schema.rb or looking at the models.

But in summary:

The main model is Kanji, and there are various other
models which refer to the other things in kanjidic.

For an introduction to the kanjidic file structure please
see Jim Breen's page at http://www.csse.monash.edu.au/~jwb/kanjidic.html
or see http://www.csse.monash.edu.au/~jwb/kanjidic_doc.html for more
detailed information.

This application imports all the data from the kanjidic file I perceive
the data to be of two basic types:

* Language related data
* Dictionary indexes

_____________________

Language related data
=====================

The language related models are

*Kunyomi
*Onyomi
*Nanori
*Meaning (the English meaning of a Kanji)
*Korean (the Korean reading of a Kanji)
*Pinyin (the Chinese Pinyin reading of a Kanji)

Each of these models have their own tables and join tables
to join to the Kanji model.

Dictionary indexes
==================

The majority of the data refer to various dictionary indexes
and study book indexes such as James Heisig's "Remembering the 
Kanji"

These indexes have been moved into the kanji_lookups table, 
where the dictionary_id can be used to find out which dictionary or
index it refers to.


_________________________

Future
=======

Future plans for this project.

* Using Rsync do an hourly check and download only the changes to the kanjidic file, and
update the database accordingly. 
* Do periodic entire rebuilds of the database.
* Provide a copy of this database for the wwwjdic website to allow people to get up to date
versions of the database.
* Incorporate data from edict into their own tables
* Incorporate example sentences from the tanaka corpus (tatoeba project)
* Create join tables to the kanji for tanaka corpus based on the reading and sense number
* Possible automated discovery of likely joins between edict words and kanjidic entries
(this is possibly a research project in itself, but it may be worth finding some exact matches
and at least creating a 'possible matches' table. This would mean for any client applications
using the database that they may be able to match edict entries to kanji reading entries, and
also tanaka corpus sentences to their corresponding kanji reading entries)

About

A project for converting the kanjidic (and in future edict and Tanaka corpus) files from Jim Breen's wwwjdic project into a database format.

Readme

Activity

5 stars

2 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

config

config

db

db

doc

doc

features

features

lib/tasks

lib/tasks

log

log

public

public

script

script

spec

spec

test

test

vendor/plugins/yaml_db

vendor/plugins/yaml_db

.gitignore

.gitignore

README

README

Rakefile

Rakefile

lsm

lsm

Repository files navigation

About

Releases

Packages

Languages

markburns/wwwjdic2db

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages