Japanese Character Image Classification

Author: Chaz Frazer

Overview

This project is designed to take image recognition of Japanese characters, and create a learning model that can classify the characters based on the read input.

The project looks to accurately predict across the three main Japanese language writing systems (Kanji, Hiragana, Katakana) - used in Japan for over a thousand years since the 8th century.

The data is from the ETL Character Database, which includes over a billion total of Japanese characters hand-written and reorganized by the Japanese National Institute of Advanced Industrial Science and Technology (AIST), alongside the assistance of the Japan Electronics & Information Technology Industries Association (JEITA).

Business Problem

Can a viable product model be created to accurately transcribe, read, and identify Japanese text for the archiving of important literary works? This can be used to preserve the surviving texts of endangered languages from the Ainu and Ryukyu minority groups in Japan.

Can this be expanded to create an accurate API that recognizes written Japanese characters for touchscreen devices (ie. dictionaries, translation apps). The target audience is Japanese and English research orgs, higher learning institutions, linguistic preservation societies, and language students.

The Writing Systems of Japan

Kanji:
Kanji entered Japan in the 8th century via Chinese monks who also brought other traditions with them such as tea and Buddhism. Kanji is based on comparable Chinese characters that convey meaning from pictographic images.

Hiragana:
Phonetic writing system taking the mostly curvature root aspects of some kanji characters to represent a phonetic representation of sounds. There are 46 individual hiragana characters used today (alongside 29 diphthongs).

Katakana:
Much like hiragana, katakana is phonetically identical to hiragana. Katakana takes the angular aspects of some kanji characters and is mainly used for foreign words, onomatopoeia, and sounds. Katakana contains the same amount of phonetic characters as hiragana.

Kuzushiji:
A cursive writing style, over 3 million books, on a diverse array of topics such as literature, science, mathematics, and cooking written in kuzushiji are preserved today. However, the standardization of Japanese textbooks known as the “Elementary School Order” in 1900, removed Kuzushiji from the regular school curriculum, as modern Japanese print became popular. As a result, most Japanese natives today cannot read books written or printed in kuzushiji just 120 years ago.

Data

The data is from the National Institute of Advanced Industrial Science and Technology (AIST) and was reorganized by the Japan Electronics and Information Technology Industries Association. There are about 1.2 million handwritten Japanese records written by tens of thousands of individuals including numerals, hiragana, katakana, and kanji. Collected from 1973 to 1984, the data was sent to AIST by submission of magnetic tapes and CD-R delivered by post.

Each file contains 5 data sets except ETL8G_33.
Each data set contains 956 characters written by a writer.
Each writer wrote 10 sheets (genkouyoushi) per data set.

Hiragana (ETL 8):
71 hiragana characters (46 unique + 29 diphthongs)
160 writers
8199 records (genkouyoushi sheets)
1,254,120,000 unique handwritten hiragana characters (shared with kanji chars in the same files)

Kanji (ETL 8):
883 daily use kanji
160 writers
8199 records
152,878,411 unique handwritten kanji (shared with hiragana chars in the same files)

Katakana (ETL 1):
46 katakana characters (46 unique, dipthongs not included as they are phonetically identical to hiragana)
1411 writers
2052 records
2,436,366 unique handwritten katakana characters

Data Cleaning & EDA

The data was read in from binary, sorted, and then saved to a .npz file for further access and to model upon. The separate datasets were then merged into one to represent the full scope of the Japanese language. Once done and training labels were created, the data images were able to be rendered for inspection.

Feature Engineering

The data was resized to 64x64 pixels for our CNN model to read over. A Tensorflow's ImageDataGenerator filter was put over the images at random to create variability in our data to reduce the chance of our model overfitting.

Modeling & Results

After our EDA and feature engineering, we were ready to begin our modeling the process.
The data was trained on KNN and Random Forest shallow algorithms initially, and then a CNN and cuDNN model in the cloud using AWS' EC2 instance package. The environment was run in a virtual machine using g4dn Nvidia Tesla GPU architecture.

Pre-merge our dataset sizes for modeling were:

Hiragana: 11,360 image classes
Kanji: 139,680 images classes
Katakana: 64,906 images classes

After the merge, we had 214,946 images across 3 classes

Initial Class Imbalance

After modeling each writing system, and experimenting with various parameters and hyperparameters, our results were as below:

Model	Description	Train Accuracy	Train Loss	Validation Accuracy	Validation Loss	Test Accuracy	Test Loss
KNN	n_neighbors=10, weights='distance'	92.67%	N/A	92.75%	N/A	95.95%	N/A
Random Forest	class_weight='balanced', max_depth=32	94.50%	N/A	94.48%	N/A	94.95%	N/A
CNN	12 layers, 8,491,555 parameters	99.79%	0.68%	99.40%	3.2%	99.73%	0.90%
cuDNN (Nvidia)	8 layers, 264,195 parameters	99.95%	0.017%	99.55%	0.027%	99.48%	0.019%

Best Performing Model Architecture

Next Steps

Work with kuzushiji (Japanese cursive writing) KMINST dataset variations
OpenCV for live model image recognition using a webcam
Expand model for touchscreen handwriting API integration for language education (iOS app)
The CUNY Endangered Language Initiative strives to preserve our dying languages around the world. Use the model as a way to utilize computational linguistics and preserve precious texts and early written Japanese history

References

ELTCDB Data Set -- http://etlcdb.db.aist.go.jp/
AIST -- https://www.aist.go.jp/
JEITA -- http://www.jeita.or.jp/english/

Repository Structure

├── data
├── img
├── logs
├── models
├── sub_functions    
├── trials
├── .gitignore
├── README.md
├── japanese_character_classification.pdf   
└── japanese_classification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Japanese Character Image Classification

Overview

Business Problem

The Writing Systems of Japan

Data

Data Cleaning & EDA

Feature Engineering

Modeling & Results

Initial Class Imbalance

Best Performing Model Architecture

Next Steps

References

Repository Structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
img		img
logs		logs
models		models
sub_functions		sub_functions
trials		trials
.gitignore		.gitignore
README.md		README.md
japanese_character_classification.pdf		japanese_character_classification.pdf
japanese_classification.ipynb		japanese_classification.ipynb

Mynusjanai/japanese_classification

Folders and files

Latest commit

History

Repository files navigation

Japanese Character Image Classification

Overview

Business Problem

The Writing Systems of Japan

Data

Data Cleaning & EDA

Feature Engineering

Modeling & Results

Initial Class Imbalance

Best Performing Model Architecture

Next Steps

References

Repository Structure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages