Skip to content

AI-powered Kosakuin (Contributor) for Aozora Bunko (https://www.aozora.gr.jp). This project (Shinonome Bunko) aims to generate texts from digital images of books with OCR so that human beings can modify them with accuracy. / 青空文庫の工作に貢献するべく,OCR技術を用いたテキスト電子化を試みる極めて私的なプロジェクト.

License

Notifications You must be signed in to change notification settings

ItsukiKigoshi/shinonome-bunko

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 

Repository files navigation

shinonome-bunko / 東雲文庫

やがて夜は明ける.

In my mind...🤔

graph TD
    A[Images of Books] -->|OCR| B
    B["Text File \n (MD? XML?; \n UTF-8? Shift-JIS?)"] -->|"Parser? (Should I do this?)"| C
    B --> |"Iterate + Modify by Human \n (Editor in Browser or Git; cf. Wiki, Qiita, Zenn)"| B
    C["Aozora Bunko File Format?"] -->| | D
    D["Publish to Aozora Bunko?"]
  1. Use Existing Aozora Bunko Files as Training Data
    • We can find original texts since Aozora Bunko shows the original version of the texts ("底本").
    • Supervised learning with these data

This Project consists of...

  1. Text Recognition
    • OCR with Python
    • Aim to generate texts accurately and quickly also in Japanese vertical texts
  2. Viewer/Editor
    • Simple and Fast Viewer and Editor working on Browser
    • Anyone can modify the generated texts either in the Built-in Editor or GitHub (Can we compare the original pictures and the generated texts?)
    • Can this editor be built with Python as well?
  3. Text Matching Game
    • Matching Game for Japanese Texts
    • Aim to improve the accuracy of OCR (also for fun, of course!)
    • This game can be a learning material for Japanese learners (like the original concept of Duolingo)
    • cf. Google Captcha

Related Projects

About

AI-powered Kosakuin (Contributor) for Aozora Bunko (https://www.aozora.gr.jp). This project (Shinonome Bunko) aims to generate texts from digital images of books with OCR so that human beings can modify them with accuracy. / 青空文庫の工作に貢献するべく,OCR技術を用いたテキスト電子化を試みる極めて私的なプロジェクト.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages