Nan.ai OCR open data initiative makes handwritten data publicly available for reuse in training OCR models. This dataset is derived from forms submitted via Nan.ai and processed using our OCR ML service (which extracts information from photos of forms captured using readily available mobile devices. Extracted, anonymized, and annotated based on forms submitted via Nan.ai, this dataset can be used to train OCR models for your own use case.
You can participate by (1) handwritten data or (2) annotating existing datasets. We also welcome image processing experts to improve this repository's usability for various use cases.
To explore our datasets, you can use the existing image processing notebooks available here
or import data by following the instructions here
.
From form images, the image data is isolated, extracted and anonymized to form a generic dataset similar to the MNIST handwritten dataset. This repository is constantly updated to reflect any inputs such as annotations for improvement from the OCR ML service. Users have the discretion on the ratio to split the data to test-train-validate sets.
We are also creating datasets derived and annotated based on the corpus data such as use-case specific dictionaries (e.g. possible handwritten values and its counterpart on standard naming of places in the Philippines).
Alongside our open data initiative, we are also open sourcing a related machine learning service, Nan.ai OCR.
- Documentation
- Issue tracking
- Discussion board
- How to contribute data
Nan.ai-opendata-ocr is licensed under the Creative Commons Zero v1.0 Universal