Skip to content

Convert AIHub OCR format data to training-datasets-splitter format data.

Notifications You must be signed in to change notification settings

DaveLogs/AIHubOCR2TDS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AIHubOCR2TDS

Convert AIHub OCR format data to training-datasets-splitter format data.

References

Usage example

Convert datasets for 'printed', 'printed_augmentation' and 'handwritten' group.

(venv) $ python3 convert.py \
                --input_path ./input \
                --label_file ./input/labels.json \
                --output_path ./output

Convert datasets for 'textinthewild' group.

(venv) $ python3 convert_textinthewild.py \
                --input_path ./input \
                --label_file ./input/labels.json \
                --output_path ./output

Input Data Structures

The structure of input data folder as below.

/input
├── label_info.json
├── /group1
│   #   [id].[ext]
│   ├── 0000000001.png
│   ├── 0000000002.png
│   ├── 0000000003.png
│   └── ...
│
├── /group2
└── ...

Label file structure

For the 'label_info.json' file structure, refer to the AIHub OCR.

Output Data Structure

The structure of output data folder as below.

/output
├── /group1
│   #   [filename].[ext]
│   ├── 0000000001.png
│   ├── 0000000002.png
│   ├── ...
│   └── labels.txt
│
├── /group2
└── ...

Label file structure

  • labels.txt
# {filename}\t{label}\n
  0000000001.png	abcd
  0000000002.png	efgh
  0000000003.png	ijkl
  ...

About

Convert AIHub OCR format data to training-datasets-splitter format data.

Resources

Stars

Watchers

Forks

Languages