AIHubOCR2TDS

Convert AIHub OCR format data to training-datasets-splitter format data.

References

AIHub OCR: Korean text images
training-datasets-splitter: Split the training datasets into 'training'/'validation'/'test' data.

Usage example

Convert datasets for 'printed', 'printed_augmentation' and 'handwritten' group.

(venv) $ python3 convert.py \
                --input_path ./input \
                --label_file ./input/labels.json \
                --output_path ./output

Convert datasets for 'textinthewild' group.

(venv) $ python3 convert_textinthewild.py \
                --input_path ./input \
                --label_file ./input/labels.json \
                --output_path ./output

Input Data Structures

The structure of input data folder as below.

Input: AIHub OCR format data

/input
├── label_info.json
├── /group1
│   #   [id].[ext]
│   ├── 0000000001.png
│   ├── 0000000002.png
│   ├── 0000000003.png
│   └── ...
│
├── /group2
└── ...

Label file structure

For the 'label_info.json' file structure, refer to the AIHub OCR.

Output Data Structure

The structure of output data folder as below.

Output: for use in training-datasets-splitter project.

/output
├── /group1
│   #   [filename].[ext]
│   ├── 0000000001.png
│   ├── 0000000002.png
│   ├── ...
│   └── labels.txt
│
├── /group2
└── ...

Label file structure

labels.txt

# {filename}\t{label}\n
  0000000001.png	abcd
  0000000002.png	efgh
  0000000003.png	ijkl
  ...

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.idea		.idea
README.md		README.md
convert.py		convert.py
convert_textinthewild.py		convert_textinthewild.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

README.md

README.md

convert.py

convert.py

convert_textinthewild.py

convert_textinthewild.py

requirements.txt

requirements.txt

Repository files navigation

AIHubOCR2TDS

References

Usage example

Convert datasets for 'printed', 'printed_augmentation' and 'handwritten' group.

Convert datasets for 'textinthewild' group.

Input Data Structures

Label file structure

Output Data Structure

Label file structure

About

Languages

DaveLogs/AIHubOCR2TDS

Folders and files

Latest commit

History

Repository files navigation

AIHubOCR2TDS

References

Usage example

Convert datasets for 'printed', 'printed_augmentation' and 'handwritten' group.

Convert datasets for 'textinthewild' group.

Input Data Structures

Label file structure

Output Data Structure

Label file structure

About

Resources

Stars

Watchers

Forks

Languages