Convert AIHub OCR format data to training-datasets-splitter format data.
- AIHub OCR: Korean text images
- training-datasets-splitter: Split the training datasets into 'training'/'validation'/'test' data.
(venv) $ python3 convert.py \
--input_path ./input \
--label_file ./input/labels.json \
--output_path ./output
(venv) $ python3 convert_textinthewild.py \
--input_path ./input \
--label_file ./input/labels.json \
--output_path ./output
The structure of input data folder as below.
- Input: AIHub OCR format data
/input
├── label_info.json
├── /group1
│ # [id].[ext]
│ ├── 0000000001.png
│ ├── 0000000002.png
│ ├── 0000000003.png
│ └── ...
│
├── /group2
└── ...
For the 'label_info.json' file structure, refer to the AIHub OCR.
The structure of output data folder as below.
- Output: for use in training-datasets-splitter project.
/output
├── /group1
│ # [filename].[ext]
│ ├── 0000000001.png
│ ├── 0000000002.png
│ ├── ...
│ └── labels.txt
│
├── /group2
└── ...
- labels.txt
# {filename}\t{label}\n
0000000001.png abcd
0000000002.png efgh
0000000003.png ijkl
...