Cuneiform OCR Data Preprocessing for Ebl Project (https://www.ebl.lmu.de/, https://github.com/ElectronicBabylonianLiterature
Data+Code is part of Paper Sign Detection for Cuneiform Tablets from Yunus Cobanoglu, Luis Sáenz, Ilya Khait, Enrique Jiménez please contact us for access to data on Zenodoo and paper as it is under currently under review. See https://github.com/ElectronicBabylonianLiterature/cuneiform-ocr/blob/main/README.md for overview and general information of all repositories associated with the paper from above.
- requirements.txt (optionally: includes opencv-python)
pip3 install torch=="2.0.1" torchvision --index-url https://download.pytorch.org/whl/cpu
pip install -U openmim
mim install "mmocr==1.0.0rc5"
it is important to use this exact version because prepare_data.py won't work in newer versions (DATA_PARSERS are not backward compatible)mim install "mmcv==2.0.0"
Make sure PYTHONPATH is root of repository
See the explanatory video here.
The data was fetched from our api https://github.com/ElectronicBabylonianLiterature/ebl-api/blob/master/ebl/fragmentarium/retrieve_annotations.py
download raw-data and processed-data according to instructions below. This code will use raw-data and processed-data and output the data as in ready-for-training (i.e. icdar2015 and coco2017 format) (see in zenodoo ready-for-training.tar.gz). Directory Structure
data
processed-data
data-coco
data-icdar2015
detection
...
classification
data (after gather_all.py
...
raw-data
ebl
heidelberg
jooch
urschrei-CDP
-
Preprocessing Heidelberg Data, all Details in
cuneiform_ocr_data/heidelberg/README.md
-
Ebl (our) data in
data/raw-data/ebl
(generally better to create test set from ebl data because quality is better)2.1. Run
extract_contours.py
withEXTRACT_AUTMOATICALLY=False
ondata/raw-data/ebl/detection
2.2. Run
display_bboxes.py
and use keys to delete all which are not good quality -
Run
select_test_set.py
which will select 50 randomly images fromdata/processed-data/ebl/ebl-detection-extracted-deleted
(currently no option to create val set because of small size of dataset) -
data/processed-data/ebl/ebl-detection-extracted-test
has .txt file will names of all images in test set (this will be necessary to create train,test for classification later) -
Now merge
data/processed-data/heidelberg/heidelberg-extracted-deleted
andebl-detection-extracted-train
which will be your train set (seedata/processed-data/detection
, around 295 train and 50 test instances). -
Optionally. Create Icdar2015 Style dataset using
convert_to_icdar2015.py
-
Optionally: Create Coco Style Dataset
convert_to_coco.py
will create only a test set coco style
image: P3310-0.jp, with gt_P3310-0.txt.
Ground truth contains top left x, top left y, width, height and sign.
Sign followed by ? means it is partially broken. Unclear signs have value 'UnclearSign'.
Example: 0,0,10,10,KUR
- Fetch Sign Images from https://labasi.acdh.oeaw.ac.at/ using their api in
cuneiform_ocr_data/labasi
- Run
classification/cdp/main.py
will map data using our ABZ Sign Mapping some signs can't be mapped (should be checked by an assyiriologist for correctness) - Run
classification/jooch/main.py
will map data using our ABZ Sign Mapping some signs can't be mapped (should be checked by an assyiriologist for correctness) - Merge
data/processed-data/heidelberg/heidelberg
anddata/raw-data/ebl/ebl-classification
todata/processed-data/ebl+heidelberg-classification
- Split
data/processed-data/ebl+heidelberg-classification
todata/processed-data/ebl+heidelberg-classification-train
anddata/processed-data/ebl+heidelberg-classification-test
by copying all files which are part of the detection test set using the scriptmove_test_set_for_classification.py
- Run
crop_sign.py
ondata/processed-data/ebl+heidelberg-classification-train
anddata/processed-data/ebl+heidelberg-classification-test
you can modifycrop_signs.py
to include/exclude partially broken or UnclearSigns. data/processed-data/classification
should contain Cuneiform Dataset JOOCH, ebl+heidelberg/ebl+heidelberg-train, ebl+heidelberg/ebl+heidelberg-test, labasi and urschrei-CDP-processedgather_all.py
will gather and finalize the format for training/testing of all the folders from 7. (gather_all.py will create "cuneiform_ocr_data/classification/data" directory with classes.txt which has all classes used for training/testing)
- Use
move_test_set_for_classification.py
to extract all images belonging to detection test set for classification - Images are cropped from LMU and Heidelberg using
crop_signs.py
and converted to ABZ Sign List via ebl.txt mapping from OraccGlobalSignList/ MZL to ABZ Number- Partially Broken and Unclear Signs can be dealt included/excluded on parameter in script
- Images from CDP (urschrei-cdp) are renamed using the mapping from the urschrei-repo https://github.com/urschrei/CDP/csvs (look at cuneiform_ocr/preprocessing_cdp)
- Images are renamed
rename_to_mzl.py
- Images are mapped via urschrei-cdp corrected_instances_forimport.xlsx and custom mapping via
convert_cdp_and_jooch.py
- Images are renamed
- Cuneiform JOOCH images are not used due to bad quality currently
- Labasi Project is scraped with
labasi/crawl_labasi_page.py
(can take very long multiple hours with interruptions) and renamed manually to fit ebl.txt mapping
- Deep learning of cuneiform sign detection with weak supervision using transliteration alignment https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0243039
- Annotated Tablets (75 Tablets) https://compvis.github.io/cuneiform-sign-detection-dataset/
- -> Heidelberg Data
- Towards Query-by-eXpression Retrieval of Cuneiform Signs [https://patrec.cs.tu-dortmund.de/pubs/papers/Rusakov2020-TQX]
- -> JOOCH Dataset https://graphics-data.cs.tu-dortmund.de/docs/publications/cuneiform/
- Labasi Project https://labasi.acdh.oeaw.ac.at/
- CDP Project https://github.com/urschrei/CDP
- LMU https://www.ebl.lmu.de/