Cuneiform OCR Data Preprocessing for Ebl Project (https://www.ebl.lmu.de/, https://github.com/ElectronicBabylonianLiterature

Data+Code is part of Paper Sign Detection for Cuneiform Tablets from Yunus Cobanoglu, Luis Sáenz, Ilya Khait, Enrique Jiménez please contact us for access to data on Zenodoo and paper as it is under currently under review. See https://github.com/ElectronicBabylonianLiterature/cuneiform-ocr/blob/main/README.md for overview and general information of all repositories associated with the paper from above.

Installation

requirements.txt (optionally: includes opencv-python)
pip3 install torch=="2.0.1" torchvision --index-url https://download.pytorch.org/whl/cpu
pip install -U openmim
mim install "mmocr==1.0.0rc5" it is important to use this exact version because prepare_data.py won't work in newer versions (DATA_PARSERS are not backward compatible)
mim install "mmcv==2.0.0"

Make sure PYTHONPATH is root of repository

See the explanatory video here.

Data

The data was fetched from our api https://github.com/ElectronicBabylonianLiterature/ebl-api/blob/master/ebl/fragmentarium/retrieve_annotations.py

download raw-data and processed-data according to instructions below. This code will use raw-data and processed-data and output the data as in ready-for-training (i.e. icdar2015 and coco2017 format) (see in zenodoo ready-for-training.tar.gz). Directory Structure

data
  processed-data
    data-coco
    data-icdar2015
    detection
    ...
    classification
      data (after gather_all.py
      ...
  raw-data
    ebl
    heidelberg
    jooch
    urschrei-CDP

Data Preprocessing for Text Detection (Predict only Bounding Boxes)

Preprocessing Heidelberg Data, all Details in cuneiform_ocr_data/heidelberg/README.md
Ebl (our) data in data/raw-data/ebl (generally better to create test set from ebl data because quality is better)

2.1. Run extract_contours.py with EXTRACT_AUTMOATICALLY=False on data/raw-data/ebl/detection

2.2. Run display_bboxes.py and use keys to delete all which are not good quality
Run select_test_set.py which will select 50 randomly images from data/processed-data/ebl/ebl-detection-extracted-deleted (currently no option to create val set because of small size of dataset)
data/processed-data/ebl/ebl-detection-extracted-test has .txt file will names of all images in test set (this will be necessary to create train,test for classification later)
Now merge data/processed-data/heidelberg/heidelberg-extracted-deleted and ebl-detection-extracted-train which will be your train set (see data/processed-data/detection, around 295 train and 50 test instances).
Optionally. Create Icdar2015 Style dataset using convert_to_icdar2015.py
Optionally: Create Coco Style Dataset convert_to_coco.py will create only a test set coco style

Dataformat

image: P3310-0.jp, with gt_P3310-0.txt.

Ground truth contains top left x, top left y, width, height and sign.

Sign followed by ? means it is partially broken. Unclear signs have value 'UnclearSign'.

Example: 0,0,10,10,KUR

Data Preprocessing for Image Classification

Fetch Sign Images from https://labasi.acdh.oeaw.ac.at/ using their api in cuneiform_ocr_data/labasi
Run classification/cdp/main.py will map data using our ABZ Sign Mapping some signs can't be mapped (should be checked by an assyiriologist for correctness)
Run classification/jooch/main.py will map data using our ABZ Sign Mapping some signs can't be mapped (should be checked by an assyiriologist for correctness)
Merge data/processed-data/heidelberg/heidelberg and data/raw-data/ebl/ebl-classification to data/processed-data/ebl+heidelberg-classification
Split data/processed-data/ebl+heidelberg-classification to data/processed-data/ebl+heidelberg-classification-train and data/processed-data/ebl+heidelberg-classification-test by copying all files which are part of the detection test set using the script move_test_set_for_classification.py
Run crop_sign.py on data/processed-data/ebl+heidelberg-classification-train and data/processed-data/ebl+heidelberg-classification-test you can modify crop_signs.py to include/exclude partially broken or UnclearSigns.
data/processed-data/classification should contain Cuneiform Dataset JOOCH, ebl+heidelberg/ebl+heidelberg-train, ebl+heidelberg/ebl+heidelberg-test, labasi and urschrei-CDP-processed
gather_all.py will gather and finalize the format for training/testing of all the folders from 7. (gather_all.py will create "cuneiform_ocr_data/classification/data" directory with classes.txt which has all classes used for training/testing)

Data Preprocessing for Image (Sign) Classification (Details)

Use move_test_set_for_classification.py to extract all images belonging to detection test set for classification
Images are cropped from LMU and Heidelberg using crop_signs.py and converted to ABZ Sign List via ebl.txt mapping from OraccGlobalSignList/ MZL to ABZ Number
- Partially Broken and Unclear Signs can be dealt included/excluded on parameter in script
Images from CDP (urschrei-cdp) are renamed using the mapping from the urschrei-repo https://github.com/urschrei/CDP/csvs (look at cuneiform_ocr/preprocessing_cdp)
- Images are renamed rename_to_mzl.py
- Images are mapped via urschrei-cdp corrected_instances_forimport.xlsx and custom mapping via convert_cdp_and_jooch.py
Cuneiform JOOCH images are not used due to bad quality currently
Labasi Project is scraped with labasi/crawl_labasi_page.py (can take very long multiple hours with interruptions) and renamed manually to fit ebl.txt mapping

Acknowledgements/ Citation

Deep learning of cuneiform sign detection with weak supervision using transliteration alignment https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0243039
- Annotated Tablets (75 Tablets) https://compvis.github.io/cuneiform-sign-detection-dataset/
- -> Heidelberg Data
Towards Query-by-eXpression Retrieval of Cuneiform Signs [https://patrec.cs.tu-dortmund.de/pubs/papers/Rusakov2020-TQX]
- -> JOOCH Dataset https://graphics-data.cs.tu-dortmund.de/docs/publications/cuneiform/
Labasi Project https://labasi.acdh.oeaw.ac.at/
CDP Project https://github.com/urschrei/CDP
LMU https://www.ebl.lmu.de/

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
configs		configs
cuneiform_ocr_data		cuneiform_ocr_data
.flake8		.flake8
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs