Skip to content

ElectronicBabylonianLiterature/cuneiform-ocr-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cuneiform OCR Data Preprocessing for Ebl Project (https://www.ebl.lmu.de/, https://github.com/ElectronicBabylonianLiterature

Data+Code is part of Paper Sign Detection for Cuneiform Tablets from Yunus Cobanoglu, Luis Sáenz, Ilya Khait, Enrique Jiménez please contact us for access to data on Zenodoo DOI and paper as it is under currently under review. See https://github.com/ElectronicBabylonianLiterature/cuneiform-ocr/blob/main/README.md for overview and general information of all repositories associated with the paper from above.

Installation

  • requirements.txt (optionally: includes opencv-python)
  • pip3 install torch=="2.0.1" torchvision --index-url https://download.pytorch.org/whl/cpu
  • pip install -U openmim
  • mim install "mmocr==1.0.0rc5" it is important to use this exact version because prepare_data.py won't work in newer versions (DATA_PARSERS are not backward compatible)
  • mim install "mmcv==2.0.0"

Make sure PYTHONPATH is root of repository

See the explanatory video here.

Data

The data was fetched from our api https://github.com/ElectronicBabylonianLiterature/ebl-api/blob/master/ebl/fragmentarium/retrieve_annotations.py

download DOI raw-data and processed-data according to instructions below. This code will use raw-data and processed-data and output the data as in ready-for-training (i.e. icdar2015 and coco2017 format) (see in zenodoo ready-for-training.tar.gz). Directory Structure

data
  processed-data
    data-coco
    data-icdar2015
    detection
    ...
    classification
      data (after gather_all.py
      ...
  raw-data
    ebl
    heidelberg
    jooch
    urschrei-CDP

Data Preprocessing for Text Detection (Predict only Bounding Boxes)

  1. Preprocessing Heidelberg Data, all Details in cuneiform_ocr_data/heidelberg/README.md

  2. Ebl (our) data in data/raw-data/ebl (generally better to create test set from ebl data because quality is better)

    2.1. Run extract_contours.py with EXTRACT_AUTMOATICALLY=False on data/raw-data/ebl/detection

    2.2. Run display_bboxes.py and use keys to delete all which are not good quality

  3. Run select_test_set.py which will select 50 randomly images from data/processed-data/ebl/ebl-detection-extracted-deleted (currently no option to create val set because of small size of dataset)

  4. data/processed-data/ebl/ebl-detection-extracted-test has .txt file will names of all images in test set (this will be necessary to create train,test for classification later)

  5. Now merge data/processed-data/heidelberg/heidelberg-extracted-deleted and ebl-detection-extracted-train which will be your train set (see data/processed-data/detection, around 295 train and 50 test instances).

  6. Optionally. Create Icdar2015 Style dataset using convert_to_icdar2015.py

  7. Optionally: Create Coco Style Dataset convert_to_coco.py will create only a test set coco style

Dataformat

image: P3310-0.jp, with gt_P3310-0.txt.

Ground truth contains top left x, top left y, width, height and sign.

Sign followed by ? means it is partially broken. Unclear signs have value 'UnclearSign'.

Example: 0,0,10,10,KUR

Data Preprocessing for Image Classification

  1. Fetch Sign Images from https://labasi.acdh.oeaw.ac.at/ using their api in cuneiform_ocr_data/labasi
  2. Run classification/cdp/main.py will map data using our ABZ Sign Mapping some signs can't be mapped (should be checked by an assyiriologist for correctness)
  3. Run classification/jooch/main.py will map data using our ABZ Sign Mapping some signs can't be mapped (should be checked by an assyiriologist for correctness)
  4. Merge data/processed-data/heidelberg/heidelberg and data/raw-data/ebl/ebl-classification to data/processed-data/ebl+heidelberg-classification
  5. Split data/processed-data/ebl+heidelberg-classification to data/processed-data/ebl+heidelberg-classification-train and data/processed-data/ebl+heidelberg-classification-test by copying all files which are part of the detection test set using the script move_test_set_for_classification.py
  6. Run crop_sign.py on data/processed-data/ebl+heidelberg-classification-train and data/processed-data/ebl+heidelberg-classification-test you can modify crop_signs.py to include/exclude partially broken or UnclearSigns.
  7. data/processed-data/classification should contain Cuneiform Dataset JOOCH, ebl+heidelberg/ebl+heidelberg-train, ebl+heidelberg/ebl+heidelberg-test, labasi and urschrei-CDP-processed
  8. gather_all.py will gather and finalize the format for training/testing of all the folders from 7. (gather_all.py will create "cuneiform_ocr_data/classification/data" directory with classes.txt which has all classes used for training/testing)

Data Preprocessing for Image (Sign) Classification (Details)

  1. Use move_test_set_for_classification.py to extract all images belonging to detection test set for classification
  2. Images are cropped from LMU and Heidelberg using crop_signs.py and converted to ABZ Sign List via ebl.txt mapping from OraccGlobalSignList/ MZL to ABZ Number
    • Partially Broken and Unclear Signs can be dealt included/excluded on parameter in script
  3. Images from CDP (urschrei-cdp) are renamed using the mapping from the urschrei-repo https://github.com/urschrei/CDP/csvs (look at cuneiform_ocr/preprocessing_cdp)
    • Images are renamed rename_to_mzl.py
    • Images are mapped via urschrei-cdp corrected_instances_forimport.xlsx and custom mapping via convert_cdp_and_jooch.py
  4. Cuneiform JOOCH images are not used due to bad quality currently
  5. Labasi Project is scraped with labasi/crawl_labasi_page.py (can take very long multiple hours with interruptions) and renamed manually to fit ebl.txt mapping

Acknowledgements/ Citation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages