This is assuming that we have converted the `.docx` files to utf-8 encoded `.txt` and `.TIF` to `.png`, after deskewing them slightly (made on the fly with ScanTailor). 

In [31]:
from ajmc.commons.variables import DRIVE_BASE_DIR
persee_sample_dir = DRIVE_BASE_DIR / 'data' / 'persee_sample'
ocr_origine_dir = persee_sample_dir / 'ocr_origine'
transcription_dir = persee_sample_dir / 'transcriptions'
png_dir = persee_sample_dir / 'png'


# OCR

## Quality assessment

We start by exploring the average quality of the OCR compared with the transcription

In [32]:
# Get the ocr results and the ground truth transcription
page_ids = [path.stem for path in png_dir.glob('*.png')]
ocr_pages = [(ocr_origine_dir / (page_id+'_OCR_origine.txt')).read_text(encoding='utf-8') for page_id in page_ids]
gt_pages = [(transcription_dir / (page_id+'_transcription.txt')).read_text(encoding='utf-8') for page_id in page_ids]



In [33]:
# In order to avoid space-chars related issues, we replace all space chars by a single space
import re
ocr_pages = [re.sub(r'\s+', ' ', page) for page in ocr_pages]
gt_pages = [re.sub(r'\s+', ' ', page) for page in gt_pages]

In [34]:
# We now compute the average quality of the ocr
from ajmc.ocr.evaluation import line_based_evaluation

# Note that we will consider a page as a single (long) line since we do not have the line segmentation
error_record, editops_record, results = line_based_evaluation(gt_lines=gt_pages, ocr_lines=ocr_pages, normalize=False)

print(results)

INFO - ajmc.ocr.evaluation -   Character Error Rate: 0.021
INFO - ajmc.ocr.evaluation -   Word Error Rate: 0.094
INFO - ajmc.ocr.evaluation -   Greek Character Error Rate: 0.121
INFO - ajmc.ocr.evaluation -   Latin Character Error Rate: 0.011
INFO - ajmc.ocr.evaluation -   Numeral Character Error Rate: 0.004


{'chars_ER': 0.021, 'words_ER': 0.094, 'greek_chars_ER': 0.121, 'latin_chars_ER': 0.011, 'numeral_chars_ER': 0.004, 'punctuation_chars_ER': 0.006}


In [35]:
# Now let us look at the error distribution
for (op, old, new), count in sorted(editops_record.items(), key=lambda x: x[1], reverse=True):
    print(f'{op} {old} -> {new}: {count}')

delete   -> : 27
replace Ά -> Ἀ: 13
replace ό -> ὸ: 13
replace έ -> ἐ: 11
replace α -> ὰ: 10
delete - -> : 10
replace ι -> ὶ: 8
replace ϊ -> ἴ: 7
delete I -> : 6
replace η -> ῆ: 5
delete 0 -> : 5
replace ε -> ὲ: 5
replace Ι -> |: 5
replace έ -> ὲ: 5
replace ί -> ἰ: 4
delete U -> : 4
delete L -> : 4
delete E -> : 4
delete P -> : 4
insert . -> : 4
replace ά -> ἀ: 4
insert I -> : 4
replace α -> ἀ: 4
insert - -> : 4
replace ύ -> ῦ: 4
replace ώ -> ῶ: 3
replace ί -> ῖ: 3
replace ί -> ἱ: 3
replace r -> I: 3
delete 6 -> : 3
replace α -> ᾶ: 3
replace ύ -> ὑ: 3
replace ώ -> ὠ: 3
replace Έ -> Ἐ: 3
replace ή -> ὴ: 3
replace ή -> ἡ: 3
replace ί -> ὶ: 3
replace ε -> ἔ: 3
replace ΐ -> ῖ: 3
replace ω -> ῶ: 3
replace ύ -> ὐ: 2
replace ' -> ἴ: 2
delete ί -> : 2
replace ή -> ῆ: 2
replace η -> ὴ: 2
replace l -> 1: 2
insert   -> : 2
delete 2 -> : 2
delete ] -> : 2
delete B -> : 2
delete T -> : 2
delete N -> : 2
delete É -> : 2
delete G -> : 2
delete R -> : 2
delete A -> : 2
delete H -> : 2
delete Q -> : 2


We can see that many ocr errors are related to space chars and diacritics. Mainly, we see that the error rate is **much higher for greek chars** (12.1%) as it is for latin chars (2.1%). 

## Re-OCRing using specialised models

Let us now try to re-OCR the pages using specialised models for greek and latin chars. We will use Tesseract 5 using `-l fra+grc_hist`. `grc_hist` is a model trained on historical greek texts, especially commentaries and groundtruth from OpenGreekAndLatin. In can be found in tesseract's [contrib repo](https://github.com/tesseract-ocr/tessdata_contrib/) and must be place in the `TESSDATA_DIR` used for running (see tesseract's documentation for more info). 

We then run the following command:
```bash
cd png
export TESSDATA_PREFIX=/path/to/tessdata/dir/
for i in *.png ; do tesseract $i "${i%%.png}" -l grc_hist+fra;  done;
mkdir ../tesseract/txt
mv *.txt ../tesseract/txt
```

We will now run the same evaluation as before on the new OCR results.

In [42]:
tess_output_dir = persee_sample_dir / 'tesseract' / 'txt_grc_grc_hist'
tess_ocr_pages = [(tess_output_dir / (page_id+'.txt')).read_text(encoding='utf-8') for page_id in page_ids]

tess_ocr_pages = [re.sub(r'\s+', ' ', page) for page in tess_ocr_pages]

error_record, editops_record, results = line_based_evaluation(gt_lines=gt_pages, ocr_lines=tess_ocr_pages, normalize=False)

INFO - ajmc.ocr.evaluation -   Character Error Rate: 0.038
INFO - ajmc.ocr.evaluation -   Word Error Rate: 0.139
INFO - ajmc.ocr.evaluation -   Greek Character Error Rate: 0.083
INFO - ajmc.ocr.evaluation -   Latin Character Error Rate: 0.028
INFO - ajmc.ocr.evaluation -   Numeral Character Error Rate: 0.004


We see that we obtained slightly better ocr results for greek chars, which is our goal when using a specialised model. Let us now move on to the OLR ! 

# OLR 