# Commentary importation pipeline

This notebook goes through all the steps involved in the creation of `CanonicalCommentary`s from OCR outputs.

We will therefore:

1. See how to import an `OcrCommentary` from OCR outputs.
2. See how to optimise this commentary and transform it to a `CanonicalCommentary`
3. See how to export it to a canonical json format for later use.

## Creating an `OcrCommetary`.

`OcrCommentary`s need access to (at least) three kind of information:
- OCR outputs files, which represent single pages and which will serve as a basis to create `OcrPage` objects
- The corresponding images (after which the former are named)
- A via-project.json containing information about the layout.

Using the data provided in `ajmc/data/sample_commentaries`, we can create our first `OcrCommentary`.

In [None]:
from ajmc.text_processing.ocr_classes import OcrCommentary

ocr_commentary = OcrCommentary(id='cu31924087948174',
                               ocr_dir='../data/sample_commentaries/cu31924087948174/ocr/runs/tess_eng_grc/outputs',
                               via_path='../data/sample_commentaries/cu31924087948174/olr/via_project.json',
                               image_dir='../data/sample_commentaries/cu31924087948174/images/png')

Providing all these paths can be cumbersome. `ajmc` therefore has a [systematic directory structure]() which allows us to create a commentary directly from its OCR outputs directory if it is compliant with the project\'s folder structure. As `../data/sample_commentaries` are ajmc-compliant, we can simply:

In [None]:
# Note that our path holds to the structure pattern : '/abspath/to/base_dir/[comm_id]/ocr/runs/[ocr_run]/outputs'
ocr_commentary = OcrCommentary.from_ajmc_structure(
    ocr_dir='../data/sample_commentaries/cu31924087948174/ocr/runs/tess_eng_grc/outputs')

The creation of an `OcrCommentary` will take care of the creation of its pages, lines, regions and words. However, it is also possible to instantiate any of these directly:

In [None]:
from ajmc.text_processing.ocr_classes import OcrPage

page = OcrPage(id='cu31924087948174_0035',
               ocr_path='../data/sample_commentaries/cu31924087948174/ocr/runs/tess_eng_grc/outputs/cu31924087948174_0035.hocr',
               image_path='../data/sample_commentaries/cu31924087948174/images/png/cu31924087948174_0035.png',
               commentary=ocr_commentary)

Note:
    It is not necessary to provide all the arguments provided here. For instance, if you leave `commentary=...` blank, the object will still be functionnal, but you won't be able to retrieve commentary-level information, such as the via_project.

## Why should one bother creating `CanonicalCommentary`s ? 

- TL;DR : Skip to the next section 👇🏼

### The vagaries of OCR

You may ask yourself: what's actually the problem with `OcrCommentary`s ? Why should we care about enhancing an `OcrCommentary` in the first place ? Well, the problem is not really about the object itself but on the many inconsistencies and noise of the OCR outputs it relies on. To cite a few:
    1. Empty or non words
    2. Crummy, elongated, stretched or shrinked word bounding boxes or even inverted bounding boxes with negative width and height (True story).
    3. Labyrinthine reading order
    4. Single lines spanning over multiple columns, multiple lines or side numbers
    5. Diacritics recognized as single lines
    6. Crummy, elongated, stretched or shrinked line bounding boxes
    7. ...


### The weakness of xml formats

To add to this already long though not exhaustive list of pitfalls, one should add two other caveats:
- OCR outputs come in different formats (Kraken or Tesseract style `hocr`, `alto`, `pagexml`...)
- Though very different because of their individualistic wills to create [harmonized, overarching standards](https://xkcd.com/927/), these formats all share the same weakness: the nested architecture of xml-like documents. Let me provided with a simple example. Say we have the following page :

```xml
<xml_page attribute_1="..." attribute_2="...">
    <xml_line attribute_1="..." attribute_2="...">
        <xml_word attribute_1="..." attribute_2="...">Foo</xml_word>
        <xml_word attribute_1="..." attribute_2="...">Bar</xml_word>
    </xml_line>
    <xml_line attribute_1="..." attribute_2="...">
        <xml_word attribute_1="..." attribute_2="...">Zig</xml_word>
        <xml_word attribute_1="..." attribute_2="...">Zag</xml_word>
    </xml_line>
</xml_page>
```
In `xml_page` we have two `xml_line` elements, which themselves contain two `xml_word` elements. If this is already a pain to navigate through, the most vicious issue is still to come. It appears when you try to overlap different layers of text containers. Say you have a region spanning only the n first word of a line. Should your region be a child of the line ? This makes no sense from a global perspective: regions (such as paragraphs) are higher in the hierarchy and should be parent to lines. One could be tempted to create a line for the region, but then an other problem arises: when calling all the lines from a page, should one call the lines from the regions or from the lines directly, as they are now different ? If you think this is complicated enough, let me give another trivial example : what about entities (e.g. named entities) that span over multiple pages. For instance, say we have an entity starting with the two last words of the last line of page n and ends with the first word of the main text of page n+1. Retrieve the words in such an entity demands extrem precision and ridiculously complex chunks of code. In pseudo-python, you would end up with something like `my_entity.words = pages[n].children.lines[-1].children.words[-2:]+page[n+1].children.lines[0].children.words[0]`. And this is even yet a simple case.  What if you have a footnote on page n that you don't want to include ? What if the first line of page n+1 is actually the page number and not the main text ? I let you imagine the kind of recondite code you end up with (`my_entity.words = pages[n].children.find_all("regions", type="main_text")[-1].children.lines[-1].words[-2:]+pages[n+1] ... my_pain.end()`. Not even mentionning the fact that this code is not yet dynamic and that a simple change in page numbering, word alignment or region reading order completely ruins the pipeline.

### The advantages of the canonical format

To tackle these issues, we come up with a fairly simple canonical format to store our data. Its philosophy and implementation is strikingly easy to fathom: go horizontal ! Instead of having nested and re-nested text containers we collect a global list of words and map every other text container to a word range. And that pretty much it. Here's an example

```json
{
  "words" : [{"text":"Foo", "attribute_1":"...", "attribute_2":"..."},
             {"text":"Bar", "attribute_1": "...", "attribute_2":"..."},
             {"text":"Zig", "attribute_1": "...", "attribute_2":"..."},
             {"text":"Zag", "attribute_1": "...", "attribute_2":"..."}],
  "pages": [{"word_range": [0,3]}],
  "lines": [{"word_range": [0,1]},
            {"word_range": [2,3]}]
}
```

This format comes with a lot of advantages :

1. It's a `json`, not an xml. It's easily readable both by humans and machines. You can import it in 2 lines of code and then, well, it's a python `dict`. No weird `bs4` or `etree` objects. They are completely overkill for our purpose and offer nothing that `json`s or `dict`s can't do.
2. It solves the nesting and overlapping problem at once. You can have overlapping, nested, renested textcontainers. Important thing is that they can be accessed **horizontally**, simply by finding the other textcontainers with an included or overlapping word range. Same to get a textcontainer's words : simply call `my_tc.words = words[*my_tc.word_range]`.
3. It makes redundant information of xmls useless: To get a line's bounding box, you simply concatenate it's words bounding box. This allows to store an entire 400 pages commentary in a ~35MB file, as opposed to ~85MB other OCR outputs (with no information loss and no optimisation on my side), which transitions well to the next point.
4.  It is computationnaly efficient. See a simple example here :

In [4]:
import time
import re
from ajmc.text_processing.ocr_classes import OcrCommentary
from ajmc.text_processing.canonical_classes import CanonicalCommentary


def time_commentary_operations(commentary_type):
    print(f'Measuring {commentary_type} importation time and manipulation time')

    start_time = time.time()
    if commentary_type == "OcrCommentary":
        commentary = OcrCommentary.from_ajmc_structure('/Users/sven/drive/_AJAX/AjaxMultiCommentary/data/commentaries/commentaries_data/sophoclesplaysa05campgoog/ocr/runs/15n06u_lace_base_sophoclesplaysa05campgoog-2021-05-21-19-46-44-porson-2021-05-21-16-43-56/outputs')
    else:
        commentary = CanonicalCommentary.from_json('/Users/sven/drive/_AJAX/AjaxMultiCommentary/data/commentaries/commentaries_data/sophoclesplaysa05campgoog/canonical/v2/15o09Y_lace_base_sophoclesplaysa05campgoog-2021-05-23-21-38-49-porson-2021-05-23-14-27-27.json')

    commentary.children.words
    print("    Time required by importation and word retrieval: {:.2f}s".format(time.time() - start_time))

    start_time = time.time()
    [l.text for l in commentary.children.lines if re.findall(r'[0-9]', l.text)]
    print("    Time required to retrieve the text lines containing decimals: {:.2f}s\n".format(time.time() - start_time))


time_commentary_operations('OcrCommentary')
time_commentary_operations('CanonicalCommentary')

Measuring OcrCommentary importation time and manipulation time
    Time required by importation and word retrieval: 9.81s
    Time required to retrieve the text lines containing decimals: 2.10s

Measuring CanonicalCommentary importation time and manipulation time
    Time required by importation and word retrieval: 3.07s
    Time required to retrieve the text lines containing decimals: 0.23s



5. It allows to deal with multiple versions of the text easily, simply by creating new lists of words and mapping text container customly to any list for any word range (Recall how complicated such an implementation would be if it was to be performed in a nested architecture).

## Post-processing OCR outputs

Now, how does this solves the OCR related issues mentionned above ? These are dealt with in post-processing. `OcrCommentary.to_canonical()` therefore launches two operations under the hood:
1. Post-processing OCR.
2. Converting to `CanonicalCommentary`.

Since we already covered the second step, let us briefly go through the first one. Post-processing the OCR aims at harmonizing text, bounding boxes and relations between text containers. It therefore brings a solution to each of the problems listed above:
- It deletes empty words and non words.
- It adjusts word bounding boxes to their content using contours detection (`cv2.findContours`)
- It adjusts line and regions boxes to the minimal box containing the words (for regions, a `_inclusion_threshold` is used, which, set to 0.8 proves to be quiet robust.
- It cuts lines according to regions, so that overlapping lines are now chunked.
- It removes empty lines
- It resets reading order from the region level downwards (i.e. order regions, then line in each region, then words in each line). The algorithm is also robust. Use `Image.draw_reading_order` to test.

All these operations are performed at page level, using `OcrPage.optimise()`, which is itself called internally by `OcrCommentary.to_canonical()`:

In [None]:
can_commentary = OcrCommentary.from_ajmc_structure(ocr_path='../data/sample_commentaries/cu31924087948174/ocr/runs/tess_eng_grc/outputs/cu31924087948174_0035.hocr').to_canonical()

## Exporting canonical commentaries to json

This last step is pretty straightforward:

In [None]:
can_commentary.to_json(output_path=None)

If `output_path` is not provided, the canonical json will be automatically exported to the location determined by ajmc's directory structure guidelines (i.e. `/base_dir/comm_id/canonical/v2/ocr_run_id.json`). Under the hood, this calls on each `CanonicalTextContainer`s' specific `self.to_json()` method.