### Creating a commentary from ajmc\'s folder structure

You can also create a commentary from the `ocr_dir` of a commentary which is compliant with the project\'s folder structure. Because all the path selection relies on this structure, declaration is a lot easier.

In [None]:
from ajmc.text_processing.ocr_classes import OcrCommentary

comm = Commentary.from_structure(ocr_dir='/abspath/to/base_dir/[comm_id]/ocr/runs/[ocr_run]/outputs')

### `CanonicalCommentary`'s main attributes

As a collection of pages, `CanonicalCommentary` objects contain :

- Lists of ocr_outputs `Commentary.Pages`, `Commentary.ocr_groundtruth_pages` and `Commentary.olr_groundtruth_pages`.
- `Commentary.paths`, a dict containing the most useful paths
- `Commentary.via_project` a dict object with information about layout-regions.
- `classes.TextElements` such as `regions`, `lines` and `words`.

**Note**. All classes contains their respective subelements. Each `Page` contains `regions`, `lines` and `words`, each `OlrRegion` contains `lines` and `words` and each `lines` contains `words`.

This provides a simple framework for a wide panel of actions. You can for instance :

In [None]:
# Get all the words in the annotated `primary_text`s of a commentary:
words = [word.text for region in comm.regions for word in region.children.words if
         region.region_type == 'primary_text']
# Note that this still takes a bit of time as regions are automatically re-shaped.

# Get their counts
count = len(words)

# Get the pages with at least one annotated `commentary` region
count_comm_regions = [p for p in comm.children.pages if any([r.region_type == 'commentary' for r in p.children.regions])]

### Getting familiar with textual elements

Textual elements are stored in two kinds of objects : `classes.OlrRegion` and `classes.TextElement`. You can:

In [None]:
# Selecting a region randomly
region = comm.olr_groundtruth_pages[12].children.regions[0]

# Get the region type
region.region_type

# Get the lines contained in the region
region.children.lines

# Get the region's image
region.image

# Get the region's coordinates
region.bbox

# Get the region's text
region.text

**Note**. With the exception of `region_type`, all these attributes are shared among `OlrRegion` and `TextElement` (which is the motherclass for lines and words.

In [None]:
# Selecting a region randomly
region = comm.olr_groundtruth_pages[12].children.regions[0]

# Get the region type
region.region_type

# Get the lines contained in the region
region.children.lines

# Get the region's image
region.image

# Get the region's coordinates
region.bbox

# Get the region's text
region.text