# Presentation

In this notebook, we are going to do some stats about our OLR-groundtruth. First, we are going to cound the number of words by regions.

# Get the OCR-outputs paths

Let us first list our output dirs. We will be using the default `base_dir`, but you may use your own. Since we are mainly interested in OLR, the OCR we choose doesn\'t really matter.

In [10]:
from ajmc.commons import variables
import os

base_dir = variables.PATHS['base_dir']

ocr_output_dirs = [
    os.path.join(base_dir, 'bsb10234118/ocr/runs/22e089_tess_final/outputs'),
    os.path.join(base_dir, 'Colonna1975/ocr/runs/18108i_kraken/outputs'),
    os.path.join(base_dir, 'cu31924087948174/ocr/runs/1bm0b3_tess_final/outputs'),
    os.path.join(base_dir, 'DeRomilly1976/ocr/runs/17g08V_kraken/outputs'),
    os.path.join(base_dir, 'Ferrari1974/ocr/runs/17k0de_kraken/outputs'),
    os.path.join(base_dir, 'Garvie1998/ocr/runs/17g0ao_kraken/outputs'),
    os.path.join(base_dir, 'Kamerbeek1953/ocr/runs/17u09o_kraken/outputs'),
    os.path.join(base_dir, 'lestragdiesdeso00tourgoog/ocr/runs/21i0dA_tess_hocr/outputs'),
    os.path.join(base_dir, 'Paduano1982/ocr/runs/17v0fZ_kraken/outputs'),
    os.path.join(base_dir, 'sophoclesplaysa05campgoog/ocr/runs/1bm0b4_tess_final/outputs'),
    os.path.join(base_dir, 'sophokle1v3soph/ocr/runs/1bm0b5_tess_final/outputs'),
    os.path.join(base_dir, 'Untersteiner1934/ocr/runs/17v0as_kraken/outputs'),
    os.path.join(base_dir, 'Wecklein1894/ocr/runs/1bm0b6_tess_final/outputs')
]

# Initialize our output

We will output a single `Dict[str,List[int]]` containing our word counts. We want it to look like :

```python
{'region1':[word_counts for each commentary], 'region2': [word_counts for each commentary]}
```

In [2]:
# We store commentary_ids, they will later be useful for index
commentary_ids = []
# We take our regions from `OLR_REGION_TYPES`
counts = {k: [] for k in variables.OLR_REGION_TYPES}

# Loop over output paths

Let us now loop over our commentaries

In [11]:
from ajmc.text_importation.classes import Commentary

for ocr_dir in ocr_output_dirs[7:]:
    commentary = Commentary.from_folder_structure(ocr_dir=ocr_dir)

    # We append our commentary id
    commentary_ids.append(commentary.id)

    # We create a temporary dict to store our counts
    comm_reg_counts = {}
    # We are only counting only fully annotated OLR pages (not those where only commentary is annotated.
    for p in commentary.olr_groundtruth_pages:
        for r in p.regions:
            try:
                comm_reg_counts[r.region_type] += len(r.words)
            except KeyError:
                comm_reg_counts[r.region_type] = len(r.words)

    # Let us now append our `comm_reg_counts` to our general counts
    for k in counts.keys():
        counts[k].append(comm_reg_counts.get(k, 0))

INFO - ajmc.commons.file_management.utils -   Page_id lestragdiesdeso00tourgoog_0014 matches no file in /Users/sven/drive/_AJAX/AjaxMultiCommentary/data/commentaries/commentaries_data/lestragdiesdeso00tourgoog/ocr/groundtruth/evaluation, skipping...
INFO - ajmc.commons.file_management.utils -   Page_id lestragdiesdeso00tourgoog_0015 matches no file in /Users/sven/drive/_AJAX/AjaxMultiCommentary/data/commentaries/commentaries_data/lestragdiesdeso00tourgoog/ocr/groundtruth/evaluation, skipping...
INFO - ajmc.commons.file_management.utils -   Page_id lestragdiesdeso00tourgoog_0016 matches no file in /Users/sven/drive/_AJAX/AjaxMultiCommentary/data/commentaries/commentaries_data/lestragdiesdeso00tourgoog/ocr/groundtruth/evaluation, skipping...
INFO - ajmc.commons.file_management.utils -   Page_id lestragdiesdeso00tourgoog_0017 matches no file in /Users/sven/drive/_AJAX/AjaxMultiCommentary/data/commentaries/commentaries_data/lestragdiesdeso00tourgoog/ocr/groundtruth/evaluation, skipping...


In [12]:
counts

{'id': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'app_crit': [0, 6375, 1199, 1956, 0, 940, 0, 0, 0, 2031, 0, 0, 0],
 'appendix': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1450, 0, 0, 844],
 'bibliography': [0, 1153, 0, 0, 348, 932, 0, 0, 0, 0, 0, 539, 0],
 'commentary': [6019,
  0,
  9440,
  8412,
  17190,
  11923,
  21802,
  0,
  6016,
  10519,
  9772,
  24714,
  8156],
 'footnote': [621, 1014, 460, 740, 2045, 1525, 578, 392, 0, 1364, 148, 0, 178],
 'index_siglorum': [1961, 0, 0, 0, 0, 1160, 0, 0, 0, 1646, 0, 3299, 0],
 'introduction': [0,
  0,
  3679,
  3893,
  4134,
  5367,
  6354,
  4272,
  7360,
  2114,
  4697,
  5976,
  820],
 'line_number_text': [84, 118, 57, 102, 73, 81, 0, 0, 90, 44, 30, 32, 40],
 'line_number_commentary': [11,
  34,
  240,
  15,
  77,
  67,
  189,
  0,
  10,
  14,
  46,
  35,
  38],
 'printed_marginalia': [0, 1, 3, 0, 0, 0, 0, 0, 0, 29, 0, 0, 0],
 'handwritten_marginalia': [0, 0, 0, 0, 0, 0, 26, 0, 0, 26, 0, 0, 0],
 'page_number': [41, 5, 36, 5, 24, 13, 25, 0, 36, 33, 51, 

# Convert our dict to `pandas.DataFrame`

In [16]:
import pandas as pd

counts_df = pd.DataFrame(counts, index=commentary_ids)
counts_df

Unnamed: 0,id,app_crit,appendix,bibliography,commentary,footnote,index_siglorum,introduction,line_number_text,line_number_commentary,...,handwritten_marginalia,page_number,preface,primary_text,running_header,table_of_contents,title,translation,other,undefined
bsb10234118,0,0,0,0,6019,621,1961,0,84,11,...,0,41,855,4377,0,0,108,0,663,40
Colonna1975,0,6375,0,1153,0,1014,0,0,118,34,...,0,5,5535,6668,0,0,26,5,14,2107
cu31924087948174,0,1199,0,0,9440,460,0,3679,57,240,...,0,36,2551,2860,0,0,44,0,24,32
DeRomilly1976,0,1956,0,0,8412,740,0,3893,102,15,...,0,5,0,5258,0,129,52,0,9,1181
Ferrari1974,0,0,0,348,17190,2045,0,4134,73,77,...,0,24,1,4659,0,78,92,0,1671,609
Garvie1998,0,940,0,932,11923,1525,1160,5367,81,67,...,0,13,0,3958,0,0,24,3811,0,3727
Kamerbeek1953,0,0,0,0,21802,578,0,6354,0,189,...,26,25,507,0,0,0,115,0,41,1244
lestragdiesdeso00tourgoog,0,0,0,0,0,392,0,4272,0,0,...,0,0,0,0,0,0,0,0,0,42
Paduano1982,0,0,0,0,6016,0,0,7360,90,10,...,0,36,0,5343,0,0,37,6483,5,1146
sophoclesplaysa05campgoog,0,2031,1450,0,10519,1364,1646,2114,44,14,...,26,33,0,2124,0,270,60,1266,18,8


In [15]:
counts_df.to_csv('/Users/sven/Desktop/olr_counts.tsv', sep='\t')