## Assessing the quality of the alignments

This notebook looks at the quality of the alignments between the OCR text and the human-corrected text from the Trove dataset that takes place in the `aligning_trove.ipynb` notebook. It checks the ratio of characters that have been aligned in contrast with the total number of characters, and produces a CSV file for quick evaluation of the alignment of a sample of 50 articles.

In [1]:
import pandas as pd
import ast

In [2]:
trove_df = pd.read_pickle("trove_artidigh_aligned_human.pkl")

In [3]:
trove_df.head()

Unnamed: 0,filePath,articleId,articleType,year,ocrText,humanText,corrected,str_similarity,str_length,alignment,processed
1,./trove_overproof/datasets/dataset1/rawTextAnd...,18378453,Article ILLUSTRATED,1953,"FROM RIVER CROSSING TO END OF TRIÄÜ I ^PI A^H""...",FROM RIVER CROSSING TO END OF TRIAL SPLASH: Pe...,,0.847747,747,"[(0, 4, 0, 4), (5, 10, 5, 10), (11, 19, 11, 19...",yes
2,./trove_overproof/datasets/dataset1/rawTextAnd...,18363627,Article,1953,"Natural Childbirth Sir,-We nurses have seen fa...","Natural Childbirth Sir,-We nurses have seen fa...",,0.964174,642,"[(0, 7, 0, 7), (8, 18, 8, 18), (19, 26, 19, 26...",yes
5,./trove_overproof/datasets/dataset1/rawTextAnd...,18368961,Article,1953,"DIVORCE Before The Judge In Divorce, Mr Justic...","DIVORCE Before The Judge In Divorce, Mr. Justi...",,0.894262,1220,"[(0, 7, 0, 7), (8, 14, 8, 14), (19, 24, 19, 24...",yes
7,./trove_overproof/datasets/dataset1/rawTextAnd...,18381450,Article,1953,I SCHOOL CHESS * Homebush Increased Ils lead o...,SCHOOL CHESS Homebush increased its lead over...,,0.918347,992,"[(2, 8, 0, 6), (9, 14, 7, 12), (17, 25, 14, 22...",yes
8,./trove_overproof/datasets/dataset1/rawTextAnd...,18383206,Article,1953,Architects' Contracts Architects have signed t...,Architects' Contracts Architects have signed t...,,0.897167,953,"[(0, 11, 0, 11), (12, 21, 12, 21), (22, 32, 22...",yes


In [4]:
sample_df = trove_df.sample(50)

In [5]:
sample_df

Unnamed: 0,filePath,articleId,articleType,year,ocrText,humanText,corrected,str_similarity,str_length,alignment,processed
19839,./trove_overproof/datasets/dataset1/rawTextAnd...,14255823,Article,1899,I JOHNSTONE^ BAY SAILING CL01). |¡ The weekly ...,JOHNSTONE'S BAY SAILING CLUB. The weekly meeti...,,0.893057,1244,"[(2, 12, 0, 11), (17, 24, 16, 23), (39, 45, 34...",yes
5070,./trove_overproof/datasets/dataset1/rawTextAnd...,15842582,Article,1919,I ATHLETIC^ f ' CnOSSOOUNTRY CHAMPIONSHIP. WON...,ATHLETICS CROSS COUNTRY CHAMPIONSHIP. WON BY R...,,0.892623,4859,"[(2, 11, 0, 9), (16, 28, 16, 23), (29, 42, 24,...",yes
2448,./trove_overproof/datasets/dataset1/rawTextAnd...,13657408,Article,1887,I LABOUR MATTERS. * I Sinco the publication of...,LABOUR MATTERS. Since the publication of our l...,,0.907478,785,"[(2, 8, 0, 6), (9, 17, 7, 15), (22, 27, 16, 21...",yes
18366,./trove_overproof/datasets/dataset1/rawTextAnd...,17439016,Article,1938,> DIÑO BORGIOLI. I New italian Songs. Besides ...,DINO BORGIOLI. New italian Songs. Besides a lo...,,0.917374,1735,"[(2, 6, 0, 4), (7, 16, 5, 14), (23, 30, 19, 26...",yes
6375,./trove_overproof/datasets/dataset1/rawTextAnd...,28369417,Article,1884,"A meeting of tte New South ""Wales Cricket Asso...",CRICKET. --- A meeting of the New South Wales ...,,0.882084,1094,"[(2, 9, 15, 22), (21, 26, 34, 39), (34, 41, 46...",yes
25164,./trove_overproof/datasets/dataset1/rawTextAnd...,14469875,Article,1902,"TILE CAMBRIDGE (ORTHIC) SHORTHAND SOCIETY. , T...",THE CAMBRIDGE (ORTHIC) SHORTHAND SOCIETY. The ...,,0.708847,864,"[(5, 14, 4, 13), (15, 23, 14, 22), (24, 33, 23...",yes
9831,./trove_overproof/datasets/dataset1/rawTextAnd...,13966169,Article,1894,"TEMORA, REEFTON, AND BARMEDMAN DISTRICT«. Tho ...","TEMORA, REEFTON, AND BARMEDMAN DISTRICTS. Tho ...",,0.893821,4483,"[(0, 7, 0, 7), (8, 16, 8, 16), (21, 30, 21, 30...",yes
27102,./trove_overproof/datasets/dataset1/rawTextAnd...,13046646,Article,1860,"LAW. SUPREME COURT -WHnsKSDAr. i "" Is Eanrrr. ...",LAW. SUPREME COURT -WEDNESDAY. IN EQUITY. BEFO...,,0.880203,972,"[(0, 4, 0, 4), (5, 12, 5, 12), (13, 18, 13, 18...",yes
17351,./trove_overproof/datasets/dataset1/rawTextAnd...,14654860,Article,1904,GORDON BENNETT NEWS. In Hamburg and tne neighb...,GORDON BENNETT NEWS. In Hamburg and the neighb...,,0.857143,935,"[(0, 6, 0, 6), (7, 14, 7, 14), (15, 20, 15, 20...",yes
27905,./trove_overproof/datasets/dataset1/rawTextAnd...,14369575,Article,1901,"SALES OF WORK. - » A. bazaar and ial<, of -wor...",SALES OF WORK. - ---- A. bazaar and sale of -w...,,0.752273,2640,"[(0, 5, 0, 5), (9, 14, 9, 14), (22, 28, 25, 31...",yes


### Compare the amount of text that has been aligned vs uncertain

Number of characters that have been aligned vs characters whose alignment has been left as uncertain. Measure is given for each quality band.

In [6]:
trove_df['alignedchars_ocr'] = 0
trove_df['alignedchars_hum'] = 0
trove_df['uncertainchars_ocr'] = 0
trove_df['uncertainchars_hum'] = 0

trove_df = trove_df[(abs(trove_df['ocrText'].str.len() - trove_df['humanText'].str.len()) <= 100)]

for index, row in trove_df.iterrows():
    aligned_chars_ocr = 0
    aligned_chars_hum = 0
    uncertain_chars_ocr = 0
    uncertain_chars_hum = 0
    ocrText = row['ocrText']
    humanText = row['humanText']
    alignment = ast.literal_eval(row['alignment'])
    ocr_index = 0
    hum_index = 0
    alignment.sort(key=lambda tup: tup[0])
    for a in alignment:
        if a[0] != 0 and a[2] != 0:
            uncertain_chars_ocr += len(ocrText[ocr_index : a[0] - 1].strip())
            uncertain_chars_hum += len(humanText[hum_index : a[2] - 1].strip())
            aligned_chars_ocr += len(ocrText[a[0] : a[1]].strip())
            aligned_chars_hum += len(humanText[a[2] : a[3]].strip())
        ocr_index = a[1] + 1
        hum_index = a[3] + 1
    trove_df.loc[index, 'alignedchars_ocr'] = aligned_chars_ocr
    trove_df.loc[index, 'alignedchars_hum'] = aligned_chars_hum
    trove_df.loc[index, 'uncertainchars_ocr'] = uncertain_chars_ocr
    trove_df.loc[index, 'uncertainchars_hum'] = uncertain_chars_hum

In [9]:
def quality(similarity):
    if similarity > 0.9:
        return 1 # good
    elif similarity > 0.8:
        return 2 # soso
    elif similarity > 0.7:
        return 3 # bad
    return 4 # ugly

trove_df['quality_band'] = trove_df["str_similarity"].apply(quality)
trove_df.head()

Unnamed: 0,filePath,articleId,articleType,year,ocrText,humanText,corrected,str_similarity,str_length,alignment,processed,alignedchars_ocr,alignedchars_hum,uncertainchars_ocr,uncertainchars_hum,quality_band
1,./trove_overproof/datasets/dataset1/rawTextAnd...,18378453,Article ILLUSTRATED,1953,"FROM RIVER CROSSING TO END OF TRIÄÜ I ^PI A^H""...",FROM RIVER CROSSING TO END OF TRIAL SPLASH: Pe...,,0.847747,747,"[(0, 4, 0, 4), (5, 10, 5, 10), (11, 19, 11, 19...",yes,387,385,338,268,2
2,./trove_overproof/datasets/dataset1/rawTextAnd...,18363627,Article,1953,"Natural Childbirth Sir,-We nurses have seen fa...","Natural Childbirth Sir,-We nurses have seen fa...",,0.964174,642,"[(0, 7, 0, 7), (8, 18, 8, 18), (19, 26, 19, 26...",yes,419,414,115,129,1
5,./trove_overproof/datasets/dataset1/rawTextAnd...,18368961,Article,1953,"DIVORCE Before The Judge In Divorce, Mr Justic...","DIVORCE Before The Judge In Divorce, Mr. Justi...",,0.894262,1220,"[(0, 7, 0, 7), (8, 14, 8, 14), (19, 24, 19, 24...",yes,478,508,474,530,2
7,./trove_overproof/datasets/dataset1/rawTextAnd...,18381450,Article,1953,I SCHOOL CHESS * Homebush Increased Ils lead o...,SCHOOL CHESS Homebush increased its lead over...,,0.918347,992,"[(2, 8, 0, 6), (9, 14, 7, 12), (17, 25, 14, 22...",yes,586,605,210,230,1
8,./trove_overproof/datasets/dataset1/rawTextAnd...,18383206,Article,1953,Architects' Contracts Architects have signed t...,Architects' Contracts Architects have signed t...,,0.897167,953,"[(0, 11, 0, 11), (12, 21, 12, 21), (22, 32, 22...",yes,506,518,273,299,2


In [14]:
sampledf_band1 = trove_df[trove_df['quality_band'] == 1]
sampledf_band2 = trove_df[trove_df['quality_band'] == 2]
sampledf_band3 = trove_df[trove_df['quality_band'] == 3]
sampledf_band4 = trove_df[trove_df['quality_band'] == 4]

dfbands = [sampledf_band1, sampledf_band2, sampledf_band3, sampledf_band4]

for band in reversed(range(len(dfbands))):
    ahum = dfbands[band]['alignedchars_hum'].sum(axis=0)
    aocr = dfbands[band]['alignedchars_ocr'].sum(axis=0)
    uhum = dfbands[band]['uncertainchars_ocr'].sum(axis=0)
    uocr = dfbands[band]['uncertainchars_hum'].sum(axis=0)
    
    print("Band:", band, ahum / (ahum + uhum))

Band: 3 0.2878159676988064
Band: 2 0.4288786937251221
Band: 1 0.5554745451432586
Band: 0 0.6363623896364651


### Create csv with aligned text of articles in the sample

Format (uncertain text is surrounded by `(((` and `)))` to ease seeing them when annotating):
```
articleId  str_similarity   aligned         ocrText              humanText         ocrOffset humOffset
14473800	0.7610062893	UNCERTAIN       (((THE               THE)))            0         3
14473800	0.7610062893	ALIGNED         NEXT                 NEXT              4         8
14473800	0.7610062893	ALIGNED         MATCH.               MATCH.            9         15
14473800	0.7610062893	UNCERTAIN       (((Tho               The)))            16        19
14473800	0.7610062893	ALIGNED         third                third             20        25
14473800	0.7610062893	ALIGNED         test                 test              26        30
14473800	0.7610062893	ALIGNED         match                match             31        36
14473800	0.7610062893	UNCERTAIN       (((of tho            of the)))         37        43
14473800	0.7610062893	ALIGNED         tour                 tour              44        48
14473800	0.7610062893	UNCERTAIN       (((v                 >)))              49        51	
14473800	0.7610062893	ALIGNED         ill                  will              49        53
14473800	0.7610062893	ALIGNED         commence             commence          54        62
14473800	0.7610062893	UNCERTAIN       (((to du} at         to-day at)))      65        74
14473800	0.7610062893	ALIGNED         Sheffield,           Sheffield,        73        83
14473800	0.7610062893	UNCERTAIN       (((and lull bo tho   and will be the)))86        101
14473800	0.7610062893	ALIGNED         first                first             100       105
14473800	0.7610062893	ALIGNED         occasion             occasion          106       114
```

Output is here: https://docs.google.com/spreadsheets/d/1HugVFVYPjtz9rBPNfJwsmL9WXVDCUVw_IuFiCYSHR88/edit#gid=753828840

In [7]:
import csv

dAlignmentsToEvaluate = dict()
for index, row in sample_df.iterrows():
    newArticle = []
    ocrText = row['ocrText']
    humanText = row['humanText']
    alignment = ast.literal_eval(row['alignment'])
    ocr_index = 0
    hum_index = 0
    alignment.sort(key=lambda tup: tup[0])
    for a in alignment:
        if a[0] != 0 and a[2] != 0:
            uncertain_match = ocrText[ocr_index : a[0] - 1].strip() + "\t" + humanText[hum_index : a[2] - 1].strip()
            uncertain_match = uncertain_match.strip()
            if uncertain_match:
                uncertain = str(row['articleId']) + "\t" + str(row['str_similarity']) + "\tUNCERTAIN\t" + "(((" + uncertain_match + ")))\t" + str(ocr_index) + "\t " + str(a[0] - 1)
                newArticle.append(uncertain)
        aligned_match = ocrText[a[0] : a[1]].strip() + "\t" + humanText[a[2] : a[3]].strip()
        aligned = str(row['articleId']) + "\t" + str(row['str_similarity']) + "\t" + "ALIGNED\t" + aligned_match + "\t" + str(a[2]) + "\t " + str(a[3])
        newArticle.append(aligned)
        ocr_index = a[1] + 1
        hum_index = a[3] + 1

    dAlignmentsToEvaluate[(row['articleId'], row['str_similarity'])] = newArticle

with open("alignments.csv", "w") as csvfile:
    writer = csv.writer(csvfile, delimiter='\t')
    for article in dAlignmentsToEvaluate:
        for textsnippet in dAlignmentsToEvaluate[article]:
            textsnippet = textsnippet.split("\t")
            writer.writerow(textsnippet)

Assessment of the quality of the evaluation here: https://docs.google.com/spreadsheets/d/1HugVFVYPjtz9rBPNfJwsmL9WXVDCUVw_IuFiCYSHR88/edit#gid=753828840