# Impact of OCR in linguistic processing

Tasks in this notebook:
- [ ] Part-of-speech tagging
- [ ] Named entity recognition
- [ ] Dependency parsing
- [ ] Semantic role labelling

Not considered here:
- Sentence splitting
- Tokenisation
- Lemmatisation

In [1]:
import pathlib
import sys
import argparse
from pathlib import Path
import glob
import string
import pandas as pd
import os
import numpy as np
import difflib
from difflib import SequenceMatcher
import collections
import re
import ast
from tqdm import tnrange, tqdm_notebook

In [2]:
trovedf = pd.read_pickle("db_trove.pkl")

In [3]:
trovedf.shape

(30509, 9)

In [4]:
trovedf.head()

Unnamed: 0,filePath,articleId,articleType,year,ocrText,humanText,corrected,str_similarity,str_length
1,./trove_overproof/datasets/dataset1/rawTextAnd...,18378453,Article ILLUSTRATED,1953,"FROM RIVER CROSSING TO END OF TRIÄÜ I ^PI A^H""...",FROM RIVER CROSSING TO END OF TRIAL SPLASH: Pe...,,0.847747,747
2,./trove_overproof/datasets/dataset1/rawTextAnd...,18363627,Article,1953,"Natural Childbirth Sir,-We nurses have seen fa...","Natural Childbirth Sir,-We nurses have seen fa...",,0.964174,642
3,./trove_overproof/datasets/dataset1/rawTextAnd...,18366055,Article,1953,FIRST CHURCH I SERVICE 1 Presbyterian I ' Anni...,FIRST CHURCH SERVICE Presbyterian Anniversary ...,,0.739176,947
4,./trove_overproof/datasets/dataset1/rawTextAnd...,18386137,Article,1953,"""Bob"" Lulham's Fight Against Thallium District...","""Bob"" Lulham's Fight Against Thallium Arthur ...",,0.493898,2950
5,./trove_overproof/datasets/dataset1/rawTextAnd...,18368961,Article,1953,"DIVORCE Before The Judge In Divorce, Mr Justic...","DIVORCE Before The Judge In Divorce, Mr. Justi...",,0.894262,1220


### Create sample to use in linguistic processing tasks

Define quality bands:

In [5]:
def quality(similarity):
    if similarity > 0.9:
        return 1 # good
    elif similarity > 0.8:
        return 2 # soso
    elif similarity > 0.7:
        return 3 # bad
    return 4 # ugly

Discard texts whose lengths differ by 100 characters or more:

In [6]:
trovedf = trovedf[(abs(trovedf['ocrText'].str.len() - trovedf['humanText'].str.len()) <= 100)]
trovedf['quality_band'] = trovedf["str_similarity"].apply(quality)

In [7]:
trovedf.shape

(28287, 10)

In [8]:
subsamplefile = pathlib.Path("trove_subsample.pkl")
sampledf_band1 = pd.DataFrame()
sampledf_band2 = pd.DataFrame()
sampledf_band3 = pd.DataFrame()
sampledf_band4 = pd.DataFrame()
sampledf = pd.DataFrame()
if not subsamplefile.exists():
    trovedf_band1 = trovedf[trovedf['quality_band'] == 1]
    trovedf_band2 = trovedf[trovedf['quality_band'] == 2]
    trovedf_band3 = trovedf[trovedf['quality_band'] == 3]
    trovedf_band4 = trovedf[trovedf['quality_band'] == 4]
    sampledf_band1 = trovedf_band1.sample(950)
    sampledf_band2 = trovedf_band2.sample(950)
    sampledf_band3 = trovedf_band3.sample(950)
    sampledf_band4 = trovedf_band4.sample(950)
    sampledf = pd.concat([sampledf_band1, sampledf_band2, sampledf_band3, sampledf_band4])
    sampledf.to_pickle('trove_subsample.pkl')
else:
    sampledf = pd.read_pickle('trove_subsample.pkl')

In [9]:
sampledf.head()

Unnamed: 0,filePath,articleId,articleType,year,ocrText,humanText,corrected,str_similarity,str_length,quality_band
26185,./trove_overproof/datasets/dataset1/rawTextAnd...,14330048,Article,1900,"WESTERN SUBURBS COTTAGE HOSPITAL. ,-«A-. Tim c...",WESTERN SUBURBS COTTAGE HOSPITAL. — The commit...,,0.920732,1144,1
13445,./trove_overproof/datasets/dataset1/rawTextAnd...,13166969,Article,1868,INTERCOLONIAL NEWS. QUEENSLAND. BRISBANE files...,INTERCOLONIAL NEWS. QUEENSLAND. BRISBANE files...,,0.924826,2155,1
13729,./trove_overproof/datasets/dataset1/rawTextAnd...,14953246,Article,1908,IN DIVORCE. (Before Mr. Justlco Simpson.) O'BR...,IN DIVORCE. (Before Mr. Justice Simpson.) O'BR...,,0.929938,3240,1
13979,./trove_overproof/datasets/dataset1/rawTextAnd...,14947605,Article,1908,"I MILITAEY. MELBOURNE, Saturday. The following...","MILITARY. MELBOURNE, Saturday. The following n...",,0.921942,2563,1
22979,./trove_overproof/datasets/dataset1/rawTextAnd...,15708305,Article,1917,I WAR CASUALTIES. -» j KILLED. ! PRIVATE W. PR...,WAR CASUALTIES. -» KILLED. PRIVATE W. PRYOR. M...,,0.915966,578,1


In [10]:
corrected_cond = (sampledf["corrected"] == '')
sampledf["use_corrected"] = 1
sampledf.loc[corrected_cond, 'corrected'] = sampledf.loc[corrected_cond, 'humanText']
sampledf.loc[corrected_cond, 'use_corrected'] = 0

In [11]:
sampledf.head()

Unnamed: 0,filePath,articleId,articleType,year,ocrText,humanText,corrected,str_similarity,str_length,quality_band,use_corrected
26185,./trove_overproof/datasets/dataset1/rawTextAnd...,14330048,Article,1900,"WESTERN SUBURBS COTTAGE HOSPITAL. ,-«A-. Tim c...",WESTERN SUBURBS COTTAGE HOSPITAL. — The commit...,WESTERN SUBURBS COTTAGE HOSPITAL. — The commit...,0.920732,1144,1,0
13445,./trove_overproof/datasets/dataset1/rawTextAnd...,13166969,Article,1868,INTERCOLONIAL NEWS. QUEENSLAND. BRISBANE files...,INTERCOLONIAL NEWS. QUEENSLAND. BRISBANE files...,INTERCOLONIAL NEWS. QUEENSLAND. BRISBANE files...,0.924826,2155,1,0
13729,./trove_overproof/datasets/dataset1/rawTextAnd...,14953246,Article,1908,IN DIVORCE. (Before Mr. Justlco Simpson.) O'BR...,IN DIVORCE. (Before Mr. Justice Simpson.) O'BR...,IN DIVORCE. (Before Mr. Justice Simpson.) O'BR...,0.929938,3240,1,0
13979,./trove_overproof/datasets/dataset1/rawTextAnd...,14947605,Article,1908,"I MILITAEY. MELBOURNE, Saturday. The following...","MILITARY. MELBOURNE, Saturday. The following n...","MILITARY. MELBOURNE, Saturday. The following n...",0.921942,2563,1,0
22979,./trove_overproof/datasets/dataset1/rawTextAnd...,15708305,Article,1917,I WAR CASUALTIES. -» j KILLED. ! PRIVATE W. PR...,WAR CASUALTIES. -» KILLED. PRIVATE W. PRYOR. M...,WAR CASUALTIES. -» KILLED. PRIVATE W. PRYOR. M...,0.915966,578,1,0


### Align OCRed and human-corrected text

First pass at aligning OCR and corrected: 

In [12]:
def findOcrHumanMatches(ocrText, humText):
    dOcrWordIndices = dict()
    dHumWordIndices = dict()
    maskedOcrText = ocrText.lower()
    for ow in maskedOcrText.split(" "):
        indices = maskedOcrText.index(ow), maskedOcrText.index(ow) + len(ow)
        for index in range(indices[0], indices[1]):
            maskedOcrText = maskedOcrText[:index] + '%' + maskedOcrText[index + 1:]
        if ow in dOcrWordIndices:
            dOcrWordIndices[ow].append(indices)
        else:
            dOcrWordIndices[ow] = [indices]
    maskedHumText = humText.lower()
    for hw in maskedHumText.split(" "):
        indices = maskedHumText.index(hw), maskedHumText.index(hw) + len(hw)
        for index in range(indices[0], indices[1]):
            maskedHumText = maskedHumText[:index] + '%' + maskedHumText[index + 1:]
        if hw in dHumWordIndices:
            dHumWordIndices[hw].append(indices)
        else:
            dHumWordIndices[hw] = [indices]
    
    dPotentialMatches = dict()
    for hq in sorted(dHumWordIndices, key=len, reverse=True):
        for oq in dOcrWordIndices:
            m = SequenceMatcher(None, hq, oq)
            if (hq, tuple(dHumWordIndices[hq])) in dPotentialMatches:
                dPotentialMatches[(hq, tuple(dHumWordIndices[hq]))].append((oq, dOcrWordIndices[oq], float(m.ratio())))
            else:
                dPotentialMatches[(hq, tuple(dHumWordIndices[hq]))] = [(oq, dOcrWordIndices[oq], float(m.ratio()))]
    
    ratio_decreasing = [1.0, 0.9, 0.8, 0.7]
    distance_limits = [10, 20, 30, 50, 90]
    word_length_list = [5, 3]
    
    lMatches = []
    used_hum_indices = []
    used_ocr_indices = []
    already_added = set()
    
    for word_length in word_length_list:
        for ratio in ratio_decreasing:
            for allowed_distance in distance_limits:
                for pm in dPotentialMatches:
                    hum_word = pm[0]
                    hum_indices = pm[1]
                    if len(hum_word) > word_length:
                        potential_matches = [dPotentialMatches[pm]]
                        hum_index_matched = False
                        for hum_index in hum_indices:
                            if hum_index_matched == False:
                                if not (hum_word, hum_index) in already_added:
                                    for pm in potential_matches[0]:
                                        for ow_indices in pm[1]:
                                            if abs(hum_index[0] - ow_indices[0]) <= allowed_distance and pm[2] >= ratio:
                                                if not hum_index[0] in used_hum_indices and not ow_indices[0] in used_ocr_indices:
                                                    match_not_possible = False
                                                    for already_matched in lMatches:
                                                        if already_matched[2] > hum_index[0]:
                                                            if already_matched[0] <= ow_indices[0]:
                                                                match_not_possible = True
                                                        elif already_matched[2] < hum_index[0]:
                                                            if already_matched[0] >= ow_indices[0]:
                                                                match_not_possible = True
                                                    if match_not_possible == False: 
                                                        already_added.add((hum_word, hum_index))
                                                        lMatches.append((ow_indices[0], ow_indices[1], hum_index[0], hum_index[1]))
                                                        used_hum_indices += list(range(hum_index[0], hum_index[0] + len(hw)))
                                                        used_ocr_indices += list(range(ow_indices[0], ow_indices[0] + len(ow)))
                                                        hum_index_matched = True
                                                        break
                            else:
                                hum_index_matched = False
                                break

    sorted_matches = sorted(lMatches, key=lambda tup: tup[0])
    
    return sorted_matches

Second pass at aligning OCR and human-corrected text:

In [13]:
def alignmentSecondPass(ocrText, humanText, alignment):
    second_pass_alignment = []
    ocr_index = 0
    hum_index = 0
    for a in alignment:
        if a[0] != 0 and a[2] != 0:
            uncertain_match = [ocrText[ocr_index : a[0] - 1].strip(), humanText[hum_index : a[2] - 1].strip(), ocr_index, hum_index]
            if uncertain_match:
                uncertain_ocr = uncertain_match[0]
                uncertain_ocr_list = uncertain_ocr.split(" ")
                uncertain_hum = uncertain_match[1]
                uncertain_hum_list = uncertain_hum.split(" ")
                m = SequenceMatcher(None, uncertain_ocr, uncertain_hum)
                if m.ratio() >= 0.9:
                    if len(uncertain_ocr_list) <= 3:
                        if len(uncertain_ocr_list) == len(uncertain_hum_list):
                            for utoken in range(len(uncertain_ocr_list)):
                                if uncertain_ocr_list[utoken] and uncertain_hum_list[utoken]:
                                    second_pass_ocr = ocr_index + uncertain_ocr.index(uncertain_ocr_list[utoken]), ocr_index + uncertain_ocr.index(uncertain_ocr_list[utoken]) + len(uncertain_ocr_list[utoken])
                                    second_pass_hum = hum_index + uncertain_hum.index(uncertain_hum_list[utoken]), hum_index + uncertain_hum.index(uncertain_hum_list[utoken]) + len(uncertain_hum_list[utoken])
                                    second_pass_alignment.append((second_pass_ocr[0], second_pass_ocr[1], second_pass_hum[0], second_pass_hum[1]))
        aligned_match = ocrText[a[0] : a[1]].strip() + "\t" + humanText[a[2] : a[3]].strip()
        ocr_index = a[1] + 1
        hum_index = a[3] + 1
    alignment += second_pass_alignment
    
    alignment = sorted(alignment, key=lambda x: x[0])

    return alignment

In [14]:
if not 'alignment' in sampledf:
    sampledf['alignment'] = ""
    sampledf['processed'] = "no"

counter = 0
for index, row in tqdm_notebook(sampledf.iterrows()):
    if row['processed'] == "no":
        counter += 1
        ocrText = row['ocrText'].strip(" ")
        humanText = row['corrected'].strip(" ")
        sorted_matches = findOcrHumanMatches(ocrText, humanText)
        sorted_matches = alignmentSecondPass(ocrText, humanText, sorted_matches)
        sampledf.loc[index, 'alignment'] = str(sorted_matches)
        sampledf.loc[index, 'processed'] = 'yes'
        if counter % 100 == 0:
            sampledf.to_pickle("trove_subsample_aligned.pkl")

sampledf.to_pickle("trove_subsample_aligned.pkl")

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))




In [15]:
sampledf.head()

Unnamed: 0,filePath,articleId,articleType,year,ocrText,humanText,corrected,str_similarity,str_length,quality_band,use_corrected,alignment,processed
26185,./trove_overproof/datasets/dataset1/rawTextAnd...,14330048,Article,1900,"WESTERN SUBURBS COTTAGE HOSPITAL. ,-«A-. Tim c...",WESTERN SUBURBS COTTAGE HOSPITAL. — The commit...,WESTERN SUBURBS COTTAGE HOSPITAL. — The commit...,0.920732,1144,1,0,"[(0, 7, 0, 7), (8, 15, 8, 15), (16, 23, 16, 23...",yes
13445,./trove_overproof/datasets/dataset1/rawTextAnd...,13166969,Article,1868,INTERCOLONIAL NEWS. QUEENSLAND. BRISBANE files...,INTERCOLONIAL NEWS. QUEENSLAND. BRISBANE files...,INTERCOLONIAL NEWS. QUEENSLAND. BRISBANE files...,0.924826,2155,1,0,"[(0, 13, 0, 13), (14, 19, 14, 19), (20, 31, 20...",yes
13729,./trove_overproof/datasets/dataset1/rawTextAnd...,14953246,Article,1908,IN DIVORCE. (Before Mr. Justlco Simpson.) O'BR...,IN DIVORCE. (Before Mr. Justice Simpson.) O'BR...,IN DIVORCE. (Before Mr. Justice Simpson.) O'BR...,0.929938,3240,1,0,"[(0, 2, 0, 2), (3, 11, 3, 11), (12, 19, 12, 19...",yes
13979,./trove_overproof/datasets/dataset1/rawTextAnd...,14947605,Article,1908,"I MILITAEY. MELBOURNE, Saturday. The following...","MILITARY. MELBOURNE, Saturday. The following n...","MILITARY. MELBOURNE, Saturday. The following n...",0.921942,2563,1,0,"[(2, 11, 0, 9), (12, 22, 10, 20), (23, 32, 21,...",yes
22979,./trove_overproof/datasets/dataset1/rawTextAnd...,15708305,Article,1917,I WAR CASUALTIES. -» j KILLED. ! PRIVATE W. PR...,WAR CASUALTIES. -» KILLED. PRIVATE W. PRYOR. M...,WAR CASUALTIES. -» KILLED. PRIVATE W. PRYOR. M...,0.915966,578,1,0,"[(6, 17, 4, 15), (23, 30, 19, 26), (44, 50, 38...",yes


In [16]:
sampledf.shape

(3800, 13)

### Align just one article

In [21]:
ocrText = "i NEW CALEDONIA. ' Io the Editor of the Berala. SIB, -Enclosed is a letter ooncernlug the expedition of the ' »Governor, M. de Siisseî, tbiough the north of Caledonia, which"
humanText = "NEW CALEDONIA. To the Editor of the Herald. SIR, -Enclosed is a letter concerning the expedition of the Governor, M. de Saisset, through the north of Caledonia, which"
sorted_matches = findOcrHumanMatches(ocrText, humanText)
aligned = alignmentSecondPass(ocrText, humanText, sorted_matches)

In [22]:
for a in aligned:
    print(ocrText[a[0]:a[1]], humanText[a[2]:a[3]], a)

CALEDONIA. CALEDONIA. (6, 16, 4, 14)
Editor Editor (26, 32, 22, 28)
of of (33, 35, 29, 31)
the the (36, 39, 32, 35)
Berala. Herald. (40, 47, 36, 43)
SIB, SIR, (48, 52, 44, 48)
-Enclosed -Enclosed (53, 62, 49, 58)
is is (63, 65, 59, 61)
a a (66, 67, 62, 63)
letter letter (68, 74, 64, 70)
ooncernlug concerning (75, 85, 71, 81)
the the (86, 89, 82, 85)
expedition expedition (90, 100, 86, 96)
»Governor, Governor, (110, 120, 104, 113)
M. M. (121, 123, 114, 116)
de de (124, 126, 117, 119)
Siisseî, Saisset, (127, 135, 120, 128)
tbiough through (136, 143, 129, 136)
the the (144, 147, 137, 140)
north north (148, 153, 141, 146)
of of (154, 156, 147, 149)
Caledonia, Caledonia, (157, 167, 150, 160)
which which (168, 173, 161, 166)
