# Explore aligned data

Once aligned data is [downloaded from Zenodo](), one can explore the dataframes here.

In [2]:
# where is data downloaded?
data_download_dir = '/Users/jnaiman/Dropbox/wwt_image_extraction/OCRPostCorrection/alignments/'

In [26]:
import pandas as pd
from IPython.display import display, Latex

# debug
from importlib import reload
import utils
reload(utils)

from utils import select_wordtype

In [4]:
test = pd.read_csv(data_download_dir + 'test_masked_n10000_20230503.csv')

In [5]:
test.head()

Unnamed: 0,aligned sentences source,aligned sentences target,sentences source,sentences target,aligned sentences source types,aligned sentences target types,sentences source types,sentences target types
0,A histogram of the ^^^^^^^Va^^^ /slope for dw...,A histogram of the $_{\rm max}$ /slope for dw...,A histogram of the Va /slope for dwarf irregu...,A histogram of the $_{\rm max}$ /slope for dw...,W WWWWWWWWW WW WWW ^^^^^^^II^^^ WWWWWW WWW WW...,W WWWWWWWWW WW WWW IIIIIIIIIIII WWWWWW WWW WW...,W WWWWWWWWW WW WWW II WWWWWW WWW WWWWW WWWWWW...,W WWWWWWWWW WW WWW IIIIIIIIIIII WWWWWW WWW WW...
1,Observations were carried out. using ^^a log ...,Observations were carried out@ using – a log ...,Observations were carried out. using a log of...,Observations were carried out using – a log o...,WWWWWWWWWWWW WWWW WWWWWWW WWWW WWWWW ^^W WWW ...,WWWWWWWWWWWW WWWW WWWWWWW WWW@ WWWWW W W WWW ...,WWWWWWWWWWWW WWWW WWWWWWW WWWW WWWWW W WWW WW...,WWWWWWWWWWWW WWWW WWWWWWW WWW WWWWW W W WWW W...
2,Compared to a smooth polynomial. the flat fie...,"Compared to a smooth polynomial, the flat fie...",Compared to a smooth polynomial. the flat fie...,"Compared to a smooth polynomial, the flat fie...",WWWWWWWW WW W WWWWWW WWWWWWWWWWW WWW WWWW WWW...,WWWWWWWW WW W WWWWWW WWWWWWWWWWW WWW WWWW WWW...,WWWWWWWW WW W WWWWWW WWWWWWWWWWW WWW WWWW WWW...,WWWWWWWW WW W WWWWWW WWWWWWWWWWW WWW WWWW WWW...
3,2006) confirmed. lis scenario.,2006) confirmed this scenario.,2006) confirmed. lis scenario.,2006) confirmed this scenario.,WWWWW WWWWWWWWWW WWW WWWWWWWWW,WWWWW WWWWWWWWW WWWW WWWWWWWWW,WWWWW WWWWWWWWWW WWW WWWWWWWWW,WWWWW WWWWWWWWW WWWW WWWWWWWWW
4,Thus. slieht differences in ihe ^^^©C'a ^^^^v...,"Thus, slight differences in the $\Sigma Ca$ v...",Thus. slieht differences in ihe ©C'a value of...,"Thus, slight differences in the $\Sigma Ca$ v...",WWWWW WWWWWW WWWWWWWWWWW WW WWW ^^^IIIII^^^^W...,WWWWW WWWWWW WWWWWWWWWWW WW WWW IIIIIIIIIII W...,WWWWW WWWWWW WWWWWWWWWWW WW WWW IIIIIWWWWW WW...,WWWWW WWWWWW WWWWWWWWWWW WW WWW IIIIIIIIIII W...


"Raw" source (OCR) and target (synthetic ground truth, SGT) sentences are stored in `sentences source` and `sentences target`:

In [12]:
i = 4
print('OCR : ', test.iloc[i]['sentences source'])
print('SGT : ', test.iloc[i]['sentences target'])

OCR :   Thus. slieht differences in ihe ©C'a value of one of them could change the derived W" significantly.
SGT :   Thus, slight differences in the $\Sigma Ca$ value of one of them could change the derived $W'$ significantly.


The SGT instances contain the LaTeX formatting needed to display math formulas, for example:

In [15]:
display(Latex(f''+str(test.iloc[i]['sentences target'])))

<IPython.core.display.Latex object>

Also provided are sentences which have been aligned using the [Levenshtein edit distance Python package](https://github.com/maxbachmann/Levenshtein):

In [16]:
print('OCR : ', test.iloc[i]['aligned sentences source'])
print('SGT : ', test.iloc[i]['aligned sentences target'])

OCR :   Thus. slieht differences in ihe ^^^©C'a ^^^^value of one of them could change the derived ^W^" significantly.
SGT :   Thus, slight differences in the $\Sigma Ca$ value of one of them could change the derived $W'$ significantly.


Here, insertions in the OCR are marked with "^" characters and deletions in the OCR are marked as "@" in the *SGT* sentences, for example:

In [21]:
i2 = 9
print('OCR : ', test.iloc[i2]['aligned sentences source'])
print('SGT : ', test.iloc[i2]['aligned sentences target'])

OCR :   Taken as a whole. Figure ^^^^^^^^^^5. is evidence for a similar PAIL size distribution in spirals. AGN. clwarls. and HH II regions.
SGT :   Taken as a whole, Figure \ref{smith1} is evidence for a similar PAH@ size distribution in spirals, AGN, d@warfs, and H@ II regions.


## Sentence types

Additionally, each "type" of character is denoted in each sentence.  For example:

In [23]:
test.iloc[i]['sentences target types']

' WWWWW WWWWWW WWWWWWWWWWW WW WWW IIIIIIIIIII WWWWW WW WWW WW WWWW WWWWW WWWWWW WWW WWWWWWW IIII WWWWWWWWWWWWWW'

Here, `W` means a word character and `I` means an inline character.

If we are not sure what a character means here, we can check with `select_wordtype`: 

In [24]:
char_list = list(test.iloc[i]['sentences target types'])
char_list[:5]

[' ', 'W', 'W', 'W', 'W']

In [28]:
count_types = {}
for c in char_list:
    t = select_wordtype(c)
    # don't count spaces
    if t != ' ':
        if t in count_types:
            count_types[t] += 1
        else:
            count_types[t] = 1

In [29]:
count_types

{'word': 78, 'inline': 15}

After alignment with OCR, we can use these SGT tags to then back-track out what the OCR character tags are:

In [30]:
test.iloc[i]['sentences source types']

' WWWWW WWWWWW WWWWWWWWWWW WW WWW IIIIIWWWWW WW WWW WW WWWW WWWWW WWWWWW WWW WWWWWWW IW WWWWWWWWWWWWWW'

We can also use these tags to find different kinds of words.  For example, let's look for hyphenated things:

In [54]:
ilimit = 5 # stop after we find this number of things

icount = 0
for i in range(len(test)):
    d = test.iloc[i]
    char_list = list(d['sentences target types'])
    for ic,c in enumerate(char_list):
        if 'hyp-' in select_wordtype(c):
            print('word type:', select_wordtype(c), 'char=', c)
            print('OCR   : ', d['sentences source'])        
            print('SGT   : ', d['sentences target'])
            print('types : ', d['sentences target types'])
            icount += 1
            print('')
            break
    if icount > ilimit: break

word type: hyp-word char= w
OCR   :   When the mass of such a shell reaches some critical value (presumably of the order of 0.1. )) the shell can become unstable in respect to recombining into the "iron group elements (specifically intoNi) to supply the stalled shock wave with the energy of =10" erg necessary to trigger the supernova.
SGT   :   When the mass of such a shell reaches some critical value (presumably of the order of $\approx$ ) the shell can become unstable in respect to recombining into the ”iron group" elements (specifically into }) to supply the stalled shock wave with the energy of $\approx 10^{51}$ erg necessary to trigger the supernova.
types :   WWWW WWW WWWW WW WWWW W WWWWW WWWWWWW WWWW WWWWWWWW WWWWW WWWWWWWWWWW WW WWW WWWWW WW IIIIIIIII W WWW WWWWW WWW WWWWWW WWWWWWWW WW WWWWWWW WW WWWWWWWWWWW WWWW WWW WWWWW WWWWWW WWWWWWWW WWWWWWWWWWWWW WWWW wW WW WWWWWW WWW WWWWWWW WWWWW WWWW WWWW WWW WWWWWW WW IIIIIIIIIIIIIIIII WWW WWWWWWWWW WW WWWWWWW WWW WWWWWWWWWW

word typ