# Extract from the Programming Historian Tutorial Generating an Ordered Data Set from an OCR Text File

This Notebook is designed to trial the instructions from the Lesson written by Jon Crum for the Programming Historian 


## Importing the libraries 
First we import all the libraries we will need 

In [1]:
import re
from pprint import pprint
from collections import Counter

## Set up the functions we are going to apply 

### Roman to Arabic Number

The function below will search for number expressed in roman numeral and transform them to the correspondin arab number 

In [2]:
def rom2ar(rom):
    """ From the Python tutor mailing list:
    János Juhász janos.juhasz at VELUX.com
    returns arabic equivalent of a Roman numeral """
    roman_codec = {'M':1000, 'D':500, 'C':100, 'L':50, 'X':10, 'V':5, 'I':1}
    roman = rom.upper()
    roman = list(roman)
    roman.reverse()
    decimal = [roman_codec[ch] for ch in roman]
    result = 0

    while len(decimal):
        act = decimal.pop()
        if len(decimal) and act < max(decimal):
            act = -act
        result += act

    return result

### Levestein Function 

The Function below will help us identify the page breaks in the text. Once we have defined it we can call "lev" everytime we need to do so on our text

In [3]:
'''
Code ripped from https://www.datacamp.com/community/tutorials/fuzzy-string-python
'''
def lev(seq1, seq2):
    """ levenshtein_ratio_and_distance:
        For all i and j, distance[i,j] will contain the Levenshtein
        distance between the first i characters of seq1 and the
        first j characters of seq2
    """
    # Initialize matrix of zeros
    rows = len(seq1)+1
    cols = len(seq2)+1
    distance = [[0]*cols for x in range(rows)]

    # Populate matrix of zeros with the indeces of each character of both strings
    for i in range(1, rows):
        for k in range(1,cols):
            distance[i][0] = i
            distance[0][k] = k

    # Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions    
    for col in range(1, cols):
        for row in range(1, rows):
            if seq1[row-1] == seq2[col-1]:
                cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
            else:
                # In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
                # the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
                cost = 1
            distance[row][col] = min(distance[row-1][col] + 1,      # Cost of deletions
                                 distance[row][col-1] + 1,          # Cost of insertions
                                 distance[row-1][col-1] + cost)     # Cost of substitutions

    return distance[row][col]

## Search for the headers of each page 

Now that we have our function we start and we do so by searching for the headers using the lev function. Before doing that we need set up our starting variables and the file input and output 

In [7]:
n=0
fin = open("canonical.txt", 'r') # read our OCR output text
fout = open("out1.txt", 'w') # create a new textfile to write to when we're ready
GScriba = fin.readlines() # turn our input file into a list of lines

In [8]:
for line in GScriba:
    # get a Levenshtein distance score for each line in the text
    recto_lev_score = lev(line, 'IL CARTOLARE DI GIOVANNI SCRIBA')
    verso_lev_score = lev(line, 'MARIO CHIAUDANO - MATTIA MORESCO')

    # you want to use a score that's as high as possible,
    # but still finds only potential page header texts.
    if recto_lev_score < 26 :

        # If we increment a variable 'n' to count the number of headers we've found,
        # then the value of that variable should be our page number.
        n += 1
        print(f"recto: {recto_lev_score} {line}")

        # Once we've figured out our optimal 'lev' score, we can 'uncomment'
        # all these `fout.write()` lines to write out our new text file,
        # replacing each header with an easy-to-find string that contains
        # the page number: our variable 'n'.

        fout.write("~~~~~ PAGE %d ~~~~~\n\n" % n)
    elif verso_lev_score < 26 :
        n += 1
        print(f"verso: {verso_lev_score} {line}")
        fout.write("~~~~~ PAGE %d ~~~~~\n\n" % n)
    else:
        fout.write(line)
        pass

print(n)

verso: 13 2 ll!ARIO ClllAUDA:-;O • MATTIA MORESCO

recto: 12 IL CARTOLARB DI GIOYA."\:'\l SCRIBA 3

verso: 18 MAl\10 COIAUDAi'iO - l\lATTIA l\101\ESCO

recto: 10 rL CARTOL~RE DI GIOV.o;:-;r SCRIBA

verso: 8 MARIO CHIAUDAJ';O - MATTlA lllORESCO

recto: 3 IL CARTOLARE DI GIOVANNI SCRIBA 7

verso: 9 8 MARIO BIAUDA O • )JATTIA MORESCO

recto: 9 fL CARTOLARE DI GIOVA!' N I SCRIBA 9·

recto: 4 IL CARTOLARE DI GIOVANNI SCRIBA 11

verso: 7 12 MARIO CBIAUDAì'O - MATTIA MORESCO

recto: 5 lL CARTOLARE DI GIOVANNI SCRIBA 13

verso: 14 MAl\10 CIJIA UDANO - ~AITIA :O.lORESCO

recto: 5 IL CARTOLARE Df GIOVANNI SCRIBA 15

verso: 15 16 MAl\10 CRIAUDANO • MATIIA MOl\llSCO

recto: 16 CL CAl\TOL.\l\E DI CIOVA:\Wl SCRIBA 17

verso: 8 18 MARIO CHIAUDA:-;0 - MATTIA MORESCO

recto: 15 IL CAllTOLAl\E DI GIO\'A:'>l"I SCRIBA 19

verso: 10 20 MARIO CalAUDA1'0 - MATTCA MORESCO

recto: 4 IL CARTOLARE DI GIOVA:"NJ SCRIBA

verso: 5 22 MARIO CHLAUDANO - MATTIA MORESCO

recto: 5 IL CARTOLARE DI CIOVANNI SCRIBA 23

vers

Print the first 300 lines of the results 

In [9]:
with open("out1.txt", 'r') as file:
    for _ in range(300):  # Print first 300 lines
        print(file.readline(), end='')


~ . '" ~~ ,'"':.f~l• l
[io. 1 r.] (1) . I.
.. .. . si obbliga di dare ad Anna figlia del fu Ogerio Musso determinati quantitativi di m erci al ritorno dal viaggio di Alessandria o al S. Giovanni prossimo (dicembre 115-!).
(Test.)es Ann e fi(lie) quondam Ogerii :\fussi] (2) .
.... dom ine :\nne quondam fìlie Ogerii )fm i qu ...... de Gu ido ne ex part e ip ius usque ad adventum navium \lexand(riam) ....
postquam vencrit aut u quc ::id a nctum Iohanncm in istis quatuor m ercihu , videliceL (quarl am in pipere, quart am in braçili sel)-
vatico, quart am in alumine çucarino et quarlam in bono bombace, quod si non fecero pe( nam clupli stipuhmti prom itto) ~n bonis meis. Rctineo tamen mi ch i in predi clis libris si voluero convenire ipsam .\nnam de .... de aliquo quod quondam fìlius meus
sibi remi!'eril de dotibus eiu . . \ ctum ante domum Oonumòei de
Tercio, (mill e imo) centesimo qui nquagesimo quarto, mense deccmbris, ind ic ione secunda .
H.
I consoli di Ccnot•a assn!Mnn cnn senl,,n-:

We can see that the count are all off by 1 because there was no header in the first page so we need to modify the above to start from 2 and therefore set the n to 1

In [12]:
n=1
fin = open("canonical.txt", 'r') # read our OCR output text
fout = open("out1.txt", 'w') # create a new textfile to write to when we're ready
GScriba = fin.readlines() # turn our input file into a list of lines
for line in GScriba:
    # get a Levenshtein distance score for each line in the text
    recto_lev_score = lev(line, 'IL CARTOLARE DI GIOVANNI SCRIBA')
    verso_lev_score = lev(line, 'MARIO CHIAUDANO - MATTIA MORESCO')

    # you want to use a score that's as high as possible,
    # but still finds only potential page header texts.
    if recto_lev_score < 26 :

        # If we increment a variable 'n' to count the number of headers we've found,
        # then the value of that variable should be our page number.
        n += 1
        print(f"recto: {recto_lev_score} {line}")

        # Once we've figured out our optimal 'lev' score, we can 'uncomment'
        # all these `fout.write()` lines to write out our new text file,
        # replacing each header with an easy-to-find string that contains
        # the page number: our variable 'n'.

        fout.write("~~~~~ PAGE %d ~~~~~\n\n" % n)
    elif verso_lev_score < 26 :
        n += 1
        print(f"verso: {verso_lev_score} {line}")
        fout.write("~~~~~ PAGE %d ~~~~~\n\n" % n)
    else:
        fout.write(line)
        pass

print(n)

verso: 13 2 ll!ARIO ClllAUDA:-;O • MATTIA MORESCO

recto: 12 IL CARTOLARB DI GIOYA."\:'\l SCRIBA 3

verso: 18 MAl\10 COIAUDAi'iO - l\lATTIA l\101\ESCO

recto: 10 rL CARTOL~RE DI GIOV.o;:-;r SCRIBA

verso: 8 MARIO CHIAUDAJ';O - MATTlA lllORESCO

recto: 3 IL CARTOLARE DI GIOVANNI SCRIBA 7

verso: 9 8 MARIO BIAUDA O • )JATTIA MORESCO

recto: 9 fL CARTOLARE DI GIOVA!' N I SCRIBA 9·

recto: 4 IL CARTOLARE DI GIOVANNI SCRIBA 11

verso: 7 12 MARIO CBIAUDAì'O - MATTIA MORESCO

recto: 5 lL CARTOLARE DI GIOVANNI SCRIBA 13

verso: 14 MAl\10 CIJIA UDANO - ~AITIA :O.lORESCO

recto: 5 IL CARTOLARE Df GIOVANNI SCRIBA 15

verso: 15 16 MAl\10 CRIAUDANO • MATIIA MOl\llSCO

recto: 16 CL CAl\TOL.\l\E DI CIOVA:\Wl SCRIBA 17

verso: 8 18 MARIO CHIAUDA:-;0 - MATTIA MORESCO

recto: 15 IL CAllTOLAl\E DI GIO\'A:'>l"I SCRIBA 19

verso: 10 20 MARIO CalAUDA1'0 - MATTCA MORESCO

recto: 4 IL CARTOLARE DI GIOVA:"NJ SCRIBA

verso: 5 22 MARIO CHLAUDANO - MATTIA MORESCO

recto: 5 IL CARTOLARE DI CIOVANNI SCRIBA 23

vers

In [13]:
# print the first 300 row again 
with open("out1.txt", 'r') as file:
    for _ in range(300):  # Print first 300 lines
        print(file.readline(), end='')


~ . '" ~~ ,'"':.f~l• l
[io. 1 r.] (1) . I.
.. .. . si obbliga di dare ad Anna figlia del fu Ogerio Musso determinati quantitativi di m erci al ritorno dal viaggio di Alessandria o al S. Giovanni prossimo (dicembre 115-!).
(Test.)es Ann e fi(lie) quondam Ogerii :\fussi] (2) .
.... dom ine :\nne quondam fìlie Ogerii )fm i qu ...... de Gu ido ne ex part e ip ius usque ad adventum navium \lexand(riam) ....
postquam vencrit aut u quc ::id a nctum Iohanncm in istis quatuor m ercihu , videliceL (quarl am in pipere, quart am in braçili sel)-
vatico, quart am in alumine çucarino et quarlam in bono bombace, quod si non fecero pe( nam clupli stipuhmti prom itto) ~n bonis meis. Rctineo tamen mi ch i in predi clis libris si voluero convenire ipsam .\nnam de .... de aliquo quod quondam fìlius meus
sibi remi!'eril de dotibus eiu . . \ ctum ante domum Oonumòei de
Tercio, (mill e imo) centesimo qui nquagesimo quarto, mense deccmbris, ind ic ione secunda .
H.
I consoli di Ccnot•a assn!Mnn cnn senl,,n-:

Now we will need to manually add the tag for page one so 
- go back to the home page
- find out1.txt
- double click on it to open it in an editor
- Add ~~~~~ PAGE 1 ~~~~~ at the top of the document


Now we can deal with the roman numeral at the start of each document 


In [18]:
# At the top, do the importing you need, then define rom2ar() as described above, and then:
n = 0
romstr = re.compile("\s*[IVXLCDM]{2,}")
fin = open("OutCleaned.txt", 'r')
fout = open("out2.txt", 'w')
GScriba = fin.readlines()

for line in GScriba:
    if romstr.match(line):
        rnum = line.strip().strip('.')
        # each time we find a roman numeral by itself on a line we increment n:
        # that's our charter number.
        n += 1
        try:
            # translate the roman to the arabic and it should be equal to n.
            if n != rom2ar(rnum):
                # if it's not, then alert us
                print(f"{n}, there's a charter roman numeral missing?, because line number {GScriba.index(line)} reads: {line}")
                # then set 'n' to the right number
                n = rom2ar(rnum)
        except KeyError:
            print(f"{n}, KeyError, line number {GScriba.index(line)} reads: {line}")

1, there's a charter roman numeral missing?, because line number 12 reads: II

5, there's a charter roman numeral missing?, because line number 73 reads: VI

10, there's a charter roman numeral missing?, because line number 148 reads: XI

15, there's a charter roman numeral missing?, because line number 215 reads: XVI.

20, there's a charter roman numeral missing?, because line number 299 reads: XXI.

23, there's a charter roman numeral missing?, because line number 348 reads: XXIV

25, there's a charter roman numeral missing?, because line number 389 reads: XXVI

30, there's a charter roman numeral missing?, because line number 467 reads: XXXI.

35, there's a charter roman numeral missing?, because line number 546 reads: XXXVI

38, there's a charter roman numeral missing?, because line number 423 reads: XXVIII

29, there's a charter roman numeral missing?, because line number 578 reads: XXXIX

48, there's a charter roman numeral missing?, because line number 772 reads: XLIX

50, there

there are a lot of error and unfortunately we need to fix them manually (focus on the one that say Key error)
- go back to the home page
- find out1.txt
- double click on it to open it in an editor
- Using the list of error above clean the out1.txt file
- If it is taking too much time you can use the OutCleaned.txt that we have created that contain the first 100 chapters cleaned


Let's add the translated numbers to something easy to find later

In [22]:
n = 0
romstr = re.compile("\s*[IVXLCDM]{2,}")
fin = open("OutCleaned.txt", 'r')
fout = open("out2.txt", 'w')
GScriba = fin.readlines()

for line in GScriba:
    if romstr.match(line):
        rnum = line.strip().strip('.')
        num = rom2ar(rnum)
        fout.write(f"[~~~~ GScriba_{rnum} :::: {num} ~~~~]\n")
    else:
        fout.write(line)

In [23]:
# print the first 300 row again 
with open("out2.txt", 'r') as file:
    for _ in range(300):  # Print first 300 lines
        print(file.readline(), end='')


~~~~~ PAGE 1 ~~~~~

[io. 1 r.] (1) . I.
.. .. . si obbliga di dare ad Anna figlia del fu Ogerio Musso determinati quantitativi di m erci al ritorno dal viaggio di Alessandria o al S. Giovanni prossimo (dicembre 115-!).
(Test.)es Ann e fi(lie) quondam Ogerii :\fussi] (2) .
.... dom ine :\nne quondam fìlie Ogerii )fm i qu ...... de Gu ido ne ex part e ip ius usque ad adventum navium \lexand(riam) ....
postquam vencrit aut u quc ::id a nctum Iohanncm in istis quatuor m ercihu , videliceL (quarl am in pipere, quart am in braçili sel)-
vatico, quart am in alumine çucarino et quarlam in bono bombace, quod si non fecero pe( nam clupli stipuhmti prom itto) ~n bonis meis. Rctineo tamen mi ch i in predi clis libris si voluero convenire ipsam .\nnam de .... de aliquo quod quondam fìlius meus
sibi remi!'eril de dotibus eiu . . \ ctum ante domum Oonumòei de
Tercio, (mill e imo) centesimo qui nquagesimo quarto, mense deccmbris, ind ic ione secunda .

[~~~~ GScriba_II :::: 2 ~~~~]

I consoli di Ccno

Again almost good but need to fix a couple of things
- go back to the home page
- find out2.txt
- double click on it to open it in an editor
- Fix the Roman numeral headers that did not worked 
- Again there is a cleaned version named Cleaned2.txt if you need 

Now we search for folio notations

In [24]:

# note the optional quantifiers '\s?'. We want to find as many as we can, and
# the OCR is erratic about whitespace, so our regex is permissive. But as
# you find and correct these strings, you will want to make them consistent.

fol = re.compile("\[fo\.\s?\d+\s?[rv]\.\s?\]")

for line in GScriba:
    if fol.match(line):
        # since GScriba is a list, we can get the index of any of its members to find the line number in our input file.
        print (GScriba.index(line), line)

159 [fo. 2 r.] 



In [28]:
for line in GScriba:
    all = fol.findall(line)
    if len(all) > 1:
        print (GScriba.index(line), line)

## Find and normalize the Italian summary lines.
Find-and-normalize-the-italian-summary-lines
This important line is invariably the first one after the charter heading.

In [44]:
slug_and_firstline = re.compile(r"""
    (\[~~~~\sGScriba_)  # matches the "[~~~~ GScriba_" bit
    (.*)                # matches the charter's roman numeral
    \s::::\s            # matches the " :::: " bit
    (\d+)               # matches the arabic charter number
    \s~~~~\]\n          # matches the last " ~~~~ " bit and the line ending
    (.*)                # matches all of the next line up to:
    (\(\d?.*\d+\))      # the paranthetical expression at the end
    """, re.VERBOSE)

In [47]:
num_firstlines = 0
n=0
fin = open("Cleaned2.txt", 'r')
# NB: GScriba is not a list of lines this time, but a single big string.
GScriba = fin.read()

# finditer() creates an iterator 'i' that we can do a 'for' loop over.
i = slug_and_firstline.finditer(GScriba)

# each element 'x' in that iterator is a regex match object.
for x in i:
    # count the summary lines we find. Remember, we know how many
    # there should be, because we know how many charters there are.
    num_firstlines += 1

    chno = int(x.group(3)) # our charter number is a string, we need an integer

    # chno should equal n + 1, if it doesn't, report to us
    if chno != n + 1:
        print(f"problem in charter: {(n + 1)}") #NB: this will miss consecutive problems.
    # then set n to the right charter number
    n = chno

# print out the number of summary lines we found
print(f"number of italian summaries: {num_firstlines}")

problem in charter: 1
problem in charter: 46
problem in charter: 57
problem in charter: 63
problem in charter: 71
problem in charter: 74
problem in charter: 77
problem in charter: 83
problem in charter: 88
number of italian summaries: 11


In [48]:
# Don't forget to import the Counter module:
from collections import Counter
fin = open("out3.txt", 'r')
GScriba = fin.readlines() # GScriba is a list again
r = re.compile("\(\d{1,2}\)") # there's lots of ways for OCR to screw this up, so be alert.
pg = re.compile("~~~~~ PAGE \d+ ~~~~~")
pgno = 0

pgfnlist = []
# remember, we're processing lines in document order. So for each page
# we'll populate a temporary container, 'pgfnlist', with values. Then
# when we come to a new page, we'll report what those values are and
# then reset our container to the empty list.

for line in GScriba:
    if pg.match(line):
        # if this test is True, then we're starting a new page, so increment pgno
        pgno += 1

        # if we've started a new page, then test our list of footnote markers
        if pgfnlist:
            c = Counter(pgfnlist)

            # if there are fn markers that do not appear exactly twice,
            # then report the page number to us
            if 1 in c.values(): print(pgno, pgfnlist)

            # then reset our list to empty
            pgfnlist = []

    # for each line, look for ALL occurences of our footnote marker regex
    i = r.finditer(line)
    for mark in [eval(x.group(0)) for x in i]:
        # and add them to our list for this page
        pgfnlist.append(mark)

2 [1, 2, 2]
3 [1]
10 [1, 2, 2]
14 [1, 1, 2]
16 [1]
18 [1]
21 [1, 3, 1, 2, 3]
24 [1, 2, 1, 2, 3, 4]
25 [1, 2, 3, 6, 1, 2, 3, 4, 5, 6]
26 [1, 2, 3, 1, 3, 4]
27 [1, 3, 2, 3]
28 [1, 2]
29 [1, 2, 2]
35 [1, 2, 4, 1, 2, 3, 4]
36 [1, 3, 1, 2, 3]
38 [1, 2, 1, 2, 3]
42 [1]
46 [2, 1, 2]
49 [1]
52 [1, 2, 2]


In [None]:
## Creating a dictionary

In [56]:


# Regular expression patterns
slug = re.compile("(\[~~~~\sGScriba_)(.*)\s::::\s(\d+)\s~~~~\]")  # Pattern to match a new charter
fol = re.compile("\[fo\.\s?\d+\s?[rv]\.\s?\]")  # Pattern to match a folio
pgbrk = re.compile("~~~~~ PAGE (\d+) ~~~~~")  # Pattern to match a page break

# Open the file
with open("Cleaned2.txt", 'r') as fin:
    GScriba = fin.readlines()

# Global variables with starting values
n = 0
this_folio = '[fo. 1 r.]'
this_page = 1
charters = {}  # Dictionary to store charter information

for line in GScriba:
    if fol.match(line):
        this_folio = fol.match(line).group(0)
        continue

    if slug.match(line):
        m = slug.match(line)
        chid = "GScriba_" + m.group(2)
        chno = int(m.group(3))

        # Ensure proper handling if chno is in charters
        if chno in charters:
            # Key exists in the dictionary, so access it safely
            d = charters[chno]
        else:
            # Handle the case where the key doesn't exist in the dictionary
            print(f"Key {chno} not found in charters dictionary.")
            continue  # Skip further processing for this line

        d['footnotes'] = []
        d['chid'] = chid
        d['chno'] = chno
        d['folio'] = this_folio
        d['pgno'] = this_page

        if re.match('^\(\d+\)', line):
            continue

        if pgbrk.match(line):
            this_page = int(pgbrk.match(line).group(1))
        elif fol.search(line):
            this_folio = fol.search(line).group(0)
            templist.append(line)
        else:
            templist.append(line)

        d['text'] = [x for x in templist if not x == '\n']



Key 1 not found in charters dictionary.
Key 2 not found in charters dictionary.
Key 3 not found in charters dictionary.
Key 4 not found in charters dictionary.
Key 5 not found in charters dictionary.
Key 6 not found in charters dictionary.
Key 7 not found in charters dictionary.
Key 8 not found in charters dictionary.
Key 9 not found in charters dictionary.
Key 10 not found in charters dictionary.
Key 11 not found in charters dictionary.
Key 12 not found in charters dictionary.
Key 13 not found in charters dictionary.
Key 14 not found in charters dictionary.
Key 15 not found in charters dictionary.
Key 16 not found in charters dictionary.
Key 17 not found in charters dictionary.
Key 18 not found in charters dictionary.
Key 19 not found in charters dictionary.
Key 20 not found in charters dictionary.
Key 21 not found in charters dictionary.
Key 22 not found in charters dictionary.
Key 23 not found in charters dictionary.
Key 24 not found in charters dictionary.
Key 25 not found in chart