In this notebook, we match the laws titles as yielded from the xslt processing to the manually annotated data. 
We can check to what extent the approprate titles have been found
Furthermore, we can connect the detected laws to the annotations in order to create labeled data. 

In [1]:
import pandas as pd
import xml.etree.ElementTree as ET
import io
import difflib 
import nltk 


In [3]:
# load ground truth (gt)
#dt = {'pdfpaginastart':int, 'Volledigetiteltekst':str}

with open('../categories/Gelderlandboek_cat2.csv', 'r') as f: 
    gt= pd.read_csv(f, sep=';', nrows=435 ) # file has empty lines
    
gt.rename(columns={'pdfpaginastart':'page', 'Volledigetiteltekst':'title'}, inplace=True )
gt.head()


Unnamed: 0,TranskribusID,NUMBERTEST,Gewest,Instrument,Instrumentoverige,Jaarhand,Maand,Dagnr,title,page,...,CodeMPIER1,CODEMPIER2,CODEMPIER3,CODEMPIER4,CODEMPIER5,CODEMPIER6,CODEMPIER7,CODEmpIER8,CODEMPIER9,CODEMPIER10
0,168882,1,1,2,,1581,6,,Resolutie tegens het slaen der Hegh-munten,11,...,4ESP7D7,,,,,,,,,
1,168882,2,1,1,,1581,7,31.0,Placaet waer by alle Vasallen en Leenplichtige...,11,...,1SO2K2,1SO2B1,,,,,,,,
2,168882,3,1,1,,1581,10,31.0,Placaet op den prys van den gelde en tegens he...,12,...,4ESP7D7,4ESP7D9,4ESP7D11,,,,,,,
3,168882,4,1,1,,1581,11,7.0,"Placaet op 't vangen der vyanden, en tegens he...",15,...,2PSO2G3,2PSO2J,2PSO2E2,2PSO2E7,2PSO2K4,,,,,
4,168882,5,1,1,,1582,1,24.0,Placaet inhoudende confiscatie en verval der l...,16,...,1SO2B1,1SO2K2,2PSO4A18,2PSO2E2,2PSO4K4,2PSO4K8,2PSO2J,,,


In [5]:
# load processed info (xslt output)
def iter_docs(author):
    author_attr = author.attrib
    for doc in author.iter('Header'):
        doc_dict = author_attr.copy()
        doc_dict.update(doc.attrib)
        doc_dict['page']=int(doc_dict['page'])
        doc_dict['title_start'] = doc.text
        yield doc_dict

with open('process_page_output.xml','r') as f:
    etree = ET.parse(f)
    
doc_df = pd.DataFrame(list(iter_docs(etree.getroot())))
doc_df.rename(columns={'title_start':'title'}, inplace=True)

doc_df.head()

Unnamed: 0,page,size,title,type
0,11,3.5294117647058822,HET TWEEDE DEEL,start-law-2
1,11,1.6470588235294117,Resolutie tegens het slaen der Hegh-munten,start-law
2,11,1.2941176470588236,DEEEngaende die munts-,start-law-2
3,11,1.6470588235294117,Placaet waer by alle Vasallen en,start-law
4,12,1.5294117647058822,Placaet op den prys van den,start-law


Let's see to what extent the same laws are in both dataframes.
NB:
- There may be errors in the titles of either one, so we use fuzzy matching based on string edit distance. 
- The xslt output only has the first line of the title

In [6]:
def match_lr(left, right):
    # find corresponding law in right for every entry in left
    for l_i, row in left.iterrows(): 
        # set index
        n=row.page 
        r_i = (right.page>n-3)&(right.page<n+3) # Candidates: only look in surrounding pages (+/- 2) 
        # to find corresponding law (both for efficiency and correctness)    

        if max(r_i)==False: # nothing to match with 
            left.loc[l_i,'closest_i']= -1
            continue

        # determine longest prefix that could match
        prefix = row.title[:max(right.loc[r_i,'title'].str.len())]
        # determine distance to candidates, only compare to longest prefix that could match: [:len(row.title)]
        # NB: penalty on substitution bigger than deletion because it might be a prefix
        distances=right.loc[r_i,'title'].apply(lambda x:  nltk.edit_distance(prefix, x[:len(row.title)], substitution_cost=3))

        ii = distances.idxmin() # index of the closest match in the right frame    
        if distances[ii] > 14: # match isn't good enough
            left.loc[l_i,'closest_i']= -1 
        else:
            left.loc[l_i,'closest_i'], left.loc[l_i,'distance']=ii, distances[ii] # save index + corresponding distance in left df
    

First, see if we can find a match for every title from the processing.

In [7]:
match_lr(doc_df, gt)
merged_1 = doc_df.add_suffix( '_L').merge(gt.add_suffix('_R'), how='left', left_on='closest_i_L', right_index=True)


In [9]:
# have a look at the worst matches
merged_1[['title_L', 'title_R', 'distance_L']].sort_values(by='distance_L', ascending=False).head(10)

Unnamed: 0,title_L,title_R,distance_L
244,Placaet waer by to0. Ryx-,Placaet waer by hondert ryxdaelers belooft wor...,14.0
77,Resolutie waer by Registers en,Resolutie dat registers en leger-boecken over ...,14.0
401,Resolutie om by ’t Hoff in Leen-,"Resolutie dat 't Hoff in leen-saken vier, vyff...",14.0
245,Placaet van de Hooghe ende,Placaet van de Heeren Staten Generael tegens J...,14.0
291,Placaet rahende N aengebael-,Placaet raeckende 31 aengehaelde coper-vaten t...,12.0
324,Ordonnantie van den Hove en,Ordonnantie van Hoff en Reecken-kamer op het k...,12.0
309,Publicatie waer by Doctor Lu-,"Placaet waer by Doctor Lucas Harckens, en Gera...",12.0
158,Kercken- ordeninghe s goet ge-,Kercken-ordeninge goet-gevonden ende gearreste...,12.0
344,Placaet tegens de beerlose vaga¬,"Placaet tegens de heerlose knechten, vagabodne...",12.0
179,Resolutie dat den krychs-rtaet,Resolutie dat den Krygs-Raet geen jurisdictie ...,10.0


In [10]:
missing_links = merged_1[merged_1.distance_L.isnull()] # false positives or problems in gt?
ml = len(missing_links)
print("Precision is "+ str(1-ml/len(doc_df)) +".", ml, "missing links (False positives):" )
print(missing_links[['title_L', 'title_R', 'distance_L']])

Precision is 0.9257641921397379. 34 missing links (False positives):
                                         title_L title_R  distance_L
0                                HET TWEEDE DEEL     NaN         NaN
2                         DEEEngaende die munts-     NaN         NaN
24            Resolutie van dat geestelicke goe-     NaN         NaN
83                  Op-ten tweden van profanatie     NaN         NaN
103                    Placaet van myn Heeren de     NaN         NaN
104     Staten Generael van de Vereentrde Neder-     NaN         NaN
139               Artyckelen van den Gelderschen     NaN         NaN
148                Son-en feest ende bede-dagen.     NaN         NaN
159          Van de Kerckelycke t samen-kemsten.     NaN         NaN
160             Volgen de articulen daer inne de     NaN         NaN
239            Nxvidkt uyttet reces-sito Lqinst-     NaN         NaN
259        BBaer en Latum aen de Graefschap Zut¬     NaN         NaN
272              Exrract uyttet re

Now, see if we can find a match for every title in the gold truth

In [11]:
match_lr(gt, doc_df)
merged_2 = gt.add_suffix( '_L').merge(doc_df.add_suffix( '_R'), how='left', left_on='closest_i_L', right_index=True)

In [12]:
missing_links = merged_2[merged_2.distance_L.isnull()] # false negatives
# because of processing or HTR errors?
# (or problems in gt?)
ml = len(missing_links)
print("Recall is "+ str(1-ml/len(gt)) +".", ml, "missing links (False positives):" )
print(len(missing_links), "missing links (False negatives): ")
print(missing_links[['title_L', 'title_R', 'distance_L']])

Recall is 0.9540229885057472. 20 missing links (False positives):
20 missing links (False negatives): 
                                               title_L title_R  distance_L
2    Placaet op den prys van den gelde en tegens he...     NaN         NaN
73   Placaet tegens eenige nieuwe heele en halve sc...     NaN         NaN
76   Resolutie dat registers en leger-boecken over ...     NaN         NaN
88              Resolutie aengaende de keursmatigheyt.     NaN         NaN
102  Placaet van haer Hoogh Mogende op de paspoorte...     NaN         NaN
108  Placaet tegens vagabonden, lantlopers, bedelae...     NaN         NaN
133  Resolutie inhoudnede dat tienden geen schattin...     NaN         NaN
139  Eerste en tweede affischeyt van wegen de Heere...     NaN         NaN
146  Placeaet tegens de gestrooyde lasteringen en c...     NaN         NaN
241  Placaet waer by hondert ryxdaelers belooft wor...     NaN         NaN
242  Placaet van de Heeren Staten Generael tegens J...     NaN         N