# Analysing Index Entries

To get an idea of the quality of the lemma extraction, and of the possibility to link lemmas across books, we do an analysis of 145,000 index entries extracted from the resolution books of 1708-1749 (inventories 3763-3804), as these have the same index page layout, with currently (2021-09-03) mostly high-quality OCR output.

Our approach will be to iteratively extract correct lemmas, analysing the remaining unidentified lemmas in the next iteration to improve the later iterations of extractions.

In [1]:
# This reload library is just used for developing the REPUBLIC hOCR parser 
# and can be removed once this module is stable.
%reload_ext autoreload
%autoreload 2


# This is needed to add the repo dir to the path so jupyter
# can load the republic modules directly from the notebooks
import os
import sys
repo_name = 'republic-project'
repo_dir = os.path.split(os.getcwd())[0].split(repo_name)[0] + repo_name
print("adding project dir to path:", repo_dir)
if repo_dir not in sys.path:
    sys.path.append(repo_dir)



adding project dir to path: /Users/marijnkoolen/Code/Huygens/republic-project


In [2]:
import re

# Many lemmas have leading and/or trailing punctuation that is not part of
# the proper lemma. 
# The first step is to add a column with cleaned up lemmas.

def clean_start_end(string):
    if type(string) != str:
        return string
    return re.sub(r'\W+$', '', re.sub(r'^\W+', '', string))


In [3]:
import pandas as pd

# The file with 140,313 index lemma entries
entries_file = '../../data/indices/index_entries-3763-3804-latest.csv.gz'

df = pd.read_csv(entries_file, sep='\t', compression='gzip')

df.head(2)


Unnamed: 0,lemma,main_term,text,first_line_id,first_line_scan_id,first_line_page_id,first_line_column_id,last_line_id,last_line_scan_id,last_line_page_id,last_line_column_id
0,nee Winter quartieren,nee,"nee Winter quartieren, 918.",NL-HaNA_1.01.02_3763_0626-column-2471-1318-894...,NL-HaNA_1.01.02_3763_0626,NL-HaNA_1.01.02_3763_0626-page-1251,NL-HaNA_1.01.02_3763_0626-column-2471-1318-894...,NL-HaNA_1.01.02_3763_0626-column-2471-1318-894...,NL-HaNA_1.01.02_3763_0626,NL-HaNA_1.01.02_3763_0626-page-1251,NL-HaNA_1.01.02_3763_0626-column-2471-1318-894...
1,nee Winter quartieren,nee,aerte tifthop van Gran Brief van felicitatie o...,NL-HaNA_1.01.02_3763_0626-column-2471-1318-894...,NL-HaNA_1.01.02_3763_0626,NL-HaNA_1.01.02_3763_0626-page-1251,NL-HaNA_1.01.02_3763_0626-column-2471-1318-894...,NL-HaNA_1.01.02_3763_0626-column-2471-1318-894...,NL-HaNA_1.01.02_3763_0626,NL-HaNA_1.01.02_3763_0626-page-1251,NL-HaNA_1.01.02_3763_0626-column-2471-1318-894...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140313 entries, 0 to 140312
Data columns (total 11 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   lemma                 140198 non-null  object
 1   main_term             140075 non-null  object
 2   text                  140313 non-null  object
 3   first_line_id         140313 non-null  object
 4   first_line_scan_id    140313 non-null  object
 5   first_line_page_id    140313 non-null  object
 6   first_line_column_id  140313 non-null  object
 7   last_line_id          140313 non-null  object
 8   last_line_scan_id     140313 non-null  object
 9   last_line_page_id     140313 non-null  object
 10  last_line_column_id   140313 non-null  object
dtypes: object(11)
memory usage: 11.8+ MB


In [5]:
df.shape

(140313, 11)

In [8]:
from republic.elastic.republic_elasticsearch import initialize_es

rep_es = initialize_es(host_type='external', timeout=60)

query = {
    #'size': 0,
    #'track_total_hits': True,
    #'query': {
        'range': {
            'metadata.session_year': {
                'gte': 1708,
                'lte': 1749
            }
        }
    #}
}


response = rep_es.es_anno.search(index='resolutions', query=query, size=0, track_total_hits=True)
print('Total number of resolutions in period 1708-1749:', response['hits']['total']['value'])

Total number of resolutions in period 1708-1749: 157026


In [15]:
list(df[df.text.str.contains('pestil')].text)

['te enamineeren de communicatie van Gelderland weegens de pestilentiaale Sieckte in Seevenbergen. 656.']

In [49]:
from IPython.display import HTML

# check the number of distinct lemmas and how often they occur
print(len(df.lemma.value_counts()))
df.lemma.value_counts().head(20).to_frame(name="# Sublemma's").reset_index().rename(columns={"index": "Hoofdterm"})

15801


Unnamed: 0,Hoofdterm,# Sublemma's
0,Pasporten,2924
1,'s Hertogenbosch,2784
2,Holland,2566
3,Raad van Staate,1809
4,Militaire,1516
5,Militaire saaken,1470
6,Militaire saacken,1443
7,Vlaanderen,1368
8,Brieven van,1124
9,E,1064


There are ~~19662~~15801 distinct lemmas, some with more than 2000 entries, many with only a single entry.

The lemmas with many entries are probably correct in the sense that they have no OCR errors, but we need a reasoned approach to identifying the correct lemmas and mapping the lemmas with OCR errors to their correct counterparts. 

Repeated entries of the same lemma in a single book don't repeat the lemma itself, so the lemma term is copied in the extraction process. To get an idea of whether a lemma appears multiplle times with the exact same orthography, we should look at how often they occur with that same orthography across books.

In [50]:
# Rename the lemma column to ocr_lemma, so we know this is the raw extracted text with potential errors.
df = df.rename(columns={'lemma': 'ocr_lemma'})
df.shape

(140313, 11)

In [51]:

df['clean_lemma'] = df.ocr_lemma.apply(clean_start_end)
df.clean_lemma.value_counts().head(20).to_frame(name="# Sublemma's").reset_index().rename(columns={"index": "Hoofdterm"})

Unnamed: 0,Hoofdterm,# Sublemma's
0,Pasporten,3583
1,Holland,3161
2,Raad van Staate,2875
3,s Hertogenbosch,2856
4,Vlaanderen,1525
5,Militaire,1516
6,Militaire saaken,1470
7,Militaire saacken,1443
8,Brieven van,1124
9,E,1066


Cleaning up the OCR lemmas reduced the number of distinct lemmas from ~~19662~~15801 to ~~19182~~15561, so some of the terms with surrounding punctuation are now mapped onto other OCR lemmas.

## Creating links from index entries to IIIF images

In [52]:
# https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3785/NL-HaNA_1.01.02_3785_0047.jpg/full/full/0/default.jpg

def get_line_iiif_url(line_id, margin=100):
    elements = parse_line_id_elements(line_id)
    return get_iiif_url(elements['inventory_num'], elements['scan_id'], elements['line_coords'], margin=margin)

def get_column_iiif_url(line_id, margin=100):
    elements = parse_line_id_elements(line_id)
    return get_iiif_url(elements['inventory_num'], elements['scan_id'], elements['column_coords'], margin=margin)

def get_iiif_url(inv_num, scan_id, coords, margin=100):
    coords_string = f"{coords[0]-margin},{coords[1]-margin},{coords[2]+2*margin},{coords[3]+2*margin}"
    base_url = 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/'
    return f"{base_url}{inv_num}/{scan_id}.jpg/{coords_string}/full/0/default.jpg"

def parse_line_id_elements(line_id):
    scan_id, rest = line_id.split('-column-')
    inventory_num = scan_id.split('_')[2]
    column_coords, line_coords = rest.split('-line-')
    return {
        'scan_id': scan_id,
        'inventory_num': inventory_num,
        'column_coords': [int(coord) for coord in column_coords.split('-')],
        'line_coords': [int(coord) for coord in line_coords.split('-')]
    }

for line_id in list(df[df.clean_lemma == 'porten'].first_line_id):
    elements = parse_line_id_elements(line_id)
    line_iiif_url = get_line_iiif_url(line_id, margin=100)
    column_iiif_url = get_column_iiif_url(line_id, margin=100)
    print(elements['inventory_num'], elements['scan_id'], elements['line_coords'])
    print(line_iiif_url)
    


3769 NL-HaNA_1.01.02_3769_0013 [1486, 497, 251, 47]
https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3769/NL-HaNA_1.01.02_3769_0013.jpg/1386,397,451,247/full/0/default.jpg


In [53]:
df['line_iiif_url'] = df.first_line_id.apply(get_line_iiif_url)
df['column_iiif_url'] = df.first_line_id.apply(get_column_iiif_url)


## Add inventory source

Each entries comes from a book that is identified by its inventory number. In these books, each lemma is mentioned only once in the index, but the lemma can have multiple entries, where the lemma term is omitted after the first entry, and replaced by a repeat symbol (`____`).

By adding the inventory number for each entry in a separate column, we can group entries per year. 

In [54]:
# The line IDs contain the inventory number of the book from which the
# entry was extracted.
# We add a column with the inventory number so we can check which lemmas 
# occur in multiple books, therefore need to be linked and are more 
# likely to be correct lemmas (lemmas with OCR errors are less likely
# to occur with precisely the same errors across multiple books)

df['inventory'] = df.first_line_id.apply(lambda x: x.split('_')[2])
df.inventory.nunique()

42

In [55]:
df[(df.inventory == '3798') & (df.first_line_page_id == 'NL-HaNA_1.01.02_3798_0034-page-67')]


Unnamed: 0,ocr_lemma,main_term,text,first_line_id,first_line_scan_id,first_line_page_id,first_line_column_id,last_line_id,last_line_scan_id,last_line_page_id,last_line_column_id,clean_lemma,line_iiif_url,column_iiif_url,inventory
115041,Paltz,Paltz,"versoeck om den stluen te ontstaan, de Raad va...",NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,Paltz,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3798
115042,Paltz,Paltz,"Creditif op van Asten als Agent, en aangenaam....",NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,Paltz,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3798
115043,Paltz,Paltz,klaghten over ontireckingh van Jerri- zoir in ...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,Paltz,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3798
115044,Paltz,Paltz,versoght de schuldige aan gepleeghde geweldena...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,Paltz,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3798
115045,Paltz,Paltz,notificeerende dat Meyer afgesonden had om ove...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,Paltz,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3798
115046,Paltz,Paltz,Pasport tot de vrye passagie van duy- Jend las...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,Paltz,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3798
115047,Paltz,Paltz,antwoord weegens quade behandelingh van Bedien...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,Paltz,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3798
115048,Paltz,Paltz,antwoord wegens de passagie vau sis honderd la...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,Paltz,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3798
115049,Paraviciny,Paraviciny,Paraviciny advertentie. 279. 302. 420. 588. 589.,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,Paraviciny,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3798
115050,Paraviciny,Paraviciny,antwoord weegens bet vijiteeren van Scheepen v...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,NL-HaNA_1.01.02_3798_0034,NL-HaNA_1.01.02_3798_0034-page-67,NL-HaNA_1.01.02_3798_0034-column-2650-472-907-...,Paraviciny,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3798


In [56]:
# There are entries from 42 different books. 

# Next step, count the number of books in which each clean lemma occurs
df[['clean_lemma', 'inventory']].drop_duplicates().clean_lemma.value_counts().head(20).to_frame(name="# Indices").reset_index().rename(columns={"index": "Hoofdterm"})


Unnamed: 0,Hoofdterm,# Indices
0,Breda,40
1,Finantie,38
2,Utrecht,35
3,s Hertogenbosch,35
4,Hop,34
5,Groningen,34
6,Vlaanderen,33
7,Levantschen Handel,32
8,Pruyssen,31
9,Overyssel,31


In [57]:
df[(df.clean_lemma == 'Admiraliteyten in het') & df.text.str.startswith('Admiraliteyten in het') == True]

Unnamed: 0,ocr_lemma,main_term,text,first_line_id,first_line_scan_id,first_line_page_id,first_line_column_id,last_line_id,last_line_scan_id,last_line_page_id,last_line_column_id,clean_lemma,line_iiif_url,column_iiif_url,inventory
41865,Admiraliteyten in het,Admiraliteyten,Admiraliteyten in het gemeen te adviseren op d...,NL-HaNA_1.01.02_3779_0005-column-2629-1686-904...,NL-HaNA_1.01.02_3779_0005,NL-HaNA_1.01.02_3779_0005-page-9,NL-HaNA_1.01.02_3779_0005-column-2629-1686-904...,NL-HaNA_1.01.02_3779_0005-column-2629-1686-904...,NL-HaNA_1.01.02_3779_0005,NL-HaNA_1.01.02_3779_0005-page-9,NL-HaNA_1.01.02_3779_0005-column-2629-1686-904...,Admiraliteyten in het,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3779
49160,Admiraliteyten in het,Admiraliteyten,Admiraliteyten in het gemeen beschreeven. 79. ...,NL-HaNA_1.01.02_3781_0006-column-319-384-879-2...,NL-HaNA_1.01.02_3781_0006,NL-HaNA_1.01.02_3781_0006-page-10,NL-HaNA_1.01.02_3781_0006-column-319-384-879-2949,NL-HaNA_1.01.02_3781_0006-column-319-384-879-2...,NL-HaNA_1.01.02_3781_0006,NL-HaNA_1.01.02_3781_0006-page-10,NL-HaNA_1.01.02_3781_0006-column-319-384-879-2949,Admiraliteyten in het,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3781
75376,Admiraliteyten in het,Admiraliteyten,Admiraliteyten in het gemeen gelast de Schip- ...,NL-HaNA_1.01.02_3788_0005-column-3480-1603-888...,NL-HaNA_1.01.02_3788_0005,NL-HaNA_1.01.02_3788_0005-page-9,NL-HaNA_1.01.02_3788_0005-column-3480-1603-888...,NL-HaNA_1.01.02_3788_0005-column-3480-1603-888...,NL-HaNA_1.01.02_3788_0005,NL-HaNA_1.01.02_3788_0005-page-9,NL-HaNA_1.01.02_3788_0005-column-3480-1603-888...,Admiraliteyten in het,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3788
125079,Admiraliteyten in het,Admiraliteyten,Admiraliteyten in het gemeen.,NL-HaNA_1.01.02_3801_0008-column-3530-1823-895...,NL-HaNA_1.01.02_3801_0008,NL-HaNA_1.01.02_3801_0008-page-15,NL-HaNA_1.01.02_3801_0008-column-3530-1823-895...,NL-HaNA_1.01.02_3801_0008-column-3530-1823-895...,NL-HaNA_1.01.02_3801_0008,NL-HaNA_1.01.02_3801_0008-page-15,NL-HaNA_1.01.02_3801_0008-column-3530-1823-895...,Admiraliteyten in het,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3801
125094,Admiraliteyten in het,Admiraliteyten,Admiraliteyten in het gemeen. van het Schip Zi...,NL-HaNA_1.01.02_3801_0009-column-313-436-896-2...,NL-HaNA_1.01.02_3801_0009,NL-HaNA_1.01.02_3801_0009-page-16,NL-HaNA_1.01.02_3801_0009-column-313-436-896-2950,NL-HaNA_1.01.02_3801_0009-column-313-436-896-2...,NL-HaNA_1.01.02_3801_0009,NL-HaNA_1.01.02_3801_0009-page-16,NL-HaNA_1.01.02_3801_0009-column-313-436-896-2950,Admiraliteyten in het,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3801
125138,Admiraliteyten in het,Admiraliteyten,Admiraliteyten in het gemeen. lingh van de kos...,NL-HaNA_1.01.02_3801_0009-column-2553-446-918-...,NL-HaNA_1.01.02_3801_0009,NL-HaNA_1.01.02_3801_0009-page-17,NL-HaNA_1.01.02_3801_0009-column-2553-446-918-...,NL-HaNA_1.01.02_3801_0009-column-2553-446-918-...,NL-HaNA_1.01.02_3801_0009,NL-HaNA_1.01.02_3801_0009-page-17,NL-HaNA_1.01.02_3801_0009-column-2553-446-918-...,Admiraliteyten in het,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3801
125159,Admiraliteyten in het,Admiraliteyten,Admiraliteyten in het gemeen. voorgevallene tu...,NL-HaNA_1.01.02_3801_0009-column-3522-381-932-...,NL-HaNA_1.01.02_3801_0009,NL-HaNA_1.01.02_3801_0009-page-17,NL-HaNA_1.01.02_3801_0009-column-3522-381-932-...,NL-HaNA_1.01.02_3801_0009-column-3522-381-932-...,NL-HaNA_1.01.02_3801_0009,NL-HaNA_1.01.02_3801_0009-page-17,NL-HaNA_1.01.02_3801_0009-column-3522-381-932-...,Admiraliteyten in het,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3801
125178,Admiraliteyten in het,Admiraliteyten,Admiraliteyten in het gemeen.,NL-HaNA_1.01.02_3801_0010-column-303-435-886-2...,NL-HaNA_1.01.02_3801_0010,NL-HaNA_1.01.02_3801_0010-page-18,NL-HaNA_1.01.02_3801_0010-column-303-435-886-2868,NL-HaNA_1.01.02_3801_0010-column-303-435-886-2...,NL-HaNA_1.01.02_3801_0010,NL-HaNA_1.01.02_3801_0010-page-18,NL-HaNA_1.01.02_3801_0010-column-303-435-886-2868,Admiraliteyten in het,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3801
133377,Admiraliteyten in het,Admiraliteyten,Admiraliteyten in het gemeen.,NL-HaNA_1.01.02_3803_0009-column-370-406-953-2...,NL-HaNA_1.01.02_3803_0009,NL-HaNA_1.01.02_3803_0009-page-16,NL-HaNA_1.01.02_3803_0009-column-370-406-953-2979,NL-HaNA_1.01.02_3803_0009-column-370-406-953-2...,NL-HaNA_1.01.02_3803_0009,NL-HaNA_1.01.02_3803_0009-page-16,NL-HaNA_1.01.02_3803_0009-column-370-406-953-2979,Admiraliteyten in het,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3803
133400,Admiraliteyten in het,Admiraliteyten,Admiraliteyten in het gemeen.,NL-HaNA_1.01.02_3803_0009-column-1398-431-874-...,NL-HaNA_1.01.02_3803_0009,NL-HaNA_1.01.02_3803_0009-page-16,NL-HaNA_1.01.02_3803_0009-column-1398-431-874-...,NL-HaNA_1.01.02_3803_0009-column-1398-431-874-...,NL-HaNA_1.01.02_3803_0009,NL-HaNA_1.01.02_3803_0009-page-16,NL-HaNA_1.01.02_3803_0009-column-1398-431-874-...,Admiraliteyten in het,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3803


Of the ~~19,182~~15,561 distinct terms, some occur in the vast majority of books, while many occur in only one. The latter are more likely to contain OCR errors, although there are also rare but correct lemmas among them. 

We inspect the lemmas occuring in at least half of the books to see if they contain any obvious OCR errors:

In [58]:
s1 = df.clean_lemma.value_counts()
s2 = df[['clean_lemma', 'inventory']].drop_duplicates().clean_lemma.value_counts()
#.to_frame().rename(columns={'clean_lemma': 'Entries'}).head(5)

pd.concat([s1.to_frame().rename(columns={'clean_lemma': "# Sublemma's"}), s2.to_frame().rename(columns={'clean_lemma': '# Indices'})], axis=1).reset_index().rename(columns={"index": "Hoofdterm"}).head(20)


Unnamed: 0,Hoofdterm,# Sublemma's,# Indices
0,Pasporten,3583,29
1,Holland,3161,24
2,Raad van Staate,2875,17
3,s Hertogenbosch,2856,35
4,Vlaanderen,1525,33
5,Militaire,1516,9
6,Militaire saaken,1470,13
7,Militaire saacken,1443,10
8,Brieven van,1124,19
9,E,1066,14


In [59]:
pd.concat([s2.to_frame().rename(columns={'clean_lemma': "# Indices"}), s1.to_frame().rename(columns={'clean_lemma': "# Sublemma's"})], axis=1).reset_index().rename(columns={"index": "Hoofdterm"}).head(20)


Unnamed: 0,Hoofdterm,# Indices,# Sublemma's
0,Breda,40,471
1,Finantie,38,984
2,Utrecht,35,723
3,s Hertogenbosch,35,2856
4,Hop,34,1029
5,Groningen,34,828
6,Vlaanderen,33,1525
7,Levantschen Handel,32,185
8,Pruyssen,31,308
9,Overyssel,31,671


In [60]:
s = df[['clean_lemma', 'inventory']].drop_duplicates().clean_lemma.value_counts()

sorted(list(s[s >= 21].index))

['Admiraliteyt in het Noorder Quartier',
 'Admiraliteyt tot Amsterdam',
 'Barbaryen',
 'Bergen op den Zoom',
 'Bescheyt',
 'Beyeren',
 'Bleyswyck',
 'Boodens',
 'Bosch',
 'Breda',
 'Bruyninx',
 'Buys',
 'Colyear',
 'Commissien',
 'Credentie',
 'Cronstrom',
 'Finantie',
 'Gedeputeerden',
 'Gerbrants',
 'Groningen',
 'Grys',
 'Heysterman',
 'Hochepied',
 'Hogendorp',
 'Holland',
 'Hompesch',
 'Hop',
 'Hoyer',
 'Hudson',
 'Instructie',
 'Keulen',
 'Lemmens',
 'Levantschen Handel',
 'Luyck',
 'Meyer',
 'Mol',
 'Octroyen',
 'Overduyn',
 'Overmaze',
 'Overyssel',
 'Paltz',
 'Pardon',
 'Pasporten',
 'Patenten',
 'Pruyssen',
 'Revisie',
 'Rumpf',
 'Serres',
 'Siet',
 'Smits',
 'Spina',
 'Suriname',
 'Utrecht',
 'Vlaanderen',
 'Voorschryvens',
 'de Groot',
 'de Jongh',
 's Gravesande',
 's Hertogenbosch',
 'van Breugel',
 'van Deurs',
 'van Hoey',
 'van Rechteren',
 'vanden Bergh',
 'vanden Heuvel',
 'vander Burgh',
 'vander Duyn',
 'vander Goes',
 'vander Haar',
 'vander Meer',
 'vander Meulen

~~There are a few very short terms with stopwords like *om*, *en*, partial words like *gen* and *ren*~~ and the empty string ''. But the rest looks good. 

In [61]:
non_lemmas = {
    '',
    'en',
    'gen',
    'om',
    'ren',
}

print(df[df.clean_lemma.isin(non_lemmas)].shape)

df[df.clean_lemma.isin(non_lemmas)].head(5)[['clean_lemma', 'text']]

(539, 15)


Unnamed: 0,clean_lemma,text
1950,,) diers te veranderen in de lauwe Gar-
1951,,"des te voet, 12337."
3479,om,"om garantie op negotiatie, 1490. Q"
3483,om,om een Cômpagnie onder Cars éf wel
3631,ren,"ren en Waren, 129. Resolutie van haer Hoogh Mo..."


There are also some rows with empty lemma terms. So something went wrong with these entries.

In [62]:
print(df[df.clean_lemma.isna() == True].shape)
df[df.clean_lemma.isna() == True].head(2)

(115, 15)


Unnamed: 0,ocr_lemma,main_term,text,first_line_id,first_line_scan_id,first_line_page_id,first_line_column_id,last_line_id,last_line_scan_id,last_line_page_id,last_line_column_id,clean_lemma,line_iiif_url,column_iiif_url,inventory
62849,,,", teyts Reekenkamer te adviseeren. 7oo.",NL-HaNA_1.01.02_3784_0024-column-2560-452-926-...,NL-HaNA_1.01.02_3784_0024,NL-HaNA_1.01.02_3784_0024-page-47,NL-HaNA_1.01.02_3784_0024-column-2560-452-926-...,NL-HaNA_1.01.02_3784_0024-column-2560-452-926-...,NL-HaNA_1.01.02_3784_0024,NL-HaNA_1.01.02_3784_0024-page-47,NL-HaNA_1.01.02_3784_0024-column-2560-452-926-...,,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3784
71110,,,", cheeren voor Perez. 438.",NL-HaNA_1.01.02_3786_0034-column-3446-432-889-...,NL-HaNA_1.01.02_3786_0034,NL-HaNA_1.01.02_3786_0034-page-67,NL-HaNA_1.01.02_3786_0034-column-3446-432-889-...,NL-HaNA_1.01.02_3786_0034-column-3446-432-889-...,NL-HaNA_1.01.02_3786_0034,NL-HaNA_1.01.02_3786_0034-page-67,NL-HaNA_1.01.02_3786_0034-column-3446-432-889-...,,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3786


These short non-lemma entries look like extraction mistakes. Before continuing, we drop them from the list and write them to file as entries with extraction mistakes.

In [63]:
s = df[df.inventory == '3765'].ocr_lemma.value_counts()
s[s > 10]

mende                                 74
Lumge-                                73
Spaeniche Nederlanden Gouvernement    65
Militaire laken                       56
Brûsel                                50
Overquantier van Gelderlandt          49
Declaratien                           47
Voorschryvens                         44
Paltz;                                38
Denemarcken                           37
Pruyssen                              36
Vlaenderen                            34
Rysel                                 31
Commiftien                            30
Doomiek                               26
Commissien                            26
Luyek                                 25
Quinze                                22
Engelandt                             21
gelyi                                 20
s Hertogenbosch                       19
Ooff-Vrieslandt                       19
ZZ de Militie                         19
Geneniiteyt Rekenkamer                18
Duytiche Hoven  

## Analysing Likely Mistakes

In [64]:
mistakes_file = '../../data/indices/index_entries-3763-3804-mistakes.csv'

mistakes_df = df[(df.clean_lemma.isin(non_lemmas)) | (df.clean_lemma.isna() == True)]
print(mistakes_df.shape)
mistakes_df.to_csv(mistakes_file, sep='\t')

# Drop the entries of lemmas with mistakes from the main data frame
df = df[(df.clean_lemma.isin(non_lemmas) == False) & (df.clean_lemma.isna() == False)]


df.shape



(654, 15)


(139659, 15)

In [65]:
mistakes_df[['ocr_lemma', 'clean_lemma']].value_counts()

ocr_lemma  clean_lemma
ren        ren            344
&                          56
.-                         31
en         en              20
/                          18
)                          13
om         om              13
gen        gen             12
;                          10
gen.       gen              5
|                           3
ren.       ren              3
\                           3
'                           2
-.                          2
»                           2
>                           1
).                          1
dtype: int64

In [66]:
mistakes_df.inventory.value_counts().head(5)

3792    127
3764    111
3786     76
3779     68
3798     52
Name: inventory, dtype: int64

In [67]:
mistakes_df[mistakes_df.inventory == '3792'].first_line_page_id.value_counts()

NL-HaNA_1.01.02_3792_0387-page-772    44
NL-HaNA_1.01.02_3792_0386-page-770    32
NL-HaNA_1.01.02_3792_0386-page-771    31
NL-HaNA_1.01.02_3792_0385-page-769    20
Name: first_line_page_id, dtype: int64

In [68]:
df[(df.inventory == '3792') & (df.first_line_page_id == 'NL-HaNA_1.01.02_3792_0385-page-769')]


Unnamed: 0,ocr_lemma,main_term,text,first_line_id,first_line_scan_id,first_line_page_id,first_line_column_id,last_line_id,last_line_scan_id,last_line_page_id,last_line_column_id,clean_lemma,line_iiif_url,column_iiif_url,inventory
91384,Militaire saacken,Militaire,Pasport voor Lemmens en Graaven tot den uytvoe...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,Militaire saacken,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91385,Militaire saacken,Militaire,Pasport voor den Baron van Schwart- zenbergh t...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,Militaire saacken,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91386,Militaire saacken,Militaire,berigbt van den commandeerenden Of- Jicier van...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,Militaire saacken,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91387,Militaire saacken,Militaire,Hirzel weegens doen van eed als Ge- nerdal Maj...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,Militaire saacken,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91388,Militaire saacken,Militaire,Pasport voor van Texel tot den uyt- voer van M...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,Militaire saacken,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91389,Militaire saacken,Militaire,Larcher van Keenenburg ses maan- den verlof. 106.,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,Militaire saacken,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91390,Militaire saacken,Militaire,rapport op het berigbt van Hambrock op de Memo...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,Militaire saacken,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91391,Militaire saacken,Militaire,propositie van Zeeland tot verande- ringe der ...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,Militaire saacken,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91392,Militaire saacken,Militaire,Veldman gepermitteert voor aght daa- gen een r...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,Militaire saacken,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91393,Militaire saacken,Militaire,Lyste van de veranderinge der Guar- nisoenen. ...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-2602-445-905-...,Militaire saacken,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792


In [69]:
mistakes_df[mistakes_df.inventory == '3792'].first_line_page_id.value_counts()

NL-HaNA_1.01.02_3792_0387-page-772    44
NL-HaNA_1.01.02_3792_0386-page-770    32
NL-HaNA_1.01.02_3792_0386-page-771    31
NL-HaNA_1.01.02_3792_0385-page-769    20
Name: first_line_page_id, dtype: int64

In [70]:
mistakes_df[(mistakes_df.inventory == '3792') & (mistakes_df.first_line_page_id == 'NL-HaNA_1.01.02_3792_0385-page-769')]



Unnamed: 0,ocr_lemma,main_term,text,first_line_id,first_line_scan_id,first_line_page_id,first_line_column_id,last_line_id,last_line_scan_id,last_line_page_id,last_line_column_id,clean_lemma,line_iiif_url,column_iiif_url,inventory
91403,ren,ren,ren weegens het defect in baare Militie. 137-,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,ren,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91404,ren,ren,Carel Lodewyck van Wassenaer op sijn versoeck ...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,ren,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91405,ren,ren,consent van Vriesland in den Staat van Oorlogh...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,ren,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91406,ren,ren,Grave van Hompesch aangesteld tot Gouverneur v...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,ren,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91407,ren,ren,Vriesland en Overyssel om vermeer - deringh va...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,ren,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91408,ren,ren,Resolutie van Zeeland raakende de incompleethe...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,ren,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91409,ren,ren,Pasport voor de Officieren van bert Esquadron ...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,ren,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91410,ren,ren,"Zeeland consent in den Staat van Oorlogh, 1.9.",NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,ren,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91411,ren,ren,Passport voor Smits tot den uytvoer van Montee...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,ren,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792
91412,ren,ren,"antwoord van den Bisschop en Prince wan Luyck,...",NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,NL-HaNA_1.01.02_3792_0385,NL-HaNA_1.01.02_3792_0385-page-769,NL-HaNA_1.01.02_3792_0385-column-3561-445-915-...,ren,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3792


## Extracting Likely Correct Lemmas

In [71]:
s = df[['clean_lemma', 'inventory']].drop_duplicates().clean_lemma.value_counts()

# Extract the lemmas occuring in at least 21 books as correct lemmas
correct_lemmas =  set(s[s >= 21].index)


In [72]:
s = df[['clean_lemma', 'inventory']].drop_duplicates().clean_lemma.value_counts()

# Extract the lemmas occuring in at least 21 books as correct lemmas
correct_lemmas =  set(s[s >= 21].index)

correct_file = '../../data/indices/index_entries-3763-3804-correct_lemmas.csv'

# create a new data frame with only the correct lemmas entries
correct_df = df[df.clean_lemma.isin(correct_lemmas)]

# Drop the entries of correct lemmas from the main data frame
df = df[df.clean_lemma.isin(correct_lemmas) == False]

# in the correct lemma data frame, add a new column to explicitly label the correct lemma
correct_df['correct_lemma'] = correct_df.clean_lemma

# write the correct lemma data frame to file for later use
correct_df.to_csv(correct_file, sep='\t')


df.shape




(112813, 15)

In [73]:
correct_df[correct_df.clean_lemma.str.contains('Bergen op')].head(5)

Unnamed: 0,ocr_lemma,main_term,text,first_line_id,first_line_scan_id,first_line_page_id,first_line_column_id,last_line_id,last_line_scan_id,last_line_page_id,last_line_column_id,clean_lemma,line_iiif_url,column_iiif_url,inventory,correct_lemma
189,Bergen op den Zoom,Bergen,"Bergen op den Zoom, Vilatte versoeckt met het ...",NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1255,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1255,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,Bergen op den Zoom,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3763,Bergen op den Zoom
190,Bergen op den Zoom,Bergen,"item den Grave van Oxenstern, 578.",NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1255,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1255,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,Bergen op den Zoom,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3763,Bergen op den Zoom
191,Bergen op den Zoom,Bergen,"item Paron Friesheym, 581.",NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1255,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1255,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,Bergen op den Zoom,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3763,Bergen op den Zoom
192,Bergen op den Zoom,Bergen,"item Dompré, Lieutenant Ge- nerael, 581.",NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1255,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1255,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,Bergen op den Zoom,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3763,Bergen op den Zoom
193,Bergen op den Zoom,Bergen,"item Baron Spar, sor.",NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1255,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1255,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,Bergen op den Zoom,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....,3763,Bergen op den Zoom


In [80]:
#correct_lemmas
import sys

#!{sys.executable} -m pip install sklearn
!{sys.executable} -m pip install sparse_dot_topn

Collecting sparse_dot_topn
  Downloading sparse_dot_topn-0.3.1.tar.gz (17 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Installing backend dependencies ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting cython>=0.29.15
  Using cached Cython-0.29.28-py2.py3-none-any.whl (983 kB)
Building wheels for collected packages: sparse-dot-topn
  Building wheel for sparse-dot-topn (pyproject.toml) ... [?25ldone
[?25h  Created wheel for sparse-dot-topn: filename=sparse_dot_topn-0.3.1-cp39-cp39-macosx_10_9_x86_64.whl size=296962 sha256=e6a1cc4ff1b8481a648d17c9b9352b0d4889bbaf0408e4bc9cdcf9b99026117b
  Stored in directory: /Users/marijnkoolen/Library/Caches/pip/wheels/07/0b/f4/7f1cf4237f41f6eeacfc8f6b1859ee0a51abf369a940c8823c
Successfully built sparse-dot-topn
Installing collected packages: cython, sparse-dot-topn
Successfully installed cython-0.29.28 sparse-dot-topn-0.3.1


In [81]:
from republic.extraction.extract_index_entries import LemmaGrouper

lemma_grouper = LemmaGrouper(use_lowercase=True, debug_level=1)

ocr_lemmas = list(df.clean_lemma.drop_duplicates())

lemma_grouper.set_idf(ocr_lemmas + list(correct_lemmas))

setting IDF for 15556 terms


In [82]:
lemma_grouper.group_terms_by_base_terms(ocr_lemmas, correct_lemmas)

setting IDF for 15556 terms
Term TF-IDF matrix has shape: (15483, 6948)
Base term TF-IDF matrix has shape: (73, 6948)
Coord matrix has shape: (15483, 73)
number of groups: 73
number of grouped terms: 239


Make a mapping of each lemma term and the inventories it occurs in, so it's easy to compare which lemmas occur in at least one inventory together.

The insight here is that if two different lemma terms appear in a single inventory, they must be distinct terms no matter how similar they are, otherwise they would have used only a single lemma and repeat symbols. 

For example, the terms `Luyck` and `Luycken` are grouped by the lemma grouper, but they are listed as separate lemmas in the index pages of both inventories 3778 and 3780. If `Luycken` were a misrecognition of `Luyck`, they would no co-occur with distinct lemmas within 3778 and 3780.

In [83]:
from collections import defaultdict

inv_map = defaultdict(set)

temp_df = df[['clean_lemma', 'inventory']].drop_duplicates()

for index, row in temp_df.iterrows():
    inv_map[row['clean_lemma']].add(row['inventory'])

temp_df = correct_df[['clean_lemma', 'inventory']].drop_duplicates()

for index, row in temp_df.iterrows():
    inv_map[row['clean_lemma']].add(row['inventory'])



In [84]:
import numpy as np

def terms_share_inventory(term1, term2, inv_map):
    return len(inv_map[term1].intersection(inv_map[term2])) > 0

lemma_map = {}

for term in sorted(lemma_grouper.group_lookup):
    correct_lemma = lemma_grouper.group_lookup[term]
    if terms_share_inventory(term, correct_lemma, inv_map):
        continue
    if term == correct_lemma:
        continue
    # m
    lemma_map[term] = correct_lemma

In [85]:
# add the correct lemma in a separate column to the mapped ocr_lemmas
df['correct_lemma'] = df.clean_lemma.apply(lambda x: lemma_map[x] if x in lemma_map else np.nan)
# add the mapped lemmas to the correct lemma DataFrame
correct_df = pd.concat([correct_df, df[df.correct_lemma.isna() == False]])
# drop the mapped lemmas from the main DataFrame
df = df[df.correct_lemma.isna() == True]


In [86]:
df.shape

(110763, 16)

In [87]:
inv_map['Luyck']

{'3763',
 '3768',
 '3769',
 '3770',
 '3772',
 '3773',
 '3778',
 '3779',
 '3780',
 '3782',
 '3783',
 '3785',
 '3786',
 '3787',
 '3788',
 '3789',
 '3790',
 '3791',
 '3792',
 '3795',
 '3796',
 '3797',
 '3798',
 '3799',
 '3800'}

In [88]:
inv_map['Luycken']

{'3776', '3777', '3778', '3780'}

In [89]:
df[['clean_lemma','inventory']].drop_duplicates().clean_lemma.value_counts()

Bentinck                        20
van Dorssele                    20
Engelandt                       20
Bededagh                        20
Zeeland                         20
                                ..
Mexnert                          1
Mingerssen                       1
Missche tot                      1
de Moisan Brieux Brieven van     1
Zwitzerse Cantons                1
Name: clean_lemma, Length: 15397, dtype: int64

In [90]:
s = df[['clean_lemma', 'inventory']].drop_duplicates().clean_lemma.value_counts()

sorted(list(s[s > 10].index))
#s[['Bergh', 'Assendelft']]
s[s >= 10]

Bentinck        20
van Dorssele    20
Engelandt       20
Bededagh        20
Zeeland         20
                ..
Hoeuft          10
Vleertman       10
Ham             10
van Affelen     10
Bronckhorst     10
Name: clean_lemma, Length: 307, dtype: int64

In [91]:
#df['lemma'] = df.lemma.apply(lambda x: x if type(x) == str else '')
df[df.clean_lemma.str.startswith('Bergh')].clean_lemma.value_counts()


Berghuys              26
Berghman               8
Bergh                  6
Bergheyck              2
Berghsma               2
Berghman klagten       1
Bergh-Hohenzollern     1
Name: clean_lemma, dtype: int64

In [92]:
def read_lemmas(lemma_file):
    with open(lemma_file, 'rt') as fh:
        lemmas = []
        for line in fh:
            lemma = line.strip()
            if lemma[-1] == ';':
                lemma = lemma[:-1]
            lemma = lemma.replace(';', ' ')
            lemmas.append(lemma)
    return lemmas


index_lemma_file = '../../analysis/Indices/index_1700_1724_3867.csv'

transcribed_lemmas = read_lemmas(index_lemma_file)
len(transcribed_lemmas)

transcribed_lemmas

['Abdije van Bern',
 'Abditte van Thorn',
 'Admiraal',
 'Admiraliteiten in t gemeen',
 'Admiraliteit op de Maze',
 'Admiraliteit te Amsterdam',
 'Admiraliteit in Zeeland',
 'Admiraliteit in Vriesland',
 'Admiraliteit Noorderquartr',
 'Aerdenburg',
 'Agenten',
 'Agent van Haar Hoog Mogd',
 'Agent van Haar Hoog Mogd Munster en Paderborn',
 'Aken',
 'Algiers',
 'Almanak',
 'Almanakkegeld',
 'Ameland',
 'Amersfoort',
 'Ammunitie',
 'Ampten',
 'Anholt',
 'Apperitez',
 'Armen',
 'Arnhem',
 'Artillerij',
 'Artois',
 'Auctographa',
 'Auvergne',
 'Axel en Neuse',
 'Barneze',
 'Bededagen',
 'Bedelarije',
 'Bedestonden',
 'Benthem',
 'Berbice',
 'Bergen op Zoom',
 'Bern',
 'Besendinge',
 'Beursen',
 'Beijeren',
 'Biervliet',
 'Bilbao',
 'Bodens',
 'Boeken',
 'Bommel',
 'Bon',
 'Boxmeer',
 'Braband',
 'Brand',
 'Brandenburg',
 'Brandenburg Anspach',
 'Brandewijnstookers',
 'Brandspuijt',
 'Breda en Baronnije',
 'Breemen',
 'Breskens',
 'Brieven',
 'Brouwers',
 'Brunswijk Lunenburg',
 'Brussel',
 '

In [93]:
print("transcribed lemmas:", len(transcribed_lemmas))

transcribed lemmas: 369


In [94]:
correct_lemmas = [lemma for lemma in list(correct_df.correct_lemma.drop_duplicates()) if type(lemma) == str]

lemma_grouper = LemmaGrouper(use_lowercase=True, debug_level=0)

gt_lemmas = transcribed_lemmas
all_lemmas = gt_lemmas + correct_lemmas

lemma_grouper.group_terms_by_base_terms(correct_lemmas, gt_lemmas)

gt_tfidf = lemma_grouper._get_tf_idf_matrix(gt_lemmas)
correct_tfidf = lemma_grouper._get_tf_idf_matrix(correct_lemmas)
gt_tfidf.shape

(369, 1379)

In [95]:
correct_tfidf.shape

(73, 1379)

In [96]:
for term in sorted(lemma_grouper.group_lookup):
    group_lemma = lemma_grouper.group_lookup[term]
    if term == group_lemma:
        continue
    # m
    print(f'{term: <40}{group_lemma: <40}')
    #lemma_map[term] = correct_lemma

Bergen op den Zoom                      Bergen op Zoom                          
s Hertogenbosch                         Hertogenbosch                           


In [97]:
transcribed_lemmas

['Abdije van Bern',
 'Abditte van Thorn',
 'Admiraal',
 'Admiraliteiten in t gemeen',
 'Admiraliteit op de Maze',
 'Admiraliteit te Amsterdam',
 'Admiraliteit in Zeeland',
 'Admiraliteit in Vriesland',
 'Admiraliteit Noorderquartr',
 'Aerdenburg',
 'Agenten',
 'Agent van Haar Hoog Mogd',
 'Agent van Haar Hoog Mogd Munster en Paderborn',
 'Aken',
 'Algiers',
 'Almanak',
 'Almanakkegeld',
 'Ameland',
 'Amersfoort',
 'Ammunitie',
 'Ampten',
 'Anholt',
 'Apperitez',
 'Armen',
 'Arnhem',
 'Artillerij',
 'Artois',
 'Auctographa',
 'Auvergne',
 'Axel en Neuse',
 'Barneze',
 'Bededagen',
 'Bedelarije',
 'Bedestonden',
 'Benthem',
 'Berbice',
 'Bergen op Zoom',
 'Bern',
 'Besendinge',
 'Beursen',
 'Beijeren',
 'Biervliet',
 'Bilbao',
 'Bodens',
 'Boeken',
 'Bommel',
 'Bon',
 'Boxmeer',
 'Braband',
 'Brand',
 'Brandenburg',
 'Brandenburg Anspach',
 'Brandewijnstookers',
 'Brandspuijt',
 'Breda en Baronnije',
 'Breemen',
 'Breskens',
 'Brieven',
 'Brouwers',
 'Brunswijk Lunenburg',
 'Brussel',
 '

In [98]:
from republic.extraction.extract_index_entries import LemmaGrouper

lemma_grouper = LemmaGrouper(use_lowercase=True, debug_level=1)

ocr_lemmas = list(df.clean_lemma.drop_duplicates())

lemma_grouper.set_idf(ocr_lemmas + transcribed_lemmas)

setting IDF for 15766 terms


In [99]:
lemma_grouper.group_terms_by_base_terms(ocr_lemmas, transcribed_lemmas)

setting IDF for 15766 terms
Term TF-IDF matrix has shape: (15397, 7014)
Base term TF-IDF matrix has shape: (369, 7014)
Coord matrix has shape: (15397, 369)
number of groups: 369
number of grouped terms: 595


In [100]:
for term in sorted(lemma_grouper.group_lookup):
    group_lemma = lemma_grouper.group_lookup[term]
    if term == group_lemma:
        continue
    # m
    print(f'{term: <40}{group_lemma: <40}')
    #lemma_map[term] = correct_lemma

Aadmiraliteyt in Vrieslandt             Admiraliteit in Vriesland               
Admiraliteyt in Vriesland               Admiraliteit in Vriesland               
Admiraliteyt in Vrieslandt              Admiraliteit in Vriesland               
Admiraliteyt in Vrieslandt te           Admiraliteit in Vriesland               
Admiraliteyt in Zeeland                 Admiraliteit in Zeeland                 
Admiraliteyt in Zeelande                Admiraliteit in Zeeland                 
Admiraliteyt in Zeelandt                Admiraliteit in Zeeland                 
Admiraliteyt op de Maze                 Admiraliteit op de Maze                 
Admiraliteyt op de Maze te              Admiraliteit op de Maze                 
Algiers Dey                             Algiers                                 
Armentier                               Armen                                   
Audenaerde                              Naerden                                 
Audenaerden                 

In [101]:
import json

lemma_correction_map = lemma_grouper.group_lookup

with open('lemma_correction_map.json', 'wt') as fh:
    json.dump(lemma_correction_map, fh)

### Correcting lemmata based on transcribed index terms

In [102]:
import pandas as pd
import re

# Many lemmas have leading and/or trailing punctuation that is not part of
# the proper lemma. 
# The first step is to add a column with cleaned up lemmas.

def clean_start_end(string):
    if type(string) != str:
        return string
    return re.sub(r'\W+$', '', re.sub(r'^\W+', '', string))

# The file with 140,313 index lemma entries
entries_file = '../../data/indices/index_entries-3763-3804-latest.csv.gz'

df = pd.read_csv(entries_file, sep='\t', compression='gzip')

# Rename the lemma column to ocr_lemma, so we know this is the raw extracted text with potential errors.
df = df.rename(columns={'lemma': 'ocr_lemma'})
df.shape


# The line IDs contain the inventory number of the book from which the
# entry was extracted.
# We add a column with the inventory number so we can check which lemmas 
# occur in multiple books, therefore need to be linked and are more 
# likely to be correct lemmas (lemmas with OCR errors are less likely
# to occur with precisely the same errors across multiple books)

df['inventory'] = df.first_line_id.apply(lambda x: x.split('_')[2])
df.inventory.nunique()

df['clean_lemma'] = df.ocr_lemma.apply(clean_start_end)

df['correct_lemma'] = df.clean_lemma.apply(lambda x: lemma_correction_map[x] if x in lemma_correction_map else x)

df.correct_lemma.value_counts()




Militaire Saken               4960
Holland                       3871
Pasporten                     3584
Raad van Staten               3060
s Hertogenbosch               2856
                              ... 
van Zievel                       1
Verbodt                          1
Verburgh                         1
de Walsche Kerkenraad            1
van Ouderkerck Brieven van       1
Name: correct_lemma, Length: 15381, dtype: int64

In [104]:
s1 = df.correct_lemma.value_counts()
s2 = df[['correct_lemma', 'inventory']].drop_duplicates().correct_lemma.value_counts()
#.to_frame().rename(columns={'clean_lemma': 'Entries'}).head(5)

pd.concat([s1.to_frame().rename(columns={'correct_lemma': 'Entries'}), s2.to_frame().rename(columns={'correct_lemma': 'Indices'})], axis=1).head(20)


Unnamed: 0,Entries,Indices
Militaire Saken,4960,38
Holland,3871,41
Pasporten,3584,29
Raad van Staten,3060,18
s Hertogenbosch,2856,35
Vlaanderen,1554,33
Overquartier van gelderland,1479,41
Gelderland,1375,38
Zeeland,1351,35
Brieven,1201,19


In [82]:
# Next step, count the number of books in which each clean lemma occurs
s = df[['correct_lemma', 'inventory']].drop_duplicates().correct_lemma.value_counts()

s[s > 30].to_frame().rename(columns={'correct_lemma': 'Num indices'})

Unnamed: 0,Num indices
Overquartier van gelderland,41
Breda,40
Vriesland,40
Militaire Saken,38
Gelderland,38
Finantie,38
West Indische Compagnie,38
s Hertogenbosch,35
Utrecht,35
Zeeland,35


In [208]:
ocr_lemmas = [lemma for lemma in list(df.clean_lemma.drop_duplicates()) if type(lemma) == str]

len(ocr_lemmas)

19475

In [209]:
ocr_lemmas = [lemma for lemma in list(df.clean_lemma.drop_duplicates()) if type(lemma) == str]

lemma_grouper = LemmaGrouper(use_lowercase=True, debug_level=0)

gt_lemmas = lemmas
all_lemmas = gt_lemmas + ocr_lemmas

#lemma_grouper.set_idf(all_lemmas)
#lemma_grouper.group_terms(ocr_lemmas)
lemma_grouper.group_terms_by_base_terms(ocr_lemmas[:100000], gt_lemmas)

gt_tfidf = lemma_grouper._get_tf_idf_matrix(lemmas)
ocr_tfidf = lemma_grouper._get_tf_idf_matrix(ocr_lemmas)
gt_tfidf.shape

(369, 7512)

In [216]:
# terms in the OCR lemma list that are similar to some of the correct lemmas
# but are their own lemmas
non_variants = {
    'Oostindische Compagnie',
    'Indische Compagnie',
    'Denaturalisatie',
    'Audenaerden',
    'Aerden',
    'Pesters',
    'Portois',
    'Ratificatie',
    'Thielen',
    'Timmers',
    'Trompetters',
    'Vermaze',
    'Waerdenburgh',
    'Zeegelaar',
    'Zeegers',
}

lemma_map = {}

for term in sorted(lemma_grouper.group_lookup):
    if term in gt_lemmas:
        continue
    #print(term, lemma_grouper.group_lookup[term])
    group_term = lemma_grouper.group_lookup[term]
    print(f"{term: <50}{group_term: <40}")
    if term in lemma_map:
        print('\tDOUBLE MAP:', term, group_term, lemma_map[term])

Aadmiraliteyt in Vrieslandt                       Admiraliteit in Vriesland               
Admiraliteyt in Vriesland                         Admiraliteit in Vriesland               
Admiraliteyt in Vrieslandt                        Admiraliteit in Vriesland               
Admiraliteyt in Vrieslandt te                     Admiraliteit in Vriesland               
Admiraliteyt in Zeeland                           Admiraliteit in Zeeland                 
Admiraliteyt in Zeelande                          Admiraliteit in Zeeland                 
Admiraliteyt in Zeelandt                          Admiraliteit in Zeeland                 
Admiraliteyt op de Maze                           Admiraliteit op de Maze                 
Admiraliteyt op de Maze te                        Admiraliteit op de Maze                 
Aerden                                            Naerden                                 
Algiers Dey                                       Algiers                                 

In [225]:
df[df.clean_lemma == 'porten']

Unnamed: 0,ocr_lemma,main_term,text,first_line_id,first_line_scan_id,first_line_page_id,first_line_column_id,last_line_id,last_line_scan_id,last_line_page_id,last_line_column_id,clean_lemma,inventory,correct_lemma
14324,porten,porten,porten 356. hondert twintigh Pasporten voor de...,NL-HaNA_1.01.02_3769_0013-column-1443-431-883-...,NL-HaNA_1.01.02_3769_0013,NL-HaNA_1.01.02_3769_0013-page-24,NL-HaNA_1.01.02_3769_0013-column-1443-431-883-...,NL-HaNA_1.01.02_3769_0013-column-1443-431-883-...,NL-HaNA_1.01.02_3769_0013,NL-HaNA_1.01.02_3769_0013-page-24,NL-HaNA_1.01.02_3769_0013-column-1443-431-883-...,porten,3769,
71237,porten,porten,porten verleent. 738.,NL-HaNA_1.01.02_3785_0047-column-1216-392-1006...,NL-HaNA_1.01.02_3785_0047,NL-HaNA_1.01.02_3785_0047-page-92,NL-HaNA_1.01.02_3785_0047-column-1216-392-1006...,NL-HaNA_1.01.02_3785_0047-column-1216-392-1006...,NL-HaNA_1.01.02_3785_0047,NL-HaNA_1.01.02_3785_0047-page-92,NL-HaNA_1.01.02_3785_0047-column-1216-392-1006...,porten,3785,
71238,porten,porten,Bewindhebbers van Zeelandt te be-,NL-HaNA_1.01.02_3785_0047-column-1216-392-1006...,NL-HaNA_1.01.02_3785_0047,NL-HaNA_1.01.02_3785_0047-page-92,NL-HaNA_1.01.02_3785_0047-column-1216-392-1006...,NL-HaNA_1.01.02_3785_0047-column-1216-392-1006...,NL-HaNA_1.01.02_3785_0047,NL-HaNA_1.01.02_3785_0047-page-92,NL-HaNA_1.01.02_3785_0047-column-1216-392-1006...,porten,3785,
72732,porten,porten,"porten, de Admiraliteyt op de Maze te ad-",NL-HaNA_1.01.02_3786_0018-column-1317-441-904-...,NL-HaNA_1.01.02_3786_0018,NL-HaNA_1.01.02_3786_0018-page-34,NL-HaNA_1.01.02_3786_0018-column-1317-441-904-...,NL-HaNA_1.01.02_3786_0018-column-1317-441-904-...,NL-HaNA_1.01.02_3786_0018,NL-HaNA_1.01.02_3786_0018-page-34,NL-HaNA_1.01.02_3786_0018-column-1317-441-904-...,porten,3786,
95214,porten.,porten.,porten. 6a1.,NL-HaNA_1.01.02_3792_0375-column-3532-429-895-...,NL-HaNA_1.01.02_3792_0375,NL-HaNA_1.01.02_3792_0375-page-749,NL-HaNA_1.01.02_3792_0375-column-3532-429-895-...,NL-HaNA_1.01.02_3792_0375-column-3532-429-895-...,NL-HaNA_1.01.02_3792_0375,NL-HaNA_1.01.02_3792_0375-page-749,NL-HaNA_1.01.02_3792_0375-column-3532-429-895-...,porten,3792,
95215,porten.,porten.,rapport en resolutie dien aangaande. 626.,NL-HaNA_1.01.02_3792_0375-column-3532-429-895-...,NL-HaNA_1.01.02_3792_0375,NL-HaNA_1.01.02_3792_0375-page-749,NL-HaNA_1.01.02_3792_0375-column-3532-429-895-...,NL-HaNA_1.01.02_3792_0375-column-3532-429-895-...,NL-HaNA_1.01.02_3792_0375,NL-HaNA_1.01.02_3792_0375-page-749,NL-HaNA_1.01.02_3792_0375-column-3532-429-895-...,porten,3792,
95216,porten.,porten.,rapport en versûght ordre te stellen tot,NL-HaNA_1.01.02_3792_0375-column-3532-429-895-...,NL-HaNA_1.01.02_3792_0375,NL-HaNA_1.01.02_3792_0375-page-749,NL-HaNA_1.01.02_3792_0375-column-3532-429-895-...,NL-HaNA_1.01.02_3792_0375-column-3532-429-895-...,NL-HaNA_1.01.02_3792_0375,NL-HaNA_1.01.02_3792_0375-page-749,NL-HaNA_1.01.02_3792_0375-column-3532-429-895-...,porten,3792,


## Analysing page references

We assume that each page reference is a reference to a resolution that starts and/or ends on that page. So each reference is a reference to a single resolution. That means the number of page references is equal to the number of resolution references. 

Each term can refer to one or more pages, so to one or more resolutions. Each term can also occur in one or more annual indexes, so refer to resolutions across years. 


- **linking factor**: the number of resolutions that are linked by the same index term. Index terms are a grouping mechanism that is related to the content or aboutness of the resolutions.
- **linking specifity**: the specifity of the content with which an index term links resolutions. The specifity varies. A term like *military matters* broadly categorises resolutions related to military matters, but resolutions that are assigned that index term can still be only vaguely related. Whereas a more specific term like a town or person name, is much more specific. 
- **Linking coverage**: the number of resolutions that linked to at least one other resolution by a set of index terms. 

The _linking factor_ and _linking specifity_ are inversely, but not necessarily linearly, related. 

- The **inverse relationship** is explained by the amount of content that is covered by a set of resolutions, which tends to be larger for larger sets of resolutions. Terms that group many resolutions therefore group a large amount of content, which is likely more heterogeneous than terms that group few resolutions. 
- The **non-linearity** comes from the fact that not all index terms are equally relevant or important for each resolution they apply to. A person might be the sender of two missives leading to two resolutions that are topically unrelated to each other. What unites these resolutions is their sender, but this is not necessarily a relevant connection. 

In [6]:
import pandas as pd
import re

# The file with 145,985 index lemma entries
entries_file = '../../data/indices/index_entries-3763-3804-latest.csv.gz'

df = pd.read_csv(entries_file, sep='\t', compression='gzip')

df = df.rename(columns={'lemma': 'ocr_lemma'})
df['clean_lemma'] = df.ocr_lemma.apply(clean_start_end)

df['inventory'] = df.first_line_id.apply(lambda x: x.split('_')[2])



In [8]:
import re

import numpy as np

def has_comma_refs(text):
    return get_comma_refs_string(text) is not None

def get_comma_refs_string_type(text):
    if re.search(r'[,\.] \d+', text):
        return 'comma_digit_refs'
    elif ' siet ' in text:
        return 'redirect'
    elif re.search(r'[,\.] \w+\d+', text):
        return 'comma_alpha_digit_refs'
    elif re.search(r'[a-z] \w+\d+', text):
        return 'word_digit_refs'
    elif re.search(r'letter [A-Z]', text):
        return 'redirect'
    elif text[-1] == '-':
        return 'missing_end'
    else:
        return 'unknown'

def get_comma_refs_string(text):
    if re.search(r'[,\.] \d+', text):
        return re.sub(r'.*?[,\.] (\d+)', r'\1', text)
    elif re.search(r'[,\.] \w+\d+', text):
        return re.sub(r'.*?[,\.] (\w+\d+)', r'\1', text)
    elif re.search(r'[a-z] \w+\d+', text):
        return re.sub(r'.*?[a-z] (\w+\d+)', r'\1', text)
    else:
        return np.nan

ref_texts = list(df.head(50).text)

df['ref_string'] = df.text.apply(get_comma_refs_string)
df['ref_string_type'] = df.text.apply(get_comma_refs_string_type)


In [9]:
df[df.ref_string.isna() == False].inventory.value_counts()
df.ref_string_type.value_counts() / len(df)

comma_digit_refs          0.879370
unknown                   0.066808
word_digit_refs           0.029035
comma_alpha_digit_refs    0.013912
missing_end               0.006671
redirect                  0.004205
Name: ref_string_type, dtype: float64

In [13]:
def split_ref_string(ref_string):
    return [ref for ref in re.split(r'\W+', ref_string) if ref != '' and ref.isdigit()]

df['num_refs'] = df.ref_string.apply(lambda x: len(split_ref_string(x)) if x is not np.nan else np.nan)
df.num_refs.value_counts()

1.0    125566
0.0      3232
2.0       591
3.0        19
4.0         6
Name: num_refs, dtype: int64

In [14]:
df[df.num_refs > 2]

Unnamed: 0,ocr_lemma,main_term,text,first_line_id,first_line_scan_id,first_line_page_id,first_line_column_id,last_line_id,last_line_scan_id,last_line_page_id,last_line_column_id,clean_lemma,inventory,ref_string,ref_string_type,num_refs
24,adnirdites Rotterdam,adnirdites,"adnirdites Rotterdam, nopende aen- haien van v...",NL-HaNA_1.01.02_3763_0626-column-3416-1337-896...,NL-HaNA_1.01.02_3763_0626,NL-HaNA_1.01.02_3763_0626-page-1251,NL-HaNA_1.01.02_3763_0626-column-3416-1337-896...,NL-HaNA_1.01.02_3763_0626-column-3416-1337-896...,NL-HaNA_1.01.02_3763_0626,NL-HaNA_1.01.02_3763_0626-page-1251,NL-HaNA_1.01.02_3763_0626-column-3416-1337-896...,adnirdites Rotterdam,3763,15.59 168 203.,comma_digit_refs,4.0
72,Adniraliteyt Amsterdam,Adniraliteyt,"wegens een Pinck van Scheve- ningen gesonden, ...",NL-HaNA_1.01.02_3763_0627-column-1365-421-889-...,NL-HaNA_1.01.02_3763_0627,NL-HaNA_1.01.02_3763_0627-page-1252,NL-HaNA_1.01.02_3763_0627-column-1365-421-889-...,NL-HaNA_1.01.02_3763_0627-column-1365-421-889-...,NL-HaNA_1.01.02_3763_0627,NL-HaNA_1.01.02_3763_0627-page-1252,NL-HaNA_1.01.02_3763_0627-column-1365-421-889-...,Adniraliteyt Amsterdam,3763,921.948.995.,comma_digit_refs,3.0
99,Aadniraliteyt Zeelandt,Aadniraliteyt,nopende equipage na Portugael en de Middelandt...,NL-HaNA_1.01.02_3763_0627-column-3459-441-871-...,NL-HaNA_1.01.02_3763_0627,NL-HaNA_1.01.02_3763_0627-page-1253,NL-HaNA_1.01.02_3763_0627-column-3459-441-871-...,NL-HaNA_1.01.02_3763_0627-column-3459-441-871-...,NL-HaNA_1.01.02_3763_0627,NL-HaNA_1.01.02_3763_0627-page-1253,NL-HaNA_1.01.02_3763_0627-column-3459-441-871-...,Aadniraliteyt Zeelandt,3763,833.994 1036. nos.,comma_digit_refs,3.0
140,Auerquercq,Auerquercq,"Auerquercq, advertentie, 17.31-35.62. 97. 123....",NL-HaNA_1.01.02_3763_0628-column-396-427-902-2...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1254,NL-HaNA_1.01.02_3763_0628-column-396-427-902-2878,NL-HaNA_1.01.02_3763_0628-column-1345-402-882-...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1254,NL-HaNA_1.01.02_3763_0628-column-1345-402-882-...,Auerquercq,3763,1797123139201227297303331341347364368416436451...,comma_digit_refs,3.0
156,Barcelona,Barcelona,"Barcelona, 608.659. 631.924.988.",NL-HaNA_1.01.02_3763_0628-column-1345-402-882-...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1254,NL-HaNA_1.01.02_3763_0628-column-1345-402-882-...,NL-HaNA_1.01.02_3763_0628-column-1345-402-882-...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1254,NL-HaNA_1.01.02_3763_0628-column-1345-402-882-...,Barcelona,3763,608631.924.988.,comma_digit_refs,3.0
204,Beyersche,Beyersche,"Beyersche interessen en aghterstallen, 266. 29...",NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1255,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,NL-HaNA_1.01.02_3763_0628,NL-HaNA_1.01.02_3763_0628-page-1255,NL-HaNA_1.01.02_3763_0628-column-3442-417-970-...,Beyersche,3763,266295.651.657.980.,comma_digit_refs,4.0
2375,Engeland,Engeland,"Hermitage advertentie, 5. 14. 92. 100. 1a6. 13...",NL-HaNA_1.01.02_3764_0011-column-2645-416-899-...,NL-HaNA_1.01.02_3764_0011,NL-HaNA_1.01.02_3764_0011-page-21,NL-HaNA_1.01.02_3764_0011-column-2645-416-899-...,NL-HaNA_1.01.02_3764_0011-column-2645-416-899-...,NL-HaNA_1.01.02_3764_0011,NL-HaNA_1.01.02_3764_0011-page-21,NL-HaNA_1.01.02_3764_0011-column-2645-416-899-...,Engeland,3764,5149210011361452192472582883143493763864034134...,comma_digit_refs,3.0
5737,Overquantier van Gelderlandt,Overquantier,"unt Kesd, u99. rs. 868. 871 914 1083.",NL-HaNA_1.01.02_3765_0019-column-2606-386-896-...,NL-HaNA_1.01.02_3765_0019,NL-HaNA_1.01.02_3765_0019-page-37,NL-HaNA_1.01.02_3765_0019-column-2606-386-896-...,NL-HaNA_1.01.02_3765_0019-column-2606-386-896-...,NL-HaNA_1.01.02_3765_0019,NL-HaNA_1.01.02_3765_0019-page-37,NL-HaNA_1.01.02_3765_0019-column-2606-386-896-...,Overquantier van Gelderlandt,3765,868871 914 1083.,comma_digit_refs,3.0
8198,Overquartier van Gelderlande,Overquartier,"H stunt, brand aldaer, 183.205. joo. 452. 458....",NL-HaNA_1.01.02_3766_0020-column-521-433-899-2...,NL-HaNA_1.01.02_3766_0020,NL-HaNA_1.01.02_3766_0020-page-38,NL-HaNA_1.01.02_3766_0020-column-521-433-899-2878,NL-HaNA_1.01.02_3766_0020-column-521-433-899-2...,NL-HaNA_1.01.02_3766_0020,NL-HaNA_1.01.02_3766_0020-page-38,NL-HaNA_1.01.02_3766_0020-column-521-433-899-2878,Overquartier van Gelderlande,3766,1834524587707305 433. nei 1364,comma_digit_refs,3.0
10393,teren in Ryssel,teren,teren in Ryssel gelaten magh werden. 1247. rak...,NL-HaNA_1.01.02_3767_0765-column-1324-408-930-...,NL-HaNA_1.01.02_3767_0765,NL-HaNA_1.01.02_3767_0765-page-1528,NL-HaNA_1.01.02_3767_0765-column-1324-408-930-...,NL-HaNA_1.01.02_3767_0765-column-1324-408-930-...,NL-HaNA_1.01.02_3767_0765,NL-HaNA_1.01.02_3767_0765-page-1528,NL-HaNA_1.01.02_3767_0765-column-1324-408-930-...,teren in Ryssel,3767,12471322. consent in de Petitie tot de Hagazyn...,comma_digit_refs,4.0


In [15]:
s = df.num_refs.value_counts()

print(sum([i*c for i, c in zip (s.index, s)]))
for i, c in zip (s.index, s):
    print(i, c, i * c)

126829.0
1.0 125566 125566.0
0.0 3232 0.0
2.0 591 1182.0
3.0 19 57.0
4.0 6 24.0


In [320]:
# What are the entries with unknown endings?
list(df[df.ref_string_type == 'unknown'].text.head(50))

['equipage ende onderhoudt van Schepen na de Middelanduche Zee',
 'nopende aenhalinge van Koets en Paerden van den Ridder Richard simers, sso.',
 'geen Granen na vyandtlijcke Plaetsen te laten passeren; ror5.',
 'nopende het verkoopen van eeni- ge onbequame Schepen, Gor.',
 'nopende het senden van Bus- kruyt na Genua en neutrale Have- nen, sor.',
 'item Baron Spar, sor.',
 'Burcard toegestaen voor een maendt van Breda na Amsterdam te mogen vertrecken, roso.',
 'Conferentie Krijghsgevangenen afgebro- ken, ar.',
 'der Gedeputeerden tot de Ko- ninghlycke Portugaelsche Bruydt, ES)',
 'Drakesteyn, toor.',
 'Druet, tost.',
 'Bk, Brigadier, versoeckende tot',
 'Els, Boudewyn Willem , Capiteyn, versoeckende den eedt by procura- tie te mogen doen, gtr.',
 'Els, Collonel en Brigadier, versoec- kende het Gouvernement van Roer- monde by vacature, toro.',
 'gelast alle devorren aen te wen- den tot ontslaginge van het Schip de jonge lan, met retorsie na West- Indien varende, door Engelsche ge- nomen

In [321]:
s = df.groupby(['clean_lemma']).num_refs.sum().sort_values()
for index in s[s > 100].index:
    print(f"{index: <40}{s[index]: > 5}")

Bergen op Zoom                           101.0
vander Duyn                              101.0
Vranckrijck                              102.0
weegens                                  102.0
niet                                     103.0
Elsacker                                 103.0
Doornici                                 103.0
Raad van Vlaanderen                      103.0
Hochepied                                103.0
Kerckelijcke saacken                     103.0
vacante Compagnie                        104.0
van Rechteren                            105.0
DoParenten                               105.0
Commissarissen Deciseurs                 105.0
Rechteren                                106.0
Nalatenschap                             107.0
Raasvelt                                 107.0
Hertogenbosch                            108.0
twee                                     109.0
Greflein                                 109.0
Hudson                                   110.0
Stockholm    

In [341]:
df[df.clean_lemma.str.len() < 4].clean_lemma.value_counts()

       1037
Hop     974
E       836
ren     677
EX      520
       ... 
iem       1
cie       1
Nys       1
wyk       1
Eck       1
Name: clean_lemma, Length: 453, dtype: int64

In [376]:
temp_df = df[df.clean_lemma.str.len() >= 4]

sample_lemma_num_refs = temp_df.groupby(['clean_lemma']).num_refs.sum().sample(1000)

In [377]:
sample_df = df[df.clean_lemma.isin(sample_lemma_num_refs.index)]
sample_df.shape

(13638, 16)

In [378]:
{term: 'person_name' for term in list(sample_df.clean_lemma.drop_duplicates())}

{'nee Winter quartieren': 'person_name',
 'Autwerpen': 'person_name',
 'Armaitres tot Duynkercken': 'person_name',
 'Auvergne': 'person_name',
 'Backere': 'person_name',
 'Chevallerie': 'person_name',
 'Eedt van suyveringe': 'person_name',
 'Geursen': 'person_name',
 'Ginu': 'person_name',
 'Guardemeuble de Lange': 'person_name',
 's Hertogenbosch': 'person_name',
 'SI Jacomo Dirix': 'person_name',
 'Jan Somer': 'person_name',
 'Langenbergh': 'person_name',
 'Lillo': 'person_name',
 'Meuthen': 'person_name',
 'een Friseer-molen te': 'person_name',
 'Vlaenderen': 'person_name',
 'Oostende': 'person_name',
 'Oyen': 'person_name',
 'Pasporten-recht': 'person_name',
 'Plante': 'person_name',
 'Queisen': 'person_name',
 'Rumpf': 'person_name',
 'Seville': 'person_name',
 'Sichterman': 'person_name',
 'Tullebardin': 'person_name',
 'Westwoldingerlandt': 'person_name',
 'Aken': 'person_name',
 'Andringa': 'person_name',
 'Aspremont': 'person_name',
 'Barbaryen': 'person_name',
 'Benthein-Stei

In [None]:
term_cat = {
    'Roesteren': 'person_name',
    'Commissie': 'org',
    'Coolen': 'person_name',
    'Delsupéche': 'person_name',
    'opioven': 'person_name',
    'Heemert': 'person_name',
    'Mailboroug': 'person_name',
    'Mealingh': 'person_name',
    'Rouser': 'person_name',
    'Armentiers': 'person_name',
    'Fytingh': 'person_name',
    'Rollané': 'person_name',
    'Thieleman Hulsipe': 'person_name',
    'tuna': 'person_name',
    'Sintsendorff': 'person_name',
    'Martini': 'person_name',
    'wegens': 'person_name',
    'Santen': 'person_name',
    'soodanich in te schicken': 'fout',
    'Mayeux': 'person_name',
    'zZuerins': 'person_name',
    'Auger': 'person_name',
    'Verliege': 'person_name',
    'Noorthey': 'person_name',
    'Schilt': 'person_name',
    'Fourquin': 'person_name',
    'Martyn Pasport verleent': 'person_name',
    'van Rieu': 'person_name',
    'de Wit Brieven van': 'person_name',
    'Bleyswyck': 'person_name',
    'Flinck te': 'person_name',
    'Schoenmakers': 'person_name',
    'Verlooven': 'person_name',
    'van den Bergh': 'person_name',
    'Besmettelycke siekten': 'person_name',
    'Declaratien van de Vuurstoockster van Durs': 'person_name',
    'Lutteken': 'person_name',
    'Calkberner Brieven van': 'person_name',
    'de Groote': 'person_name',
    'van Hoogstraten': 'person_name',
    'de Retroactta': 'person_name',
    'Hallandi': 'person_name',
    'aan den Resident Galliers': 'person_name',
    'Barbut Brieven van': 'person_name',
    'Bertlingh': 'person_name',
    'Carrier': 'person_name',
    'Commistien in den Raadt van Staate': 'person_name',
    'Gylen': 'person_name',
    'van Heusden': 'person_name',
    'van Affelen te': 'person_name',
    'page': 'person_name',
    'lycke Majesteyt': 'person_name',
    'Euskerke': 'person_name',
    'Lerberge': 'person_name',
    'miteert': 'person_name',
    'Bors van Waveren': 'person_name',
    'Middelen tot Iperen': 'person_name',
    'Memorien van den Grave Konigsegg Erps': 'person_name',
    'van Bladel': 'person_name',
    'voorige Traclaaten': 'person_name',
    'Alvares': 'person_name',
    'van Hauxleden': 'person_name',
    'Gerritje': 'person_name',
    'wan Aarlerixtel': 'person_name',
    'de Roomsche Kerken': 'person_name',
    'Algiers Pasport': 'person_name',
    'Gastelaars Brieven van': 'person_name',
    'van Heteren Aitiestatie de': 'person_name',
    'van Moens': 'person_name',
    'Nelleftein': 'person_name',
    'generaale pyrdon': 'person_name',
    'de Wolff': 'person_name',
    'Booreel': 'person_name',
    'van Heyden tot Otmarssen': 'person_name',
    'giment van Hirzel': 'person_name',
    'Goste': 'person_name',
    'koopen': 'person_name',
    'Rotteveel': 'person_name',
    'Lyste van de Nieuwejaaren': 'person_name',
    'Corendyckér': 'person_name',
    'Hummeling': 'person_name',
    'Straatmaackers': 'person_name',
    'Sweers de Landas': 'person_name',
    'Hoeymans': 'person_name',
    'Heer van de Hoeve': 'person_name',
    'van Issem': 'person_name',
    'voor het Regiment van Plotho': 'person_name',
    'Notarissen in de Meyerye': 'person_name',
    'Leewe': 'person_name',
    'overgegeeven': 'person_name',
    'van Aat': 'person_name',
    'Tonis': 'person_name',
    'van Aspremont': 'person_name',
    'van Bleyywyck': 'person_name',
    'Coehoorn van Houwerda': 'person_name',
    'na Braband': 'person_name',
    'BDB Aad van Braband te': 'person_name',
    'Verlaan': 'person_name',
    'van Biesenbroek': 'person_name',
    'Haaften': 'person_name'
}

The classification of a random sample gives us an estimate of the distribution of classes. Combining this with the count of each term's linking factor gives us in insight in the linking potential of each class and in which class we get the most value out of the required effort to build curated term lists that cover many resolutions.



In [390]:


list(sample_df.apply(lambda x: ' '.join([x['clean_lemma'], x['text']]), axis=1))




['nee Winter quartieren nee Winter quartieren, 918.',
 'nee Winter quartieren aerte tifthop van Gran Brief van felicitatie over het verwisielen van het jaer, 45.',
 'Autwerpen Autwerpen, Bringius versoeckende haer Hoogh Mog. approbatie als Collo- nel titulair, 896.',
 'Autwerpen wegens iwee Schepen op de Schelde kruyssende, 1087.',
 'Autwerpen nopendé het senden van een Bat- talllon na Hulster-Ambacht, 1000.',
 'Autwerpen wegens het senden van Levens- middelen na den Vyandt, 99. 281.',
 'Autwerpen nopende vier Compagnien van Junius, leggende op twee Forten by het Vlaemsche Hooft, 1166. 1192.',
 'Armaitres tot Duynkercken Armaitres tot Duynkercken, 242.252. 274. 344.',
 'Auvergne Auvergne, Generael Major, omme tot Lieutenant Generael aengestelt te werden, 188.',
 'Backere Backere, 781.',
 'Chevallerie Chevallerie versoeckende Montfort tot sijn gevangenisse, 744',
 'Eedt van suyveringe Eedt van suyveringe, 42.',
 'Geursen Geursen, Capiteyn om Major de Briga- de aengestelt te werden, 258.

In [369]:
sample_df.first_line_id.apply(get_line_iiif_url, margin=100)

1285      https://images.diginfra.net/iiif/NL-HaNA_1.01....
2025      https://images.diginfra.net/iiif/NL-HaNA_1.01....
2026      https://images.diginfra.net/iiif/NL-HaNA_1.01....
2027      https://images.diginfra.net/iiif/NL-HaNA_1.01....
2028      https://images.diginfra.net/iiif/NL-HaNA_1.01....
                                ...                        
143801    https://images.diginfra.net/iiif/NL-HaNA_1.01....
143983    https://images.diginfra.net/iiif/NL-HaNA_1.01....
143984    https://images.diginfra.net/iiif/NL-HaNA_1.01....
143985    https://images.diginfra.net/iiif/NL-HaNA_1.01....
145581    https://images.diginfra.net/iiif/NL-HaNA_1.01....
Name: first_line_id, Length: 1142, dtype: object

In [395]:
temp_df = sample_df.groupby('clean_lemma').head(1)

temp_df = temp_df[['inventory', 'clean_lemma', 'text', 'first_line_id', 'first_line_column_id']]
temp_df['line_iiif_url'] = temp_df.first_line_id.apply(get_line_iiif_url)
temp_df['column_iiif_url'] = temp_df.first_line_id.apply(get_column_iiif_url)
temp_df = temp_df.drop(['first_line_id','first_line_column_id'], axis=1)
temp_df

Unnamed: 0,inventory,clean_lemma,text,line_iiif_url,column_iiif_url
0,3763,nee Winter quartieren,"nee Winter quartieren, 918.",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
129,3763,Autwerpen,"Autwerpen, Bringius versoeckende haer Hoogh Mo...",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
138,3763,Armaitres tot Duynkercken,"Armaitres tot Duynkercken, 242.252. 274. 344.",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
148,3763,Auvergne,"Auvergne, Generael Major, omme tot Lieutenant ...",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
153,3763,Backere,"Backere, 781.",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
...,...,...,...,...,...
144833,3804,Parnasiins van de Portugeesche Jobasche Natie,Parnasiins van de Portugeesche Jobasche Natie ...,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
145009,3804,DD Aaden van de,"DD Aaden van de gemeene drie Bonden,",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
145420,3804,Roode,Roode om aangefielt te werden als Scheepen van...,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
145492,3804,Slegers,Slegers om seekere Sententie tegens baar bui- ...,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....


In [396]:
import os

data_dir = '../../analysis/Indices/'
sample_file = os.path.join(data_dir, 'entries-sample-categorise.csv')
temp_df.to_csv(sample_file)

### Observations

- Person names link fewer resolutions (low frequency of occurrence: low token/type rate)
- Person names are difficult to curate (high number of distinct names: many types)


### Conclusions

- Person names are less valuable in linking and faceting resolutions because of their lower linking potential and the higher effort required to operationalise them well. 
- Topics are low in number so easier to curate and operationalise, and have very high linking potential.
- Topics and geographical locations have hierarchical structure (in the case of topic, it is not tree-like structure but a DAG with partially overlapping topics and sub-topics connected to multiple parents.

In [430]:
data_dir = '../../analysis/Indices/'
categorised_sample_file = os.path.join(data_dir, 'entries-sample-categorised-Rik.tsv')

categorised_sample_df = pd.read_csv(categorised_sample_file, sep='\t')
categorised_sample_df

Unnamed: 0,inventory,category,clean_lemma,text,line_iiif_url,column_iiif_url
0,3763,e,nee Winter quartieren,"nee Winter quartieren, 918.",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
1,3763,l,Autwerpen,"Autwerpen, Bringius versoeckende haer Hoogh Mo...",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
2,3763,p,Armaitres tot Duynkercken,"Armaitres tot Duynkercken, 242.252. 274. 344.",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
3,3763,p,Auvergne,"Auvergne, Generael Major, omme tot Lieutenant ...",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
4,3763,p,Backere,"Backere, 781.",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
...,...,...,...,...,...,...
995,3804,,Parnasiins van de Portugeesche Jobasche Natie,Parnasiins van de Portugeesche Jobasche Natie ...,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
996,3804,,DD Aaden van de,"DD Aaden van de gemeene drie Bonden,",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
997,3804,,Roode,Roode om aangefielt te werden als Scheepen van...,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
998,3804,,Slegers,Slegers om seekere Sententie tegens baar bui- ...,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....


In [431]:
s = categorised_sample_df.category.apply(lambda x: x.replace('?', '') if x is not np.nan else np.nan).value_counts()

print(sum(s))
s

200


p      97
l      34
z      30
e      24
f       7
i       7
i;l     1
Name: category, dtype: int64

In [432]:
s / sum(s)


p      0.485
l      0.170
z      0.150
e      0.120
f      0.035
i      0.035
i;l    0.005
Name: category, dtype: float64

In [433]:
s[[i for i in s.index if i != 'e']] / sum(s[[i for i in s.index if i != 'e']])

p      0.551136
l      0.193182
z      0.170455
f      0.039773
i      0.039773
i;l    0.005682
Name: category, dtype: float64

In the sample of 200 lemmas, there are 24 entries incorrectly identified as lemmas (12%).  Of the remaining 176 lemmas, there are:

- 97 person names (55%)
- 35 place names (20%)
- 30 topics (17%)
- 8 institutes/organisations (5%)
- 7 person roles/functions (4%)


In [434]:
categorised_sample_df

Unnamed: 0,inventory,category,clean_lemma,text,line_iiif_url,column_iiif_url
0,3763,e,nee Winter quartieren,"nee Winter quartieren, 918.",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
1,3763,l,Autwerpen,"Autwerpen, Bringius versoeckende haer Hoogh Mo...",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
2,3763,p,Armaitres tot Duynkercken,"Armaitres tot Duynkercken, 242.252. 274. 344.",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
3,3763,p,Auvergne,"Auvergne, Generael Major, omme tot Lieutenant ...",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
4,3763,p,Backere,"Backere, 781.",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
...,...,...,...,...,...,...
995,3804,,Parnasiins van de Portugeesche Jobasche Natie,Parnasiins van de Portugeesche Jobasche Natie ...,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
996,3804,,DD Aaden van de,"DD Aaden van de gemeene drie Bonden,",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
997,3804,,Roode,Roode om aangefielt te werden als Scheepen van...,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
998,3804,,Slegers,Slegers om seekere Sententie tegens baar bui- ...,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....


In [505]:
temp_df = df[['clean_lemma', 'num_refs']].drop_duplicates()#.set_index('clean_lemma')

temp_df = pd.merge(temp_df, categorised_sample_df, on=['clean_lemma'], how='right')

In [508]:
temp_df['category'] = temp_df.category.apply(lambda x: np.nan if x is np.nan else x.replace('?',''))
temp_df['num_links'] = temp_df.num_refs.apply(lambda x: 0 if x == 0.0 else (x * (x-1))/2)

In [518]:
temp_df[temp_df.category.isna() == False][['category', 'num_refs']]
temp_df = temp_df[temp_df.category.isna() == False][['category', 'num_refs', 'num_links']]

s = temp_df.groupby(['category']).num_links.value_counts()
t = s.unstack('category').fillna(0.0)

t = t.reset_index()

# How many links does each category generate?
t[['p', 'l', 'i', 'f', 'z']].apply(lambda x: x * t.num_links).sum()


category
p     20.0
l    281.0
i    232.0
f     23.0
z    227.0
dtype: float64

In [521]:
categorised_sample_df.category.apply(lambda x: np.nan if x is np.nan else x.replace('?','')).value_counts()

p      97
l      34
z      30
e      24
f       7
i       7
i;l     1
Name: category, dtype: int64

Unnamed: 0_level_0,inventory,category,text,line_iiif_url,column_iiif_url
clean_lemma,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
nee Winter quartieren,3763,e,"nee Winter quartieren, 918.",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
Autwerpen,3763,l,"Autwerpen, Bringius versoeckende haer Hoogh Mo...",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
Armaitres tot Duynkercken,3763,p,"Armaitres tot Duynkercken, 242.252. 274. 344.",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
Auvergne,3763,p,"Auvergne, Generael Major, omme tot Lieutenant ...",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
Backere,3763,p,"Backere, 781.",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
...,...,...,...,...,...
Parnasiins van de Portugeesche Jobasche Natie,3804,,Parnasiins van de Portugeesche Jobasche Natie ...,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
DD Aaden van de,3804,,"DD Aaden van de gemeene drie Bonden,",https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
Roode,3804,,Roode om aangefielt te werden als Scheepen van...,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....
Slegers,3804,,Slegers om seekere Sententie tegens baar bui- ...,https://images.diginfra.net/iiif/NL-HaNA_1.01....,https://images.diginfra.net/iiif/NL-HaNA_1.01....


## Searching in the index entries
