## Parsing REPBULIC hOCR Files

The scans of the printed RSG volumes have the following characteristics

- all scans:
  - have two pages per scan
  - have up to 4 columns per scan, 2 per page 
  - full scan is around 4800 pixels wide, left page is up to pixel 2400, right page is from pixel 2400 (roughly)
- scans of index pages
  - have no page numbers
- scans of resolution pages
  - have page numbers (left-side page is even, right-side page is odd)
  
### Columns

The scans are normalized such that the columns are straight. The text width should be around 1000 pixels. Some columns are not cut out properly, resulting in columns that are either to small (some of the column text is missing), or too wide (the hOCR output contains partial texts from two columns)

### Index pages

- start of entry: 
  - start left alignment
- end of entry:
  - end of line possibly before end of text column. 
  - One or more page numbers


### Resolution pages

- header:
  - next top of page (less than 350 pixels from the top)
  - page has header with:
    - even numbered pages: date page_number year
    - odd numbered pages: year page_number date
  - columns have half of page header, e.g.:
    - even numbered pages: 
      - first column: date left aligned and part of page_number right aligned
      - second column: part of page_number left aligned and year right aligned
    - odd numbered pages: 
      - first column: year left aligned and part of page_number right aligned
      - second column: part of page_number left aligned and date right aligned
      
### Viewer

- page viewer: https://images.huygens.knaw.nl/assets/argos/index.html
- list of page URLs: https://images.huygens.knaw.nl/api/argos


### National Archive site

- search in the archive: https://www.nationaalarchief.nl/onderzoeken/index/nt00444?searchTerm=
- search the index: https://www.nationaalarchief.nl/onderzoeken/zoekhulpen/voc-opvarenden
- example page: https://www.nationaalarchief.nl/onderzoeken/index/nt00444/d110980c-c864-11e6-9d8b-00505693001d


In [1]:
# This reload library is just used for developing the REPUBLIC hOCR parser 
# and can be removed once this module is stable.
%reload_ext autoreload
%autoreload 2

In [7]:
import json
import os
import re
from collections import defaultdict
from parse_hocr_files import make_hocr_page
#from parse_republic_hocr_files import get_files, get_page_types, count_page_ref_lines, get_index_entry_lines, gather_page_columns
from elasticsearch import Elasticsearch
import parse_republic_hocr_files as rep_parser

# The hOCR file name contains relevant information for parsing. Here's an example:
# NL-HaNA_1.01.02_3780_0016.jpg-0-251-98--0.40.hocr

# NL-HaNA_1.01.02 is the name of the archive
# 3780_0016 identifies the specific page with a specific contract
# 0-251-98--0.40 identifies four aspects:
#   1. the number of the column (0)
#   2. the offset from the left (251)
#   3. the offset from the top (98)
#   4. and the slant (-0.40)



### Reading column scans for a single volume

1. get scan file info
    - scan number, page number, page side, column number, slant, page
2. iterate over pages
    - create hocr_page
    - determine page type: index, resolution, other
    

In [346]:
import copy

data_dir = "../../../Data/Projects/REPUBLIC/hocr/1725/"

scan_files = rep_parser.get_files(data_dir)
scan_files.sort(key = lambda x: x["scan_num_column_num"])
print("Number of scan files:", len(scan_files))
#print(json.dumps(scan_files[0:11], indent=2))

scan_columns = defaultdict(list)
for scan_file in scan_files:
    scan_columns[scan_file["scan_num"]]

config = {
    "tiny_word_width": 15, # pixel width
    "avg_char_width": 20,
    "remove_tiny_words": True,
    "remove_line_numbers": False,
}

pages_info = rep_parser.gather_page_columns(scan_files)


Number of scan files: 1887


### Indexing Page Data in Elasticsearch



In [201]:
index = "republic_hocr_pages"
doc_type = "page"

es = Elasticsearch()
if es.indices.exists(index=index):
    print("exists, deleting")
    es.indices.delete(index=index)

def create_es_doc(page_info):
    doc = copy.deepcopy(page_info)
    for column_info in doc["columns"]:
        if "column_hocr" in column_info:
            column_info["column_hocr"] = json.dumps(column_info["column_hocr"])
    return doc

def parse_es_doc(doc):
    for column_info in doc["columns"]:
        if "column_hocr" in column_info:
            column_info["column_hocr"] = json.loads(column_info["column_hocr"])
    return doc

for page_id in pages_info:
    if pages_info[page_id]["scan_num"] >= 7000:
        continue
    print(page_id)
    page_info = copy.deepcopy(pages_info[page_id])
    for column_info in page_info["columns"]:
        try:
            column_info["column_hocr"] = rep_parser.get_column_hocr(column_info, config)
        except TypeError:
            print("Error parsing file", column_info["filepath"])
    doc = create_es_doc(page_info)
    es.index(index=index, doc_type=doc_type, id=page_id, body=doc)
        
    #pages_hocr[page_id] = rep_parser.get_page_columns_hocr(pages_info[page_id], config)
#print(json.dumps(pages_info["scan-6-even"], indent=2))
#print(json.dumps(pages_hocr["scan-6-even"], indent=2))


exists, deleting
scan-4-even
scan-4-odd
scan-5-even
scan-5-odd
scan-6-even
scan-6-odd
scan-7-even
scan-7-odd
scan-8-even
scan-8-odd
scan-9-even
scan-9-odd
scan-10-even
scan-10-odd
scan-11-even
scan-11-odd
scan-12-even
scan-12-odd
scan-13-even
scan-13-odd
scan-14-even
scan-14-odd
scan-15-even
scan-15-odd
scan-16-even
scan-16-odd
scan-17-even
scan-17-odd
scan-18-even
scan-18-odd
scan-19-even
scan-19-odd
scan-20-even
scan-20-odd
scan-21-even
scan-21-odd
scan-22-even
scan-22-odd
scan-23-even
scan-23-odd
scan-24-even
scan-24-odd
scan-25-even
scan-25-odd
scan-26-even
scan-26-odd
scan-27-even
scan-27-odd
scan-28-even
scan-28-odd
scan-29-even
scan-29-odd
scan-30-even
scan-30-odd
scan-31-even
scan-31-odd
scan-32-even
scan-32-odd
scan-33-even
scan-33-odd
scan-34-even
scan-34-odd
scan-35-even
scan-35-odd
scan-36-even
scan-36-odd
scan-37-even
scan-37-odd
scan-38-even
scan-38-odd
scan-39-even
scan-39-odd
scan-40-even
scan-40-odd
scan-41-even
scan-41-odd
scan-42-even
scan-42-odd
scan-43-even
scan-43

scan-322-odd
scan-323-even
scan-323-odd
scan-324-even
scan-324-odd
scan-325-odd
scan-326-even
scan-326-odd
scan-327-even
scan-327-odd
scan-328-even
scan-328-odd
scan-329-even
scan-329-odd
scan-330-even
scan-330-odd
scan-331-even
scan-331-odd
scan-332-even
scan-332-odd
scan-333-even
scan-333-odd
scan-334-even
Error parsing file ../../../Data/Projects/REPUBLIC/hocr/1725/NL-HaNA_1.01.02_3780_0334.jpg-2-2253-75--0.00.hocr
scan-334-odd
scan-335-even
scan-335-odd
scan-336-even
scan-336-odd
scan-337-even
scan-337-odd
scan-338-even
scan-338-odd
scan-339-even
scan-339-odd
scan-340-even
scan-340-odd
scan-341-even
scan-341-odd
scan-342-even
scan-342-odd
scan-343-even
scan-343-odd
scan-344-even
scan-344-odd
scan-345-even
scan-345-odd
scan-346-even
scan-346-odd
scan-347-even
scan-347-odd
scan-348-even
scan-348-odd
scan-349-even
scan-350-even
scan-350-odd
scan-351-even
scan-351-odd
scan-352-even
scan-352-odd
scan-353-even
scan-353-odd
scan-354-even
scan-354-odd
scan-355-even
scan-355-odd
scan-356-ev

### Determining Page Type

We want to parse index pages differently from resolution pages and filter out non-text pages and pages where the columns are not properly identified.


In [404]:
def retrieve_page_doc(page_id):
    response = es.get(index=index, doc_type=doc_type, id=page_id)
    if "_source" in response:
        page_doc = parse_es_doc(response["_source"])
        return page_doc
    else:
        return None
    

for page_id in pages_info:
    #if pages_info[page_id]["scan_num"] < 14 or  pages_info[page_id]["scan_num"] > 318:
    #    continue
    page_doc = retrieve_page_doc(page_id)
    #print(json.dumps(page_info, indent=2))
    #print(pages_hocr[page_id]['scan-4-even-0'])
    rep_parser.calculate_left_jumps(page_doc)
    pages_info[page_id]["page_type"] = rep_parser.get_page_type(page_doc, config, DEBUG=False)
    print(page_id, "\t", pages_info[page_id]["page_type"])




scan-4-even 	 bad_page
scan-4-odd 	 bad_page
		NO HEADER LINE for column id scan-5-even-0
scan-5-even 	 index_page
scan-5-odd 	 index_page
scan-6-even 	 bad_page
scan-6-odd 	 index_page
scan-7-even 	 index_page
scan-7-odd 	 index_page
scan-8-even 	 index_page
scan-8-odd 	 index_page
scan-9-even 	 bad_page
scan-9-odd 	 index_page
scan-10-even 	 index_page
scan-10-odd 	 index_page
scan-11-even 	 bad_page
scan-11-odd 	 index_page
scan-12-even 	 index_page
scan-12-odd 	 index_page
scan-13-even 	 index_page
scan-13-odd 	 index_page
scan-14-even 	 index_page
scan-14-odd 	 bad_page
		NO HEADER LINE for column id scan-15-even-1
scan-15-even 	 index_page
scan-15-odd 	 bad_page
scan-16-even 	 index_page
scan-16-odd 	 index_page
scan-17-even 	 index_page
scan-17-odd 	 index_page
scan-18-even 	 index_page
scan-18-odd 	 bad_page
scan-19-even 	 index_page
scan-19-odd 	 bad_page
scan-20-even 	 index_page
scan-20-odd 	 bad_page
scan-21-even 	 index_page
scan-21-odd 	 bad_page
scan-22-even 	 index_page

scan-133-even 	 resolution_page
scan-133-odd 	 resolution_page
scan-134-even 	 resolution_page
scan-134-odd 	 resolution_page
scan-135-even 	 resolution_page
scan-135-odd 	 resolution_page
scan-136-even 	 resolution_page
scan-136-odd 	 resolution_page
scan-137-even 	 resolution_page
scan-137-odd 	 resolution_page
		NO HEADER LINE for column id scan-138-even-0
scan-138-even 	 resolution_page
scan-138-odd 	 resolution_page
scan-139-even 	 resolution_page
scan-139-odd 	 resolution_page
scan-140-even 	 resolution_page
scan-140-odd 	 resolution_page
scan-141-even 	 resolution_page
scan-141-odd 	 resolution_page
scan-142-even 	 resolution_page
scan-142-odd 	 resolution_page
scan-143-even 	 resolution_page
scan-143-odd 	 resolution_page
scan-144-even 	 resolution_page
scan-144-odd 	 resolution_page
scan-145-odd 	 resolution_page
scan-146-even 	 resolution_page
scan-146-odd 	 resolution_page
scan-147-even 	 resolution_page
scan-147-odd 	 resolution_page
scan-148-even 	 resolution_page
scan-148

scan-266-even 	 resolution_page
		NO HEADER LINE for column id scan-266-odd-2
scan-266-odd 	 resolution_page
scan-267-even 	 resolution_page
scan-267-odd 	 resolution_page
scan-268-even 	 resolution_page
scan-268-odd 	 resolution_page
scan-270-even 	 resolution_page
scan-270-odd 	 resolution_page
scan-271-even 	 resolution_page
scan-271-odd 	 resolution_page
scan-272-even 	 resolution_page
scan-272-odd 	 resolution_page
scan-273-even 	 resolution_page
scan-273-odd 	 resolution_page
scan-274-even 	 resolution_page
scan-274-odd 	 resolution_page
scan-275-even 	 resolution_page
scan-275-odd 	 resolution_page
scan-276-even 	 resolution_page
scan-276-odd 	 resolution_page
scan-277-even 	 resolution_page
scan-277-odd 	 resolution_page
scan-278-even 	 resolution_page
scan-278-odd 	 resolution_page
scan-279-even 	 resolution_page
scan-279-odd 	 resolution_page
scan-280-even 	 resolution_page
scan-280-odd 	 resolution_page
scan-281-even 	 resolution_page
scan-281-odd 	 resolution_page
scan-282-

scan-402-odd 	 resolution_page
scan-403-even 	 resolution_page
scan-403-odd 	 resolution_page
scan-404-even 	 resolution_page
scan-404-odd 	 resolution_page
scan-405-even 	 resolution_page
scan-405-odd 	 resolution_page
scan-406-even 	 resolution_page
scan-406-odd 	 resolution_page
scan-407-even 	 resolution_page
scan-407-odd 	 resolution_page
scan-408-even 	 resolution_page
scan-408-odd 	 resolution_page
scan-409-even 	 resolution_page
scan-409-odd 	 resolution_page
scan-410-even 	 resolution_page
scan-410-odd 	 resolution_page
scan-411-even 	 resolution_page
scan-411-odd 	 resolution_page
scan-412-even 	 resolution_page
scan-412-odd 	 resolution_page
scan-413-even 	 resolution_page
scan-413-odd 	 resolution_page
scan-414-even 	 resolution_page
scan-414-odd 	 resolution_page
scan-415-even 	 resolution_page
scan-415-odd 	 resolution_page
		NO HEADER LINE for column id scan-416-even-0
scan-416-even 	 resolution_page
scan-416-odd 	 resolution_page
scan-417-even 	 resolution_page
scan-417

### Parsing and Preprocessing Index Pages

- filter tiny and huge text elements (i.e. deviating from average character/word width and height
- extract page lines that are part of the main text body containing index entries
- insert and clean up repetition symbols in index entries
    - determine length of repetition symbol
    - identify and replace mis-recognized repetition symbols


In [341]:
from parse_republic_hocr_files import index_lemmata
avg_left = 0
lemma_index = defaultdict(list)
curr_lemma = None
    

for page_id in pages_info:
    print("\n\n", page_id)
    if pages_info[page_id]["page_type"] != "index_page":
        print("skipping non-index page")
        continue
    page_doc = retrieve_page_doc(page_id)
    pages_info[page_id]["num_page_ref_lines"] = rep_parser.count_page_ref_lines(page_doc)
    for column_info in page_doc["columns"]:
        print("\n\n", column_info["column_id"])
        column_hocr = column_info["column_hocr"]
        lines = rep_parser.get_index_entry_lines(column_hocr)
        curr_lemma = index_lemmata(column_id, lines, lemma_index, curr_lemma)






 scan-4-even
skipping non-index page


 scan-4-odd
skipping non-index page


 scan-5-even
skipping non-index page


 scan-5-odd


 scan-5-odd-1
Header: LE        Kn
	scan: don't know line: 0 top: 244 0 1004 1004
	FIRST BODY LINE: 9 1903 	##            gejaf} Fabritius te'Cadix eh Courcamp
possibly misrecognised repeat symbol: 145 mn
DEVIATING LINE: 227 [227, 108, 63, 112, 169] gejaf} Fabritius te'Cadix eh Courcamp
avg_repeat_symbol_length: 145
DEVIATING LINE: 169 [227, 108, 63, 112, 169, 114, 116, 113, 113] ee juftificatie van de conduites ‘gehouden
avg_repeat_symbol_length: 145
DEVIATING LINE: 237 [114, 116, 113, 113, 237, 110, 114, 60, 116] redenen waarom het Schip van Zee-
avg_repeat_symbol_length: 145
DEVIATING LINE: 240 [114, 60, 116, 108, 240, 69, 115, 72, 119] gevis en geaccordeert. “916.
avg_repeat_symbol_length: 145
DEVIATING LINE: 247 [115, 72, 119, 122, 247, 122, 260, 124, 131] Re{olutie ‘wegens de adminiftratie van
avg_repeat_symbol_length: 145
DEVIATING LINE: 260 [119, 1

41 101 1 continue_stop 	      de Convoyen en Licenten tot Swolle.   169.
42 99 -2 start 	      —— wegens  aangevinge  van Rundvee én
	PAGE_REFS: [178] 	CURR LEMMA: Admiraliteyt in het Noorder Quartier
HAS LEMMA: Varckens. 178.
LEMMA: Varckens
43 100 -1 start_stop 	      Varckens.   178.
44 100 -2 start 	      —— Greveftein  aangefteldt  tot  Contrerol-
45 108 6 continue 	      leur van de Comvoyen en Licenten tot Ha(fel.
	PAGE_REFS: [256] 	CURR LEMMA: Varckens
46 107 5 continue_stop 	      256.
47 100 -3 start 	      —— commijfie  wegens  Hollandt  voor  de
	PAGE_REFS: [304] 	CURR LEMMA: Varckens
48 106 10 continue_stop 	      Heer Jelius.   304.
49 98 6 continue 	      —— iten voor de Heer Aylua wegens Pries-
	PAGE_REFS: [310] 	CURR LEMMA: Varckens
50 106 14 continue_stop 	      landt.   310.
51 108 21 continue 	 —— Akerlaken  aangefteldt  zot  Capiteyn,
52 37 -52 start 	   uw  35
53 59 -33 start 	    and    Houttuyn aangefteldt  tot Capiteya op
	PAGE_REFS: [361] 	CURR LEMMA: Varckens

5 104 -15 start 	    —— Ei   en Zoelmont   midts vier duy[ent
	PAGE_REFS: [166] 	CURR LEMMA: Beaufort aangefteldt tot Secretaris
6 150 37 continue_stop 	        gulden ter The[aurie fourneerende.   166.
HAS LEMMA: Becker, creditif van den Furft van Ooftvries-
LEMMA: Becker
7 104 -8 start 	      Becker, creditif van den Furft van Ooftvries-
	PAGE_REFS: [31] 	CURR LEMMA: Becker
8 49 -56 start_stop 	        —— landt.   31.
	PAGE_REFS: [96] 	CURR LEMMA: Becker
9 101 7 continue_stop 	      —— creditif en aangenaam.  96.
10 105 11 continue 	      —— Memorie tegens  de afgefette  admini-
	PAGE_REFS: [126] 	CURR LEMMA: Becker
11 141 53 continue_stop 	        fratoiren, te examineeren.   126.
12 49 -38 start 	   —— rapport  en  refolutie  dien aangaande.
	PAGE_REFS: [137] 	CURR LEMMA: Becker
13 49 -44 start_stop 	         —— 137.
	PAGE_REFS: [199] 	CURR LEMMA: Becker
HAS LEMMA: Becker 07% afiftentie, afgeweftn. 199,
LEMMA: Becker 07% afiftentie
14 98 0 start_stop 	      Becker 07% afiftentie, a

16 159 7 continue 	         dikant fem :magh  hebben  in  het  verkie-
17 159 2 continue 	         mn  van  cen  Armmeefter 3: te examineeren.
	PAGE_REFS: [688] 	CURR LEMMA: Bleskensgrave
18 163 6 continue_stop 	         688.
19 122 -34 start 	       —__   DrûfJard gelaft ordre te fellen tot bet
20 162 5 continue 	         opvangen  van  [Heydens  en  Vagabonden.
	PAGE_REFS: [692] 	CURR LEMMA: Bleskensgrave
21 163 6 continue_stop 	         692.
22 163 6 continue 	   —— Regenten van de Hage wegens [teke-
23 159 2 continue 	         ve Capitalen, de Raadt van Staate te ad-
	PAGE_REFS: [717] 	CURR LEMMA: Bleskensgrave
24 163 2 continue_stop 	         vijeeren.  717.
25 166 5 continue 	   —— Schouten ende Regenten    mitsgaders
HAS LEMMA: Predikant en Kerkenraadt van de Hage ge-
26 157 -5 start 	         Predikant en Kerkenraadt van de Hage ge-
27 161 0 start 	         laft haar te reguleeren na bet een en twin-
28 161 0 start 	         tighfle Articul van het Reglement op de po-
	PAGE_REF

61 0 -2 start 	 ee


 scan-10-odd-3
	IS INDEX HEADER: 0 201 	## )           E           Xx.
possibly misrecognised repeat symbol: 217 Cs
possibly misrecognised repeat symbol: 131 et
DEVIATING LINE: 240 [118, 117, 110, 108, 240, 117, 116, 235, 238] gj geadmitteert te werden van hare
avg_repeat_symbol_length: 174
DEVIATING LINE: 235 [108, 240, 117, 116, 235, 238, 116, 240, 113] geadmitteert. 491.
avg_repeat_symbol_length: 174
DEVIATING LINE: 238 [240, 117, 116, 235, 238, 116, 240, 113, 235] Pafport om eenige Goederen na Prank-
avg_repeat_symbol_length: 174
DEVIATING LINE: 240 [116, 235, 238, 116, 240, 113, 235, 113, 113] gm eenigh Silverwerck 24 Parijs te
avg_repeat_symbol_length: 174
DEVIATING LINE: 235 [238, 116, 240, 113, 235, 113, 113, 66, 65] item voor de Meubilen en Bágagie
avg_repeat_symbol_length: 174
DEVIATING LINE: 233 [112, 114, 62, 105, 233, 109, 109, 61, 102] Pafport om met vier Paarden de Ri=
avg_repeat_symbol_length: 174
DEVIATING LINE: 233 [109, 109, 61, 102, 233, 110, 21

DEVIATING LINE: 253 [130, 249, 80, 128, 253, 166, 130, 256, 132] van da Vienon: 645: 913.
avg_repeat_symbol_length: 146
DEVIATING LINE: 256 [128, 253, 166, 130, 256, 132, 254, 84, 133] van den Amanuenfis Engelenburgh.
avg_repeat_symbol_length: 146
DEVIATING LINE: 254 [166, 130, 256, 132, 254, 84, 133, 86, 135] voor vander Weyde. 828.
avg_repeat_symbol_length: 146
DEVIATING LINE: 263 [138, 137, 133, 138, 263] rapport en antwoordt. 445.
avg_repeat_symbol_length: 146
	PAGE_REFS: [39, 488] 	CURR LEMMA: Jam
0 76 -12 start_stop 	     —— pan den Refident Spina.  39. 488.
	PAGE_REFS: [44] 	CURR LEMMA: Jam
1 74 -6 start_stop 	     — van  den  Envoyé  Bruyns.   44.
	PAGE_REFS: [509] 	CURR LEMMA: Jam
2 126 47 continue_stop 	       509.
	PAGE_REFS: [35] 	CURR LEMMA: Jam
3 78 0 start_stop 	     —— ‘van den Envoyé Bijs.   35.
4 88 9 continue 	      —— pan  den   Perwer  vander  Göës.
	PAGE_REFS: [1498] 	CURR LEMMA: Jam
5 38 -46 start_stop 	   1498,
6 75 -9 start 	     —— ‘van Sara Befcheyt.   $8.
7 

0 33 -30 start 	   —Commiien van  de Heer Lohman in
	PAGE_REFS: [521] 	CURR LEMMA: Jubflituyt
1 33 -45 start_stop 	 —— des Generaliteyts Reekenkamer.   521.
2 107 25 continue 	       —— item  voor  den  Heer  de  Drews ter
	PAGE_REFS: [680] 	CURR LEMMA: Jubflituyt
3 33 -58 start_stop 	 —— Generaliteyt.   680.
4 109 12 continue 	       —— item voor den Heer Geertfema in den
	PAGE_REFS: [706] 	CURR LEMMA: Jubflituyt
5 154 48 continue_stop 	         Raadt van Staate.   706.
6 109 -4 start 	       —— redenen van onvermogen  in het fur-
7 150 32 continue 	        neeren, dev penningen ten Comptoire generaal,
	PAGE_REFS: [879] 	CURR LEMMA: Jubflituyt
8 151 26 continue_stop 	         te examineeren, … 879.
	PAGE_REFS: [960] 	CURR LEMMA: Jubflituyt
9 108 -21 start_stop 	       —— befendinge gedecerneert.   960.
10 99 -29 start 	      de Geen    Brieven  requifitoir  verleendt.
	PAGE_REFS: [158] 	CURR LEMMA: Jubflituyt
11 157 25 continue_stop 	         158.
12 95 -37 start 	      de. Groot  te 

DEVIATING LINE: 253 [116, 241, 120, 123, 253, 118, 119, 70] item op bet ver[oeck van Regen-
avg_repeat_symbol_length: 146
0 58 -27 start 	   — Regenten  van  Heefwijck  0m vermij
1 94 9 continue 	     fie   de Raadt van  Staate te advifteren.
	PAGE_REFS: [63] 	CURR LEMMA: PVHermitage
2 106 22 continue_stop 	      63.
3 65 -22 start 	    —— Regenten van Merefeldboven en Heelft
4 106 17 continue 	      om Temijie 4 Raadt van Staate te advi-
	PAGE_REFS: [79] 	CURR LEMMA: PVHermitage
5 81 -10 start_stop 	      feeren:.   79.
6 81 -11 start 	    —— Hodorp, Predikant tot Aarle on Beek
7 105 12 continue 	      om protellie   de  Raadt  van Biabandt te
	PAGE_REFS: [80] 	CURR LEMMA: PVHermitage
8 105 8 continue_stop 	      advijfeeren.   80.
9 81 -16 start 	    —— Regenten  van Erp  om  te negtie-
10 106 6 continue 	      ren’,  de Raadt  van.  Stlaate  te advijceren.
11 108 10 continue 	      93
12 105 6 continue 	     —— item  op bet  verfoeck van Regenten
13 107 13 continue 	      van Hooghl

DEVIATING LINE: 259 [131, 76, 122, 125, 259, 123, 247, 123, 247] item op de Requefte van Regenten
avg_repeat_symbol_length: 181
DEVIATING LINE: 247 [122, 125, 259, 123, 247, 123, 247, 123, 244] Regentèn van Helvoirt remijie ver-
avg_repeat_symbol_length: 181
DEVIATING LINE: 247 [259, 123, 247, 123, 247, 123, 244, 79, 249] Regenten van Drunen remie ver-
avg_repeat_symbol_length: 181
DEVIATING LINE: 244 [247, 123, 247, 123, 244, 79, 249, 119, 245] Regenten van Moergeftel remijie ver-
avg_repeat_symbol_length: 181
DEVIATING LINE: 249 [247, 123, 244, 79, 249, 119, 245, 119, 242] Regenten van Dien remijie ver-
avg_repeat_symbol_length: 181
DEVIATING LINE: 245 [244, 79, 249, 119, 245, 119, 242, 118, 238] Regenten van Huyclom gepermitteert
avg_repeat_symbol_length: 181
DEVIATING LINE: 242 [249, 119, 245, 119, 242, 118, 238, 119, 239] Regenten van Hapert en Cafteren re-
avg_repeat_symbol_length: 181
DEVIATING LINE: 238 [245, 119, 242, 118, 238, 119, 239, 115, 115] item aan Regenten van Marefel

3 129 5 continue_stop 	       len  tot furmiJemecnt van interefen.   394.
4 123 4 continue 	      —— belaftinge  van  den  hondertflen Pen-
	PAGE_REFS: [399] 	CURR LEMMA: Heylmans
5 127 7 continue_stop 	       aingh ter Generaliteyt gea;vefteert. 399.
6 123 2 continue 	     —— Cammifen  ea  Clercguch  om trae-
	PAGE_REFS: [408] 	CURR LEMMA: Heylmans
7 123 6 continue_stop 	       ment, Te examineren.   408.
8 85 -21 start 	     —— Saat van veragbierde intere (Jen van
9 131 24 continue 	       Canitalen onder haar Hoogl Mog. guarantie
10 125 23 continue 	       gewegoticert; Ie examineren,  423. F1:
11 87 -15 start 	     —— Refolutie van Overyfel, rakende het
12 35 -63 start 	   9  verleenen  van  twee  Alies  van decharge.
	PAGE_REFS: [459] 	CURR LEMMA: Heylmans
13 129 26 continue_stop 	       459:
14 81 -17 start 	     ———  Refôluiie van Zeelandi dien aangaan-
	PAGE_REFS: [395] 	CURR LEMMA: Heylmans
15 129 30 continue_stop 	       de.   395.
16 84 -20 start 	     —— gelaft aan den Grif

	PAGE_REFS: [595] 	CURR LEMMA: Aeio
31 169 0 start_stop 	         twee Alles van decharge.   595.
32 169 6 continue 	    —— wegens ongebeurde Jaaken in publique
33 168 5 continue 	         Coaranten   Hollandt  verfoght  voor fieninge
	PAGE_REFS: [777] 	CURR LEMMA: Aeio
34 167 10 continue_stop 	         te doen.  777.
	PAGE_REFS: [819] 	CURR LEMMA: Aeio
35 166 9 continue_stop 	   —— notificatie van fijn overlijden.   819.
HAS LEMMA: Horenbeeck om Hoepen te mogen uytvoeren
36 120 -37 start 	       Horenbeeck om Hoepen te mogen uytvoeren
	PAGE_REFS: [853] 	CURR LEMMA: Horenbeeck
37 169 12 continue_stop 	         by Hollandt overgenoomen.  853.
38 119 -32 start 	       van  Hornes,  Prince   nader  bericht op het
39 167 16 continue 	         contra bericht van la Motte, ’s Landts Ad-=
	PAGE_REFS: [15] 	CURR LEMMA: Horenbeeck
40 168 16 continue_stop 	         vocaten te advijeeren.   15.
	PAGE_REFS: [127] 	CURR LEMMA: Horenbeeck
41 169 17 continue_stop 	    —— advis en refolutie.   127.
HA

7 150 -5 start 	        pen/ioen van hondert guldens baar levenlangh
	PAGE_REFS: [279] 	CURR LEMMA: NEO
8 159 2 continue_stop 	         toegeleght. * 279.
9 150 -5 start 	 —— Kerckenraadt van Maaftricht om een
10 150 -4 start 	        funds om de onkofien van een beroepinge goet
	PAGE_REFS: [338] 	CURR LEMMA: NEO
11 162 7 continue_stop 	         te maken, te examineeren.   338.
12 170 15 continue 	         —    Clas  van Walcheren wegens diffe-
13 150 -4 start 	        vent   in de beroepinge van  een Predikant te
14 152 -3 start 	        biervliet    de  Magiftraat  te  berichten.
	PAGE_REFS: [360] 	CURR LEMMA: NEO
15 156 8 continue_stop 	        360.
16 154 8 continue 	 —— Gecommitteerden der Synode verfoeck
17 150 7 continue 	        om de Papieren Synodi Nationalis e4 Au-
	PAGE_REFS: [389] 	CURR LEMMA: NEO
18 151 8 continue_stop 	        tographa tot Leyden geaccordeert.   389.
19 87 -55 start 	     ——  Gecommitteerden  haar  Hoogh  Mog.
20 148 8 continue 	        voor haar yver in

23 81 -20 start 	   —— van Spörcken om achterfallen, te exa-
	PAGE_REFS: [262] 	CURR LEMMA: Memorien
24 106 4 continue_stop 	    mineeren.   262.
25 99 -4 start 	   —— van Meynertsbagen  klagten over het
26 106 3 continue 	    aanhouden van vier hondert vyf en dertigh
27 98 -3 start 	   fucken Noteboomen Hout,  de Admiraliteyt
	PAGE_REFS: [313] 	CURR LEMMA: Memorien
28 108 9 continue_stop 	    in Vrieslandt te berigten.   313.
29 113 14 continue 	    —— van  Siegman om.  reftitutie  van een
30 104 4 continue 	    Obligatie van een millioen galdens, den Ont-
	PAGE_REFS: [345] 	CURR LEMMA: Memorien
HAS LEMMA: Janger Hogendorp te berigten. 345.
LEMMA: Janger Hogendorp te berigten
31 97 -3 start_stop 	   Janger Hogendorp te berigten.   345.
32 60 -41 start 	  —  van Meynerishagen infteerende op de
33 108 6 continue 	    betalinge van een vente  van duyfent guldens
34 108 7 continue 	    ‘sjaars, de Raaden tot de Nalatenfchap te
	PAGE_REFS: [379] 	CURR LEMMA: Janger Hogendorp te berigten
35

	PAGE_REFS: [31] 	CURR LEMMA: Jchreeven
42 163 59 continue_stop 	         31,
43 64 -41 start 	    ——  item  woor  de [elve  tot  uytvoer vat
44 111 2 continue 	       Monteeringe  voor  het  Regimet van  Prins
	PAGE_REFS: [231] 	CURR LEMMA: Jchreeven
45 114 5 continue_stop 	       Willem van Heen na Namen.  231,
46 88 -22 start 	     —— Hambroeck  vier  maanden  verlof.
	PAGE_REFS: [234] 	CURR LEMMA: Jchreeven
47 116 13 continue_stop 	       234.
48 108 0 start 	     —— rapport en refolutie‚ nopende de ver=
49 115 6 continue 	       deelinge der Regimenten naa Staats 'Vlaan-
	PAGE_REFS: [234] 	CURR LEMMA: Jchreeven
50 119 14 continue_stop 	       dereu.   234.
51 98 -11 start 	      —— Passport  voor  vander  Duyn  ‘om
52 108 -1 start 	      fes  Paarden  naar  Doornick  te  voeren.
53 122 20 continue 	       23%.
54 78 -24 start 	     —— Zeelandt verfoght Deferteurs gereclas
55 119 18 continue 	       meert werdende volgens de Conventie over té
	PAGE_REFS: [235] 	CURR LEMMA: Jchreeve

possibly misrecognised repeat symbol: 147 md
DEVIATING LINE: 184 [49, 49, 59, 46, 184, 162, 45, 79, 45] om een ander Heer , den: Heer Ham te
avg_repeat_symbol_length: 147
DEVIATING LINE: 162 [49, 59, 46, 184, 162, 45, 79, 45, 91] frogeeren en geaccordeert.. 138 —
avg_repeat_symbol_length: 147
DEVIATING LINE: 224 [96, 41, 81, 83, 224, 69, 92, 0, 235] Beaufort aangefteldt tot Secretaris,
avg_repeat_symbol_length: 147
DEVIATING LINE: 235 [224, 69, 92, 0, 235, 66, 88, 86, 205] Avoroult Heer van Guincourt aan=
avg_repeat_symbol_length: 147
DEVIATING LINE: 205 [235, 66, 88, 86, 205, 70, 85, 85, 208] te berichten op de Memorie van Mey-
avg_repeat_symbol_length: 147
DEVIATING LINE: 208 [205, 70, 85, 85, 208, 207, 82, 81, 204] Requefte civile verleent. 395.
avg_repeat_symbol_length: 147
DEVIATING LINE: 207 [70, 85, 85, 208, 207, 82, 81, 204, 204] bericht op de Memorie van Meynerts-
avg_repeat_symbol_length: 147
DEVIATING LINE: 204 [208, 207, 82, 81, 204, 204, 82, 202, 80] oexecuriaal. 406,
avg_

60 131 4 continue_stop 	      119.  126.


 scan-29-even-1
	IS INDEX HEADER: 0 227 	##           Em.           xXx
possibly misrecognised repeat symbol: 166 EE
possibly misrecognised repeat symbol: 143 ne
possibly misrecognised repeat symbol: 142 od
possibly misrecognised repeat symbol: 147 a
possibly misrecognised repeat symbol: 201 me
DEVIATING LINE: 219 [219, 57, 219, 98, 51] rapport en refolntie wegens achterftal
avg_repeat_symbol_length: 157
DEVIATING LINE: 219 [219, 57, 219, 98, 51, 107, 136] klaghten van de oude adminiftratoris
avg_repeat_symbol_length: 157
DEVIATING LINE: 136 [219, 98, 51, 107, 136, 90, 218, 101, 8] ne peflmtie op de Mifive van den Ont-
avg_repeat_symbol_length: 157
DEVIATING LINE: 218 [51, 107, 136, 90, 218, 101, 8, 84, 227] Memorie van de oude Aáminiftratooren.
avg_repeat_symbol_length: 157
DEVIATING LINE: 227 [218, 101, 8, 84, 227, 76, 52, 91, 224] antwoordt wegens achterflallen te
avg_repeat_symbol_length: 157
DEVIATING LINE: 224 [227, 76, 52, 91, 224, 225,

DEVIATING LINE: 208 [0, 0, 208, 0, 208, 166] tonnen Meel te tran[porteren. 160.
avg_repeat_symbol_length: 96
DEVIATING LINE: 166 [0, 208, 0, 208, 166] meme van den Raadt van Staate om der=
avg_repeat_symbol_length: 96
0 58 -64 start 	    —— Kerckenraadt om een collefle, te exa:
	PAGE_REFS: [5308] 	CURR LEMMA: Paltz
1 147 19 continue_stop 	        mineeren.   5308.
2 104 -27 start 	      —_       ampel bericht   raakende het Turven
3 151 17 continue 	         op den Gulickfchen Bodem, te examineeren,
	PAGE_REFS: [841] 	CURR LEMMA: Paltz
4 153 23 continue_stop 	         841.
5 158 17 continue 	   —— notificeerende de geboorte vaneen Prins
6 151 9 continue 	         met Brieven van congratulatie in civile tere
	PAGE_REFS: [966] 	CURR LEMMA: Paltz
7 151 2 continue_stop 	         men beantwoordt.  966.
HAS LEMMA: Paravicini, advertentie. 899. 912. 928.939:
LEMMA: Paravicini
8 105 -39 start 	      Paravicini, advertentie.  899. 912. 928.939:
	PAGE_REFS: [947, 958, 965] 	CURR LEMMA: Paravicin

32 178 22 continue 	    vier hondert duy[ent guldens‚ de Raadt van
	PAGE_REFS: [92] 	CURR LEMMA: Pefters
33 168 8 continue_stop 	   Staate te advifeeren.   92.
	PAGE_REFS: [115] 	CURR LEMMA: Pefters
34 180 11 continue_stop 	    den Grave Daun.   115.
	PAGE_REFS: [126] 	CURR LEMMA: Pefters
35 132 -31 start_stop 	  —— Brief van Credentie.   126.
36 178 11 continue 	    duyfent guldens  aan Wiels over te maa-
37 174 7 continue 	    ken    de  Raadt van Staate te advifteren.
	PAGE_REFS: [135] 	CURR LEMMA: Pefters
38 185 22 continue_stop 	    135.
39 136 -28 start 	  ai   gelaft ftekere Alte van /ubmijie en
40 179 10 continue 	    willige  condemnatie  op  den [elven  voet te
41 179 10 continue 	    vervolgen.   T45.
42 132 -38 start 	  —— devoir te doen op de Requefte van de
43 186 16 continue 	    Claffis  van Walcheren  tot  ontflaginge van
HAS LEMMA: Jeekeren Landtman, die geweygert hadde
LEMMA: Jeekeren Landtman
44 174 -1 start 	    Jeekeren  Landtman,  die  geweygert  hadde
	PAGE_REFS

	PAGE_REFS: [68] 	CURR LEMMA: Janzer
28 140 8 continue_stop 	        te mogen omflaan.  68.
29 130 -3 start 	      —— #e advifeeren op het ver[oeck van Re-
	PAGE_REFS: [68] 	CURR LEMMA: Janzer
30 130 -2 start_stop 	       genten van Ieefwyck om remie.   68.
31 129 -3 start 	       —— te  berichten  op de Mi(ive van den
32 136 4 continue 	        Churfurft van de Palts wegens  arreft door
	PAGE_REFS: [70] 	CURR LEMMA: Janzer
33 135 3 continue_stop 	        Abraham en Mofes Cohen.  70.
34 129 -3 start 	      —— ?e advijeeren op het ver[oek van Re-
35 129 -3 start 	       genten van Ginneken en Bacel om interpre-
36 136 4 continue 	        tatie van het Placaat tot reduêlie der inte-
37 137 10 continue 	        velen.  77-
38 129 2 continue 	      —— te advifeeren op de Requefte van Re-
39 132 4 continue 	        genten van Meefeldhoven en Zeelft om remif-
	PAGE_REFS: [79] 	CURR LEMMA: Janzer
40 130 2 continue_stop 	       fie.   79.
41 91 -32 start 	      —— term op  de Mi(ive van Peters

avg_repeat_symbol_length: 147
DEVIATING LINE: 273 [104, 152, 275, 156, 273, 152, 151, 102, 148] defeëten by de laaifte Monferinte
avg_repeat_symbol_length: 147
DEVIATING LINE: 278 [151, 102, 148, 146, 278, 159, 164, 270, 142] confent van Gelderlandt in dé gene=
avg_repeat_symbol_length: 147
DEVIATING LINE: 270 [146, 278, 159, 164, 270, 142, 275, 146, 144]  Comrgijie voor den Heer Paats wes
avg_repeat_symbol_length: 147
DEVIATING LINE: 275 [159, 164, 270, 142, 275, 146, 144, 98, 147] Ze advifteren op het verfoeck vór
avg_repeat_symbol_length: 147
DEVIATING LINE: 265 [146, 144, 98, 147, 265, 139, 141, 267, 142] te advifteren op de Requefte van Ia:
avg_repeat_symbol_length: 147
DEVIATING LINE: 267 [147, 265, 139, 141, 267, 142, 283, 144, 97] te op het verfoeck van Snonckaart
avg_repeat_symbol_length: 147
DEVIATING LINE: 283 [139, 141, 267, 142, 283, 144, 97, 141, 143] van Heeswyek remijië verleend:
avg_repeat_symbol_length: 147
DEVIATING LINE: 267 [144, 97, 141, 143, 267, 144, 266, 142, 9

40 118 -11 start 	   —— verzogt Ordonnantie van  twee duy-
HAS LEMMA: Sent guldens te depecheeren woor den Heer
41 118 -10 start 	   Sent guldens  te  depecheeren  woor  den Heer
	PAGE_REFS: [478] 	CURR LEMMA: Sent
42 134 10 continue_stop 	    Hop.   478.
43 118 -5 start 	  —— te advifeeren op de Mifive van van
44 134 11 continue 	    Welderen om nogh twee Compagnien van de
	PAGE_REFS: [479] 	CURR LEMMA: Sent
45 133 14 continue_stop 	    la Rocque.   479.
46 94 -25 start 	  —— Regenten  van  Hogemierden   Lage-
47 130 16 continue 	    mierden  ende  Haulzel  remijjie  verleendt.
	PAGE_REFS: [480] 	CURR LEMMA: Sent
48 129 14 continue_stop 	    480.
49 82 -32 start 	 ——  aduis op  de  klaghten van de Chur-
	PAGE_REFS: [481] 	CURR LEMMA: Sent
50 122 9 continue_stop 	   fur van de Palts, te examineren.  481.
51 85 -30 start 	  —— onderzoeck te doen op de klagten-van
52 126 12 continue 	   Pefters  wegens  infolentie  door  Dragonaers
53 126 13 continue 	   van  vander  Duyn  te  Thienen   

DEVIATING LINE: 243 [165, 121, 119, 120, 243, 120, 243, 119, 122] Regenten van Venlo gepermitteert vier
avg_repeat_symbol_length: 147
DEVIATING LINE: 243 [119, 120, 243, 120, 243, 119, 122, 246, 123] te advifeeren op het ver[oeck van Re-
avg_repeat_symbol_length: 147
DEVIATING LINE: 246 [120, 243, 119, 122, 246, 123, 122, 125, 245] Provincie van Zeelandt verfoght or-
avg_repeat_symbol_length: 147
DEVIATING LINE: 245 [246, 123, 122, 125, 245, 123, 113, 242, 124] te advifeeren op het verfoeck van Ie
avg_repeat_symbol_length: 147
DEVIATING LINE: 242 [125, 245, 123, 113, 242, 124, 249, 123, 125] item op bet verfoeck van Regenten van
avg_repeat_symbol_length: 147
DEVIATING LINE: 249 [123, 113, 242, 124, 249, 123, 125, 248, 125] Pasport voor Belaarts om eenige Goe-
avg_repeat_symbol_length: 147
DEVIATING LINE: 248 [124, 249, 123, 125, 248, 125, 188, 128, 249] Regenten van Oir[chot verfoeck om
avg_repeat_symbol_length: 147
DEVIATING LINE: 249 [248, 125, 188, 128, 249, 121] te advifteren op he

	PAGE_REFS: [616] 	CURR LEMMA: Rcygersman
60 114 12 continue_stop 	     616.


 scan-39-even-1
	IS INDEX HEADER: 0 274 	##           E           Xx.
possibly misrecognised repeat symbol: 142 mn
DEVIATING LINE: 218 [111, 62, 62, 108, 218, 94, 232, 59, 106] Ambafadeur aangenomen favorabel te
avg_repeat_symbol_length: 143
DEVIATING LINE: 232 [62, 108, 218, 94, 232, 59, 106, 105, 105] antwoordt. 926. 935.
avg_repeat_symbol_length: 143
DEVIATING LINE: 226 [59, 106, 105, 105, 226, 226, 101, 107, 224] gm ordres en refolutie. 11.
avg_repeat_symbol_length: 143
DEVIATING LINE: 226 [106, 105, 105, 226, 226, 101, 107, 224, 102] teedenen waaromme op de paghtver-
avg_repeat_symbol_length: 143
DEVIATING LINE: 224 [226, 226, 101, 107, 224, 102, 223, 99, 53] klaghten van den Paghter te exa-
avg_repeat_symbol_length: 143
DEVIATING LINE: 223 [101, 107, 224, 102, 223, 99, 53, 97, 99] klaghten over wanbetalinge van den
avg_repeat_symbol_length: 143
DEVIATING LINE: 220 [99, 53, 97, 99, 220, 95, 97, 46, 45] 

	PAGE_REFS: [766] 	CURR LEMMA: Schoorman
3 117 26 continue_stop 	       766.
HAS LEMMA: Schorfin aangefteldt als Major van het Re-
4 68 -23 start 	    Schorfin aangefteldt als Major  van het Re-
	PAGE_REFS: [147] 	CURR LEMMA: Schorfin
5 113 16 continue_stop 	       giment van Chambrier.   147.
HAS LEMMA: Schotte om Brieven van Voorfchryvens, de
LEMMA: Schotte om Brieven van Voorfchryvens
6 70 -22 start 	     Schotte om  Brieven  van Voorfchryvens,  de
	PAGE_REFS: [628] 	CURR LEMMA: Schotte om Brieven van Voorfchryvens
7 116 24 continue_stop 	       retroatla na te fien.   628.
8 88 -5 start 	     —— Brieven  van Voorfchryvens aan het
	PAGE_REFS: [711] 	CURR LEMMA: Schotte om Brieven van Voorfchryvens
9 119 21 continue_stop 	       Hof van Sweeden.  711.
	PAGE_REFS: [199] 	CURR LEMMA: Schotte om Brieven van Voorfchryvens
HAS LEMMA: Schreurs om affiftentie, afgeweefen. 199.
LEMMA: Schreurs om affiftentie
10 71 -24 start_stop 	     Schreurs om affiftentie, afgeweefen.   199.
HAS LEMMA: Sc

	PAGE_REFS: [70] 	CURR LEMMA: Tarouca
HAS LEMMA: Temet om /ubfiftentie afgeweefen. 70.
LEMMA: Temet om /ubfiftentie afgeweefen
52 53 -24 start_stop 	    Temet  om  /ubfiftentie    afgeweefen.   70.
53 102 22 continue 	      Dad                  a
HAS LEMMA: Tengnagel 7e advifeeren op de Rcgue[le van
54 48 -29 start 	   Tengnagel 7e advifeeren  op de Rcgue[le van
55 96 15 continue 	      Pelgrom Engelbreght om approbatie van [te-
	PAGE_REFS: [713] 	CURR LEMMA: Tengnagel
56 93 15 continue_stop 	      kere Collatie.   713.


 scan-41-even-1
	IS INDEX HEADER: 0 255 	##              !           <
possibly misrecognised repeat symbol: 142 nn
DEVIATING LINE: 227 [56, 103, 54, 227, 103, 105, 54, 98] Pa/port ‘om de Monteeringe voor het
avg_repeat_symbol_length: 144
DEVIATING LINE: 105 [103, 54, 227, 103, 105, 54, 98, 4, 56] te mogen uytvoeren. 451.
avg_repeat_symbol_length: 144
DEVIATING LINE: 108 [54, 98, 4, 56, 108, 53, 92, 225, 101] 192.
avg_repeat_symbol_length: 144
DEVIATING LINE: 225 [56,

26 86 -33 start 	     ———— Refòlutie wegens de Aften:  van des
	PAGE_REFS: [205] 	CURR LEMMA: Vreede
27 134 20 continue_stop 	        charge.   205.
28 94 -22 start 	      —— antwoordt op baar Hoogh Mog. Re:
29 125 9 continue 	       Solutie ‚ raakende de defeftueusheyt, te exa-
	PAGE_REFS: [228] 	CURR LEMMA: Vreede
30 133 23 continue_stop 	        mineeren.   228.
31 85 -31 start 	     —_— propofitie wegens de vyf dertigh duy-
32 123 8 continue 	       Jent guldens van de Admiralitejt   te exa
	PAGE_REFS: [246, 415] 	CURR LEMMA: Vreede
33 132 17 continue_stop 	        mineeren.   246. 415.
34 86 -30 start 	     ——_—  om patent  voor  de  Compagnie vak
35 136 19 continue 	        Haarfma   de Raadt van Staate te advi-
	PAGE_REFS: [263] 	CURR LEMMA: Vreede
36 127 6 continue_stop 	       Jesren.   263.
37 91 -31 start 	      —— gertet op het projet Reglement op de
38 139 21 continue 	        Revijien van fententieh van den Ra4dt van
	PAGE_REFS: [265] 	CURR LEMMA: Vreede
39 141 22 continu

54 121 27 continue 	       ken.   696.                         #5
55 27 -69 start 	  ——  pegfoeck 07: dedommagement van brand
	PAGE_REFS: [791] 	CURR LEMMA: Wierfma
56 124 27 continue_stop 	        afgeweefen.   791.
HAS LEMMA: Windilgratz affcheydt gewalidiceert en Brie-
57 78 -25 start 	     Windilgratz affcheydt  gewalidiceert en  Brie-
58 127 26 continue 	       ven van recredenlie.   EIT.
59 84 -14 start 	     —— Secretaris Medaille van drie honder:
	PAGE_REFS: [626] 	CURR LEMMA: Windilgratz
60 123 13 continue_stop 	       geldens toegelenhbt.  626.
61 127 20 continue 	       wiiee                    Wins


 scan-44-even


 scan-44-even-0
	IS INDEX HEADER: 0 243 	##                        I           N
possibly misrecognised repeat symbol: 223 eeN
possibly misrecognised repeat symbol: 131 A
DEVIATING LINE: 271 [95, 144, 271, 96, 149, 267, 145] oyftigh guldens toegeleght. 267.
avg_repeat_symbol_length: 135
DEVIATING LINE: 267 [144, 271, 96, 149, 267, 145, 268, 142, 142] Refolntie v

 scan-107-even
skipping non-index page


 scan-107-odd
skipping non-index page


 scan-108-even
skipping non-index page


 scan-108-odd
skipping non-index page


 scan-109-even
skipping non-index page


 scan-109-odd
skipping non-index page


 scan-110-even
skipping non-index page


 scan-110-odd
skipping non-index page


 scan-111-even
skipping non-index page


 scan-111-odd
skipping non-index page


 scan-112-even
skipping non-index page


 scan-112-odd
skipping non-index page


 scan-113-even
skipping non-index page


 scan-113-odd
skipping non-index page


 scan-114-even
skipping non-index page


 scan-114-odd
skipping non-index page


 scan-115-even
skipping non-index page


 scan-115-odd
skipping non-index page


 scan-116-even
skipping non-index page


 scan-116-odd
skipping non-index page


 scan-117-even
skipping non-index page


 scan-117-odd
skipping non-index page


 scan-118-even
skipping non-index page


 scan-118-odd
skipping non-index page


 scan-119-even
skipping non-

ZeroDivisionError: division by zero

In [None]:
for lemma in lemma_index:
    print("\nTrefwoord:", lemma)
    #print(lemma_index[lemma])
    for entry in lemma_index[lemma]:
        pages = ", ".join([str(page_ref) for page_ref in entry["page_refs"]])
        description = entry["description"][:70]
        print("\tPagina:", pages, "\tBeschrijving:", description)

In [None]:
# scan 45 uneven is first resolution page
# page num: 91

from fuzzy_context_searcher import FuzzyContextSearcher
import pandas as pd

config = {
    "char_match_threshold": 0.8,
    "ngram_threshold": 0.6,
    "levenshtein_threshold": 0.8,
    "ignorecase": False,
    "ngram_size": 3,
    "skip_size": 0,
}

fuzzy_searcher = FuzzyContextSearcher(config)

keywords = [
    "Admiraliteyt tot Amfterdam", 
    "Admiraliteyt in het Noorder Quartier", 
    "Admiraliteyt in Vrieslandt", 
    "Admiralteyt in Zeelandt",
    "Varckens"
]

distractor_terms = {
    "Admiraliteyt tot Amfterdam": {
        "Admiraliteyt in het Noorder Quartier", "Admiraliteyt in Vrieslandt", "Admiralteyt in Zeelandt"
    },
    "Admiraliteyt in het Noorder Quartier": {
        "Admiraliteyt tot Amfterdam", "Admiraliteyt in Vrieslandt", "Admiralteyt in Zeelandt"
    },
    "Admiraliteyt in Vrieslandt": {
        "Admiraliteyt tot Amfterdam", "Admiraliteyt in het Noorder Quartier", "Admiralteyt in Zeelandt"
    },
    "Admiralteyt in Zeelandt": {
        "Admiraliteyt tot Amfterdam", "Admiraliteyt in het Noorder Quartier", "Admiraliteyt in Vrieslandt"
    },
}
fuzzy_searcher.index_keywords(keywords)
fuzzy_searcher.index_distractor_terms(distractor_terms)

hocr_resolution_pages = []



### Fuzzy Searching of Keywords in the Resolutions

Knowing which keywords should appear in the text, possibly with some spelling variation and OCR errors, we can use a fuzzy search algorithm to find candidate matches. 

Keywords that are similar to each other are registered as distractor terms, so matches are assigned as candidates to the nearest of sets of similar keywords. 

In [None]:
from parse_republic_hocr_files import merge_text_lines, read_hocr_scan

lemma_matches = defaultdict(list)

def add_context(match, page_text):
    context = fuzzy_searcher.get_term_context(page_text, match, context_size=40)
    match["match_term_in_context"] = context["match_term_in_context"]
    match["context_start_offset"] = context["start_offset"]
    match["context_end_offset"] = context["end_offset"]

for scan_file in scan_files:
    resolution_page_num = scan_file["scan_page_num"] - 90
    if scan_file["scan_page_num"] <= 90:
        continue
    print(scan_file["scan_page_num"], resolution_page_num)
    hocr_page = read_hocr_scan(scan_file)
    page_text = merge_text_lines(hocr_page)
    matches = fuzzy_searcher.find_candidates(page_text)
    for match in matches:
        lemma_matches[match["match_keyword"]] += [match]
        add_context(match, page_text)
        match["page_num"] = scan_file["scan_page_num"]
        print(match["match_keyword"], "\t", match)
    #break
    

In [None]:
for lemma in sorted(lemma_matches):
    print("\n", lemma, "\tAantal kandidaten:", len(lemma_matches[lemma]), "\n")
    for match in lemma_matches[lemma]:
        print("\tKandidaat:", match["match_string"])
        print("\tPagina:", match["page_num"])
        print("\tContext:", match["match_term_in_context"][5:-5])
        print()



### Extracting Resolutions From Pages

Identify:

- resolution dates
- resolution participant lists
- resolution text blocks

In [533]:
from fuzzy_context_searcher import FuzzyContextSearcher
from fuzzy_person_name_searcher import FuzzyPersonNameSearcher

config = {
    "char_match_threshold": 0.8,
    "ngram_threshold": 0.6,
    "levenshtein_threshold": 0.8,
    "ignorecase": False,
    "ngram_size": 2,
    "skip_size": 2,
}

keywords = [
    #"Is ter Vergaderinge gelefen",
    #"Ontvangen een miffive van",
    #"WAAR op gedelibereert zijnde",
    #"WAAR op geen refolutie is gevallen",
    #"De Resolutien, gifteren genomen",
    #"Nihil actum eft.",
    #"PRAESIDE",
    "PRAESENTIBUS",
    #"Januarii"
]

week_days = [
    "Lunae",
    "Martis",
    "Mercurii",
    "Jovis",
    "Veneris",
    "Sabbathi",
    "Dominica"
]

months = [
    "Januarii",
    "Februarii",
    "Maart",
    "April",
    "Mey",
    "Junii",
    "Juli",
    "Augufti",
    "September",
    "October",
    "November",
    "December"
]

spelling_variants = {
    "Ontvangen een miffive van": [
        "ON een miffive van",
    ],
    "PRAESENTIBUS": [
        "PRASENTIBUS",
        "PRESENTIBUS"
    ]
}

fuzzy_searcher = FuzzyContextSearcher(config)
fuzzy_person_searcher = FuzzyPersonNameSearcher(config)

fuzzy_searcher.index_keywords(keywords)
#fuzzy_searcher.index_spelling_variants(spelling_variants)
#fuzzy_searcher.index_distractor_terms(distractor_terms)

def is_empty_line(line):
    return len(line["words"]) == 0

def get_resolution_page_paragraphs(page_info):
    paragraphs = []
    for column_info in page_doc["columns"]:
        paragraph = []
        prev_line_bottom = None
        for line in column_info["column_hocr"]["lines"]:
            boundary = False
            if is_empty_line(line):
                continue
            if prev_line_bottom == None:
                prev_line_bottom = line["bottom"]
                paragraph.append(line)
                continue
            line_gap = line["top"] - prev_line_bottom
            if line_gap > 30:
                boundary = True
            elif line_gap > 10 and line_is_centered_date(line):
                boundary = True
            if boundary and prev_line_bottom < 400:
                paragraph = [] # start new paragraph, ignore header line
            elif boundary:
                paragraphs.append(paragraph)
                paragraph = []
            paragraph.append(line)
            prev_line_bottom = line["bottom"]
        if len(paragraph) > 0:
            paragraphs.append(paragraph)
    return paragraphs

def line_has_weekday(line):
    return line_has_word_from_list(line, week_days)

def line_has_month(line):
    return line_has_word_from_list(line, months)

def line_has_word_from_list(line, word_list):
    for line_word in line["words"]:
        best_match = None
        best_score = 1
        for list_word in word_list:
            score = score_levenshtein_distance(line_word["word_text"], list_word)
            relative_score = score / len(list_word)
            if relative_score < 0.4:
                if relative_score < best_score:
                    best_match = list_word
                    best_score = relative_score
        if best_match:
            print("#{}# #{}#".format(line_word["word_text"], best_match), best_score)
            return True
    return False

def line_is_centered_date(line):
    # line is centered
    if line["words"][0]["left"] < 200 or line["words"][-1]["right"] > 800:
        return False
    if line_has_weekday(line) and line_has_month(line):
        print("\n\tCentered date line:\t", line["line_text"])
        return True
    # line has weekday den number month
    return False

def paragraph_starts_with_centered_date(paragraph):
    if line_is_centered_date(paragraph[0]):
        return True
    return False

def merge_paragraph_lines(paragraph):
    paragraph_text = ""
    for line in paragraph:
        if line["line_text"][-1] == "-":
            paragraph_text += line["line_text"][:-1]
        else:
            paragraph_text += line["line_text"] + " "
    return paragraph_text

for page_id in pages_info:
    if pages_info[page_id]["page_type"] != "resolution_page":
        continue
    if pages_info[page_id]["scan_num"] < 264 or  pages_info[page_id]["scan_num"] > 280:
        continue
    print(page_id, pages_info[page_id]["page_type"])
    page_doc = retrieve_page_doc(page_id)
    for paragraph in get_resolution_page_paragraphs(page_doc):
        paragraph_text = merge_paragraph_lines(paragraph)
        matches = fuzzy_searcher.find_candidates(paragraph_text, include_variants=True)
        if len(matches) == 0 and paragraph_starts_with_centered_date(paragraph):
            print("DATE LINE:", paragraph_text)
        #    print("paragraph_text:", paragraph_text)
        for match in matches:
            #print("\t", match)
            if match["match_term"] == "PRAESENTIBUS":
                print("DAY START:", paragraph_text)
            context_match = fuzzy_searcher.get_term_context(paragraph_text, match, context_size=200)
            #print(context_match)
            person_matches = fuzzy_person_searcher.find_person_names_in_text(context_match["match_term_in_context"])
            #person_matches = fuzzy_person_searcher.find_person_names_in_context(context_match)
            #for person_match in person_matches:
            #    print("\t", person_match)
        #print("\n\n\n")


scan-264-even resolution_page
scan-264-odd resolution_page
#Martis# #Martis# 0.0
#Junit# #Junii# 0.2

	Centered date line:	 Martis den 19, Junit
DAY START: Martis den 19, Junit 1725. PRASIDE, Den Heere Van Maasdam. PRASENTIBUS. De Heeren Jan Welderen van Heucketom , Singendonck , ván Heeckeren, Umibgroeven met een extraordinaris Gedeputeerde uyt de Provincie van Gelderlandt. Vanden Boetzelaar Raadtpenfionaris van Hoornbeeck. Ocker(Je. Van Voor. Van Schwartzenbergty Rou[e , Vriefen. Emmen, Tamminga. 
scan-265-even resolution_page
scan-265-odd resolution_page
scan-266-even resolution_page
scan-266-odd resolution_page
DAY START: Mercuri: den 210. Junii 4725. PRASTDE, Den Heere Zan Maasdam. PRASENTIBUS, De Heeren’ Van Welderen van Heuckts lom van Heeckeren Umbgroeven met een extraordinaris Gedeputeerde uyt dé Provincie van Gelderlandt. Panden Boetzelaar Eelbo van Marfeveen, Boon Raadtpen/ionaris van Hoornbeeck. OckerJe, met een extraordinaris Gedeputeerde uyt de Provincie van Zeelandt. Van

## Experimental Corner

Playground for trying fuzzy searching algorithms.

In [503]:
def score_levenshtein_distance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1
    distances = range(len(s1) + 1)
    #distances = [0] * (len(s1) + 1)
    #print("distances initial:",distances)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        #print("distances_:", distances_)
        #print("i2:", i2, "c2:", c2)
        for i1, c1 in enumerate(s1):
            #print("i1:", i1, "c1:", c1)
            if c1 == c2:
                distances_.append(distances[i1])
                #print(distances_, "equal")
            else:
                #print("\tminimum of:", distances[i1], distances[i1 + 1], distances_[-1])
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
                #print(distances_, "unequal")
        distances = distances_
        #distances = distances_
        #print("distances:", distances, s2[i2:i2+len(s1)])
    return distances[-1]

def get_match(matrix, T):
    match = ""
    col_index = len(matrix) - 1
    distance = matrix[-1][-1]
    row_index = len(matrix[-1]) - 1
    while row_index > 0:
        if col_index < 0:
            print("BREAKING")
            break
        #print("col_index:", col_index, "row_index:", row_index, "distance:", distance)
        if matrix[col_index-1][row_index-1] < distance:
            #print("replacing")
            match = T[col_index] + match
            distance = matrix[col_index-1][row_index-1]
            col_index -= 1
            row_index -= 1
        elif matrix[col_index][row_index-1] < distance:
            #print("inserting")
            #match = T[col_index-1] + match
            distance = matrix[col_index][row_index-1]
            row_index -= 1
        elif matrix[col_index-1][row_index] < distance:
            #print("deleting")
            match = T[col_index] + match
            distance = matrix[col_index-1][row_index]
            col_index -= 1
        elif matrix[col_index-1][row_index-1] == distance:
            #print("copying")
            match = T[col_index] + match
            col_index -= 1
            row_index -= 1
        else:
            print("This should never be printed")
        #print("row_index:", row_index)
        #print("match:", match)
    #print("returning match:", match)
    return (match, col_index+1, matrix[-1][-1])

def search_levenshtein_distance(P, T, threshold=2):
    if len(P) > len(T):
        P, T = T, P
    distances = range(len(P) + 1)
    matches = []
    matrix = []
    #distances = [0] * (len(T) + 1)
    #print("distances initial:",distances)
    for t_j, T_j in enumerate(T):
        distances_ = [0]
        #print("distances_:", distances_)
        #print("t_j:", t_j, "T_j:", T_j)
        for p_i, P_i in enumerate(P):
            #print("p_i:", p_i, "P_i:", P_i)
            if P_i == T_j:
                distances_.append(distances[p_i])
                #print(distances_, "equal")
            else:
                #print("\tminimum of:", distances[p_i], distances[p_i + 1], distances_[-1])
                distances_.append(1 + min((distances[p_i], distances[p_i + 1], distances_[-1])))
                #print(distances_, "unequal")
        distances = distances_
        #distances = distances_
        #print("distances:", distances, T[t_j-len(P):t_j])
        matrix += [distances]
        distance = distances[-1]
        if distance <= threshold:
            match = get_match(matrix, T)
            #print(T[:match[1]])
            #print(T[match[1]:match[1]+len(match[0])])
            #print(T[match[1]+len(match[0]):])
            matches.append(match)
    #for row in matrix:
    #    print(row)
    #for col_index, distance in enumerate(distances):
    return matches

target = "survey"
context = "surgery"
matches = search_levenshtein_distance(target, context)
print(matches)

target = "survey"
context = "the doctor is specialised in surgery of the heart"
matches = search_levenshtein_distance(target, context)
print(matches)



[('surge', 0, 2), ('surger', 0, 2), ('surgery', 0, 2)]
[('surge', 29, 2), ('surger', 29, 2), ('surgery', 29, 2)]


In [327]:
texts = [
    "B. dr Bofch Mikelaer, zal op den 11 May , in de Keyzers Kroon verkopen , een uvrmuntënde pirty Schilderyen van de voornaernfte M;efter«; ils van de Oude Pslma extra goed , A'exander Veronecs, Pluweele Breugel , Wouwerman, J. vander Heydc, Laireffe, Bril en ancVc. Nagelaten door een voon.aem Liefhcbb.T ; als mede een fraye parry Teekenmgen en Miniatuur fchoone D'ukke Prenten van de voorn-emfte Mcefters. De Citilogu-zal ir.tyds te bekomen zyn by Jicób Carpi Konft Schilder, en by de gsm. Mikelaer. B. SÜgtenborff Mikelaer, za' op Woensdag den 25- Miy, t'Amft. in 't Oude Heeren Logement verkopen, een pirty Engelfche Manuft-Uuren et Wmkclw ren , beftaende in Lakend diverfe fbrteering vin * en 9 quart breed , Biyen, Kirfayen, Drogetreu,.Sergies, gcfc»om-c en gj'lrcepre Kalamankcn, S.urynen, diverfe Grynen en Stoften, Ssycn, Chitzen , Catoenenen Neteldoekeu , divetfc Kouffen,-1 !___. witte gebleekte. Li.inens„cn andere Goederen meer ;, alle* dacgj ïoer <k Verkoping te zien.. '",
    "Plülippus van d:r Land Makelaer, za! op d:.i : 5 May verkopen , een uytmuntend ko1 fti- Kabinet Schilderyen; als vin Pb. Woaverman, A. vin O.tilc, Rottenhimcr, G. Metzu, G. dc LairefTc, D. van Deeien, |. Srccn, D. Teniers, de- Oude Griffier. ]. en 4. Bcth, M. ds Hond-koere-, J. do Heem, J. Lingelbigh.en andere Mecfterj meer; nagelaten door den £J. Heer Secretaris Lambert Witzen , wier van de Citalogus by den gen*. Mikelaer te bekomen zullen zyn. Alte de gee-.e die iers te prereideeren hebben of verfftiulcKgt zyn op de mgchten Boedel vin Pieter Engels , tot Warmer overleden, gelieven luc prerenficn binr.e.-yden tyd vaa 5 weeleen 1:1 te leveren leo Wccsvi-lers rn Armvoogden tot Wormer ; dew-'ke de Penningen vm de y;em. Boedel geprovenieert binnen korre , na verloop van de ge.eyde C weken , by preferentie en concurrentie zulten diftri- MSfctti foi.nig als zy zu'ten vmdc.i te be'nojren,",
    "Jac. Torner Junior Mikelaer, prefènteert tegens primo May.of wel eerder :e verbuurcn, een zeer vermakelyke en welgelege Herberg genaemt RUSTENBURG ; gelegen buyten dc Urregtfj Poort op 't Ruften!., rger par! , met zyn tnodicufe Huyzinge , fraye Thuyn en Tuynhuy», Kolfoaen.Troktafel &o; aiwaer de neering veel jiren met fueces il gcoaen en nog werd gecontinueert, laetft bewoont dexr wylen Dirk He Lange. N ider onderngtinge bv gem. Makelae-. Alle de eeene die eenig regt, aélie of pretenfie beeft, of iets fchal lig zyn a-.n den r.eabindonneerden Boedel .-an Dirk van Schaik en Ja -metje Theuniffe, geweezene Herbe-g er in de Oude Loosdregt, werden-verzoet zulks op en aen ce geven ter Secretary aldaer, uyterlyk voor primo April 1745 , op potne vau eeuwig ftilfwygen en verftek.",
    "ArnoHns van Sprang-Mikelaer, zal op Maendag d-n 27 Vnny.t'Amfteid. in 't Oude Heeten Logement vetkopen, een wel ter nerinc 4_endc Hoys, In de Crom elboo»ftee» 't rte huvs -an den\"Dim,»rnaemt het Mo huys, daer de Baftille uythangt ; breeJet byßiljettr Vermeld : lemant nadet ondetrigi begreiende of genegen zynde dit Perceel uyt de hand te-koopen , fpieeke met de gem. Makelaet, by van eygendom «dagen voet de Veikoopdag te zien tullen zyn. Ajnüld:. s- au Spiang.M.kc!aei, zal op Maendag den 2- lünv , t'Aml. in 't Ó.de tfeeien Logement vetku. en, drie hegre, fterkeen weC-doortt üsnefde Huvfcn, ftaende nieft mi'k.nrlrien in deGroore aen dl-Vtoord-vde by de dwirsftiact • bieede by 3i,;**rte**i vermeld: lefnanrna-e.'onderri.ti.ig begeerende of geneegen z.ynie degemPi'tcedriaTt ie.__.ad.teko.pen,fpi?eke mei yucru, Mak_liei, by wien de bewyzen .van cyi*j_d-_- S dagen voot de:yeikoop_ag.tc _iea zullen ü\", cv",
    "Job. Haverkamp Mikelaer, zal op Maendag den 11 January 1745, t Amft. in t Oude Heere Logement verkopen, No. 1. Een hegt en fterk Huys; ftaende op de Weesperftraet tuffchen de Hect:n en Keyzersgragt. No.1. Eén dito Huys.ftaende naeft 't voorgaende. No. 3. Een Sdepart in een we! ter nering ftaende Huys en Erve op de N. Z. van de Agterburgwal op de hoek van de Pottebakkersfteeg. Breder by Biüetten gefpecifkeert. De bewyzen van eygendom en de Vylconditien zullen 8 dagen voor en op de Verkoopdag voormiddag» tot J 2 uuren toe te zien zyn ten Compt. van den Notaris E. Haverkamp. Uyt dé huni te Koop een wel beklante Banket en Confiturier» Winkel met deszelfs Gereetfchappen , waer in die affaire» circa SS jaeren met c.\".cd fucces zyn gecontinueert: Te bevragen by Hendrik Bofch en Gomp., in de Kalverftract, tuflehen de Öjjes-flaysen Olyfiigers-fteeg\" t'Amfterdam. Als meede het Huys te hu*r.",
    "J. van Zutphen Mikelaer , zal op Maendag den 14 September , t'Amft. in 't Oude Heeren Logement verkopen , een extra wel geleegen 3-ftreeksLiken-Raem , met zyn Thuyn , Huyzinge en Droogfcheevders Winkel, eertyts genaemt het Varken, en nu Buyren-Ruft, ftaende en aeleegen buyten de Raem-Pooit ren eynde het eerfte Sceene-Pad , zynde Stads Grond , en geteekent met. No! 27, 28 en 29, Alles breder by Biljetten en dagelyks te zien , de bewyzen van Eygendom zullen 8 dagen voor en op de verkoopdag te zien zyn '. ten Compt. van den Nots. Arnoud Roermond in de Krom-Elleboogfteeg , en nader onderrigting by gem. Makelaer. ; Alle die iets te pretendeeren mogten hebben ten.laften van wylen Jean Fredrik Bernard , in zyn Leven Boekverkoper t'Amft. v of die eenige Engagementen met dezelve hebben lopende , uyr wat hoofde het o:)-_zoude mogen zyn, als mede die aen der Boedel fchuJdig mogten wezen , of ook eenige Goederen van hem onder zig hebben , werden verzogt zulks op voor tieu ïsOitcber 1744. .;ten Compt. vanden Notaris Jan Ardinois , op de Cingel t'Amfterdam.",    
]

target = "Makelaar"
target = "Oude Heere Logement"
for text in texts:
    matches = search_levenshtein_distance(target, text)
    print(matches)



[('Oude Heeren Logemen', 568, 2), ('Oude Heeren Logement', 568, 1), ('Oude Heeren Logement ', 568, 2)]
[]
[]
[('Oude Heeten Logement', 73, 2)]
[('Oude Heere Logeme', 74, 2), ('Oude Heere Logemen', 74, 1), ('Oude Heere Logement', 74, 0), ('Oude Heere Logement ', 74, 1), ('Oude Heere Logement v', 74, 2)]
[('Oude Heeren Logemen', 74, 2), ('Oude Heeren Logement', 74, 1), ('Oude Heeren Logement ', 74, 2)]
