## Experimentation Notebook

This notebook is part of the sandbox and is intended to experiment and play around with the REPUBLIC elasticsearch functionalities.

In [1]:
# This is needed to add the repo dir to the path so jupyter
# can load the republic modules directly from the notebooks
import os
import sys
repo_name = 'republic-project'
repo_dir = os.path.split(os.getcwd())[0].split(repo_name)[0] + repo_name
print(repo_dir)
if repo_dir not in sys.path:
    sys.path.append(repo_dir)



/Users/marijnkoolen/Code/Huygens/republic-project


## Initialise Republic Elasticsearch Instance

This creates a RepublicElasticsearch object that contains an elasticsearch instance for the Republic CAF indexes, as well as a range of retrieval functions.

Check the [README](https://github.com/HuygensING/republic-project#readme) for configuration details that should be placed in `settings.py`.

In [2]:
from republic.elastic.republic_elasticsearch import initialize_es

rep_es = initialize_es(host_type='external', timeout=60)
rep_es.config['resolutions_index'] = 'resolutions_new'

## Keyword in Context

A simple way to start exploring is with the `keyword_in_context` function. It takes words or phrases as input and shows a number of hits with surrounding context in the resolutions.

The `keyword_in_context` function returns `hit`s, which are dictionaries with the search `term`, the `pre` and `post` contextual words, a formatted `context`, as well as the `para_id` and `resolution_id` and `resolution_offset` and `para_offset`.

In [10]:
# use single word or multi-word phrase
for hit in rep_es.keyword_in_context("tyden", num_hits=20, context_size=5):
    print(hit["context"])


                        gevalle sijn Persoon ten allen tyden te sullen sisteren, daer voor
                     belofte en borghtochte van tallen tyden sijn Persoon, soo het nodigh
                             cautie van sigh ten allen tyden weder te sullen sisteren als
                 Onderdaanen in deese conjuncturen van tyden niet ten hooghite noodsaackelijck soude
             als hooghstdeselve na omstandigheeden van tyden en saatken sal vinden te
                            haar Hoog Mog., in voorige tyden by diergelyke ceremonien, gewoon waren
                                 wel nu als in voorige tyden het gewoole douceur van zes
                          volgens de gewoonte van alle tyden, maendelijck waren betaeldt geweest van
                            veel minder als in voorige tyden gereguleert zynde, veele van haer
              Convoyen en Licenten, sedert immemoriale tyden genoten. Waer op gedelibereert zijnde
                   by de tegenwoordige constitutie van tyd

The `keyword_in_context` function also has several optional arguments to control the size of the context window (`context_size`, default is 3 words before and after), the number of hits (`num_hits`, default is 10) and query filters to constrain the search space (`filters`, which are added to the query).

**Note**: the `num_hits` argument controls the number of _resolutions_ that are retrieved. Within a resolution, the search keyword may appear multiple times. A context is created for each occurrence of the search keyword, so the number returned of contexts can be (and typically is) higher than the number of hits.

In [11]:
# use context_size to get fewer or more surrounding words as context
for hit in rep_es.keyword_in_context("secreete", context_size=5):
    print(hit["context"])

                          gedaan op haar Hoogh Mogende secreete commissoriaale Resolutie van den vierden
                         ontfangen, en gemeld in sijne secreete Mislive van den seven en
                       besoignes by haar Hoogh Mogende secreete Resolutie van den vierden December
                        voldoeninge van haar Hoog Mog. secreete Resolutie en Aanschryving van den
                             in haar Hoog Mog. gemelde secreete - Resolutie van den 12 deeser
                     nakoominge van haar Hoogh Mogende secreete Resolutie en Aanschryvens van den
                          vervat in haar Hoogh Mogende secreete Resolutie van den sestienden deeser
                       ontfangst van haar Hoog Mogende secreete Resolutie van den dertigsten der
                             in gevolge haar Hoog Mog. secreete Resolutie van den 25 deezer
                        kraghte van haar Hoogh Mogende secreete Resolutie van den vier en
               Suppliantes Man waaren 

In [12]:
for hit in rep_es.keyword_in_context("periculeuse", context_size=5):
    # First, show paragraph id (which contains session date)
    print(hit["resolution_id"])
    # Second, show the keyword in context
    print(hit["context"])
    # Finally, add newline for readability
    print()


session-1709-02-06-num-1-resolution-10
                 tot Scheveninge, dolerende over sijne periculeuse ende kostelijke reyse, met een

session-1709-02-06-num-1-resolution-10
                      Stuyrman Jochem Joppe, voor sijn periculeuse reyse ende schade aen de

session-1720-01-20-num-1-resolution-4
                            douceur voor de sware ende periculeuse reyse die hy gedaan heeft

session-1733-07-11-num-1-resolution-11
                          door niet alleen buyten alle periculeuse gevaaren werden gestelt, maar oock

session-1737-12-27-num-1-resolution-16
                                van door een swaare en periculeuse sieckte te worden overvallen, in

session-1672-11-21-num-1-resolution-25
                       Schoonhoven aen eene sware ende periculeuse sieckte het beddehoudende. om hem

session-1711-09-30-num-1-resolution-9
                      werden, om sulcke moeyelijcke en periculeuse reyse te doen of te

session-1707-11-17-num-1-resolution-14
          

In [13]:
# use num_hits to get fewer or more results
for hit in rep_es.keyword_in_context("pestilentie", context_size=500, num_hits=20):
    print(hit["resolution_id"])
    print(hit["context"])
    print()


session-1739-01-07-num-1-resolution-7
DE Heeren Gedepuleerden van de Provintie van Gelderland, hebben ter Vergaderinge voorgedraagen, en in bedencken gestelt, of niet in het begin van dit aangevangen jaar; behoort te werden uytgeschreven een generaale Danck-, Vasten Beededagh , om God almaghtigh te dancken voor de continuatie van de vreede, en voor alle sijne weldaaden tot hier toe aan den Staat beweesen, en te bidden om sijnen aanhoudenden zeegen over den Staat, ten eynde deselve van Oorlogen, Pestilentie en andere plaagen bevryd mooge blyven. WAAR op gedelibereert zynde, hebben de Heeren Gedeputeerden van de Provintien van Holland en Westvriesland, van Zeeland, van Vriesland, van Overyssel, en van Stad en Lande aangenoomen haar daar op naader te sullen verklaaren

session-1711-09-04-num-1-resolution-16
DE ondergeschreven Resident van de Loffelijcke Hansche Steden, geeft sich de eere, op expresse ordre van den Senaet der Stadt Dantzich, aen U Hoogh Mogende met alle behoorlijk respect 

In [14]:
# use filters to contrain the search space:
# selecting resolutions by year
filters = [
    {"match": {"metadata.session_year": 1672}}
]

for hit in rep_es.keyword_in_context("periculeuse", filters=filters):
    print(hit["para_id"])
    print(hit["context"])

session-1672-11-21-num-1-para-56
                   eene sware ende periculeuse sieckte het beddehoudende
session-1672-02-18-num-1-para-10
           doch voornamentlijck in periculeuse tijden in goede
session-1672-01-04-num-1-para-63
            doch voornamentlick in periculeuse tijden, in goede
session-1672-02-24-num-1-para-13
            doch voornamentlick in periculeuse tijden in goede


In [15]:
# use filters to contrain the search space:
# selecting resolutions by date range
filters = [
    {"range": {"metadata.session_date": {"gte": "1672-04-01", "lte": "1672-08-01"}}}
]

for hit in rep_es.keyword_in_context("Vloot", filters=filters):
    print(hit["para_id"], '\n')
    print(hit["context"], '\n')

session-1672-06-16-num-1-para-24 

                 dat de Smirnasche Vloot door d'Engelschen 

session-1672-05-31-num-1-para-92 

                      sigh naer de Vloot te vervoegen: Waerop 

session-1672-05-24-num-1-para-20 

           gedaen, opde Smirnasche Vloot is gear„ resteert 

session-1672-04-02-num-1-para-50 

                       Ho:Mo: inde vloot vanden Staat op 

session-1672-07-27-num-1-para-83 

                   ont„ houden van Vloot der Vijanden van 

session-1672-05-31-num-1-para-36 

                 dat de Smirnasche Vloot, door d'Engelschen 

session-1672-04-06-num-1-para-74 

  gevolmachtichde opde voorschreve vloot commanderen sal den 



In [49]:
# use filters to contrain the search space:
# selecting resolutions by date range
filters = [
    {"range": {"metadata.session_date": {"gte": "1672-01-01", "lte": "1672-12-31"}}}
]


phrases = [
    "Marie van Sisburgh",
    "Nicolaes Gouverneur",
    "Wouter van Swieten",
    "Jan van Gulpen",
    "Seghwaert"
]

for phrase in phrases:
    for hit in rep_es.keyword_in_context(phrase, filters=filters, context_size=200):
        print(hit["para_id"], '\n')
        print(hit["context"], '\n')

session-1672-02-06-num-1-para-16 

Opde requeste van Marie van Sisburgh, weduwe wijlen Johan Pels, houdende, dat haer Ho:Mo: in consideratie dat den voornoemden haren man met het schip vanden neutenant admirael den heere van Wassenaer, in Zee was opgesprongen, aende Suppliante in Januarij 1671. hadden toegeleijt de somme van hondert twintigh gulden, ende versoeckende dat deselve somme van hondert twintich gulden, aen haer bij continuatie wederom tot haer onderhout in haren hoogen onderdan van drie en t'seventightaren, soude mogen werden toegeleijt; Is naer voergaende deliberatie goetgevonden ende verstaen, dat deselve requeste gestelt sal werden in handen vande heeren Huijgens, ende andere haer Ho:Mo: Gedeputeerdens tot de saecken vande Griffie, om te visiteren, examineren, ende daervan rapport te doen 

session-1672-02-17-num-1-para-10 

De requeste van Nicolaes Gouverneur Coopman tot Amsterdam, versoeckende dat alsnoch favorabelijck soude mogen werden gedisponeert op sijn vorige requ

In [16]:
filters = [
    {"match": {"metadata.session_year": 1672}}
]

# using a larger context size
for hit in rep_es.keyword_in_context("Vlooten", filters=filters, context_size=20):
    print(hit["resolution_id"], hit["resolution_offset"], '\n')
    print(hit["context"], '\n')


session-1672-05-31-num-1-resolution-1 0 

Ontfangen een missive vanden Heer Cornelis de Witt, hare Ho:Mo: Gedepden. ende Gevolmachtichde op 's Lants Vlooten in de jegenwoordige expeditie ter Zee, Jehan ‛s Lants Schip de seven Provincien, laverende voor Walcheren, Brugge & Oost van haer 

session-1672-09-01-num-1-resolution-1 583 

advertentie ten spoedichsten kennisse sal werden gegeven aenden Lieutenant Admirael de Ruijter om daerop behoorlicke reflexie te nemen, de Vijantlicke vlooten te doen observeren, ingevolge van hare Ho:Mo: resolutie vanden seven„ thienden Augusti laestleden, de desseijnen vande Vijanden vanden Staet 

session-1672-09-01-num-1-resolution-1 1366 

Welderen, ende Lieutenant Admirael de Ruijter sal werden, aengeschreven, dat deselve haer soo veel mogelick op de voor„ schreve Vijantlicke Vlooten sullen informeren, haer Ho:Mo: sonder eenich tijt versuijm, adverteren vande condtschappen die haer vande voornoemde Vijantlicke Vlooten souden mogen 

session-1672-07-15-n

## Retrieving Resolutions

The `rep_es` object has a range of functions to retrieve `resolution` objects.

You can find all available properties and methods of `resolution` objects in `republic_document_model.py`: i.e. in the
[Resolution](https://github.com/HuygensING/republic-project/blob/bb4cdad7b4cb9fb71378d0dde000fe7725ceb45e/republic/model/republic_document_model.py#L392) class, which inherits several properties and methods from the [ResolutionElementDoc](https://github.com/HuygensING/republic-project/blob/bb4cdad7b4cb9fb71378d0dde000fe7725ceb45e/republic/model/republic_document_model.py#L158)

In [18]:
resolutions = rep_es.retrieve_resolutions_by_session_date("1672-02-12")
for res in resolutions:
    print(res.session_date.isoformat(), res.id, res.metadata['index_timestamp'])

1672-02-12 session-1672-02-12-num-1-resolution-15 2022-03-29T15:39:04.587225
1672-02-12 session-1672-02-12-num-1-attendance_list 2022-03-29T13:12:35.786221
1672-02-12 session-1672-02-12-num-1-resolution-1 2022-03-29T13:12:35.914799
1672-02-12 session-1672-02-12-num-1-resolution-2 2022-03-29T13:12:36.879396
1672-02-12 session-1672-02-12-num-1-resolution-3 2022-03-29T13:12:38.169285
1672-02-12 session-1672-02-12-num-1-resolution-4 2022-03-29T13:12:38.314536
1672-02-12 session-1672-02-12-num-1-resolution-5 2022-03-29T13:12:38.663113
1672-02-12 session-1672-02-12-num-1-resolution-6 2022-03-29T13:12:39.011344
1672-02-12 session-1672-02-12-num-1-resolution-7 2022-03-29T13:12:39.212624
1672-02-12 session-1672-02-12-num-1-resolution-8 2022-03-29T13:12:39.406294
1672-02-12 session-1672-02-12-num-1-resolution-9 2022-03-29T13:12:39.550528
1672-02-12 session-1672-02-12-num-1-resolution-10 2022-03-29T13:12:39.704603
1672-02-12 session-1672-02-12-num-1-resolution-11 2022-03-29T13:12:39.895641
1672-0

In [11]:
output_file = "resoluties_rampjaar.csv"

with open(output_file, 'wt') as fh:
    headers = ['resolution_id', 'date', 'paragraph_id', 'text', 'iiif_url']
    fh.write('\t'.join(headers) + '\n')
    for res in resolutions:
        for para in res.paragraphs:
            if isinstance(para.metadata['iiif_url'], list):
                url = ', '.join(para.metadata['iiif_url'])
            else:
                url = para.metadata['iiif_url']
            row = [res.id, res.metadata['session_date'], para.id, para.text, url]
            row = [cell if cell is not None else '' for cell in row]
            print(row)
            fh.write('\t'.join(row) + '\n')
        

['session-1672-02-12-num-1-attendance_list', '1672-02-12', 'session-1672-02-12-num-1-para-1', 'Veneris den 12. Februarij 1672', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3285/NL-HaNA_1.01.02_3285_0265.jpg/2609,337,2446,3576/full/0/default.jpg']
['session-1672-02-12-num-1-attendance_list', '1672-02-12', 'session-1672-02-12-num-1-para-2', 'Preside den Heere van Coeverden', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3285/NL-HaNA_1.01.02_3285_0265.jpg/2609,337,2446,3576/full/0/default.jpg']
['session-1672-02-12-num-1-attendance_list', '1672-02-12', 'session-1672-02-12-num-1-para-3', 'Præsentibus de Heeren van Gendt, Gellicum, Brakell, Vijgh, Ripperda tot Buirse Schimmelpenningh, Ommeren Werckendam, Goeree, Meerens Odijck, Reijgersbergh, Crommon, Vrijbergen, Mauregnault,', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3285/NL-HaNA_1.01.02_3285_0265.jpg/2609,337,2446,3576/full/0/default.jpg']
['session-1672-02-12-num-1-attendance_list', '1672-02-12', 'session-1672-02-12-nu

In [19]:
keyword = "pestilentie"

query = {"query": {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},                  # only resolutions, no attendance lists
            {"match": {"metadata.session_year": 1711}},         # only resolutions from 1672
            {"match": {"paragraphs.text": keyword}}, # only resolutions containing 'raet pensionaris'
        ]
    }
}}
resolutions = rep_es.retrieve_resolutions_by_query(query)
output_file = f"resoluties_rampjaar_{keyword.replace(' ','_')}.csv"
with open(output_file, 'wt') as fh:
    headers = ['resolution_id', 'date', 'paragraph_id', 'text', 'iiif_url']
    fh.write('\t'.join(headers) + '\n')
    for res in resolutions:
        print(res.metadata["text_page_num"])
        for para in res.paragraphs:
            if isinstance(para.metadata['iiif_url'], list):
                url = ', '.join(para.metadata['iiif_url'])
            else:
                url = para.metadata['iiif_url']
            row = [res.id, res.metadata['session_date'], para.id, para.text, url]
            row = [cell if cell is not None else '' for cell in row]
            print(row)
            fh.write('\t'.join(row) + '\n')

[1057, 1058]
['session-1711-09-04-num-1-resolution-16', '1711-09-04', 'session-1711-09-04-num-1-para-20', 'IS ter Vergaderinge gelesen de Memorie van den Heere Resident Breyer, rakende het ontssaen van de Wollen ende andere Goederen, voor de besmettelijke Sieckte van Dantzick na dese Landen gebraght, volgende de voorschreve Memorie hier naer geinsereert.', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3766/NL-HaNA_1.01.02_3766_0556.jpg/2506,315,1119,3115/full/0/default.jpg']
['session-1711-09-04-num-1-resolution-16', '1711-09-04', 'session-1711-09-04-num-1-para-21', 'Hoogh Mog. Heeren.', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3766/NL-HaNA_1.01.02_3766_0556.jpg/2506,315,1119,3115/full/0/default.jpg']
['session-1711-09-04-num-1-resolution-16', '1711-09-04', 'session-1711-09-04-num-1-para-22', 'DE ondergeschreven Resident van de Loffelijcke Hansche Steden, geeft sich de eere, op expresse ordre van den Senaet der Stadt Dantzich, aen U Hoogh Mogende met alle behoorlijk respec

In [21]:
query = {
    "bool": {
        "must": [
            {"match": {"metadata.session_year": 1711}},
            {"match": {"metadata.text_page_num": 1099}}
        ]
    }
}

resolutions = rep_es.retrieve_resolutions_by_query(query)
for res in resolutions:
    print(res.id)
    for para in res.paragraphs:
        print(para.text)

session-1711-09-16-num-1-resolution-3
ONtfangen een Missive van de Magistraet der Keyserlycke vrye Rycks-Stadt Bremen, geschreven aldaer den twaelfden deser loopende maendt, houdende, dat de Koopluyden aldaer op de Nederlanden trafiquerenrende, aen haer hadden voorgedragen ende te kennen gegeven, in wat voegen sommige van haer reets in het midden van de maendt Augusti, in verscheyde Schepen hare Packen met Linnen ende andere Waren, alle met beëdigde Pasporten voorsien, dat se uyt geen besmetlijcke, maer uyt gesonde Plaetsen waren gekomen, na Amsterdam hadden laten afgaen, hebbende de selve tegenwoordigh eerst vernomen, dat haer Hoogh Mogende op den vierden September, als wanneer de voorschreve van daer afgevaren Schepen reets te Amsterdam waren aengekomen , een Placaet hadden laten uytgaen , waer by haer Hoogh Mogende de in het Placaet gespecificeerde Goederen verboden in te brengen; dat sy haer Hoogh Mogende konden verseeckeren, dat hare Stadt tot noch toe een gesonde lucht genoodt, e

### The Anatomy of a Resolution

Resolutions in the index consist of `metadata` and `paragraphs`.

In [20]:
import json

res = resolutions[0]
# Each resolution has metadata
print(json.dumps(res.metadata, indent=4))

{
    "inventory_num": 3766,
    "source_id": "session-1711-09-04-num-1",
    "type": "resolution",
    "id": "session-1711-09-04-num-1-resolution-16",
    "session_date": "1711-09-04",
    "session_id": "session-1711-09-04-num-1",
    "session_num": 1,
    "president": null,
    "session_year": 1711,
    "session_month": 9,
    "session_day": 4,
    "session_weekday": "Veneris",
    "proposition_type": "memorie",
    "proposer": null,
    "decision": null,
    "resolution_type": "ordinaris",
    "text_page_num": [
        1057,
        1058
    ],
    "index_timestamp": "2022-01-10T16:40:43.318944"
}


In [14]:
# You can dump all resolution data to JSON
res.json

{'id': 'session-1672-04-01-num-1-resolution-4',
 'type': ['republic_doc', 'resolution_element', 'resolution'],
 'metadata': {'inventory_num': 3285,
  'source_id': 'session-1672-04-01-num-1',
  'type': 'resolution',
  'id': 'session-1672-04-01-num-1-resolution-4',
  'session_date': '1672-04-01',
  'session_id': 'session-1672-04-01-num-1',
  'session_num': 1,
  'president': None,
  'session_year': 1672,
  'session_month': 4,
  'session_day': 1,
  'session_weekday': 'Veneris',
  'proposition_type': 'missive',
  'proposer': None,
  'decision': None,
  'resolution_type': 'ordinaris',
  'text_page_num': [],
  'index_timestamp': '2022-02-03T09:11:42.940395'},
 'evidence': [{'type': 'PhraseMatch',
   'phrase': 'ONtfangen een Missive van',
   'variant': 'ONtfangen een Missive van',
   'string': 'Ontfangen een Missive van',
   'offset': 0,
   'label': ['proposition_opening',
    'proposition_from_correspondence',
    'proposition_type:missive'],
   'ignorecase': False,
   'text_id': 'session-167

## Using Elasticsearch Queries

See the [Elasticsearch Query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) for details on how to construct different types of queries.

In [21]:
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},                  # only resolutions, no attendance lists
            {"match": {"metadata.session_year": 1672}},         # only resolutions from 1672
            {"match": {"paragraphs.text": "raet pensionaris"}}, # only resolutions containing 'raet pensionaris'
        ]
    }
}

resolutions = rep_es.retrieve_resolutions_by_query(query)

for res in resolutions:
    print(res.id)
    for para in res.paragraphs:
        print(f"\t{para.text}\n")
    print('--------------------\n')

session-1672-07-25-num-1-resolution-14
	Ontfangen een missive vanden pensionaris Pesters, geschreven tot Maestricht den 23en. deses, houdende advertentie, ende onder anderen rakende de contri„ butie bij de franschen gevordent wer„ ,dende inde Landen van Overmase, Waerop gedelibereert zijnde, Is goetgevonden ende verstaen, dat de voors missive gestelt sal werden in handen vande heeren van Brakel ende andere haer Ho:Mo: Gedepu„ teerden tot de saken vande Landen van Overmaze, met ende nevens eenige Heeren Gecommitteerden uijt den Raet van State bij haer E. selffs te nomineren, om te visi„ teren, examineren, ende daer van rapport te doen

--------------------

session-1672-11-15-num-1-resolution-5
	Ontfangen een missive van Alleij Aga, geschreven tot Amsterdam den twaelffden deses, houdende advertentie, dat hij uijt Turckijen was gesonden voor Ambassadeur vanden Grootenheer aenden Coningh van Sweeden dat hij oock een recommandatie Brieff aen haer Ho:Mo: om hem be,, hulpsaem te sijn int gee

In [22]:
# import Counter to do some simple word counting and frequency comparison
from collections import Counter
import re


In [23]:
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},
            {"match": {"metadata.session_year": 1672}}
        ]
    }
}

resolutions_1672 = rep_es.retrieve_resolutions_by_query(query, size=10000)

all_word_freq = Counter()

for res in resolutions_1672:
    for para in res.paragraphs:
        all_word_freq.update([word for word in re.split(r"\W+", para.text) if word != ''])

for word, freq in all_word_freq.most_common(10):
    print(f"{word: <20}{freq: >6}")

ende                 26487
van                  23337
de                   20843
te                   15032
dat                  11202
den                   9404
haer                  8512
vande                 8469
tot                   8431
in                    8057


In [24]:
res_missives = [res for res in resolutions_1672 if res.metadata['proposition_type'] == 'missive']

len(res_missives)

1616

In [25]:
from collections import Counter
import re

query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},
            {"match": {"metadata.session_year": 1672}},
            {"match": {"paragraphs.text": "raet pensionaris"}}
        ]
    }
}

resolutions = rep_es.retrieve_resolutions_by_query(query)


word_freq = Counter()

for res in resolutions:
    for para in res.paragraphs:
        word_freq.update([word for word in re.split(r"\W+", para.text) if word != ''])

rel_freq = {}
min_freq = 3
for word, freq in word_freq.most_common():
    if freq < min_freq:
        continue
    rel_freq[word] = freq / all_word_freq[word]
    
for word in sorted(rel_freq, key = lambda w: rel_freq[w], reverse=True):
    print(f"{word: <20}{rel_freq[word]: >6.4f}{word_freq[word]: >6}{all_word_freq[word]: >8}")

Pensionaris         0.3333     3       9
Wijtingh            0.2000     3      15
Pensionnaris        0.0946     7      74
Fagel               0.0714     5      70
raet                0.0338     5     148
Brakel              0.0268     3     112
gecommuniceert      0.0208     3     144
Raet                0.0182    16     877
Schepenen           0.0176     3     170
seeckere            0.0176     3     170
geaddresseert       0.0164     5     305
saken               0.0120     3     251
schreven            0.0112     3     268
nomineren           0.0109     3     275
heeft               0.0105     6     569
Orange              0.0090     3     332
Griffier            0.0086     4     464
advertentie         0.0085    10    1172
nevens              0.0080     5     622
anderen             0.0073     3     411
Gecommitteerden     0.0069     3     433
Hollandt            0.0069     3     435
hadden              0.0065     5     768
Heer                0.0062     8    1281
examineren      

### Resolutions in JSON Format

Resolution objects have a `.json` property to get a JSON representation of the resolution, including metadata, paragraph text and basic statistics. This can be a convenient format for storing and later retrieving them from disk (faster than getting them from the Republic CAF server).

You can also turn them into plain text representations if you want to do extensive text analysis.

In [26]:
import json
import gzip

resolutions_file = "../../data/resolutions/rampjaar-ordinaris-resolutions.json.gz"

# open a file for storing the JSON representation of resolutions
with gzip.open(resolutions_file, 'wt') as fh:
    # iterate over the resolutions and dump their JSON representations to file
    json.dump([res.json for res in resolutions_1672], fh)

In [4]:
import json
import gzip

import republic.model.republic_document_model as rdm


resolutions_file = "../../data/resolutions/rampjaar-ordinaris-resolutions.json.gz"

# Reading the JSON representations from file again and turning 
# them into Resolution objects again
with gzip.open(resolutions_file, 'rt') as fh:
    # the document model has a convenience function to turn a JSON representation
    # to a Resolution object: json_to_republic_resolution
    resolutions_1672 = [rdm.json_to_republic_resolution(res) for res in json.load(fh)]
    

In [27]:
# Creating plain text representations of resolutions by concatenating paragraph texts
for res in resolutions_1672:
    res_text = '\n'.join([para.text for para in res.paragraphs])
    print(res_text)
    break

Is gehoort het rapport vande heeren van Ommeren, ende andere haer Ho:Mo: Gedeputeerden tot de saken vande Zee, hebbende ingevolge ende tot voldoeninge vander selver resolutie Commis„ soriael, gevisiteert ende geexami„ neert de requeste van Sammel Turcker, woonende tot Rotterdam als d'affaires doende van Thimo„ thens keijser, Coopman tot dublijn in Irlandt, raekende hetschip de providentie wenigh dagen voor dat de Smirnasche Vloot door d'Engelschen is geattacqueert, tot Rotterdam voornoemt gearri„ veert, ende aldaer ontladen Waerop gedelibereert sijnde, Is goetgevonden ende verstaen, dat de voorsr. requeste gesonden sal werden aen het Collegie ter admiraliteyt opde Maze, om haer Ho:Mo: daerop te laten toecomen der selver bericht consi„ deratien en advis.
Finis van maent Junij
1072


## Retrieving Aggregate Statistics

You can also directly query the indexes using the elasticsearch instance inside the `rep_es` object, which is stored in the `es_anno` property (so can be addressed via `rep_es.es_anno`).

Below is an example of a query and an aggregation to get the number of resolutions per month in the year 1672:

In [33]:
# raet pensionaris in resolutions
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},
            {"match": {"metadata.session_year": 1672}},
            {"match": {"paragraphs.text": "raet pensionaris"}}
        ]
    }
}

aggs = {
    "months": {
        "date_histogram": {
            "field": "metadata.session_date",
            "calendar_interval": "month"
        }
    }
}


response = rep_es.es_anno.search(index="resolutions_new", query=query, aggs=aggs, size=0)
buckets = response["aggregations"]["months"]["buckets"]
for bucket in buckets:
    print(bucket["key_as_string"].split("T")[0], bucket["doc_count"])

1672-01-01 130
1672-02-01 36
1672-03-01 146
1672-04-01 18
1672-05-01 142
1672-06-01 89
1672-07-01 128
1672-08-01 113
1672-09-01 95
1672-10-01 122
1672-11-01 73
1672-12-01 31


Results on 2022-02-10:
    
- 1672-01-01 64
- 1672-02-01 18
- 1672-03-01 71
- 1672-04-01 9
- 1672-05-01 83
- 1672-06-01 45
- 1672-07-01 63
- 1672-08-01 56
- 1672-09-01 47
- 1672-10-01 59
- 1672-11-01 48
- 1672-12-01 27


In [34]:
# raet pensionaris in attendance lists
query = {
    "bool": {
        "must": [
            {"match": {"type": "attendance_list"}},
            {"match": {"metadata.session_year": 1672}},
            {"match": {"paragraphs.text": "raet pensionaris"}}
        ]
    }
}

response = rep_es.es_anno.search(index="resolutions_new", query=query, aggs=aggs, size=0)
buckets = response["aggregations"]["months"]["buckets"]
for bucket in buckets:
    print(bucket["key_as_string"].split("T")[0], bucket["doc_count"])

1672-01-01 27
1672-02-01 14
1672-03-01 29
1672-04-01 10
1672-05-01 15
1672-06-01 4
1672-07-01 4
1672-08-01 7
1672-09-01 23
1672-10-01 31
1672-11-01 19
1672-12-01 9


In [29]:
# raet pensionaris in attendance lists
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},
            {"match": {"paragraphs.text": "pestilentie"}}
        ]
    }
}

aggs = {
    "months": {
        "date_histogram": {
            "field": "metadata.session_date",
            "calendar_interval": "year",
        }
    }
}

response = rep_es.es_anno.search(index="resolutions", query=query, aggs=aggs, size=0)
buckets = response["aggregations"]["months"]["buckets"]
for bucket in buckets:
    print(bucket["key_as_string"].split("T")[0], bucket["doc_count"])

1711-01-01 1
1712-01-01 0
1713-01-01 0
1714-01-01 0
1715-01-01 2
1716-01-01 0
1717-01-01 0
1718-01-01 0
1719-01-01 0
1720-01-01 0
1721-01-01 0
1722-01-01 0
1723-01-01 0
1724-01-01 0
1725-01-01 0
1726-01-01 0
1727-01-01 0
1728-01-01 0
1729-01-01 0
1730-01-01 0
1731-01-01 0
1732-01-01 0
1733-01-01 0
1734-01-01 0
1735-01-01 0
1736-01-01 0
1737-01-01 0
1738-01-01 0
1739-01-01 1
1740-01-01 0
1741-01-01 0
1742-01-01 0
1743-01-01 0
1744-01-01 0
1745-01-01 0
1746-01-01 0
1747-01-01 0
1748-01-01 0
1749-01-01 0
1750-01-01 0
1751-01-01 0
1752-01-01 2
1753-01-01 0
1754-01-01 0
1755-01-01 0
1756-01-01 0
1757-01-01 0
1758-01-01 0
1759-01-01 0
1760-01-01 0
1761-01-01 0
1762-01-01 0
1763-01-01 0
1764-01-01 0
1765-01-01 0
1766-01-01 0
1767-01-01 0
1768-01-01 0
1769-01-01 0
1770-01-01 1
1771-01-01 2


In [31]:
# 
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},
            {"match": {"metadata.session_year": 1672}}
        ]
    }
}

aggs = {
    "prop_types": {
        "terms": {
            "field": "metadata.proposition_type.keyword"
        }
    }
}


response = rep_es.es_anno.search(index="resolutions_new", query=query, aggs=aggs, size=0)
buckets = response["aggregations"]["prop_types"]["buckets"]
for bucket in buckets:
    print(bucket["key"], bucket["doc_count"])
    
print("Total:", sum([bucket["doc_count"] for bucket in buckets]))

missive 2841
requeste 482
rapport 404
voordracht 89
memorie 51
declaratie 14
rekening 2
Total: 3883


In [9]:
resolutions = rep_es.retrieve_resolutions_by_query(query, size=2700)
len(resolutions)

2500

In [23]:
from collections import Counter

Counter([res.metadata["proposition_type"] for res in resolutions])
for res in resolutions:
    response = rep_es.es_anno.get(index="resolutions", id=res.id)
    res_json = response["_source"]
    if "proposition_type" not in res_json["metadata"] or res_json["metadata"]["proposition_type"] is None:
        print(res.id, res.proposition_type, res.metadata["proposition_type"])
        res_json["metadata"]["proposition_type"] = res.proposition_type
        print('\t', res.paragraphs[0].text[:100], '\n')
        for match in res.evidence:
            print('\t\t', match.phrase.phrase_string, match.label)
        #rep_es.index_resolution(res)

session-1672-01-28-num-1-resolution-24 None None
	 Is gehoort het rapport vande heeren van Brakell, ende andere hare Ho:Mo: Gedeputeerdens tot de saeck 

		 in gevolge en tot voldoeninge van der selver Resolutie commissoriaal van den proposition_opening
		 hebbende ['proposition_verb', 'proposition_opening_end_verb', 'proposition_body']
		 houdende ['proposition_verb', 'proposition_opening_end_verb', 'proposition_body']
		 dienende ['proposition_verb', 'proposition_opening_end_verb', 'proposition_body']
		 hebbende ['proposition_verb', 'proposition_opening_end_verb', 'proposition_body']
session-1672-01-12-num-1-resolution-5 None None
	 Op de Requeste van Mr. Louis d' Outrelean, Licentiaet in de rechten woonende tot Middelburch in Zeel 

		 OP de Requeste van  proposition_opening
		 versoekende ['proposition_verb', 'request', 'proposition_opening_end_verb', 'proposition_body']
session-1672-06-08-num-1-resolution-14 None None
	 Op de requeste van Jacob Pesijn, Ontfanger vande gemeijne mi

KeyboardInterrupt: 

## Fuzzy Search

Use fuzzy searching with a list of keywords/phrases to find resolutions with spelling variants of those keywords/phrases.

The `fuzzy-search` package has a FuzzyPhraseSearcher, but also a FuzzyContextSearcher, that returns the matched phrase together with the surrounding context, and allows one to search for additional keywords/phrases within those match contexts.

A PhraseMatchInContext object has a `.string` property that contains the string in the text that matches the phrase. In addition, it has the properties `prefix`, `suffix` and `context`. The `prefix` contains the preceding text, the `suffix` the text following the matching string, and the `context` contains the matching string with both `prefix` and `suffix` text. The amount of preceding and following text is controlled in the `.find_matches()` function by optional the `prefix_size` and `suffix_size` arguments. 

Requirements: 
- `fuzzy-search` (python package via `pip install fuzzy-search`, but it should already be installed if you installed the required packages to use the Republic code repository).

In [7]:
from fuzzy_search.fuzzy_context_searcher import FuzzyContextSearcher
from fuzzy_search.fuzzy_phrase_model import PhraseModel

phrase = 'tot de saecken'

config = {
    'levenshtein_threshold': 0.7,
    'ngram_size': 3,
    'skip_size': 1,
    'include_variants': True
}

phrase_model = PhraseModel([phrase], config=config)
saecken_searcher = FuzzyContextSearcher(config)
saecken_searcher.index_phrase_model(phrase_model)

for res in resolutions_1672[:20]:
    for para in res.paragraphs:
        matches = saecken_searcher.find_matches(para.text)
        for match in matches:
            print(f"Phrase: {match.phrase.phrase_string: <30}\tmatch string: {match.string}")
            print(f"\t{match.context}\n")
            print(match.offset, match.end)
            

Phrase: tot de saecken                	match string: tot de saken
	Is gehoort het rapport vande heeren van Ommeren, ende andere haer Ho:Mo: Gedeputeerden tot de saken vande Zee, hebbende ingevolge ende tot voldoeninge vander selver resolutie Commis„ soriael, gevisit

87 99
Phrase: tot de saecken                	match string: tot de saecken
	Is gehoort het rapport vande Heeren van Ommeren, ende andere hare Ho:Mo: Gedeputeerden tot de saecken vande Zee, hebbende inge, volge, ende tot voldoeninge van der„ selver resolutie Commissoriael van„ 

87 101
Phrase: tot de saecken                	match string: tot de saecken
	t gestelt sal werden in handen vande Heeren van Hoogendorp, ende andere hare Ho:Mo: Gedepu„ teerden tot de saecken vande Zee, om te visiteren, examineren, ende van alles rapport te doen, sonder resumptie.

1128 1142
Phrase: tot de saecken                	match string: tot de saecken
	ns gestelt sullen werden in handen vande Heeren van Hogendorp ende andere haer Ho:Mo: Gedeput

Many suffixes start with the name of persons or organisations, followed by a comma. To get an overview of which entities are mentioned after this phrase `tot de saecken`, we can do a simple count of suffixes, cut off at the first comma:

In [75]:
saecken_entity_freq = Counter()

for res in resolutions_1672:
    for para in res.paragraphs:
        matches = saecken_searcher.find_matches(para.text, suffix_size=50)
        for match in matches:
            saecken_entity = match.suffix.split(',')[0]
            saecken_entity_freq.update([saecken_entity])

for saecken_entity, freq in saecken_entity_freq.most_common(50):
    print(f"{saecken_entity: <40}{freq: >5}")

 vande Zee                                462
 vande Triple Alliantie                    73
 van Vlaenderen                            46
 vande Triple Alliancie                    44
 vande finantie                            44
 vande Meijerije van ‛s Hertogenbosch      31
 van Oostvrieslandt                        23
 vande Landen van Overmase                 22
 vande Griffie                             20
 vande Triple alliantie                    19
 van Oostvrieslant                         17
 vande Triple Alliantie om te visiteren    12
 van Denemarcken                            8
 van de Zee                                 7
 vande Triple Alliancie om te visiteren     6
 vande Westindische Compagnie               6
 van Oost                                   6
 van Vlaen„ deren                           6
 vande Triple alliancie                     6
 vande Meijerie van s'hertogenbosch         6
 ken vande Zee                              6
 vande triple Alliantie           

Next, we make second searcher to find the phrase _gestelt sullen werden_ and its singular variant _gestelt sal werden_ in the prefix of `tot de saecken`:

In [92]:
phrases = [
    {
        'phrase': 'gestelt sullen werden',
        'variants': [
            'gestelt sal werden'
        ]
    }
]

config = {
    'levenshtein_threshold': 0.7,
    'ngram_size': 3,
    'skip_size': 1,
    'include_variants': True
}


phrase_model = PhraseModel(phrases, config=config)
prefix_searcher = FuzzyContextSearcher(config)
prefix_searcher.index_phrase_model(phrase_model)

for res in resolutions_1672[:20]:
    print(res.id)
    for para in res.paragraphs:
        saecken_matches = saecken_searcher.find_matches(para.text, prefix_size=150)
        for saecken_match in saecken_matches:
            # first, get the entity mentioned in the suffix
            saecken_entity = match.suffix.split(',')[0]
            # next, search for the prefix phrase in the prefix of 'tot de saecken'
            prefix_matches = prefix_searcher.find_matches(saecken_match.prefix)
            for prefix_match in prefix_matches:
                print(f"{prefix_match.string}\n\t{prefix_match.suffix}")
                print(f"{saecken_match.string}\n\t{saecken_entity}\n")


session-1672-01-07-num-1-resolution-1
session-1672-01-07-num-1-resolution-2
gestelt sal werden
	 in handen vande Heeren Schimmelpenningh ende andere haer Ho:Mo: Gedeputeerden 
tot de saecken
	 vande Triple Alliantie

gestelt sal werden
	 in handen vande Heeren Schimmelpenningh ende andere haer Ho:Mo: Gedeputeerden 
tot de saecken
	 vande Triple Alliantie

session-1672-01-07-num-1-resolution-3
gestelt sal werden
	 in handen vande Heeren van Ommeren ende andere hare Ho:Mo: Gede,, puteerden 
tot de saecken
	 vande Triple Alliantie

gestelt sal werden
	 in handen vande Heeren van Ommeren ende andere hare Ho:Mo: Gede,, puteerden 
tot de saecken
	 vande Triple Alliantie

session-1672-01-07-num-1-resolution-4
gestelt sullen werden
	 in handen vande Heeren van Gent ende andere hare Ho:Mo: Gedeputeerden 
tot de saken
	 vande Triple Alliantie

session-1672-01-07-num-1-resolution-5
session-1672-01-07-num-1-resolution-6
gestelt sal werden
	 in handen vande Heeren van Gent, ende andere hare Ho:Mo: 