## Experimentation Notebook

This notebook is part of the sandbox and is intended to experiment and play around with the REPUBLIC elasticsearch functionalities.

In [3]:
# This is needed to add the repo dir to the path so jupyter
# can load the republic modules directly from the notebooks
import os
import sys
repo_name = 'republic-project'
repo_dir = os.path.split(os.getcwd())[0].split(repo_name)[0] + repo_name
print(repo_dir)
if repo_dir not in sys.path:
    sys.path.append(repo_dir)



/Users/femkegordijn/republic-project


In [2]:
import elasticsearch

In [22]:
pip install elasticsearch==7.17.0

Note: you may need to restart the kernel to use updated packages.


In [2]:
elasticsearch.__version__

(7, 17, 0)

NameError: name 'y' is not defined

## Initialise Republic Elasticsearch Instance

This creates a RepublicElasticsearch object that contains an elasticsearch instance for the Republic CAF indexes, as well as a range of retrieval functions.

Check the [README](https://github.com/HuygensING/republic-project#readme) for configuration details that should be placed in `settings.py`.

In [4]:
from republic.elastic.republic_elasticsearch import initialize_es

rep_es = initialize_es(host_type='external', timeout=60)


## Keyword in Context

A simple way to start exploring is with the `keyword_in_context` function. It takes words or phrases as input and shows a number of hits with surrounding context in the resolutions.

The `keyword_in_context` function returns `hit`s, which are dictionaries with the search `term`, the `pre` and `post` contextual words, a formatted `context`, as well as the `para_id` and `resolution_id` and `resolution_offset` and `para_offset`.

In [5]:
# use single word or multi-word phrase
for hit in rep_es.keyword_in_context("g", num_hits=20, context_size=10):
    print(hit["context"])


ONtfangen een Missive van den Commissaris G. G. Brandi, geschreeven te Alexandrien in Egypten den 29 October 1781
de Heeren Staaten haare Principaalen hadden gecommitteert den Heere C. G. G. v. Wassenaer, omme sessie te neemen weegens hooghgemelde Provincie
Amsterdam; versoekende derhalven, dat haar Hoogh Mogende gemelden Heere C. G. G. v. Wassenaer na prestatie van den behoorlijcken eed, met
IS ter Vergaderinge geleesen de Requeste van O. G. Veldman , Collonel van een Regiment Infanterye, en Commandeur der Stad
absenteeren, wanneer het commando zoude koomen op den Collonel H. G. Veldman. WAAaRk op gedelibereert zynde , is goedgevonden en verstaan, dat
laatende het commando aldaar in handen van den Collonek H. G. Veldman
een Missive van den Gouverneur Generaal der Colonie Zurinamen, J. G Wichers, geschreven te Paramaribo den 30 January deezes jaars 1789
houdende , dat alzoo J. G. Neitzsch, Inwoonder der gemelde Colonie, aan hem by Requeste te
ab intestato of anderzints zoude kunnen sch

The `keyword_in_context` function also has several optional arguments to control the size of the context window (`context_size`, default is 3 words before and after), the number of hits (`num_hits`, default is 10) and query filters to constrain the search space (`filters`, which are added to the query).

**Note**: the `num_hits` argument controls the number of _resolutions_ that are retrieved. Within a resolution, the search keyword may appear multiple times. A context is created for each occurrence of the search keyword, so the number returned of contexts can be (and typically is) higher than the number of hits.

In [6]:
# use context_size to get fewer or more surrounding words as context
for hit in rep_es.keyword_in_context("secreete", context_size=5):
    print(hit["context"])

                          gedaan op haar Hoogh Mogende secreete commissoriaale Resolutie van den vierden
                         ontfangen, en gemeld in sijne secreete Mislive van den seven en
                       besoignes by haar Hoogh Mogende secreete Resolutie van den vierden December
                        voldoeninge van haar Hoog Mog. secreete Resolutie en Aanschryving van den
                             in haar Hoog Mog. gemelde secreete - Resolutie van den 12 deeser
                     nakoominge van haar Hoogh Mogende secreete Resolutie en Aanschryvens van den
                          vervat in haar Hoogh Mogende secreete Resolutie van den sestienden deeser
                       ontfangst van haar Hoog Mogende secreete Resolutie van den dertigsten der
                             in gevolge haar Hoog Mog. secreete Resolutie van den 25 deezer
                        kraghte van haar Hoogh Mogende secreete Resolutie van den vier en
               Suppliantes Man waaren 

In [7]:
for hit in rep_es.keyword_in_context("goetgevonden", context_size=5):
    # First, show paragraph id (which contains session date)
    print(hit["resolution_id"])
    # Second, show the keyword in context
    print(hit["context"])
    # Finally, add newline for readability
    print()


session-1713-05-23-num-1-resolution-14
               Fervaques. IS na voorgaende deliberatie goetgevonden ende verstaen, dat een Pasport

session-1713-05-23-num-1-resolution-14
                          IS na voorgaende deliberatie goetgevonden ende verstaen , dat een Pasport

session-1709-02-14-num-1-resolution-7
                      Waer op gedelibereert zijnde, is goetgevonden ende verstaen, mits desen tot

session-1709-02-14-num-1-resolution-7
                          Is na voorgaende deliberatie goetgevonden ende verstaen, dat het Collegie

session-1709-02-14-num-1-resolution-7
                          Is na voorgaende deliberatie goetgevonden ende verstaen, dat den Raedt

session-1709-02-14-num-1-resolution-7
                          Is na voorgaende deliberatie goetgevonden ende verstaen, dat ten behoeve

session-1709-02-14-num-1-resolution-7
                          Is na voorgaende deliberatie goetgevonden ende verstaen, dat ten behoeve

session-1708-06-25-num-1-resoluti

In [8]:
# use num_hits to get fewer or more results
for hit in rep_es.keyword_in_context("voornoemde Procureur", context_size=5, num_hits=20):
    print(hit["resolution_id"])
    print(hit["context"])
    print()


session-1779-12-06-num-1-resolution-15
                              andere zyde; waar by den voornoemde Procureur van Kervel versoekt obedientie, en

session-1779-12-06-num-1-resolution-15
                  op condemnatie; en consenteerende de voornoemde Procureur Alsche in de versogte condemnatie

session-1780-07-18-num-1-resolution-8
                         exceptie en defensie; waar op voornoemde Procureur van Alphen wyders versoekt condemnatie

session-1780-10-13-num-1-resolution-7
                               andere zyde; waar by de voornoemde Procureur van Alphen versogt obedientie: en

session-1780-10-13-num-1-resolution-7
                    wyders versogt condemnatie, en den voornoemde Procureur de Byo consenteerde in de

session-1779-07-23-num-1-resolution-11
                              andere zyde; waar by den voornoemde Procureur van Son versogt obedientie, en

session-1779-07-23-num-1-resolution-11
           Son versogt condemnatie. Consenteerende den voornoemde Pro

In [30]:
# use filters to contrain the search space:
# selecting resolutions by year
filters = [
    {"match": {"metadata.session_year": 1672}}
]

for hit in rep_es.keyword_in_context('tijden', filters=filters, context_size=10, num_hits=10000):
    print(hit["para_id"])
    print(hit["context"])

session-1672-05-31-num-1-para-55
deert, de paerden uijt den Lande — vervoert, maer in dese tijden van Oorloge wel eenige onzeijlen den Staet souden connen werden
session-1672-05-31-num-1-para-55
haer verplicht hadden gevonden haer Ho:Mo: bij dese becommerlijcke tijden te geven in derselver bedencken, dewijle dat bij verscheijde van
session-1672-05-31-num-1-para-55
te laten brengen, T‛welck haer E. oordeelden bij dese tijden niet alleen ten hoochsten noot„ saeckelijck te sijn, soo tot
session-1672-10-12-num-1-para-27
duijsent ende een hondert gulden, dat van gelijcken t' allen tijden van Oorloge ordre placht gestelt te worden op den ontfangh
session-1672-10-12-num-1-para-27
in de reecke„ ningen van tijtlicke gewesene Ontfangers Generael in tijden van Oorloge, dat haer E. aengaende d'ordre te stellen
session-1672-06-08-num-1-para-23
gunt bij de hoochgeme. heeren Staten in dese conjuncture van tijden ende saecken was geresolveert ende gedaen: Waerop gedelibereert zijnde, Is
session-1672-0

In [9]:
# use filters to contrain the search space:
# selecting resolutions by date range
filters = [
    {"range": {"metadata.session_date": {"gte": "1672-04-01", "lte": "1672-08-01"}}}
]

for hit in rep_es.keyword_in_context("Vloot", filters=filters):
    print(hit["para_id"], '\n')
    print(hit["context"], '\n')

session-1672-05-31-num-1-para-92 

                      sigh naer de Vloot te vervoegen: Waerop 

session-1672-06-16-num-1-para-24 

                 dat de Smirnasche Vloot door d'Engelschen 

session-1672-04-02-num-1-para-50 

                       Ho:Mo: inde vloot vanden Staat op 

session-1672-05-24-num-1-para-20 

           gedaen, opde Smirnasche Vloot is gear„ resteert 

session-1672-07-27-num-1-para-83 

                   ont„ houden van Vloot der Vijanden van 

session-1672-05-31-num-1-para-36 

                 dat de Smirnasche Vloot, door d'Engelschen 

session-1672-04-06-num-1-para-74 

  gevolmachtichde opde voorschreve vloot commanderen sal den 



In [10]:
filters = [
    {"match": {"metadata.session_year": 1672}}
]

# using a larger context size
for hit in rep_es.keyword_in_context("requeste", filters=filters, context_size=1000, num_hits=10000):
    print(hit["resolution_id"], hit["resolution_offset"], '\n')
    print(hit["context"], '\n')


session-1672-09-12-num-1-resolution-1 0 

Is ter Vergaderinge gelesen de Requeste van Paulus Croesbeeck, Procu„ reur voor het Collegie ter Admira„ liteijt op de Maze, versoeckende om redenen daerin verhaelt, dat hare Ho:Mo: goede geliefte zij, den Suppliant, in qualité, soo hij ageert, eenmael expeditie van Justitie te laten gewerden, ende dienvolgende hem sijn versoeck bij sijne voorgaende Requeste gedaen, alsnoch te accor„ derden: Waerop gedelibereert sijnde, Is goetgevonden ende verstaen, dat de voornoemde Requeste gestelt sal werden in handen vande Heeren meerens, ende andere hare Ho:Mo: Gede„ puteerden tot de saecken vande Zee, om te visiteren, examineren, ende daer van rapport te doen 

session-1672-09-12-num-1-resolution-1 658 

Sijnde ter Vergaderinge gelesen de Requeste vande gesamentlicke Boden vande Generaliteijt, versoeckende om de geallegeerde rederen, dat haer Ho:Mo: de Supplianten de reijsen op Zeelandt wederom gelieven te vergunnen, onder condi„ tie van deselve soo spoe

## Retrieving Resolutions

The `rep_es` object has a range of functions to retrieve `resolution` objects.

You can find all available properties and methods of `resolution` objects in `republic_document_model.py`: i.e. in the
[Resolution](https://github.com/HuygensING/republic-project/blob/bb4cdad7b4cb9fb71378d0dde000fe7725ceb45e/republic/model/republic_document_model.py#L392) class, which inherits several properties and methods from the [ResolutionElementDoc](https://github.com/HuygensING/republic-project/blob/bb4cdad7b4cb9fb71378d0dde000fe7725ceb45e/republic/model/republic_document_model.py#L158)

In [11]:
resolutions = rep_es.retrieve_resolutions_by_session_date("1672-02-12")
for res in resolutions:
    print(res.session_date.isoformat(), res.id)

1672-02-12 session-1672-02-12-num-1-attendance_list
1672-02-12 session-1672-02-12-num-1-resolution-1
1672-02-12 session-1672-02-12-num-1-resolution-2
1672-02-12 session-1672-02-12-num-1-resolution-3
1672-02-12 session-1672-02-12-num-1-resolution-4
1672-02-12 session-1672-02-12-num-1-resolution-5
1672-02-12 session-1672-02-12-num-1-resolution-6
1672-02-12 session-1672-02-12-num-1-resolution-7
1672-02-12 session-1672-02-12-num-1-resolution-8
1672-02-12 session-1672-02-12-num-1-resolution-9
1672-02-12 session-1672-02-12-num-1-resolution-10
1672-02-12 session-1672-02-12-num-1-resolution-11
1672-02-12 session-1672-02-12-num-1-resolution-12
1672-02-12 session-1672-02-12-num-1-resolution-13
1672-02-12 session-1672-02-12-num-1-resolution-14
1672-02-12 session-1672-02-12-num-1-resolution-15


In [14]:
output_file = "resoluties_rampjaar.csv"

with open(output_file, 'wt') as fh:
    headers = ['resolution_id', 'date', 'paragraph_id', 'text', 'iiif_url']
    fh.write('\t'.join(headers) + '\n')
    for res in resolutions:
        for para in res.paragraphs:
            if isinstance(para.metadata['iiif_url'], list):
                url = ', '.join(para.metadata['iiif_url'])
            else:
                url = para.metadata['iiif_url']
            row = [res.id, res.metadata['session_date'], para.id, para.text, url]
            row = [cell if cell is not None else '' for cell in row]
            print(row)
            fh.write('\t'.join(row) + '\n')
        

['session-1672-02-12-num-1-attendance_list', '1672-02-12', 'session-1672-02-12-num-1-para-1', 'Veneris den 12. Februarij 1672', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3285/NL-HaNA_1.01.02_3285_0265.jpg/2609,337,2446,3576/full/0/default.jpg']
['session-1672-02-12-num-1-attendance_list', '1672-02-12', 'session-1672-02-12-num-1-para-2', 'Preside den Heere van Coeverden', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3285/NL-HaNA_1.01.02_3285_0265.jpg/2609,337,2446,3576/full/0/default.jpg']
['session-1672-02-12-num-1-attendance_list', '1672-02-12', 'session-1672-02-12-num-1-para-3', 'Præsentibus de Heeren van Gendt, Gellicum, Brakell, Vijgh, Ripperda tot Buirse Schimmelpenningh, Ommeren Werckendam, Goeree, Meerens Odijck, Reijgersbergh, Crommon, Vrijbergen, Mauregnault,', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3285/NL-HaNA_1.01.02_3285_0265.jpg/2609,337,2446,3576/full/0/default.jpg']
['session-1672-02-12-num-1-attendance_list', '1672-02-12', 'session-1672-02-12-nu

In [12]:
keyword = "oorlogh"

query = {"query": {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},                  # only resolutions, no attendance lists
            {"match": {"metadata.session_year": 1672}},         # only resolutions from 1672
            {"match": {"paragraphs.text": keyword}}, # only resolutions containing 'raet pensionaris'
        ]
    }
}}
resolutions = rep_es.retrieve_resolutions_by_query(query)
output_file = f"resoluties_rampjaar_{keyword.replace(' ','_')}.csv"
with open(output_file, 'wt') as fh:
    headers = ['resolution_id', 'date', 'paragraph_id', 'text', 'iiif_url']
    fh.write('\t'.join(headers) + '\n')
    for res in resolutions:
        for para in res.paragraphs:
            if isinstance(para.metadata['iiif_url'], list):
                url = ', '.join(para.metadata['iiif_url'])
            else:
                url = para.metadata['iiif_url']
            row = [res.id, res.metadata['session_date'], para.id, para.text, url]
            row = [cell if cell is not None else '' for cell in row]
            print(row)
            fh.write('\t'.join(row) + '\n')

['session-1672-04-01-num-1-resolution-4', '1672-04-01', 'session-1672-04-01-num-1-para-11', 'Ontfangen een Missive van het Collegie ter admiraliteijt tot amsterdam, geschreven aldaer den eenendertichsten Martij lest„ leden, houdende, dat het gemelte Collegie vermits de jegenwoordige ongelegentheijt met den Coningh van Groot Brittannien geerne soude verstaen haer Ho:Mo: intentie ontrent het affsenden van hetschip van Oorlogh voor desen gedestineert tot transport vanden heer van Strevels„ hoeck, haer ho:Mo: gedesigneerde resident aen het hoff van Spamen, uijt dese Landen naer Spaignen, sonderlingh alsoo het voorschreve Schip van oorlogh niet sonder pericul ende ondienst vanden Lande soo verte vande handt gesonden conde werden, ende andersints was gedesigneert onder de schepen die', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3285/NL-HaNA_1.01.02_3285_0544.jpg/3236,-8,1830,3874/full/0/default.jpg']
['session-1672-04-01-num-1-resolution-4', '1672-04-01', 'session-1672-04-01-num-1-par

### The Anatomy of a Resolution

Resolutions in the index consist of `metadata` and `paragraphs`.

In [13]:
import json

res = resolutions[0]
# Each resolution has metadata
print(json.dumps(res.metadata, indent=4))

{
    "inventory_num": 3285,
    "source_id": "session-1672-04-01-num-1",
    "type": "resolution",
    "id": "session-1672-04-01-num-1-resolution-4",
    "session_date": "1672-04-01",
    "session_id": "session-1672-04-01-num-1",
    "session_num": 1,
    "president": null,
    "session_year": 1672,
    "session_month": 4,
    "session_day": 1,
    "session_weekday": "Veneris",
    "proposition_type": "missive",
    "proposer": null,
    "decision": null,
    "resolution_type": "ordinaris",
    "text_page_num": [],
    "index_timestamp": "2022-02-03T09:11:42.940395"
}


In [14]:
# You can dump all resolution data to JSON
res.json

{'id': 'session-1672-04-01-num-1-resolution-4',
 'type': ['republic_doc', 'resolution_element', 'resolution'],
 'metadata': {'inventory_num': 3285,
  'source_id': 'session-1672-04-01-num-1',
  'type': 'resolution',
  'id': 'session-1672-04-01-num-1-resolution-4',
  'session_date': '1672-04-01',
  'session_id': 'session-1672-04-01-num-1',
  'session_num': 1,
  'president': None,
  'session_year': 1672,
  'session_month': 4,
  'session_day': 1,
  'session_weekday': 'Veneris',
  'proposition_type': 'missive',
  'proposer': None,
  'decision': None,
  'resolution_type': 'ordinaris',
  'text_page_num': [],
  'index_timestamp': '2022-02-03T09:11:42.940395'},
 'evidence': [{'type': 'PhraseMatch',
   'phrase': 'ONtfangen een Missive van',
   'variant': 'ONtfangen een Missive van',
   'string': 'Ontfangen een Missive van',
   'offset': 0,
   'label': ['proposition_opening',
    'proposition_from_correspondence',
    'proposition_type:missive'],
   'ignorecase': False,
   'text_id': 'session-167

## Using Elasticsearch Queries

See the [Elasticsearch Query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) for details on how to construct different types of queries.

In [15]:
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},                  # only resolutions, no attendance lists
            {"match": {"metadata.session_year": 1672}},         # only resolutions from 1672
            {"match": {"metadata.paragraphs.text": "requeste"}}, # only resolutions containing 'raet pensionaris'
        ]
    }
}

resolutions = rep_es.retrieve_resolutions_by_query(query)

for res in resolutions:
    print(res.id)
    for para in res.paragraphs:
        print(f"\t{para.text}\n")
    print('--------------------\n')

session-1672-09-12-num-1-resolution-1
	Is ter Vergaderinge gelesen de Requeste van Paulus Croesbeeck, Procu„ reur voor het Collegie ter Admira„ liteijt op de Maze, versoeckende om redenen daerin verhaelt, dat hare Ho:Mo: goede geliefte zij, den Suppliant, in qualité, soo hij ageert, eenmael expeditie van Justitie te laten gewerden, ende dienvolgende hem sijn versoeck bij sijne voorgaende Requeste gedaen, alsnoch te accor„ derden: Waerop gedelibereert sijnde, Is goetgevonden ende verstaen, dat de voornoemde Requeste gestelt sal werden in handen vande Heeren meerens, ende andere hare Ho:Mo: Gede„ puteerden tot de saecken vande Zee, om te visiteren, examineren, ende daer van rapport te doen.

	Sijnde ter Vergaderinge gelesen de Requeste vande gesamentlicke Boden vande Generaliteijt, versoeckende om de geallegeerde rederen, dat haer Ho:Mo: de Supplianten de reijsen op Zeelandt wederom gelieven te vergunnen, onder condi„ tie van deselve soo spoedelick jae spoediger te doen, als de tegen„ wo

In [16]:
# import Counter to do some simple word counting and frequency comparison
from collections import Counter
import re


In [31]:
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},
            {"match": {"metadata.session_year": 1672}},
            #{"match": {"paragraphs.text": "requeste"}}
        ]
    }
}

resolutions_1672 = rep_es.retrieve_resolutions_by_query(query, size=10000)

all_word_freq = Counter()

for res in resolutions_1672:
    for para in res.paragraphs:
        all_word_freq.update([word for word in re.split(r"\W+", para.text) if word != ''])

for word, freq in all_word_freq.most_common(10):
    print(f"{word: <20}{freq: >6}")
    
#print(all_word_freq)

ende                 26333
van                  23211
de                   20743
te                   14955
dat                  11138
den                   9356
haer                  8467
vande                 8418
tot                   8380
in                    8012


In [32]:
Counter([res.metadata['proposition_type'] for res in resolutions_1672])

Counter({None: 460,
         'missive': 1617,
         'requeste': 289,
         'rekening': 1,
         'memorie': 36,
         'rapport': 2,
         'declaratie': 7})

In [20]:
res_missives = [res for res in resolutions_1672 if res.metadata['proposition_type'] == 'missive']

len(res_missives)

1617

In [22]:
from collections import Counter
import re

query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},
            {"match": {"metadata.session_year": 1672}},
            {"match": {"paragraphs.text": "raet pensionaris"}}
        ]
    }
}

resolutions = rep_es.retrieve_resolutions_by_query(query)


word_freq = Counter()

for res in resolutions:
    for para in res.paragraphs:
        word_freq.update([word for word in re.split(r"\W+", para.text) if word != ''])

rel_freq = {}
min_freq = 3
for word, freq in word_freq.most_common():
    if freq < min_freq:
        continue
    rel_freq[word] = freq / all_word_freq[word]
    
for word in sorted(rel_freq, key = lambda w: rel_freq[w], reverse=True):
    print(f"{word: <20}{rel_freq[word]: >6.4f}{word_freq[word]: >6}{all_word_freq[word]: >8}")

Pensionaris         1.0000     3       3
Pensionnaris        0.2000     7      35
Wijtingh            0.2000     3      15
Fagel               0.1429     5      35
Brakel              0.1154     3      26
geaddresseert       0.0847     5      59
raet                0.0833     5      60
advertentie         0.0826    10     121
schreven            0.0517     3      58
gecommuniceert      0.0500     3      60
saken               0.0448     3      67
Griffier            0.0412     4      97
Raet                0.0395    16     405
nomineren           0.0337     3      89
Ontfangen           0.0335     8     239
missive             0.0318    12     377
gevallen            0.0292     4     137
seeckere            0.0254     3     118
Schepenen           0.0248     3     121
anderen             0.0246     3     122
Missive             0.0244     4     164
nevens              0.0236     5     212
heeft               0.0217     6     276
Orange              0.0216     3     139
geschreven      

### Resolutions in JSON Format

Resolution objects have a `.json` property to get a JSON representation of the resolution, including metadata, paragraph text and basic statistics. This can be a convenient format for storing and later retrieving them from disk (faster than getting them from the Republic CAF server).

You can also turn them into plain text representations if you want to do extensive text analysis.

In [22]:
import json
import gzip

resolutions_file = "../../data/resolutions/rampjaar-ordinaris-resolutions.json.gz"

# open a file for storing the JSON representation of resolutions
with gzip.open(resolutions_file, 'wt') as fh:
    # iterate over the resolutions and dump their JSON representations to file
    json.dump([res.json for res in resolutions_1672], fh)

In [23]:
import json
import gzip

import republic.model.republic_document_model as rdm


resolutions_file = "../../data/resolutions/rampjaar-ordinaris-resolutions.json.gz"

# Reading the JSON representations from file again and turning 
# them into Resolution objects again
with gzip.open(resolutions_file, 'rt') as fh:
    # the document model has a convenience function to turn a JSON representation
    # to a Resolution object: json_to_republic_resolution
    resolutions_1672 = [rdm.json_to_republic_resolution(res) for res in json.load(fh)]
    

In [24]:
# Creating plain text representations of resolutions by concatenating paragraph texts
for res in resolutions_1672:
    res_text = '\n'.join([para.text for para in res.paragraphs])
    print(res_text)
    break

Is gehoort het rapport vande Heeren Schimmelpenningh, ende andere hare Ho:Mo: Gedeputeerden tot de saken vande Zee, hebbende ingevolge ende tot voldoeninge van derselver resolutie Commissoriael vanden negenentwin„ tichsten December laestleden, gevi, siteert ende geexamineert de Requeste van David Centsen, Consul vande Nederlantsche natie tot Rochelle, versoeckende door hare Ho:Mo: met eene somme van penningen te mogen werden gesubvenieert, ten aensien vande oncosten bij hem gesupporteert in een continueel vervolgh van ontrent acht maenden, om expeditie, en het obtineren van eene resolutie op de Consulaetrechten aldaer, ende een daghgelt aen hem Suppliant als Con„ sul toe te leggen: Waerop gedelibereert sijnde, Is goetgevonden ende verstaen, mits desen te versoec„ ken de Heeren Gedeputeerden vande Provincie van Hollandt ende West,, vrieslandt, dat haer E. haer hoe eerder soo liever willen verclaren op het rapport vande gemelte Heeren Schimmel„ penningh ende andere hare Ho:Mo: Gedeputeer

## Retrieving Aggregate Statistics

You can also directly query the indexes using the elasticsearch instance inside the `rep_es` object, which is stored in the `es_anno` property (so can be addressed via `rep_es.es_anno`).

Below is an example of a query and an aggregation to get the number of resolutions per month in the year 1672:

In [29]:
# raet pensionaris in resolutions
query = {
    "bool": {
        "must": [
            #{"match": {"type": "resolution"}},
            {"match": {"metadata.session_year": 1672}},
            {"match": {"metadata.proposition_type": "requeste"}}
        ]
    }
}

aggs = {
    "months": {
        "date_histogram": {
            "field": "metadata.session_date",
            "calendar_interval": "week"
        }
    }
}


response = rep_es.es_anno.search(index="resolutions", query=query, aggs=aggs, size=0)
buckets = response["aggregations"]["months"]["buckets"]
for bucket in buckets:
    print(bucket["key_as_string"].split("T")[0], bucket["doc_count"])

1672-01-04 3
1672-01-11 6
1672-01-18 2
1672-01-25 1
1672-02-01 3
1672-02-08 1
1672-02-15 4
1672-02-22 0
1672-02-29 5
1672-03-07 6
1672-03-14 7
1672-03-21 10
1672-03-28 11
1672-04-04 9
1672-04-11 0
1672-04-18 0
1672-04-25 4
1672-05-02 19
1672-05-09 10
1672-05-16 9
1672-05-23 4
1672-05-30 9
1672-06-06 3
1672-06-13 6
1672-06-20 0
1672-06-27 1
1672-07-04 2
1672-07-11 10
1672-07-18 3
1672-07-25 4
1672-08-01 7
1672-08-08 5
1672-08-15 5
1672-08-22 0
1672-08-29 4
1672-09-05 5
1672-09-12 5
1672-09-19 6
1672-09-26 4
1672-10-03 6
1672-10-10 4
1672-10-17 7
1672-10-24 4
1672-10-31 6
1672-11-07 5
1672-11-14 2
1672-11-21 6
1672-11-28 13
1672-12-05 1
1672-12-12 9
1672-12-19 5
1672-12-26 1


In [26]:
# raet pensionaris in attendance lists
query = {
    "bool": {
        "must": [
            {"match": {"type": "attendance_list"}},
            {"match": {"metadata.session_year": 1672}},
            {"match": {"paragraphs.text": "raet pensionaris"}}
        ]
    }
}

response = rep_es.es_anno.search(index="resolutions", query=query, aggs=aggs, size=0)
buckets = response["aggregations"]["months"]["buckets"]
for bucket in buckets:
    print(bucket["key_as_string"].split("T")[0], bucket["doc_count"])

1672-01-01 14
1672-02-01 8
1672-03-01 14
1672-04-01 5
1672-05-01 10
1672-06-01 2
1672-07-01 2
1672-08-01 3
1672-09-01 12
1672-10-01 15
1672-11-01 12
1672-12-01 6


In [30]:
# raet pensionaris in resolutions
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},
            {"match": {"metadata.session_year": 1672}}
        ]
    }
}

aggs = {
    "months": {
        "date_histogram": {
            "field": "metadata.session_date",
            "calendar_interval": "month"
        }
    }
}


response = rep_es.es_anno.search(index="resolutions", query=query, aggs=aggs, size=0)
buckets = response["aggregations"]["months"]["buckets"]
for bucket in buckets:
    print(bucket["key_as_string"].split("T")[0], bucket["doc_count"])

1672-01-01 200
1672-02-01 191
1672-03-01 241
1672-04-01 107
1672-05-01 292
1672-06-01 179
1672-07-01 194
1672-08-01 212
1672-09-01 220
1672-10-01 184
1672-11-01 210
1672-12-01 182


## Fuzzy Search

Use fuzzy searching with a list of keywords/phrases to find resolutions with spelling variants of those keywords/phrases.

The `fuzzy-search` package has a FuzzyPhraseSearcher, but also a FuzzyContextSearcher, that returns the matched phrase together with the surrounding context, and allows one to search for additional keywords/phrases within those match contexts.

A PhraseMatchInContext object has a `.string` property that contains the string in the text that matches the phrase. In addition, it has the properties `prefix`, `suffix` and `context`. The `prefix` contains the preceding text, the `suffix` the text following the matching string, and the `context` contains the matching string with both `prefix` and `suffix` text. The amount of preceding and following text is controlled in the `.find_matches()` function by optional the `prefix_size` and `suffix_size` arguments. 

Requirements: 
- `fuzzy-search` (python package via `pip install fuzzy-search`, but it should already be installed if you installed the required packages to use the Republic code repository).

In [81]:
from fuzzy_search.fuzzy_context_searcher import FuzzyContextSearcher
from fuzzy_search.fuzzy_phrase_model import PhraseModel

phrase = 'tot de saecken'

config = {
    'levenshtein_threshold': 0.7,
    'ngram_size': 3,
    'skip_size': 1,
    'include_variants': True
}

phrase_model = PhraseModel([phrase], config=config)
saecken_searcher = FuzzyContextSearcher(config)
saecken_searcher.index_phrase_model(phrase_model)

for res in resolutions_1672[:20]:
    for para in res.paragraphs:
        matches = saecken_searcher.find_matches(para.text)
        for match in matches:
            print(f"Phrase: {match.phrase.phrase_string: <30}\tmatch string: {match.string}")
            print(f"\t{match.context}\n")
            

Phrase: tot de saecken                	match string: tot de saken
	Is gehoort het rapport vande Heeren Schimmelpenningh, ende andere hare Ho:Mo: Gedeputeerden tot de saken vande Zee, hebbende ingevolge ende tot voldoeninge van derselver resolutie Commissoriael vanden neg

Phrase: tot de saecken                	match string: tot de saecken
	at gestelt sal werden in handen vande Heeren Schimmelpenningh ende andere haer Ho:Mo: Gedeputeerden tot de saecken vande Griffie, om te visiteren, examineren, ende daervan rapport te doen.

Phrase: tot de saecken                	match string: tot de saecken
	iael gestelt sal werden in handen vande Heeren van Ommeren ende andere hare Ho:Mo: Gede,, puteerden tot de saecken vande Griffie, omme te visiteren, examineren, ende daervan rapport te doen.

Phrase: tot de saecken                	match string: tot de saken
	issiven gestelt sullen werden in handen vande Heeren van Gent ende andere hare Ho:Mo: Gedeputeerden tot de saken vande Triple Alliancie omme

Many suffixes start with the name of persons or organisations, followed by a comma. To get an overview of which entities are mentioned after this phrase `tot de saecken`, we can do a simple count of suffixes, cut off at the first comma:

In [75]:
saecken_entity_freq = Counter()

for res in resolutions_1672:
    for para in res.paragraphs:
        matches = saecken_searcher.find_matches(para.text, suffix_size=50)
        for match in matches:
            saecken_entity = match.suffix.split(',')[0]
            saecken_entity_freq.update([saecken_entity])

for saecken_entity, freq in saecken_entity_freq.most_common(50):
    print(f"{saecken_entity: <40}{freq: >5}")

 vande Zee                                462
 vande Triple Alliantie                    73
 van Vlaenderen                            46
 vande Triple Alliancie                    44
 vande finantie                            44
 vande Meijerije van ‛s Hertogenbosch      31
 van Oostvrieslandt                        23
 vande Landen van Overmase                 22
 vande Griffie                             20
 vande Triple alliantie                    19
 van Oostvrieslant                         17
 vande Triple Alliantie om te visiteren    12
 van Denemarcken                            8
 van de Zee                                 7
 vande Triple Alliancie om te visiteren     6
 vande Westindische Compagnie               6
 van Oost                                   6
 van Vlaen„ deren                           6
 vande Triple alliancie                     6
 vande Meijerie van s'hertogenbosch         6
 ken vande Zee                              6
 vande triple Alliantie           

Next, we make second searcher to find the phrase _gestelt sullen werden_ and its singular variant _gestelt sal werden_ in the prefix of `tot de saecken`:

In [92]:
phrases = [
    {
        'phrase': 'gestelt sullen werden',
        'variants': [
            'gestelt sal werden'
        ]
    }
]

config = {
    'levenshtein_threshold': 0.7,
    'ngram_size': 3,
    'skip_size': 1,
    'include_variants': True
}


phrase_model = PhraseModel(phrases, config=config)
prefix_searcher = FuzzyContextSearcher(config)
prefix_searcher.index_phrase_model(phrase_model)

for res in resolutions_1672[:20]:
    print(res.id)
    for para in res.paragraphs:
        saecken_matches = saecken_searcher.find_matches(para.text, prefix_size=150)
        for saecken_match in saecken_matches:
            # first, get the entity mentioned in the suffix
            saecken_entity = match.suffix.split(',')[0]
            # next, search for the prefix phrase in the prefix of 'tot de saecken'
            prefix_matches = prefix_searcher.find_matches(saecken_match.prefix)
            for prefix_match in prefix_matches:
                print(f"{prefix_match.string}\n\t{prefix_match.suffix}")
                print(f"{saecken_match.string}\n\t{saecken_entity}\n")


session-1672-01-07-num-1-resolution-1
session-1672-01-07-num-1-resolution-2
gestelt sal werden
	 in handen vande Heeren Schimmelpenningh ende andere haer Ho:Mo: Gedeputeerden 
tot de saecken
	 vande Triple Alliantie

gestelt sal werden
	 in handen vande Heeren Schimmelpenningh ende andere haer Ho:Mo: Gedeputeerden 
tot de saecken
	 vande Triple Alliantie

session-1672-01-07-num-1-resolution-3
gestelt sal werden
	 in handen vande Heeren van Ommeren ende andere hare Ho:Mo: Gede,, puteerden 
tot de saecken
	 vande Triple Alliantie

gestelt sal werden
	 in handen vande Heeren van Ommeren ende andere hare Ho:Mo: Gede,, puteerden 
tot de saecken
	 vande Triple Alliantie

session-1672-01-07-num-1-resolution-4
gestelt sullen werden
	 in handen vande Heeren van Gent ende andere hare Ho:Mo: Gedeputeerden 
tot de saken
	 vande Triple Alliantie

session-1672-01-07-num-1-resolution-5
session-1672-01-07-num-1-resolution-6
gestelt sal werden
	 in handen vande Heeren van Gent, ende andere hare Ho:Mo: 