## Experimentation Notebook

This notebook is part of the sandbox and is intended to experiment and play around with the REPUBLIC elasticsearch functionalities.

In [1]:
# This is needed to add the repo dir to the path so jupyter
# can load the republic modules directly from the notebooks
import os
import sys
repo_name = 'republic-project'
repo_dir = os.path.split(os.getcwd())[0].split(repo_name)[0] + repo_name
print(repo_dir)
if repo_dir not in sys.path:
    sys.path.append(repo_dir)



/Users/marijnkoolen/Code/Huygens/republic-project


## Initialise Republic Elasticsearch Instance

This creates a RepublicElasticsearch object that contains an elasticsearch instance for the Republic CAF indexes, as well as a range of retrieval functions.

Check the [README](https://github.com/HuygensING/republic-project#readme) for configuration details that should be placed in `settings.py`.

In [2]:
from republic.elastic.republic_elasticsearch import initialize_es

rep_es = initialize_es(host_type='external', timeout=60)


## Keyword in Context

A simple way to start exploring is with the `keyword_in_context` function. It takes words or phrases as input and shows a number of hits with surrounding context in the resolutions.

The `keyword_in_context` function returns `hit`s, which are dictionaries with the search `term`, the `pre` and `post` contextual words, a formatted `context`, as well as the `para_id` and `resolution_id` and `resolution_offset` and `para_offset`.

In [3]:
# use single word or multi-word phrase
for hit in rep_es.keyword_in_context("consenteerende", num_hits=20, context_size=10):
    print(hit["context"])


extraordinaris byslagh van vyfiien duysent guldens, daar in mits deesen consenteerende, en voor de overige twee duysent guldens daar in consenteerende
Resolutie van de Heeren Staaten van hooghgemelde Provincie haare Principaalen, consenteerende in de helfte van de Petitie van een millioen guldens
Resolutie van de Heeren Staaten van honghgemelde Provintie haare Principaalsen, consenteerende in het concept Placaat tot billioeneeringe van quaade Stuyvers, ende
Resolutie van de Heeren Staaten van hooghgemelde Provintie sijne Principaalen, consenteerende in het secours van twintigch Scheenhen van oorlogh aan sijne
Resolutie van de Heeren Staaten van hooghgemelde Provincie sijne Principalen, consenteerende in het Tractaat met de Regeeringe van Tripoli; volgende de
van de Heeren Staat en van hooghgemelde Provincie sijne Principaalen, consenteerende in de negotiatie van twee honderd duysend guldens voor en
Resolutie van de Heeren Staaten van hooghgemelde Provincie sijne Principalen, consenteer

The `keyword_in_context` function also has several optional arguments to control the size of the context window (`context_size`, default is 3 words before and after), the number of hits (`num_hits`, default is 10) and query filters to constrain the search space (`filters`, which are added to the query).

**Note**: the `num_hits` argument controls the number of _resolutions_ that are retrieved. Within a resolution, the search keyword may appear multiple times. A context is created for each occurrence of the search keyword, so the number returned of contexts can be (and typically is) higher than the number of hits.

In [4]:
# use context_size to get fewer or more surrounding words as context
for hit in rep_es.keyword_in_context("secreete", context_size=5):
    print(hit["context"])

                          gedaan op haar Hoogh Mogende secreete commissoriaale Resolutie van den vierden
                         ontfangen, en gemeld in sijne secreete Mislive van den seven en
                       besoignes by haar Hoogh Mogende secreete Resolutie van den vierden December
                        voldoeninge van haar Hoog Mog. secreete Resolutie en Aanschryving van den
                             in haar Hoog Mog. gemelde secreete - Resolutie van den 12 deeser
                     nakoominge van haar Hoogh Mogende secreete Resolutie en Aanschryvens van den
                          vervat in haar Hoogh Mogende secreete Resolutie van den sestienden deeser
                       ontfangst van haar Hoog Mogende secreete Resolutie van den dertigsten der
                             in gevolge haar Hoog Mog. secreete Resolutie van den 25 deezer
                        kraghte van haar Hoogh Mogende secreete Resolutie van den vier en
               Suppliantes Man waaren 

In [5]:
for hit in rep_es.keyword_in_context("periculeuse", context_size=5):
    # First, show paragraph id (which contains session date)
    print(hit["resolution_id"])
    # Second, show the keyword in context
    print(hit["context"])
    # Finally, add newline for readability
    print()


session-1709-02-06-num-1-resolution-10
                 tot Scheveninge, dolerende over sijne periculeuse ende kostelijke reyse, met een

session-1709-02-06-num-1-resolution-10
                      Stuyrman Jochem Joppe, voor sijn periculeuse reyse ende schade aen de

session-1720-01-20-num-1-resolution-4
                            douceur voor de sware ende periculeuse reyse die hy gedaan heeft

session-1733-07-11-num-1-resolution-11
                          door niet alleen buyten alle periculeuse gevaaren werden gestelt, maar oock

session-1737-12-27-num-1-resolution-16
                                van door een swaare en periculeuse sieckte te worden overvallen, in

session-1672-11-21-num-1-resolution-25
                       Schoonhoven aen eene sware ende periculeuse sieckte het beddehoudende. om hem

session-1711-09-30-num-1-resolution-9
                      werden, om sulcke moeyelijcke en periculeuse reyse te doen of te

session-1707-11-17-num-1-resolution-14
          

In [6]:
# use num_hits to get fewer or more results
for hit in rep_es.keyword_in_context("voornoemde Procureur", context_size=5, num_hits=20):
    print(hit["resolution_id"])
    print(hit["context"])
    print()


session-1779-12-06-num-1-resolution-15
                              andere zyde; waar by den voornoemde Procureur van Kervel versoekt obedientie, en

session-1779-12-06-num-1-resolution-15
                  op condemnatie; en consenteerende de voornoemde Procureur Alsche in de versogte condemnatie

session-1780-07-18-num-1-resolution-8
                         exceptie en defensie; waar op voornoemde Procureur van Alphen wyders versoekt condemnatie

session-1780-10-13-num-1-resolution-7
                               andere zyde; waar by de voornoemde Procureur van Alphen versogt obedientie: en

session-1780-10-13-num-1-resolution-7
                    wyders versogt condemnatie, en den voornoemde Procureur de Byo consenteerde in de

session-1779-07-23-num-1-resolution-11
                              andere zyde; waar by den voornoemde Procureur van Son versogt obedientie, en

session-1779-07-23-num-1-resolution-11
           Son versogt condemnatie. Consenteerende den voornoemde Pro

In [7]:
# use filters to contrain the search space:
# selecting resolutions by year
filters = [
    {"match": {"metadata.session_year": 1672}}
]

for hit in rep_es.keyword_in_context("periculeuse", filters=filters):
    print(hit["para_id"])
    print(hit["context"])

session-1672-11-21-num-1-para-56
                   eene sware ende periculeuse sieckte het beddehoudende
session-1672-02-18-num-1-para-10
           doch voornamentlijck in periculeuse tijden in goede
session-1672-01-04-num-1-para-63
            doch voornamentlick in periculeuse tijden, in goede
session-1672-02-24-num-1-para-13
            doch voornamentlick in periculeuse tijden in goede


In [8]:
# use filters to contrain the search space:
# selecting resolutions by date range
filters = [
    {"range": {"metadata.session_date": {"gte": "1672-04-01", "lte": "1672-08-01"}}}
]

for hit in rep_es.keyword_in_context("Vloot", filters=filters):
    print(hit["para_id"], '\n')
    print(hit["context"], '\n')

session-1672-05-31-num-1-para-92 

                      sigh naer de Vloot te vervoegen: Waerop 

session-1672-06-16-num-1-para-24 

                 dat de Smirnasche Vloot door d'Engelschen 

session-1672-04-02-num-1-para-50 

                       Ho:Mo: inde vloot vanden Staat op 

session-1672-05-24-num-1-para-20 

           gedaen, opde Smirnasche Vloot is gear„ resteert 

session-1672-07-27-num-1-para-83 

                   ont„ houden van Vloot der Vijanden van 

session-1672-05-31-num-1-para-36 

                 dat de Smirnasche Vloot, door d'Engelschen 

session-1672-04-06-num-1-para-74 

  gevolmachtichde opde voorschreve vloot commanderen sal den 



In [9]:
filters = [
    {"match": {"metadata.session_year": 1672}}
]

# using a larger context size
for hit in rep_es.keyword_in_context("Vlooten", filters=filters, context_size=20):
    print(hit["resolution_id"], hit["resolution_offset"], '\n')
    print(hit["context"], '\n')


session-1672-05-31-num-1-resolution-1 0 

Ontfangen een missive vanden Heer Cornelis de Witt, hare Ho:Mo: Gedepden. ende Gevolmachtichde op 's Lants Vlooten in de jegenwoordige expeditie ter Zee, Jehan ‛s Lants Schip de seven Provincien, laverende voor Walcheren, Brugge & Oost van haer 

session-1672-09-01-num-1-resolution-1 583 

advertentie ten spoedichsten kennisse sal werden gegeven aenden Lieutenant Admirael de Ruijter om daerop behoorlicke reflexie te nemen, de Vijantlicke vlooten te doen observeren, ingevolge van hare Ho:Mo: resolutie vanden seven„ thienden Augusti laestleden, de desseijnen vande Vijanden vanden Staet 

session-1672-09-01-num-1-resolution-1 1366 

Welderen, ende Lieutenant Admirael de Ruijter sal werden, aengeschreven, dat deselve haer soo veel mogelick op de voor„ schreve Vijantlicke Vlooten sullen informeren, haer Ho:Mo: sonder eenich tijt versuijm, adverteren vande condtschappen die haer vande voornoemde Vijantlicke Vlooten souden mogen 

session-1672-07-15-n

## Retrieving Resolutions

The `rep_es` object has a range of functions to retrieve `resolution` objects.

You can find all available properties and methods of `resolution` objects in `republic_document_model.py`: i.e. in the
[Resolution](https://github.com/HuygensING/republic-project/blob/bb4cdad7b4cb9fb71378d0dde000fe7725ceb45e/republic/model/republic_document_model.py#L392) class, which inherits several properties and methods from the [ResolutionElementDoc](https://github.com/HuygensING/republic-project/blob/bb4cdad7b4cb9fb71378d0dde000fe7725ceb45e/republic/model/republic_document_model.py#L158)

In [11]:
resolutions = rep_es.retrieve_resolutions_by_session_date("1672-02-12")
for res in resolutions:
    print(res.session_date.isoformat(), res.id)

1672-02-12 session-1672-02-12-num-1-attendance_list
1672-02-12 session-1672-02-12-num-1-resolution-1
1672-02-12 session-1672-02-12-num-1-resolution-2
1672-02-12 session-1672-02-12-num-1-resolution-3
1672-02-12 session-1672-02-12-num-1-resolution-4
1672-02-12 session-1672-02-12-num-1-resolution-5
1672-02-12 session-1672-02-12-num-1-resolution-6
1672-02-12 session-1672-02-12-num-1-resolution-7
1672-02-12 session-1672-02-12-num-1-resolution-8
1672-02-12 session-1672-02-12-num-1-resolution-9
1672-02-12 session-1672-02-12-num-1-resolution-10
1672-02-12 session-1672-02-12-num-1-resolution-11
1672-02-12 session-1672-02-12-num-1-resolution-12
1672-02-12 session-1672-02-12-num-1-resolution-13
1672-02-12 session-1672-02-12-num-1-resolution-14
1672-02-12 session-1672-02-12-num-1-resolution-15


In [12]:
output_file = "resoluties_rampjaar.csv"

with open(output_file, 'wt') as fh:
    headers = ['resolution_id', 'date', 'paragraph_id', 'text', 'iiif_url']
    fh.write('\t'.join(headers) + '\n')
    for res in resolutions:
        for para in res.paragraphs:
            if isinstance(para.metadata['iiif_url'], list):
                url = ', '.join(para.metadata['iiif_url'])
            else:
                url = para.metadata['iiif_url']
            row = [res.id, res.metadata['session_date'], para.id, para.text, url]
            row = [cell if cell is not None else '' for cell in row]
            print(row)
            fh.write('\t'.join(row) + '\n')
        

['session-1672-02-12-num-1-attendance_list', '1672-02-12', 'session-1672-02-12-num-1-para-1', 'Veneris den 12. Februarij 1672', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3285/NL-HaNA_1.01.02_3285_0265.jpg/2609,337,2446,3576/full/0/default.jpg']
['session-1672-02-12-num-1-attendance_list', '1672-02-12', 'session-1672-02-12-num-1-para-2', 'Preside den Heere van Coeverden', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3285/NL-HaNA_1.01.02_3285_0265.jpg/2609,337,2446,3576/full/0/default.jpg']
['session-1672-02-12-num-1-attendance_list', '1672-02-12', 'session-1672-02-12-num-1-para-3', 'Præsentibus de Heeren van Gendt, Gellicum, Brakell, Vijgh, Ripperda tot Buirse Schimmelpenningh, Ommeren Werckendam, Goeree, Meerens Odijck, Reijgersbergh, Crommon, Vrijbergen, Mauregnault,', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3285/NL-HaNA_1.01.02_3285_0265.jpg/2609,337,2446,3576/full/0/default.jpg']
['session-1672-02-12-num-1-attendance_list', '1672-02-12', 'session-1672-02-12-nu

In [16]:
keyword = "oorlogh"

query = {"query": {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},                  # only resolutions, no attendance lists
            {"match": {"metadata.session_year": 1672}},         # only resolutions from 1672
            {"match": {"paragraphs.text": keyword}}, # only resolutions containing 'raet pensionaris'
        ]
    }
}}
resolutions = rep_es.retrieve_resolutions_by_query(query)
output_file = f"resoluties_rampjaar_{keyword.replace(' ','_')}.csv"
with open(output_file, 'wt') as fh:
    headers = ['resolution_id', 'date', 'paragraph_id', 'text', 'iiif_url']
    fh.write('\t'.join(headers) + '\n')
    for res in resolutions:
        for para in res.paragraphs:
            if isinstance(para.metadata['iiif_url'], list):
                url = ', '.join(para.metadata['iiif_url'])
            else:
                url = para.metadata['iiif_url']
            row = [res.id, res.metadata['session_date'], para.id, para.text, url]
            row = [cell if cell is not None else '' for cell in row]
            print(row)
            fh.write('\t'.join(row) + '\n')

['session-1672-04-01-num-1-resolution-4', '1672-04-01', 'session-1672-04-01-num-1-para-11', 'Ontfangen een Missive van het Collegie ter admiraliteijt tot amsterdam, geschreven aldaer den eenendertichsten Martij lest„ leden, houdende, dat het gemelte Collegie vermits de jegenwoordige ongelegentheijt met den Coningh van Groot Brittannien geerne soude verstaen haer Ho:Mo: intentie ontrent het affsenden van hetschip van Oorlogh voor desen gedestineert tot transport vanden heer van Strevels„ hoeck, haer ho:Mo: gedesigneerde resident aen het hoff van Spamen, uijt dese Landen naer Spaignen, sonderlingh alsoo het voorschreve Schip van oorlogh niet sonder pericul ende ondienst vanden Lande soo verte vande handt gesonden conde werden, ende andersints was gedesigneert onder de schepen die', 'https://images.diginfra.net/iiif/NL-HaNA_1.01.02/3285/NL-HaNA_1.01.02_3285_0544.jpg/3236,-8,1830,3874/full/0/default.jpg']
['session-1672-04-01-num-1-resolution-4', '1672-04-01', 'session-1672-04-01-num-1-par

### The Anatomy of a Resolution

Resolutions in the index consist of `metadata` and `paragraphs`.

In [12]:
import json

res = resolutions[0]
# Each resolution has metadata
print(json.dumps(res.metadata, indent=4))

{
    "inventory_num": 3285,
    "source_id": "session-1672-02-12-num-1",
    "type": "resolution",
    "id": "session-1672-02-12-num-1-attendance_list",
    "session_date": "1672-02-12",
    "session_id": "session-1672-02-12-num-1",
    "session_num": 1,
    "president": null,
    "session_year": 1672,
    "session_month": 2,
    "session_day": 12,
    "session_weekday": "Veneris",
    "text_page_num": [],
    "index_timestamp": "2022-02-03T09:14:42.796086",
    "proposition_type": null,
    "proposer": null,
    "decision": null,
    "resolution_type": "ordinaris"
}


In [13]:
# You can dump all resolution data to JSON
res.json

{'id': 'session-1672-02-12-num-1-attendance_list',
 'type': ['republic_doc',
  'resolution_element',
  'resolution',
  'attendance_list'],
 'metadata': {'inventory_num': 3285,
  'source_id': 'session-1672-02-12-num-1',
  'type': 'resolution',
  'id': 'session-1672-02-12-num-1-attendance_list',
  'session_date': '1672-02-12',
  'session_id': 'session-1672-02-12-num-1',
  'session_num': 1,
  'president': None,
  'session_year': 1672,
  'session_month': 2,
  'session_day': 12,
  'session_weekday': 'Veneris',
  'text_page_num': [],
  'index_timestamp': '2022-02-03T09:14:42.796086',
  'proposition_type': None,
  'proposer': None,
  'decision': None,
  'resolution_type': 'ordinaris'},
 'evidence': [],
 'stats': {'lines': 16, 'words': 71, 'text_regions': 0, 'paragraphs': 5},
 'paragraphs': [{'id': 'session-1672-02-12-num-1-para-1',
   'type': ['republic_doc', 'resolution_paragraph', 'republic_paragraph'],
   'metadata': {'inventory_num': 3285,
    'source_id': 'session-1672-02-12-num-1',
    

## Using Elasticsearch Queries

See the [Elasticsearch Query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) for details on how to construct different types of queries.

In [16]:
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},                  # only resolutions, no attendance lists
            {"match": {"metadata.session_year": 1672}},         # only resolutions from 1672
            {"match": {"paragraphs.text": "raet pensionaris"}}, # only resolutions containing 'raet pensionaris'
        ]
    }
}

resolutions = rep_es.retrieve_resolutions_by_query(query)

for res in resolutions:
    print(res.id)
    for para in res.paragraphs:
        print(f"\t{para.text}\n")
    print('--------------------\n')

session-1672-07-25-num-1-resolution-14
	Ontfangen een missive vanden pensionaris Pesters, geschreven tot Maestricht den 23en. deses, houdende advertentie, ende onder anderen rakende de contri„ butie bij de franschen gevordent wer„ ,dende inde Landen van Overmase, Waerop gedelibereert zijnde, Is goetgevonden ende verstaen, dat de voors missive gestelt sal werden in handen vande heeren van Brakel ende andere haer Ho:Mo: Gedepu„ teerden tot de saken vande Landen van Overmaze, met ende nevens eenige Heeren Gecommitteerden uijt den Raet van State bij haer E. selffs te nomineren, om te visi„ teren, examineren, ende daer van rapport te doen

--------------------

session-1672-11-15-num-1-resolution-5
	Ontfangen een missive van Alleij Aga, geschreven tot Amsterdam den twaelffden deses, houdende advertentie, dat hij uijt Turckijen was gesonden voor Ambassadeur vanden Grootenheer aenden Coningh van Sweeden dat hij oock een recommandatie Brieff aen haer Ho:Mo: om hem be,, hulpsaem te sijn int gee

In [17]:
# import Counter to do some simple word counting and frequency comparison
from collections import Counter
import re


In [12]:
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},
            {"match": {"metadata.session_year": 1672}}
        ]
    }
}

resolutions_1672 = rep_es.retrieve_resolutions_by_query(query, size=10000)

all_word_freq = Counter()

for res in resolutions_1672:
    for para in res.paragraphs:
        all_word_freq.update([word for word in re.split(r"\W+", para.text) if word != ''])

for word, freq in all_word_freq.most_common(10):
    print(f"{word: <20}{freq: >6}")

ende                 26333
van                  23211
de                   20743
te                   14955
dat                  11138
den                   9356
haer                  8467
vande                 8418
tot                   8380
in                    8012


In [13]:
Counter([res.metadata['proposition_type'] for res in resolutions_1672])

Counter({None: 460,
         'missive': 1617,
         'requeste': 289,
         'rekening': 1,
         'memorie': 36,
         'rapport': 2,
         'declaratie': 7})

In [20]:
res_missives = [res for res in resolutions_1672 if res.metadata['proposition_type'] == 'missive']

len(res_missives)

1617

In [11]:
from collections import Counter
import re

query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},
            {"match": {"metadata.session_year": 1672}},
            {"match": {"paragraphs.text": "raet pensionaris"}}
        ]
    }
}

resolutions = rep_es.retrieve_resolutions_by_query(query)


word_freq = Counter()

for res in resolutions:
    for para in res.paragraphs:
        word_freq.update([word for word in re.split(r"\W+", para.text) if word != ''])

rel_freq = {}
min_freq = 3
for word, freq in word_freq.most_common():
    if freq < min_freq:
        continue
    rel_freq[word] = freq / all_word_freq[word]
    
for word in sorted(rel_freq, key = lambda w: rel_freq[w], reverse=True):
    print(f"{word: <20}{rel_freq[word]: >6.4f}{word_freq[word]: >6}{all_word_freq[word]: >8}")

NameError: name 'all_word_freq' is not defined

### Resolutions in JSON Format

Resolution objects have a `.json` property to get a JSON representation of the resolution, including metadata, paragraph text and basic statistics. This can be a convenient format for storing and later retrieving them from disk (faster than getting them from the Republic CAF server).

You can also turn them into plain text representations if you want to do extensive text analysis.

In [22]:
import json
import gzip

resolutions_file = "../../data/resolutions/rampjaar-ordinaris-resolutions.json.gz"

# open a file for storing the JSON representation of resolutions
with gzip.open(resolutions_file, 'wt') as fh:
    # iterate over the resolutions and dump their JSON representations to file
    json.dump([res.json for res in resolutions_1672], fh)

In [23]:
import json
import gzip

import republic.model.republic_document_model as rdm


resolutions_file = "../../data/resolutions/rampjaar-ordinaris-resolutions.json.gz"

# Reading the JSON representations from file again and turning 
# them into Resolution objects again
with gzip.open(resolutions_file, 'rt') as fh:
    # the document model has a convenience function to turn a JSON representation
    # to a Resolution object: json_to_republic_resolution
    resolutions_1672 = [rdm.json_to_republic_resolution(res) for res in json.load(fh)]
    

In [24]:
# Creating plain text representations of resolutions by concatenating paragraph texts
for res in resolutions_1672:
    res_text = '\n'.join([para.text for para in res.paragraphs])
    print(res_text)
    break

Is gehoort het rapport vande Heeren Schimmelpenningh, ende andere hare Ho:Mo: Gedeputeerden tot de saken vande Zee, hebbende ingevolge ende tot voldoeninge van derselver resolutie Commissoriael vanden negenentwin„ tichsten December laestleden, gevi, siteert ende geexamineert de Requeste van David Centsen, Consul vande Nederlantsche natie tot Rochelle, versoeckende door hare Ho:Mo: met eene somme van penningen te mogen werden gesubvenieert, ten aensien vande oncosten bij hem gesupporteert in een continueel vervolgh van ontrent acht maenden, om expeditie, en het obtineren van eene resolutie op de Consulaetrechten aldaer, ende een daghgelt aen hem Suppliant als Con„ sul toe te leggen: Waerop gedelibereert sijnde, Is goetgevonden ende verstaen, mits desen te versoec„ ken de Heeren Gedeputeerden vande Provincie van Hollandt ende West,, vrieslandt, dat haer E. haer hoe eerder soo liever willen verclaren op het rapport vande gemelte Heeren Schimmel„ penningh ende andere hare Ho:Mo: Gedeputeer

## Retrieving Aggregate Statistics

You can also directly query the indexes using the elasticsearch instance inside the `rep_es` object, which is stored in the `es_anno` property (so can be addressed via `rep_es.es_anno`).

Below is an example of a query and an aggregation to get the number of resolutions per month in the year 1672:

In [25]:
# raet pensionaris in resolutions
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},
            {"match": {"metadata.session_year": 1672}},
            {"match": {"paragraphs.text": "raet pensionaris"}}
        ]
    }
}

aggs = {
    "months": {
        "date_histogram": {
            "field": "metadata.session_date",
            "calendar_interval": "month"
        }
    }
}


response = rep_es.es_anno.search(index="resolutions", query=query, aggs=aggs, size=0)
buckets = response["aggregations"]["months"]["buckets"]
for bucket in buckets:
    print(bucket["key_as_string"].split("T")[0], bucket["doc_count"])

1672-01-01 64
1672-02-01 18
1672-03-01 71
1672-04-01 9
1672-05-01 83
1672-06-01 45
1672-07-01 63
1672-08-01 56
1672-09-01 47
1672-10-01 59
1672-11-01 48
1672-12-01 27


In [26]:
# raet pensionaris in attendance lists
query = {
    "bool": {
        "must": [
            {"match": {"type": "attendance_list"}},
            {"match": {"metadata.session_year": 1672}},
            {"match": {"paragraphs.text": "raet pensionaris"}}
        ]
    }
}

response = rep_es.es_anno.search(index="resolutions", query=query, aggs=aggs, size=0)
buckets = response["aggregations"]["months"]["buckets"]
for bucket in buckets:
    print(bucket["key_as_string"].split("T")[0], bucket["doc_count"])

1672-01-01 14
1672-02-01 8
1672-03-01 14
1672-04-01 5
1672-05-01 10
1672-06-01 2
1672-07-01 2
1672-08-01 3
1672-09-01 12
1672-10-01 15
1672-11-01 12
1672-12-01 6


In [30]:
# raet pensionaris in resolutions
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},
            {"match": {"metadata.session_year": 1672}}
        ]
    }
}

aggs = {
    "months": {
        "date_histogram": {
            "field": "metadata.session_date",
            "calendar_interval": "month"
        }
    }
}


response = rep_es.es_anno.search(index="resolutions", query=query, aggs=aggs, size=0)
buckets = response["aggregations"]["months"]["buckets"]
for bucket in buckets:
    print(bucket["key_as_string"].split("T")[0], bucket["doc_count"])

1672-01-01 200
1672-02-01 191
1672-03-01 241
1672-04-01 107
1672-05-01 292
1672-06-01 179
1672-07-01 194
1672-08-01 212
1672-09-01 220
1672-10-01 184
1672-11-01 210
1672-12-01 182


## Fuzzy Search

Use fuzzy searching with a list of keywords/phrases to find resolutions with spelling variants of those keywords/phrases.

The `fuzzy-search` package has a FuzzyPhraseSearcher, but also a FuzzyContextSearcher, that returns the matched phrase together with the surrounding context, and allows one to search for additional keywords/phrases within those match contexts.

A PhraseMatchInContext object has a `.string` property that contains the string in the text that matches the phrase. In addition, it has the properties `prefix`, `suffix` and `context`. The `prefix` contains the preceding text, the `suffix` the text following the matching string, and the `context` contains the matching string with both `prefix` and `suffix` text. The amount of preceding and following text is controlled in the `.find_matches()` function by optional the `prefix_size` and `suffix_size` arguments. 

Requirements: 
- `fuzzy-search` (python package via `pip install fuzzy-search`, but it should already be installed if you installed the required packages to use the Republic code repository).

In [81]:
from fuzzy_search.fuzzy_context_searcher import FuzzyContextSearcher
from fuzzy_search.fuzzy_phrase_model import PhraseModel

phrase = 'tot de saecken'

config = {
    'levenshtein_threshold': 0.7,
    'ngram_size': 3,
    'skip_size': 1,
    'include_variants': True
}

phrase_model = PhraseModel([phrase], config=config)
saecken_searcher = FuzzyContextSearcher(config)
saecken_searcher.index_phrase_model(phrase_model)

for res in resolutions_1672[:20]:
    for para in res.paragraphs:
        matches = saecken_searcher.find_matches(para.text)
        for match in matches:
            print(f"Phrase: {match.phrase.phrase_string: <30}\tmatch string: {match.string}")
            print(f"\t{match.context}\n")
            

Phrase: tot de saecken                	match string: tot de saken
	Is gehoort het rapport vande Heeren Schimmelpenningh, ende andere hare Ho:Mo: Gedeputeerden tot de saken vande Zee, hebbende ingevolge ende tot voldoeninge van derselver resolutie Commissoriael vanden neg

Phrase: tot de saecken                	match string: tot de saecken
	at gestelt sal werden in handen vande Heeren Schimmelpenningh ende andere haer Ho:Mo: Gedeputeerden tot de saecken vande Griffie, om te visiteren, examineren, ende daervan rapport te doen.

Phrase: tot de saecken                	match string: tot de saecken
	iael gestelt sal werden in handen vande Heeren van Ommeren ende andere hare Ho:Mo: Gede,, puteerden tot de saecken vande Griffie, omme te visiteren, examineren, ende daervan rapport te doen.

Phrase: tot de saecken                	match string: tot de saken
	issiven gestelt sullen werden in handen vande Heeren van Gent ende andere hare Ho:Mo: Gedeputeerden tot de saken vande Triple Alliancie omme

Many suffixes start with the name of persons or organisations, followed by a comma. To get an overview of which entities are mentioned after this phrase `tot de saecken`, we can do a simple count of suffixes, cut off at the first comma:

In [75]:
saecken_entity_freq = Counter()

for res in resolutions_1672:
    for para in res.paragraphs:
        matches = saecken_searcher.find_matches(para.text, suffix_size=50)
        for match in matches:
            saecken_entity = match.suffix.split(',')[0]
            saecken_entity_freq.update([saecken_entity])

for saecken_entity, freq in saecken_entity_freq.most_common(50):
    print(f"{saecken_entity: <40}{freq: >5}")

 vande Zee                                462
 vande Triple Alliantie                    73
 van Vlaenderen                            46
 vande Triple Alliancie                    44
 vande finantie                            44
 vande Meijerije van ‛s Hertogenbosch      31
 van Oostvrieslandt                        23
 vande Landen van Overmase                 22
 vande Griffie                             20
 vande Triple alliantie                    19
 van Oostvrieslant                         17
 vande Triple Alliantie om te visiteren    12
 van Denemarcken                            8
 van de Zee                                 7
 vande Triple Alliancie om te visiteren     6
 vande Westindische Compagnie               6
 van Oost                                   6
 van Vlaen„ deren                           6
 vande Triple alliancie                     6
 vande Meijerie van s'hertogenbosch         6
 ken vande Zee                              6
 vande triple Alliantie           

Next, we make second searcher to find the phrase _gestelt sullen werden_ and its singular variant _gestelt sal werden_ in the prefix of `tot de saecken`:

In [92]:
phrases = [
    {
        'phrase': 'gestelt sullen werden',
        'variants': [
            'gestelt sal werden'
        ]
    }
]

config = {
    'levenshtein_threshold': 0.7,
    'ngram_size': 3,
    'skip_size': 1,
    'include_variants': True
}


phrase_model = PhraseModel(phrases, config=config)
prefix_searcher = FuzzyContextSearcher(config)
prefix_searcher.index_phrase_model(phrase_model)

for res in resolutions_1672[:20]:
    print(res.id)
    for para in res.paragraphs:
        saecken_matches = saecken_searcher.find_matches(para.text, prefix_size=150)
        for saecken_match in saecken_matches:
            # first, get the entity mentioned in the suffix
            saecken_entity = match.suffix.split(',')[0]
            # next, search for the prefix phrase in the prefix of 'tot de saecken'
            prefix_matches = prefix_searcher.find_matches(saecken_match.prefix)
            for prefix_match in prefix_matches:
                print(f"{prefix_match.string}\n\t{prefix_match.suffix}")
                print(f"{saecken_match.string}\n\t{saecken_entity}\n")


session-1672-01-07-num-1-resolution-1
session-1672-01-07-num-1-resolution-2
gestelt sal werden
	 in handen vande Heeren Schimmelpenningh ende andere haer Ho:Mo: Gedeputeerden 
tot de saecken
	 vande Triple Alliantie

gestelt sal werden
	 in handen vande Heeren Schimmelpenningh ende andere haer Ho:Mo: Gedeputeerden 
tot de saecken
	 vande Triple Alliantie

session-1672-01-07-num-1-resolution-3
gestelt sal werden
	 in handen vande Heeren van Ommeren ende andere hare Ho:Mo: Gede,, puteerden 
tot de saecken
	 vande Triple Alliantie

gestelt sal werden
	 in handen vande Heeren van Ommeren ende andere hare Ho:Mo: Gede,, puteerden 
tot de saecken
	 vande Triple Alliantie

session-1672-01-07-num-1-resolution-4
gestelt sullen werden
	 in handen vande Heeren van Gent ende andere hare Ho:Mo: Gedeputeerden 
tot de saken
	 vande Triple Alliantie

session-1672-01-07-num-1-resolution-5
session-1672-01-07-num-1-resolution-6
gestelt sal werden
	 in handen vande Heeren van Gent, ende andere hare Ho:Mo: 