## Experimentation Notebook

This notebook is part of the sandbox and is intended to experiment and play around with the REPUBLIC elasticsearch functionalities.

In [1]:
# This is needed to add the repo dir to the path so jupyter
# can load the republic modules directly from the notebooks
import os
import sys
repo_name = 'republic-project'
repo_dir = os.path.split(os.getcwd())[0].split(repo_name)[0] + repo_name
print(repo_dir)
if repo_dir not in sys.path:
    sys.path.append(repo_dir)



/Users/marijnkoolen/Code/Huygens/republic-project


## Initialise Republic Elasticsearch Instance

This creates a RepublicElasticsearch object that contains an elasticsearch instance for the Republic CAF indexes, as well as a range of retrieval functions.

Check the [README](https://github.com/HuygensING/republic-project#readme) for configuration details that should be placed in `settings.py`.

In [2]:
from republic.elastic.republic_elasticsearch import initialize_es

rep_es = initialize_es(host_type='external', timeout=60)


## Keyword in Context

A simple way to start exploring is with the `keyword_in_context` function. It takes words or phrases as input and shows a number of hits with surrounding context in the resolutions.

The `keyword_in_context` function returns `hit`s, which are dictionaries with the search `term`, the `pre` and `post` contextual words, a formatted `context`, as well as the `para_id` and `resolution_id` and `resolution_offset` and `para_offset`.

In [3]:
# use single word or multi-word phrase
for hit in rep_es.keyword_in_context("voornoemde Procureur"):
    print(hit["context"])


                       waar by den voornoemde Procureur van Kervel versoekt
              en consenteerende de voornoemde Procureur Alsche in de
                 defensie; waar op voornoemde Procureur van Alphen wyders
                        waar by de voornoemde Procureur van Alphen versogt
               condemnatie, en den voornoemde Procureur de Byo consenteerde
                       waar by den voornoemde Procureur van Son versogt
   condemnatie. Consenteerende den voornoemde Procureur van Kervel in
         en defensie ; versoekende voornoemde Procureur van Son wyders
                       by welke de voornoemde Procureur van Son, alvoorens
                        waar op de voornoemde Procureur van Son condemnatie
                       waar by den voornoemde Procureur van Son versogt
                      wyse door de voornoemde Procureur van Son soude
             qualificeeren, om aan voornoemde Procureur van Son, of
               te authoriseeren de voornoemde Procureur 

The `keyword_in_context` function also has several optional arguments to control the size of the context window (`context_size`, default is 3 words before and after), the number of hits (`num_hits`, default is 10) and query filters to constrain the search space (`filters`, which are added to the query).

**Note**: the `num_hits` argument controls the number of _resolutions_ that are retrieved. Within a resolution, the search keyword may appear multiple times. A context is created for each occurrence of the search keyword, so the number returned of contexts can be (and typically is) higher than the number of hits.

In [4]:
# use context_size to get fewer or more surrounding words as context
for hit in rep_es.keyword_in_context("voornoemde Procureur", context_size=5):
    print(hit["context"])

                              andere zyde; waar by den voornoemde Procureur van Kervel versoekt obedientie, en
                  op condemnatie; en consenteerende de voornoemde Procureur Alsche in de versogte condemnatie
                         exceptie en defensie; waar op voornoemde Procureur van Alphen wyders versoekt condemnatie
                               andere zyde; waar by de voornoemde Procureur van Alphen versogt obedientie: en
                    wyders versogt condemnatie, en den voornoemde Procureur de Byo consenteerde in de
                              andere zyde; waar by den voornoemde Procureur van Son versogt obedientie, en
           Son versogt condemnatie. Consenteerende den voornoemde Procureur van Kervel in de versogte
          behoudens exceptie en defensie ; versoekende voornoemde Procureur van Son wyders daar op
                              andere zyde, by welke de voornoemde Procureur van Son, alvoorens eisch te
                               en defens

In [5]:
for hit in rep_es.keyword_in_context("voornoemde Procureur", context_size=5):
    # First, show paragraph id (which contains session date)
    print(hit["resolution_id"])
    # Second, show the keyword in context
    print(hit["context"])
    # Finally, add newline for readability
    print()


session-1779-12-06-num-1-resolution-15
                              andere zyde; waar by den voornoemde Procureur van Kervel versoekt obedientie, en

session-1779-12-06-num-1-resolution-15
                  op condemnatie; en consenteerende de voornoemde Procureur Alsche in de versogte condemnatie

session-1780-07-18-num-1-resolution-8
                         exceptie en defensie; waar op voornoemde Procureur van Alphen wyders versoekt condemnatie

session-1780-10-13-num-1-resolution-7
                               andere zyde; waar by de voornoemde Procureur van Alphen versogt obedientie: en

session-1780-10-13-num-1-resolution-7
                    wyders versogt condemnatie, en den voornoemde Procureur de Byo consenteerde in de

session-1779-07-23-num-1-resolution-11
                              andere zyde; waar by den voornoemde Procureur van Son versogt obedientie, en

session-1779-07-23-num-1-resolution-11
           Son versogt condemnatie. Consenteerende den voornoemde Pro

In [6]:
# use num_hits to get fewer or more results
for hit in rep_es.keyword_in_context("voornoemde Procureur", context_size=5, num_hits=20):
    print(hit["resolution_id"])
    print(hit["context"])
    print()


session-1779-12-06-num-1-resolution-15
                              andere zyde; waar by den voornoemde Procureur van Kervel versoekt obedientie, en

session-1779-12-06-num-1-resolution-15
                  op condemnatie; en consenteerende de voornoemde Procureur Alsche in de versogte condemnatie

session-1780-07-18-num-1-resolution-8
                         exceptie en defensie; waar op voornoemde Procureur van Alphen wyders versoekt condemnatie

session-1780-10-13-num-1-resolution-7
                               andere zyde; waar by de voornoemde Procureur van Alphen versogt obedientie: en

session-1780-10-13-num-1-resolution-7
                    wyders versogt condemnatie, en den voornoemde Procureur de Byo consenteerde in de

session-1779-07-23-num-1-resolution-11
                              andere zyde; waar by den voornoemde Procureur van Son versogt obedientie, en

session-1779-07-23-num-1-resolution-11
           Son versogt condemnatie. Consenteerende den voornoemde Pro

In [7]:
# use filters to contrain the search space:
# selecting resolutions by year
filters = [
    {"match": {"metadata.session_year": 1672}}
]

for hit in rep_es.keyword_in_context("de Witt", filters=filters):
    print(hit["para_id"])
    print(hit["context"])

session-1672-06-08-num-1-para-54
               vanden Heer Cornen. de Witt, haer Ho:Mo
session-1672-06-08-num-1-para-54
                 vanden geme. heer de Witt, en dat hij
session-1672-06-08-num-1-para-54
                  aenden gein Heer de Witt sal werden gerescribeert
session-1672-06-08-num-1-para-54
                      dat hij heer de Witt, mits sijn indispositie
session-1672-03-14-num-1-para-21
              missive vande Heeren de Witt, ende van Vrijbergen
session-1672-03-01-num-1-para-34
              missive vande Heeren de Witt, ende van Vrijbergen
session-1672-02-18-num-1-para-56
              missive vande Heeren de Witt, van Vrijbergen ende
session-1672-02-25-num-1-para-38
              Missive vande Heeren de Witt, van Vrijbergen ende
session-1672-03-18-num-1-para-10
              missive vande Heeren de Witt, ende van Vrijbergen
session-1672-01-15-num-1-para-16
            Heer Raet Pensionnaris de Witt heeft ter Vergaderinge
session-1672-02-25-num-1-para-39
      

In [9]:
# use filters to contrain the search space:
# selecting resolutions by date range
filters = [
    {"range": {"metadata.session_date": {"gte": "1672-04-01", "lte": "1672-08-01"}}}
]

for hit in rep_es.keyword_in_context("Vloot", filters=filters):
    print(hit["para_id"], '\n')
    print(hit["context"], '\n')

session-1672-05-31-num-1-para-92 

                      sigh naer de Vloot te vervoegen: Waerop 

session-1672-06-16-num-1-para-24 

                 dat de Smirnasche Vloot door d'Engelschen 

session-1672-04-02-num-1-para-50 

                       Ho:Mo: inde vloot vanden Staat op 

session-1672-05-24-num-1-para-20 

           gedaen, opde Smirnasche Vloot is gear„ resteert 

session-1672-07-27-num-1-para-83 

                   ont„ houden van Vloot der Vijanden van 

session-1672-05-31-num-1-para-36 

                 dat de Smirnasche Vloot, door d'Engelschen 

session-1672-04-06-num-1-para-74 

  gevolmachtichde opde voorschreve vloot commanderen sal den 



In [10]:
filters = [
    {"match": {"metadata.session_year": 1672}}
]

# using a larger context size
for hit in rep_es.keyword_in_context("Vlooten", filters=filters, context_size=20):
    print(hit["resolution_id"], hit["resolution_offset"], '\n')
    print(hit["context"], '\n')


session-1672-05-31-num-1-resolution-1 0 

Ontfangen een missive vanden Heer Cornelis de Witt, hare Ho:Mo: Gedepden. ende Gevolmachtichde op 's Lants Vlooten in de jegenwoordige expeditie ter Zee, Jehan ‛s Lants Schip de seven Provincien, laverende voor Walcheren, Brugge & Oost van haer 

session-1672-09-01-num-1-resolution-1 583 

advertentie ten spoedichsten kennisse sal werden gegeven aenden Lieutenant Admirael de Ruijter om daerop behoorlicke reflexie te nemen, de Vijantlicke vlooten te doen observeren, ingevolge van hare Ho:Mo: resolutie vanden seven„ thienden Augusti laestleden, de desseijnen vande Vijanden vanden Staet 

session-1672-09-01-num-1-resolution-1 1366 

Welderen, ende Lieutenant Admirael de Ruijter sal werden, aengeschreven, dat deselve haer soo veel mogelick op de voor„ schreve Vijantlicke Vlooten sullen informeren, haer Ho:Mo: sonder eenich tijt versuijm, adverteren vande condtschappen die haer vande voornoemde Vijantlicke Vlooten souden mogen 

session-1672-07-15-n

## Retrieving Resolutions

The `rep_es` object has a range of functions to retrieve `resolution` objects.

You can find all available properties and methods of `resolution` objects in `republic_document_model.py`: i.e. in the
[Resolution](https://github.com/HuygensING/republic-project/blob/bb4cdad7b4cb9fb71378d0dde000fe7725ceb45e/republic/model/republic_document_model.py#L392) class, which inherits several properties and methods from the [ResolutionElementDoc](https://github.com/HuygensING/republic-project/blob/bb4cdad7b4cb9fb71378d0dde000fe7725ceb45e/republic/model/republic_document_model.py#L158)

In [11]:
resolutions = rep_es.retrieve_resolutions_by_session_date("1672-02-12")
for res in resolutions:
    print(res.session_date.isoformat(), res.id)

1672-02-12 session-1672-02-12-num-1-attendance_list
1672-02-12 session-1672-02-12-num-1-resolution-1
1672-02-12 session-1672-02-12-num-1-resolution-2
1672-02-12 session-1672-02-12-num-1-resolution-3
1672-02-12 session-1672-02-12-num-1-resolution-4
1672-02-12 session-1672-02-12-num-1-resolution-5
1672-02-12 session-1672-02-12-num-1-resolution-6
1672-02-12 session-1672-02-12-num-1-resolution-7
1672-02-12 session-1672-02-12-num-1-resolution-8
1672-02-12 session-1672-02-12-num-1-resolution-9
1672-02-12 session-1672-02-12-num-1-resolution-10
1672-02-12 session-1672-02-12-num-1-resolution-11
1672-02-12 session-1672-02-12-num-1-resolution-12
1672-02-12 session-1672-02-12-num-1-resolution-13
1672-02-12 session-1672-02-12-num-1-resolution-14
1672-02-12 session-1672-02-12-num-1-resolution-15


### The Anatomy of a Resolution

Resolutions in the index consist of `metadata` and `paragraphs`.

In [12]:
import json

res = resolutions[0]
# Each resolution has metadata
print(json.dumps(res.metadata, indent=4))

{
    "inventory_num": 3285,
    "source_id": "session-1672-02-12-num-1",
    "type": "resolution",
    "id": "session-1672-02-12-num-1-attendance_list",
    "session_date": "1672-02-12",
    "session_id": "session-1672-02-12-num-1",
    "session_num": 1,
    "president": null,
    "session_year": 1672,
    "session_month": 2,
    "session_day": 12,
    "session_weekday": "Veneris",
    "text_page_num": [],
    "index_timestamp": "2022-02-03T09:14:42.796086",
    "proposition_type": null,
    "proposer": null,
    "decision": null,
    "resolution_type": "ordinaris"
}


In [13]:
# You can dump all resolution data to JSON
res.json

{'id': 'session-1672-02-12-num-1-attendance_list',
 'type': ['republic_doc',
  'resolution_element',
  'resolution',
  'attendance_list'],
 'metadata': {'inventory_num': 3285,
  'source_id': 'session-1672-02-12-num-1',
  'type': 'resolution',
  'id': 'session-1672-02-12-num-1-attendance_list',
  'session_date': '1672-02-12',
  'session_id': 'session-1672-02-12-num-1',
  'session_num': 1,
  'president': None,
  'session_year': 1672,
  'session_month': 2,
  'session_day': 12,
  'session_weekday': 'Veneris',
  'text_page_num': [],
  'index_timestamp': '2022-02-03T09:14:42.796086',
  'proposition_type': None,
  'proposer': None,
  'decision': None,
  'resolution_type': 'ordinaris'},
 'evidence': [],
 'stats': {'lines': 16, 'words': 71, 'text_regions': 0, 'paragraphs': 5},
 'paragraphs': [{'id': 'session-1672-02-12-num-1-para-1',
   'type': ['republic_doc', 'resolution_paragraph', 'republic_paragraph'],
   'metadata': {'inventory_num': 3285,
    'source_id': 'session-1672-02-12-num-1',
    

## Using Elasticsearch Queries

See the [Elasticsearch Query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) for details on how to construct different types of queries.

In [16]:
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},                  # only resolutions, no attendance lists
            {"match": {"metadata.session_year": 1672}},         # only resolutions from 1672
            {"match": {"paragraphs.text": "raet pensionaris"}}, # only resolutions containing 'raet pensionaris'
        ]
    }
}

resolutions = rep_es.retrieve_resolutions_by_query(query)

for res in resolutions:
    print(res.id)
    for para in res.paragraphs:
        print(f"\t{para.text}\n")
    print('--------------------\n')

session-1672-07-25-num-1-resolution-14
	Ontfangen een missive vanden pensionaris Pesters, geschreven tot Maestricht den 23en. deses, houdende advertentie, ende onder anderen rakende de contri„ butie bij de franschen gevordent wer„ ,dende inde Landen van Overmase, Waerop gedelibereert zijnde, Is goetgevonden ende verstaen, dat de voors missive gestelt sal werden in handen vande heeren van Brakel ende andere haer Ho:Mo: Gedepu„ teerden tot de saken vande Landen van Overmaze, met ende nevens eenige Heeren Gecommitteerden uijt den Raet van State bij haer E. selffs te nomineren, om te visi„ teren, examineren, ende daer van rapport te doen

--------------------

session-1672-11-15-num-1-resolution-5
	Ontfangen een missive van Alleij Aga, geschreven tot Amsterdam den twaelffden deses, houdende advertentie, dat hij uijt Turckijen was gesonden voor Ambassadeur vanden Grootenheer aenden Coningh van Sweeden dat hij oock een recommandatie Brieff aen haer Ho:Mo: om hem be,, hulpsaem te sijn int gee

In [17]:
# import Counter to do some simple word counting and frequency comparison
from collections import Counter
import re


In [18]:
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},
            {"match": {"metadata.session_year": 1672}}
        ]
    }
}

resolutions_1672 = rep_es.retrieve_resolutions_by_query(query, size=10000)

all_word_freq = Counter()

for res in resolutions_1672:
    for para in res.paragraphs:
        all_word_freq.update([word for word in re.split(r"\W+", para.text) if word != ''])

for word, freq in all_word_freq.most_common(10):
    print(f"{word: <20}{freq: >6}")

ende                 26333
van                  23211
de                   20743
te                   14955
dat                  11138
den                   9356
haer                  8467
vande                 8418
tot                   8380
in                    8012


In [19]:
Counter([res.metadata['proposition_type'] for res in resolutions_1672])

Counter({None: 460,
         'missive': 1617,
         'requeste': 289,
         'rekening': 1,
         'memorie': 36,
         'rapport': 2,
         'declaratie': 7})

In [20]:
res_missives = [res for res in resolutions_1672 if res.metadata['proposition_type'] == 'missive']

len(res_missives)

1617

In [21]:
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},
            {"match": {"metadata.session_year": 1672}},
            {"match": {"paragraphs.text": "raet pensionaris"}}
        ]
    }
}

resolutions = rep_es.retrieve_resolutions_by_query(query)


word_freq = Counter()

for res in resolutions:
    for para in res.paragraphs:
        word_freq.update([word for word in re.split(r"\W+", para.text) if word != ''])

rel_freq = {}
min_freq = 3
for word, freq in word_freq.most_common():
    if freq < min_freq:
        continue
    rel_freq[word] = freq / all_word_freq[word]
    
for word in sorted(rel_freq, key = lambda w: rel_freq[w], reverse=True):
    print(f"{word: <20}{rel_freq[word]: >6.4f}{word_freq[word]: >6}{all_word_freq[word]: >8}")

Pensionaris         0.3333     3       9
Wijtingh            0.2000     3      15
Pensionnaris        0.0946     7      74
Fagel               0.0714     5      70
raet                0.0338     5     148
Brakel              0.0268     3     112
gecommuniceert      0.0208     3     144
Raet                0.0183    16     873
seeckere            0.0178     3     169
Schepenen           0.0176     3     170
geaddresseert       0.0166     5     302
saken               0.0120     3     251
schreven            0.0112     3     267
nomineren           0.0110     3     272
heeft               0.0106     6     564
Orange              0.0091     3     330
Griffier            0.0087     4     461
advertentie         0.0086    10    1168
nevens              0.0081     5     615
anderen             0.0073     3     410
Gecommitteerden     0.0070     3     430
Hollandt            0.0069     3     433
hadden              0.0065     5     764
Heer                0.0063     8    1275
examineren      

### Resolutions in JSON Format

Resolution objects have a `.json` property to get a JSON representation of the resolution, including metadata, paragraph text and basic statistics. This can be a convenient format for storing and later retrieving them from disk (faster than getting them from the Republic CAF server).

You can also turn them into plain text representations if you want to do extensive text analysis.

In [22]:
import json
import gzip

resolutions_file = "../../data/resolutions/rampjaar-ordinaris-resolutions.json.gz"

# open a file for storing the JSON representation of resolutions
with gzip.open(resolutions_file, 'wt') as fh:
    # iterate over the resolutions and dump their JSON representations to file
    json.dump([res.json for res in resolutions_1672], fh)

In [23]:
import json
import gzip

import republic.model.republic_document_model as rdm


resolutions_file = "../../data/resolutions/rampjaar-ordinaris-resolutions.json.gz"

# Reading the JSON representations from file again and turning 
# them into Resolution objects again
with gzip.open(resolutions_file, 'rt') as fh:
    # the document model has a convenience function to turn a JSON representation
    # to a Resolution object: json_to_republic_resolution
    resolutions_1672 = [rdm.json_to_republic_resolution(res) for res in json.load(fh)]
    

In [24]:
# Creating plain text representations of resolutions by concatenating paragraph texts
for res in resolutions_1672:
    res_text = '\n'.join([para.text for para in res.paragraphs])
    print(res_text)
    break

Is gehoort het rapport vande Heeren Schimmelpenningh, ende andere hare Ho:Mo: Gedeputeerden tot de saken vande Zee, hebbende ingevolge ende tot voldoeninge van derselver resolutie Commissoriael vanden negenentwin„ tichsten December laestleden, gevi, siteert ende geexamineert de Requeste van David Centsen, Consul vande Nederlantsche natie tot Rochelle, versoeckende door hare Ho:Mo: met eene somme van penningen te mogen werden gesubvenieert, ten aensien vande oncosten bij hem gesupporteert in een continueel vervolgh van ontrent acht maenden, om expeditie, en het obtineren van eene resolutie op de Consulaetrechten aldaer, ende een daghgelt aen hem Suppliant als Con„ sul toe te leggen: Waerop gedelibereert sijnde, Is goetgevonden ende verstaen, mits desen te versoec„ ken de Heeren Gedeputeerden vande Provincie van Hollandt ende West,, vrieslandt, dat haer E. haer hoe eerder soo liever willen verclaren op het rapport vande gemelte Heeren Schimmel„ penningh ende andere hare Ho:Mo: Gedeputeer

## Retrieving Aggregate Statistics

You can also directly query the indexes using the elasticsearch instance inside the `rep_es` object, which is stored in the `es_anno` property (so can be addressed via `rep_es.es_anno`).

Below is an example of a query and an aggregation to get the number of resolutions per month in the year 1672:

In [25]:
# raet pensionaris in resolutions
query = {
    "bool": {
        "must": [
            {"match": {"type": "resolution"}},
            {"match": {"metadata.session_year": 1672}},
            {"match": {"paragraphs.text": "raet pensionaris"}}
        ]
    }
}

aggs = {
    "months": {
        "date_histogram": {
            "field": "metadata.session_date",
            "calendar_interval": "month"
        }
    }
}


response = rep_es.es_anno.search(index="resolutions", query=query, aggs=aggs, size=0)
buckets = response["aggregations"]["months"]["buckets"]
for bucket in buckets:
    print(bucket["key_as_string"].split("T")[0], bucket["doc_count"])

1672-01-01 64
1672-02-01 18
1672-03-01 71
1672-04-01 9
1672-05-01 83
1672-06-01 45
1672-07-01 63
1672-08-01 56
1672-09-01 47
1672-10-01 59
1672-11-01 48
1672-12-01 27


In [26]:
# raet pensionaris in attendance lists
query = {
    "bool": {
        "must": [
            {"match": {"type": "attendance_list"}},
            {"match": {"metadata.session_year": 1672}},
            {"match": {"paragraphs.text": "raet pensionaris"}}
        ]
    }
}

response = rep_es.es_anno.search(index="resolutions", query=query, aggs=aggs, size=0)
buckets = response["aggregations"]["months"]["buckets"]
for bucket in buckets:
    print(bucket["key_as_string"].split("T")[0], bucket["doc_count"])

1672-01-01 14
1672-02-01 8
1672-03-01 14
1672-04-01 5
1672-05-01 10
1672-06-01 2
1672-07-01 2
1672-08-01 3
1672-09-01 12
1672-10-01 15
1672-11-01 12
1672-12-01 6
