# Fuzzy Search and Petitioners' Roles and Professions

In [1]:
# This is needed to add the repo dir to the path so jupyter
# can load the republic modules directly from the notebooks
import os
import sys
repo_name = 'republic-project'
repo_dir = os.path.split(os.getcwd())[0].split(repo_name)[0] + repo_name
print(repo_dir)
if repo_dir not in sys.path:
    sys.path.append(repo_dir)



/Users/marijnkoolen/Code/Huygens/republic-project


In [2]:
# load the Republic Elasticsearch API
from republic.elastic.republic_elasticsearch import initialize_es

rep_es = initialize_es(host_type='external', timeout=60)


### Dictionary of resolution-specific terms

During the project we compile lists of terms that are relevant within the corpus of resolutions. The lists of terms are categorised, with lists for persons, organisations, objects, locations, topics, etc.

These terms can be used in fuzzy search to identify and classify different aspects of resolutions. For instance, the opening sentence of a resolution describes a proposition submitted to the States General. This proposition has a source type (often a specific type of document like a missive or a request) and details about who submitted, from when and on what date. Categorising these aspects allows us to add metadata to the individual resolutions with which we can improve information access. 

In [8]:
from republic.model.resolution_phrase_model import read_republic_term_dict

term_dict = read_republic_term_dict()

# What different categories of terms are available?
term_dict.keys()

dict_keys(['action', 'object', 'unit', 'meeting', 'person_reference', 'organisation', 'geographical_name', 'other_name', 'person_name', 'adjective', 'location', 'topic', 'date', 'misc', 'function'])

Most categories have sub-categories. E.g. the `person_reference` category distinguishes between professions, family relationships, legal status and titles.

In [3]:
term_dict['person_reference'].keys()

dict_keys(['person_legal_status', 'person_family', 'person_citizen', 'person_title', 'person_other', 'person_profession', 'person_meeting_role', 'person_nationality'])

In [4]:
# The number of profession terms in the dictionary
len(term_dict['person_reference']['person_profession'])

1039

These person reference terms can be added as a lexicon to a fuzzy searcher, so you can search for occurrences of these terms.

In [10]:
from fuzzy_search.fuzzy_context_searcher import FuzzyContextSearcher
from fuzzy_search.fuzzy_phrase_model import PhraseModel

config = {
    'levenshtein_threshold': 0.8,
    'ngram_threshold': 0.7,
    'ngram_size': 3,
    'skip_size': 1,
    'include_variants': True
}

phrases = []
for category in term_dict['person_reference']:
    for term in term_dict['person_reference'][category]:
        # turn the term into a fuzzy search phrase, and add its categories as labels
        phrase = {
            "phrase": term,
            "label": ["person_reference", category]
        }
        # add the term to the list of phrases
        phrases.append(phrase)
print("number of person reference phrases:", len(phrases))

number of person reference phrases: 1190


In [14]:
# Create a fuzzy search phrase model from the list of phrases
phrase_model = PhraseModel(phrases, config=config)
# configure a searcher
person_ref_searcher = FuzzyContextSearcher(config)
# Add the phrase model as lexicon to the searcher
person_ref_searcher.index_phrase_model(phrase_model)



In [23]:
# Create a query to select only resolutions in the year 1672 based 
# on propositions of type request
query = {
    "bool": {
        "must": [
            {"match": {"metadata.type": "resolution"}},
            {"match": {"metadata.session_year": 1672}},
            {"match": {"metadata.proposition_type": "requeste"}}
        ]
    }
}

resolutions = rep_es.retrieve_resolutions_by_query(query, size=100)



Each request proposition starts with a fixed formula, followed by details of the proposer, location and date, and then a _proposition verb_ that introduces the content of the proposition. To identify the proposer's role or profession, we use the fuzzy searcher and the `person_reference` lexicon on the text between the opening formula and the _proposition verb_.

In [28]:
from collections import Counter

person_ref_freq = Counter()
person_ref_type_freq = Counter()

for res in resolutions:
    # the opening formula is always in the first paragraph
    first_para = res.paragraphs[0]
    
    # The resolution evidence consists of fuzzy search matches based
    # on the resolution opening phrase lexicon.
    # Select only the matches in the first paragraph
    first_para_matches = [match for match in res.evidence if match.text_id == first_para.id]    
    
    # From there, pick the first match phrase that is an opening formula. 
    # The end of the formula is the start of the 
    opening_match = [match for match in first_para_matches if match.has_label('proposition_opening')][0]
    proposition_start = opening_match.end
    
    # Then, pick the first proposition verb as the end of the proposition text,
    # or the end of the paragraph if there is no proposition verb
    verb_matches = [match for match in first_para_matches if match.has_label('proposition_verb')]
    proposition_end = verb_matches[0].end if len(verb_matches) > 0 else len(first_para.text)
    
    # Select the text of the first paragraph between the opening formula and the proposition verb
    proposition_text = first_para.text[proposition_start:proposition_end]
    print(proposition_text, '\n')
    
    # look for person reference terms
    matches = person_ref_searcher.find_matches(proposition_text)
    for match in matches:
        print(f"Phrase: {match.phrase.phrase_string: <30}\tmatch string: {match.string}")
        print(f"\t", match.label)
        person_ref_freq.update([match.phrase.phrase_string])
        # the label can be a single string or a list of strings
        refs = match.label if isinstance(match.label, list) else [match.label]
        person_ref_type_freq.update(refs)
            

Johan Coorte, ende Gijsbert Zuijlen van Nieuvelt, beijde Schepenen ‛s Lants vanden Vrijen, versoeckende 

Phrase: Scheepenen                    	match string: Schepenen
	 ['person_reference', 'person_profession']
Jan van Eede geweest hebbende 

Johan d'Arbaij, Major van een regiment te voet, ten dienst deser Landen, guarnisoen houdende 

Phrase: Major                         	match string: Major
	 ['person_reference', 'person_profession']
Balthasar van geersbergen, Secretaris van Derssel, Wessen & Beersen alle inde Meijerije van s' Hertogenbos, houdende 

Phrase: Sekretaris                    	match string: Secretaris
	 ['person_reference', 'person_profession']
Phrase: Secretaris                    	match string: Secretaris
	 ['person_reference', 'person_profession']
N. Cauberecht, Licentiaet inde rechten tot Maestricht, houdende 

Phrase: Licentiaat                    	match string: Licentiaet
	 ['person_reference', 'person_profession']
Boulliu, Burgemeesteren ende Schepenen der Stede

Phrase: Burgemeester                  	match string: Burgemeesteren
	 ['person_reference', 'person_profession']
Phrase: Burgemeesteren                	match string: Burgemeesteren
	 ['person_reference', 'person_profession']
Phrase: Scheepenen                    	match string: Schepenen
	 ['person_reference', 'person_profession']
Johan Schoock, houdende 

Johannes Amilius, Commis Generael vande Convoijen ende Licenten, houdende 

Henrick Graham, Lieutenant Colonnel ten dienste deser Landen, houdende 

Phrase: Lieutenant                    	match string: Lieutenant
	 ['person_reference', 'person_profession']
Otto Grave van Limburch, ende Bronchorst, Heer van Stierum, houdende 

Phrase: Grave                         	match string: Grave
	 ['person_reference', 'person_title']
Phrase: Heer                          	match string: Heer
	 ['person_reference', 'person_title']
francisco van Lisidro, Coopman tot Amsterdam, houdende 

Phrase: Koopman                       	match string: Coopman
	 

Phrase: Schipper                      	match string: schipper
	 ['person_reference', 'person_profession']
Phrase: schippers                     	match string: schipper
	 ['person_reference', 'person_profession']
Phrase: Schepen                       	match string: Schepe
	 ['person_reference', 'person_profession']
Phrase: Suppliantes                   	match string: Suppliant
	 ['person_reference', 'person_meeting_role']
Phrase: Suppliante                    	match string: Suppliant
	 ['person_reference', 'person_meeting_role']
Phrase: Supplianten                   	match string: Suppliant
	 ['person_reference', 'person_meeting_role']
Phrase: Suppliants                    	match string: Suppliant
	 ['person_reference', 'person_meeting_role']
Phrase: Suppliant                     	match string: Suppliant
	 ['person_reference', 'person_meeting_role']
Phrase: Borger                        	match string: Borgers
	 ['person_reference', 'person_citizen']
Phrase: Borgers                      

**Note**: this is a very coarse analysis, containing plenty of mistakes. Check output of especially resolutions where the first paragraph has no proposition verb, because then the entire paragraph is used and many person references will not be about the proposer.

In [30]:
for person_ref, freq in person_ref_freq.most_common():
    print(f"{person_ref: <30}{freq: >5}")

Scheepenen                        7
Burgemeester                      5
Burgemeesteren                    5
Griffier                          5
Koopman                           5
Schepen                           5
Schipper                          5
Scheepen                          4
Mr                                4
Bailliuw                          4
Borger                            4
Borgers                           4
Grave                             3
Heer                              3
Capitein                          3
Lieutenant                        3
Schippers                         3
Sekretaris                        2
Secretaris                        2
Commissaris                       2
Vrouwe                            2
Poorter                           2
Suppliantes                       2
Suppliante                        2
Supplianten                       2
Suppliants                        2
Suppliant                         2
Resident                    

In [32]:
for person_ref_type, freq in person_ref_type_freq.most_common():
    print(f"{person_ref_type: <30}{freq: >5}")

person_reference                119
person_profession                75
person_title                     17
person_meeting_role              12
person_citizen                   12
person_family                     2
person_legal_status               1
