# Conversion Process for the files from the Dutch Parliament

In [1]:
# Imports
import os
import json
import requests
from tqdm import tqdm
import lxml.etree as et
from utils import ground_actors
from collections import defaultdict

# Introduction

This notebook contains a small example of how to use the code in this repo as en end-to-end way of converting parliamentary documents to the TEI format.

The notebook will contain the following sections:
 - Data acquisition
 - Data conversion
 - Grounding of actors
 - Metadata creation

## Data Acquisition

- The data from the parliamentary debates can be retrieved from multiple sources. In this research the meetings were extracted from   the officielebekendmakingen FTP server, filtered on the 'Handelingen' type. This process is probably the easiest, however
  the information can also be acquired through the website by making web requests. The code for that is shown below.
  (details on how to do this can be found on https://www.koopoverheid.nl/documenten/instructies/2021/02/09/handleiding-voor-het-uitvragen-van-de-collectie-officiele-publicaties)

In [2]:
# Code for downloading the parliament files

def get_handelingen_urls(start_date='2015-01-01', end_date='2020-12-31', num_records_per_request=100, startnum=1):
    # return each doc as [doc_url, doc_metadata_url]
    base_url = 'https://repository.overheid.nl/sru?&'
    # Retrieve a json response as this is a bit easier to parse
    # currently we are filtering on the debates, we can remove this if we want all the files
    query = 'query=(c.product-area==officielepublicaties) AND (w.publicatienaam="Handelingen") AND (dt.modified>=%s AND dt.modified<=%s) AND (w.subrubriek="Plenair debat")&httpAccept=application/json&maximumRecords=%d&startRecord=%d'% (start_date, end_date, num_records_per_request, startnum)
    url = base_url+query
    response = requests.get(url).json()
    number_of_hits = response['searchRetrieveResponse']['numberOfRecords']

    # Get the num of the next document for our next query, the maximum records
    # that are returned in one go is 1000 so we will have to make multiple calls
    # of we exceed this
    next_record_num = response['searchRetrieveResponse'].get('nextRecordPosition', None)
    # get all the documents from the 'records' key
    documents = response['searchRetrieveResponse']['records']['record']
    
    # We can extract the xml files from this, but we should also check that we get the right files,
    # some files have been corrected later, this is denoted by n1 in the newer files, and the
    # 'replacedBy' key in the record of the older record.
    
    # If wanted, more filtering can be done on the files with the  'publicatienaam' key to filter
    # on vragenuurtjes, debates, openingen etc.
    
    urls = []
    metadata_urls = []
    
    for doc in documents:
        doc_data = doc['recordData']['gzd']
        # see if the 'isReplaceBy' key exist, if it does not the document is the newest version already
        is_old = doc['recordData']['gzd']['originalData']['meta']['owmsmantel'].get('isReplacedBy', None)
        if not is_old:
            xml_url = doc['recordData']['gzd']['enrichedData']['preferredUrl'].replace('.html', '.xml')
            # to get the metadata url simply add '/metadata' behind the doc identifier
            xml_metadata_url = xml_url.replace('.xml', '/metadata.xml')
            urls.append(xml_url)
            metadata_urls.append(xml_metadata_url)
            
    return next_record_num, urls, metadata_urls

def get_all_files(start_date='2015-12-01', end_date='2020-12-31', stopnum=None, num_records_per_request=100):
    # stopnum if you want to limit the number of files downloaded
    all_xml_urls = []
    all_metadata_urls = []
    next_num, xml_urls, metadata_urls = get_handelingen_urls(start_date, end_date, num_records_per_request=num_records_per_request)
    
    all_xml_urls.extend(xml_urls)
    all_metadata_urls.extend(metadata_urls)
    
    while next_num:
        next_num, urls, meta_urls = get_handelingen_urls(start_date, end_date, startnum=next_num)
        all_xml_urls.extend(urls)
        all_metadata_urls.extend(meta_urls)
    
    files = []
    metadata_files = []
    
    for i in tqdm(range(len(all_xml_urls[:stopnum]))):
        file = requests.get(all_xml_urls[i]).content.decode('utf-8')
        metadata_file = requests.get(all_metadata_urls[i]).content.decode('utf-8')
        files.append(file)
        metadata_files.append(metadata_file)
    
    return files, metadata_files
        



In [3]:
# Lets just download 20 files and save them in the right place, as a quick sample.
# The process is of course the same for larger collections, but it will take longer for larger collections
# just remove the stopnum parameter and all the found results will be downloaded and saved.

xml_files, xml_metadata_files = get_all_files(start_date='2023-01-01', end_date='2023-02-01',stopnum=20)


100%|███████████████████████████████████████████| 20/20 [00:16<00:00,  1.25it/s]


In [7]:
# Let's create a folder to save our data to
os.mkdir('raw_files')
os.mkdir('metadata_files')

In [8]:
# Now we will use the metadata we got from the xml files to save the meetings with the 
# filenames used by the ParlaMint conventions.
for xml_file, metadata_file in zip(xml_files, xml_metadata_files):
    # first we will actually already read in the metadata file so that we can appropriately name
    # the source file
    
    parsed_metadata_file = et.fromstring(metadata_file)
    date = parsed_metadata_file.xpath('metadata[@name="OVERHEIDop.datumVergadering"]')[0].attrib['content']
    chamber = parsed_metadata_file.xpath('metadata[@scheme="OVERHEID.StatenGeneraal"]')[0].attrib['content']
    chamber = 'tweedekamer' if ('Tweede Kamer' in chamber) else 'eerstekamer'
    meeting_num = parsed_metadata_file.xpath('metadata[@name="OVERHEIDop.handelingenItemNummer"]')[0].attrib['content']
    dest_file_name = 'ParlaMint-NL_%s-%s-%s.xml' % (date, chamber, meeting_num)
    
    # write the file to the appropriate location, because we haven't used lxml
    # the output file doesnt look very nice, but its still valid xml and this approach is quicker
    with open(os.path.join('raw_files', dest_file_name), 'w') as f:
        f.write(xml_file)
    
    # we can use lxml to write the other file to xml
    metadata_name = parsed_metadata_file.xpath('metadata[@name="DC.identifier"]')[0].attrib['content']+"_metadata.xml"
    et.ElementTree(parsed_metadata_file).write(os.path.join('metadata_files', metadata_name), pretty_print=True)
    



Let's look at an example to see what the input file looks like

In [9]:
# Picking 0 to show a bit of an interesting file
print(xml_files[1])

﻿<?xml version="1.0" encoding="utf-8"?>
<officiele-publicatie xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://technische-documentatie.oep.overheid.nl/schema/op-xsd-2012-1">
  <metadata>
    <meta name="OVERHEIDop.externMetadataRecord" scheme="" content="https://zoek.officielebekendmakingen.nl/h-tk-20222023-24-5/metadata.xml" />
  </metadata>
  <handelingen>
    <agendapunt>
      <nr>5</nr>
      <item-titel>Raad Algemene Zaken van 18 november 2022</item-titel>
      <onderwerp>
        <al-groep>
          <al>Aan de orde is het <nadruk type="vet">tweeminutendebat Raad Algemene Zaken van 18 november 2022</nadruk>.</al>
        </al-groep>
      </onderwerp>
      <spreekbeurt nieuw="nee">
        <spreker>
          <voorvoegsels>De</voorvoegsels>
          <naam>
            <achternaam>voorzitter</achternaam>
          </naam>:</spreker>
        <tekst status="goed">
          <al-groep>
            <al>Een hartelijk 

As can be seen from the output, the file is an xml file in the format from the officielebekendmakingen API. Some of the parts of this file will be removed
in our process, such as the 'agendapunt' tags, as these things will be encoded differently in our new files. The speakers will also be encoded differently and
some grounding will have to take place. To make the process transparent, we will include the original names extracted from the raw files as notes in the ParlaMint code, just so that it is clear how the names that were found were extracted, and to possibly add you own entity disambiguation/linking pipeline.

# The next step: The conversion process

The way the code is currently set up, doing the conversion process requires two passes over the data to set all the metadata regarding some word and tag level statistics to the right values. After this the names of the actors in the files are set
to what they were in the text, this is easier for the grounding step that comes after this.

In [6]:
# Call saxon with the first conversion script
!java  -jar saxon-he-10.3.jar  -s:raw_files  -xsl:conversion.xsl  -o:semiconverted_files

In [7]:
# output of first conversion on our original file:
print(et.tostring(et.parse('semiconverted_files/ParlaMint-NL_2020-10-15-tweedekamer-14.xml'),pretty_print=True,encoding='unicode'))

<TEI xmlns="http://www.tei-c.org/ns/1.0" ana="#covid" xml:lang="nl" xml:id="ParlaMint-NL_2020-10-15-tweedekamer-14">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title type="main" xml:lang="en">Dutch Parliamentary Corpus ParlaMint-NL, Lower House 2020-10-15 [ParlaMint]</title>
            <title type="main" xml:lang="nl">Corpus van het Nederlandse Parlement ParlaMint-NL, Tweede Kamer 2020-10-15 [ParlaMint]</title>
            <title type="sub" xml:lang="en">Report of the meeting of the Dutch Lower House, Meeting 15, Session 14 (2020-10-15)</title>
            <title type="sub" xml:lang="nl">Verslag van de vergadering van de Tweede Kamer, Meeting 15, Session 14 (2020-10-15)</title>
            <meeting ana="#parla.meeting.regular" corresp="#TK" n="15">Meeting 15</meeting>
            <meeting ana="#parla.session" corresp="#TK" n="14">Session 14</meeting>
            <meeting ana="#parla.term #TK.29" corresp="#TK" n="29-lower">Meeting of the 29th Tweede Kamer</meetin

The above cell show the output of the first conversion. The main thing to notice is the encoding of some attributes to the TEI format, and
the alteration of the names of the speakers to include a party (or a _ if no party is known, this is dealt with later.)

In [8]:
# Call saxon again with the second script
!java  -jar saxon-he-10.3.jar  -s:semiconverted_files  -xsl:second_pass.xsl  -o:converted_files
print(et.tostring(et.parse('converted_files/ParlaMint-NL_2020-10-15-tweedekamer-14.xml'),pretty_print=True,encoding='unicode'))

<TEI xmlns="http://www.tei-c.org/ns/1.0" ana="#covid" xml:lang="nl" xml:id="ParlaMint-NL_2020-10-15-tweedekamer-14">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title type="main" xml:lang="en">Dutch Parliamentary Corpus ParlaMint-NL, Lower House 2020-10-15 [ParlaMint]</title>
            <title type="main" xml:lang="nl">Corpus van het Nederlandse Parlement ParlaMint-NL, Tweede Kamer 2020-10-15 [ParlaMint]</title>
            <title type="sub" xml:lang="en">Report of the meeting of the Dutch Lower House, Meeting 15, Session 14 (2020-10-15)</title>
            <title type="sub" xml:lang="nl">Verslag van de vergadering van de Tweede Kamer, Meeting 15, Session 14 (2020-10-15)</title>
            <meeting ana="#parla.meeting.regular" corresp="#TK" n="15">Meeting 15</meeting>
            <meeting ana="#parla.session" corresp="#TK" n="14">Session 14</meeting>
            <meeting ana="#parla.term #TK.29" corresp="#TK" n="29-lower">Meeting of the 29th Tweede Kamer</meetin

In the above cell the output of the second conversion pas is shown, it looks very similar to the first pass, with some updated counts for words and speeches.

We get no output from either conversion script call, meaning the process has been executed without any errors.

Naturally, the next step is to validate the output files to ensure they adhere to the TEI format

In [9]:
! java -jar jing.jar ParlaMint-TEI.rng converted_files/ParlaMint-NL_*.xml

We are not getting any output from jing, so this means that all the files passed the validation

# Optional step: 'Grounding' the actors

Although the process of 100% accurately 'grounding' all mentions to real people is difficult, we do have a method of at least
getting pretty close by using existing records from eerstekamer.nl and tweedekamer.nl. However this only works people currently in these chambers and is thus not optimal. To solve this we use the 'parlement.com' website, which has a nice search engine that
works names and partynames. This is useful, as this is sometimes the only way to distinguish people with the same surnames.

In case there is still ambiguity, we simply pick the first occurence (for the ParlaMint data this was manually corrected, as the number of cases where this happened was manageable). Although not perfect, it is pretty decent.

In [10]:
# call the external method 'ground_actors' from utils.py, this will search the names on the parlement.com website based on the names in the xml file
list_of_actors = ground_actors('converted_files')

100%|██████████| 21/21 [00:00<00:00, 1729.41it/s]
100%|██████████| 21/21 [00:00<00:00, 1158.50it/s]


The step of grounding the actors changes the converted xml files, so we should of course again verify that they are again
validating as they should.

In [11]:
! java -jar jing.jar ParlaMint-TEI.rng converted_files/ParlaMint-NL_*.xml

In [12]:
# Pring the file with the actor conversion
print(et.tostring(et.parse('converted_files/ParlaMint-NL_2020-10-15-tweedekamer-14.xml'),pretty_print=True,encoding='unicode'))

<TEI xmlns="http://www.tei-c.org/ns/1.0" ana="#covid" xml:lang="nl" xml:id="ParlaMint-NL_2020-10-15-tweedekamer-14">
   <teiHeader>
      <fileDesc>
         <titleStmt>
            <title type="main" xml:lang="en">Dutch Parliamentary Corpus ParlaMint-NL, Lower House 2020-10-15 [ParlaMint]</title>
            <title type="main" xml:lang="nl">Corpus van het Nederlandse Parlement ParlaMint-NL, Tweede Kamer 2020-10-15 [ParlaMint]</title>
            <title type="sub" xml:lang="en">Report of the meeting of the Dutch Lower House, Meeting 15, Session 14 (2020-10-15)</title>
            <title type="sub" xml:lang="nl">Verslag van de vergadering van de Tweede Kamer, Meeting 15, Session 14 (2020-10-15)</title>
            <meeting ana="#parla.meeting.regular" corresp="#TK" n="15">Meeting 15</meeting>
            <meeting ana="#parla.session" corresp="#TK" n="14">Session 14</meeting>
            <meeting ana="#parla.term #TK.29" corresp="#TK" n="29-lower">Meeting of the 29th Tweede Kamer</meetin

From the above cell we can see the grounding, '#KHOllengren' has been created from the 'minister ollengren' reference. The way the code is set up it is guaranteed(if there are no spelling mistakes in the source of course) that the
people mentioned in the text will get grounded to the same person, even if they have different functions, (like minister and then just a regular member). People with no parties are annotated as guests

# Last step: creating the metadata (partially)

The last step in converting the documents is creating the metadata records for both the speakers and the parties.
We can do this with the data we got from the grounding process.

As the content of the root file that is normally in the corpus is fairly standard, we will simply print the
results of our metadata creation step, this can easily be adapted to write the contents of the output to a template
root file.

In [13]:
# We remove non because we are now creating a list of parties, of course people without parties
# do not contribute to this list
list_of_parties = set([val['party'] for val in list_of_actors.values()]) - set([None])

party_template = """<org role="politicalParty" xml:id="party.%s">
                      <orgName full="yes">%s</orgName>
                      <orgName full="init">%s</orgName>
                   </org>"""
for party in list_of_parties:
    print(party_template % (party, party, party))
    print('\n')
    

<org role="politicalParty" xml:id="party.PVV">
                      <orgName full="yes">PVV</orgName>
                      <orgName full="init">PVV</orgName>
                   </org>


<org role="politicalParty" xml:id="party.PvdD">
                      <orgName full="yes">PvdD</orgName>
                      <orgName full="init">PvdD</orgName>
                   </org>


<org role="politicalParty" xml:id="party.D66">
                      <orgName full="yes">D66</orgName>
                      <orgName full="init">D66</orgName>
                   </org>


<org role="politicalParty" xml:id="party.GroenLinks">
                      <orgName full="yes">GroenLinks</orgName>
                      <orgName full="init">GroenLinks</orgName>
                   </org>


<org role="politicalParty" xml:id="party.50PLUS">
                      <orgName full="yes">50PLUS</orgName>
                      <orgName full="init">50PLUS</orgName>
                   </org>


<org role="politicalParty" 

In [14]:
# now we can do the same thing for the persons, although here we have to do a bit more work
# to get some of the metadata required. We will also include a link to the person on parlement.com
# if the person was found. This will make future extractions of more info easier.
# Although not really meant for this, we will use the <graphic url= ''> TEI tag for this.
# We will also use the information present in the 'functions' of the persons to add
# lines corresponding to their respective functions, where we convert the names of the functions
# to things allowed in the TEI format (minister -> minister) (staatssecretaris -> secretary)

person_template = """<person n="%d" xml:id="%s">
                          <persName>
                             <surname>%s</surname>
                             <forename>%s</forename>
                          </persName>
                      </person>"""

for i, (person, info) in enumerate(list_of_actors.items()):
                         
    id_name = info['id_name']
    if info.get('grounded_name', None):
        full_name = list(reversed(info['grounded_name'].split(',')))
        if len(full_name) == 1:
            full_name = full_name[0]
            forename = " ".join(full_name.split(' ')[:-1])
            surname = full_name.split(' ')[-1]
        else:
            forename = full_name[0]
            surname = full_name[1]
    else:
        forename = 'UNK'
        surname = person
    
    forename = forename.strip()
    surname = surname.strip()
    # now we can create a small xml element from this and add elements if we have them
    person_xml = et.fromstring(person_template % (i+1, id_name, surname, forename))
    if 'deheer' in person.lower():
        sex = 'M'
        sex_name = 'man'
    elif 'mevrouw' in person.lower():
        sex = 'F'
        sex_name = 'vrouw'
    else:
        sex='UNK'
        sex_name = 'unknown'
    # add the info
    person_xml.append(et.fromstring('<sex value="%s">%s</sex>' % (sex, sex_name)))
    
    
    if info['functions']:
        # exclude 'fractievoorzitters'
        # add that they have been an mp
        person_xml.append(et.fromstring('<affiliation role="MP"/>'))
        if info['party']:
            person_xml.append(et.fromstring('<affiliation role="member" ref="#party.%s"/>'  % info['party']))
        is_chair = any(['voorzitter' in function and not 'fractievoorzitter' in function for function in info['functions']])
        for function in info['functions']:
            if 'voorzitter' not in function:
                if 'minister' in function:
                    person_xml.append(et.fromstring('<affiliation role="minister"/>'))
                elif 'staatssecretaris' in function:
                    person_xml.append(et.fromstring('<affiliation role="secretary"/>'))
                    
        
        if is_chair:
            person_xml.append(et.fromstring('<affiliation role="chairperson" />'))
    # add the link if it exists
    if info['link']:
        person_xml.append(et.fromstring('<graphic url="%s"/>' % info['link']))
    
    print(et.tostring(person_xml,pretty_print=True).decode())
    
        
# Output can be shown, but is hidden because it is quite large   


<person n="1" xml:id="JanAnthonieBruijn">
                          <persName>
                             <surname>Bruijn</surname>
                             <forename>Jan Anthonie</forename>
                          </persName>
                      <sex value="UNK">unknown</sex><affiliation role="MP"/><affiliation role="chairperson"/><graphic url="www.parlement.com/id/vimphy96a3y5/j_a_jan_anthonie_bruijn"/></person>

<person n="2" xml:id="MJVanRooijen">
                          <persName>
                             <surname>Van Rooijen</surname>
                             <forename>M.J.</forename>
                          </persName>
                      <sex value="M">man</sex><affiliation role="MP"/><affiliation role="secretary"/><graphic url="www.parlement.com/id/vg09llgfy4zy/m_j_martin_van_rooijen"/></person>

<person n="3" xml:id="HansVijlbrief">
                          <persName>
                             <surname>Vijlbrief</surname>
                          

# Conclusion

This notebook gives an example usage of the end-to-end pipline for converting parliamentary documents from the 'officielebekenmakingen.nl' website to the TEI format. It does this by using
2 conversion scripts for the xml conversion and the 'parlement.com' website for grounding speakers and adding metadata records.

# Discussion

The above code can be used in a near end-to-end fashion to convert existing officielebekendmakingen records to ParlaMint TEI. The conversion works quite well, however there are some things to watch out for,
these are list below.

1. Missing metadata
    - The current script just grabs all 'Handelingen'. These sometimes also contains things like attachments, which are separate objects withoug metadata. The first conversion pass will
    notice this and throw an error, meaning that the document is not processed further by the pipeline. This is not really a problem if you are not interested in the documents,
    otherwise a filter should probably be added to the first url to filter them out or to save them to a different location and process them appropriately.
    - The current way of automatically extracting metadata from speakers is not perfect. Altough it gets a lot of the speakers right, it does make mistakes. It is probably best
    to check the metadata manually in cases of small datasets, but this is of course more difficult for larger datasets. 
    - The metadata records retrieved now are very basic. The fornames are just initials and no data about the specific functions of people other than chair is mentioned. (or if they switched parties etc.)
    Although not included in the dataset, I have added the possibility of doing this later by adding the link to the parlement.com page of the politician.
    This can be scraped to get more information that can be added to the records.