## Debates to RDF

This notebook is exploratory code updating the Akoma Ntoso (AKN) debates XML to reflect changes made to the ontology and then extracting metadata to RDF. 

For more information on the ontology, read the [Debates section of the wiki](https://github.com/Oireachtas/ontology/wiki/Debates)

I am making the changes to [the AKN file](../debates/AK-dail-2015-11-12.xml)  in the debates folder, and saving the changes as a new file.

In [1]:
import re
import json
from lxml import etree
from rdflib import URIRef, Literal, Namespace, Graph
from rdflib.namespace import RDF, OWL, SKOS, DCTERMS, XSD, RDFS, FOAF
from dateutil.parser import parse
from datetime import datetime

In [2]:
AKN = {"akn": "http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13"}
xml = "../debates/AK-dail-2015-11-12.xml"
baseURI = "http://oireachtas.ie"

### Metadata elements


As the file contains data for the oral part of the debate (writtens are stored separately), the FRBRWork attributes need to be updated. See the [Metadata](https://github.com/Oireachtas/ontology/wiki/Debates#metadata) section of the wiki for the specification.

Note that under the Akoma Ntoso naming convention, the "@" character in the FRBRExpression URI denote them as the original expression of the work. This is not strictly true in the case of revised Official Reports but we have no way of telling the difference at the moment, so the original expression of a debate is whatever this file turns out to be.

The original FRBRExpression URIs have a language value of ``eng``, however, this should be ``mul`` because it is not (easily) possible to determine whether a debate is in English or Irish.

In [35]:
#last part of the regex included to prevent duplicate insertions
regex = re.compile("\d{4}-\d{2}-\d{2}(?!/debate|/writtens)")

root = etree.parse(xml).getroot()

work = root.find(".//{*}FRBRWork")
etree.SubElement(work, "FRBRname", {"value": "debate"})


name = root.find(".//{*}FRBRWork/{*}FRBRname").attrib['value']
for uri in root.xpath(".//akn:identification/*//*[starts-with(@value, '/akn')]", namespaces=AKN):
    print(re.sub("{.+}", "", uri.getparent().tag)+ "/" + re.sub("{.+}", "", uri.tag) )
    value = uri.attrib['value'].replace("eng@", "mul@")
    print("Original:", value)
    span = regex.search(value).span()
    uri.attrib['value'] = value[:span[1]] + "/" + name + value[span[1]:]
    print("New:", uri.attrib['value'], "\n---")

FRBRWork/FRBRthis
Original: /akn/ie/debateRecord/dail/2015-11-12/main
New: /akn/ie/debateRecord/dail/2015-11-12/debate/main 
---
FRBRWork/FRBRuri
Original: /akn/ie/debateRecord/dail/2015-11-12
New: /akn/ie/debateRecord/dail/2015-11-12/debate 
---
FRBRExpression/FRBRthis
Original: /akn/ie/debateRecord/dail/2015-11-12/mul@/main
New: /akn/ie/debateRecord/dail/2015-11-12/debate/mul@/main 
---
FRBRExpression/FRBRuri
Original: /akn/ie/debateRecord/dail/2015-11-12/mul@
New: /akn/ie/debateRecord/dail/2015-11-12/debate/mul@ 
---
FRBRManifestation/FRBRthis
Original: /akn/ie/debateRecord/dail/2015-11-12/mul@/main.xml
New: /akn/ie/debateRecord/dail/2015-11-12/debate/mul@/main.xml 
---
FRBRManifestation/FRBRuri
Original: /akn/ie/debateRecord/dail/2015-11-12/mul@.akn
New: /akn/ie/debateRecord/dail/2015-11-12/debate/mul@.akn 
---


Get the correct Dáil for Minister references.

In [36]:
date = parse(root.xpath(".//akn:FRBRWork/akn:FRBRdate/@date", namespaces=AKN)[0])

with open ("../data/dail.json", "r") as f:
    dail = json.load(f)
for d in dail:
    if parse(d['start']) <= date <= parse(d['end']):
        house_uri = d['houseURI'].replace("/house", "")
house_uri   

'/dail/31'

Add a heading to the Prelude debateSection - this heading is displayed on the web but is not in the original XML.

In [37]:
heading = etree.SubElement(root.xpath(".//akn:debateSection[@name='prelude']", namespaces=AKN)[0], "heading")
heading.text = "Prelude"

# Correct generation and reported href errors
for concept in root.xpath(".//akn:TLCConcept[@eId='generation' or @eId='reported' or @eId='publication']", namespaces=AKN):
    concept.attrib['href'] = concept.attrib['href'].replace("oireachtas//", "oireachtas/")

### Unmatched Members
Will need to audit unmatched member URIs. Thought I had fixed them already.


In [38]:
for person in root.xpath(".//akn:meta/akn:references/akn:TLCPerson", namespaces=AKN):
    person.attrib['href'] = person.attrib['href'] + house_uri

### TLCPerson references

TLCPerson references are to the OIR:Member URI in the original AKN. However, using the specific org:Membership (of a Dáil or Seanad) would make it easier to link to other information needed in the website, like constituency and party, which otherwise would require a more expensive query within date ranges. The information associated with OIR:Member is only one step away. However, there may be a cost to this when it comes to searching for speeches by a Member over multiple houses. For that reason, it would be worthwhile testing this over a larger set of debate files.

In [39]:
with open("../data/members.json", "r") as f:
    memberLU = {m['pId']: m['eId'] for m in json.load(f)}

In [40]:
def assert_URI_has_right_number_of_elements(uri, right_len):
    uri_len = len(uri.split("/"))
    try:
        assert uri_len == right_len
    except AssertionError as e:
        e.args += ("URI: {0} Incorrect length. Should have {1} elements but it has {2}".format(uri, right_len, uri_len),)
        raise

#Michael Kitt the latter has a pId of MichaelPKitt
unmatched = root.xpath(".//akn:TLCPerson[contains(@href, 'unmatched')]", namespaces=AKN)
for unm in unmatched:
    unm.attrib['href'] = unm.attrib['href'].replace("/member/unmatchedMember", memberLU['MichaelPKitt'])
    unm.attrib['showAs'] = "Mr. Michael P. Kitt"
    print(unm.attrib)
    assert_URI_has_right_number_of_elements(unm.attrib['href'], 7)
for person in root.xpath(".//akn:TLCPerson", namespaces=AKN):
    assert_URI_has_right_number_of_elements(person.attrib['href'], 7)

{'eId': 'MichaelKitt', 'href': '/ie/oireachtas/member/Michael-P-Kitt.D.1975-03-04/dail/31', 'showAs': 'Mr. Michael P. Kitt'}


Fix TLC eId references starting with "#"

In [41]:
for tlc in root.xpath(".//akn:references/*[starts-with(@eId, '#')]", namespaces=AKN):
    tlc.attrib['eId'] = tlc.attrib['eId'][1:]

### Speeches and questions

Questions are addressed to a function of the relevant department rather than a Minister, should update accordingly.

In [42]:
with open("../data/government_members.json", "r") as f:
    ministers = json.load(f)
functions = {m['uri'].split("/")[-1] for m in ministers}

for m in ministers:
    m['function'] = m['uri'].split("/")[-1].split("__")

# Update records for end of Cabinet for 31st Dáil
end_date = "2016-05-06"
for m in ministers:
    if m['end'] is None:
        m['end'] = end_date
        m['cabinets'][0]['end'] = end_date

to_roles = set(root.xpath(".//akn:question", namespaces=AKN))
for role in to_roles:
    f = role.attrib['to'].split("for_")[-1].lower()
    #print(f)
    for m in ministers:
        if "functions" in m and f in m['functions'] and (parse(m['start']) <= date <= parse(m['end'])):
            to_ref = m['office'].replace(" ", "_")
            role_eId = root.xpath(".//akn:TLCRole[@eId='{}']/@eId".format(to_ref), namespaces=AKN)
            if len(role_eId) == 1:
                role.attrib['to'] = "#"+role_eId[0]
        



### Deriving Member office titles from debates XML

Accurate information about all office holders is not yet available in structured format. A workaround for this would be to find Members' office titles from the debates XML in the speaker/@as attribute. 

A Minister is only identified by title the first and last time he or she speaks in a particular debate. This information would be more usefully associated with every instance of that Minister speaking.

In [None]:
for role in root.xpath(".//akn:TLCRole", namespaces=AKN):
    if not role.attrib['eId'] in ["author", "editor", "Acting_Chairman"]:
        person = root.xpath(".//akn:speech[@as='{}']/@by".format("#"+role.attrib['eId']), namespaces=AKN)
        if len(person) > 0: 
            tlc_p = root.xpath(".//akn:TLCPerson[@eId='{}']/@href".format(person[0][1:]), 
                                namespaces=AKN)[0]
            role.attrib['href'] = tlc_p+role.attrib['href'].replace("/ie/oireachtas", "")

debates_with_speeches = root.xpath(".//akn:debateSection[.//akn:speech]", namespaces=AKN)
for debate in debates_with_speeches:
    for speech in debate.xpath(".//akn:speech[@as]", namespaces=AKN):
        other_speeches_as = debate.xpath(".//akn:speech[@by='{}'][not(@as)]".format(speech.attrib['by']), 
                                         namespaces=AKN)
        for other_speech in other_speeches_as:
            
            other_speech.attrib['as'] = speech.attrib['as']
            

### Converting to RDF

When converting to RDF, FRBR elements map to their RDA equivalents. I'm mapping only the FRBRuri elements for now. 

ToDo: Extend ontology to cover both contributors as those listed as TLCPerson as well as speakers as those identified in speech nodes.

In [21]:
g = Graph()

In [22]:
OIR = Namespace("http://oireachtas.ie/ontology#")
RDA = Namespace("http://www.rdaregistry.info/Elements/c/#")
METALEX = Namespace("http://www.metalex.eu/metalex/2008-05-02#")

g.bind("oir", OIR)

In [23]:
workURI = baseURI+root.xpath(".//akn:FRBRWork/akn:FRBRuri/@value", 
                                  namespaces=AKN)[0]
work = URIRef(workURI)

In [24]:
# C10001 is RDA:Work
g.add(( work, 
       RDF.type, 
       RDA.C10001))
# C10006 is RDA:Expression
g.add(( URIRef(baseURI+root.xpath(".//akn:FRBRExpression/akn:FRBRuri/@value", 
                                  namespaces=AKN)[0]), 
       RDF.type, 
       RDA.C10006))
# C10007 is RDA:Manifestation
g.add(( URIRef(baseURI+root.xpath(".//akn:FRBRManifestation/akn:FRBRuri/@value", 
                                  namespaces=AKN)[0]), 
       RDF.type, 
       RDA.C10007))
g.add(( work, 
       RDF.type, 
       OIR.DebateRecord))

### Debate sections

The individual debates in OIR:Debate are contained in a sequence of OIR:DebateSection instances, which may in turn contain further OIR:DebateSection instances.

Each DebateSection has a property ``OIR:debateSectionOf``, with its parent as object and an inverse property ``OIR:debateSection`` between parent and child DebateSection.

The first DebateSection in a sequence is denoted with the property ``OIR:firstDebateSectionOf`` of its parent instance(either type ``OIR:Debate`` or ``OIR:DebateSection``). They also have an inverse property, ``OIR:firstDebateSection``. 

Each DebateSection except the last has the property ``OIR:nextDebateSection``, taking as its object the succeeding DebateSection. 

Each DebateSection after the first has a property ``OIR:nextDebateSectionOf`` taking as its object the preceeding OIR:DebateSection.

In [25]:
#g = Graph()
def seq_dbs_uri(dbsects, i, dbs_uri, rel):
    other_dbs_uri = workURI+"/"+dbsects[i].attrib['eId']
    relation = "nextDebateSection"+rel
    g.add(( URIRef(dbs_uri), OIR[relation], URIRef(other_dbs_uri) ))

def child_dbs_uri(parent_uri, dbs_uri, relation):
    g.add(( URIRef(parent_uri), OIR[relation], URIRef(dbs_uri)))
    g.add(( URIRef(dbs_uri), OIR[relation+"Of"], URIRef(parent_uri)))

def debate_type(dbs):
    g.add(( URIRef(dbs_uri), RDF.type, OIR.DebateSection ))
    name = dbs.attrib['name']
    name = name.title()+"Debate" if name != "debate" else name.title()
    g.add(( URIRef(dbs_uri), OIR.debateType, OIR[name]))
    
dbsects = root.xpath(".//akn:debateBody/akn:debateSection", namespaces=AKN)

for i, dbs in enumerate(dbsects):    
    dbs_uri = workURI+"/"+dbs.attrib['eId']
    debate_type(dbs)
    child_dbs_uri(workURI, dbs_uri, "debateSection" )
    if i == 0:
        child_dbs_uri(workURI, dbs_uri, "firstDebateSection" )
    else:
        seq_dbs_uri(dbsects, i-1, dbs_uri, "Of")
    if i != len(dbsects)-1:
        seq_dbs_uri(dbsects, i+1, dbs_uri, "")
    subdbsects = dbs.xpath("./akn:debateSection[./akn:heading]", namespaces=AKN)
    for n, subdbs in enumerate(subdbsects):
        subdbs_uri = workURI+"/"+subdbs.attrib['eId']
        debate_type(subdbs)
        child_dbs_uri(dbs_uri, dbs_uri, "debateSection" )
        if n == 0:
            child_dbs_uri(dbs_uri, subdbs_uri, "firstDebateSection")
        else:
            seq_dbs_uri(subdbsects, n-1, subdbs_uri, "Of")
        if n != len(subdbsects)-1:
            seq_dbs_uri(subdbsects, n+1, subdbs_uri, "")

Leaders Questions has name attribute "questions" when it's really just a debate. The ``questions`` attribute should only refer to numbered/formal questions (this has implications for early years when questions weren't numbered.) It may not matter so much though when written answers are broken out.

In [26]:
#g = Graph()
def match_tlc_uri(ref, tlc):
    tlc_uri = root.xpath(".//akn:{0}[@eId='{1}']".format(tlc, ref[1:]), namespaces=AKN)
    try:
        assert len(tlc_uri) == 1
    except AssertionError as e:
        e.args += ( "{0} returned {1} TLCPerson matches, should be 1".format(ref, 
                                                                             len(by_uri)),
                   )
        raise
    return baseURI + tlc_uri[0].attrib['href']

def seq_uri(parent, i, this_uri, rel):
    other_uri = workURI+"/"+parent[i].attrib['eId']
    g.add(( URIRef(this_uri), OIR[rel], URIRef(other_uri) ))
    
def child_uri(parent_uri, child_uri, relation):
    g.add(( URIRef(parent_uri), OIR[relation], URIRef(child_uri)))
    g.add(( URIRef(child_uri), OIR[relation+"Of"], URIRef(parent_uri)))

def keep_sequence(workURI, parent_list, this_uri, i, rel):
    if i == 0:
        child_uri(workURI, this_uri, "first{}".format(rel) )
    else:
        seq_uri(parent_list, i-1, pq_uri, "next{}Of".format(rel))
    if i != len(parent_list)-1:
        seq_uri(parent_list, i+1, pq_uri, "next{}".format(rel))
    
questions = root.xpath(".//akn:question", namespaces=AKN)
for i, pq in enumerate(questions):
    order = int(pq.attrib['eId'].split("_")[-1])
    pq_uri = workURI+"/"+pq.attrib['eId']
    dbs_uri = workURI+"/"+pq.getparent().attrib['eId']
    by_uri = match_tlc_uri(pq.attrib['by'], "TLCPerson")
    to_uri = match_tlc_uri(pq.attrib['to'], "TLCRole")
    g.add(( URIRef(pq_uri), RDF.type, OIR.Question))
    g.add(( URIRef(pq_uri), OIR.by, URIRef(by_uri)))
    g.add(( URIRef(pq_uri), OIR.to, URIRef(to_uri)))
    g.add(( URIRef(pq_uri), OIR.questionNo, Literal(order, datatype=XSD.integer)))
    keep_sequence(workURI, questions, pq_uri, i, "Question")
    child_uri(dbs_uri, pq_uri, "part")
    
answers = root.xpath(".//akn:answer", namespaces=AKN) # not in current sample file
print(len(g))

198


In [27]:
#g = Graph()
debate_speeches = root.xpath(".//akn:debateSection[./akn:speech or ./akn:answer]", namespaces=AKN)
for debate in debate_speeches:
    
    speeches = debate.xpath("./akn:speech|./akn:answer", namespaces=AKN)
    for i, spk in enumerate(speeches):
        
        spk_uri = workURI + "/" + spk.attrib['eId']
        dbs_uri = workURI + "/" + spk.getparent().attrib['eId']
        if len(spk.attrib['by']) > 1:
            by_uri = match_tlc_uri(spk.attrib['by'], "TLCPerson")
            g.add(( URIRef(spk_uri), OIR.by, URIRef(by_uri)))
    
        if "as" in spk.attrib:
            as_uri = match_tlc_uri(spk.attrib['as'], 
                                    "TLCRole")
            
            g.add(( URIRef(spk_uri), OIR['as'], URIRef(as_uri)))
        keep_sequence(dbs_uri, speeches, spk_uri, i, "Speech")
        child_uri(dbs_uri, pq_uri, "part")
print(len(g))

1969


### Bills and amendments


In [28]:
for dbs in root.xpath(".//akn:debateSection[@refersTo]", namespaces=AKN):
    bill_uri = baseURI + root.xpath(".//akn:TLCEvent[@eId='{}']/@href".format(dbs.attrib['refersTo'][1:]), namespaces=AKN)[0]
    dbs_uri = workURI+"/"+dbs.attrib["eId"]
    g.add(( URIRef(bill_uri), OIR.debate, URIRef(dbs_uri) ))
    g.add(( URIRef(dbs_uri), OIR.debateOf, URIRef(bill_uri) ))

** TODO: update TLCEvent for missing amendments. **

** TODO: differentiate sections (sec) and recommendations (rec) from amendments **

In [29]:
#g = Graph()
missed_amd = []

for amd in root.xpath(".//akn:entity[@name='amendment']", namespaces=AKN):
    amd_tlc = root.xpath(".//akn:TLCEvent[@eId='{}']".format(amd.attrib["refersTo"][1:]), namespaces=AKN)
    if len(amd_tlc) == 0: 
        missed_amd.append(amd.attrib['refersTo'][1:])
    amd_uri = amd.attrib['refersTo'][1:].replace(".", "/")
    amd_len = len(amd_uri.split("/"))
    deb_uri = workURI + "/" + amd.getparent().attrib["eId"]
    g.add(( URIRef(amd_uri), OIR.eventDebate, URIRef(deb_uri)))
    g.add(( URIRef(deb_uri), OIR.eventDebateOf, URIRef(amd_uri)))
    if amd_len == 6:
        amd_components = amd_uri.split("/amd_")
        g.add(( URIRef(amd_uri), RDF.type, OIR.BillAmendment))
        g.add(( URIRef(amd_uri), OIR.amendmentStage, URIRef(amd_components[0])))
        g.add(( URIRef(amd_components[0]), OIR.amendmentStageOf, URIRef(amd_uri)))
        g.add(( URIRef(amd_uri), OIR.amendmentNo, Literal(amd_components[1], datatype=XSD.string)))
        g.add(( URIRef(amd_uri), OIR.eventDebate, URIRef(deb_uri)))
    else:
        amd_components = amd_uri.split("/")
        outcome = amd_components[-1].title()
        affected_amd_uri = "/".join(amd_components[:-1])
        g.add(( URIRef(amd_uri), RDF.type, OIR.BillEventOutcome ))
        #g.add(( URIRef(amd_uri), OIR.outcomeOf, URIRef(affect_amd_uri) ))
        g.add(( URIRef(affected_amd_uri), METALEX.result, URIRef(amd_uri) ))
        

    
missed_amd = set(missed_amd)
len(g)

2215

### Votes

Members who are participants are linked to specific speech of voting instances by the object categories of OIR:speakerOf, OIR:voterOf or OIR:attendeeOf. A voter will also be one of OIR:taVoter, OIR:nilVoter or OIR:staonVoter

** TODO: Describe vote_matter in terms of whether it's a Bill element or a debate element **

** TODO: Decide whether to update voter href to debateSection rather than summary **

In [30]:
#g = Graph()

for vote in root.xpath(".//akn:voting", namespaces=AKN):
    vote_uri = workURI + "/" + vote.attrib['eId']
    vote_dbs = root.xpath(".//akn:debateSection[akn:summary/@eId='{}']".format(vote.attrib['href'][1:]), namespaces=AKN)[0]
    vote_ref = workURI + "/" + vote_dbs.attrib['eId']
    #print(vote_ref)
    if vote.attrib['refersTo'].startswith("#bill"):
        vote_matter = baseURI + "/" + vote.attrib['refersTo'][1:].replace(".", "/")
    else:
        vote_matter = workURI + vote.attrib['refersTo'][1:]
    g.add(( URIRef(vote_uri), OIR.divisionOf, URIRef(vote_matter) ))
    g.add(( URIRef(vote_matter), OIR.division, URIRef(vote_uri) ))
    g.add(( URIRef(vote_uri), OIR.debate, URIRef(vote_ref) ))
    g.add(( URIRef(vote_ref), OIR.debateOf, URIRef(vote_uri) ))
    for count in vote.xpath("./akn:count", namespaces=AKN):
        v_type_uri = baseURI + root.xpath(".//akn:TLCConcept[@eId='{}']/@href".format(count.attrib['refersTo'][1:]), 
                        namespaces=AKN)[0]
        count_uri = workURI + "/" + count.attrib['eId']
        g.add(( URIRef(count_uri), RDF.type, URIRef(v_type_uri) ))
        count_dbs = vote_dbs.xpath("./akn:debateSection[@name='{}']".format(count.attrib['refersTo'][1:]), 
                            namespaces=AKN)[0]
        for person in count_dbs.xpath(".//akn:person/@refersTo", namespaces=AKN):
            voter_uri = baseURI + root.xpath(".//akn:TLCPerson[@eId='{}']/@href".format(person[1:]),
                                  namespaces=AKN)[0]
            g.add(( URIRef(count_uri), OIR.voter, URIRef(voter_uri) ))
            g.add(( URIRef(voter_uri), OIR.voterOf, URIRef(count_uri) ))
len(g)

5677

In [46]:
with open("../debates/AK-dail-2015-11-12-v2.xml", "wb") as f: 
    f.write(etree.tostring(root, xml_declaration=True, encoding="utf-8"))

In [32]:
today = datetime.today().date()
print(today)
g.serialize("../data/debates_{}_{}.ttl".format(house_uri[1:].replace("/", "_"), date.date()), format="turtle")

2016-08-05
