## Debates to RDF

This notebook is exploratory code updating the Akoma Ntoso (AKN) debates XML to reflect changes made to the ontology and then extracting metadata to RDF. 

For more information on the ontology, read the [Debates section of the wiki](https://github.com/Oireachtas/ontology/wiki/Debates)

I am making the changes to [the AKN file](../debates/AK-dail-2015-11-12.xml)  in the debates folder, and saving the changes as a new file.

In [3]:
import re
import json
from lxml import etree
from rdflib import URIRef, Literal, Namespace, Graph
from rdflib.namespace import RDF, OWL, SKOS, DCTERMS, XSD, RDFS, FOAF

In [2]:
AKN = {"akn": "http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13"}
xml = "../debates/AK-dail-2015-11-12.xml"
baseURI = "http://oireachtas.ie"

### Metadata elements


As the file contains data for the oral part of the debate (writtens are stored separately), the FRBRWork attributes need to be updated. See the [Metadata](https://github.com/Oireachtas/ontology/wiki/Debates#metadata) section of the wiki for the specification.

Note that under the Akoma Ntoso naming convention, the "@" character in the FRBRExpression URI denote them as the original expression of the work. This is not strictly true in the case of revised Official Reports but we have no way of telling the difference at the moment, so the original expression of a debate is whatever this file turns out to be.

The original FRBRExpression URIs have a language value of ``eng``, however, this should be ``mul`` because it is not (easily) possible to determine whether a debate is in English or Irish.

In [71]:
#last part of the regex included to prevent duplicate insertions
regex = re.compile("\d{4}-\d{2}-\d{2}(?!/debate|/writtens)")

root = etree.parse(xml).getroot()

work = root.find(".//{*}FRBRWork")
etree.SubElement(work, "FRBRname", {"value": "debate"})


name = root.find(".//{*}FRBRWork/{*}FRBRname").attrib['value']
for uri in root.xpath(".//akn:identification/*//*[starts-with(@value, '/akn')]", namespaces=AKN):
    print(re.sub("{.+}", "", uri.getparent().tag)+ "/" + re.sub("{.+}", "", uri.tag) )
    value = uri.attrib['value'].replace("eng@", "mul@")
    print("Original:", value)
    span = regex.search(value).span()
    uri.attrib['value'] = value[:span[1]] + "/" + name + value[span[1]:]
    print("New:", uri.attrib['value'], "\n---")

FRBRWork/FRBRthis
Original: /akn/ie/debateRecord/dail/2015-11-12/main
New: /akn/ie/debateRecord/dail/2015-11-12/debate/main 
---
FRBRWork/FRBRuri
Original: /akn/ie/debateRecord/dail/2015-11-12
New: /akn/ie/debateRecord/dail/2015-11-12/debate 
---
FRBRExpression/FRBRthis
Original: /akn/ie/debateRecord/dail/2015-11-12/mul@/main
New: /akn/ie/debateRecord/dail/2015-11-12/debate/mul@/main 
---
FRBRExpression/FRBRuri
Original: /akn/ie/debateRecord/dail/2015-11-12/mul@
New: /akn/ie/debateRecord/dail/2015-11-12/debate/mul@ 
---
FRBRManifestation/FRBRthis
Original: /akn/ie/debateRecord/dail/2015-11-12/mul@/main.xml
New: /akn/ie/debateRecord/dail/2015-11-12/debate/mul@/main.xml 
---
FRBRManifestation/FRBRuri
Original: /akn/ie/debateRecord/dail/2015-11-12/mul@.akn
New: /akn/ie/debateRecord/dail/2015-11-12/debate/mul@.akn 
---


Add a heading to the Prelude debateSection - this heading is displayed on the web but is not in the original XML.

In [40]:
heading = etree.SubElement(root.xpath(".//akn:debateSection[@name='prelude']", namespaces=AKN)[0], "heading")
heading.text = "Prelude"

### TLCPerson references

TLCPerson references are to the OIR:Member URI in the original AKN. However, using the specific org:Membership (of a Dáil or Seanad) would make it easier to link to other information needed in the website, like constituency and party, which otherwise would require a more expensive query within date ranges. The information associated with OIR:Member is only one step away. However, there may be a cost to this when it comes to searching for speeches by a Member over multiple houses. For that reason, it would be worthwhile testing this over a larger set of debate files.

In [80]:
for person in root.xpath(".//akn:meta/akn:references/akn:TLCPerson", namespaces=AKN):
    person.attrib['href'] = person.attrib['href'] + "/dail/31"

### Unmatched Members
Will need to audit unmatched member URIs. Thought I had fixed them already.


In [36]:
with open("../data/members.json", "r") as f:
    memberLU = {m['pId']: m['eId'] for m in json.load(f)}

In [82]:
def assert_URI_has_right_number_of_elements(uri, right_len):
    uri_len = len(uri.split("/"))
    try:
        assert uri_len == right_len
    except AssertionError as e:
        e.args += ("URI: {0} Incorrect length. Should have {1} elements but it has {2}".format(uri, right_len, uri_len),)
        raise

#Michael Kitt the latter has a pId of MichaelPKitt
unmatched = root.xpath(".//akn:TLCPerson[contains(@href, 'unmatched')]", namespaces=AKN)
for unm in unmatched:
    unm.attrib['href'] = unm.attrib['href'].replace("/member/unmatchedMember", memberLU['MichaelPKitt'])
    unm.attrib['showAs'] = "Mr. Michael P. Kitt"
    print(unm.attrib)
    assert_URI_has_right_number_of_elements(unm.attrib['href'], 7)

for person in root.xpath(".//akn:TLCPerson", namespaces=AKN):
    
    assert_URI_has_right_number_of_elements(person.attrib['href'], 7)

### Converting to RDF

When converting to RDF, FRBR elements map to their RDA equivalents. I'm mapping only the FRBRuri elements for now. 

ToDo: Extend ontology to cover both contributors as those listed as TLCPerson as well as speakers as those identified in speech nodes.

In [119]:
g = Graph()

In [120]:
OIR = Namespace("http://oireachtas.ie/ontology#")
RDA = Namespace("http://www.rdaregistry.info/Elements/c/#")
METALEX = Namespace("http://www.metalex.eu/metalex/2008-05-02#")

In [121]:
workURI = baseURI+root.xpath(".//akn:FRBRWork/akn:FRBRuri/@value", 
                                  namespaces=AKN)[0]
work = URIRef(workURI)

In [116]:
# C10001 is RDA:Work
g.add(( work, 
       RDF.type, 
       RDA.C10001))
# C10006 is RDA:Expression
g.add(( URIRef(baseURI+root.xpath(".//akn:FRBRExpression/akn:FRBRuri/@value", 
                                  namespaces=AKN)[0]), 
       RDF.type, 
       RDA.C10006))
# C10007 is RDA:Manifestation
g.add(( URIRef(baseURI+root.xpath(".//akn:FRBRManifestation/akn:FRBRuri/@value", 
                                  namespaces=AKN)[0]), 
       RDF.type, 
       RDA.C10007))
g.add(( work, 
       RDF.type, 
       OIR.DebateRecord))

### Debate sections

The individual debates in OIR:Debate are contained in a sequence of OIR:DebateSection instances, which may in turn contain further OIR:DebateSection instances.

Each DebateSection has a property ``OIR:debateSectionOf``, with its parent as object and an inverse property ``OIR:debateSection`` between parent and child DebateSection.

The first DebateSection in a sequence is denoted with the property ``OIR:firstDebateSectionOf`` of its parent instance(either type ``OIR:Debate`` or ``OIR:DebateSection``). They also have an inverse property, ``OIR:firstDebateSection``. 

Each DebateSection except the last has the property ``OIR:nextDebateSection``, taking as its object the succeeding DebateSection. 

Each DebateSection after the first has a property ``OIR:nextDebateSectionOf`` taking as its object the preceeding OIR:DebateSection.

In [162]:
g = Graph()
def seq_dbs_uri(dbsects, i, dbs_uri, rel):
    other_dbs_uri = workURI+"/"+dbsects[i].attrib['eId']
    relation = "nextDebateSection"+rel
    g.add(( URIRef(dbs_uri), OIR[relation], URIRef(other_dbs_uri) ))

def child_dbs_uri(parent_uri, dbs_uri, relation):
    g.add(( URIRef(parent_uri), OIR[relation], URIRef(dbs_uri)))
    g.add(( URIRef(dbs_uri), OIR[relation+"Of"], URIRef(parent_uri)))

def debate_type(dbs):
    g.add(( URIRef(dbs_uri), RDF.type, OIR.DebateSection ))
    name = dbs.attrib['name']
    name = name+"Debate" if name != "debate" else name
    g.add(( URIRef(dbs_uri), OIR.debateType, OIR[name.title()]))
    
dbsects = root.xpath(".//akn:debateBody/akn:debateSection", namespaces=AKN)

for i, dbs in enumerate(dbsects):    
    dbs_uri = workURI+"/"+dbs.attrib['eId']
    debate_type(dbs)
    child_dbs_uri(workURI, dbs_uri, "debateSection" )
    if i == 0:
        child_dbs_uri(workURI, dbs_uri, "firstDebateSection" )
    else:
        seq_dbs_uri(dbsects, i-1, dbs_uri, "Of")
    if i != len(dbsects)-1:
        seq_dbs_uri(dbsects, i+1, dbs_uri, "")
    subdbsects = dbs.xpath("./akn:debateSection[./akn:heading]", namespaces=AKN)
    for n, subdbs in enumerate(subdbsects):
        subdbs_uri = workURI+"/"+subdbs.attrib['eId']
        debate_type(subdbs)
        child_dbs_uri(dbs_uri, dbs_uri, "debateSection" )
        if n == 0:
            child_dbs_uri(dbs_uri, subdbs_uri, "firstDebateSection")
        else:
            seq_dbs_uri(subdbsects, n-1, subdbs_uri, "Of")
        if n != len(subdbsects)-1:
            seq_dbs_uri(subdbsects, n+1, subdbs_uri, "")

### Speeches and questions

Questions are addressed to a function of the relevant department rather than a Minister, should update accordingly.

In [142]:
[q.attrib['to'] for q in root.findall(".//{*}question")][0]

'#Minister_for_Arts'

Leaders Questions has name attribute "questions" when it's really just a debate. The ``questions`` attribute should only refer to numbered/formal questions (this has implications for early years when questions weren't numbered.) It may not matter so much though when written answers are broken out.

In [160]:
questions = root.xpath(".//akn:debateSection[./akn:question]/akn:question", namespaces=AKN)
for pq in questions:
    pq_uri = workURI+"/"+pq.attrib['eId']
    g.add(( URIRef(pq_uri), rdf.type,  ))
etree.tostring(questions[0])

b'<question xmlns="http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" by="#SeanOFearghaillFF" to="#Minister_for_Arts" eId="pq_1">\n                        <p eId="para_1">1<b>Deputy Se&#225;n &#211; Feargha&#237;l</b> asked the <b>Minister for Arts, Heritage and the Gaeltacht</b> the funding being provided for the Heritage Council in 2016; and if she will make a statement on the matter. [39716/15]</p>\n                    </question>\n                    '

### Member roles in debates

ToDo: these should be object properties, not classes.

Members are participants, OIR:debateParticipantOf in debate if they are recorded as speaking, voting or (in the case of a committee) attending. Members who are participants are linked to specific speech of voting instances by the object categories of OIR:speakerOf, OIR:voterOf or OIR:attendeeOf. A voter will also be one of OIR:taVoter, OIR:nilVoter or OIR:staonVoter

In [98]:
for member in root.xpath(".//akn:TLCPerson/@href", namespaces=AKN):
    memberURI = URIRef(baseURI+member)
    g.add(( work, METALEX.participant, memberURI ))
    g.add(( memberURI, METALEX.participantOf, work ))

In [136]:
root.findall(".//{*}question")

[<Element {http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13}question at 0x7f4e3c10c208>,
 <Element {http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13}question at 0x7f4e3c10c988>,
 <Element {http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13}question at 0x7f4e3c10c688>,
 <Element {http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13}question at 0x7f4e3c10c108>,
 <Element {http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13}question at 0x7f4e3c10c308>,
 <Element {http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13}question at 0x7f4e3c10c508>,
 <Element {http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13}question at 0x7f4e3c10ce08>,
 <Element {http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13}question at 0x7f4e3c10c648>,
 <Element {http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13}question at 0x7f4e3c10cc88>,
 <Element {http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13}question at 0x7f4e3c10c548>]

In [96]:
for dbs in root.xpath(".//akn:debateSection[./akn:speech]", namespaces=AKN):
    
    spkrs = set(dbs.xpath(".//akn:speech/@by", namespaces=AKN))
    for spkr in spkrs: 
        href = root.xpath(".//akn:TLCPerson[@eId='{}']/@href".format(spkr[1:]), namespaces=AKN)
        if len(href) != 1:
            pass
        else:
            dbsURI = URIRef(workURI+"/"+dbs.attrib['eId'])
            spkrURI = URIRef(baseURI+href[0])
            g.add(( spkrURI, OIR.speakerOf, dbsURI))
            g.add(( dbsURI, OIR.speaker, spkrURI ))
            

In [108]:
for dbs in root.xpath(".//akn:debateSection[@name='ta']", namespaces=AKN):
    voteURI = URIRef(workURI+"/"+dbs.getparent().attrib['eId'])
    voters = [p[1:] for p in dbs.xpath(".//akn:person/@refersTo", namespaces=AKN)]
    for voter in voters:
        href = root.xpath(".//akn:TLCPerson[@eId='{}']/@href".format(voter), namespaces=AKN)
        voterURI = URIRef(baseURI+href[0])
        g.add(( voterURI, OIR.voterOf, voteURI))
        g.add(( voteURI, OIR.voter, voterURI ))

In [65]:
href = '/ie/oireachtas/member/Michael-P-Kitt.D.1975-03-04/dail/31'
right_len = 7
uri_len = len(href.split("/"))
try:
    assert uri_len == right_len
except AssertionError as e:
    e.args += ("URI: {0} Incorrect length. Should have {1} elements but it has {2}".format(href, right_len, uri_len),)
    raise

In [134]:
for s, p, o in g:
    if p.endswith("Type"):
        print(s.split("/")[-1], p.split("#")[-1], o.split("/")[-1])

dbsect_3 debateType ontology#Questions
dbsect_82 debateType ontology#Debate
dbsect_83 debateType ontology#Debate
dbsect_2 debateType ontology#Questions
dbsect_32 debateType ontology#Debate
dbsect_31 debateType ontology#Debate
dbsect_29 debateType ontology#Questions
dbsect_30 debateType ontology#Debate
dbsect_16 debateType ontology#Debate
dbsect_3 debateType ontology#Question
dbsect_9 debateType ontology#Question
dbsect_15 debateType ontology#Debate
dbsect_33 debateType ontology#Debate
dbsect_9 debateType ontology#Questions
dbsect_1 debateType ontology#Prelude


In [163]:
for s, p, o in g:
    if p.endswith("type"):
        print(s.split("/")[-1], p.split("#")[-1], o.split("/")[-1])

dbsect_30 type ontology#DebateSection
dbsect_32 type ontology#DebateSection
dbsect_82 type ontology#DebateSection
dbsect_9 type ontology#DebateSection
dbsect_16 type ontology#DebateSection
dbsect_33 type ontology#DebateSection
dbsect_29 type ontology#DebateSection
dbsect_15 type ontology#DebateSection
dbsect_83 type ontology#DebateSection
dbsect_31 type ontology#DebateSection
dbsect_1 type ontology#DebateSection
dbsect_3 type ontology#DebateSection
dbsect_2 type ontology#DebateSection


In [110]:
[d.attrib['eId'] for d in root.xpath(".//akn:debateBody/akn:debateSection", namespaces=AKN)]

['dbsect_1',
 'dbsect_2',
 'dbsect_3',
 'dbsect_9',
 'dbsect_15',
 'dbsect_16',
 'dbsect_29',
 'dbsect_30',
 'dbsect_31',
 'dbsect_32',
 'dbsect_33',
 'dbsect_82',
 'dbsect_83']

In [112]:
root.xpath(".//akn:debateBody//akn:debateSection/akn:heading/text()", namespaces=AKN)

['Ceisteanna - Questions',
 'Priority Questions',
 'Heritage Council Funding',
 'Seirbhísí Eitilte',
 'National Monuments',
 'Seirbhísí Eitilte',
 'National Monuments',
 'Other Questions',
 'Cultural Policy',
 'Commemorative Events',
 'Céanna agus Cuanta',
 'Pleananna Teanga',
 'Deer Culls',
 'Message from Select Committee',
 'Garda Síochána (Policing Authority and Miscellaneous Provisions) Bill 2015: Report Stage (Resumed)',
 'Leaders’ Questions',
 'Order of Business',
 'Finance Bill 2015: Financial Resolutions',
 'Finance Bill 2015: Allocation of Time: Motion',
 'Garda Síochána (Policing Authority and Miscellaneous Provisions) Bill 2015: Report Stage (Resumed)',
 'Topical Issue Matters',
 'Topical Issue Debate',
 'Preschool Services',
 'Sexual Abuse and Violence',
 'Housing Issues',
 'Schools Building Projects Applications']