## Debates to RDF

This notebook is draft code updating the Akoma Ntoso (AKN) debates XML to reflect changes made to the ontology and then extracting metadata to RDF. 

For more information on the ontology, read the [Debates section of the wiki](https://github.com/Oireachtas/ontology/wiki/Debates)

I am making the changes to [the AKN file](../debates/AK-dail-2015-11-12.xml)  in the debates folder, and saving the changes as a new file.

In [1]:
import re
from lxml import etree
from rdflib import URIRef, Literal, Namespace, Graph
from rdflib.namespace import RDF, OWL, SKOS, DCTERMS, XSD, RDFS, FOAF

In [2]:
AKN = {"akn": "http://docs.oasis-open.org/legaldocml/ns/akn/3.0/CSD13"}
xml = "../debates/AK-dail-2015-11-12.xml"
baseURI = "http://oireachtas.ie"

### Metadata elements


As the file contains data for the oral part of the debate (writtens are stored separately), the FRBRWork attributes need to be updated. See the [Metadata](https://github.com/Oireachtas/ontology/wiki/Debates#metadata) section of the wiki for the specification.

Note that under the Akoma Ntoso naming convention, the "@" character in the FRBRExpression URI denote them as the original expression of the work. This is not strictly true in the case of revised Official Reports but we have no way of telling the difference at the moment, so the original expression of a debate is whatever this file turns out to be.

The original FRBRExpression URIs have a language value of ``eng``, however, this should be ``mul`` because it is not (easily) possible to determine whether a debate is in English or Irish.

In [3]:
regex = re.compile("\d{4}-\d{2}-\d{2}(?!/debate|/writtens)")

root = etree.parse(xml).getroot()

work = root.find(".//{*}FRBRWork")
etree.SubElement(work, "FRBRname", {"value": "debate"})


name = root.find(".//{*}FRBRWork/{*}FRBRname").attrib['value']
for uri in root.xpath(".//akn:identification/*//*[starts-with(@value, '/akn')]", namespaces=AKN):
    print(re.sub("{.+}", "", uri.getparent().tag)+ "/" + re.sub("{.+}", "", uri.tag) )
    value = uri.attrib['value'].replace("eng@", "mul@")
    print("Original:", value)
    span = regex.search(value).span()
    uri.attrib['value'] = value[:span[1]] + "/" + name + value[span[1]:]
    print("New:", uri.attrib['value'], "\n---")

FRBRWork/FRBRthis
Original: /akn/ie/debateRecord/dail/2015-11-12/main
New: /akn/ie/debateRecord/dail/2015-11-12/debate/main 
---
FRBRWork/FRBRuri
Original: /akn/ie/debateRecord/dail/2015-11-12
New: /akn/ie/debateRecord/dail/2015-11-12/debate 
---
FRBRExpression/FRBRthis
Original: /akn/ie/debateRecord/dail/2015-11-12/mul@/main
New: /akn/ie/debateRecord/dail/2015-11-12/debate/mul@/main 
---
FRBRExpression/FRBRuri
Original: /akn/ie/debateRecord/dail/2015-11-12/mul@
New: /akn/ie/debateRecord/dail/2015-11-12/debate/mul@ 
---
FRBRManifestation/FRBRthis
Original: /akn/ie/debateRecord/dail/2015-11-12/mul@/main.xml
New: /akn/ie/debateRecord/dail/2015-11-12/debate/mul@/main.xml 
---
FRBRManifestation/FRBRuri
Original: /akn/ie/debateRecord/dail/2015-11-12/mul@.akn
New: /akn/ie/debateRecord/dail/2015-11-12/debate/mul@.akn 
---


### TLCPerson references

TLCPerson references are to the OIR:Member URI in the original AKN. However, using the specific org:Membership (of a Dáil or Seanad) would make it easier to link to other information needed in the website, like constituency and party, which otherwise would require a more expensive query within date ranges. The information associated with OIR:Member is only one step away. However, there may be a cost to this when it comes to searching for speeches by a Member over multiple houses. For that reason, it would be worthwhile testing this over a larger set of debate files.

In [4]:
for person in root.xpath(".//akn:meta/akn:references/akn:TLCPerson", namespaces=AKN):
    person.attrib['href'] = person.attrib['href'] + "/dail/31"

### Converting to RDF

When converting to RDF, FRBR elements map to their RDA equivalents. I'm mapping only the FRBRuri elements for now. 

ToDo: Extend ontology to cover both contributors as those listed as TLCPerson as well as speakers as those identified in speech nodes.

In [5]:
g = Graph()

In [6]:
OIR = Namespace("http://oireachtas.ie/ontology#")
RDA = Namespace("http://www.rdaregistry.info/Elements/c/#")

In [7]:
# C10001 is RDA:Work
g.add(( URIRef(baseURI+root.xpath(".//akn:FRBRWork/akn:FRBRuri/@value", 
                                  namespaces=AKN)[0]), 
       RDF.type, 
       RDA.C10001))
# C10006 is RDA:Expression
g.add(( URIRef(baseURI+root.xpath(".//akn:FRBRExpression/akn:FRBRuri/@value", 
                                  namespaces=AKN)[0]), 
       RDF.type, 
       RDA.C10006))
# C10007 is RDA:Manifestation
g.add(( URIRef(baseURI+root.xpath(".//akn:FRBRManifestation/akn:FRBRuri/@value", 
                                  namespaces=AKN)[0]), 
       RDF.type, 
       RDA.C10007))

In [8]:
for r, d, a in g:
    print(r, d, a)

http://oireachtas.ie/akn/ie/debateRecord/dail/2015-11-12/debate/mul@.akn http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.rdaregistry.info/Elements/c/#C10007
http://oireachtas.ie/akn/ie/debateRecord/dail/2015-11-12/debate http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.rdaregistry.info/Elements/c/#C10001
http://oireachtas.ie/akn/ie/debateRecord/dail/2015-11-12/debate/mul@ http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.rdaregistry.info/Elements/c/#C10006
