## OpenAIRE API Example
Use the [OpenAIRE API](https://explore.openaire.eu/) to find *open access* PDFs for research publications.

If needed, install from the `requirements.txt` file:
```
source venv/bin/activate
pip install -r requirements.txt
```

Toggle this boolean flag to get into debug/test mode...

In [2]:
DEBUG = False # True

Here's an iterator to scan the file from the ETL workflow which lists the publications to lookup. NB: the row format may change, check that -

In [3]:
import csv

def iter_pub (filename):
    with open(filename) as f:
        for row in csv.reader(f, delimiter=","):
            dataset, doi, journal, title = row[:4]
            yield doi, journal, title

Here's an access function to read content from a URI -- in other words, access an HTTP-based API:

In [4]:
import urllib.request

def load_uri (uri):
    with urllib.request.urlopen(uri) as response:
        html = response.read()
        return html.decode("utf-8")

Here's a function that creates a UUID for an entity in the graph. We use a tuple (a set of columns), e.g., `[doi, journal, title]`, with the `hashlib` library. This way, the UUID values are reproduced the same way each time.

In [5]:
import hashlib

def get_hash (row):
    m = hashlib.blake2b(digest_size=10)
    
    for elem in row:
        m.update(elem.encode("utf-8").lower().strip())

    return m.hexdigest()

This function uses the title of a publication to run a lookup in the OpenAIR API, yielding the `[doi, pub_id, pub_url, journal, title]` tuple as a result.

In [6]:
from urllib import parse

API_URI = "http://api.openaire.eu/search/publications?title="

def lookup_pub_uris (iter, debug=False):
    for doi, journal, title in iter:
        if DEBUG:
            print(doi, journal, title)

        xml = load_uri(API_URI + parse.quote(title))

        if DEBUG:
            print(xml)

        pub_url = extract_pub_uri(xml)

        if pub_url:
            pub_id = "publication-{}".format(get_hash([doi, journal, title]))
            yield doi, pub_id, pub_url, journal, title

Here's an access function that parses the XML output from the OpenAIRE API response, then extracts the URL for the publication's open access PDF -- if it's available.

In [7]:
import xml.etree.ElementTree as et

NS = {
    "oaf": "http://namespace.openaire.eu/oaf"
    }

def extract_pub_uri (xml):
    root = et.fromstring(xml)
    result = root.findall("./results/result[1]/metadata/oaf:entity/oaf:result", NS)

    if len(result) > 0:
        url_list = result[0].findall("./children/instance/webresource/url")

        if len(url_list) > 0:
            pub_url = url_list[0].text
            return pub_url

    return None

Now, specific the source data -- or optionally inject a test case instead...

In [10]:
filename = "../usda/dyads_for_validation.csv"
iter = iter_pub(filename)

if DEBUG:
    iter = [["10.5150/alfred.e.neuman", "Mad Magazine", "Does a Nutritious Diet Cost More in Food Deserts?"]]

In [9]:
import json

# pull results from the OpenAIRE API                                                                                         
results = [ row for row in lookup_pub_uris(iter, debug=DEBUG) ]
print(json.dumps(results, indent=2))

[
  [
    "doi",
    "publication-61f787206f84063920fb",
    "http://cds.cern.ch/record/969091",
    "journal",
    "title"
  ],
  [
    "10.1111/agec.12444",
    "publication-e84ab84d25278051773e",
    "http://purl.umn.edu/205424",
    "Agricultural Economics",
    "Does a nutritious diet cost more in food deserts?"
  ],
  [
    "10.1016/j.aaspro.2016.02.014",
    "publication-0c97553ee24b5584124e",
    "http://dx.doi.org/10.1016/j.aaspro.2016.02.014",
    "Agriculture and Agricultural Science Procedia",
    "Consumer Response to Quality Differentiation Strategies in Wine PDOs"
  ],
  [
    "10.1016/j.wep.2015.12.001",
    "publication-54a72c05d10975c8ac10",
    "http://hdl.handle.net/10419/194508",
    "Wine Economics and Policy",
    "Wine consumption and sales strategies: The evolution of Mass Retail Trading in Italy"
  ],
  [
    "10.1162/rest_a_00447",
    "publication-3d5bdb7e22255b2aa07d",
    "http://www.mitpressjournals.org/doi/pdf/10.1162/REST_a_00447",
    "Review of Econom

Really, that's all that's needed for the ETL workflow.

The next step would be what we'll do later for knowledge graph generation (as an example with metadata) -- this converts the `results` structure into JSON-LD. Here's a header and template to use as metadata in TTL (intermediate form) during the format conversion.

In [12]:
TTL_HEADER = """                                                                                                             
@base <https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary> .                                                  
@prefix cito:   <http://purl.org/spar/cito/> .                                                                               
@prefix dct:    <http://purl.org/dc/terms/> .                                                                                
@prefix foaf:   <http://xmlns.com/foaf/0.1/> .                                                                               
@prefix rdf:    <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .                                                              
@prefix xsd:    <http://www.w3.org/2001/XMLSchema#> .                                                                        
"""

PUB_TEMPLATE = """                                                                                                           
:{}                                                                                                                          
  rdf:type :ResearchPublication ;                                                                                            
  dct:title "{}"@en ;                                                                                                        
  dct:identifier "{}" ;                                                                                                      
  dct:language "en" ;                                                                                                        
  foaf:page "{}"^^xsd:anyURI ;                                                                                               
  .                                                                                                                          
"""

In [15]:
from rdflib import Graph, plugin
from rdflib.plugin import register, Parser, Serializer
import pyld

# format as JSON_LD                                                                                                      
with open("vocab.json", "r") as f:
    CONTEXT = json.load(f)

frags = [TTL_HEADER]

for doi, pub_id, pub_url, journal, title in results:
    frags.append(PUB_TEMPLATE.format(pub_id, title, doi, pub_url))

g = Graph()
g.parse(data="\n".join(frags), format="n3")

jsonld = json.loads(g.serialize(format="json-ld", context=CONTEXT))
jsonld = pyld.jsonld.compact(jsonld, CONTEXT)

print(json.dumps(jsonld, indent=2))

{
  "@context": {
    "@language": "en",
    "@vocab": "https://github.com/Coleridge-Initiative/adrf-onto/wiki/Vocabulary#",
    "authority": "http://id.loc.gov/authorities/subjects/",
    "cito": "http://purl.org/spar/cito/",
    "dbpedia": "http://dbpedia.org/resource/",
    "dcat": "http://www.w3.org/ns/dcat#",
    "dct": "http://purl.org/dc/terms/",
    "doi": "https://doi.org/",
    "fabio": "http://purl.org/spar/fabio/",
    "foaf": "http://xmlns.com/foaf/0.1/",
    "iso639": "http://id.loc.gov/vocabulary/iso639-1/",
    "mads": "http://www.loc.gov/mads/rdf/v1#",
    "orcid": "https://orcid.org/",
    "owl": "http://www.w3.org/2002/07/owl#",
    "pav": "http://purl.org/pav/",
    "prism": "http://prismstandard.org/namespaces/basic/2.0/",
    "rdf": "http://www.w3.org/1999/02/22-rdf-syntax-ns#",
    "rdfs": "http://www.w3.org/2000/01/rdf-schema#",
    "skos": "http://www.w3.org/2004/02/skos/core#",
    "xsd": "http://www.w3.org/2001/XMLSchema#"
  },
  "@graph": [
    {
      "@id"

Overkill, but good to have in hand.