<a href="https://colab.research.google.com/github/IAAA-Lab/otp-module-3-activity-1/blob/main/odeco_otp3_assignment1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ODECO, Online Training Programme, Module 3, Assignment 1

The European Data Portal (data.europa.eu) has a user interface for searching at https://data.europa.eu/data/datasets?locale=en&minScoring=0 .

However, if we want to build an application (software) reusing some of the datasets publicly available at the portal, we need to write some code accesing direclty the APIs offered by data.europa.eu.

In this short tutorial/assignment we will show to use the following APIs for seaching:

- the CKAN Restful API (https://data.europa.eu/api/hub/search/#tag/Ckan ): CKAN API also available in many other Open Data Catalogues deployed with CKAN software
- the Search Restful API (https://data.europa.eu/api/hub/search/#tag/Search) : particular API developed for data.europa.eu
- the SPARQL end-point (https://data.europa.eu/data/sparql?locale=en): standard SPARQL end-point, also available in many other Open Data catalogues

As programming language to see that the use of the APIs is feasible, we will use Python.

Let's start preparing a query to retrieve datasets from the European Data Portal. First, we will type a keyword. For instance, if we wan to retrieve the datasets produced in the European TRAFAIR project, we could use the specific term 'Trafair'.

In [None]:
keyword = input('Type a keyword : ')

One initial possibility is to use the package_search function of the CKAN API (see https://data.europa.eu/api/hub/search/#operation/ckanPackageSearch ). 

In [None]:
edp_url = 'https://data.europa.eu/data/search'

search_request = edp_url + '/ckan/package_search?q=%22'+keyword +'%22&rows=10'

print('URL of the search request: ', search_request, '\n')

from urllib.request import urlopen

response = urlopen(search_request)

import json

jsonResponse = json.load(response)

print(jsonResponse)

As the output is JSON format and is not easy to read by humans, we should extract specific fields, e.g. the identifier ('id'), with the following code:

In [None]:
rows = jsonResponse["result"]["results"]
for row in rows:
    print(row["id"])

A second possiblity is to use the Search Restful API of the portal (see https://data.europa.eu/api/hub/search/#operation/searchGet )

In [None]:
search_request = edp_url + '/search?q=%22'+keyword +'%22&limit=10'

print('URL of the search request: ', search_request, '\n')

response = urlopen(search_request)

jsonResponse = json.load(response)

rows = jsonResponse["result"]["results"]
for row in rows:
    print(row["id"])

If we want to make more specific queries filtering the values of specific properties, we can construct a SPARQL query including a filter on the dcat:keyword property using this keyword.

In [None]:
query_text = """
           PREFIX dct:<http://purl.org/dc/terms/>
           PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
           PREFIX dcat: <http://www.w3.org/ns/dcat#>
           SELECT DISTINCT ?s WHERE { 
             ?s a dcat:Dataset .
             ?s dcat:keyword ?value . 
             FILTER regex(str(?value), '"""+ keyword +"""', 'i') .
           }"""

print(query_text)

We could run this query at the user interface of the SPARQL end-point at https://data.europa.eu/data/sparql?locale=en 

But in our case, we will launch the query through our Python code:

In [None]:
!pip install SPARQLWrapper

from SPARQLWrapper import SPARQLWrapper, JSON

sparql = SPARQLWrapper('https://data.europa.eu/sparql')

sparql.setQuery(query_text)

sparql.setReturnFormat(JSON)

results = sparql.query().convert()


We need to process the results in JSON format as follows:

In [None]:
for row in results["results"]["bindings"]:
    dataset = row["s"]["value"]
    print(dataset)

We can click on one of the previous links to see the metadata of a dataset in a human readable format.
We can also retrieve the metadata in Turtle format of one of these datasets, and upload it in a graph:

In [None]:

def get_file_name(url):
    """
    https://europeandataportal.eu/set/data/https-opendata-aragon-es-datos-catalogo-dataset-oai-zaguan-unizar-es-89319
    return https-opendata-aragon-es-datos-catalogo-dataset-oai-zaguan-unizar-es-89319
    """
    words = url.split('/')
    file_name = words[len(words)-1]
    return file_name

import rdflib

def download_metadata(url):
    """
    Parses the dataset with URL in the graph
    """
    id = get_file_name(url)
    ttl_url = 'https://data.europa.eu/data/api/datasets/'+ id + '.ttl?useNormalizedId=true&locale=en'
    print(ttl_url)
    try:
        graph = rdflib.Graph()
        graph.parse(ttl_url, format="turtle")
        return graph
    except Exception as err:
        print(f'Other error occurred: {err}')

url = input('Copy here one of the previous URLs: ')

graph = download_metadata(url)
v = graph.serialize(format='turtle')
print(v)

We can also retrieve the real data behind a dataset. 

For instance, this dataset ( http://data.europa.eu/88u/dataset/https-opendata-aragon-es-datos-catalogo-dataset-oai-zaguan-unizar-es-95726 ) has an associated distribution in CSV format.

We can retrieve the URL of the CSV making a SPARQL query on the metadata in Turtle format:

In [None]:
url = "http://data.europa.eu/88u/dataset/https-opendata-aragon-es-datos-catalogo-dataset-oai-zaguan-unizar-es-95726"

graph = download_metadata(url)

def get_csv_url(graph):
    query_text = """
                   PREFIX dct:<http://purl.org/dc/terms/>
                   PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
                   PREFIX dcat: <http://www.w3.org/ns/dcat#>
                   PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
                   SELECT  ?value  WHERE {
                        ?dataset dcat:distribution ?resource .
                        ?resource a dcat:Distribution .
                        ?resource dct:format <http://publications.europa.eu/resource/authority/file-type/CSV> .
                        ?resource dcat:accessURL ?value
                    }
                    """
    print(query_text)
    qres =  graph.query(query_text)
    for row in qres:
        value = row["value"]
    return value

csv_url = get_csv_url(graph)
print(csv_url)


Last, we can visualize the content of the CSV file:

In [None]:
!pip install pandas

import pandas as pd

raw_sensor_data_2020_01 = pd.read_csv(csv_url)

print(raw_sensor_data_2020_01)