# Health-RI core metadata example

This notebook shows how to generate metadata compliant with the Health-RI core shapes using SeMPyRO.

First, we look at the Health-RI mandatory fields and their types for a Dataset.

In [13]:
from sempyro.hri_dcat.hri_dataset import HRIDataset
from sempyro.foaf.agent import Agent
from pprint import pprint

core_fields = HRIDataset.annotate_model()
types = core_fields.get_fields_types()
mandatory_types = {k: types[k] for k in core_fields.mandatory_fields()}
pprint(mandatory_types)

{'contact_point': {'RDF type': 'uri', 'datatype': 'List[Union[Url, VCard]]'},
 'creator': {'RDF type': 'uri', 'datatype': 'List[Union[Url, Agent]]'},
 'description': {'RDF type': 'literal', 'datatype': 'List[LiteralField]'},
 'identifier': {'RDF type': 'xsd:string',
                'datatype': 'Union[str, LiteralField]'},
 'issued': {'RDF type': 'datetime_literal',
            'datatype': 'Union[str, datetime, date, AwareDatetime, '
                        'NaiveDatetime]'},
 'license': {'RDF type': 'uri', 'datatype': 'Url'},
 'modified': {'RDF type': 'datetime_literal',
              'datatype': 'Union[str, date, AwareDatetime, NaiveDatetime]'},
 'publisher': {'RDF type': 'uri', 'datatype': 'List[Union[Url, Agent]]'},
 'theme': {'RDF type': 'uri', 'datatype': 'List[Url]'},
 'title': {'RDF type': 'rdfs_literal', 'datatype': 'List[LiteralField]'}}


Now we read the previous example data. First we load and visualize it.

In [14]:
from tabulate import tabulate
import pandas as pd

df = pd.read_csv("./example_data.csv", sep=";")
print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))

+------+-----------------------------+----------------------------------------------------------------------+------------------+------------------------------------------------------+-------------------------------------+
|   id | name                        | description                                                          | author_name      | author_id                                            | keywords                            |
|------+-----------------------------+----------------------------------------------------------------------+------------------+------------------------------------------------------+-------------------------------------|
|    1 | Gryffindor research project | Impact of muggle technical inventions on word's magic presense       | Hermione Granger | https://harrypotter.fandom.com/wiki/Hermione_Granger | magic, technic, muggles             |
|    2 | Slytherin research project  | Comarative analysis of magic powers of muggle-born and blood wizards | Dr

A few fields already exist! We try to map all mandatory fields now based on what we have.

In [15]:
# We split keywords by the comma, and put them in a List
df["keywords"] = df["keywords"].apply(lambda x: x.split(","))
df.rename(columns={"keywords": "keyword"}, inplace=True)

# We rename "id" to identifier
# Note: cardinality in DCAT-AP is 0..* but in Health-RI is 1..1
# So it should NOT be in a list
df["id"] = df["id"].apply(lambda x: str(x))
df.rename(columns={"id": "identifier"}, inplace=True)

# Description can stay as is, but we put it in a List (reason: i18n support).
df["description"] = df["description"].apply(lambda x: [x])

# Name is called "title" in dcat. We first put it in a List, then rename it
df["name"] = df["name"].apply(lambda x: [x])
df.rename(columns={"name": "title"}, inplace=True)

# creator is a foaf:Agent. We add the mandatory tags for it
df["creator"] = df.apply(lambda x: [{'name': ["author_name"], 'identifier': x["author_id"]}], axis=1)
df.drop(columns=["author_name", "author_id"], inplace=True)


Great, we have half of the fields now! We are still missing the following:

* contact_point
* issued
* license
* modified
* publisher
* theme
* type

We do not need to panic, as we can usually get away with having these specified on a global level.

In [16]:
from rdflib import URIRef
from sempyro.foaf.agent import Agent
import datetime

# contact_point is a VCard, we create a dictionary for it
# prof. Dumbledore is the contact point for all Hogwarts-related research
contact_point_vcard = {"hasEmail":URIRef("mailto:dumbledore@hogwarts.example.com"), "hasUID":URIRef("https://www.wikidata.org/wiki/Q712548")}
df["contact_point"] = [contact_point_vcard for _ in range(len(df))]

# We add a Health theme for all datasets. Put it in a list cause cardinalit is 1..*
df["theme"] = [[URIRef("http://publications.europa.eu/resource/authority/data-theme/HEAL")] for _ in range(len(df))]

# CC BY-SA 4.0 license for all datasets, Hermione needs her citations
df["license"] = URIRef("https://creativecommons.org/licenses/by-sa/4.0/")

# Publisher is Hogwarts obviously
df["publisher"] = [[URIRef("https://harrypotter.fandom.com/wiki/Hogwarts_School_of_Witchcraft_and_Wizardry")] for _ in range(len(df))]

# Issued and modified we put some different times
df["issued"] = datetime.datetime(2024, 7, 1, 11, 11, 11)
df["modified"] = datetime.datetime.today()

We can now see how the populated dataframe looks like:

In [17]:
print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))

+--------------+---------------------------------+---------------------------------------------------------------------------+-----------------------------------------------+---------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------+-------------------------------------------------+--------------------------------------------------------------------------------------------------------+---------------------+----------------------------+
|   identifier | title                           | description                                                               | keyword                                       | creator                                                                                           | con

In the end, serialize the graphs.

In [18]:
datasets = df.to_dict('records')
dcat_datasets = [HRIDataset(**x) for x in datasets]
for i, dataset in enumerate(dcat_datasets):
    print(dataset.to_graph(URIRef(f"http://example.com/dataset/{i}")).serialize())

@prefix dcat: <http://www.w3.org/ns/dcat#> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix v: <http://www.w3.org/2006/vcard/ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.com/dataset/0> a dcat:Dataset ;
    dcterms:creator [ a foaf:Agent ;
            dcterms:identifier "https://harrypotter.fandom.com/wiki/Hermione_Granger" ;
            foaf:name "author_name" ] ;
    dcterms:description "Impact of muggle technical inventions on word's magic presense" ;
    dcterms:identifier "1"^^xsd:string ;
    dcterms:issued "2024-07-01T11:11:11"^^xsd:dateTime ;
    dcterms:license <https://creativecommons.org/licenses/by-sa/4.0/> ;
    dcterms:modified "2024-07-23T09:09:06.583297"^^xsd:dateTime ;
    dcterms:publisher <https://harrypotter.fandom.com/wiki/Hogwarts_School_of_Witchcraft_and_Wizardry> ;
    dcterms:title "Gryffindor research project" ;
    dcat:contactPoint [ a v:Kind ;
            v:hasEmail <mailto:du

In [19]:
datasets = df.to_dict('records')
ds = datasets[0]
pprint(ds)


{'contact_point': [{'hasEmail': rdflib.term.URIRef('mailto:dumbledore@hogwarts.example.com'),
                    'hasUID': rdflib.term.URIRef('https://www.wikidata.org/wiki/Q712548')}],
 'creator': [{'identifier': 'https://harrypotter.fandom.com/wiki/Hermione_Granger',
              'name': ['author_name']}],
 'description': ["Impact of muggle technical inventions on word's magic "
                 'presense'],
 'identifier': '1',
 'issued': Timestamp('2024-07-01 11:11:11'),
 'keyword': ['magic', ' technic', ' muggles'],
 'license': rdflib.term.URIRef('https://creativecommons.org/licenses/by-sa/4.0/'),
 'modified': Timestamp('2024-07-23 09:09:06.583297'),
 'publisher': [rdflib.term.URIRef('https://harrypotter.fandom.com/wiki/Hogwarts_School_of_Witchcraft_and_Wizardry')],
 'theme': [rdflib.term.URIRef('http://publications.europa.eu/resource/authority/data-theme/HEAL')],
 'title': ['Gryffindor research project']}
