# Preparing and uploading data to Fair Data Point with SeMPyRO using Health-RI Core v2

In this notebook we will go trough the steps of defining a simple metadata set consisting of a DCAT:Catalog and DCAT:Dataset according the Health-RI Core v2 application profile. We will load some example data and serialize it to a turtle file or push it to a FAIR Data Point (FDP). 

**Prerequisites:** To execute this notebook in full one needs to have a running FAIR Data Point (FDP) instance with an active write access account.
This notebook is written for the reference implementation, FAIR Data Point version 1.16 with the [Health-RI Core v2 SHACL shapes](https://github.com/Health-RI/health-ri-metadata/tree/develop/Formalisation(shacl)/Core/FairDataPointShape).

This notebook continues after the 'Documentation_DCAT' notebook.

## Imports and setup

In [None]:
from typing import List, Union
from pprint import pprint

from rdflib import URIRef, DCTERMS
from pydantic import AnyHttpUrl, Field, field_validator

from getpass import getpass
import dateutil.parser as parser

from fairclient.fdpclient import FDPClient

from sempyro import LiteralField
from sempyro.hri_dcat import HRICatalog, HRIDataset, HRIVCard, HRIAgent, HRIDistribution
from sempyro.utils.validator_functions import force_literal_field

To work with the FAIR Data Point, we need to log into the FDP and define the FDP subclasses:

In [None]:
fdp_base=input("Enter base link to FDP: ").rstrip("/")
username=input("Enter username: ")
password = getpass(prompt="Password: ")

fdp_client = FDPClient(base_url=fdp_base, username=username, password=password)

In [None]:
class FDPCatalog(HRICatalog):
    is_part_of: [AnyHttpUrl] = Field(
        description="Link to parent object", 
        json_schema_extra={
            "rdf_term": DCTERMS.isPartOf, 
            "rdf_type": "uri"
        })

EX = "http://www.example.com"

## Defining a Catalog and Datasets

The function of SeMPyRO is to define objects according to a specification, such as dcat:Catalog and dcat:Dataset, and validate the metadata agains this specification. The metadata for the datasets we will use for this demo is in `example_data_fdp.csv`. 
The FDP specification requires that each dataset is a part of a catalog, therefore we need to create a catalog. 

To see what we need to provide for that we can annotate the model and request the mandatory fields:

In [None]:
core_fields = HRICatalog.annotate_model()
types = core_fields.get_fields_types()
mandatory_types = {k: types[k] for k in core_fields.mandatory_fields()}
pprint(mandatory_types)

Let's create a minimum catalogue with an example title and description. We also need a URI to use as a graph subject at serialization. Let's use `example.com` domain for now for this purpose:

In [None]:
# Create a class instance with the same data
fdp_catalog = FDPCatalog(
    title=[
        LiteralField(value="Inflammatory Bowel Disease catalogue", language="en")
    ],
    description=[
        LiteralField(value="This catalogue describes the core metadata of AUMC Inflammatory Bowel Disease datasets", language="en")
    ],
    contact_point=HRIVCard(
        hasEmail="mailto:data-access-committee@xumc.nl",
        formatted_name="Data Access Committee of the x UMC"),
    publisher=HRIAgent(
        name=[LiteralField(value="Academic Medical Center")],
        identifier=["https://ror.org/05wg1m734"],
        homepage=URIRef("https://www.xumc.nl"),
        mbox="mailto:data-access-committee@xumc.nl"
    ),
    is_part_of=[URIRef(fdp_base)],
    dataset=[])

fdp_catalog_record = fdp_catalog.to_graph(URIRef(f"{EX}/test_catalog_1"))
print(fdp_catalog_record.serialize())


In [None]:
catalog_fdp_url = fdp_client.create_and_publish(resource_type="catalog", metadata=fdp_catalog_record)
print(catalog_fdp_url)

Now we can add datasets to the catalogue. Data for example datasets will be used from the [Health-RI Metadata repository](https://github.com/Health-RI/health-ri-metadata).

In [None]:
hri_dataset = HRIDataset(
    contact_point=HRIVCard(
        hasEmail="mailto:data-access-committee@xumc.nl",
        formatted_name="Data Access Committee of the x UMC"
    )
    ,
    creator=[HRIAgent(
        name=["Academic Medical Center"], 
        identifier=["https://ror.org/05wg1m734"],
        homepage="https://www.xumc.nl",
        mbox="mailto:data-access-committee@xumc.nl"    
    )],
    description=[LiteralField(value=
                              "The primary aim of the PRISMA study was to investigate the potential value of risk-tailored versus "
                              "traditional breast cancer screening protocols in the Netherlands. Data collection took place between "
                              "2014-2019, resulting in ∼67,000 mammograms, ∼38,000 surveys, ∼10,000 blood samples and ∼600 saliva "
                              "samples.")],
    issued=parser.isoparse("2024-07-01T11:11:11"),
    identifier=f"{EX}/dataset/ZLOYOJ",
    modified=parser.isoparse("2024-06-04T13:36:10.246Z"),
    publisher=HRIAgent(
        name=["Academic Medical Center"], 
        identifier=["https://ror.org/05wg1m734"],
        homepage="https://www.xumc.nl",
        mbox="mailto:data-access-committee@xumc.nl"    
    ),
    theme=[URIRef("http://publications.europa.eu/resource/authority/data-theme/HEAL")],
    title=[LiteralField(value="Questionnaire data of the Personalised RISk-based MAmmascreening Study (PRISMA)")],
    license=URIRef("https://creativecommons.org/licenses/by-sa/4.0/"),
    distribution=[],
    access_rights=URIRef("http://publications.europa.eu/resource/authority/access-right/RESTRICTED"),
    keyword=['example'],
    applicable_legislation=["http://data.europa.eu/eli/reg/2025/327/oj"]
)

To make sure the dataset is correctly serialized, link it to the catalogue. After that we can publish it:

In [None]:
fdp_dataset_record = hri_dataset.to_graph(subject=URIRef(hri_dataset.identifier))
fdp_dataset_record.add((URIRef(hri_dataset.identifier), DCTERMS.isPartOf, URIRef(catalog_fdp_url)))
dataset_fdp_url = fdp_client.create_and_publish(resource_type="dataset", metadata=fdp_dataset_record)

print(dataset_fdp_url)

To allow people to access the dataset, we can point them to the distribution:

In [None]:
hri_distribution = HRIDistribution(
    title=[
        LiteralField(value="CSV-distribution of the questionnaire data of the Personalised RISk-based MAmmascreening Study (PRISMA)")
    ],
    description=[
        LiteralField(value="CSV file containing the questionnaire data of the PRISMA study")
    ],
    access_url=URIRef("https://example.com/dataset/PRISMA/questionnaire.csv"),
    media_type=URIRef("https://www.iana.org/assignments/media-types/text/csv"),
    byte_size="4096",
    license=URIRef("https://definities.geostandaarden.nl/dcat-ap-nl/id/waardelijst/licenties/niet_open"),
    rights="https://www.example.com/contracts/definitely_a_real_DPA.pdf",
    format=URIRef("http://publications.europa.eu/resource/authority/file-type/CSV")
)

The identifier of the distribution should be unique in the context of the dataset. Access URL is mandatory, so we can use that in combination with the dataset identifier to form our distribution identifier. Let's add the distribution to the dataset and publish it:

In [None]:
access_url_str = str(hri_distribution.access_url)
distribution_uri = URIRef(f"{hri_dataset.identifier}/distribution/{access_url_str.split('/')[-1]}")
fdp_distribution_record = hri_distribution.to_graph(subject=distribution_uri)
fdp_distribution_record.add((distribution_uri, DCTERMS.isPartOf, URIRef(f"{dataset_fdp_url}")))
print(fdp_distribution_record.serialize())

distribution_fdp_url = fdp_client.create_and_publish(resource_type="distribution", metadata=fdp_distribution_record)

print(distribution_fdp_url)

In this notebook we have created and pushed to an FDP:

In [None]:
print(f"A catalog: {catalog_fdp_url}")
print(f"A dataset: {dataset_fdp_url}")
print(f"A distribution: {distribution_fdp_url}")