# Preparing and uploading data to Fair Data Point (FDP) compliant with the Health-RI core shapes using SeMPyRO

**Prerequirements:** To execute this notebook in full one needs to have a running FAIR Data Point (FDP) instance with an active write access account. This can be a locally running FDP or an external FDP.

The data used to manually fill in the SeMPyRO dataset is [dataset_health.ttl](https://github.com/Health-RI/harvesting-test-data/blob/main/dataset_health.ttl).

## Imports and setup

Import all required Python libraries and classes for the notebook's workflow. These imports enable the creation, uploading of the dataset to the FDP and a comparison to the single-source-of-thruth [dataset_health.ttl](https://github.com/Health-RI/harvesting-test-data/blob/main/dataset_health.ttl) (not working at this moment).

In [None]:
from getpass import getpass

from fairclient.fdpclient import FDPClient
from pydantic import AnyHttpUrl, Field
from rdflib import DCTERMS, Graph, URIRef
from rdflib.compare import to_isomorphic

from sempyro import LiteralField
from sempyro.adms import Identifier
from sempyro.dcat import Attribution, Relationship
from sempyro.dqv import QualityCertificate
from sempyro.hri_dcat import HRIAgent, HRICatalog, HRIDataset, HRIDistribution, HRIPeriodOfTime, HRIVCard

Connect to your FDP.

In [None]:
fdp_base = input("Enter base link to FDP: ").rstrip("/")
username = input("Enter username: ")
password = getpass(prompt="Password: ")

# Or for local testing with default credentials:
# fdp_base = "http://localhost:8080"
# username = "albert.einstein@example.com"
# password = "password"

fdp_client = FDPClient(base_url=fdp_base, username=username, password=password)

## Catalog

The following cells define and instantiate a catalog that will be uploaded to the FDP instance. The custom `FDPCatalog` class extends the `HRICatalog` class to include an `is_part_of` field, which links the catalog to its parent FDP object. After defining the class, we create an instance with metadata including title, description, contact point, and publisher information. Finally, the catalog is converted to RDF and uploaded to the FDP.

In [None]:
class FDPCatalog(HRICatalog):
    is_part_of: [AnyHttpUrl] = Field(
        description="Link to parent object",
        json_schema_extra={
            "rdf_term": DCTERMS.isPartOf,
            "rdf_type": "uri",
        },
    )

In [None]:
# Create a class instance with the same data
fdp_catalog = FDPCatalog(
    title=[LiteralField(value="Test data Catalog", language="en")],
    description=[
        LiteralField(
            value="This catalogue is to test the submission of a complete metadata .ttl file to an FDP instance.",
            language="en",
        )
    ],
    contact_point=HRIVCard(hasEmail="mailto:contact-point@xumc.nl", formatted_name="Contact Point of the x UMC"),
    publisher=HRIAgent(
        name=[LiteralField(value="Academic Medical Center")],
        identifier=["https://ror.org/05wg1m734"],
        homepage=URIRef("https://www.xumc.nl"),
        mbox="mailto:data-access-committee@xumc.nl",
    ),
    is_part_of=[URIRef(fdp_base)],
    dataset=[],
)

In [None]:
fdp_catalog_record = fdp_catalog.to_graph(URIRef("https://www.example.com/catalog/testdata"))
print(fdp_catalog_record.serialize())

catalog_fdp_url = fdp_client.create_and_publish(resource_type="catalog", metadata=fdp_catalog_record)
print(catalog_fdp_url)

## Dataset

The following cells demonstrate how to manually create an HRIDataset instance with comprehensive metadata that complies with the HRI specification. The dataset contains the same data as in [dataset_health.ttl](https://github.com/Health-RI/harvesting-test-data/blob/main/dataset_health.ttl). After creating the dataset instance, it is serialized to RDF format and uploaded to the FDP instance, linked to the previously created catalog.

In [None]:
# Manually fill an HRIDataset instance with data matching the provided Turtle (.ttl) dataset.
# All possible fields, including previously commented out/nested fields, are now included.
# For deeply nested objects, dicts or nested dataclasses are used as appropriate for demonstration.
hri_dataset = HRIDataset(
    identifier="http://example.com/dataset/1234567890",
    title=[
        LiteralField(value="HealthDCAT-AP test dataset", language="en"),
        LiteralField(value="HealthDCAT-AP test dataset", language="nl"),
    ],
    description=[
        LiteralField(value="This dataset is an example of using HealthDCAT-AP in CKAN", language="en"),
        LiteralField(value="Deze dataset is een voorbeeld van het gebruik van HealthDCAT-AP in CKAN", language="nl"),
    ],
    access_rights=URIRef("http://publications.europa.eu/resource/authority/access-right/NON_PUBLIC"),
    keyword=[
        LiteralField(value="Test 1", language="en"),
        LiteralField(value="Test 2", language="en"),
        LiteralField(value="Test 3", language="nl"),
    ],
    language=[
        "http://publications.europa.eu/resource/authority/language/ENG",
        "http://publications.europa.eu/resource/authority/language/NLD",
        "http://publications.europa.eu/resource/authority/language/FRA",
    ],
    conforms_to=["http://www.wikidata.org/entity/Q19597236"],
    geographical_coverage=["http://publications.europa.eu/resource/authority/country/NLD"],
    frequency="http://publications.europa.eu/resource/authority/frequency/DAILY",
    publisher=HRIAgent(
        name=[
            LiteralField(value="Health-RI Publisher", language="en"),
            LiteralField(value="Health-RI Uitgever", language="nl"),
        ],
        identifier=["https://orcid.org/publisher_id"],
        mbox="mailto:example@publisher.com",
        homepage="https://example.com/publisher_homepage",
        publisher_note=LiteralField(
            value="Health-RI is the Dutch health care initiative to build an integrated health data infrastructure for research and innovation.",
            language="en",
        ),
        publisher_type="http://purl.org/adms/publishertype/NationalAuthority",
        spatial=["http://publications.europa.eu/resource/authority/country/NLD"],
        type="http://publications.europa.eu/resource/authority/dataset-type/TEST_DATA",
    ),
    creator=[
        HRIAgent(
            name=[
                LiteralField(value="Example Creator", language="en"),
                LiteralField(value="Voorbeeld Maker", language="nl"),
            ],
            identifier=["https://orcid.org/creator_id"],
            mbox="mailto:example@creator.com",
            homepage="https://example.com/creator_homepage",
            publisher_note=LiteralField(value="Creator example note", language="en"),
            publisher_type="http://purl.org/adms/publishertype/NationalAuthority",
            spatial=["http://publications.europa.eu/resource/authority/country/NLD"],
            type="http://publications.europa.eu/resource/authority/dataset-type/TEST_DATA",
        )
    ],
    contact_point=HRIVCard(
        formatted_name=LiteralField(value="Contact Point", language="en"),
        hasEmail="mailto:contact@example.com",
        contact_page=["https://example.com/contactPoint"],
    ),
    release_date="2024-01-01T00:00:00Z",
    modification_date="2024-12-31T23:59:59Z",
    is_referenced_by=["https://doi.org/10.1038/sdata.2016.18", "https://dx.doi.org/10.1002/jmri.28679"],
    documentation=["https://www.sciensano.be/en/projects/linking-registers-covid-19-vaccine-surveillance"],
    has_version=["http://example.com/dataset/1234567890"],
    in_series=["http://example.com/series"],
    theme=[URIRef("http://publications.europa.eu/resource/authority/data-theme/HEAL")],
    type=["http://publications.europa.eu/resource/authority/dataset-type/TEST_DATA"],
    version="1.2.3",
    version_notes=[
        LiteralField(value="Dataset continuously updated", language="en"),
        LiteralField(value="Dataset continue bijgewerkt", language="nl"),
    ],
    population_coverage=[
        LiteralField(value="This example includes a very non-descript population", language="en"),
        LiteralField(value="Dit voorbeeld bevat een zeer nietszeggende populatie", language="nl"),
    ],
    number_of_records=123,
    number_of_unique_individuals=1234,
    minimum_typical_age=0,
    maximum_typical_age=110,
    analytics=[
        HRIDistribution(
            access_url="https://www.example.com/analytics",
            byte_size=2048,
            format="http://publications.europa.eu/resource/authority/file-type/TXT",
            license="http://publications.europa.eu/resource/authority/licence/SLEEPYCAT",
            rights="https://www.example.com/rights_statement_for_analytics",
            title=[
                LiteralField(value="Analytics title", language="en"),
                LiteralField(value="Analytiek titel", language="nl"),
            ],
        )
    ],
    code_values=["http://www.wikidata.org/entity/Q12125"],
    coding_system=["http://www.wikidata.org/entity/P1690", "http://www.wikidata.org/entity/P4229"],
    health_theme=["http://www.wikidata.org/entity/Q7907952", "http://www.wikidata.org/entity/Q58624061"],
    legal_basis=["https://w3id.org/dpv#Consent"],
    personal_data=[
        "https://w3id.org/dpv/pd#Age",
        "https://w3id.org/dpv/pd#Gender",
        "https://w3id.org/dpv/pd#HealthRecord",
    ],
    purpose=["https://w3id.org/dpv#AcademicResearch"],
    qualified_attribution=[
        Attribution(
            agent=HRIAgent(
                name=[
                    LiteralField(value="Organization qualifiedAttribution", language="en"),
                    LiteralField(value="Organisatie qualifiedAttribution", language="nl"),
                ],
                identifier=["https://orcid.org/qualifiedAttribution_id"],
                mbox="mailto:example@qualifiedAttribution.com",
                homepage="https://example.com/qualifiedAttribution",
                publisher_note=LiteralField(value="qualifiedAttribution example note", language="en"),
                publisher_type="http://purl.org/adms/publishertype/NationalAuthority",
                spatial=["http://publications.europa.eu/resource/authority/country/NLD"],
                type="http://publications.europa.eu/resource/authority/dataset-type/TEST_DATA",
            ),
            role="http://inspire.ec.europa.eu/metadata-codelist/ResponsiblePartyRole/pointOfContact",
        )
    ],
    qualified_relation=[
        Relationship(
            had_role=["http://www.iana.org/assignments/relation/related"],
            relation=["http://example.com/dataset/3.141592"],
        )
    ],
    quality_annotation=[
        QualityCertificate(
            target="http://example.com/dataset/1234567890", body="http://example.com/dataset/certificate"
        )
    ],
    distribution=[
        HRIDistribution(
            access_url="https://www.example.com/analytics",
            byte_size=1024,
            format="http://publications.europa.eu/resource/authority/file-type/CSV",
            license="http://publications.europa.eu/resource/authority/licence/CC_BYNCND_3_0",
            description=[
                LiteralField(value="Example of distribution description.", language="en"),
                LiteralField(value="Voorbeeld van distributie omschrijving", language="nl"),
            ],
            title=[
                LiteralField(
                    value="Technical report number of unique study subjects available by environment for project HDBP0250",
                    language="en",
                ),
                LiteralField(
                    value="Technisch rapport aantal unieke studiepersonen beschikbaar per omgeving voor project HDBP0250",
                    language="nl",
                ),
            ],
            media_type="http://www.iana.org/assignments/media-types/text/csv",
            download_url="https://fair.healthdata.be/sites/default/files/distribution/d43a158e-7d13-4660-bbc3-9d3f8d5501e5/Technical_report_number_of_unique_study_subjects_available_by_environment_for_project_HDBP0250.csv",
            release_date="2024-06-03T08:51:00Z",
            modification_date="2024-06-04T18:00:00Z",
            language=["http://publications.europa.eu/resource/authority/language/ENG"],
            linked_schemas=["http://example.org/spec"],
            packaging_format="http://www.iana.org/assignments/media-types/text/csv",
            compression_format="http://www.iana.org/assignments/media-types/application/gzip",
            status="http://publications.europa.eu/resource/authority/distribution-status/COMPLETED",
            retention_period=HRIPeriodOfTime(
                start_date="2021-01-01T00:00:00Z",
                end_date="2031-12-31T00:00:00Z",
            ),
            temporal_resolution="P1D",
            rights="http://www.example.com/rights_statement_for_distribution",
            checksum={
                "algorithm": "spdx:checksumAlgorithm_md5",
                "checksum_value": "a0f2a3c1dcd5b1cac71bf0c03f2ff1bd",
            },
            documentation=["http://example.com/analytics_documentation"],
            applicable_legislation=["http://data.europa.eu/eli/reg/2025/327/oj"],
        )
    ],
    sample=[
        HRIDistribution(
            access_url="https://www.example.com/sample",
            byte_size=4096,
            format="http://publications.europa.eu/resource/authority/file-type/XLSX",
            license="http://publications.europa.eu/resource/authority/licence/GPL_2_0_OR_LATER",
            rights="https://www.example.com/rights_statement_for_sample",
            title=[
                LiteralField(value="Sample title", language="en"),
                LiteralField(value="Monster titel", language="nl"),
            ],
        )
    ],
    temporal_coverage=[
        HRIPeriodOfTime(
            start_date="2020-01-01T00:00:00Z",
            end_date="2020-12-31T00:00:00Z",
        )
    ],
    temporal_resolution="P1D",
    retention_period=HRIPeriodOfTime(
        start_date="2020-01-01T00:00:00Z",
        end_date="2030-12-31T00:00:00Z",
    ),
    status=URIRef("http://publications.europa.eu/resource/authority/dataset-status/DEVELOP"),
    other_identifier=[
        Identifier(
            notation="https://www.healthinformationportal.eu/health-information-sources/linking-registers-covid-19-vaccine-surveillance",
            schema_agency="Health Information Portal",
        )
    ],
    applicable_legislation=["http://data.europa.eu/eli/reg/2025/327/oj"],
    source=["http://example.com/source_of_data"],
    # was_generated_by=[Activity()] # Left out because of incompleteness of the testdata; must be filled in at a later date.
)

print(hri_dataset.to_graph(subject=URIRef(hri_dataset.identifier)).serialize())

In [None]:
# Upload the test dataset to the FDP
fdp_dataset_record = hri_dataset.to_graph(subject=URIRef(hri_dataset.identifier))
fdp_dataset_record.add((URIRef(hri_dataset.identifier), DCTERMS.isPartOf, URIRef(catalog_fdp_url)))

dataset_fdp_url = fdp_client.create_and_publish(resource_type="dataset", metadata=fdp_dataset_record)
print(dataset_fdp_url)

## Compare the manual graph and the testdata graph

Compare the manually created hri_dataset RDF graph with the reference test data from GitHub. It fetches the reference TTL file, parses it, and compares both graphs to identify any differences in triples.

In [None]:
# Get and parse the TTL file
uri_testdata = URIRef(
    "https://raw.githubusercontent.com/Health-RI/harvesting-test-data/refs/heads/main/dataset_health.ttl"
)
graph_testdata = Graph().parse(uri_testdata, format="turtle")

graph_testdata.add((URIRef(graph_testdata.identifier), DCTERMS.isPartOf, URIRef(catalog_fdp_url)))
testdata_fdp_url = fdp_client.create_and_publish(resource_type="dataset", metadata=graph_testdata)

In [None]:
# Create a graph from hri_dataset
hri_dataset_graph = hri_dataset.to_graph(subject=URIRef(hri_dataset.identifier))

# Compare the number of triples
print(f"\nOriginal graph has {len(graph_testdata)} triples")
print(f"HRI dataset graph has {len(hri_dataset_graph)} triples")

# Find triples in original but not in hri_dataset_graph
missing_triples = graph_testdata - hri_dataset_graph
print(f"\nTriples in original but not in HRI dataset: {len(missing_triples)}")
if len(missing_triples) > 0:
    print(missing_triples.serialize(format="turtle"))

# Find triples in hri_dataset_graph but not in original
extra_triples = hri_dataset_graph - graph_testdata
print(f"\nTriples in HRI dataset but not in original: {len(extra_triples)}")
if len(extra_triples) > 0:
    print(extra_triples.serialize(format="turtle"))

### Using the to_isomorphic function

To check if the two graphs are the same, the to_isomorphic function is used. This line returns a True or False boolean.

In [None]:
to_isomorphic(graph_testdata) == to_isomorphic(hri_dataset_graph)