# GDI Example dataset

The GDI project has its own specific metadata based on DCAT and HealthDCAT-AP.
Here is an example using the following six fields:

1. Dataset Title
2. Dataset Description
3. Number of participants
4. Relevant phenotypes (covid status, sex, age?, smoking status?)
5. DUO Codes?
6. Ancestry / Population?

The first two are part of the Dublin core terms, the third is defined by HealthDCAT-AP.

Let's first import some stuff and define the HealthDCAT-AP namespace.

⚠️ The HealthDCAT-AP namespace is not formally defined yet, so we use a placeholder

In [None]:
from typing import List, Union

from pydantic import ConfigDict, Field
from rdflib import DCAT, DCTERMS, Namespace, URIRef
from rdflib.namespace import DefinedNamespace

from sempyro.dcat import DCATDataset
from sempyro.rdf_model import LiteralField


# Define HealthDCAT-AP namespace with some properties
class HEALTHDCAT(DefinedNamespace):
    minTypicalAge: int
    maxTypicalAge: int
    numberOfUniqueIndividuals: int
    numberOfRecords: int
    populationCoverage: List[LiteralField]

    # FIXME: This is a placeholder until official HealthDCAT-AP namespace is defined
    _NS = Namespace("http://healthdataportal.eu/ns/health#")

Second, we define a Dataset class for GDI MS8. This is based on the DCAT-AP dataset class, but with
a few additional properties we borrow from HealthDCAT-AP. In this case, we define the number of
participants as a mandatory property, and the population coverage description as an optional one.

In [None]:
class GDIDataset(DCATDataset):
    model_config = ConfigDict(
                              json_schema_extra={
                                  "$ontology": "https://healthdcat-ap.github.io/",
                                  "$namespace": str(HEALTHDCAT),
                                  "$IRI": DCAT.Dataset,
                                  "$prefix": "healthdcatap"
                              }
                              )
    min_typical_age: int = Field(
        description=" Minimum typical age of the population within the dataset",
        json_schema_extra={
            "rdf_term": HEALTHDCAT.minTypicalAge,
            "rdf_type": "xsd:nonNegativeInteger"
        }
    )
    max_typical_age: int = Field(
        description="Maximum typical age of the population within the dataset",
        json_schema_extra={
            "rdf_term": HEALTHDCAT.maxTypicalAge,
            "rdf_type": "xsd:nonNegativeInteger",
        }
    )
    no_unique_individuals: int = Field(
        description="Number of participants in study",
        json_schema_extra={
            "rdf_term": HEALTHDCAT.numberOfUniqueIndividuals,
            "rdf_type": "xsd:nonNegativeInteger",
        }
    )
    no_records: int = Field(
        description="Size of the dataset in terms of the number of records.",
        json_schema_extra={
            "rdf_term": HEALTHDCAT.numberOfRecords, 
            "rdf_type": "xsd:nonNegativeInteger",
        }
    )
    population_coverage: List[Union[str, LiteralField]] = Field(
        default=None,
        description="A definition of the population within the dataset",
        json_schema_extra={
            "rdf_term": HEALTHDCAT.populationCoverage,
            "rdf_type": "rdfs_literal"
        }
    )

Now we are ready to define the dataset. We can do that as a Python dictionary.

⚠️ As DCAT supports multilingual, Literals must usually be defined as a list. Using the
`LiteralField` class, you can define a language for each string.

In [None]:
import datetime

from sempyro.dcat.dcat_distribution import DCATDistribution
from sempyro.dcat.data_service import DCATDataService
from sempyro.foaf.agent import Agent
from sempyro.vcard.vcard import VCard


dataset_subject = URIRef("http://example.com/gdi/dataset")
distribution_subject = URIRef("http://example.com/gdi/distribution")
dataservice_definition_subject = URIRef("http://example.com/gdi/dataservice")

dataset_definition = {
    "contact_point": [VCard(hasEmail=["mailto:servicedesk@health-ri.nl"], formatted_name=["Servicedesk Health-RI"],
                             hasUID="https://ror.org/05wg1m734")],
    "creator": [Agent(name=["BSC"], identifier="https://ror.org/05wg1m734", mbox=["mailto:info@bsc.es"], homepage="https://www.bsc.es/")],
    "description": ["""The synthetic genomes have been created trying to mimic real cancer data of 4 patients (Named 185, 186, 187, and 188). 
    Mutations are based on real CRC patients from the PCAWG dataset. For each patient, two tumor samples at different time points and one 
    healthy sample have been simulated. The cancer intra-tumor heterogeneity and evolution in the patients is depicted by simulating reads 
    from tumor subclones separately and then mixing them according to their clonal proportions in each sample. For rapid use and transfer, 
    only selected chromosomes have been generated for each patient. For Beacon queries, the following url can be used https://gdi-beacon-prototype.azurewebsites.net/api/datasets/EGAD50000000276"""],
    "distribution": ["http://example.com/gdi/distribution"],
    "release_date": datetime.datetime(2025, 1, 29, 11, 11, 11, tzinfo=datetime.timezone.utc),
    "keyword": ["Cancer", "Colorectal Cancer", "Genomics", "Synthetic Data"],
    "identifier": ["EGAD50000000276"],
    "modification_date": datetime.datetime(2025, 1, 29, 13, 36, 10, tzinfo=datetime.timezone.utc),
    "publisher": [Agent(name=["Health-RI"], identifier="https://ror.org/05wg1m734", mbox=["mailto:servicedesk@health-ri.nl"], homepage="https://www.health-ri.nl/")],
    "theme": [URIRef("http://publications.europa.eu/resource/authority/data-theme/HEAL")],
    "title": ["EOSC4Cancer Longitudinal Synthetic Colorectal Cancer Genomic data developed at BSC"],
    "no_unique_individuals": 4,
    "min_typical_age": 1,
    "max_typical_age": 100,
    "no_records": 10,
    "population_coverage": ["This test dataset covers no real population."],
    "access_rights": URIRef("http://publications.europa.eu/resource/authority/access-right/PUBLIC")
}

distribution_definition = {
    "title": ["Example VCF from BSC"],
    "description": ["VCF file containing GWAS and allele frequency lookup data of synthetic Colorectal cancer cases."],
    "access_url": ["https://ega-archive.org/datasets/EGAD50000000276"],
    "media_type": "https://www.iana.org/assignments/media-types/application/vcf",
    "license": URIRef("https://creativecommons.org/licenses/by-sa/4.0/"),
    "format": "https://publications.europa.eu/resource/authority/file-type/VCF",
    "byte_size": 1024,
    "rights": URIRef("http://publications.europa.eu/resource/authority/access-right/PUBLIC")
}

dataservice_definition = {
    "title": ["Beacon+ implementation of Health-ri"],
    "description": ["Beacon queries on cancer genomics data"],
    "endpoint_url": ["https://gdi-test.healthdata.nl/api"],
    "endpoint_description": ["https://gdi-test.healthdata.nl/api/info"],  # assuming beacon+ spec URL
    "serves_dataset": [
        "http://example.com/gdi/dataset"
    ],
    "access_rights": URIRef("http://publications.europa.eu/resource/authority/access-right/RESTRICTED"),
    "contact_point": [VCard(hasEmail=["mailto:servicedesk@health-ri.nl"], formatted_name=["Servicedesk Health-RI"],
                             hasUID="https://ror.org/05wg1m734")],
    "identifier": ["EGAD50000000276-beacon"],
    "license": URIRef("https://creativecommons.org/licenses/by-nc/4.0/"),
    "publisher": [Agent(name=["Health-RI"], identifier="https://ror.org/05wg1m734", mbox=["mailto:servicedesk@health-ri.nl"], homepage="https://www.health-ri.nl/")],
    "theme": [URIRef("http://publications.europa.eu/resource/authority/data-theme/HEAL")],# placeholder for controlled vocab if missing
    "version": "v2.0",
    "keyword": ["beacon"],
    "conforms_to": "https://github.com/ga4gh-beacon/specification-v2"
}


Finally, we instantiate the dataset class and print the serialization.

In [None]:
example_dataset = GDIDataset(**dataset_definition)
example_dataset_graph = example_dataset.to_graph(URIRef("http://example.com/gdi/dataset"))
example_distribution = DCATDistribution(**distribution_definition)
example_distribution_graph = example_distribution.to_graph(distribution_subject)
example_dataservice = DCATDataService(**dataservice_definition)
example_dataservice_graph = example_dataservice.to_graph(dataservice_definition_subject)

# Add them up for prettier visualization
print((example_dataset_graph + example_distribution_graph + example_dataservice_graph).serialize(format="turtle"))

Now, we can push the Dataset to a FAIR Data Point. For this, we use the Health-RI developed
[FAIRClient](https://github.com/Health-RI/fairclient) library.

First, we define a couple of settings. Note: you will need an existing Catalog in the FDP to add
the dataset to. If you don't have one, you can easily create them using the web interface.

In [None]:
fdp_parent_catalog = "http://localhost:8888/catalog/83e5c2a9-9fe8-4b98-8737-47ef7332f579"
fdp_baseurl = "http://localhost:8888"
fdp_user = "albert.einstein@example.com"
fdp_pass = "password"

In [None]:
import fairclient.fdpclient

# Log in to the FAIR Data Point
fdpclient = fairclient.fdpclient.FDPClient(base_url=fdp_baseurl, username=fdp_user, password=fdp_pass)

# Add a reference to the parent catalog to make the FDP happy
example_dataset_graph.add((dataset_subject, DCTERMS.isPartOf, URIRef(fdp_parent_catalog)))

In [None]:
new_dataset = fdpclient.create_and_publish("dataset", example_dataset_graph)
print(new_dataset)

To conclude we can now add the distribution to the dataset. This is done by adding the distribution to the dataset graph and then pushing it to the FDP.

In [None]:
example_distribution_graph.add((distribution_subject, DCTERMS.isPartOf, URIRef(f"{new_dataset}")))
distribution_fdp_id = fdpclient.create_and_publish(resource_type="distribution", metadata=example_distribution_graph)

print(distribution_fdp_id)

example_dataservice_graph.add((dataservice_definition_subject, DCTERMS.isPartOf, URIRef(f"{distribution_fdp_id}")))
dataservice_fdp_id = fdpclient.create_and_publish(resource_type="dataservice", metadata=example_dataservice_graph)

print(dataservice_fdp_id)

Note: Manually setting the fdp_id of the distribution and dataservice into the dataset's distribution field and the distribution's access_service field is still required, as this has not been automated yet!