# GDI Example dataset

The GDI project has its own specific metadata based on DCAT and HealthDCAT-AP.
Here is an example using the following six fields:

1. Dataset Title
2. Dataset Description
3. Number of participants
4. Relevant phenotypes (covid status, sex, age?, smoking status?)
5. DUO Codes?
6. Ancestry / Population?

The first two are part of the Dublin core terms, the third is defined by HealthDCAT-AP.

Let's first import some stuff and define the HealthDCAT-AP namespace.

⚠️ The HealthDCAT-AP namespace is not formally defined yet, so we use a placeholder

In [2]:
from typing import List, Union

from pydantic import ConfigDict, Field
from rdflib import DCAT, DCTERMS, Namespace, URIRef
from rdflib.namespace import DefinedNamespace

from sempyro.dcat import DCATDataset
from sempyro.rdf_model import LiteralField


# Define HealthDCAT-AP namespace with some properties
class HEALTHDCAT(DefinedNamespace):
    minTypicalAge: int
    maxTypicalAge: int
    numberOfUniqueIndividuals: int
    numberOfRecords: int
    populationCoverage: List[LiteralField]

    # FIXME: This is a placeholder until official HealthDCAT-AP namespace is defined
    _NS = Namespace("http://example.com/ns/healthdcat#")

Second, we define a Dataset class for GDI MS8. This is based on the DCAT-AP dataset class, but with
a few additional properties we borrow from HealthDCAT-AP. In this case, we define the number of
participants as a mandatory property, and the population coverage description as an optional one.

In [3]:
class GDIDataset(DCATDataset):
    model_config = ConfigDict(
                              json_schema_extra={
                                  "$ontology": "https://healthdcat-ap.github.io/",
                                  "$namespace": str(HEALTHDCAT),
                                  "$IRI": DCAT.Dataset,
                                  "$prefix": "healthdcatap"
                              }
                              )
    min_typical_age: int = Field(
        description=" Minimum typical age of the population within the dataset",
        rdf_term=HEALTHDCAT.minTypicalAge,
        rdf_type="xsd:nonNegativeInteger",
    )
    max_typical_age: int = Field(
        description="Maximum typical age of the population within the dataset",
        rdf_term=HEALTHDCAT.maxTypicalAge,
        rdf_type="xsd:nonNegativeInteger",
    )
    no_unique_individuals: int = Field(
        description="Number of participants in study",
        rdf_term=HEALTHDCAT.numberOfUniqueIndividuals,
        rdf_type="xsd:nonNegativeInteger",
    )
    no_records: int = Field(
        description="Size of the dataset in terms of the number of records.",
        rdf_term=HEALTHDCAT.numberOfRecords,
        rdf_type="xsd:nonNegativeInteger",
    )
    population_coverage: List[Union[str, LiteralField]] = Field(
        default=None,
        description="A definition of the population within the dataset",
        rdf_term=HEALTHDCAT.populationCoverage,
        rdf_type="rdfs_literal",
    )

Now we are ready to define the dataset. We can do that as a Python dictionary.

⚠️ As DCAT supports multilingual, Literals must usually be defined as a list. Using the
`LiteralField` class, you can define a language for each string.

In [None]:
import datetime

from sempyro.dcat.dcat_distribution import DCATDistribution
from sempyro.foaf.agent import Agent
from sempyro.vcard.vcard import VCard

dataset_definition = {
    "contact_point": [VCard(hasEmail=["mailto:cto@biodata.pt"], full_name=["BioData.pt Chief Technology Officer"],
                           hasUID="https://ror.org/02q7abn51")],
    "creator": [Agent(name=["BioData.pt"], identifier="https://ror.org/02q7abn51")],
    "description": ["This dataset is being used as part of the GDI Milestone 8, containing VCFs and phenotypic data in CSV format about 41514 samples. The dataset consists only of synthetic data."],
    "distribution": ["https://fdp.gdi.biodata.pt/gdi/distribution", ], # TODO MAKE IT THE LIST OF ALL THE DISTRIBUTION URI
    "release_date": datetime.datetime(2024, 7, 7, 11, 11, 11, tzinfo=datetime.timezone.utc),
    "keyword": ["COVID"],
    "identifier": ["GDID-becadf5a-a1b2"],
    "update_date": datetime.datetime(2024,11,4,10,20,5, tzinfo=datetime.timezone.utc),
    "publisher": [Agent(name=["BioData.pt"], identifier="https://ror.org/02q7abn51")],
    "theme": [URIRef("http://publications.europa.eu/resource/authority/data-theme/HEAL")],
    "title": ["PT node COVID-19 GWAS and Allele Frequency Lookup Dataset with Italian population 1"],
    "license": URIRef("https://creativecommons.org/licenses/by-sa/4.0/"),
    "no_unique_individuals": 41514,
    "no_records": 18382376,
    "population_coverage": ["This test dataset covers no real population."],
    "min_typical_age": 18,
    "max_typical_age": 64,
}

distribution_definition = {
    "title": ["GWAS and Allele Frequency Lookup Data Distribution for GDI MS8"],
    "description": ["VCF file containing GWAS and allele frequency lookup data of synthetic COVID-19."
    "cases and controls for GDI MS8 demonstration."],
    "access_url": ["https://example.com/dataset/GDI-MS8-COVID19.vcf"],
    "media_type": "https://www.iana.org/assignments/media-types/application/vcf",
    # "identifier": ["GDIF-12345678-90ab-defg"]
}

dataset_subject = URIRef("https://fdp.gdi.biodata.pt/gdi/dataset")
distribution_subject = URIRef("https://fdp.gdi.biodata.pt/gdi/distribution")

In [None]:
distribution_definition = {
    "title": ["GWAS and Allele Frequency Lookup Data Distribution for GDI MS8"],
    "description": ["VCF file containing GWAS and allele frequency lookup data of synthetic COVID-19."
    "cases and controls for GDI MS8 demonstration."],
    "access_url": ["https://example.com/dataset/GDI-MS8-COVID19.vcf"], # TODO LOCATION ON S&I
    "media_type": "https://www.iana.org/assignments/media-types/application/vcf",
    # "identifier": ["GDIF-12345678-90ab-defg"]
}

distribution_subject = URIRef("https://fdp.gdi.biodata.pt/gdi/distribution/1") # make it unique, the thing after slash


Finally, we instantiate the dataset class and print the serialization.

In [None]:
example_dataset = GDIDataset(**dataset_definition)
example_dataset_graph = example_dataset.to_graph(URIRef("https://fdp.gdi.biodata.pt/gdi/dataset"))
example_distribution = DCATDistribution(**distribution_definition)
example_distribution_graph = example_distribution.to_graph(distribution_subject)

# Add them up for prettier visualization
print((example_dataset_graph + example_distribution_graph).serialize(format="turtle"))

Now, we can push the Dataset to a FAIR Data Point. For this, we use the Health-RI developed
[FAIRClient](https://github.com/Health-RI/fairclient) library.

First, we define a couple of settings. Note: you will need an existing Catalog in the FDP to add
the dataset to. If you don't have one, you can easily create them using the web interface.

In [None]:
fdp_parent_catalog = "https://fdp.gdi.biodata.pt/catalog/88e146bb-0fb9-4f45-ad57-26f928750773"
fdp_baseurl = "https://fdp.gdi.biodata.pt"
fdp_user = "albert.einstein@example.com"
with open("/home/ubuntu/fdp/passwd.txt", "r") as f:
    passwd = f.read().strip()
fdp_pass = passwd

In [None]:
import fairclient.fdpclient

# Log in to the FAIR Data Point
fdpclient = fairclient.fdpclient.FDPClient(base_url=fdp_baseurl, username=fdp_user, password=fdp_pass)

# Add a reference to the parent catalog to make the FDP happy
example_dataset_graph.add((dataset_subject, DCTERMS.isPartOf, URIRef(fdp_parent_catalog)))

In [None]:
new_dataset = fdpclient.create_and_publish("dataset", example_dataset_graph)
print(new_dataset)

To conclude we can now add the distribution to the dataset. This is done by adding the distribution to the dataset graph and then pushing it to the FDP.

In [None]:
example_distribution_graph.add((distribution_subject, DCTERMS.isPartOf, URIRef(f"{new_dataset}"))) # do for each dist
distribution_fdp_id = fdpclient.create_and_publish(resource_type="distribution", metadata=example_distribution_graph)

print(distribution_fdp_id)