# GDI Example dataset

The GDI project has its own specific metadata based on DCAT and HealthDCAT-AP.
Here is an example using the following six fields:

1. Dataset Title
2. Dataset Description
3. Number of participants
4. Relevant phenotypes (covid status, sex, age?, smoking status?)
5. DUO Codes?
6. Ancestry / Population?

The first two are part of the Dublin core terms, the third is defined by HealthDCAT-AP.

Let's first import some stuff and define the HealthDCAT-AP namespace.

⚠️ The HealthDCAT-AP namespace is not defined yet, so we use a placeholder

In [None]:
from typing import List, Union
from pydantic import Field
from rdflib.namespace import DefinedNamespace
from rdflib import DCAT, XSD, Namespace, URIRef
from sempyro.dcat import DCATDataset
from sempyro.rdf_model import LiteralField

# Define HealthDCAT-AP namespace with some properties
class HEALTHDCAT(DefinedNamespace):
    noParticipants: int
    populationCoverage : List[LiteralField]

    # FIXME: This is a placeholder until official HealthDCAT-AP namespace is defined
    _NS = Namespace("http://example.com/ns/healthdcat#")



Second, we define a Dataset class for GDI MS8. This is based on the DCAT-AP dataset class, but with
a few additional properties we borrow from HealthDCAT-AP. In this case, we define the number of
participants as a mandatory property, and the population coverage description as an optional one.

In [None]:
class GDIDataset(DCATDataset):
    no_participants: int = Field(
        description="Number of participants in study",
        rdf_term=HEALTHDCAT.noParticipants,
        rdf_type="xsd:nonNegativeInteger",
    )
    population_coverage: List[Union[str, LiteralField]] = Field(
        default=None,
        description="A definition of the population within the dataset",
        rdf_term=HEALTHDCAT.populationCoverage,
        rdf_type="rdfs_literal",
    )

Now we are ready to define the dataset. We can do that as a Python dictionary.

⚠️ As DCAT supports multilingual, Literals must usually be defined as a list. Using the
`LiteralField` class, you can define a language for each string.

In [None]:
dataset_definition = {
    'title': ["GDI MS8 dataset"],
    'description': ["This is an example dataset for GDI MS8."],
    'no_participants': 5,
    'population_coverage': ["This test dataset covers no real population."],
}


Finally, we instantiate the dataset class and print the serialization.

In [None]:
example = GDIDataset(**dataset_definition)
example_graph = example.to_graph(URIRef("http://example.com/dataset"))
example_graph.bind("healthdcat", HEALTHDCAT)
print(example_graph.serialize(format="turtle"))