# Preparing and uploading data to Fair Data Point (FDP) compliant with the Health-RI core shapes using SeMPyRO

**Prerequirements:** To execute this notebook in full one needs to have a running FAIR Data Point (FDP) instance with an active write access account.

I used [sample](https://github.com/Health-RI/health-ri-metadata) data as a reference for the Health-RI core shapes. The shapes are available at the repo: [shapes](https://github.com/Health-RI/health-ri-metadata/tree/develop/Formalisation(shacl)/Core/PiecesShape). And the FDP instance I used is https://fdp-test.health-ri.nl. The sample data is also available in metadata repository

FDP requires each dataset to be a part of a catalogue, therefore we need to create a catalogue. Let's see what we need to provide for that:

In [None]:
from sempyro.hri_dcat import HRICatalog

catalog_fields = HRICatalog.annotate_model()
print(catalog_fields.mandatory_fields())

Let's create a minimum catalogue with an example title and description. We also need a URI to use as a graph subject at serialization. A FDP requirement is a link pointing to a parent object, in the case of a catalogue it is FDP itself and it should be a property `is_part_of` in the range `DCTERMS.isPartOf`. More about this in the other example Usage_example_FDP.ipynb. We used the reusable code from this example.  Let's use `example.com` domain for now for this purpose:

In [None]:
from sempyro import LiteralField
from sempyro.foaf import Agent
from sempyro.vcard import VCard
from sempyro.hri_dcat import HRICatalog
from rdflib import URIRef, DCTERMS
from pydantic import AnyHttpUrl, Field

EX = "https://example.com"
fdp_base=input("Enter base link to FDP: ").rstrip("/")

class FDPCatalog(HRICatalog):
    is_part_of: [AnyHttpUrl] = Field(description="Link to parent object",
                                   rdf_term=DCTERMS.isPartOf,
                                   rdf_type="uri"
                                  )

# Create a class instance with the same data
fdp_catalog = FDPCatalog(
    title=[LiteralField(value="Inflammatory Bowel Disease catalogue", language="en")],
    description=[LiteralField(value="This catalogue describes the core metadata of AUMC Inflammatory Bowel Disease datasets", language="en")],
    contact_point= VCard(hasEmail=[URIRef("mailto:data-access-committee@xumc.nl")],
                         full_name=["Data Access Committee of the x UMC"], hasUID="https://ror.org/05wg1m734"),
    publisher=[Agent(
                                 name=[
                                     LiteralField(value="Academic Medical Center")],
                                 identifier="https://ror.org/05wg1m734")],
    is_part_of=[fdp_base],
    dataset=[]
                        )

fdp_catalog_record = fdp_catalog.to_graph(URIRef(f"{EX}/test_catalog_1"))
print(fdp_catalog_record.serialize())




Publishing a record in FDP consists of two steps: creating a record and publishing. These two actions are performed as API calls with different content types, so we need to implement methods for changing content type, creating a record and publishing the record. After that, the client looks like this:

In [None]:
username=input("Enter username: ")

In [None]:
from getpass import getpass
password = getpass()

Now we can create a client instance and publish the catalogue record to FDP:

In [None]:
from fairclient.fdpclient import FDPClient

fdp_client = FDPClient(base_url=fdp_base, username=username, password=password)

catalog_fdp_url = fdp_client.create_and_publish(resource_type="catalog", metadata=fdp_catalog_record)
print(catalog_fdp_url)

Now we need to add datasets to the catalogue. Data for example datasets will be used from [sample](https://github.com/Health-RI/health-ri-metadata) 

In [None]:
from sempyro.hri_dcat import HRIDataset
from sempyro.vcard import VCard
import dateutil.parser as parser

hri_dataset = HRIDataset(
    contact_point= VCard(hasEmail=[URIRef("mailto:data-access-committee@xumc.nl")],
                         full_name=["Data Access Committee of the x UMC"], hasUID="https://ror.org/05wg1m734")
    ,
    creator=[Agent(name=["Academic Medical Center"], identifier="https://ror.org/05wg1m734")],
    description=[LiteralField(value=
                              "The primary aim of the PRISMA study was to investigate the potential value of risk-tailored versus "
                              "traditional breast cancer screening protocols in the Netherlands. Data collection took place between "
                              "2014-2019, resulting in ∼67,000 mammograms, ∼38,000 surveys, ∼10,000 blood samples and ∼600 saliva "
                              "samples.")],
    issued=parser.isoparse("2024-07-01T11:11:11"),
    identifier=f"{EX}/dataset/ZLOYOJ",
    modified=parser.isoparse("2024-06-04T13:36:10.246Z"),
    publisher=[
        Agent(name=["Radboud University Medical Center"], identifier="https://ror.org/05wg1m734")],
    theme=[URIRef("http://publications.europa.eu/resource/authority/data-theme/HEAL")],
    title=[LiteralField(value="Questionnaire data of the Personalised RISk-based MAmmascreening Study (PRISMA)")],
    license=URIRef("https://creativecommons.org/licenses/by-sa/4.0/"),
    distribution=[]
)

Make sure the dataset is correctly serialized link it to the catalogue and publish it:

In [None]:
fdp_dataset_record = hri_dataset.to_graph(subject=URIRef(hri_dataset.identifier))
fdp_dataset_record.add((URIRef(hri_dataset.identifier), DCTERMS.isPartOf, URIRef(catalog_fdp_url)))
dataset_fdp_url = fdp_client.create_and_publish(resource_type="dataset", metadata=fdp_dataset_record)

print(dataset_fdp_url)

Now we can check the catalogue and the dataset in FDP. The next step is adding a distribution. Let's create a distribution for the dataset:

In [None]:
from sempyro.hri_dcat import HRIDistribution

hri_distribution = HRIDistribution(
    title=[LiteralField(value="CSV-distribution of the questionnaire data of the Personalised RISk-based MAmmascreening Study (PRISMA)")],
    description=[LiteralField(value="CSV file containing the questionnaire data of the PRISMA study")],
    access_url=[URIRef("https://example.com/dataset/PRISMA/questionnaire.csv")],
    media_type=URIRef("https://www.iana.org/assignments/media-types/text/csv")
)

The identifier of the distribution should be unique in the context of the dataset. Access URL is mandatory. Let's add the distribution to the dataset and publish it:

In [None]:
access_url_str = str(hri_distribution.access_url[0])
distribution_uri = URIRef(f"{EX}/distribution/{access_url_str.split('/')[-1]}")
fdp_distribution_record = hri_distribution.to_graph(subject=distribution_uri)
fdp_distribution_record.add((distribution_uri, DCTERMS.isPartOf, URIRef(f"{dataset_fdp_url}")))
distribution_fdp_id = fdp_client.create_and_publish(resource_type="distribution", metadata=fdp_distribution_record)

print(distribution_fdp_id)

Now we can check the catalogue, dataset and distribution in FDP.