# Preparing and uploading data to Fair Data Point with SeMPyRO

In this notebook we will go trough the steps of defining a simple metadata set consisting of a DCAT:Catalog and DCAT:Dataset. We will load some example data and serialize it to a turtle file or push it to a FAIR Data Point (FDP). 

**Prerequisites:** To execute this notebook in full one needs to have a running FAIR Data Point (FDP) instance with an active write access account.
This notebook is written for the reference implementation, FAIR Data Point version 1.16 with default SHACL shapes.

## Imports and setup

In [None]:
from typing import List, Union

import pandas as pd
from tabulate import tabulate
from rdflib import URIRef, DCTERMS
from pydantic import AnyHttpUrl, Field, field_validator

from getpass import getpass

from fairclient.fdpclient import FDPClient

from sempyro import LiteralField
from sempyro.dcat import DCATCatalog, DCATDataset
from sempyro.vcard import VCard
from sempyro.foaf import Agent
from sempyro.utils.validator_functions import force_literal_field

## Defining a Catalog and Datasets

The function of SeMPyRO is to define objects according to a specification, such as dcat:Catalog and dcat:Dataset, and validate the metadata agains this specification. The metadata for the datasets we will use for this demo is in `example_data_fdp.csv`. 
The FDP specification requires that each dataset is a part of a catalog, therefore we need to create a catalog. 

To see what we need to provide for that we can annotate the model and request the mandatory fields:

In [None]:
catalog_fields = DCATCatalog.annotate_model()
print(catalog_fields.mandatory_fields())

Let's create a minimum catalogue with an example title and description. We also need a URI to use as a graph subject at serialization. Let's use `example.com` domain for now for this purpose:

In [None]:
catalog_subject = URIRef("http://example.com/test_catalog_1")

catalog = DCATCatalog(title=[LiteralField(value="Test catalog", language="en")],
                      description=[LiteralField(value="Catalog for test example datasets", language="en")])

We can check the serialized output of the current Catalog record to see what it looks like:

In [None]:
catalog_record = catalog.to_graph(catalog_subject)
print(catalog_record.serialize())

Now let's add datasets to the catalog.
Data for example datasets will be fetched from `./example_data_fdp.csv` file. Let's look into the data:

In [None]:
df = pd.read_csv("./example_data_fdp.csv", sep=";")
print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))

We need to modify some of the text formatting to align it with the standard. Each contact point also needs to be formatted in a VCard (vCard:Kind) object. This is also a model class from SeMPyRO.

In [None]:
df["keywords"] = df["keywords"].apply(lambda x: [y.strip() for y in x.split(",")])
df["theme"] = df["theme"].apply(lambda x: x.split(","))
df["id"] = df["id"].apply(lambda x: [str(x)])
df["contact_point"] = df.apply(
    lambda x: VCard(hasEmail=x["contact_point"], full_name=[x["author_name"]], hasUID=x["author_id"]), axis=1
)
datasets = df.to_dict('records')

In [None]:
dataset_list = []
for record in datasets:
    dataset = DCATDataset(
        title=[LiteralField(value=record["name"])],
        description=[LiteralField(value=record["description"])],
        identifier=record["id"],
        creator=[record["author_id"]],
        release_date=record["issued"],
        theme=record["theme"],
        keyword=[LiteralField(value=x) for x in record["keywords"]],
        contact_point=[record["contact_point"]]
    )
    dataset_subject = URIRef(f"http://example.com/dataset_{record['id'][0]}")
    dataset_list.append({'subject_uri': dataset_subject, 'dataset': dataset})

These DCATDataset objects can be serialized individually:

In [None]:
for dataset_dict in dataset_list:
    dataset_graph = dataset_dict['dataset'].to_graph(dataset_dict['subject_uri'])
    print(dataset_graph.serialize())

The DCATDatasets can be added to the DCATCatalog to link them:

In [None]:
dataset_objects = [ds_dict['dataset'] for ds_dict in dataset_list]
catalog.dataset = dataset_objects
catalog_graph = catalog.to_graph(catalog_subject)
print(catalog_graph.serialize())

Or the DCATDataset and DCATCatalog can be linked through the subject URIs of the DCATDatasets:

In [None]:
dataset_uris = [ds_dict['subject_uri'] for ds_dict in dataset_list]
catalog.dataset = dataset_uris
catalog_graph = catalog.to_graph(catalog_subject)
for ds in dataset_list:
    catalog_graph += ds['dataset'].to_graph(ds['subject_uri'])
print(catalog_graph.serialize())

This output can also be written to file:

In [None]:
catalog_graph.serialize(destination="./usage_example.ttl")

## Push to a FAIR Data Point

The first step in pushing to a FAIR Data Point, is connecting to a FAIR Data Point. For this you need the URL, a username and a password.

In [None]:
fdp_base = input("Enter base link to FDP: ").rstrip("/'")

In [None]:
username = input("Enter username: ")

In [None]:
password = getpass(prompt="Password: ")

Now connect to FDP with given username and password:

In [None]:
fdpclient = FDPClient(base_url=fdp_base, username=username, password=password)

To align with the FDP standard, some modifications need to be made to the model. One requirement is that an object contains a link pointing to a parent object. In the case of a catalogue it is FDP itself and it should be a property `is_part_of` in the range `DCTERMS.isPartOf`. This property is outside of DCAT-AP specification. 

There are two ways to add it. The first way is to add it directly to a graph after converting the base FDP link to URIRef:

In [None]:
catalog_record.add((catalog_subject, DCTERMS.isPartOf, URIRef(fdp_base)))
print(catalog_record.serialize())

The record above can be published to FDP. The second way is to create a subclass of the DCATCatalog class specifically for FDP. This is the way to go if you want to write reusable code.

In [None]:
# Create subclass of catalog, and add/override the fields different from standard DCAT-AP
class FDPCatalog(DCATCatalog):
    publisher: List[Agent] = Field(
        description="The entity responsible for making the resource available.",
        json_schema_extra={
            "rdf_term": DCTERMS.publisher, 
            "rdf_type": "uri",
        }
    )
    is_part_of: [AnyHttpUrl] = Field(description="Link to parent object", 
                                     json_schema_extra={
                                         "rdf_term": DCTERMS.isPartOf, 
                                         "rdf_type": "uri"
                                     })
    has_version: LiteralField = Field(
        description="This resource has a more specific, versioned resource",
        json_schema_extra={
            "rdf_term": DCTERMS.hasVersion,
            "rdf_type": "rdfs_literal",
        }
    )

    @field_validator("has_version", mode="before")
    @classmethod
    def convert_to_literal(cls, value: Union[str, LiteralField]) -> List[LiteralField]:
        return force_literal_field(value)


In `DCATCatalog` `publisher` field is inherited from DCATResource, is optional and takes either AnyHttpUrl or Agent:
```
publisher: List[Union[AnyHttpUrl, Agent]] = Field(
        default=None,
        description="The entity responsible for making the resource available.",
        rdf_term=DCTERMS.publisher,
        rdf_type="uri"
    )
```

❗Note, that a particular configuration concerning mandatory fields and field types may be defined differently in Shape Constraint Language (SHACL) forms for an FDP instance. In this case you may need to change the example code accordingly to prevent validation errors on uploading data. To review your instance's SHACL forms, go to `<your FDP host>/schemas` and select the resource type of interest.

So far the catalogue record was compliant with DCAT-AP notation. However, the default FDP shapes require us to add a `publisher` in the form of an `foaf:Agent`. We also add the previously mentioned `is_part_of` field. The `has_version` field must be a single Literal with the default shapes, instead of an IRI list as DCAT-AP specifies as allowed input.

Now that we have a valid FDP catalog class, we can fill it with data.

In [None]:
fdp_catalog = FDPCatalog(
    title=[LiteralField(value="Hogwarts research catalog", language="en")],
    description=[LiteralField(value="Catalog for Hogwarts students research projects", language="en")],
    publisher=[
        Agent(
            name=["Hogwarts school of Witchcraft and Wizardry"],
            identifier="https://harrypotter.fandom.com/wiki/Hogwarts_School_of_Witchcraft_and_Wizardry",
        )
    ],
    is_part_of=[fdp_base],
    has_version="1.0",
)

fdp_catalog_record = fdp_catalog.to_graph(catalog_subject)
print(fdp_catalog_record.serialize())

In [None]:
catalog_fdp_url = fdpclient.create_and_publish(resource_type="catalog", metadata=fdp_catalog_record)
print(catalog_fdp_url)

If everything goes well you should be able to see a new catalog entry in your FDP instance: ![newly created catalog](./imgs/fdp_catalog.png)

This time let's prepare a class for an FDP-compartible dataset inheriting from the SeMPyRO DCATDataset.
We need to extend the base class with `is_part_of` property similarly as we have done for the catalogue, make the Publisher an Agent and modify the `has_version` field.

Another property to add is an identifier. It is not mandatory in the way that FDP does not require this property but it is useful in case you need to update a record in FDP. Each time a record is created in FDP a unique ID is assigned to it. (For the catalogue record example above we have extracted it from the response header) The fact the identifier does not exist before the record is created in an FDP makes it quite hard to track. Hence, having an identifier on the data level is highly recommended to implement incremental updates.

In [None]:
class FDPDataset(DCATDataset):
    publisher: List[Agent] = Field(description="The entity responsible for making the resource available.",
                                   json_schema_extra={
                                        "rdf_term": DCTERMS.publisher,
                                        "rdf_type": "uri"
                                   })
    is_part_of: [AnyHttpUrl] = Field(description="Link to parent object",
                                     json_schema_extra={
                                         "rdf_term": DCTERMS.isPartOf,
                                         "rdf_type": "uri"
                                     }
                                  )
    identifier: List[Union[str, LiteralField]] = Field(
        description="A unique identifier of the resource being described or catalogued.",
        json_schema_extra={
            "rdf_term": DCTERMS.identifier,
            "rdf_type": "rdfs_literal"
        })
    has_version: LiteralField = Field(description="This resource has a more specific, versioned resource",
                                      json_schema_extra={
                                          "rdf_term": DCTERMS.hasVersion,
                                          "rdf_type": "rdfs_literal"
                                      })

    @field_validator("has_version", mode="before")
    @classmethod
    def convert_to_literal(cls, value: Union[str, LiteralField]) -> List[LiteralField]:
        return force_literal_field(value)

Now let's create datasets filling in mandatory fields and some optional which persist in the data and publish them to FDP:

In [None]:
for record in datasets:
    dataset = FDPDataset(
        title=[LiteralField(value=record["name"])],
        description=[LiteralField(value=record["description"])],
        identifier=record["id"],
        is_part_of=[f"{catalog_fdp_url}"],
        creator=[record["author_id"]],
        release_date=record["issued"],
        publisher=[Agent(name=[record["publisher_name"]], identifier=record["publisher_id"])],
        theme=record["theme"],
        keyword=[LiteralField(value=x) for x in record["keywords"]],
        has_version="0.1",
    )
    dataset_subject = URIRef(f"http://example.com/dataset_{record['id'][0]}")
    dataset_graph = dataset.to_graph(dataset_subject)
    print(dataset_graph.serialize())
    dataset_fdp_id = fdpclient.create_and_publish(resource_type="dataset", metadata=dataset_graph)


The catalogue we have created earlier is now updated with 4 datasets ![catalog](./imgs/ds_in_catalog.png)

and datasets themselves are available: ![datasets](./imgs/datasets_fdp.png)