# Preparing and uploading data to Fair Data Point with SeMPyRO

**Prerequirements:** To execute this notebook in full one needs to have a running FAIR Data Point (FDP) instance with an active write access account.
This notebook is written for the reference implementation, FAIR Data Point version 1.16 with default SHACL shapes.

Let us consider uploading datasets from example_data.csv to FDP. 
FDP requires each dataset to be a part of a catalogue, therefore we need to create a catalogue. Let's see what we need to provide for that:

In [None]:
from sempyro.dcat import DCATCatalog

catalog_fields = DCATCatalog.annotate_model()
print(catalog_fields.mandatory_fields())

Let's create a minimum catalogue with an example title and description. We also need a URI to use as a graph subject at serialization. Let's use `example.com` domain for now for this purpose:

In [None]:
from sempyro import LiteralField
from rdflib import URIRef

catalog_subject = URIRef("http://example.com/test_catalog_1")

catalog = DCATCatalog(title=[LiteralField(value="Test catalog", language="en")],
                      description=[LiteralField(value="Catalog for test example datasets", language="en")])
catalog_record = catalog.to_graph(catalog_subject)
print(catalog_record.serialize())

In [None]:
fdp_base=input("Enter base link to FDP: ").rstrip("/'")

In [None]:
username=input("Enter username: ")

In [None]:
from getpass import getpass
password = getpass()

Now connect to FDP with given username/password

In [None]:
from fairclient.fdpclient import FDPClient

fdpclient = FDPClient(base_url=fdp_base, username=username, password=password)

We can check the serialized output of the current Catalog record to see how it looks like.

In [None]:
catalog_record = catalog.to_graph(catalog_subject)
print(catalog_record.serialize())

Another FDP requirement is a link pointing to a parent object, in the case of a catalogue it is FDP itself and it should be a property `is_part_of` in the range `DCTERMS.isPartOf`. This property is outside of DCAT-AP specification. There are two ways to add it: the first way is to add it directly to a graph (not forgetting to convert the base FDP link to URIRef):

In [None]:
from rdflib import DCTERMS

catalog_record.add((catalog_subject, DCTERMS.isPartOf, URIRef(fdp_base)))
print(catalog_record.serialize())

The record above can be published to FDP. But if you want to create a reusable code it is better to create a child catalog class for FDP specifically and reflect the logic required for FDP.

In `DCATCatalog` `publisher` field is inherited from DCATResource, is optional and takes either AnyHttpUrl or Agent:
```
publisher: List[Union[AnyHttpUrl, Agent]] = Field(
        default=None,
        description="The entity responsible for making the resource available.",
        rdf_term=DCTERMS.publisher,
        rdf_type="uri"
    )
```

❗Note, that a particular configuration concerning mandatory fields and field types may be defined differently in Shape Constraint Language (SCHACL) forms for an FDP instance. In this case you may need to change the example code below accordingly to prevent validation errors on uploading data. To review your instance's SCHACL forms, go to `<your FDP host>/schemas` and select the resource type of interest.

So far catalogue record was compliant with DCAT-AP notation. However, the default FDP shapes require us to add a `publisher` in the form of an `foaf:Agent`. We also add the previously mentioned `is_part_of` field. The `has_version` field must be a single Literal with the default shapes, instead of an IRI list as DCAT-AP specifies as allowed input.

In [None]:
from sempyro.foaf import Agent
from sempyro import LiteralField
from sempyro.utils.validator_functions import force_literal_field

from pydantic import AnyHttpUrl, Field, field_validator
from typing import List, Union


# Create subclass of catalog, and add/override the fields different from standard DCAT-AP
class FDPCatalog(DCATCatalog):
    publisher: List[Agent] = Field(
        description="The entity responsible for making the resource available.",
        rdf_term=DCTERMS.publisher,
        rdf_type="uri",
    )
    is_part_of: [AnyHttpUrl] = Field(description="Link to parent object", rdf_term=DCTERMS.isPartOf, rdf_type="uri")
    has_version: LiteralField = Field(
        description="This resource has a more specific, versioned resource",
        rdf_term=DCTERMS.hasVersion,
        rdf_type="rdfs_literal",
    )

    @field_validator("has_version", mode="before")
    @classmethod
    def convert_to_literal(cls, value: Union[str, LiteralField]) -> List[LiteralField]:
        return force_literal_field(value)

Now that we have a valid FDP catalog class, we can fill it with data.

In [None]:
fdp_catalog = FDPCatalog(
    title=[LiteralField(value="Hogwarts research catalog", language="en")],
    description=[LiteralField(value="Catalog for Hogwarts students research projects", language="en")],
    publisher=[
        Agent(
            name=["Hogwarts school of Witchcraft and Wizardry"],
            identifier="https://harrypotter.fandom.com/wiki/Hogwarts_School_of_Witchcraft_and_Wizardry",
        )
    ],
    is_part_of=[fdp_base],
    has_version="1.0",
)

fdp_catalog_record = fdp_catalog.to_graph(catalog_subject)
print(fdp_catalog_record.serialize())

In [None]:
catalog_fdp_url = fdpclient.create_and_publish(resource_type="catalog", metadata=fdp_catalog_record)
print(catalog_fdp_url)

If everything goes well you should be able to see a new catalog entry in your FDP instance: ![newly created catalog](./imgs/fdp_catalog.png)

Now let's add datasets to the catalog.
Data for example datasets will be fetched from `./example_data_fdp.csv` file. Let's look into the data:

In [None]:
from tabulate import tabulate
import pandas as pd

df = pd.read_csv("./example_data_fdp.csv", sep=";")
print(tabulate(df, headers='keys', tablefmt='psql', showindex=False))

Let's prepare source data: 

In [None]:
from sempyro.vcard import VCard

df["keywords"] = df["keywords"].apply(lambda x: [y.strip() for y in x.split(",")])
df["theme"] = df["theme"].apply(lambda x: x.split(","))
df["id"] = df["id"].apply(lambda x: [str(x)])
df["contact_point"] = df.apply(
    lambda x: VCard(hasEmail=x["contact_point"], full_name=[x["author_name"]], hasUID=x["author_id"]), axis=1
)

This time let's prepare a class for an FDP-compartible dataset inheriting from sempyro DCATDataset.
We need to extend the base class with `is_part_of` property similarly as we have done for the catalogue, make the Publisher an Agent and modify the `has_version` field.

Another property to add is an identifier. It is not mandatory in the way that FDP does not require this property but useful in case you need to update a record in FDP. Each time a record is created in FDP a unique id is assigned to it. (For the catalogue record example above we have extracted it from the response header). The fact the identifier does not exist before the record is created in an FDP makes it quite hard to track. Hence, having an identifier on the data level is highly recommended to implement incremental updates.

In [None]:
from sempyro.dcat import DCATDataset

class FDPDataset(DCATDataset):
    publisher: List[Agent] = Field(description="The entity responsible for making the resource available.",
                                        rdf_term=DCTERMS.publisher,
                                        rdf_type="uri")
    is_part_of: [AnyHttpUrl] = Field(description="Link to parent object",
                                   rdf_term=DCTERMS.isPartOf,
                                   rdf_type="uri"
                                  )
    identifier: List[Union[str, LiteralField]] = Field(
        description="A unique identifier of the resource being described or catalogued.",
        rdf_term=DCTERMS.identifier,
        rdf_type="rdfs_literal")
    has_version: LiteralField = Field(description="This resource has a more specific, versioned resource",
                                      rdf_term = DCTERMS.hasVersion,
                                      rdf_type="rdfs_literal")

    @field_validator("has_version", mode="before")
    @classmethod
    def convert_to_literal(cls, value: Union[str, LiteralField]) -> List[LiteralField]:
        return force_literal_field(value)

Now let's create datasets filling in mandatory fields and some optional which persist in the data and publish them to FDP:

In [None]:
datasets = df.to_dict('records')
for record in datasets:
    dataset = FDPDataset(
        title=[LiteralField(value=record["name"])],
        description=[LiteralField(value=record["description"])],
        identifier=record["id"],
        is_part_of=[f"{catalog_fdp_url}"],
        creator=[record["author_id"]],
        release_date=record["issued"],
        publisher=[Agent(name=[record["publisher_name"]], identifier=record["publisher_id"])],
        theme=record["theme"],
        keyword=[LiteralField(value=x) for x in record["keywords"]],
        has_version="0.1",
    )
    dataset_subject = URIRef(f"http://example.com/dataset_{record['id'][0]}")
    dataset_graph = dataset.to_graph(dataset_subject)
    print(dataset_graph.serialize())
    dataset_fdp_id = fdpclient.create_and_publish(resource_type="dataset", metadata=dataset_graph)


The catalogue we have created earlier is now updated with 4 datasets ![catalog](./imgs/ds_in_catalog.png)

and datasets themselves are available: ![datasets](./imgs/datasets_fdp.png)