# DCAT - Landscape analysis 

## What is [DCAT](https://www.w3.org/TR/vocab-dcat-3/)? 
**DCAT** (Data Catalogue Vocabulary) is a vocabulary for sharing data catalogs on the web. It provides standardized guidelines, rules and convensions on how to structure and describe metadata about datasets, with the goal of increasing interoperability and discoverability of datasets across domains. The vocabulary is machine-readable as it formulates [RDF](https://www.w3.org/RDF/) (Resource Description Framework) classes and properties. 

DCAT uses terms from other well-established metadata standards such as [FOAF](http://xmlns.com/foaf/spec/) and [Dublin Core](https://www.dublincore.org/), and introduces new classes and properties in the [`dcat` namespace](http://www.w3.org/ns/dcat#). 


Here's a timeline of the development of the vocabulary: 
- 2010: First development of DCAT by [Vassilios Peristeras](https://www.ihu.gr/ucips/cv/vassilios-peristeras) at the former [Digital Enterprise Research Institute](https://en.wikipedia.org/wiki/Digital_Enterprise_Research_Institute) in Ireland. 
-2012: DCAT Is taken over by the [W3C](https://www.w3.org/) Government Linked Data Working Group. 
- 2014: Fully standardized version of DCAT is released. 
- 2020: Release of DCAT version 2. 
- 2022: Release of DCAT version 3. 

The different versions of DCAT are backwards compatible, meaning that older versions remain in conformance with the newer versions. The updates generally included new classes and properties, and some contraints were relaxed. 
                                                          


### Structure 
DCAT is based around seven main classes. Each class provides guidelines for the description of a certain object based. 

- `dcat:Dataset`: represents a collection of data. 
- `dcat:Catalog`: represents a catalog, or collection of datasets, in which each individual item is a metadata record. 
- `dcat:Resource`: represents a dataset (or other resource) that may be described by a metadata record in a catalog. 
- `dcat:Distribution`: represents an accesible form of a dataset such as a downloadable file. 
- `dcat:Dataservice`: represents a collection of operations accessbile through an interface (API). 
- `dcat:DatasetSeries`: a dataset that represents a collection of datasets that are published separately, but share some characteristics that group them. 
- `dcat:CatalogRecord`: a metadata record in the catalog 

Each class has a number of predicates with different levels of obligation. Some elements are mandatory, such as the title or description of a dataset. Others are recommended or optional.  

![Overview of the classes](img/dcat_class_diagram.png)

See [this file](basic-example.ttl) for an example of a data catalogue described with DCAT. 

- Mandatory/recommended items
- [OWL 2](https://www.w3.org/TR/owl2-overview/)
- 


## EU Recommendations: [DCAT-AP](https://semiceu.github.io/DCAT-AP/releases/3.0.0/) 
In 2013, the European Union revised some legislation related to the re-use of public data. The revision included the adoption of the "open by default" principle, which formulates that data of public bodies should be open and free by default and design, as well as easily accessible via API if possible. As a response to this new standard, European public administrations set up Open Data portals, setting the first steps to a European open data ecosystem. Lack of standardization lead to a fragmented landscape of in which it was difficult to exchange metadata between the over 150 data portals, and which all had to be queried individually. 


The European Commission recognized the need to improve the fragmented open data environment in Europe. A common metadata language would make it easier to discover and reuse datasets across portals. With this in mind, the commission extended the already existing DCAT by developing additional requirements and recommendations that would adhere to the European context, resultin in **DCAT-AP** (DCAT Application Profile). It's a specification of DCAT that reuses terms but adds more specificity. It identifies more mandatory, recommended and optional elements that should be used in particular situations, as well as recommendations for controlled vocabularies that could be used. DCAT-AP is extendable, meaning that further needs for specifications can be implemented. Although it was developed in the context of Open Data portals in Europe, it can be used for any type of dataset. 

DCAT-AP does not require a system to implement specific technical environments. The only requirement is that a system can export and import data in RDF. 

(This section was based on the information found in [this paper](https://doi.org/10.1504/EG.2022.121856).)


### Goals

As listed in their [documentation](https://semiceu.github.io/DCAT-AP/releases/3.0.0/), the DCAT-AP aims to facilitate data findability and promote reusability through coherent documentation of datasets. The Application Profile enables this by considering the following aspects: 
- Understanding the data or service structure
- Understanding how to get acccess to it
- Legal information about its use
- Information about data publishers and other agents
- Information about data availability and change policies 


The main contribution of the Application Profile is to extent DCAT with more contraints or mandatory fields. 


### Range terminology 
The following section will go through some of the additions provided by the Application Profile. You will come across the terms that are listed and described below. They will occur in the context of the range of a property, meaning it describes which form the property should take. Note that in some cases, the range refers to a class in another ontology.  
- [Literal](https://semiceu.github.io/DCAT-AP/releases/3.0.0/#Literal): A literal value such as a string or integer
- [Legal Resource](https://semiceu.github.io/DCAT-AP/releases/3.0.0/#LegalResource): An [ELI](https://op.europa.eu/en/web/eu-vocabularies/eli) (European Legislation Identifier) class that represents legislation or policies
- [Document](https://semiceu.github.io/DCAT-AP/releases/3.0.0/#Document): A textual resource intended for human consumption that contains information (e.g. a web page about a Dataset)
- [Dataset](https://semiceu.github.io/DCAT-AP/releases/3.0.0/#Dataset): A DCAT-AP class that refers to a collection of data. 
- [Provenance Statement](https://semiceu.github.io/DCAT-AP/releases/3.0.0/#ProvenanceStatement): A statement of changes in ownership and custody that are significant for its authenticity, integrity and interpretation. 
- [Distribution](https://semiceu.github.io/DCAT-AP/releases/3.0.0/#Distribution): A physical embodiment of the dataset in a particular format.  
- [Resource](https://semiceu.github.io/DCAT-AP/releases/3.0.0/#Literal): Anything described by RDF
- [Agent](https://semiceu.github.io/DCAT-AP/releases/3.0.0/#Agent): An entity that carries out actions. Often refers to the organization that publishes data. 





### Mandatory properties and DCAT-AP additions
The sections below outline mandatory properties for a selection of classes, as well as optional properties that were added specifically for the Application Profile. The information is taken from the [DCAT-AP documentation](https://semiceu.github.io/DCAT-AP/releases/3.0.0/#main-entities). In the original documentation you can find all properties and their associated constraints for all classes. Moreover, it includes an indicator to mark whether a property is:  
- `A` reused as-is defined in DCAT, 
- `E` reused with additional notes or restrictions
- `P` DCAT-AP specific. 



#### The `Dataset` class

Here's an [overview](https://semiceu.github.io/DCAT-AP/releases/3.0.0/#Dataset) of the most important aspects that the metadata of a dataset should include to conform to the AP. It includes the mandatory properties as well as the additions made by DCAT-AP (indicated with P)

| Property | Range | Definition | DCAT  | Reuse | Optionality |
| ---      | ---   | ---        | ---   | ---   | --- |
| description | Literal | A free-text account of the Dataset | [link](https://www.w3.org/TR/vocab-dcat-3/#Property:resource_description)| E | mandatory | 
| title | Literal | A name given to the Dataset | [link](https://www.w3.org/TR/vocab-dcat-3/#Property:resource_title) | E | mandatory |
| applicable legislation | Legal Resource | The legislation that mandates the creation or management of the Dataset |  | P | optional | 
| documentation | Document | A page or document about this Dataset | | P | optional | 
| has version | Dataset | A related Dataset that is a version, edition, or adaptation of the described Dataset | | P | optional | 
| provenance | Provenance Statement | A statement about the lineage of a Dataset | | P | optional | 
| sample | Distribution | A sample distribution of the dataset | | P | optional | 
| source | Dataset | A related Dataset from which the described Dataset is derived | | P | optional | 
| version notes | Literal | A description of the differences between this version and a previous version of the Dataset | | P | optional | 
 


#### The `Catalogue` class
This class concerns a catalogue or repositor that hosts Datasets or Data Services. A Data Station can be defined as a data service.  

| Property | Range | Definition | DCAT  | Reuse | Optionality | 
| ---      | ---   | ---        | ---   | ---   | --- | 
| description | Literal | A free-text account of the Catalogue | [link](https://www.w3.org/TR/vocab-dcat-3/#Property:resource_description) | E | mandatory | 
| publisher | Agent | responsible for making the Catalogue available | [link](https://www.w3.org/TR/vocab-dcat-3/#Property:resource_publisher) | E | mandatory | 
| title | Literal | A name given to the Catalogue | [link](https://www.w3.org/TR/vocab-dcat-3/#Property:resource_title)| E | mandatory | 
| applicable legislation | Legal Resource | The legislation that mandates the creation or management of the Catalogue |  | P | optional | 



### Validation 

- Validation / SHACL info DCAT-AP-NL [here](https://docs.geostandaarden.nl/dcat/dcat-ap-nl30/#17C1E0BE)

## Who uses DCAT-AP?  


- [European data portal](https://data.europa.eu/en): The portal where all EU member states have to upload their open data. 

## [DCAT-AP-NL](https://geonovum.github.io/DCAT-AP-NL30/)
DCAT-AP-NL is the Dutch adaptation of DCAT-AP. It is developed and maintained by [Geonovum](https://www.geonovum.nl/), a Dutch executive committee concerned with developing data standards mainly in the realm of spatial data. A number of collaborators from different field are involved as well, including [health-RI](https://www.health-ri.nl/over-health-ri), [Digitaal erfgoed](https://datasetregister.netwerkdigitaalerfgoed.nl/?lang=en), and [Centraal Bureau voor de Statistiek](https://www.cbs.nl/)

 The profile adds further specifications, mandatory aspects and recommendations that are catered to Dutch open data catalogues and governmental data portals. It is build upon DCAT-AP, so if metadata complies with DCAT-AP-NL, it automatically also complies with DCAT-AP. The specifications are written in Dutch, but the summary in this document contains their English translations. 

The Geonovum website [reported](https://www.geonovum.nl/over-geonovum/actueel/nederlands-profiel-voor-europese-metadata-standaard-dcat) in July 2024 that the development of the Dutch application profile has finished, but is still waiting for confirmation from the EU to make it official. This is expected by the end of 2024. 



### Additions
Below you find an overview of all the mandatory properties of the `Dataset` class, as well as optional and recommended properties that have been specified by DCAT-AP-NL. There is a column to indicate whether and how the property changed compared to DCAT-AP. 


| Property | Definition   | Optionality DCAT-AP-NL | Optionality DCAT-AP | 
| ---      | ---     | ---         |    ---     |
| [title](https://geonovum.github.io/DCAT-AP-NL30/#dataset-title)    |  A name given to the Dataset |  mandatory   | mandatory         |  
| [description](https://geonovum.github.io/DCAT-AP-NL30/#dataset-description) |  A free-text account of the Dataset  |  mandatory | mandatory | 
| [access rights](https://geonovum.github.io/DCAT-AP-NL30/#dataset-access-rights) | Information about who access the resource or an indication of its security status   |  mandatory | optional | 
| [contact point](https://geonovum.github.io/DCAT-AP-NL30/#dataset-contact-point) | Contact information that can be used for correspondence about the Dataset |  mandatory |  optional | 
| [creator](https://geonovum.github.io/DCAT-AP-NL30/#dataset-creator) | An entity responsible for making the resource | mandatory | optional | 
| [identifier](https://geonovum.github.io/DCAT-AP-NL30/#dataset-identifier) | An unambiguous reference to the resource within a given context | mandatory | optional|
| [theme](https://geonovum.github.io/DCAT-AP-NL30/#dataset-theme) |  At least one theme as defined by the [Dataset Theme Vocabulary](https://publications.europa.eu/resource/authority/data-theme) | mandatory | recommended |     
| [publisher](https://geonovum.github.io/DCAT-AP-NL30/#dataset-publisher) | 	An entity responsible for making the resource available.   | mandatory  | recommended  |
| [conforms to](https://geonovum.github.io/DCAT-AP-NL30/#dataservice-conforms-to) | An established standard to which the described resource conforms |  recommended | optional |
| [documentation](https://geonovum.github.io/DCAT-AP-NL30/#dataset-documentation) |  A page or document about this Dataset | recommended | optional | 
| [landing page](https://geonovum.github.io/DCAT-AP-NL30/#dataset-landing-page) |  A webpage that gives access to the Dataset, as well as more information.  | recommended | optional | 
| [status](https://geonovum.github.io/DCAT-AP-NL30/#dataset-status) | The status of the dataset in the context of its life cycle | optional | NA: added in DCAT-AP-NL |  



### Who is using DCAT-AP-NL? 
- 

## [HealthDCAT-AP](https://healthdcat-ap.github.io/)
The HealthDCAT-AP extension was proposed by the [European Health Data Space](https://health.ec.europa.eu/ehealth-digital-health-and-care/european-health-data-space_en) (EHDS) following a [landscape analysis](https://ehds2pilot.eu/wp-content/uploads/2024/04/HealthData@EU-Pilot_MS6.1_FIN.pdf) that established that current metadata catalogues lack vocabulary specific to the health domain. The analysis was performed in response to studies arguing that findability is one of the main barriers to the re-use of health data. The development of a metadata scheme that includes a vocabulary specific to the health domain facilitates secondary use of health data, for instance in research and innovation. 

The development of the HealthDCAT is still ongoing, and the information that is included in this notebook is based on the latest [unofficial draft](https://healthdcat-ap.github.io/), released in December 2023. The official release was [expected](https://ehds2pilot.eu/upcoming_results/extension-of-dcat-ap-healthdcat-ap/) in early 2024, but has not yet happened. Reports on the status of the scheme have been absent since the start of 2024. 

As the HealthDCAT-AP is an extension of the DCAT-AP, it adds elements to the schema while still adhering to the main principles. The mandatory properties retain their status, while the other optionalities may change. 

### HealthDCAT additions
Here you find an outline of the additions that are made by the HealthDCAT to the Dataset class. Properties with the DCAT Reuse value `E` are reused from DCAT but have additional notes or restrictions in HealthDCAT. In the majority of the cases, the addition was a change of a property from optional or recommended to mandatory. Properties marked with `P` are introduced in the HealthDCAT. 

| Property | Definition | Optionality | DCAT Reuse | 
| --- | --- | --- | --- |
| alternative | Alternative title of the dataset such as an acronym | optional | P | 
| access rights | Dataset is publicly accessible, has access restrictions or is not public | mandatory | E | 
| applicable legislation | The legislation that mandates the creation or management of the Dataset. | mandatory | E | 
| analytics | An analytics distribution of the dataset | optional | P | 
| code values | Health classifications and their codes associated with the dataset | optional | P | 
| coding system | Coding systems in use (ex: ICD-10-CM, DGRs, SNOMED=CT, ...) | optional | P | 
| conforms to | An implementing rule or other specification | optional | E | 
| dataset distribution | An available Distribution for the Dataset | mandatory | E | 
| geographical coverage | A geographic region that is covered by the Dataset | mandatory | E | 
| health category | The health category to which this dataset belongs as described in the Commission Regulation on the European Health Data Space laying down a list of categories of electronic data for secondary use, Art.33 | optional | P | 
| health data access body | Health Data Access Body supporting access to data in the Member State | mandatory | P | 
| health theme | A category of the Dataset or tag describing the Dataset | mandatory | P | 
| identifier | The main identifier for the Dataset, e.g. the URI or other unique identifier in the context of the Catalogue | mandatory | P | 
| legal basis | The legal basis used to justify processing of personal data | optional | P | 
| minimum typical age | Minimum typical age of the population within the dataset | mandatory | P | 
| maximum typical age | Maximum typical age of the population within the dataset | optional | P | 
| number of records | Size of the dataset in terms of the number of records | optional | P | 
| number of records for unique individuals | Number of records for unique individuals | optional | P | 
| personal data | Key elements that represent an individual in the dataset | optional | P | 
| population coverage | A definition of the population within the dataset | optional | P | 
| provenance | A statement about the lineage of a Dataset | mandatory | E | 
| publisher | An entity (organisation) responsible for making the Dataset available | mandatory | E | 
| publisher note | A description of the publisher activities | mandatory | P | 
| publisher type | A type of organisation that makes the Dataset available | mandatory | P | 
| purpose | A free text statement of the purpose of the processing of data or personal data | mandatory | P | 
| quality annotation | Dataset, including rating, quality certificate, feedback that can be associated to the dataset | optional | P | 
| retention period | A temporal period which the dataset is available for secondary use | optional | P | 
| sample | A sample distribution of the dataset | optional |  E | 
| theme | A category of the Dataset | mandatory | E | 
| type | A type of the Dataset | mandatory | E | 



## [Health-RI](https://github.com/Health-RI/health-ri-metadata/tree/master)

A Dutch effort to create a data catalogue for health data. It is based on DCAT-AP and HealthDCAT-AP.  

- *How well do Health-RI and HealthDCAT-AP match?* \
Below you see an overview of the mandatory properties of Health-RI, as well as a comparison to their optionality status in the Health-DCAT-AP. 
- *Do the mandatory Health-RI fields match the Life Science Data Station fields?* \
Not all of the mandatory properties are currently available in the LS Data Station metadata form. 


### [Additions & Comparison](https://github.com/Health-RI/health-ri-metadata/tree/master?tab=readme-ov-file#dataset)

| property                | description                                                                 | Health-RI   | Health-DCAT-AP | present in LS Data Station?        |
|-------------------------|-----------------------------------------------------------------------------|-------------|-----------------|------------------------------------|
| access rights           | Dataset is publicly accessible, has access restrictions or is not public   | mandatory   | mandatory       | no |                                   |
| applicable legislation  | The legislation that mandates the creation or management of the Dataset.    | mandatory   | mandatory       | no                                |
| contact point           | Relevant contact information for the catalog resource                       | mandatory   | mandatory       | yes, mandatory                     |
| creator                 | The entity responsible for producing the resource                           | mandatory   | optional        | yes? (Author?), mandatory          |
| description             | A free-text account of the Dataset                                          | mandatory   | mandatory       | yes                                |
| geographical coverage   | A geographic region that is covered by the Dataset                          | mandatory   | mandatory       |      yes, optional                              |
| identifier              | A unique identifier of the resource being described or catalogued           | mandatory   | mandatory       | no                                 |
| legal basis             | The legal basis used to justify processing of personal data                 | mandatory   | optional        |                          no          |
| modified                | Most recent date on which the catalog entry was changed, updated or modified| mandatory   | optional        | no                                 |
| number of records       | Size of the dataset in terms of the number of records                       | mandatory   | optional        |                          no           |
| publisher               | The entity responsible for making the resource available                    | mandatory   | mandatory       | yes? (Producer?), optional         |
| theme                   | A main category of the resource. A resource can have multiple themes        | mandatory   | mandatory       | yes? (Subject?), mandatory         |
| title                   | A name given to the Dataset                                                 | mandatory   | mandatory       | yes                                |
| issued                  | Date of formal issuance (e.g., publication) of the resource                 | mandatory   | optional        | yes, optional                      |


### Useful links 
- [Mapping pipeline](https://health-ri.atlassian.net/wiki/spaces/FSD/pages/290291734/Mapping+pipeline): describes the process of metadata mapping 
- [Health-RI RDF Validator](https://www.itb.ec.europa.eu/shacl/healthri/upload): SHACL shape-based validator to see if your metadata conforms to Health-RI. 


### [Fair Data Point](https://www.fairdatapoint.org/)
A FAIR Data Point (FDP) is a metadata service that provides access to metadata in a way that adheres to the FAIR principles. FDP allows owners of digital objects to expose their metadata in a FAIR way, and it allows consumers of digital objects to discover information about them. By default, all uploaded resources ase publicly accessible by anyone. The DCAT model is used as the basis for the metadata in FDP. 

A FAIR Data Point has three compontents: 
- A definition of the [Fair Data Point API specification](https://specs.fairdatapoint.org/fdp-specs-v1.2.html)
- A service implementing the API specification
- A client of the API (a web front end) that can be used to add, edit and query the information in the metadata 

#### Deployment
The deployment of a FDP has a handful of components. See an overview of each component below. The information is taken from the FDP [documentation](https://fairdatapoint.readthedocs.io/en/latest/deployment/local-deployment.html).  

FDP deployment components:
- Triple Store: the place where the semantic data is stored
- MongoDB: the place where other information (e.g. user accounts, roles) is saved
- FAIRDataPoint: the core component that handles all logic and operations with the semantic data. It is distributed in Docker image `fairdatapoint`. 
- FAIRDataPoint-client: provides the user interface for humans. It is distributed in Docker image `fairdatapoint-client`. 
- Reverse Proxy: handles HTTPS certificates to keep the connection to the FDP secured. 
 
![overview of fdp components](img/FDP_components.png)

[image source](https://fairdatapoint.readthedocs.io/en/latest/about/components.html#components)

Useful links: 
- [FAIRDataPoint Github](https://github.com/FAIRDataTeam/FAIRDataPoint)

## Questions 

- (see [minutes](https://ehds2pilot.eu/wp-content/uploads/2024/04/HealthData@EU-Pilot_MS6.2_Technical-working-group-on-the-transition-from-existing-metadata-templates-to-HealthDCAT_FIN-1.pdf) )