Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DatasetSeries #240

Open
bertvannuffelen opened this issue Nov 21, 2022 · 18 comments
Open

DatasetSeries #240

bertvannuffelen opened this issue Nov 21, 2022 · 18 comments
Labels
alignment-DCAT3.0 release:3.0.0 https://semiceu.github.io/DCAT-AP/releases/3.0.0 status:fixed This issue has been fixed in a draft.

Comments

@bertvannuffelen
Copy link
Contributor

DCAT 3.0 introduces the notion of DatasetSeries.
As application profile DCAT-AP may add additional constraints on DatasetSeries.

This issue is to collect proposals.

Some suggestions are

  • Presence of a title​
  • At least one element in the series​
  • Preferred chain is forward linking (first -> next -> next)

If no suggestions are provided by the community, the DatasetSeries a proposal will be made with a minimal amount of constraints.

@oystein-asnes
Copy link

oystein-asnes commented Nov 21, 2022

+1 from Norway to include DatasetSeries i DCAT-AP.

First/Last is defined by the DatasetSeries-description and should be included as optional there ?

Prev/Next is defined by the child dataset- description (correct me if I am wrong @bertvannuffelen). Question 2: Do we need dcat:next, when we havedcat:prev? At least for time series I will assume providers adding a new dataset to a datasetSeries don't have a dcat:next to point to - but always a dcat:prev (unless it is the first).

Statement: we don't need dcat:next to put the datasets in a datasetSeries in line.

@init-dcat-ap-de
Copy link

We support the initial statements about dcterms:title and "at least one element".
We support Norways statements about dcat:prev. Backward linking allows to add data without changing the previous dataset, which is easier to handle:
dcat:prev --> recommended
prev:next --> optional

Additional we think that dct:accrualPeriodicity should be recommended.
There should also be a PO vocabulary to make the type (dcterms:type) of the dcat:DataSeries explicit, similar to Dataset Types).

Typical entries: tbd.

@andrea-perego
Copy link

About the use of dcat:prev and dcat:next, this is covered in §7 Use of inverse properties of DCAT3.

Quoting:

The properties described in 6. Vocabulary specification do not include inverses intentionally, with the purpose of ensuring interoperability also in systems not making use of OWL reasoning.

However, recognizing that inverses are needed for some use cases, DCAT supports them, but with the requirement that they MAY be used only in addition to those described in 6. Vocabulary specification, and that they MUST NOT be used to replace them.

dcat:next is one of these inverse properties.

In practice, the recommendation is as follows:

  • use dcat:prev
  • you can also use its inverse (namely, dcat:next) but only if dcat:prev is present

@matthiaspalmer
Copy link

What about the following scenario:

  1. Perdiodic datasets (e.g. yearly or even monthly) corresponding to downloadable files, one distribution per dataset.
  2. A dataset series connecting all these datasets vi the inSeries property.
  3. But there is also a dataset service (an API) that is up to date and gives access to all data independent of the period (with potential parameters to filter per period, but that is different from case to case).

What should be the best practise here? I see two different alternatives:

  1. Add another distribution for each periodic dataset that points to the data service (via accessService), or
  2. Add a distribution on the level of the dataset series that points to the dataservice

Is alternative 2 even allowed, i.e. can a dataset series has a distribution?

@jakubklimek
Copy link
Contributor

jakubklimek commented Feb 2, 2023

I see two different alternatives:

Ad 1) that could be perceived as breaking the assumption that the distributions are informationally equivalent. Nevertheless, it is similar to the case where we have a SPARQL endpoint serving multiple totally separate datasets, having that SPARQL endpoint as a DataService distribution of each of those datasets. IMHO this is definitely a viable possibility.
Ad 2) In Czechia, we do not allow instances of DatasetSeries to have distributions. But in this particular case, it could make sense. However, we have examples where we actually do something else:

3) Periodic datasets form a series as in 1) On top of that, there is another DatasetSeries, which contains the periodic DatasetSeries and a new dataset with the DataService distribution. Here we use DatasetSeries to group datasets serving similar data in various forms, kind of a vague DatasetSeries interpretation.

@sirex
Copy link

sirex commented Feb 2, 2023

Periodic datasets (e.g. yearly or even monthly) corresponding to downloadable files, one distribution per dataset.

In Lithuania we do not allow such things as Periodic datasets, we always have one dataset and multiple distributions. According to DCAT on distributions:

A specific representation of a dataset. A dataset might be available in multiple serializations that may differ in various ways, including natural language, media-type or format, schematic organization, temporal and spatial resolution, level of detail or profiles (which might specify any or all of the above).

Here is an example: https://data.gov.lt/dataset/espbi-is-e-recepto-posistemes-duomenys?lang=en

If there would have so called periodic datasets, then we would have to duplicates same dataset multiple times, and change only some attributes.

So with dataset series, as I understand, things like periodic datasets are encouraged and I think, this will end up in an explosion of datasets. Of course this is good for statistics, to say, that we have millions of datasets, most of them are duplicates.

I think, there should be requirement, that datasets should not be divided by properties like time or space, and if such division is needed, distributions should be used.

@matthiaspalmer
Copy link

@sirex I fully understand and agree with you perspective, I also worry about the effect or extra burden it will put on publishers, portals and tool developers.

However, the approach you describe is problematic as well. Having multiple distributions to divide the data in a spatial or temporal dimension goes against how distributions are supposed to be used. (Multiple distributions are supposed to correspond to different representations of a dataset, typically in different formats.)

If you are interested, I have argued for another solution where we repeat the dcat:downloadURL in a distribution instead, it is described here: w3c/dxwg#868 (comment).
And in some more detail here: w3c/dxwg#1429 (comment)

From Swedens perspective it is unclear how we are going to proceed. One option could be to keep the approach of repeating downloadURLs for most trivial scenarios and do a conversion when exporting to the data.europe.eu.

In this case we would reserve the use of DatasetService for scenarios where you have the need to provide more metadata. But we would still need to know how to do such a conversion. That is why I brought up the scenario above as it is quite common to have two distributions, one with a bunch of files and another with an API.

@sirex
Copy link

sirex commented Feb 2, 2023

One simple solution would be to use URI templates, like here:

And distributions with a templated URI is quite similar to a data services. That is why, we are moving away from distributions to data services, basically everything will be a data service.

Then you can get any distribution you want, you can filter by time, location or other attributes.

So instead of

DatasetSeries -> Dataset -> Distribution

There will be:

Dataset -> DataService -> <dynamic distributions> (periodic, in any format you want, in any language you want).

But we still have plans to use Dataset Series, but for grouping similar datasets in to groups. But from what I see, that is not, what is intended here?

The data service that I'm talking about:

https://get.data.gov.lt/datasets/gov/:ns

All data can be downloaded in multiple formats, and you can generate infinite number of so called periodic datasets, like this:

https://get.data.gov.lt/datasets/gov/lsd/covid19/AtvejaiIrMirtys?date="2020-02-01"
https://get.data.gov.lt/datasets/gov/lsd/covid19/AtvejaiIrMirtys?date="2020-02-02"
...

So the question is are dataset series really just for bunch of identical datasets partitioned by temporal or spatial dimension (which seems like a huge overhead) or it can be used for any kind of datasets to group them together, for example by similar topic?

@matthiaspalmer
Copy link

I have also been considering URI templates (https://www.rfc-editor.org/rfc/rfc6570) but unfortunately it is not an option to use in the dcat:accessURL or dcat:downloadURL positions. From the RFC:

URI Templates are not URIs: they do not identify an abstract or
physical resource, they are not parsed as URIs, and they should not
be used in places where a URI would be expected unless the template
expressions will be expanded by a template processor prior to use.

So, if it where to be used it would have to be in a different property, but then we still a value for the dcat:accessURL which is mandatory. So not a valid solution unfortunately.

Your approach of pointing to dataservices from the dataset requires the use of an intermediate distribution as there is no direct suitable property on the dataset level in the dcat namespace. However, you could point in the other direction via the dcat:servesDataset property.

I share your view that (in most cases) it will be a huge overhead to use the Dataset series for temporal or spatial dimension. That is why we have been using multiple dcat:accessURL and dcat:downloadURL to point to multiple files instead in Sweden. We also support a title on these urls to provide a way of distinguishing them, e.g. something like "Budget 2020", "Budget 2021" etc. If needed, additional properties could be attached there, like spatial and temporal properties. But the title has been enough for our use cases up to now.

@init-dcat-ap-de
Copy link

So the question is are dataset series really just for bunch of identical datasets partitioned by temporal or spatial dimension (which seems like a huge overhead) or it can be used for any kind of datasets to group them together, for example by similar topic?

As I understand the DCAT specification and the last webinar: both. They are often partinioned by temporal or spatialdimensions but they can also be a loose grouping of datasets.

always have one dataset and multiple distributions. According to DCAT on distributions

My interpretation of DCAT is the opposite of your way to do "Periodic datasets". That said, using distributions for this happens in Germany as well, because it is the easiest and only existing out-of-the-box solution to present periodic distributions as a bundle in most data portals.

@jakubklimek
Copy link
Contributor

@sirex

In Lithuania we do not allow such things as Periodic datasets, we always have one dataset and multiple distributions.

This has been discussed thoroughly in the past years both in DCAT and DCAT-AP groups. See the note in DCAT's description of Distribution, which is the result of the discussions. The point is that those should be different datasets exactly because they have a difference in some attributes. Specifically, spatial and temporal coverage are attributes of Dataset, not Distribution. To keep the dataset e.g. for a certain time period, machine findable, you need these attributes of the individual Datasets.

Moreover, if you had e.g. budget for 2020, 2021, etc., each available in different formats (XML, CSV, ...) it would be even more confusing to have a dataset Budget with distributions "2020 in XML", "2020 in CSV", "2021 in XML", etc. and not distinguishable, unless you specify your own coverage properties for Distributions, despite how this is designed in DCAT.

@matthiaspalmer

I also worry about the effect or extra burden it will put on publishers, portals and tool developers.

There can be a separation between how the data is structured in RDF, where it needs to be precise for achieving interoperability, and how the data is shown to/collected from users such as publishers. The fact that there is a dataset series, with individual datasets, each with e.g. one distribution - can still be hidden from the users, if this is desirable. I can still do a UI for publishers, where they drag and drop 10 files and are done with it, and still represent those 10 files as distributions as 10 datasets in a series, given that I get the metadata from e.g. the context of the operation in the UI.

I share your view that (in most cases) it will be a huge overhead to use the Dataset series for temporal or spatial dimension. That is why we have been using multiple dcat:accessURL and dcat:downloadURL to point to multiple files instead in Sweden. We also support a title on these urls to provide a way of distinguishing them, e.g. something like "Budget 2020", "Budget 2021" etc. If needed, additional properties could be attached there, like spatial and temporal properties. But the title has been enough for our use cases up to now.

But then you basically attach additional data to these URLs that you would normally attach to a dataset - I see no advantage of this approach, and, from the interoperability point of view, no-one who reads DCAT and DCAT-AP will expect those properties to be there - so again - sure, this can be done technically, and maybe you save a few RDF triples, but it is not interoperable with DCAT and its other users then, which is the whole point. And this would have been a problem even without DatasetSeries.

@sirex
Copy link

sirex commented Feb 11, 2023

@jakubklimek

Sorry, I'm late to the discussion, and thanks for the explanation, I think, now I see things more clearly.

In Lithuania we are moving toward federated open data publishing through a single unified API, so I guess, essentially we will have only one dataset or a data service get.data.gov.lt. And basically everything else is created artificially, to separate data into datasets by topics and provide multiple urls as distributions for convenience.

So in our case we are not describing already published data, but instead first we name a dataset and then publish the data. And one of the requirements for data publishers is to not split datasets by temporal, spatial or other attributes, instead those attributes should be added to the data itself and the data publishing service allows to filter the data by any attribute.

I guess, we could artificially split single data stream into multiple datasets and combine them into single dataset series, but as I understand, this only applies for already published data?

So currently, we try to name datasets, by a topic, and if user searches for "budget", a single dataset will be returned. Then user can pick a format and apply filters if needed. But to make it more convenient, we provide multiple distributions, already filtered by year or by region for less technical data users. But these distributions are created just for convenience, all of them point to the same data publishing service.

Here is an example:

<datasets/budget> a dcat:Dataset ;
  dct:title "Budget" .

<services/budget> a dcat:DataService ;
  dcat:endpointURL <https://get.data.gov.lt/datasets/gov/org/budget> ;
  dcat:servesDataset <datasets/budget> .

<distributions/budget/2023> a dcat:Distribution ;
  dcat:accessService <services/budget> ;
  dcat:downloadURL <https://get.data.gov.lt/datasets/gov/org/budget?date.year=2023> .

<distributions/budget/2022> a dcat:Distribution ;
  dcat:accessService <services/budget> ;
  dcat:downloadURL <https://get.data.gov.lt/datasets/gov/org/budget?date.year=2022> .

So my question is, do we need to refactor this king of presenting data, using distributions, into multiple datasets and dataset series for DCAT 3? Or the way as we do it now is also OK?

@init-dcat-ap-de
Copy link

As soon as you link the distributions to the dataset you are in conflict with the definition of distributions:

<datasets/budget> a dcat:Dataset ;
  dct:title "Budget" .
  dcat:distribution <distributions/budget/2023> .
  dcat:distribution <distributions/budget/2022> .

Imho, this would be your structure in DCAT3:

<datasetseries/budget> a dcat:DatasetSeries ;
  dct:title "Budget" .

<services/budget> a dcat:DataService ;
  dcat:endpointURL <https://get.data.gov.lt/datasets/gov/org/budget> ;
  dcat:servesDataset <datasetseries/budget> .

<dataset-distribution/budget/2023> a dcat:Dataset ;
  dcat:inSeries <datasetseries/budget> ;
  dct:title "Budget for 2023" ;
  dct:description "For convenience we prepared the budget for 2023, 
                    but it's just a call for our awesome webservice!" ;
  dcat:distribution [
    a dcat:Distribution ;
      dcat:accessService <services/budget> ;
      dcat:downloadURL <https://get.data.gov.lt/datasets/gov/org/budget?date.year=2023> .
  ] .

<dataset-distribution/budget/2022> a dcat:Dataset ;
  dcat:inSeries <datasetseries/budget> ;
  dct:title "Budget for 2022" ;
  dct:description "For convenience we prepared the budget for 2022, 
                    but it's just a call for our awesome webservice!" ;
  dcat:distribution [
    a dcat:Distribution ;
      dcat:accessService <services/budget> ;
      dcat:downloadURL <https://get.data.gov.lt/datasets/gov/org/budget?date.year=2022> .
  ] .

@sirex
Copy link

sirex commented Feb 13, 2023

As soon as you link the distributions to the dataset you are in conflict with the definition of distributions:

Could you explain, where is the conflict?

And why, this would not work?

<distributions/budget/2023> a dcat:Distribution ;
  dcat:accessService <services/budget> ;
  dcat:downloadURL <https://get.data.gov.lt/datasets/gov/org/budget?date.year=2023> .

<distributions/budget/2022> a dcat:Distribution ;
  dcat:accessService <services/budget> ;
  dcat:downloadURL <https://get.data.gov.lt/datasets/gov/org/budget?date.year=2022> .

<datasets/budget> a dcat:Dataset ;
  dct:title "Budget" ;
  dcat:distribution 
    <distributions/budget/2023> ,
    <distributions/budget/2022> .

@init-dcat-ap-de
Copy link

@sirex

Budgets are the counter example for things not to model as multiple distributione, since they are not "the same data in a different format or resolution":

As a counter-example, budget data for different years would usually be modeled as different datasets, each with their own distributions, since all distributions of one dataset should broadly contain the same data. - https://www.w3.org/TR/vocab-dcat-3/#Class:Distribution

Would it work? In your portal, definetly. In a portal which harvests you? Probably. They would show one dataset with multiple files to download. Is is semantically correct? As far as I understand it: no.

@bertvannuffelen
Copy link
Contributor Author

As a counter-example, budget data for different years would usually be modeled as different datasets, each with their own distributions, since all distributions of one dataset should broadly contain the same data. - https://www.w3.org/TR/vocab-dcat-3/#Class:Distribution

Would it work? In your portal, definetly. In a portal which harvests you? Probably. They would show one dataset with multiple files to download. Is is semantically correct? As far as I understand it: no.

The last is the challenge. According to the DCAT(-AP) definitions your last example @sirex is violating the intend that each distribution contains exactly the same data. That the difference is only in the representation.
In the practice this is/cannot be checked. Similar the advice to use Distributions for file based access and DataService for other means.

So while we cannot prohibit the example to exist, we can prohibit to assign the term Dataset Series to it.
That is the goal of the discussion to which usage we assign the term Dataset Series to it.
I would stick to the purest semantics as provided by W3C DCAT 3.0. And not use it for existing practices.
For your example, I would consider it as a Dataset with a number of Distributions. Not a Dataset Series.

The problem your example is that it is impossible to distinguish from another dataset with 2 distributions (e.g. one RDF and other json). That is the reason why the notion Dataset Series in DCAT 3.0 was explicitly introduced: to unambiguously indicate if it is a series or not.
So it is not about can I represent the grouping in some approach, but can I communicate it is a grouping.

@sirex
Copy link

sirex commented Feb 20, 2023

For your example, I would consider it as a Dataset with a number of Distributions. Not a Dataset Series.

So I'm not sure, what to do. We are moving towards large datasets, that are not split by time or place and all datasets are published as data services. For example, all data (DCAT) of our open data portal (data.gov.lt) is published as a single data service and has a single dataset record. Depending on a format, it might be a single RDF or JSON file, but also, could be multiple CSV files, because of CSV limitations, we can't put multiple tables into a single CSV file.

For convenience, we also provide multiple distributions for each dataset/data service, where data are split or filtered by various attributes, so that users could download a smaller slice of the data, that they are interested in.

For example, some users prefers to get data in a normalized form, others in a denormalized form, where multiple tables are joined into one large table. Other users, want to get data split by year or by region.

Currently, all different representations of a large dataset are provided as distributions.

Physically, all the data are in a single data service, per single dataset, and distributions are just calls to data service with different filters.

What would be your advice in this case. Should we continue with distributions, or should we convert distributions to datasets, leaving only only distributions that differs by format and wrap those datasets to a data series?

@bertvannuffelen bertvannuffelen added the release:3.0.0 https://semiceu.github.io/DCAT-AP/releases/3.0.0 label Jun 21, 2023
@bertvannuffelen
Copy link
Contributor Author

Thanks for the many contributions in this issue and during the webinars. We close this issue with the release of DCAT-AP 3.0 so that we can start with a clean discussion of the left over open topics that could be present in this exchange.

@bertvannuffelen bertvannuffelen added the status:fixed This issue has been fixed in a draft. label Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alignment-DCAT3.0 release:3.0.0 https://semiceu.github.io/DCAT-AP/releases/3.0.0 status:fixed This issue has been fixed in a draft.
Projects
None yet
Development

No branches or pull requests

7 participants