New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DatasetSeries #240
Comments
+1 from Norway to include DatasetSeries i DCAT-AP. First/Last is defined by the DatasetSeries-description and should be included as optional there ? Prev/Next is defined by the child dataset- description (correct me if I am wrong @bertvannuffelen). Question 2: Do we need Statement: we don't need |
We support the initial statements about Additional we think that Typical entries: tbd. |
About the use of Quoting:
In practice, the recommendation is as follows:
|
What about the following scenario:
What should be the best practise here? I see two different alternatives:
Is alternative 2 even allowed, i.e. can a dataset series has a distribution? |
Ad 1) that could be perceived as breaking the assumption that the distributions are informationally equivalent. Nevertheless, it is similar to the case where we have a SPARQL endpoint serving multiple totally separate datasets, having that SPARQL endpoint as a DataService distribution of each of those datasets. IMHO this is definitely a viable possibility. 3) Periodic datasets form a series as in 1) On top of that, there is another DatasetSeries, which contains the periodic DatasetSeries and a new dataset with the DataService distribution. Here we use DatasetSeries to group datasets serving similar data in various forms, kind of a vague DatasetSeries interpretation. |
In Lithuania we do not allow such things as Periodic datasets, we always have one dataset and multiple distributions. According to DCAT on distributions:
Here is an example: https://data.gov.lt/dataset/espbi-is-e-recepto-posistemes-duomenys?lang=en If there would have so called periodic datasets, then we would have to duplicates same dataset multiple times, and change only some attributes. So with dataset series, as I understand, things like periodic datasets are encouraged and I think, this will end up in an explosion of datasets. Of course this is good for statistics, to say, that we have millions of datasets, most of them are duplicates. I think, there should be requirement, that datasets should not be divided by properties like time or space, and if such division is needed, distributions should be used. |
@sirex I fully understand and agree with you perspective, I also worry about the effect or extra burden it will put on publishers, portals and tool developers. However, the approach you describe is problematic as well. Having multiple distributions to divide the data in a spatial or temporal dimension goes against how distributions are supposed to be used. (Multiple distributions are supposed to correspond to different representations of a dataset, typically in different formats.) If you are interested, I have argued for another solution where we repeat the From Swedens perspective it is unclear how we are going to proceed. One option could be to keep the approach of repeating downloadURLs for most trivial scenarios and do a conversion when exporting to the data.europe.eu. In this case we would reserve the use of DatasetService for scenarios where you have the need to provide more metadata. But we would still need to know how to do such a conversion. That is why I brought up the scenario above as it is quite common to have two distributions, one with a bunch of files and another with an API. |
One simple solution would be to use URI templates, like here: And distributions with a templated URI is quite similar to a data services. That is why, we are moving away from distributions to data services, basically everything will be a data service. Then you can get any distribution you want, you can filter by time, location or other attributes. So instead of
There will be:
But we still have plans to use Dataset Series, but for grouping similar datasets in to groups. But from what I see, that is not, what is intended here? The data service that I'm talking about: https://get.data.gov.lt/datasets/gov/:ns All data can be downloaded in multiple formats, and you can generate infinite number of so called periodic datasets, like this: https://get.data.gov.lt/datasets/gov/lsd/covid19/AtvejaiIrMirtys?date="2020-02-01" So the question is are dataset series really just for bunch of identical datasets partitioned by temporal or spatial dimension (which seems like a huge overhead) or it can be used for any kind of datasets to group them together, for example by similar topic? |
I have also been considering URI templates (https://www.rfc-editor.org/rfc/rfc6570) but unfortunately it is not an option to use in the
So, if it where to be used it would have to be in a different property, but then we still a value for the Your approach of pointing to dataservices from the dataset requires the use of an intermediate distribution as there is no direct suitable property on the dataset level in the dcat namespace. However, you could point in the other direction via the I share your view that (in most cases) it will be a huge overhead to use the Dataset series for temporal or spatial dimension. That is why we have been using multiple |
As I understand the DCAT specification and the last webinar: both. They are often partinioned by temporal or spatialdimensions but they can also be a loose grouping of datasets.
My interpretation of DCAT is the opposite of your way to do "Periodic datasets". That said, using distributions for this happens in Germany as well, because it is the easiest and only existing out-of-the-box solution to present periodic distributions as a bundle in most data portals. |
This has been discussed thoroughly in the past years both in DCAT and DCAT-AP groups. See the note in DCAT's description of Distribution, which is the result of the discussions. The point is that those should be different datasets exactly because they have a difference in some attributes. Specifically, spatial and temporal coverage are attributes of Dataset, not Distribution. To keep the dataset e.g. for a certain time period, machine findable, you need these attributes of the individual Datasets. Moreover, if you had e.g. budget for 2020, 2021, etc., each available in different formats (XML, CSV, ...) it would be even more confusing to have a dataset Budget with distributions "2020 in XML", "2020 in CSV", "2021 in XML", etc. and not distinguishable, unless you specify your own coverage properties for Distributions, despite how this is designed in DCAT.
There can be a separation between how the data is structured in RDF, where it needs to be precise for achieving interoperability, and how the data is shown to/collected from users such as publishers. The fact that there is a dataset series, with individual datasets, each with e.g. one distribution - can still be hidden from the users, if this is desirable. I can still do a UI for publishers, where they drag and drop 10 files and are done with it, and still represent those 10 files as distributions as 10 datasets in a series, given that I get the metadata from e.g. the context of the operation in the UI.
But then you basically attach additional data to these URLs that you would normally attach to a dataset - I see no advantage of this approach, and, from the interoperability point of view, no-one who reads DCAT and DCAT-AP will expect those properties to be there - so again - sure, this can be done technically, and maybe you save a few RDF triples, but it is not interoperable with DCAT and its other users then, which is the whole point. And this would have been a problem even without DatasetSeries. |
Sorry, I'm late to the discussion, and thanks for the explanation, I think, now I see things more clearly. In Lithuania we are moving toward federated open data publishing through a single unified API, so I guess, essentially we will have only one dataset or a data service get.data.gov.lt. And basically everything else is created artificially, to separate data into datasets by topics and provide multiple urls as distributions for convenience. So in our case we are not describing already published data, but instead first we name a dataset and then publish the data. And one of the requirements for data publishers is to not split datasets by temporal, spatial or other attributes, instead those attributes should be added to the data itself and the data publishing service allows to filter the data by any attribute. I guess, we could artificially split single data stream into multiple datasets and combine them into single dataset series, but as I understand, this only applies for already published data? So currently, we try to name datasets, by a topic, and if user searches for "budget", a single dataset will be returned. Then user can pick a format and apply filters if needed. But to make it more convenient, we provide multiple distributions, already filtered by year or by region for less technical data users. But these distributions are created just for convenience, all of them point to the same data publishing service. Here is an example: <datasets/budget> a dcat:Dataset ;
dct:title "Budget" .
<services/budget> a dcat:DataService ;
dcat:endpointURL <https://get.data.gov.lt/datasets/gov/org/budget> ;
dcat:servesDataset <datasets/budget> .
<distributions/budget/2023> a dcat:Distribution ;
dcat:accessService <services/budget> ;
dcat:downloadURL <https://get.data.gov.lt/datasets/gov/org/budget?date.year=2023> .
<distributions/budget/2022> a dcat:Distribution ;
dcat:accessService <services/budget> ;
dcat:downloadURL <https://get.data.gov.lt/datasets/gov/org/budget?date.year=2022> . So my question is, do we need to refactor this king of presenting data, using distributions, into multiple datasets and dataset series for DCAT 3? Or the way as we do it now is also OK? |
As soon as you link the distributions to the dataset you are in conflict with the definition of distributions: <datasets/budget> a dcat:Dataset ;
dct:title "Budget" .
dcat:distribution <distributions/budget/2023> .
dcat:distribution <distributions/budget/2022> . Imho, this would be your structure in DCAT3: <datasetseries/budget> a dcat:DatasetSeries ;
dct:title "Budget" .
<services/budget> a dcat:DataService ;
dcat:endpointURL <https://get.data.gov.lt/datasets/gov/org/budget> ;
dcat:servesDataset <datasetseries/budget> .
<dataset-distribution/budget/2023> a dcat:Dataset ;
dcat:inSeries <datasetseries/budget> ;
dct:title "Budget for 2023" ;
dct:description "For convenience we prepared the budget for 2023,
but it's just a call for our awesome webservice!" ;
dcat:distribution [
a dcat:Distribution ;
dcat:accessService <services/budget> ;
dcat:downloadURL <https://get.data.gov.lt/datasets/gov/org/budget?date.year=2023> .
] .
<dataset-distribution/budget/2022> a dcat:Dataset ;
dcat:inSeries <datasetseries/budget> ;
dct:title "Budget for 2022" ;
dct:description "For convenience we prepared the budget for 2022,
but it's just a call for our awesome webservice!" ;
dcat:distribution [
a dcat:Distribution ;
dcat:accessService <services/budget> ;
dcat:downloadURL <https://get.data.gov.lt/datasets/gov/org/budget?date.year=2022> .
] . |
Could you explain, where is the conflict? And why, this would not work? <distributions/budget/2023> a dcat:Distribution ;
dcat:accessService <services/budget> ;
dcat:downloadURL <https://get.data.gov.lt/datasets/gov/org/budget?date.year=2023> .
<distributions/budget/2022> a dcat:Distribution ;
dcat:accessService <services/budget> ;
dcat:downloadURL <https://get.data.gov.lt/datasets/gov/org/budget?date.year=2022> .
<datasets/budget> a dcat:Dataset ;
dct:title "Budget" ;
dcat:distribution
<distributions/budget/2023> ,
<distributions/budget/2022> . |
Budgets are the counter example for things not to model as multiple distributione, since they are not "the same data in a different format or resolution":
Would it work? In your portal, definetly. In a portal which harvests you? Probably. They would show one dataset with multiple files to download. Is is semantically correct? As far as I understand it: no. |
The last is the challenge. According to the DCAT(-AP) definitions your last example @sirex is violating the intend that each distribution contains exactly the same data. That the difference is only in the representation. So while we cannot prohibit the example to exist, we can prohibit to assign the term Dataset Series to it. The problem your example is that it is impossible to distinguish from another dataset with 2 distributions (e.g. one RDF and other json). That is the reason why the notion Dataset Series in DCAT 3.0 was explicitly introduced: to unambiguously indicate if it is a series or not. |
So I'm not sure, what to do. We are moving towards large datasets, that are not split by time or place and all datasets are published as data services. For example, all data (DCAT) of our open data portal (data.gov.lt) is published as a single data service and has a single dataset record. Depending on a format, it might be a single RDF or JSON file, but also, could be multiple CSV files, because of CSV limitations, we can't put multiple tables into a single CSV file. For convenience, we also provide multiple distributions for each dataset/data service, where data are split or filtered by various attributes, so that users could download a smaller slice of the data, that they are interested in. For example, some users prefers to get data in a normalized form, others in a denormalized form, where multiple tables are joined into one large table. Other users, want to get data split by year or by region. Currently, all different representations of a large dataset are provided as distributions. Physically, all the data are in a single data service, per single dataset, and distributions are just calls to data service with different filters. What would be your advice in this case. Should we continue with distributions, or should we convert distributions to datasets, leaving only only distributions that differs by format and wrap those datasets to a data series? |
Thanks for the many contributions in this issue and during the webinars. We close this issue with the release of DCAT-AP 3.0 so that we can start with a clean discussion of the left over open topics that could be present in this exchange. |
DCAT 3.0 introduces the notion of DatasetSeries.
As application profile DCAT-AP may add additional constraints on DatasetSeries.
This issue is to collect proposals.
Some suggestions are
If no suggestions are provided by the community, the DatasetSeries a proposal will be made with a minimal amount of constraints.
The text was updated successfully, but these errors were encountered: