Support for Simple Data Format Data Packages #778

rufuspollock · 2013-08-09T16:49:10Z

This is a discussion issue atm. Also a place to ask questions about how best to patch OpenRefine to support this.

Introduction and overview to Simple Data Format (full spec)

tfmorris · 2013-08-09T17:08:49Z

Here's a link to the specification: http://www.dataprotocols.org/en/latest/simple-data-format.html

wetneb · 2018-02-02T13:32:16Z

@rufuspollock We have a first working version of this, just merged. @jackyq2015 has implemented import and export for data packages distributed as Zip files, and also data packages hosted on the web as a JSON file.

The various properties of columns (types, constraints, descriptions, …) are not currently exposed in the UI - this will require more work and coordination.
I also plan to work on import/export for data packages with embedded tabular data. This might require PRs to the java library for data packages.

By the way, thanks for setting up https://github.com/frictionlessdata/test-data, this is very useful.

rufuspollock · 2018-02-07T01:45:23Z

@wetneb awesome 👍 👏 /cc @pwalsh @vitorbaptista

pwalsh · 2018-02-07T06:04:21Z

Amazing @wetneb . Happy to support with reviews and discussion on the Data Package Java Library.

thadguidry · 2018-05-29T22:09:01Z

@wetneb So what's left to deliver the features on this issue ? I'd like to see a Task breakdown please with linked issue numbers on each task remaining.

wetneb · 2018-05-30T07:37:06Z

I don't think we have issues for these yet:

expose column metadata in the UI and let the user edit it Expose column metadata in the UI and let the user edit it #1726
design how type validation and other constraint validation (uniqueness, format) should be handled in OpenRefine ; implement it Design how type validation and other constraint validations should be handled #1727
add support for inline data packages (one JSON file storing both metadata and tabular data) Add support for inline data packages (one JSON file storing both metadata and tabular data) #1728

lwinfree · 2019-10-23T19:16:18Z

Hi! I've noticed that the datapackage import & export functionality is no longer present in the latest versions (3.2 and 3.3) but it was there in 3.0. Are there any plans to re-implement this functionality? If so, is there anything you need help with to get it working?

Thanks!

wetneb · 2019-10-23T19:58:04Z

Hi @lwinfree,

Yes indeed, we removed it because it relied on a non-free library, see frictionlessdata/datapackage-java#26.

It would be great to have this back though! We do not have short-term plans to work on this but would surely welcome PRs in that direction.

In my opinion, the integration we had in 3.0 lacked vision a bit: we should think about concrete user workflows where the integration would really make a difference. As a user, how do I want to turn a messy CSV into a nice validated data package? This means thinking about the interaction between the spec's notions (such as type constraints on columns) and OpenRefine's data model, for instance. What I mean is that it's not enough to just have an importer and an exporter if the importer discards most of the interesting metadata and the exporter produces a jsonified CSV… that defeats the purpose of data packages!

It might be worth looking at use cases in communities who already rely on data packages, seeing what benefit they get out of it, how do they produce them, how could they use OpenRefine as part of their existing workflows, and so on.

lwinfree · 2019-10-23T21:14:57Z

Hi @wetneb, Thanks this is really helpful information! I work on the frictionless data team and we are interested in getting this functionality fixed. I'll keep you posted on our progress :-)
Also, yes I agree it would be great to have use cases. One of our current tool fund grantees (https://github.com/frictionlessdata/FrictionlessDarwinCore) actually inspired this issue as he was planning on working with Open Refine.
Thanks for the quick response, and I'll stay in touch.

tfmorris · 2020-06-20T17:03:00Z

It looks like datapackage-java has been updated with a new JSON library.
frictionlessdata/datapackage-java#35

Doesn't look like they do formal releases: https://github.com/frictionlessdata/datapackage-java/releases
so not sure how long a cooling off period we should allow before grabbing a snapshot.

wetneb · 2020-06-20T21:05:36Z

I would be wary of simply restoring the previous integration with the migrated library, since it did not really enable any useful workflow for users as far as I can tell.

I would be interested to hear from @lwinfree what sort of workflow their tool grantee had in mind - that could help drive the integration in the right direction.

thadguidry · 2020-06-21T01:57:56Z

@wetneb I've been told that you can find that Data Packages are used in a lot of statistical & bio tools. To name a few important ones in the R lang community:
https://cran.r-project.org/web/packages/datapackage.r/index.html
https://cran.r-project.org/web/packages/dpmr/index.html
https://cran.r-project.org/web/packages/codebook/index.html

The bio/stat/scientific community are looking for more data tools to support editing metadata and help improve reproducibility with pushing good practices for publishing data, which involves producing a data dictionary, making it machine readable, etc.
https://arxiv.org/pdf/2002.11626.pdf

The last 3 times I went to the R lang meetup in Dallas, they all asked "Does OpenRefine support adding metadata editing of the table schema yet?" My reply 3 times: "nope"

I personally kinda like the approach of vertically scrolling through the columns to edit that many existing data tools use. Makes the metadata entry faster:

wetneb · 2020-06-21T06:39:50Z

What I would like is a concrete description of a workflow in OpenRefine.

jimfhahn · 2020-06-21T15:42:11Z

This seems to relate somewhat to the FAIR OpenRefine plugin project : https://github.com/FAIRDataTeam/OpenRefine-metadata-extension

FAIR metadata would seem to aspire to some of the same goals of Data Packages tech. I came across the FAIR plugin earlier in the year but have not have the chance to play with it much. FAIR data is very much about replicability, data rights, and data re-use. https://www.go-fair.org/fair-principles/

rufuspollock · 2020-06-22T10:35:09Z

I would be interested to hear from @lwinfree what sort of workflow their tool grantee had in mind - that could help drive the integration in the right direction.

@wetneb i'm not fully up to speed on exact flow in OpenRefine itself but the overall workflow this supports is something like as follows (I imagine).

Let me know if this is the kind of thing you were looking for or now.

Export flow

User has data to tidy e.g. csv
They load into OpenRefine and wrangle. This work including adding some type information
They data is re-exported for consumption in some other tool. That export includes the datapackage.json with the Table Schema describing the table
Another tool (be that a data validator, a visualization tool, or a data loader to a DB) uses that metadata as part of its processing flow

Ingest flow

I imagine there are situation where OpenRefine would benefit from being able to consume data that is already described a data package, for example, a user has already:

Added information about the CSV dialect e.g. knowing that ; is the separator, that quotes are ' rather than " etc.
They have already added information about types e.g. 1989 is a year not an integer
Knowing the decimal character
Using the validation information (beyond types) in Table Schema to flag values out of range

etc

lwinfree · 2020-06-22T14:55:22Z

tagging @andrejjh, Andre we have updated our datapackage-java so it can theoretically be integrated with Open Refine again. Would you be able to write a short summary of the use case you would do if this integration was added back in? It would help the Open Refine team understand and prioritize. Thanks!

wetneb · 2020-06-22T20:41:59Z

Thanks @rufuspollock for the workflows, they sound very sensible to me!

I do not think pulling back the integration we had before will address these. More work is needed to make these workflows possible and smooth:

Export flow: that requires exposing more column metadata (and decide what OpenRefine should do with it - such as validation of the values in the column: in which form?)
Ingest flow: that requires feeding in metadata from the data package metadata to the importer - on top of my head I do not think this was supported, but I am not sure about that.

andrejjh · 2020-06-23T09:18:12Z

The reason I'm desperately waiting for data package compatible OpenRefine is this: Open Refine is quite popular amongst the GBIF community. It is often used by data holders to prepare institutional data for publication. On the other side, GBIF data users mostly use RGBIF but some download data in simple CSV or DwC-Archive format. With the tool I made any DwC-A can be converted into data package which enable great tool such as GoodTables and...OpenRefine if it can ingest datapackage. I'm pretty sure this functionality was supported by OpenRefine 2.x and would be greatly appreciated by the GBIF community. Ingested data packages contain enough information to be saved, after processing in OpenRefine. On top of that, OpenRefine is a powerful tool that could help people dealing with any data packages. I hate walls, love bridges. Best regards, André Heughebaert <https://andrejjh.github.io/>

tfmorris · 2020-06-23T17:06:51Z

So many acronyms, so few links or definitions. Here are some links to help others:

[GBIF](https://www.gbif.org/ - Global Biodiversity Information Facility - "international network and research infrastructure funded by the world's governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth."

RGBIF - R package for dealing with GBIF data (currently 404 for its web page)

DwC-A - Darwin Core Archive - "biodiversity informatics data standard that makes use of the Darwin Core terms to produce a single, self-contained dataset." Domain specific packaging/archiving (yet another). Appears to be zip file containing two XML files and one or more CSVs, all tied together.

"tool I made" - perhaps Darwin Core Archive Assistant

Now that I understand the terms, I'm not sure I'm any closer to understanding the workflow.

"prepare institutional data for publication" = ?

andrejjh · 2020-06-23T17:29:12Z

Hi Tom, Sorry for assuming everybody knows all the acronyms I used. But you find almost all of them except those two: - Tool I made : Frictionless DarwinCore <https://github.com/frictionlessdata/FrictionlessDarwinCore> - Prepare institutional data for publication: Quick guide to publishing data through GBIF <https://www.gbif.org/publishing-data> This is graph explaining the complete data workflow if the missing link(datapackage ingestion in OpenRefine) is added: see annex Hope it clarify a bit,

andrejjh · 2020-06-23T17:31:08Z

with the correct graph now ;-)

@andrejjh I take it you're using email rather than the web interface? If there was an attachment, it got stripped along the way. Please go to #778 and add it (or just post the URL).

lwinfree · 2021-05-28T13:59:22Z

Hi all! I'm wondering if there is any interest in reinstating this datapackage support? If yes, are there ways that the frictionless team could help? I was just communicating with @andrejjh who would still find this really useful, so that's our primary use case right now. Happy to chat more and thanks!

wetneb · 2021-05-28T15:04:18Z

Hi @lwinfree,
I think it would be a nice thing to have indeed! I am not aware of anyone working on this at the moment, so the issue is up for grabs :) I would not mark it as a "good first issue" because it is a bit too involved for that, but it is definitely doable.
I would be happy to review pull requests going in that direction.

thadguidry · 2021-05-28T15:38:26Z

@wetneb But I thought we needed to 1st add support in OpenRefine for Column metadata per your comments from last year above?

to then align with Table Schema specs builtin support for Name, Type, Format
https://specs.frictionlessdata.io/table-schema/#types-and-formats

Then having lots of discussion to arrive at decisions on how to visualize things in OpenRefine for quickly knowing and editing the Name, Type, Format... something like what SPSS does which I kinda like with it's little hover icons which could be clickable/editable
https://www.spss-tutorials.com/spss-variable-types-and-formats/

In my mind, this issue is actually an EPIC.

wetneb · 2021-05-28T16:36:42Z

Yes absolutely! I would really like to see this foundational work and design discussions about column metadata / constraints happening and I do think they are necessary to provide a satisfying support for data packages. If we are aiming for that, this issue is indeed an "epic" which should be broken down into many subtasks.

But if people only want to restore the limited support we had in the past (ingesting data packages by discarding most of the metadata they contain) then I would say it's also okay and should be easier. Even though it feels half-baked to me, perhaps it is useful in some workflows, and I don't see a reason to prevent people from implementing this in OpenRefine (if it's just another importer / exporter, basically).

tfmorris · 2023-11-28T02:18:07Z

Looks like they decided to write a tool from scratch instead: https://github.com/okfn/opendataeditor

They have a comparison page which lists a bunch of alternatives, including OpenRefine, but doesn't really say anything about how they compare.

wetneb added import About importers in general - add a label for the data format if available metadata Adding metadata to projects, columns and other parts of the data model labels Aug 2, 2017

wetneb mentioned this issue Aug 4, 2017

Enhancement : Add fields for projects metadata #1221

Closed

wetneb added new data format Requests for creation of new importers/exporters and removed import About importers in general - add a label for the data format if available labels Sep 18, 2017

jackyq2015 added Priority: High Denotes issues that require urgent attention and may be blocking progress. and removed Priority: High Denotes issues that require urgent attention and may be blocking progress. labels Oct 25, 2017

wetneb mentioned this issue Nov 28, 2017

Feature Request: Export SqlDump #205

Closed

wetneb mentioned this issue Dec 26, 2017

data package metadata #1398

Merged

thadguidry added this to the 3.5 Cookie Monster milestone May 29, 2018

thadguidry added this to TODO in Google Sponsored Projects via automation May 29, 2018

thadguidry moved this from TODO to In Progress in Google Sponsored Projects May 29, 2018

thadguidry moved this from In Progress to TODO in Google Sponsored Projects Jan 24, 2019

wetneb removed this from the 3.5 milestone Jul 26, 2019

wetneb removed the Priority: High Denotes issues that require urgent attention and may be blocking progress. label Oct 23, 2019

wetneb added the import About importers in general - add a label for the data format if available label Dec 22, 2019

rufuspollock mentioned this issue Jun 22, 2020

Do a release of updating lib frictionlessdata/datapackage-java#37

Open

loleg mentioned this issue Oct 10, 2023

OpenRefine integration okfn/opendataeditor#291

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Simple Data Format Data Packages #778

Support for Simple Data Format Data Packages #778

rufuspollock commented Aug 9, 2013 •

edited by thadguidry

tfmorris commented Aug 9, 2013

wetneb commented Feb 2, 2018

rufuspollock commented Feb 7, 2018

pwalsh commented Feb 7, 2018

thadguidry commented May 29, 2018

wetneb commented May 30, 2018 •

edited by thadguidry

lwinfree commented Oct 23, 2019

wetneb commented Oct 23, 2019

lwinfree commented Oct 23, 2019

tfmorris commented Jun 20, 2020

wetneb commented Jun 20, 2020

thadguidry commented Jun 21, 2020

wetneb commented Jun 21, 2020

jimfhahn commented Jun 21, 2020

rufuspollock commented Jun 22, 2020

lwinfree commented Jun 22, 2020

wetneb commented Jun 22, 2020 •

edited

andrejjh commented Jun 23, 2020 via email •

edited by tfmorris

tfmorris commented Jun 23, 2020

andrejjh commented Jun 23, 2020 via email •

edited by tfmorris

andrejjh commented Jun 23, 2020 via email •

edited by tfmorris

lwinfree commented May 28, 2021

wetneb commented May 28, 2021

thadguidry commented May 28, 2021 •

edited

wetneb commented May 28, 2021

tfmorris commented Nov 28, 2023

Support for Simple Data Format Data Packages #778

Support for Simple Data Format Data Packages #778

Comments

rufuspollock commented Aug 9, 2013 • edited by thadguidry

tfmorris commented Aug 9, 2013

wetneb commented Feb 2, 2018

rufuspollock commented Feb 7, 2018

pwalsh commented Feb 7, 2018

thadguidry commented May 29, 2018

wetneb commented May 30, 2018 • edited by thadguidry

lwinfree commented Oct 23, 2019

wetneb commented Oct 23, 2019

lwinfree commented Oct 23, 2019

tfmorris commented Jun 20, 2020

wetneb commented Jun 20, 2020

thadguidry commented Jun 21, 2020

wetneb commented Jun 21, 2020

jimfhahn commented Jun 21, 2020

rufuspollock commented Jun 22, 2020

Export flow

Ingest flow

lwinfree commented Jun 22, 2020

wetneb commented Jun 22, 2020 • edited

andrejjh commented Jun 23, 2020 via email • edited by tfmorris

tfmorris commented Jun 23, 2020

andrejjh commented Jun 23, 2020 via email • edited by tfmorris

andrejjh commented Jun 23, 2020 via email • edited by tfmorris

lwinfree commented May 28, 2021

wetneb commented May 28, 2021

thadguidry commented May 28, 2021 • edited

wetneb commented May 28, 2021

tfmorris commented Nov 28, 2023

rufuspollock commented Aug 9, 2013 •

edited by thadguidry

wetneb commented May 30, 2018 •

edited by thadguidry

wetneb commented Jun 22, 2020 •

edited

andrejjh commented Jun 23, 2020 via email •

edited by tfmorris

andrejjh commented Jun 23, 2020 via email •

edited by tfmorris

andrejjh commented Jun 23, 2020 via email •

edited by tfmorris

thadguidry commented May 28, 2021 •

edited