Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Simple Data Format Data Packages #778

Open
rufuspollock opened this issue Aug 9, 2013 · 26 comments
Open

Support for Simple Data Format Data Packages #778

rufuspollock opened this issue Aug 9, 2013 · 26 comments
Labels
import About importers in general - add a label for the data format if available metadata Adding metadata to projects, columns and other parts of the data model new data format Requests for creation of new importers/exporters Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.

Comments

@rufuspollock
Copy link

rufuspollock commented Aug 9, 2013

This is a discussion issue atm. Also a place to ask questions about how best to patch OpenRefine to support this.

Introduction and overview to Simple Data Format (full spec)

@tfmorris
Copy link
Member

tfmorris commented Aug 9, 2013

Here's a link to the specification: http://www.dataprotocols.org/en/latest/simple-data-format.html

@wetneb wetneb added import About importers in general - add a label for the data format if available metadata Adding metadata to projects, columns and other parts of the data model labels Aug 2, 2017
@wetneb wetneb added new data format Requests for creation of new importers/exporters and removed import About importers in general - add a label for the data format if available labels Sep 18, 2017
@jackyq2015 jackyq2015 added Priority: High Denotes issues that require urgent attention and may be blocking progress. and removed Priority: High Denotes issues that require urgent attention and may be blocking progress. labels Oct 25, 2017
@wetneb
Copy link
Sponsor Member

wetneb commented Feb 2, 2018

@rufuspollock We have a first working version of this, just merged. @jackyq2015 has implemented import and export for data packages distributed as Zip files, and also data packages hosted on the web as a JSON file.

The various properties of columns (types, constraints, descriptions, …) are not currently exposed in the UI - this will require more work and coordination.
I also plan to work on import/export for data packages with embedded tabular data. This might require PRs to the java library for data packages.

By the way, thanks for setting up https://github.com/frictionlessdata/test-data, this is very useful.

@rufuspollock
Copy link
Author

@wetneb awesome 👍 👏 /cc @pwalsh @vitorbaptista

@pwalsh
Copy link

pwalsh commented Feb 7, 2018

Amazing @wetneb . Happy to support with reviews and discussion on the Data Package Java Library.

@thadguidry
Copy link
Member

@wetneb So what's left to deliver the features on this issue ? I'd like to see a Task breakdown please with linked issue numbers on each task remaining.

@thadguidry thadguidry added this to the 3.5 Cookie Monster milestone May 29, 2018
@thadguidry thadguidry added this to TODO in Google Sponsored Projects via automation May 29, 2018
@thadguidry thadguidry moved this from TODO to In Progress in Google Sponsored Projects May 29, 2018
@wetneb
Copy link
Sponsor Member

wetneb commented May 30, 2018

I don't think we have issues for these yet:

@thadguidry thadguidry moved this from In Progress to TODO in Google Sponsored Projects Jan 24, 2019
@wetneb wetneb removed this from the 3.5 milestone Jul 26, 2019
@lwinfree
Copy link

Hi! I've noticed that the datapackage import & export functionality is no longer present in the latest versions (3.2 and 3.3) but it was there in 3.0. Are there any plans to re-implement this functionality? If so, is there anything you need help with to get it working?

Thanks!

@wetneb
Copy link
Sponsor Member

wetneb commented Oct 23, 2019

Hi @lwinfree,

Yes indeed, we removed it because it relied on a non-free library, see frictionlessdata/datapackage-java#26.

It would be great to have this back though! We do not have short-term plans to work on this but would surely welcome PRs in that direction.

In my opinion, the integration we had in 3.0 lacked vision a bit: we should think about concrete user workflows where the integration would really make a difference. As a user, how do I want to turn a messy CSV into a nice validated data package? This means thinking about the interaction between the spec's notions (such as type constraints on columns) and OpenRefine's data model, for instance. What I mean is that it's not enough to just have an importer and an exporter if the importer discards most of the interesting metadata and the exporter produces a jsonified CSV… that defeats the purpose of data packages!

It might be worth looking at use cases in communities who already rely on data packages, seeing what benefit they get out of it, how do they produce them, how could they use OpenRefine as part of their existing workflows, and so on.

@wetneb wetneb removed the Priority: High Denotes issues that require urgent attention and may be blocking progress. label Oct 23, 2019
@lwinfree
Copy link

Hi @wetneb, Thanks this is really helpful information! I work on the frictionless data team and we are interested in getting this functionality fixed. I'll keep you posted on our progress :-)
Also, yes I agree it would be great to have use cases. One of our current tool fund grantees (https://github.com/frictionlessdata/FrictionlessDarwinCore) actually inspired this issue as he was planning on working with Open Refine.
Thanks for the quick response, and I'll stay in touch.

@wetneb wetneb added the import About importers in general - add a label for the data format if available label Dec 22, 2019
@tfmorris
Copy link
Member

It looks like datapackage-java has been updated with a new JSON library.
frictionlessdata/datapackage-java#35

Doesn't look like they do formal releases: https://github.com/frictionlessdata/datapackage-java/releases
so not sure how long a cooling off period we should allow before grabbing a snapshot.

@wetneb
Copy link
Sponsor Member

wetneb commented Jun 20, 2020

I would be wary of simply restoring the previous integration with the migrated library, since it did not really enable any useful workflow for users as far as I can tell.

I would be interested to hear from @lwinfree what sort of workflow their tool grantee had in mind - that could help drive the integration in the right direction.

@thadguidry
Copy link
Member

@wetneb I've been told that you can find that Data Packages are used in a lot of statistical & bio tools. To name a few important ones in the R lang community:
https://cran.r-project.org/web/packages/datapackage.r/index.html
https://cran.r-project.org/web/packages/dpmr/index.html
https://cran.r-project.org/web/packages/codebook/index.html

The bio/stat/scientific community are looking for more data tools to support editing metadata and help improve reproducibility with pushing good practices for publishing data, which involves producing a data dictionary, making it machine readable, etc.
https://arxiv.org/pdf/2002.11626.pdf

The last 3 times I went to the R lang meetup in Dallas, they all asked "Does OpenRefine support adding metadata editing of the table schema yet?" My reply 3 times: "nope"

I personally kinda like the approach of vertically scrolling through the columns to edit that many existing data tools use. Makes the metadata entry faster:
image

@wetneb
Copy link
Sponsor Member

wetneb commented Jun 21, 2020

What I would like is a concrete description of a workflow in OpenRefine.

@jimfhahn
Copy link

This seems to relate somewhat to the FAIR OpenRefine plugin project : https://github.com/FAIRDataTeam/OpenRefine-metadata-extension

FAIR metadata would seem to aspire to some of the same goals of Data Packages tech. I came across the FAIR plugin earlier in the year but have not have the chance to play with it much. FAIR data is very much about replicability, data rights, and data re-use. https://www.go-fair.org/fair-principles/

@rufuspollock
Copy link
Author

I would be interested to hear from @lwinfree what sort of workflow their tool grantee had in mind - that could help drive the integration in the right direction.

@wetneb i'm not fully up to speed on exact flow in OpenRefine itself but the overall workflow this supports is something like as follows (I imagine).

Let me know if this is the kind of thing you were looking for or now.

Export flow

  • User has data to tidy e.g. csv
  • They load into OpenRefine and wrangle. This work including adding some type information
  • They data is re-exported for consumption in some other tool. That export includes the datapackage.json with the Table Schema describing the table
  • Another tool (be that a data validator, a visualization tool, or a data loader to a DB) uses that metadata as part of its processing flow

Ingest flow

I imagine there are situation where OpenRefine would benefit from being able to consume data that is already described a data package, for example, a user has already:

  • Added information about the CSV dialect e.g. knowing that ; is the separator, that quotes are ' rather than " etc.
  • They have already added information about types e.g. 1989 is a year not an integer
  • Knowing the decimal character
  • Using the validation information (beyond types) in Table Schema to flag values out of range

etc

@lwinfree
Copy link

tagging @andrejjh, Andre we have updated our datapackage-java so it can theoretically be integrated with Open Refine again. Would you be able to write a short summary of the use case you would do if this integration was added back in? It would help the Open Refine team understand and prioritize. Thanks!

@wetneb
Copy link
Sponsor Member

wetneb commented Jun 22, 2020

Thanks @rufuspollock for the workflows, they sound very sensible to me!

I do not think pulling back the integration we had before will address these. More work is needed to make these workflows possible and smooth:

  • Export flow: that requires exposing more column metadata (and decide what OpenRefine should do with it - such as validation of the values in the column: in which form?)
  • Ingest flow: that requires feeding in metadata from the data package metadata to the importer - on top of my head I do not think this was supported, but I am not sure about that.

@andrejjh
Copy link

andrejjh commented Jun 23, 2020 via email

@tfmorris
Copy link
Member

So many acronyms, so few links or definitions. Here are some links to help others:

[GBIF](https://www.gbif.org/ - Global Biodiversity Information Facility - "international network and research infrastructure funded by the world's governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth."

RGBIF - R package for dealing with GBIF data (currently 404 for its web page)

DwC-A - Darwin Core Archive - "biodiversity informatics data standard that makes use of the Darwin Core terms to produce a single, self-contained dataset." Domain specific packaging/archiving (yet another). Appears to be zip file containing two XML files and one or more CSVs, all tied together.

"tool I made" - perhaps Darwin Core Archive Assistant

Now that I understand the terms, I'm not sure I'm any closer to understanding the workflow.

"prepare institutional data for publication" = ?

@andrejjh
Copy link

andrejjh commented Jun 23, 2020 via email

@andrejjh
Copy link

andrejjh commented Jun 23, 2020 via email

@lwinfree
Copy link

Hi all! I'm wondering if there is any interest in reinstating this datapackage support? If yes, are there ways that the frictionless team could help? I was just communicating with @andrejjh who would still find this really useful, so that's our primary use case right now. Happy to chat more and thanks!

@wetneb
Copy link
Sponsor Member

wetneb commented May 28, 2021

Hi @lwinfree,
I think it would be a nice thing to have indeed! I am not aware of anyone working on this at the moment, so the issue is up for grabs :) I would not mark it as a "good first issue" because it is a bit too involved for that, but it is definitely doable.
I would be happy to review pull requests going in that direction.

@thadguidry
Copy link
Member

thadguidry commented May 28, 2021

@wetneb But I thought we needed to 1st add support in OpenRefine for Column metadata per your comments from last year above?

to then align with Table Schema specs builtin support for Name, Type, Format
https://specs.frictionlessdata.io/table-schema/#types-and-formats

Then having lots of discussion to arrive at decisions on how to visualize things in OpenRefine for quickly knowing and editing the Name, Type, Format... something like what SPSS does which I kinda like with it's little hover icons which could be clickable/editable
https://www.spss-tutorials.com/spss-variable-types-and-formats/

In my mind, this issue is actually an EPIC.

@wetneb
Copy link
Sponsor Member

wetneb commented May 28, 2021

Yes absolutely! I would really like to see this foundational work and design discussions about column metadata / constraints happening and I do think they are necessary to provide a satisfying support for data packages. If we are aiming for that, this issue is indeed an "epic" which should be broken down into many subtasks.

But if people only want to restore the limited support we had in the past (ingesting data packages by discarding most of the metadata they contain) then I would say it's also okay and should be easier. Even though it feels half-baked to me, perhaps it is useful in some workflows, and I don't see a reason to prevent people from implementing this in OpenRefine (if it's just another importer / exporter, basically).

@tfmorris
Copy link
Member

Looks like they decided to write a tool from scratch instead: https://github.com/okfn/opendataeditor

They have a comparison page which lists a bunch of alternatives, including OpenRefine, but doesn't really say anything about how they compare.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
import About importers in general - add a label for the data format if available metadata Adding metadata to projects, columns and other parts of the data model new data format Requests for creation of new importers/exporters Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
No open projects
Development

No branches or pull requests

9 participants