New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Simple Data Format Data Packages #778
Comments
Here's a link to the specification: http://www.dataprotocols.org/en/latest/simple-data-format.html |
@rufuspollock We have a first working version of this, just merged. @jackyq2015 has implemented import and export for data packages distributed as Zip files, and also data packages hosted on the web as a JSON file. The various properties of columns (types, constraints, descriptions, …) are not currently exposed in the UI - this will require more work and coordination. By the way, thanks for setting up https://github.com/frictionlessdata/test-data, this is very useful. |
@wetneb awesome 👍 👏 /cc @pwalsh @vitorbaptista |
Amazing @wetneb . Happy to support with reviews and discussion on the Data Package Java Library. |
@wetneb So what's left to deliver the features on this issue ? I'd like to see a Task breakdown please with linked issue numbers on each task remaining. |
I don't think we have issues for these yet:
|
Hi! I've noticed that the datapackage import & export functionality is no longer present in the latest versions (3.2 and 3.3) but it was there in 3.0. Are there any plans to re-implement this functionality? If so, is there anything you need help with to get it working? Thanks! |
Hi @lwinfree, Yes indeed, we removed it because it relied on a non-free library, see frictionlessdata/datapackage-java#26. It would be great to have this back though! We do not have short-term plans to work on this but would surely welcome PRs in that direction. In my opinion, the integration we had in 3.0 lacked vision a bit: we should think about concrete user workflows where the integration would really make a difference. As a user, how do I want to turn a messy CSV into a nice validated data package? This means thinking about the interaction between the spec's notions (such as type constraints on columns) and OpenRefine's data model, for instance. What I mean is that it's not enough to just have an importer and an exporter if the importer discards most of the interesting metadata and the exporter produces a jsonified CSV… that defeats the purpose of data packages! It might be worth looking at use cases in communities who already rely on data packages, seeing what benefit they get out of it, how do they produce them, how could they use OpenRefine as part of their existing workflows, and so on. |
Hi @wetneb, Thanks this is really helpful information! I work on the frictionless data team and we are interested in getting this functionality fixed. I'll keep you posted on our progress :-) |
It looks like Doesn't look like they do formal releases: https://github.com/frictionlessdata/datapackage-java/releases |
I would be wary of simply restoring the previous integration with the migrated library, since it did not really enable any useful workflow for users as far as I can tell. I would be interested to hear from @lwinfree what sort of workflow their tool grantee had in mind - that could help drive the integration in the right direction. |
@wetneb I've been told that you can find that Data Packages are used in a lot of statistical & bio tools. To name a few important ones in the R lang community: The bio/stat/scientific community are looking for more data tools to support editing metadata and help improve reproducibility with pushing good practices for publishing data, which involves producing a data dictionary, making it machine readable, etc. The last 3 times I went to the R lang meetup in Dallas, they all asked "Does OpenRefine support adding metadata editing of the table schema yet?" My reply 3 times: "nope" I personally kinda like the approach of vertically scrolling through the columns to edit that many existing data tools use. Makes the metadata entry faster: |
What I would like is a concrete description of a workflow in OpenRefine. |
This seems to relate somewhat to the FAIR OpenRefine plugin project : https://github.com/FAIRDataTeam/OpenRefine-metadata-extension FAIR metadata would seem to aspire to some of the same goals of Data Packages tech. I came across the FAIR plugin earlier in the year but have not have the chance to play with it much. FAIR data is very much about replicability, data rights, and data re-use. https://www.go-fair.org/fair-principles/ |
@wetneb i'm not fully up to speed on exact flow in OpenRefine itself but the overall workflow this supports is something like as follows (I imagine). Let me know if this is the kind of thing you were looking for or now. Export flow
Ingest flowI imagine there are situation where OpenRefine would benefit from being able to consume data that is already described a data package, for example, a user has already:
etc |
tagging @andrejjh, Andre we have updated our datapackage-java so it can theoretically be integrated with Open Refine again. Would you be able to write a short summary of the use case you would do if this integration was added back in? It would help the Open Refine team understand and prioritize. Thanks! |
Thanks @rufuspollock for the workflows, they sound very sensible to me! I do not think pulling back the integration we had before will address these. More work is needed to make these workflows possible and smooth:
|
The reason I'm desperately waiting for data package compatible OpenRefine
is this:
Open Refine is quite popular amongst the GBIF community. It is often used
by data holders to prepare institutional data for publication.
On the other side, GBIF data users mostly use RGBIF but some download data
in simple CSV or DwC-Archive format.
With the tool I made any DwC-A can be converted into data package which
enable great tool such as GoodTables and...OpenRefine if it can ingest
datapackage.
I'm pretty sure this functionality was supported by OpenRefine 2.x and
would be greatly appreciated by the GBIF community.
Ingested data packages contain enough information to be saved, after
processing in OpenRefine.
On top of that, OpenRefine is a powerful tool that could help people
dealing with any data packages.
I hate walls, love bridges. Best regards,
André Heughebaert <https://andrejjh.github.io/>
|
So many acronyms, so few links or definitions. Here are some links to help others: [GBIF](https://www.gbif.org/ - Global Biodiversity Information Facility - "international network and research infrastructure funded by the world's governments and aimed at providing anyone, anywhere, open access to data about all types of life on Earth." RGBIF - R package for dealing with GBIF data (currently 404 for its web page) DwC-A - Darwin Core Archive - "biodiversity informatics data standard that makes use of the Darwin Core terms to produce a single, self-contained dataset." Domain specific packaging/archiving (yet another). Appears to be zip file containing two XML files and one or more CSVs, all tied together. "tool I made" - perhaps Darwin Core Archive Assistant Now that I understand the terms, I'm not sure I'm any closer to understanding the workflow. "prepare institutional data for publication" = ? |
Hi Tom,
Sorry for assuming everybody knows all the acronyms I used.
But you find almost all of them except those two:
- Tool I made : Frictionless DarwinCore
<https://github.com/frictionlessdata/FrictionlessDarwinCore>
- Prepare institutional data for publication: Quick guide to publishing
data through GBIF <https://www.gbif.org/publishing-data>
This is graph explaining the complete data workflow if the missing
link(datapackage ingestion in OpenRefine) is added: see annex
Hope it clarify a bit,
|
Hi all! I'm wondering if there is any interest in reinstating this datapackage support? If yes, are there ways that the frictionless team could help? I was just communicating with @andrejjh who would still find this really useful, so that's our primary use case right now. Happy to chat more and thanks! |
Hi @lwinfree, |
@wetneb But I thought we needed to 1st add support in OpenRefine for Column metadata per your comments from last year above? to then align with Table Schema specs builtin support for Name, Type, Format Then having lots of discussion to arrive at decisions on how to visualize things in OpenRefine for quickly knowing and editing the Name, Type, Format... something like what SPSS does which I kinda like with it's little hover icons which could be clickable/editable In my mind, this issue is actually an EPIC. |
Yes absolutely! I would really like to see this foundational work and design discussions about column metadata / constraints happening and I do think they are necessary to provide a satisfying support for data packages. If we are aiming for that, this issue is indeed an "epic" which should be broken down into many subtasks. But if people only want to restore the limited support we had in the past (ingesting data packages by discarding most of the metadata they contain) then I would say it's also okay and should be easier. Even though it feels half-baked to me, perhaps it is useful in some workflows, and I don't see a reason to prevent people from implementing this in OpenRefine (if it's just another importer / exporter, basically). |
Looks like they decided to write a tool from scratch instead: https://github.com/okfn/opendataeditor They have a comparison page which lists a bunch of alternatives, including OpenRefine, but doesn't really say anything about how they compare. |
This is a discussion issue atm. Also a place to ask questions about how best to patch OpenRefine to support this.
Introduction and overview to Simple Data Format (full spec)
The text was updated successfully, but these errors were encountered: