feat: metadata on dataset creation #850

mohammad-alisafaee · 2019-12-04T09:30:17Z

Description

When creating datasets, users can provide its description, creators, and display name using command line options. Creator is a string with "Name " format and users can pass multiple creators.
This also makes a consistent use of short_name in the code. short_name is used as the dataset's data directory name and dataset's reference name. name is not used by Renku other than storing it as a metadata. If users don't provide a short name when creating a dataset then one is automatically created.

Fixes #515
Fixes #791
Fixes #840

emmjab · 2019-12-04T14:57:21Z

I tried the following:

$ renku dataset create new -d "this is a test dataset" -c "Emma <e.jablonski@gmail.com>" --display-name "mountain"
OK
$ ls
Dockerfile		requirements.txt

(1) didn't realize that the data directory and a new or mountain subdirectory wasn't going to be created

(env) IC-LM-208782:my_new_project jablonsk$ renku dataset 
ID                                    DISPLAY_NAME    VERSION    CREATED              CREATORS
------------------------------------  --------------  ---------  -------------------  ----------
3041b30a-df24-4dcb-a0b9-709340433f88  mountain                   2019-12-04 14:31:11  Emma
(env) IC-LM-208782:my_new_project jablonsk$ renku dataset add new /Users/jablonsk/Documents/permafrosthackathon/permafrost-refresh/data/annotations/annotations.zip 
Error: Dataset "new" does not exist.
Use "renku dataset create new" to create the dataset or retry "renku dataset add new" command with "--create" option for automatic dataset creation.

(2) Didn't realize display-name was then going to be how I add files to my dataset & the name of the directory inside data.

(env) IC-LM-208782:my_new_project jablonsk$ renku dataset add mountain /Users/jablonsk/Documents/permafrosthackathon/permafrost-refresh/data/annotations/annotations.zip 
Warning: Adding data from local Git repository. Use remote's Git URL instead to enable lineage information and updates.
(env) IC-LM-208782:my_new_project jablonsk$ ls
Dockerfile		data			requirements.txt

But then ls-files still shows the dataset being called new

$ renku dataset ls-files
ADDED                CREATORS    DATASET    PATH
-------------------  ----------  ---------  ---------------------------------------------------------------------------------------
2019-12-04 14:32:35  Emma        new        /Users/jablonsk/Documents/test/testing_515/my_new_project/data/mountain/annotations.zip

Also checked that the name & email get validated (i.e. has to be of the format "Name ")
that you can pass in any combination of the options

I don't quite understand display_name -- what makes something a valid renku name?

mohammad-alisafaee · 2019-12-04T15:14:42Z

Data directory is created when you add a file to the dataset.
display_name is used as a valid Renku name, however, its name is a bit misleading and we might change this.

emmjab · 2019-12-04T15:23:41Z

Data directory is created when you add a file to the dataset.

Not gonna dig into this one now; can see it going either way (i.e. if the directory shows up, users will think they can add files to the directory without renku dataset add-ing them, but if it doesn't show up it's kind of opaque about what happens when you "create a dataset".

display_name is used as a valid Renku name, however, its name is a bit misleading and we might change this.

What's the definition of a "valid Renku name"?

But then ls-files still shows the dataset being called new

What about the inconsistency between whether it's named by renku dataset create <firstname> v. renku dataset create --display_name <lastname>?

mohammad-alisafaee · 2019-12-11T13:45:18Z

Not gonna dig into this one now; can see it going either way (i.e. if the directory shows up, users will think they can add files to the directory without renku dataset add-ing them, but if it doesn't show up it's kind of opaque about what happens when you "create a dataset".

Please create an issue for this if you think we need to discuss it.

What's the definition of a "valid Renku name"?

A valid Git reference. Basically, one can use alphanumeric, ., -, and _. Some more characters are allowed by Git to use but we should (and will) disallow them.

What about the inconsistency between whether it's named by renku dataset create <firstname> v. renku dataset create --display_name <lastname>?

The --display-name option is removed due the apparent confusion it causes. Now, there is only dataset's name (<firstname> in your example).

rokroskar · 2019-12-12T08:31:23Z

Since we are now using internal_name, could we also specify --name or --title to give it a more human-readable name?

rokroskar · 2019-12-12T22:37:07Z

renku/core/management/datasets.py

        dataset.to_yaml()

+    def create_dataset(
+        self, name, internal_name='', description='', creators=()


should allow for specifying an identifier?

Also, +1 on this refactor. :)

rokroskar

This looks like a good improvement, thanks! I'm still not 100% sure how the names should be handled exactly, but I find this much less confusing than before.

mohammad-alisafaee · 2019-12-16T10:20:14Z

Since we are now using internal_name, could we also specify --name or --title to give it a more human-readable name?

We already have a name field for datasets which is the same as internal_name in locally-created datasets and is whatever-is-set-as-name for imported datasets. Probably, we should use a title field for human-readable names for both of these two types of datasets but in that case we are changing name field for imported datasets.

rokroskar · 2019-12-16T10:27:57Z

Right - I'm wondering if we could follow the same convention for newly-created datasets - if the name given by the user matches the "allowed characters" regex, then we keep it as is and name is the same as internal_name. However, if a user specifies the name in this way:

renku dataset create "Exploratory dataset"

then this would become

name -> "Exploratory dataset"
internal_name -> exploratory.dataset

mohammad-alisafaee · 2019-12-16T10:33:10Z

I believe this will confuse some users as they expect to the name "Exploratory dataset" to refer to the dataset but they must actually use exploratory.dataset. This is the same situation as a previous commit of this PR that we had --display-name and a name.

rokroskar · 2019-12-16T10:34:49Z

What if we followed up the dataset creation with a helpful message that said something like:

Creation successful! Use the name exploratory.dataset to refer to this dataset.

We also need something like this to follow the import command anyway.

mohammad-alisafaee · 2019-12-16T13:57:48Z

I had to force-push to resolve merge conflicts. Please review the last two commits.

rokroskar · 2019-12-16T14:02:01Z

Cool! I think that works nicely:

$ renku dataset create "My awesome dataset"
Use the name "my_awesome_dataset" to refer to this dataset.
OK

$ renku dataset
ID                                    INTERNAL_NAME             VERSION    CREATED              CREATORS
------------------------------------  ------------------------  ---------  -------------------  ---------------------------------------
8bdee124-1a8a-408e-9ff1-b98f9ba007c4  my_awesome_dataset                   2019-12-16 14:59:50  R.Roskar

@emmjab what do you think?

emmjab · 2019-12-16T15:33:03Z

Hah -- I think this is still confusing. Why does the name have to change? It has to be a valid git name? Why? it's not a submodule anymore, was that why?

rokroskar · 2019-12-16T15:34:33Z

Because it's annoying to type something that is super long and (potentially) contains special characters?

rokroskar · 2019-12-16T15:36:13Z

It's not obvious in this example, clearly, but if you import a dataset from zenodo it's likely that the name will be annoyingly long. You don't want to have to type that or copy/paste that every time you use it

emmjab · 2019-12-16T15:37:06Z

hmm... autocomplete? 😅

if you import a dataset from zenodo it's likely that the name will be annoyingly long

Why is that? I can google this

rokroskar · 2019-12-16T15:38:34Z

Take https://zenodo.org/record/3549866 for example. The name is "Synthetic dataset used in "The maximum weighted submatrix coverage problem: A CP approach"" - how do you suppose to use that on the command line?

emmjab · 2019-12-16T15:39:52Z

why not use the DOI? <-- but we're not just talking about the imported datasets, are we?

rokroskar · 2019-12-16T15:57:08Z

Ok so @emmjab and I have been discussing this a bit offline. I find "internal_name" a bit confusing - what if we called it "short_name"? We could store this using schema.org/alternateName to give the user the option to set it or change it. The default would be what we are doing now, i.e. automatically generate it.

So it would look like

dataset create --shortname dataset1 "This ia a dataset about stuff"

or

dataset import --shortname dataset1 <DOI>

rokroskar

This is definitely a step in the right direction! Thanks!

mohammad-alisafaee requested a review from a team as a code owner December 4, 2019 09:30

rokroskar requested a review from emmjab December 4, 2019 09:31

mohammad-alisafaee force-pushed the 515-metadata-on-dataset-creation branch 2 times, most recently from f05610b to c0c7d21 Compare December 11, 2019 13:29

rokroskar reviewed Dec 12, 2019

View reviewed changes

rokroskar previously approved these changes Dec 12, 2019

View reviewed changes

mohammad-alisafaee dismissed rokroskar’s stale review via bdfc87c December 16, 2019 10:22

mohammad-alisafaee added 2 commits December 16, 2019 11:33

refactor: dataset creation

640dd33

refactor: rename methods

8642b0b

mohammad-alisafaee added 7 commits December 16, 2019 11:41

refactor: consistent use of display_name

13282da

refactor: dataset name validation

21367dd

feat: parse creators from string

11c25a7

add more tests

371d14a

fix: do not allow display_name at dataset creation

afbd671

chore: resolve rebasing conflicts

ae2e688

feat: allow any character in dataset name

d482ec2

mohammad-alisafaee force-pushed the 515-metadata-on-dataset-creation branch from bdfc87c to d482ec2 Compare December 16, 2019 13:56

feat: allow short_name for datasets

5b63f58

rokroskar approved these changes Dec 17, 2019

View reviewed changes

mohammad-alisafaee merged commit b357ee7 into master Dec 17, 2019

mohammad-alisafaee deleted the 515-metadata-on-dataset-creation branch December 17, 2019 09:23

jsam mentioned this pull request Dec 18, 2019

RaaS: Use additional metadata on dataset creation #874

Closed

feat: metadata on dataset creation #850

feat: metadata on dataset creation #850

Uh oh!

Conversation

mohammad-alisafaee commented Dec 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

emmjab commented Dec 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mohammad-alisafaee commented Dec 4, 2019

Uh oh!

emmjab commented Dec 4, 2019

Uh oh!

mohammad-alisafaee commented Dec 11, 2019

Uh oh!

rokroskar commented Dec 12, 2019

Uh oh!

rokroskar Dec 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rokroskar left a comment

Choose a reason for hiding this comment

Uh oh!

mohammad-alisafaee commented Dec 16, 2019

Uh oh!

rokroskar commented Dec 16, 2019

Uh oh!

mohammad-alisafaee commented Dec 16, 2019

Uh oh!

rokroskar commented Dec 16, 2019

Uh oh!

mohammad-alisafaee commented Dec 16, 2019

Uh oh!

rokroskar commented Dec 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emmjab commented Dec 16, 2019

Uh oh!

rokroskar commented Dec 16, 2019

Uh oh!

rokroskar commented Dec 16, 2019

Uh oh!

emmjab commented Dec 16, 2019

Uh oh!

rokroskar commented Dec 16, 2019

Uh oh!

emmjab commented Dec 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rokroskar commented Dec 16, 2019

Uh oh!

rokroskar left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mohammad-alisafaee commented Dec 4, 2019 •

edited

Loading

emmjab commented Dec 4, 2019 •

edited

Loading

rokroskar Dec 12, 2019 •

edited

Loading

rokroskar commented Dec 16, 2019 •

edited

Loading

emmjab commented Dec 16, 2019 •

edited

Loading