Skip to content

Conversation

@mohammad-alisafaee
Copy link
Contributor

@mohammad-alisafaee mohammad-alisafaee commented Dec 4, 2019

Description

When creating datasets, users can provide its description, creators, and display name using command line options. Creator is a string with "Name " format and users can pass multiple creators.
This also makes a consistent use of short_name in the code. short_name is used as the dataset's data directory name and dataset's reference name. name is not used by Renku other than storing it as a metadata. If users don't provide a short name when creating a dataset then one is automatically created.

Fixes #515
Fixes #791
Fixes #840

@mohammad-alisafaee mohammad-alisafaee requested a review from a team as a code owner December 4, 2019 09:30
@rokroskar rokroskar requested a review from emmjab December 4, 2019 09:31
@emmjab
Copy link
Contributor

emmjab commented Dec 4, 2019

I tried the following:

$ renku dataset create new -d "this is a test dataset" -c "Emma <e.jablonski@gmail.com>" --display-name "mountain"
OK
$ ls
Dockerfile		requirements.txt

(1) didn't realize that the data directory and a new or mountain subdirectory wasn't going to be created

(env) IC-LM-208782:my_new_project jablonsk$ renku dataset 
ID                                    DISPLAY_NAME    VERSION    CREATED              CREATORS
------------------------------------  --------------  ---------  -------------------  ----------
3041b30a-df24-4dcb-a0b9-709340433f88  mountain                   2019-12-04 14:31:11  Emma
(env) IC-LM-208782:my_new_project jablonsk$ renku dataset add new /Users/jablonsk/Documents/permafrosthackathon/permafrost-refresh/data/annotations/annotations.zip 
Error: Dataset "new" does not exist.
Use "renku dataset create new" to create the dataset or retry "renku dataset add new" command with "--create" option for automatic dataset creation.

(2) Didn't realize display-name was then going to be how I add files to my dataset & the name of the directory inside data.

(env) IC-LM-208782:my_new_project jablonsk$ renku dataset add mountain /Users/jablonsk/Documents/permafrosthackathon/permafrost-refresh/data/annotations/annotations.zip 
Warning: Adding data from local Git repository. Use remote's Git URL instead to enable lineage information and updates.
(env) IC-LM-208782:my_new_project jablonsk$ ls
Dockerfile		data			requirements.txt

But then ls-files still shows the dataset being called new

$ renku dataset ls-files
ADDED                CREATORS    DATASET    PATH
-------------------  ----------  ---------  ---------------------------------------------------------------------------------------
2019-12-04 14:32:35  Emma        new        /Users/jablonsk/Documents/test/testing_515/my_new_project/data/mountain/annotations.zip
  • Also checked that the name & email get validated (i.e. has to be of the format "Name ")
  • that you can pass in any combination of the options

I don't quite understand display_name -- what makes something a valid renku name?

@mohammad-alisafaee
Copy link
Contributor Author

  1. Data directory is created when you add a file to the dataset.
  2. display_name is used as a valid Renku name, however, its name is a bit misleading and we might change this.

@emmjab
Copy link
Contributor

emmjab commented Dec 4, 2019

  1. Data directory is created when you add a file to the dataset.

Not gonna dig into this one now; can see it going either way (i.e. if the directory shows up, users will think they can add files to the directory without renku dataset add-ing them, but if it doesn't show up it's kind of opaque about what happens when you "create a dataset".

  1. display_name is used as a valid Renku name, however, its name is a bit misleading and we might change this.

What's the definition of a "valid Renku name"?

  1. But then ls-files still shows the dataset being called new

What about the inconsistency between whether it's named by renku dataset create <firstname> v. renku dataset create --display_name <lastname>?

@mohammad-alisafaee mohammad-alisafaee force-pushed the 515-metadata-on-dataset-creation branch 2 times, most recently from f05610b to c0c7d21 Compare December 11, 2019 13:29
@mohammad-alisafaee
Copy link
Contributor Author

Not gonna dig into this one now; can see it going either way (i.e. if the directory shows up, users will think they can add files to the directory without renku dataset add-ing them, but if it doesn't show up it's kind of opaque about what happens when you "create a dataset".

Please create an issue for this if you think we need to discuss it.

What's the definition of a "valid Renku name"?

A valid Git reference. Basically, one can use alphanumeric, ., -, and _. Some more characters are allowed by Git to use but we should (and will) disallow them.

What about the inconsistency between whether it's named by renku dataset create <firstname> v. renku dataset create --display_name <lastname>?

The --display-name option is removed due the apparent confusion it causes. Now, there is only dataset's name (<firstname> in your example).

@rokroskar
Copy link
Member

Since we are now using internal_name, could we also specify --name or --title to give it a more human-readable name?

dataset.to_yaml()

def create_dataset(
self, name, internal_name='', description='', creators=()
Copy link
Member

@rokroskar rokroskar Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should allow for specifying an identifier?

Also, +1 on this refactor. :)

rokroskar
rokroskar previously approved these changes Dec 12, 2019
Copy link
Member

@rokroskar rokroskar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good improvement, thanks! I'm still not 100% sure how the names should be handled exactly, but I find this much less confusing than before.

@mohammad-alisafaee
Copy link
Contributor Author

Since we are now using internal_name, could we also specify --name or --title to give it a more human-readable name?

We already have a name field for datasets which is the same as internal_name in locally-created datasets and is whatever-is-set-as-name for imported datasets. Probably, we should use a title field for human-readable names for both of these two types of datasets but in that case we are changing name field for imported datasets.

@rokroskar
Copy link
Member

Right - I'm wondering if we could follow the same convention for newly-created datasets - if the name given by the user matches the "allowed characters" regex, then we keep it as is and name is the same as internal_name. However, if a user specifies the name in this way:

renku dataset create "Exploratory dataset"

then this would become

name -> "Exploratory dataset"
internal_name -> exploratory.dataset

@mohammad-alisafaee
Copy link
Contributor Author

I believe this will confuse some users as they expect to the name "Exploratory dataset" to refer to the dataset but they must actually use exploratory.dataset. This is the same situation as a previous commit of this PR that we had --display-name and a name.

@rokroskar
Copy link
Member

What if we followed up the dataset creation with a helpful message that said something like:

Creation successful! Use the name exploratory.dataset to refer to this dataset.

We also need something like this to follow the import command anyway.

@mohammad-alisafaee mohammad-alisafaee force-pushed the 515-metadata-on-dataset-creation branch from bdfc87c to d482ec2 Compare December 16, 2019 13:56
@mohammad-alisafaee
Copy link
Contributor Author

I had to force-push to resolve merge conflicts. Please review the last two commits.

@rokroskar
Copy link
Member

rokroskar commented Dec 16, 2019

Cool! I think that works nicely:

$ renku dataset create "My awesome dataset"
Use the name "my_awesome_dataset" to refer to this dataset.
OK

$ renku dataset
ID                                    INTERNAL_NAME             VERSION    CREATED              CREATORS
------------------------------------  ------------------------  ---------  -------------------  ---------------------------------------
8bdee124-1a8a-408e-9ff1-b98f9ba007c4  my_awesome_dataset                   2019-12-16 14:59:50  R.Roskar

@emmjab what do you think?

@emmjab
Copy link
Contributor

emmjab commented Dec 16, 2019

Hah -- I think this is still confusing. Why does the name have to change? It has to be a valid git name? Why? it's not a submodule anymore, was that why?

@rokroskar
Copy link
Member

Because it's annoying to type something that is super long and (potentially) contains special characters?

@rokroskar
Copy link
Member

It's not obvious in this example, clearly, but if you import a dataset from zenodo it's likely that the name will be annoyingly long. You don't want to have to type that or copy/paste that every time you use it

@emmjab
Copy link
Contributor

emmjab commented Dec 16, 2019

hmm... autocomplete? 😅

if you import a dataset from zenodo it's likely that the name will be annoyingly long

Why is that? I can google this

@rokroskar
Copy link
Member

Take https://zenodo.org/record/3549866 for example. The name is "Synthetic dataset used in "The maximum weighted submatrix coverage problem: A CP approach"" - how do you suppose to use that on the command line?

@emmjab
Copy link
Contributor

emmjab commented Dec 16, 2019

why not use the DOI? <-- but we're not just talking about the imported datasets, are we?

@rokroskar
Copy link
Member

Ok so @emmjab and I have been discussing this a bit offline. I find "internal_name" a bit confusing - what if we called it "short_name"? We could store this using schema.org/alternateName to give the user the option to set it or change it. The default would be what we are doing now, i.e. automatically generate it.

So it would look like

dataset create --shortname dataset1 "This ia a dataset about stuff"

or

dataset import --shortname dataset1 <DOI>

Copy link
Member

@rokroskar rokroskar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely a step in the right direction! Thanks!

@mohammad-alisafaee mohammad-alisafaee merged commit b357ee7 into master Dec 17, 2019
@mohammad-alisafaee mohammad-alisafaee deleted the 515-metadata-on-dataset-creation branch December 17, 2019 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong file location when adding files to imported datasets Clean up use of dataset display_name cli: add metadata on dataset creation

4 participants