Skip to content

Conversation

@jsam
Copy link
Contributor

@jsam jsam commented Dec 18, 2019

closes: #769
depends on: #875

fixes: #881
fixes: #880

Fixes additional problems with:

  • dataset creators
  • linting problems
  • fixes broken tests
  • adds more tests around dataset create and add commands

@jsam jsam marked this pull request as ready for review December 19, 2019 17:04
@jsam jsam requested a review from a team as a code owner December 19, 2019 17:04
Copy link
Contributor

@mohammad-alisafaee mohammad-alisafaee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Some minor comments. My main comment is regarding #881 which I believe has a simpler fix.

creators = [Person.from_string(c) for c in creators]

elif hasattr(creators, '__iter__') and isinstance(creators[0], dict):
creators = [Person.from_dict(creator) for creator in creators]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pass the correct type (i.e. list of Person) to this function when calling it. These type checks do not look nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creators can be any collection type (list, tuple, str,...), so checking for correct type would make this even uglier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not talking about type checking for correctness but about taking action based on the type. This code also does not check for type correctness (if creators is not an iterable nothing happens and we might end up with wrong metadata).

file_paths = {str(data['path']) for data in files if str(data['path'])}
self.repo.git.add(*(file_paths - set(ignored)))
new_files_count = len(file_paths) - len(ignored)
new_files = (file_paths - set(ignored))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not new files but the files that are not ignored. Ignored files can be new as well.

Copy link
Contributor Author

@jsam jsam Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignored files are handled by the force. There are tests for it and you can check that it works fine. But the count might be off, let me check.


if not self.repo.is_dirty():
return warning_message
if dataset.contains(new_files) is False and (new_files_count or force):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is not right. For example: If the dataset contains at least one file from new_files then the if block won't execute and the rest of the files won't be added to the dataset.

Copy link
Contributor Author

@jsam jsam Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which (I would argue) that is exactly what we want as a default behaviour - transactional add. Operation should not partially succeed since from users perspective is very strange to have partial success - since users might add something by accident which they donot want. We should force users to be explicit on the CLI and not silently succeeding at the operations and not letting them know what exactly happened (which is currently the case).

edit: Let's add a flag where that behavior would be possible. That would make things more explicit and straightforward.

Copy link
Contributor

@mohammad-alisafaee mohammad-alisafaee Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following is obviously a bug:

[~/playground/tmp]$ renku dataset create a
Use the name "a" to refer to this dataset.
OK
[~/playground/tmp]$ renku dataset add a ~/some-file 
[~/playground/tmp]$ renku dataset add a ~/some-file ~/some-other-file 
[~/playground/tmp]$ git grep some-file
.gitattributes:data/a/some-file filter=lfs diff=lfs merge=lfs -text
.renku/datasets/f2960ea8-65a3-4a97-92e2-e87768d21f05/metadata.yml:  _id: file://blob/1f7d45364cbe7e7de19bfdfa01da7585351a9e5c/data/a/some-file
.renku/datasets/f2960ea8-65a3-4a97-92e2-e87768d21f05/metadata.yml:  _label: data/a/some-file@1f7d45364cbe7e7de19bfdfa01da7585351a9e5c
.renku/datasets/f2960ea8-65a3-4a97-92e2-e87768d21f05/metadata.yml:  name: some-file
.renku/datasets/f2960ea8-65a3-4a97-92e2-e87768d21f05/metadata.yml:  path: data/a/some-file
.renku/datasets/f2960ea8-65a3-4a97-92e2-e87768d21f05/metadata.yml:  url: file://../../some-file
[~/playground/tmp]$ git grep some-other-file
.gitattributes:data/a/some-other-file filter=lfs diff=lfs merge=lfs -text
[~/playground/tmp]$ git log --pretty=oneline
762b5de192477613185d8e90900a0f38c920b8b9 (HEAD -> master) renku dataset add a /home/mohammad/some-file /home/mohammad/some-other-file
bf50499e866daec0a1aa09170e36a422a6ba26ee renku dataset add a /home/mohammad/some-file
1f7d45364cbe7e7de19bfdfa01da7585351a9e5c renku dataset: committing 1 newly added files
e6dcaae7418a8c53f47ec412bae405032393a863 renku dataset create a
11b817913f9b58601b19aa1bc20d4f21ad196f06 renku init

It partially succeeds and it does not print an error message which I believe is the exact opposite of what you are aiming to do here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is truly a bug! I will add a test case for it and fix it.

Thank you.

new_files = (file_paths - set(ignored))

if not self.repo.is_dirty():
return warning_message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug reported in #881 can be solved by moving the commit statement to this if block and removing return from here:

if not self.repo.is_dirty():
    self.repo.index.commit(
         'renku dataset: commiting {} newly added files'.
         format(len(file_paths) + len(ignored))
    )

Like this we avoid an empty commit if there is no change in data files and we allow the metadata to be updated (which was prevented previously by returning early).

Copy link
Contributor Author

@jsam jsam Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in that case adding ignored files would fail to be added?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Added ignored files show up as dirty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How? If they are ignored it means that they are part of the .gitignore and git won't register them at all.

Copy link
Contributor

@mohammad-alisafaee mohammad-alisafaee Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[~/playground/tmp]$ rm -rf * .* ; renku init
OK
Initialized empty project in /home/mohammad/playground/tmp
[~/playground/tmp]$ grep .scrapy .gitignore 
.scrapy
[~/playground/tmp]$ echo .scrapy > .scrapy
[~/playground/tmp]$ git add .scrapy 
The following paths are ignored by one of your .gitignore files:
.scrapy
Use -f if you really want to add them.
[~/playground/tmp]$ git add -f .scrapy
[~/playground/tmp]$ git status -s
A  .scrapy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users do not use --force and there are ignored files then this function raises an exception and won't reach this point.

data = {field_: obj.pop(field_) for field_ in self.EDITABLE_FIELDS}
return data

def contains(self, files):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is a bit misleading. I'd rename it to contains_any or contains_at_least_one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it miss leading? It follows a common functionality pattern which can be found in standard library (and in other languages standard library).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it accepts multiple files I expect contains return true only if all files are contained. This function return true when any of files is in the dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, renamed to contains_any.


monkeypatch.setattr(requests, 'get', get)
dataset = client.load_dataset('my-dataset')
o = client._add_from_url(dataset, url, client.path)
Copy link
Contributor

@mohammad-alisafaee mohammad-alisafaee Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to avoid calling a non-public method in this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That client abstraction should really be fixed (IMHO), cause isolating pieces in testable units is almost impossible without resorting to such calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why you removed the cli call to add files. I thought perhaps it was because of the requests mocking that something wasn't working in between.
Perhaps it's better to remove this test and have it only as integration test because it is mocking an implementation detail and it is calling a protected method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I din't removed anything, that test the way it was written it was integration test and it was not marked as such - I just marked it as an integration test and wrote the actual unit test for the part of what that test is suppose to be testing since @rokroskar requested to have it.

I would also be fine just with having it as an integration test, tbh.


with client.with_dataset('my-dataset') as dataset:
file_ = dataset.files[0]
assert file_.url == 'https://example.com/index.html'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, it should be other way around - thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do not start the discussion about expected and actual ;).


assert {'result'} == set(response.json.keys())
assert {'dataset_name', 'files',
'project_id'} == set(response.json['result'].keys())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might fail depending on the order. Consider using sorted(...).

Copy link
Contributor Author

@jsam jsam Dec 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since when does order matter in a set? 😄

Try this in your interpreter:

print({1,2,3} == {2,3,1})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Technically, the order in sets matters! :)

@rokroskar
Copy link
Member

fixes: #877

I think you mean #876?

@jsam
Copy link
Contributor Author

jsam commented Dec 20, 2019

fixes: #877

I think you mean #876?

It was a typo, sorry!

Copy link
Contributor Author

@jsam jsam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oki, thanks for the review. I will look into it.

Copy link
Contributor

@mohammad-alisafaee mohammad-alisafaee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@jsam jsam merged commit faf4c9a into master Jan 6, 2020
@jsam jsam deleted the 769_project_files0 branch January 6, 2020 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants