-
Notifications
You must be signed in to change notification settings - Fork 24
feat: add project files to datasets from service #877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…r/renku-python into 769_project_files0
mohammad-alisafaee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Some minor comments. My main comment is regarding #881 which I believe has a simpler fix.
| creators = [Person.from_string(c) for c in creators] | ||
|
|
||
| elif hasattr(creators, '__iter__') and isinstance(creators[0], dict): | ||
| creators = [Person.from_dict(creator) for creator in creators] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we pass the correct type (i.e. list of Person) to this function when calling it. These type checks do not look nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creators can be any collection type (list, tuple, str,...), so checking for correct type would make this even uglier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not talking about type checking for correctness but about taking action based on the type. This code also does not check for type correctness (if creators is not an iterable nothing happens and we might end up with wrong metadata).
renku/core/management/datasets.py
Outdated
| file_paths = {str(data['path']) for data in files if str(data['path'])} | ||
| self.repo.git.add(*(file_paths - set(ignored))) | ||
| new_files_count = len(file_paths) - len(ignored) | ||
| new_files = (file_paths - set(ignored)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not new files but the files that are not ignored. Ignored files can be new as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignored files are handled by the force. There are tests for it and you can check that it works fine. But the count might be off, let me check.
renku/core/management/datasets.py
Outdated
|
|
||
| if not self.repo.is_dirty(): | ||
| return warning_message | ||
| if dataset.contains(new_files) is False and (new_files_count or force): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This condition is not right. For example: If the dataset contains at least one file from new_files then the if block won't execute and the rest of the files won't be added to the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which (I would argue) that is exactly what we want as a default behaviour - transactional add. Operation should not partially succeed since from users perspective is very strange to have partial success - since users might add something by accident which they donot want. We should force users to be explicit on the CLI and not silently succeeding at the operations and not letting them know what exactly happened (which is currently the case).
edit: Let's add a flag where that behavior would be possible. That would make things more explicit and straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following is obviously a bug:
[~/playground/tmp]$ renku dataset create a
Use the name "a" to refer to this dataset.
OK
[~/playground/tmp]$ renku dataset add a ~/some-file
[~/playground/tmp]$ renku dataset add a ~/some-file ~/some-other-file
[~/playground/tmp]$ git grep some-file
.gitattributes:data/a/some-file filter=lfs diff=lfs merge=lfs -text
.renku/datasets/f2960ea8-65a3-4a97-92e2-e87768d21f05/metadata.yml: _id: file://blob/1f7d45364cbe7e7de19bfdfa01da7585351a9e5c/data/a/some-file
.renku/datasets/f2960ea8-65a3-4a97-92e2-e87768d21f05/metadata.yml: _label: data/a/some-file@1f7d45364cbe7e7de19bfdfa01da7585351a9e5c
.renku/datasets/f2960ea8-65a3-4a97-92e2-e87768d21f05/metadata.yml: name: some-file
.renku/datasets/f2960ea8-65a3-4a97-92e2-e87768d21f05/metadata.yml: path: data/a/some-file
.renku/datasets/f2960ea8-65a3-4a97-92e2-e87768d21f05/metadata.yml: url: file://../../some-file
[~/playground/tmp]$ git grep some-other-file
.gitattributes:data/a/some-other-file filter=lfs diff=lfs merge=lfs -text
[~/playground/tmp]$ git log --pretty=oneline
762b5de192477613185d8e90900a0f38c920b8b9 (HEAD -> master) renku dataset add a /home/mohammad/some-file /home/mohammad/some-other-file
bf50499e866daec0a1aa09170e36a422a6ba26ee renku dataset add a /home/mohammad/some-file
1f7d45364cbe7e7de19bfdfa01da7585351a9e5c renku dataset: committing 1 newly added files
e6dcaae7418a8c53f47ec412bae405032393a863 renku dataset create a
11b817913f9b58601b19aa1bc20d4f21ad196f06 renku init
It partially succeeds and it does not print an error message which I believe is the exact opposite of what you are aiming to do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is truly a bug! I will add a test case for it and fix it.
Thank you.
| new_files = (file_paths - set(ignored)) | ||
|
|
||
| if not self.repo.is_dirty(): | ||
| return warning_message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug reported in #881 can be solved by moving the commit statement to this if block and removing return from here:
if not self.repo.is_dirty():
self.repo.index.commit(
'renku dataset: commiting {} newly added files'.
format(len(file_paths) + len(ignored))
)
Like this we avoid an empty commit if there is no change in data files and we allow the metadata to be updated (which was prevented previously by returning early).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in that case adding ignored files would fail to be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Added ignored files show up as dirty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How? If they are ignored it means that they are part of the .gitignore and git won't register them at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[~/playground/tmp]$ rm -rf * .* ; renku init
OK
Initialized empty project in /home/mohammad/playground/tmp
[~/playground/tmp]$ grep .scrapy .gitignore
.scrapy
[~/playground/tmp]$ echo .scrapy > .scrapy
[~/playground/tmp]$ git add .scrapy
The following paths are ignored by one of your .gitignore files:
.scrapy
Use -f if you really want to add them.
[~/playground/tmp]$ git add -f .scrapy
[~/playground/tmp]$ git status -s
A .scrapy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If users do not use --force and there are ignored files then this function raises an exception and won't reach this point.
renku/core/models/datasets.py
Outdated
| data = {field_: obj.pop(field_) for field_ in self.EDITABLE_FIELDS} | ||
| return data | ||
|
|
||
| def contains(self, files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is a bit misleading. I'd rename it to contains_any or contains_at_least_one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it miss leading? It follows a common functionality pattern which can be found in standard library (and in other languages standard library).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it accepts multiple files I expect contains return true only if all files are contained. This function return true when any of files is in the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, renamed to contains_any.
|
|
||
| monkeypatch.setattr(requests, 'get', get) | ||
| dataset = client.load_dataset('my-dataset') | ||
| o = client._add_from_url(dataset, url, client.path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to avoid calling a non-public method in this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That client abstraction should really be fixed (IMHO), cause isolating pieces in testable units is almost impossible without resorting to such calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get why you removed the cli call to add files. I thought perhaps it was because of the requests mocking that something wasn't working in between.
Perhaps it's better to remove this test and have it only as integration test because it is mocking an implementation detail and it is calling a protected method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I din't removed anything, that test the way it was written it was integration test and it was not marked as such - I just marked it as an integration test and wrote the actual unit test for the part of what that test is suppose to be testing since @rokroskar requested to have it.
I would also be fine just with having it as an integration test, tbh.
|
|
||
| with client.with_dataset('my-dataset') as dataset: | ||
| file_ = dataset.files[0] | ||
| assert file_.url == 'https://example.com/index.html' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, it should be other way around - thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do not start the discussion about expected and actual ;).
|
|
||
| assert {'result'} == set(response.json.keys()) | ||
| assert {'dataset_name', 'files', | ||
| 'project_id'} == set(response.json['result'].keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might fail depending on the order. Consider using sorted(...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since when does order matter in a set? 😄
Try this in your interpreter:
print({1,2,3} == {2,3,1})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Technically, the order in sets matters! :)
jsam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oki, thanks for the review. I will look into it.
mohammad-alisafaee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
closes: #769
depends on: #875
fixes: #881
fixes: #880
Fixes additional problems with: