Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core): new dataset provenance #2181

Merged

Conversation

m-alisafaee
Copy link
Contributor

@m-alisafaee m-alisafaee commented Jun 17, 2021

Description

Adds two indexes to the new persistent layer: datasets and datasets-provenance. The first index is a mapping from dataset names to data objects for all current datasets in a project. The second index is a mapping from dataset id to dataset object; this index contains current datasets and all previous versions of all datasets.

Fixes #2119

@m-alisafaee m-alisafaee force-pushed the 2119-new-dataset-provenance branch 9 times, most recently from 4fc27cc to b210037 Compare June 23, 2021 15:29
@m-alisafaee m-alisafaee marked this pull request as ready for review June 23, 2021 16:41
@m-alisafaee m-alisafaee requested a review from a team as a code owner June 23, 2021 16:41

def update_datasets_provenance(self, dataset, remove=False):
@inject.autoparams()
def update_datasets_provenance(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be cleaner to have two methods, add_or_update_dataset_provenance and remove_dataset_provenance?

It would also be cleaner to just call the methods on DatasetProvenance directly and inject that where needed, since this method in LocalClient doesn't really contain any additional logic (especially since commit_database isn't set to True anywhere). In the future I'd like to rely on LocalClient less and have the dedicated classes do more of the heavy lifting.
If we inject DatasetProvenance using the binder.bind_to_constructor(Class, lambda: Class(*args)) way of injection, it would only initialise DatasetProvenance if code that uses it (well, requests it as an injection) is called and then it would be a singleton, so we could just inject this, ProvenanceGraph and DependencyGraph this way in the with_database CommandBuilder without performance overhead, and it'd save performance since we wouldn't have to do from_database more than once.

@@ -1488,7 +1462,7 @@ def _check_url(url):
if not is_git:
# NOTE: Check if the url is a redirect.
url = requests.head(url, allow_redirects=True).url
u = parse.urlparse(url)
_ = parse.urlparse(url)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I wrote this, but now I'm confused, does this line actually do anything other than raise an uncaught error if the redirect url is not a valid url? I feel like we could remove this line? Looks like a copy&paste error by me

@m-alisafaee m-alisafaee force-pushed the 2119-new-dataset-provenance branch 2 times, most recently from 1d2f216 to eb75698 Compare June 28, 2021 15:30
@m-alisafaee m-alisafaee changed the base branch from master to 2108-custom-persistent-backend July 5, 2021 08:52
renku/core/errors.py Outdated Show resolved Hide resolved
@m-alisafaee m-alisafaee force-pushed the 2108-custom-persistent-backend branch from 4162ff6 to c3053ac Compare July 9, 2021 09:20
@m-alisafaee m-alisafaee force-pushed the 2108-custom-persistent-backend branch from c3053ac to ec62fa7 Compare July 9, 2021 09:39
@m-alisafaee m-alisafaee merged commit 7ab9421 into 2108-custom-persistent-backend Jul 9, 2021
@m-alisafaee m-alisafaee deleted the 2119-new-dataset-provenance branch July 9, 2021 09:46
m-alisafaee added a commit that referenced this pull request Jul 9, 2021
m-alisafaee added a commit that referenced this pull request Jul 9, 2021
m-alisafaee added a commit that referenced this pull request Jul 9, 2021
m-alisafaee added a commit that referenced this pull request Jul 9, 2021
Panaetius pushed a commit that referenced this pull request Jul 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change dataset provenance to store datasets in storage
2 participants