Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

knowledge-graph not to list removed datasets #137

Closed
jachro opened this issue Sep 25, 2019 · 2 comments
Closed

knowledge-graph not to list removed datasets #137

jachro opened this issue Sep 25, 2019 · 2 comments

Comments

@jachro
Copy link
Contributor

jachro commented Sep 25, 2019

As a Renku user, once I removed a dataset from a project, I don't want this dataset to be listed as that project's dataset anymore.

Acceptance criteria:

  • drive the change from acceptance-tests;
  • triples-generator to feed info about deleted files to the renku log command while generating triples;
  • GET /knowledge-graph/datasets NOT to return a removed dataset if it existed on a single project where it was removed or it existed on multiple forks of the project and it's removed on all of them; if a removed dataset is not removed on some forks it should be still searchable;
  • GET /knowledge-graph/datasets/:id to return NOT FOUND (404) when the requested id is the identifier of the deleted dataset with the assumption there are no forks sharing the same dataset or the dataset is removed on all the projects; the resource to return the dataset details if it's shared between at least one fork and it's NOT removed on all the projects;
  • GET /knowledge-graph/projects/:namespace/:name/datasets NOT to return datasets if they were deleted on the project with the given namespace and name even if they are not removed on projects sharing exactly the same dataset (fork case);
  • think of a case when a parent dataset (in terms of dataset modification) was removed; probably the wasDerivedFrom triple should be removed from the direct child dataset.

Original acceptance criteria:
Option 1:

  • the triples curation process to look for removed data-sets metadata files in project's .renku/datasets folder;
    • removed here means deleted in a sense of a particular commit (probably git diff --name-only --diff-filter=D HEAD~1..HEAD could be a choice here);
    • do nothing when no deleted metadata files are found;
    • if deleted metadata files are found, extract datasets Identifiers from the paths (e.g. if a deleted file is .renku/datasets/c42f08db-27f4-44d0-9b55-6dfe6ca96ec9/metadata.yml, the identifier is c42f08db-27f4-44d0-9b55-6dfe6ca96ec9);
    • generate a delete query removing schema:isPartOf link between the dataset having found Identifier and the project the triples are generated for;
    • generate a delete query removing the whole dataset entity if:
      • it's not linked to any other project;
      • there are no descendant datasets;

Option 2:

  • the triples curation process to look for removed data-sets metadata files in project's .renku/datasets folder;
    • once deleted dataset is identified (see the first approach above), the triples curation process do not generate dataset entity removal query but inserts invalidatedBy triple pointing to the commit Activity where the dataset metadata was removed;
    • the only problem I can see here is that the dataset finding queries would get more complicated (and they are already complicated enough); the reason for that is that all the queries reaching to a dataset entity would have to check if there's an invalidatedBy link to a commit Activity of a certain project as I suppose we wouldn't be unlinking project from a dataset;
    • the KG queries have to be able to deal with the eventuality of multiple invalidateBy links on a single dataset (the case of project forks);
    • cross-check whether the above assumptions hold once we play the dataset immutability issue on renku-python.

Option 3:

  • there are no changes done to a removed dataset entity but only the dataset queries are updated so the look for files that get invalidated; if all dataset's parts (effectively underlying files) are invalidated then such dataset should not be retrieved by the queries;
    • I imagine this approach would make KG queries very complicated and the complexity would have to be repeated to all the queries touching dataset entities;
@jachro jachro added this to the sprint-2020-03-27 milestone Apr 30, 2020
@ciyer
Copy link

ciyer commented May 6, 2020

Also need to consider cases where a project is deleted. There is no more .renku/datasets folder to refer to anymore.

@jachro
Copy link
Contributor Author

jachro commented May 6, 2020

@ciyer there's a separate issue to deal with project removals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants