knowledge-graph not to list removed datasets #137

jachro · 2019-09-25T15:54:42Z

As a Renku user, once I removed a dataset from a project, I don't want this dataset to be listed as that project's dataset anymore.

Acceptance criteria:

drive the change from acceptance-tests;
triples-generator to feed info about deleted files to the renku log command while generating triples;
GET /knowledge-graph/datasets NOT to return a removed dataset if it existed on a single project where it was removed or it existed on multiple forks of the project and it's removed on all of them; if a removed dataset is not removed on some forks it should be still searchable;
GET /knowledge-graph/datasets/:id to return NOT FOUND (404) when the requested id is the identifier of the deleted dataset with the assumption there are no forks sharing the same dataset or the dataset is removed on all the projects; the resource to return the dataset details if it's shared between at least one fork and it's NOT removed on all the projects;
GET /knowledge-graph/projects/:namespace/:name/datasets NOT to return datasets if they were deleted on the project with the given namespace and name even if they are not removed on projects sharing exactly the same dataset (fork case);
think of a case when a parent dataset (in terms of dataset modification) was removed; probably the wasDerivedFrom triple should be removed from the direct child dataset.

Original acceptance criteria:
Option 1:

Option 2:

the triples curation process to look for removed data-sets metadata files in project's .renku/datasets folder;
- once deleted dataset is identified (see the first approach above), the triples curation process do not generate dataset entity removal query but inserts invalidatedBy triple pointing to the commit Activity where the dataset metadata was removed;
- the only problem I can see here is that the dataset finding queries would get more complicated (and they are already complicated enough); the reason for that is that all the queries reaching to a dataset entity would have to check if there's an invalidatedBy link to a commit Activity of a certain project as I suppose we wouldn't be unlinking project from a dataset;
- the KG queries have to be able to deal with the eventuality of multiple invalidateBy links on a single dataset (the case of project forks);
- cross-check whether the above assumptions hold once we play the dataset immutability issue on renku-python.

Option 3:

there are no changes done to a removed dataset entity but only the dataset queries are updated so the look for files that get invalidated; if all dataset's parts (effectively underlying files) are invalidated then such dataset should not be retrieved by the queries;
- I imagine this approach would make KG queries very complicated and the complexity would have to be repeated to all the queries touching dataset entities;

The text was updated successfully, but these errors were encountered:

ciyer · 2020-05-06T09:13:26Z

Also need to consider cases where a project is deleted. There is no more .renku/datasets folder to refer to anymore.

jachro · 2020-05-06T09:18:26Z

@ciyer there's a separate issue to deal with project removals.

jachro added this to the sprint-2020-03-27 milestone Apr 30, 2020

jachro added bug Something isn't working knowledge-graph triples-generator labels May 1, 2020

jachro added not ready and removed not ready labels May 6, 2020

jachro removed this from the sprint-2020-03-27 milestone May 6, 2020

joke1196 added this to the sprint-2020-07-09 milestone Jul 8, 2020

ciyer mentioned this issue Jul 29, 2020

Deleted files show up in dataset's UI page SwissDataScienceCenter/renku-ui#828

Closed

jachro modified the milestones: sprint-2020-07-09, sprint-2020-07-31 Aug 19, 2020

joke1196 modified the milestones: sprint-2020-07-31, sprint-2020-08-20 Aug 24, 2020

jachro modified the milestones: sprint-2020-08-20, sprint-2020-09-10 Sep 9, 2020

jachro modified the milestones: sprint-2020-09-10, sprint-2020-10-01 Sep 30, 2020

jachro closed this as completed Oct 19, 2020

Provide feedback