Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DS-3763: Scripts and processes endpoint #17

Merged

Conversation

tomdesair
Copy link
Contributor

This is a contract proposal for https://jira.duraspace.org/browse/DS-3763

DSpace offers functionality like import and export of metadata and items, running curation tasks, running an OAI harvest from the web UI. While we can implement each of these features as a dedicated endpoint, I think it would require the same effort to implement one generic endpoint that is able to run (REST-enabled) DSpace CLI scripts for the REST API.

I'm very interested to hear your thoughts on this.

@tdonohue
Copy link
Member

Hi @tomdesair : while I like this idea overall, I worry it's a possible extension of the scope of DSpace 7. We haven't made any promises in DSpace 7 to make CLI tools available from the Admin UI or REST API.

That said, I do think it'd be nice to kick off administrative scripts via REST API and/or an Admin UI in Angular. I just worry this effort may require a significant refactor to the existing DSpace CLI, unless it is scoped more to kicking off Curation Tasks (which does need to occur from the Admin UI, as it exists in DSpace 6 in that same way).

Overall, I'd like to better understand what changes are being proposed at the Java API layer. I'd rather not undertake another significant Java API refactor during the DSpace 7 efforts, as our time is better spent on getting an initial DSpace 7 release out the door.

@tomdesair
Copy link
Contributor Author

We would only need a small refactoring in some DSpace API script classes so that they can list their parameters in a standardised way. The actual parameters or the way a script is called from the CLI will not change.

We will also not modify the actual script implementations. The idea is that starting a script from the REST API will result in an external process (in the same way if you would have invoked the script from the CLI). This has the advantage that the script is not executed within Tomcat and will not crash Tomcat if it fails (which can happen in DSpace 6 if you start, for example, an export of your full repository).

I agree that we should not support all scripts right away. I think we should focus first on the following cases which are also available in DSpace 6:

  • import
  • export
  • metadata-import
  • metadata-export
  • harvest

So we will only modify those classes, but it will not be a full refactor. That's why I included this in the contract description:

This endpoint will list all (REST supported) scripts defined in dspace/config/launcher.xml.

I think the effort for supporting these features in 5 dedicated endpoints will be similar to creating 1 generic endpoint. In later versions of DSpace 7, we can then add support for more CLI scripts.

@abollini
Copy link
Member

I'm in favor of a generic endpoint able to execute pluggable processes. IMHO these should be the curation tasks and few scripts that if were developed after the creation of the curation framework would be done as curation tasks ;)
That said, I'm not sure about the list suggested by @tomdesair as I don't think that import/export are good example of curation task (also in a bit expanded form).
Import and export are so important in DSpace that should be managed as special case, moreover the exports should be just driven using content negotiation over enabled endpoints. See also https://jira.duraspace.org/browse/DS-3722

Copy link
Member

@abollini abollini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than the formal changes to the contract please review also my general comment to decide if we want to proceed on this maybe limiting it at the start to the current curation tasks

scripts-endpoint.md Outdated Show resolved Hide resolved
scripts-endpoint.md Outdated Show resolved Hide resolved
processes-endpoint.md Outdated Show resolved Hide resolved
```

## Execution File Output
**GET /api/admin/processes/<:process-id>/file**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can have a /files that list all the existent file to download? or we can flat the files at the upper level and use the file name as rel name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean /api/admin/processes/<:process-id>/files and /api/admin/processes/<:process-id>/files/<:file-name>?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, or we can simplify in
/api/admin/processes/<:process-id>/<:file-name> assuming (ie adding the convention) that no scripts will use output as the name of one of the generated file

scripts-endpoint.md Outdated Show resolved Hide resolved
* `string`: A regular string value
* `date`: A string that has a valid date pattern
* `boolean`: true or false
* `file`: For parameters with this type, the user has to provide a file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should specify how file as supplied. Obviously one option is to provide them in a multipart POST request using the parameter as name of the file but we want to document that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll document the multipart post instructions. The ultimate goal for me would be to have file upload support using something like the tus protocol. Because then people can upload for example really large imports and datasets. But that's probably something for DSpace 7.1 or DSpace 8.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will like to have support for resumable upload but it is probably out-of-scope for dspace 7. Once introduced it should be added to all the upload endpoints (including the submission)

processes-endpoint.md Outdated Show resolved Hide resolved
@tdonohue
Copy link
Member

This was discussed a bit in today's DSpace 7 Mtg. Just noting that I'm withdrawing my concern about expansion of scope.

Based on discussion with @artlowel and @abollini, it sounds like the primary goal of this work is to enable Curation Tasks at the REST API layer, but that @tomdesair is being careful to design this in a way that it could also apply to other CLI commands (in launcher.xml).

That sounds like a good approach to me. I'd just simply recommend that, in terms of the implementation, we should not spend too much time enabling all CLI commands. A few is fine, but others can be implemented/refactored later (possibly even post DSpace 7).

@terrywbrady
Copy link
Contributor

I like the capabilities that are suggested here. I have had a long time desire to extend the curation system so that it could (1)take task appropriate parameters (2)persist output.

  • If that were possible, how many of our existing CLI tools could become curation tasks?
  • How many admin features could be written as curation tasks?
  • Could curation tasks become our back-end plugin mechanism?

Looking at the proposed scripts endpoint, could the proposed GET/POST methods described there also apply to any/all curation tasks?

A few questions about output files

  • Will output files have an expiration policy?
  • When we do choose to persist an output file, what are the restrictions on that output file? Does the script/task need to set an authorization policy for the output file?
  • In many instances, it will be useful for a task to generate JSON or HTML based on the action performed. Note the following
    • In the context of a rest client, a JSON return object would likely be preferable.
    • In the context of an admin interface, an HTML return object (with links) would likely be preferable.
    • In the context of a CLI client, a TEXT return object will likely be preferable. This is currently the only option for curation tasks.

Updates based on community feedback and support for features such as export to ZIP or export to CSV
Copy link
Member

@benbosman benbosman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functionality from this pull request is useful for features such as the export to ZIP and export to CSV we're currently working on.
For that reason, I've revived the pull request and:

  • solved the problems mentioned by the reviewers
  • reduced the scope to ensure the initial implementation doesn't create too much work
  • ensured it supports all data required to use it for features which are already supported in DSpace 6 from the UI (export to CSV, export to ZIP, curation tasks, …)

@mwoodiupui
Copy link
Member

mwoodiupui commented Nov 14, 2018

Concerning access to curation tasks: please be aware of DSpace/DSpace#2180 and DSpace/DSpace#2181 which are pending.

UPDATE: these were merged to master.

@tomdesair
Copy link
Contributor Author

Thx @benbosman !

processes-endpoint.md Outdated Show resolved Hide resolved
@abollini
Copy link
Member

I'm confused about the status of this PR and the ongoing implementation work. @benbosman are your working on implementing it? do you plan to use this contract to support the curation task system from the UI or just to implement the export feature?

I'm not sure about how to use this contract for the curation tasks UI. It is true that a curation task can be executed as a script but I don't expect to have all the curation tasks listed in the script-launcher.xml
Moreover an endpoint to retrieve all the available curation tasks is needed to provide the "dropdown" in the UI and an uniform way to specify the target object (from the command line is the handle) is also useful to check the permission as also community or collection administrators can run curation tasks over their items.

@benbosman
Copy link
Member

This PR is waiting for an approval since all feedback was processed. Once approved, the development can start.

Specifically for curation tasks, more work is indeed required to also create a list of the available tasks.
The goal of this functionality is to have a uniform REST solution for command line scripts, but the UI sections using it will still be designed specifically for that goal. The UI for e.g. importing a CSV is completely different from the UI for e.g. starting a curation task.
For most such tasks, this framework is all that's required in REST, but the curation tasks are an exception here.

@abollini
Copy link
Member

abollini commented Jan 2, 2019

The contract looks fine to me in regard of its purpose (execute administrative script). Please note that I'm not convinced that it is useful for the immediate tasks to be done in the UI (import / export / curation tasks) as I'm still convinced that for such features dedicated endpoints/strategies need to be used.

More precisely, to export something I suggest to use a simple GET request to the main target of the export with an appropriate Content-Type, i.e. we should use content-negotiation for the export.
In a similar way we can manage also the import using POST and content negotiation.

For the curation tasks as already stated we will need the ability to know which tasks are available. Of course we can expose such list as a controlled list for the t parameter of the org.dspace.curate.CurationCli script but this will make a bit more complicated the implementation of the scripts endpoint and I also worried about the effort required to refactor the scripts to expose their parameters (and allowed values, see the -t). Moreover, the curation tasks UI also need to be able to add a task to an execution queue instead than run it immediately, for all these reasons I feel more comfortable if we start a dedicated contract for a curationtasks endpoint.

Finally, for the implementation of the scripts and processes endpoints keep in mind the global goal to keep the REST webapp stateless and horizontal scalable, this mean that we should store the running processes in a shareable, temporary-persistent storage

@benbosman
Copy link
Member

Thanks for approving it, to be used at least for executing administrative scripts.

I think the discussion on whether this will be used for import, export, curation tasks would fit better in the DSpace meetings.

The solution will indeed be implemented to ensure the webapp remains stateless and horizontal scalable.

@tdonohue
Copy link
Member

tdonohue commented Jan 7, 2019

I'd like to discuss this further in our DSpace 7 meeting this week. I'm still concerned about possibly extending the scope of DSpace 7 in this endpoint. I agree that we need an endpoint for Curation Tasks, and an endpoint for Administrative import / export (of Zips).

However, this endpoint as described seems to involve functionality that isn't in DSpace (yet)... namely the entire /api/system/processes endpoint, which seems to imply a way to watch/monitor ongoing scripts/processes (and assigning each running process a UUID). As far as I'm aware, the DSpace backend doesn't currently support this concept. We've also made no promises for such a feature in DSpace 7.

So, I'd like to better understand the implementation plans/proposal for this endpoint, so that we can ensure we are not expanding our scope.

processes-endpoint.md Outdated Show resolved Hide resolved
@mwoodiupui
Copy link
Member

DSpace 6 does not export items or metadata through the GUIs, so far as I can see. That would not make much sense, since you'd have to go to the server anyway to collect the output.

@AlexanderS
Copy link
Contributor

@mwoodiupui DSpace 6 XMLUI has a feature to export the metadata of the search results (or all items in a collection/community) as CSV and items in a collection or community can be exported as ZIP files. (Both may be enabled only for admins.)

@mwoodiupui
Copy link
Member

I will add my voice to those asserting that curation tasks and CLI tools need separate treatment.
I also agree that, rather than inventing a new mechanism, it makes more sense to be able to request "give me this site/community/collection as application/zip+dspace-saf" or the like.

@mwoodiupui
Copy link
Member

Where is this export facility? Do we document it somewhere?
I didn't find any mention in the manual. I looked all through the administrative menu.

Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benbosman : Thanks for the updates and continued work here. Overall, this looks good. But, it looks like there is one outstanding request (that I agree with, see below) that was never addressed. It's a minor fix though, so I'll go ahead and give this a +1.

processes-endpoint.md Outdated Show resolved Hide resolved
@benbosman benbosman merged commit d035dd2 into DSpace:master Sep 26, 2019
@benbosman benbosman deleted the DS-3763_Scripts-and-processes-endpoint branch March 6, 2020 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants