New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DS-3673] First try: get latest COUNTER robots list release #2892
base: main
Are you sure you want to change the base?
Conversation
Once again I'm unsure how to do integration tests for command line utilities that don't affect the model. |
This pull request introduces 1 alert when merging 27a7251 into 6b3bd4d - view on LGTM.com new alerts:
|
@mwoodiupui : If you are willing to use the new Scripts & Processes framework (instead of the old ...and you could write ITs at the REST API layer (as any script of this type is runnable either via the REST API or via CLI...whereas We've only moved a handful of scripts from So, you could use those as examples. If you look closely, the |
This pull request introduces 1 alert when merging 10771e8 into a235551 - view on LGTM.com new alerts:
|
I've turned it into a "script" and added a bit of testing. But I wonder whether this really should be accessible from the UI. It's really meant to be run regularly by the system scheduler to keep a robot list updated. Its only function is to grab a release from Github and stash it on the server's filesystem. Shouldn't this really be command-line-only? Converting it was instructive. Now I need to find out how a "script" is supposed to return errors or status. |
Hi @mwoodiupui, This message is automatically generated by prince-chrismc/label-merge-conflicts-action so don't hesitate to report issues/improvements there. |
Per decision in DSpace Steering (and as announced in last DSpace DevMtg), I'm moving this to the 8.0 board. Steering has limited 7.6 to only features which provide "feature parity/compatibility" with 6.x. Even though this will move to the 8.0 board, our goal is to move all existing pre-8.0 features forward as soon as 7.6 development is finishing (as 7.6 will be the last 7.x release allowing new features). |
Hi @mwoodiupui, |
Hi @mwoodiupui, |
… the "script" idiom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwoodiupui : Apologies for the very long delay in feedback on this PR. I'm just revisiting it today, and I'm finding that I cannot get this tool to work (or maybe don't know how to use it?).
I'm trying this from commandline:
./dspace get-counter-robots
And I get this error:
Exception in thread "main" java.lang.NullPointerException
at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:293)
at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:134)
at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:99)
I also agree with you that I don't think the "get-github-release" tool is useful for the Admin User interface. It seems like an extremely advanced tool, that might only be used by other tools, but never appear even in the list of available tools (e.g. when running ./dspace
you get a list of every tool).
In other words, I'm not sure it's useful for any use case other than this get-counter-robots
script at this time. However, I do see the potential that other tools can use it if needed.
Beyond that, a few other minor notes inline below.
// Identify the latest release. | ||
String owner = commandLine.getOptionValue(GetGithubReleaseOptions.OPT_OWNER); | ||
String repo = commandLine.getOptionValue(GetGithubReleaseOptions.OPT_REPO); | ||
URL releaseConnection = new URL(String.format("https://api.github.com/repos/%s/%s/releases/latest", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to blindly always gather the latest version? Or should we really just be taking in a URL here as a parameter... that way it's possible to use a specific tagged version of the files (which could be useful if one version of DSpace supports a different file format than another)
If you feel this is the best approach, I'm OK with it. I just wanted to point out that blindly taking the "latest" can mean that you might try to download an update for DSpace 7 only to find that the new "latest" version is only compatible with DSpace 8 (as an example)... I don't know if this would ever happen, but wanted to point it out.
@@ -96,6 +96,19 @@ | |||
<class>org.dspace.app.sitemap.GenerateSitemaps</class> | |||
</step> | |||
</command> | |||
<command> | |||
<name>get-counter-robots</name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be named "update-counter-robots"?
I also wonder if this should install the file in the correct location? Personally, I wouldn't know where to place this file. Does it go under [dspace]/spiders/
or [dspace]/spiders/agents
?
If it doesn't put the file in the correct location, then we need to provide clear documentation on how to install this file. Ideally, though, if we want people to be able to run this via a cron job, it'd be nice to have it install the file in the proper place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's called "get-" rather than "update-" because it doesn't update; it gets, and lets you decide whether you want to update.
It never occurred to me that someone would trust a cron job to do the updating. YOU decide to run it because you have learned that there's a new release from COUNTER. YOU inspect the file and decide whether it's suitable. YOU install it. This seems to be what was asked for in #7020.
Maybe we need to have the discussion that Bram suggested, before finishing this up? People seem to have very different ideas of how to consume the COUNTER robots list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwoodiupui : Yes, it sounds like we may need to discuss the goals of this PR further.
My assumption was that we'd want this to be as easy as possible to help sites keep this list updated at all times. So, I was assuming it'd be as simply as a scheduled cron job that you could run to update it once a week/month/year (if changed) or similar.
Your approach is much more "manual", which I'm worried means it's not much of an improvement over the current process. An Admin could just go to this site and download the latest file quickly to inspect it & install it. This script currently seems to only help them in the download process (which could also be done via a wget
command).
So, I think you are correct that our two goals for this script don't currently align. I'd rather have an update script here, and I'm questioning whether a simple "get" script is useful when we could just tell people to run something like this:
wget https://github.com/atmire/COUNTER-Robots/blob/master/generated/COUNTER_Robots_list.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that that wget
does the same thing: does it not fetch the latest checkin, rather than the latest release? A good deal of the code here is to deal with the structure of a Github release package, which is a bit of a pain. If I were going to turn a cron job loose to update a production site, I wouldn't want to be installing random checkins.
It's been suggested more than once right here in these comments that we might want to exercise judgment rather than blindly taking the latest release. I agree with that.
I would point out that there is no current process for updating from COUNTER. That was the meat of the Issue. Our current process for keeping the bot patterns current is that from time to time someone gets worried that the patterns are very stale and manually updates them from somewhere.
One use that I envision for this tool (or a manual procedure) is as part of the DSpace release process: at some point before release, developers would update to the most recent version of the COUNTER list that they find acceptable, and put it in the official sources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per today's DevMtg, we discussed leaving this feature as-is... just a script to get-counter-robots
. It's clear there seems to be two use cases:
- Some institutions prefer just this
get
script as it lets them review the update and apply when they are ready - Some institutions prefer an automated update. But, we may be able to achieve that by providing better documentation around the proper
cp
command to use to copy the file into the correct location.- If somehow a
cp
command isn't good enough, we could create a separate script to install it to the correct location. But, for now, I'm assuming acp
command is all we may need.
- If somehow a
I'm also wondering if this should always get the latest release—it's easy to imagine an upstream release that might contain invalid or undesirable patterns, or maybe they change the format or whatever that we might not want to consume automatically. Maybe we could also have the script take a parameter specifying a release tag to download? |
References
Fixes #7020
Description
Command line gadget to get a plain-text list of robot agent patterns from the latest release by COUNTER. This is not a complete response to DS-3673 but it is a start.
Instructions for Reviewers
New feature.
List of changes in this PR:
org.dspace.app.statistics.GetGithubRelease
, a general-purpose gadget to fetch the latest release of any Github project and either save the Zip archive as-is or unpack one or more files from it.Include guidance for how to test or review your PR. This may include: steps to reproduce a bug, screenshots or description of a new feature, or reasons behind specific changes.
Checklist
main()
, because we all know whatmain()
does.)pom.xml
), I've made sure their licenses align with the DSpace BSD License based on the Licensing of Contributions documentation.