Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DS-3673] First try: get latest COUNTER robots list release #2892

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

mwoodiupui
Copy link
Member

@mwoodiupui mwoodiupui commented Jul 20, 2020

References

Fixes #7020

Description

Command line gadget to get a plain-text list of robot agent patterns from the latest release by COUNTER. This is not a complete response to DS-3673 but it is a start.

Instructions for Reviewers

New feature.

List of changes in this PR:

  • New class org.dspace.app.statistics.GetGithubRelease, a general-purpose gadget to fetch the latest release of any Github project and either save the Zip archive as-is or unpack one or more files from it.
  • Launcher configuration specific to the COUNTER-Robots project.

Include guidance for how to test or review your PR. This may include: steps to reproduce a bug, screenshots or description of a new feature, or reasons behind specific changes.

Checklist

  • My PR is small in size (e.g. less than 1,000 lines of code, not including comments & integration tests). Exceptions may be made if previously agreed upon.
  • My PR passes Checkstyle validation based on the Code Style Guide.
  • My PR includes Javadoc for all new (or modified) public methods and classes. It also includes Javadoc for large or complex private methods. (Except main(), because we all know what main() does.)
  • My PR passes all tests and includes new/updated Unit or Integration Tests based on the Code Testing Guide.
  • If my PR includes new, third-party dependencies (in any pom.xml), I've made sure their licenses align with the DSpace BSD License based on the Licensing of Contributions documentation.
  • If my PR modifies the REST API, I've linked to the REST Contract page (or open PR) related to this change.

@mwoodiupui mwoodiupui added new feature work in progress PR is still being worked on & is not currently ready for review quick win Pull request is small in size & should be easy to review and/or merge port to 6.x component: statistics Related to Statistics (Solr or Google Analytics) interface: command-line DSpace CLI (command-line interface) configuration Related to out-of-the-box configuration labels Jul 20, 2020
@mwoodiupui
Copy link
Member Author

Once again I'm unsure how to do integration tests for command line utilities that don't affect the model.

@lgtm-com
Copy link

lgtm-com bot commented Jul 20, 2020

This pull request introduces 1 alert when merging 27a7251 into 6b3bd4d - view on LGTM.com

new alerts:

  • 1 for Potential input resource leak

@tdonohue
Copy link
Member

@mwoodiupui : If you are willing to use the new Scripts & Processes framework (instead of the old launcher.xml), then you could run the script directly in Integration tests using this runDSpaceScript() method:
https://github.com/DSpace/DSpace/blob/main/dspace-server-webapp/src/test/java/org/dspace/app/rest/test/AbstractIntegrationTestWithDatabase.java#L232

...and you could write ITs at the REST API layer (as any script of this type is runnable either via the REST API or via CLI...whereas launcher.xml is CLI only).

We've only moved a handful of scripts from launcher.xml to Scripts & Processes so far, but they are all listed here: https://github.com/DSpace/DSpace/blob/main/dspace/config/spring/api/scripts.xml

So, you could use those as examples. If you look closely, the index-discovery script no longer appears in launcher.xml, but it will respond on the commandline as normal (./dspace index-discovery)...or you can now run it from the Admin UI / REST API (see the example "Processes Admin UI" video in the beta 3 announcement; https://lyrasisnow.org/now-available-dspace-7-0-beta-3/)

@lgtm-com
Copy link

lgtm-com bot commented May 5, 2022

This pull request introduces 1 alert when merging 10771e8 into a235551 - view on LGTM.com

new alerts:

  • 1 for Potential input resource leak

@mwoodiupui
Copy link
Member Author

I've turned it into a "script" and added a bit of testing. But I wonder whether this really should be accessible from the UI. It's really meant to be run regularly by the system scheduler to keep a robot list updated. Its only function is to grab a release from Github and stash it on the server's filesystem. Shouldn't this really be command-line-only?

Converting it was instructive. Now I need to find out how a "script" is supposed to return errors or status.

@github-actions github-actions bot added the merge conflict PR has a merge conflict that needs resolution label Jun 16, 2022
@github-actions github-actions bot removed the merge conflict PR has a merge conflict that needs resolution label Jan 23, 2023
@mwoodiupui mwoodiupui removed the work in progress PR is still being worked on & is not currently ready for review label Jan 24, 2023
@mwoodiupui mwoodiupui self-assigned this Jan 24, 2023
@github-actions github-actions bot added the merge conflict PR has a merge conflict that needs resolution label Feb 13, 2023
@github-actions
Copy link

Hi @mwoodiupui,
Conflicts have been detected against the base branch.
Please resolve these conflicts as soon as you can. Thanks!


This message is automatically generated by prince-chrismc/label-merge-conflicts-action so don't hesitate to report issues/improvements there.

@github-actions github-actions bot removed the merge conflict PR has a merge conflict that needs resolution label Feb 14, 2023
@mwoodiupui mwoodiupui added this to the 7.6 milestone Feb 14, 2023
@tdonohue tdonohue removed this from the 7.6 milestone Feb 23, 2023
@tdonohue
Copy link
Member

Per decision in DSpace Steering (and as announced in last DSpace DevMtg), I'm moving this to the 8.0 board. Steering has limited 7.6 to only features which provide "feature parity/compatibility" with 6.x. Even though this will move to the 8.0 board, our goal is to move all existing pre-8.0 features forward as soon as 7.6 development is finishing (as 7.6 will be the last 7.x release allowing new features).

@github-actions
Copy link

github-actions bot commented Jun 8, 2023

Hi @mwoodiupui,
Conflicts have been detected against the base branch.
Please resolve these conflicts as soon as you can. Thanks!

@github-actions github-actions bot added merge conflict PR has a merge conflict that needs resolution and removed merge conflict PR has a merge conflict that needs resolution labels Jun 8, 2023
@github-actions
Copy link

Hi @mwoodiupui,
Conflicts have been detected against the base branch.
Please resolve these conflicts as soon as you can. Thanks!

@github-actions github-actions bot added the merge conflict PR has a merge conflict that needs resolution label Aug 23, 2023
@github-actions github-actions bot removed the merge conflict PR has a merge conflict that needs resolution label Aug 31, 2023
Copy link
Member

@tdonohue tdonohue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwoodiupui : Apologies for the very long delay in feedback on this PR. I'm just revisiting it today, and I'm finding that I cannot get this tool to work (or maybe don't know how to use it?).

I'm trying this from commandline:
./dspace get-counter-robots

And I get this error:

Exception in thread "main" java.lang.NullPointerException
        at org.dspace.app.launcher.ScriptLauncher.runOneCommand(ScriptLauncher.java:293)
        at org.dspace.app.launcher.ScriptLauncher.handleScript(ScriptLauncher.java:134)
        at org.dspace.app.launcher.ScriptLauncher.main(ScriptLauncher.java:99)

I also agree with you that I don't think the "get-github-release" tool is useful for the Admin User interface. It seems like an extremely advanced tool, that might only be used by other tools, but never appear even in the list of available tools (e.g. when running ./dspace you get a list of every tool).

In other words, I'm not sure it's useful for any use case other than this get-counter-robots script at this time. However, I do see the potential that other tools can use it if needed.

Beyond that, a few other minor notes inline below.

// Identify the latest release.
String owner = commandLine.getOptionValue(GetGithubReleaseOptions.OPT_OWNER);
String repo = commandLine.getOptionValue(GetGithubReleaseOptions.OPT_REPO);
URL releaseConnection = new URL(String.format("https://api.github.com/repos/%s/%s/releases/latest",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to blindly always gather the latest version? Or should we really just be taking in a URL here as a parameter... that way it's possible to use a specific tagged version of the files (which could be useful if one version of DSpace supports a different file format than another)

If you feel this is the best approach, I'm OK with it. I just wanted to point out that blindly taking the "latest" can mean that you might try to download an update for DSpace 7 only to find that the new "latest" version is only compatible with DSpace 8 (as an example)... I don't know if this would ever happen, but wanted to point it out.

@@ -96,6 +96,19 @@
<class>org.dspace.app.sitemap.GenerateSitemaps</class>
</step>
</command>
<command>
<name>get-counter-robots</name>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be named "update-counter-robots"?

I also wonder if this should install the file in the correct location? Personally, I wouldn't know where to place this file. Does it go under [dspace]/spiders/ or [dspace]/spiders/agents?

If it doesn't put the file in the correct location, then we need to provide clear documentation on how to install this file. Ideally, though, if we want people to be able to run this via a cron job, it'd be nice to have it install the file in the proper place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called "get-" rather than "update-" because it doesn't update; it gets, and lets you decide whether you want to update.

It never occurred to me that someone would trust a cron job to do the updating. YOU decide to run it because you have learned that there's a new release from COUNTER. YOU inspect the file and decide whether it's suitable. YOU install it. This seems to be what was asked for in #7020.

Maybe we need to have the discussion that Bram suggested, before finishing this up? People seem to have very different ideas of how to consume the COUNTER robots list.

Copy link
Member

@tdonohue tdonohue Feb 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwoodiupui : Yes, it sounds like we may need to discuss the goals of this PR further.

My assumption was that we'd want this to be as easy as possible to help sites keep this list updated at all times. So, I was assuming it'd be as simply as a scheduled cron job that you could run to update it once a week/month/year (if changed) or similar.

Your approach is much more "manual", which I'm worried means it's not much of an improvement over the current process. An Admin could just go to this site and download the latest file quickly to inspect it & install it. This script currently seems to only help them in the download process (which could also be done via a wget command).

So, I think you are correct that our two goals for this script don't currently align. I'd rather have an update script here, and I'm questioning whether a simple "get" script is useful when we could just tell people to run something like this:

wget https://github.com/atmire/COUNTER-Robots/blob/master/generated/COUNTER_Robots_list.txt

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that that wget does the same thing: does it not fetch the latest checkin, rather than the latest release? A good deal of the code here is to deal with the structure of a Github release package, which is a bit of a pain. If I were going to turn a cron job loose to update a production site, I wouldn't want to be installing random checkins.

It's been suggested more than once right here in these comments that we might want to exercise judgment rather than blindly taking the latest release. I agree with that.

I would point out that there is no current process for updating from COUNTER. That was the meat of the Issue. Our current process for keeping the bot patterns current is that from time to time someone gets worried that the patterns are very stale and manually updates them from somewhere.

One use that I envision for this tool (or a manual procedure) is as part of the DSpace release process: at some point before release, developers would update to the most recent version of the COUNTER list that they find acceptable, and put it in the official sources.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per today's DevMtg, we discussed leaving this feature as-is... just a script to get-counter-robots. It's clear there seems to be two use cases:

  1. Some institutions prefer just this get script as it lets them review the update and apply when they are ready
  2. Some institutions prefer an automated update. But, we may be able to achieve that by providing better documentation around the proper cp command to use to copy the file into the correct location.
    • If somehow a cp command isn't good enough, we could create a separate script to install it to the correct location. But, for now, I'm assuming a cp command is all we may need.

@alanorth
Copy link
Contributor

alanorth commented Jan 22, 2024

I'm also wondering if this should always get the latest release—it's easy to imagine an upstream release that might contain invalid or undesirable patterns, or maybe they change the format or whatever that we might not want to consume automatically. Maybe we could also have the script take a parameter specifying a release tag to download?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: statistics Related to Statistics (Solr or Google Analytics) configuration Related to out-of-the-box configuration interface: command-line DSpace CLI (command-line interface) new feature quick win Pull request is small in size & should be easy to review and/or merge
Projects
Status: 👀 Under Review
Development

Successfully merging this pull request may close these issues.

[DS-3673] Robots/Crawlers: Pull latest botlist from COUNTER github instead of managing our own list
3 participants