Skip to content

Mitigate Google rate-limiting for Scholar polling #1237

@boonebgorges

Description

@boonebgorges

We have a mechanism that allows subscription to Google Scholar URLs. We then use our regular feed-fetching tools to regularly poll the Scholar URL and parse the contents.

However, Google is somewhat aggressive about identifying "bot" traffic. As such, we should introduce mitigations that prevent an IP/client from getting blocked. A couple initial ideas that come to mind:

  1. Automatic polling/fetching for Scholar feeds should be limited. Instead of our default interval, perhaps once daily would be sufficient
  2. Keep track of Google Scholar pings from the WP installation within a given period, and cap it. So, perhaps, only one or two per hour, or perhaps 10 per day, or some other limiting/spacing mechanism
  3. We have a manual button 'Refresh Feed Items', which should also respect these limits.
  4. Don't want to be scummy, but if there's something we could set in our request headers, like perhaps some sort of user agent string indicating that we're not a terrible bot, it would be worth exploring.

Happy to hear (or see) other suggestions for best practices.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions