Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lookups using google scholar, semantic scholar, etc. #37

Open
jeremymanning opened this issue Dec 23, 2020 · 0 comments
Open

lookups using google scholar, semantic scholar, etc. #37

jeremymanning opened this issue Dec 23, 2020 · 0 comments

Comments

@jeremymanning
Copy link
Member

I've tried using the following packages to interface with Google Scholar:

  • scholarly
  • gscholar
  • mechanize (via a simulated browser)

I can get each of these to return valid information for a small number of queries. However, when I submit many queries (I'm not sure of the precise number-- 20? 50? 100?) I start seeing 429 HTTP errors ("too many requests"). It seems that the Google Scholar backend limits the number of queries per day (or possibly the rate?) that can come from a single browser/ip address/user (I'm not sure how it's parameterized).

This seems to make it impossible (or at least "non-trivial") to use Google Scholar to verify and/or look up bibliographic information.

I've also tried using the semanticscholar package to interface with Semantic Scholar. Unfortunately, the semantic scholar API requires knowing the DOI, author ID, or semantic scholar code-- which I don't have for most papers. The Google Scholar API does support DOI lookups, but it's not useful (if I could reliably access Google Scholar we wouldn't need Semantic Scholar!). I also tried submitting requests to crossref (using the mechanize package to simulate browser requests, and then regular expressions to parse out DOIs), but the results were highly unreliable (only a very small proportion of queries seemed to return useful information).

So: I'm stumped. Until I can figure out a way forward (e.g. a way around Google Scholar's limits, a way to look up information via Semantic Scholar, and/or another reliable source for bibliographic information) I'm going to remove bibliographic lookups from the bibtex checker code. My (broken) attempts can be found in the notebook (dev folder) of this commit.

The main issue I was trying to solve was that some of the page numbers are either self-inconsistent or invalid (e.g. the given page range doesn't make sense, like starting from a high number and going to a low number, or containing mixes of alpha and numeric characters that seem suspect). I'm going to implement some heuristics for cleaning up those sorts of issues (to the extent that I can reliably detect them), and I'll ignore for now the likelihood that some bibliographic information may be entered incorrectly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant