Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arXiv API feed contains less data than page size - but getpapers starts new query with the next start parameter #177

Open
sedimentation-fault opened this issue Aug 27, 2019 · 0 comments

Comments

@sedimentation-fault
Copy link

sedimentation-fault commented Aug 27, 2019

Problem

To see this problem, it is advisable to turn on debug mode in getpapers:

category='math.AG'; start_date='20170101'; end_date='20190827'; getpapers --api 'arxiv' --query "cat:$category AND lastUpdatedDate:[${start_date}* TO ${end_date}*] " --outdir "$category" -p -l debug

You don't have to download the papers - just watch the query phase at the start until it finishes and interrupt it thereafter. Depending on the whims of arxiv.org, you will see messages telling you

Got 500 results in this page
...
Got 500 results in this page
...
Got 500 results in this page

or:

Malformed response from arXiv API - no data in feed

:roll:

If you look at the URL printed, and pay attention at its start parameter, you will see it increasing in steps of your page size (here: 500) after each message:

...&start=0...
...&start=500...
...&start=1000...

All is good if all goes according to the normal path of operation. But sometimes I see messages like:

Got 200 results in this page

while another 500 is added to start for the next query! This is a bug.

Solution

The patch below (which - how idiotic! - GitHub refused to attach...) resolves this issue. Please apply it at your discretion.

--- arxiv.js.orig       2019-08-25 22:54:11.495494078 +0200
+++ arxiv.js.new        2019-08-27 21:13:26.402183883 +0200
@@ -103,7 +103,23 @@
     setTimeout(arxiv.pageQuery.bind(arxiv), arxiv.page_delay)
     return
   }
-  log.debug('Got', result.length, 'results in this page')
+
+  // Sanity check: sometimes the feed does contain data
+  // - but with *less* than our page size results!
+  // This condition is valid only in the last batch of results
+  // - and even then the number of results we get should be 
+  // equal - and never less than - the number of remaining hits (i.e. results).
+  hitsremaining = arxiv.hitlimit - arxiv.iter;
+  log.debug('There were', hitsremaining, 'results remaining to get');
+  log.debug('Got', result.length, 'results in this page');
+  if ((result.length < arxiv.pagesize) && (result.length < hitsremaining)) {
+    log.error('Malformed response from arXiv API - only ' + result.length + ' results in feed');
+    // log.debug(data);
+    log.info('Retrying failed request');
+    setTimeout(arxiv.pageQuery.bind(arxiv), arxiv.page_delay);
+    return;
+  }
+
   arxiv.allresults = arxiv.allresults.concat(result)
   arxiv.pageprogress.tick(result.length)

Basically, what it does is keep an eye on how many results we get in each batch and, if it is less than pagesize, issue a debug message and retry the query that returned the partial results - unless we are in the very last batch and this batch contains strictly less than pagesize results. In this case, the number of results we got should be exactly equal to the number of remaining 'hits' (results). If it is less, we have a problem. This is the meaning of the condition

result.length < arxiv.pagesize) && (result.length < hitsremaining)

where

hitsremaining = arxiv.hitlimit - arxiv.iter

If you think about it a bit, you will see that it catches the problem correctly.

Tested here with

arxiv.pagesize = 1000

and works fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant