arXiv API feed contains less data than page size - but getpapers starts new query with the next start parameter #177

sedimentation-fault · 2019-08-27T20:27:21Z

Problem

To see this problem, it is advisable to turn on debug mode in getpapers:

category='math.AG'; start_date='20170101'; end_date='20190827'; getpapers --api 'arxiv' --query "cat:$category AND lastUpdatedDate:[${start_date}* TO ${end_date}*] " --outdir "$category" -p -l debug

You don't have to download the papers - just watch the query phase at the start until it finishes and interrupt it thereafter. Depending on the whims of arxiv.org, you will see messages telling you

Got 500 results in this page
...
Got 500 results in this page
...
Got 500 results in this page

or:

Malformed response from arXiv API - no data in feed

:roll:

If you look at the URL printed, and pay attention at its start parameter, you will see it increasing in steps of your page size (here: 500) after each message:

...&start=0...
...&start=500...
...&start=1000...

All is good if all goes according to the normal path of operation. But sometimes I see messages like:

Got 200 results in this page

while another 500 is added to start for the next query! This is a bug.

Solution

The patch below (which - how idiotic! - GitHub refused to attach...) resolves this issue. Please apply it at your discretion.

--- arxiv.js.orig       2019-08-25 22:54:11.495494078 +0200
+++ arxiv.js.new        2019-08-27 21:13:26.402183883 +0200
@@ -103,7 +103,23 @@
     setTimeout(arxiv.pageQuery.bind(arxiv), arxiv.page_delay)
     return
   }
-  log.debug('Got', result.length, 'results in this page')
+
+  // Sanity check: sometimes the feed does contain data
+  // - but with *less* than our page size results!
+  // This condition is valid only in the last batch of results
+  // - and even then the number of results we get should be 
+  // equal - and never less than - the number of remaining hits (i.e. results).
+  hitsremaining = arxiv.hitlimit - arxiv.iter;
+  log.debug('There were', hitsremaining, 'results remaining to get');
+  log.debug('Got', result.length, 'results in this page');
+  if ((result.length < arxiv.pagesize) && (result.length < hitsremaining)) {
+    log.error('Malformed response from arXiv API - only ' + result.length + ' results in feed');
+    // log.debug(data);
+    log.info('Retrying failed request');
+    setTimeout(arxiv.pageQuery.bind(arxiv), arxiv.page_delay);
+    return;
+  }
+
   arxiv.allresults = arxiv.allresults.concat(result)
   arxiv.pageprogress.tick(result.length)

Basically, what it does is keep an eye on how many results we get in each batch and, if it is less than pagesize, issue a debug message and retry the query that returned the partial results - unless we are in the very last batch and this batch contains strictly less than pagesize results. In this case, the number of results we got should be exactly equal to the number of remaining 'hits' (results). If it is less, we have a problem. This is the meaning of the condition

result.length < arxiv.pagesize) && (result.length < hitsremaining)

where

hitsremaining = arxiv.hitlimit - arxiv.iter

If you think about it a bit, you will see that it catches the problem correctly.

Tested here with

arxiv.pagesize = 1000

and works fine.

The text was updated successfully, but these errors were encountered:

sedimentation-fault mentioned this issue Aug 27, 2019

"Malformed response from arXiv API - no data in feed" woes... #179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arXiv API feed contains less data than page size - but getpapers starts new query with the next start parameter #177

arXiv API feed contains less data than page size - but getpapers starts new query with the next start parameter #177

sedimentation-fault commented Aug 27, 2019 •

edited

arXiv API feed contains less data than page size - but getpapers starts new query with the next start parameter #177

arXiv API feed contains less data than page size - but getpapers starts new query with the next start parameter #177

Comments

sedimentation-fault commented Aug 27, 2019 • edited

Problem

Solution

sedimentation-fault commented Aug 27, 2019 •

edited