You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To see this problem, it is advisable to turn on debug mode in getpapers:
category='math.AG'; start_date='20170101'; end_date='20190827'; getpapers --api 'arxiv' --query "cat:$category AND lastUpdatedDate:[${start_date}* TO ${end_date}*] " --outdir "$category" -p -l debug
You don't have to download the papers - just watch the query phase at the start until it finishes and interrupt it thereafter. Depending on the whims of arxiv.org, you will see messages telling you
Got 500 results in this page
...
Got 500 results in this page
...
Got 500 results in this page
or:
Malformed response from arXiv API - no data in feed
:roll:
If you look at the URL printed, and pay attention at its start parameter, you will see it increasing in steps of your page size (here: 500) after each message:
...&start=0...
...&start=500...
...&start=1000...
All is good if all goes according to the normal path of operation. But sometimes I see messages like:
Got 200 results in this page
while another 500 is added to start for the next query! This is a bug.
Solution
The patch below (which - how idiotic! - GitHub refused to attach...) resolves this issue. Please apply it at your discretion.
--- arxiv.js.orig 2019-08-25 22:54:11.495494078 +0200
+++ arxiv.js.new 2019-08-27 21:13:26.402183883 +0200
@@ -103,7 +103,23 @@
setTimeout(arxiv.pageQuery.bind(arxiv), arxiv.page_delay)
return
}
- log.debug('Got', result.length, 'results in this page')
+
+ // Sanity check: sometimes the feed does contain data
+ // - but with *less* than our page size results!
+ // This condition is valid only in the last batch of results
+ // - and even then the number of results we get should be
+ // equal - and never less than - the number of remaining hits (i.e. results).
+ hitsremaining = arxiv.hitlimit - arxiv.iter;
+ log.debug('There were', hitsremaining, 'results remaining to get');
+ log.debug('Got', result.length, 'results in this page');
+ if ((result.length < arxiv.pagesize) && (result.length < hitsremaining)) {
+ log.error('Malformed response from arXiv API - only ' + result.length + ' results in feed');
+ // log.debug(data);
+ log.info('Retrying failed request');
+ setTimeout(arxiv.pageQuery.bind(arxiv), arxiv.page_delay);
+ return;
+ }
+
arxiv.allresults = arxiv.allresults.concat(result)
arxiv.pageprogress.tick(result.length)
Basically, what it does is keep an eye on how many results we get in each batch and, if it is less than pagesize, issue a debug message and retry the query that returned the partial results - unless we are in the very last batch and this batch contains strictly less than pagesize results. In this case, the number of results we got should be exactly equal to the number of remaining 'hits' (results). If it is less, we have a problem. This is the meaning of the condition
Problem
To see this problem, it is advisable to turn on debug mode in getpapers:
category='math.AG'; start_date='20170101'; end_date='20190827'; getpapers --api 'arxiv' --query "cat:$category AND lastUpdatedDate:[${start_date}* TO ${end_date}*] " --outdir "$category" -p -l debug
You don't have to download the papers - just watch the query phase at the start until it finishes and interrupt it thereafter. Depending on the whims of arxiv.org, you will see messages telling you
or:
:roll:
If you look at the URL printed, and pay attention at its start parameter, you will see it increasing in steps of your page size (here: 500) after each message:
All is good if all goes according to the normal path of operation. But sometimes I see messages like:
while another 500 is added to start for the next query! This is a bug.
Solution
The patch below (which - how idiotic! - GitHub refused to attach...) resolves this issue. Please apply it at your discretion.
Basically, what it does is keep an eye on how many results we get in each batch and, if it is less than pagesize, issue a debug message and retry the query that returned the partial results - unless we are in the very last batch and this batch contains strictly less than pagesize results. In this case, the number of results we got should be exactly equal to the number of remaining 'hits' (results). If it is less, we have a problem. This is the meaning of the condition
result.length < arxiv.pagesize) && (result.length < hitsremaining)
where
hitsremaining = arxiv.hitlimit - arxiv.iter
If you think about it a bit, you will see that it catches the problem correctly.
Tested here with
arxiv.pagesize = 1000
and works fine.
The text was updated successfully, but these errors were encountered: