Resume (-r) does not work the same with arxiv, as with EUPMC API #158

sedimentation-fault · 2017-04-18T23:43:37Z

Description

While appending '-r' to a getpapers command will not retry to download PDFs that are already there for queries with the EUPMC API, the same is NOT the case with the arxiv API - it not only retries the metadata query, but also each and every PDF, from the first to the last one.

This is very annoying if you have 24000 PDFs to download, of which a few thousand have already been downloaded in previous attempts.

How to reproduce

Run

getpapers --api 'arxiv' --query 'cat:math.DG' --outdir math.DG -p

Let it download some number of PDFs, then interrupt it and retry with

getpapers --api 'arxiv' --query 'cat:math.DG' --outdir math.DG -p -r

Side effects

An annoying side-effect of this is that the web server will reply with a

416 Range Not Satisfiable

error, if the downloader (be it the standard requestretry from master code, or a curl wrapper from a custom fork, as in #157) tries a resumable download (i.e. if it tries to resume the download from where it stopped, in this case from the end of an already completely downloaded file, a byte range whose start is certainly not satisfiable). If the error is not caught, the script will stop at the first file that is already there, with the above 416 error.

This side effect may not occur in the master code (or it may appear as a 416 caught error, I didn't check).

The text was updated successfully, but these errors were encountered:

blahah · 2017-04-18T23:52:41Z

@sedimentation-fault at this point I think it's worth saying that you've discovered enough bugs that it's clear we need to rewrite some fundamentals. We knew about some of these internally for some time, but haven't had the resources to resolve them. If you could refrain from reporting new issues for now that would be helpful - the existing reports have enough detail that we have a good set of goals to test against when rewriting. At this point more issues just gives us more admin to do afterwards. We really appreciate the detailed analysis so far though :)

sedimentation-fault · 2017-04-19T00:31:37Z

The good thing is, all this discussion helps us to close bugs too, like #157 😉

But, no problem. I understand - I will keep quiet for a while... 😄
KUTGW! 👍

sedimentation-fault · 2017-04-21T21:41:22Z

Let me break the silence and tell you the solution - there is no point in letting you tap in the dark. 😄

Solution

This bug is actually two:

Do not redownload papers if they are already there

Reason

I thought this had to do with the '-r' option, but it does not. download.js already checks if a file with the same name is already there and does not retry to download it if there is one. BUT: the test fails for arxiv papers because their IDs contain slashes - and IDs are not sanitized in the function that does the check.

Solution

In download.js, replace

  var rename = urlObj.rename;
  var base = id + '/';

with

  var rename = sanitize(urlObj.rename);
  var base = sanitize(id) + '/';

Do not retry the query if the user passed the '-r' option

Solution

In arxiv.js, replace

arxiv.pageQuery();

with

  if (arxiv.opts.restart) {
        fs.readFile('arxiv_results.json', (err,data) => {
          if ((err) && (err.code == 'ENOENT')) {
            log.error('No existing download to restart')
            process.exit(1)
          }
          else if (err) {
            throw err
          }
          else {
            log.info('Restarting previous download')
            arxiv.allresults=JSON.parse(data)
            arxiv.addDlTasks()
          }
        } )



  }
  else {
    arxiv.pageQuery();
  }

Still in arxiv.js, add:

  arxiv.addDlTasks()

}

ArXiv.prototype.addDlTasks = function() {

  arxiv = this

after

   var filename = chalk.blue('arxiv_results.json')
   log.info('Full ArXiv result metadata written to ' + filename);

For your convenience, I attach two patches. Inspect before you apply them - the patch for download.js contains commented and debugging code too (I found it informative in this form, YMMV).

patches.zip

sedimentation-fault mentioned this issue Apr 19, 2017

HTTP error 429 - Too Many Requests #157

Closed

sedimentation-fault mentioned this issue Apr 21, 2017

Handle HTTP errors during download gracefully #156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume (-r) does not work the same with arxiv, as with EUPMC API #158

Resume (-r) does not work the same with arxiv, as with EUPMC API #158

sedimentation-fault commented Apr 18, 2017

blahah commented Apr 18, 2017

sedimentation-fault commented Apr 19, 2017

sedimentation-fault commented Apr 21, 2017

Resume (-r) does not work the same with arxiv, as with EUPMC API #158

Resume (-r) does not work the same with arxiv, as with EUPMC API #158

Comments

sedimentation-fault commented Apr 18, 2017

Description

How to reproduce

Side effects

blahah commented Apr 18, 2017

sedimentation-fault commented Apr 19, 2017

sedimentation-fault commented Apr 21, 2017

Solution

Do not redownload papers if they are already there

Reason

Solution

Do not retry the query if the user passed the '-r' option

Solution