Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume (-r) does not work the same with arxiv, as with EUPMC API #158

Open
sedimentation-fault opened this issue Apr 18, 2017 · 3 comments

Comments

@sedimentation-fault
Copy link

Description

While appending '-r' to a getpapers command will not retry to download PDFs that are already there for queries with the EUPMC API, the same is NOT the case with the arxiv API - it not only retries the metadata query, but also each and every PDF, from the first to the last one.

This is very annoying if you have 24000 PDFs to download, of which a few thousand have already been downloaded in previous attempts.

How to reproduce

Run

getpapers --api 'arxiv' --query 'cat:math.DG' --outdir math.DG -p

Let it download some number of PDFs, then interrupt it and retry with

getpapers --api 'arxiv' --query 'cat:math.DG' --outdir math.DG -p -r

Side effects

An annoying side-effect of this is that the web server will reply with a

416 Range Not Satisfiable

error, if the downloader (be it the standard requestretry from master code, or a curl wrapper from a custom fork, as in #157) tries a resumable download (i.e. if it tries to resume the download from where it stopped, in this case from the end of an already completely downloaded file, a byte range whose start is certainly not satisfiable). If the error is not caught, the script will stop at the first file that is already there, with the above 416 error.

This side effect may not occur in the master code (or it may appear as a 416 caught error, I didn't check).

@blahah
Copy link
Member

blahah commented Apr 18, 2017

@sedimentation-fault at this point I think it's worth saying that you've discovered enough bugs that it's clear we need to rewrite some fundamentals. We knew about some of these internally for some time, but haven't had the resources to resolve them. If you could refrain from reporting new issues for now that would be helpful - the existing reports have enough detail that we have a good set of goals to test against when rewriting. At this point more issues just gives us more admin to do afterwards. We really appreciate the detailed analysis so far though :)

@sedimentation-fault
Copy link
Author

The good thing is, all this discussion helps us to close bugs too, like #157 😉

But, no problem. I understand - I will keep quiet for a while... 😄
KUTGW! 👍

@sedimentation-fault
Copy link
Author

Let me break the silence and tell you the solution - there is no point in letting you tap in the dark. 😄

Solution

This bug is actually two:

Do not redownload papers if they are already there

Reason

I thought this had to do with the '-r' option, but it does not. download.js already checks if a file with the same name is already there and does not retry to download it if there is one. BUT: the test fails for arxiv papers because their IDs contain slashes - and IDs are not sanitized in the function that does the check.

Solution

In download.js, replace

  var rename = urlObj.rename;
  var base = id + '/';

with

  var rename = sanitize(urlObj.rename);
  var base = sanitize(id) + '/';

Do not retry the query if the user passed the '-r' option

Solution

In arxiv.js, replace

arxiv.pageQuery();

with

  if (arxiv.opts.restart) {
        fs.readFile('arxiv_results.json', (err,data) => {
          if ((err) && (err.code == 'ENOENT')) {
            log.error('No existing download to restart')
            process.exit(1)
          }
          else if (err) {
            throw err
          }
          else {
            log.info('Restarting previous download')
            arxiv.allresults=JSON.parse(data)
            arxiv.addDlTasks()
          }
        } )



  }
  else {
    arxiv.pageQuery();
  }

Still in arxiv.js, add:

  arxiv.addDlTasks()

}

ArXiv.prototype.addDlTasks = function() {

  arxiv = this

after

   var filename = chalk.blue('arxiv_results.json')
   log.info('Full ArXiv result metadata written to ' + filename);

For your convenience, I attach two patches. Inspect before you apply them - the patch for download.js contains commented and debugging code too (I found it informative in this form, YMMV).

patches.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants