Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Detect DOI numbers and download academic paper PDFs automatically #720

Open
pirate opened this issue Apr 24, 2021 · 2 comments

Comments

@pirate
Copy link
Member

pirate commented Apr 24, 2021

New extractor idea: SCIHUB

e.g. take this academic paper for example: https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1

If a full paper PDF is available on scihub, e.g.: https://sci-hub.se/https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1 it could be downloaded to a ./archive/<timestmap>/scihub/ output folder.

# try downloading via verbatim URL first
$ scihub.py -d https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1'
DEBUG:Sci-Hub:Successfully downloaded file with identifier https://www.cell.com/current-biology/fulltext/S0960-9822(19)31469-1

We could also look for a DOI number in the page URL or page html contents e.g.: 10.1016/j.cub.2019.11.030 using a regex and try downloading that.

# otherwise try downloading via any regex-extracted bare DOI numbers on the page or in the URL
$ scihub.py -d '10.1016/j.cub.2019.11.030'
DEBUG:Sci-Hub:Successfully downloaded file with identifier 10.1016/j.cub.2019.11.030

$ ls
c28dc1242df6f931c29b9cd445a55597-.cub.2019.11.030.pdf

New Dependencies:

New Extractors:

  • extractors/scihub.py

New Config Options:

  • SAVE_SCIHUB=True
@pirate pirate changed the title Feature Request: Extract DOI numbers and download academic paper PDFs automatically Feature Request: Detect DOI numbers and download academic paper PDFs automatically Apr 24, 2021
@danisztls
Copy link

Captchas would be a problem.

@pirate
Copy link
Member Author

pirate commented Apr 27, 2021

In my testing so far it's not a problem as long as you're not doing dozens of papers at a time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants