Scholar crawler 学术爬虫

This projects aims to crawl title and citation from given journals, with few human interaction, then download from url or scihub. Achieve 97%+ success rate on a wide range of papers.

Current : v2.0, add recaptcha solver.

Google Scholarscraper.py : With good pacing, the script is able to crawl 1 page in 10 second, and run for at least 100 page until it hit a bot check. This project solve the bot check uses recaptcha-challenger , with a openai whisper model. You can always fallback to manual human check with scraper_manual.py

ScihubDownloader.py : Try to download the links scrapped from google scholar. If failed, it will fallback to scihub. If failed again, it will fallback to scihub backbones. There are lots of sci-hub mirrors, be sure plenty to put accessible mirrors in _get_available_scihub_urls(). It's recommended to add at least 5 mirrors for load balancing.

Clash clash.py: Change proxy server before google gets irritated. TBD

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
recaptcha_challenger		recaptcha_challenger
.gitignore		.gitignore
Download.py		Download.py
README.md		README.md
chromedriver.exe		chromedriver.exe
clash.py		clash.py
img.png		img.png
img_1.png		img_1.png
list.txt		list.txt
scraper.py		scraper.py
scrapper_manual.py		scrapper_manual.py
test_diskbackedlist.py		test_diskbackedlist.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scholar crawler 学术爬虫

About

Releases

Packages

Languages

Shadow-Alex/PaperCrawler

Folders and files

Latest commit

History

Repository files navigation

Scholar crawler 学术爬虫

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages