Skip to content

DjSasadvs/Web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Web-crawler

Use Scrapy to realize the web crawler

This project uses the Scrapy framework to realize the crawler.

Ref documentation - recommand you read

Plung-in Environment

  • Python2.7、setuptools、zope.interface
  • Twisted、Scrapy
  • lxml、BeautifulSoup4、win32py、pyOpenSSL

How to start

  • douban: the main program, use to run the crawler
//Scrapy crawl ItemName
  Scrapy crawl bookItem
  Scrapy crawl movieItem
  • fetch_proxies: use to fetch the usalbe IP proxies
  • This is designed to avoid the web forbid the crawler running
//finally catch the values in the file "proxies.json"
//then paste this into the douban/douban-setting.py instead of the contents of the PROXIES
  python fetch_free_proxies.py

Problems

  • If you can't run the crawler, please run the "fetch_free_proxies.py" at first
  • How to avoid getting banned document

About

Use Scrapy to realize the web crawler

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages