scrapy爬虫抓取数据,grafana分析数据
从曲线密集度可以看到成交还是比较频繁的。贝壳网武汉二手房成交数据共有56543条(截止2020-6-21),爬虫获取了共计56153条数据(缺少部分为前期部分维度筛选不全导致以及新成交的部分房源没有及时计入)。及时爬取的话需要发布到服务器,定时爬取指定页数。
任何IT方面的技术点都会有官方文档,它最详尽最权威,比如Android,Python,涛思时序数据库,高德地图,基于Python的scrapy框架等等。以scrapy官方文档和Android官方文档为例:
基本的Demo有了之后就需要开始深入解决问题了,首当其冲的就是由于爬取太快导致的ip暂时性无法访问(还好贝壳网的反爬虫机制没有一上来就给你上黑名单),如何和反爬虫斗智斗勇。我罗列一下自己循序渐进解决这个问题的过程。
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
user_agent = random.choice(USER_AGENT_LIST)
if user_agent:
request.headers.setdefault('User-Agent', user_agent)
# logging.info(f"User-Agent:{user_agent}")
return None
from scrapy.commands import ScrapyCommand
class Command(ScrapyCommand):
requires_project = True
def syntax(self):
return '[options]'
def short_desc(self):
return 'Runs all of the spiders'
def run(self, args, opts):
spider_list = self.crawler_process.spiders.list()
for name in spider_list:
self.crawler_process.crawl(name, **opts.__dict__)
self.crawler_process.start()
同时在setting.py中设置
# many task run on the same time
COMMANDS_MODULE = 'mingyan.commands'
然而解决ip封锁的根本方式是使用动态ip,一开始是通过爬取网上免费的ip池,结果发现有效的ip少之又少,要么时效太短,要么根本无法使用。在mingyan/tools/crawl_xici_ip.py中的get_random_ip_from_mysql()方法是读取存入数据库中的免费共享ip,get_ip_from_xun()方法是自费购买的讯代理ip。花钱的肯定比免费的好用,使用哪家代理知乎上有推介。
使用动态代理ip的话,必须开启中间件MingyanSpiderMiddleware,和MingyanDownloaderMiddleware中间件一样都在middlewares.py文件中,后面的数字是优先级。
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
'mingyan.middlewares.MingyanDownloaderMiddleware': 543,
'mingyan.middlewares.MingyanSpiderMiddleware': 543,
}
在process_request中拦截ip:
class MingyanSpiderMiddleware:
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# ip = random.choice(self.ip)
get_ip = GetIP()
# ip = get_ip.get_random_ip_from_mysql()
ip = get_ip.get_ip_from_xun()
logging.info("this is request ip:" + str(ip))
auth = get_ip.get_auth()
# encoded_user_pass = base64.encodestring(auth)
request.headers['Proxy-Authorization'] = auth
request.meta['proxy'] = ip
关于”中间件Middleware“概念,是理解Scrapy原理的重要一环,详情去官网Spider Middleware了解吧。
- 如何查看shell执行情况;
- 后台nohup命令执行中dev/null重定向是什么意思;