本脚本启动前需先启动redis(并连接redis-cli)与elasticsearch(如还需将数据存入mysql,则还需启动mysql),并先单独run es_models文件夹下的types.py文件,进入elasticsearch-header进行最大值设置:https://blog.csdn.net/bushanyantanzhe/article/details/79109721
此后需在redis-cli中添加如下命令(spidername为需要启动的爬虫名,url为起始爬取位置(可参考爬虫文件内的start_urls)):
lpush spidername:start_urls URL
以启动csdn爬虫为例:
spiders目录下:
scrapy crawl csdn_QA
redis-cli下:
lpush csdn_QA:start_urls http://ask.csdn.net/questions?type=resolved
启动:scrapy crawl spidername -s JOBDIR=StorageProcess/yourFileName
暂停:Ctrl+C
重启:scrapy crawl spidername -s JOBDIR=StorageProcess/yourFileName(与启动命令相同)
启动:scrapy crawlall