<h1 align=center>第四章 4-2 大数据采集-Scrapy</h1> 

# 定义爬虫的任务

## 涉及的语法
语法涉及类（面向对象）、列表list、字典dict、循环、函数、字符串操作、文件读写

## 概述
这个爬虫的任务是爬取http://quotes.toscrape.com/page/1/ 的前两页，提取每条名言的文字内容，作者和标签，最后以JSON格式保存到文件中


## 如何修改

在自己做定制时，只需要修改`__init__`和`parse`两个方法，通俗讲__init__方法决定了爬取哪些网站，parse则指明了在每一个网页上爬取哪些内容
- init: 设置待爬取网站的列表和保存文件路径，其中变量self.urls是待爬取网站的列表，self.file是一个文件对象
- parse：方法内是针对每个url成功访问之后进行的页面解析
   关于如何解析具体网页，也就是选择器的使用，与网页格式十分相关，这个样例无法适用于其他网站。由于选择器的使用有很大的选择性，所以可以参考文档http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/selectors.html


### 方法：Scrapy框架 

In [1]:
import scrapy
import time
import json
import os

class MySpider(scrapy.Spider):
    
    name = "spider"
    
    
    
    def __init__(self):    #设置待爬取网站的列表和保存文件路径
        
        self.file = open('demo1_quotes.json', 'w');    #文件对象
        
        #设置待爬取网站列表
        self.urls = []
        for i in range(1,3):
            self.urls.append('http://quotes.toscrape.com/page/' + str(i) )
            
#       初始化效果 效果等同
#         self.urls = [
#             'http://quotes.toscrape.com/page/1/',
#             'http://quotes.toscrape.com/page/2/',
#         ]
        
        print(self.urls)

        
    def start_requests(self):
        #self.init_urls()
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse)    
    

    #parse方法会在每个request收到response之后调用,方法内是针对每个url成功访问之后进行的页面解析
    def parse(self, response):
#         parse(self, response) 中的response是HtmlResponse 类型的,可以用css选择器或者xpath选择器

        #提取名言列表
        quotes = response.css("div.quote");     #css选择器
#         quotes = response.xpath("//div[@class='quote']");  #引号要注意
    
        for quote in quotes:
            #提取名言的文字内容
            text = quote.css(".text::text").extract_first();   #提取第一条
            #提取每条名言中的作者名
            author = quote.css("small.author::text").extract_first();
            #提取名言标签
            tags = quote.css(".tags .tag::text").extract();    #提取所有的
        
            #构建字典对象
            item = {"author":author, "text": text, "tags":tags };
            #将字典转换成json字符串
            line = json.dumps(dict(item))
            #将每个条目写入文件
            self.file.write(line + "\n")
        #及时将内容写入文件，否则可能会出现少许延迟
        self.file.flush()
        os.fsync(self.file)
        #输出当前解析完成的网页网址，可以当做爬取进度来看待,与程序逻辑无关
        print("over: " + response.url)


# 执行爬虫任务
启动后，将执行Myspider。
这部分的代码块，如果确实非常了解scrapy的运行机制，那么可以做定制，否则不建议自行修改。

In [2]:
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # 这句代码就是开始了整个爬虫过程 ，会输出一大堆信息，可以无视

2020-05-29 11:48:47 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: scrapybot)
2020-05-29 11:48:47 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c  28 May 2019), cryptography 2.7, Platform Windows-10-10.0.17763-SP0
2020-05-29 11:48:47 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2020-05-29 11:48:47 [scrapy.extensions.telnet] INFO: Telnet Password: d4010be694d9647a
2020-05-29 11:48:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']


['http://quotes.toscrape.com/page/1', 'http://quotes.toscrape.com/page/2']


2020-05-29 11:48:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-29 11:48:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.ref

over: http://quotes.toscrape.com/page/2/
over: http://quotes.toscrape.com/page/1/


- 破译反爬虫
- 分布式爬虫

# 读取数据

In [3]:
import json
file = open('demo1_quotes.json','r',encoding='UTF-8') 
# lines=json.load(file_object)
# lines
# 由于文件中有多行，直接读取会出现错误，因此一行一行读取
data = []
for line in file.readlines():
    dic = json.loads(line)
    data.append(dic)
print('json文件中有%d行数据'%len(data))
data

json文件中有20行数据


[{'author': 'Marilyn Monroe',
  'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high,