<h1, align=center> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;数据科学引论 - Python之道 </h1>

<h1, align=center> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;第5课 数据收集 - Python网络爬虫实践 I </h1>

# 爬虫概述
在阅读这个样例之前，建议先了解爬虫是什么，简单理解url、爬虫技术、网页html等基本概念，这可以参考链接http://python.jobbole.com/81334/

在完成本笔记本的操作之前，需要先阅读“5 爬虫环境搭建.pdf”，下载本笔记本所依赖的Python爬虫Scrapy。

# 定义爬虫的任务

## 涉及的语法
语法涉及类（面向对象）、列表list、字典dict、循环、函数、字符串操作、文件读写

## 概述
这个爬虫的任务是爬取http://quotes.toscrape.com/page/1/ 的前两页，提取每条名言的文字内容，作者和标签，最后以JSON格式保存到文件中


## 如何修改

在自己做定制时，只需要修改`__init__`和`parse`两个方法，通俗讲__init__方法决定了爬取哪些网站，parse则指明了在每一个网页上爬取哪些内容
- init_urls: 设置待爬取网站的列表和保存文件路径，其中变量self.urls是待爬取网站的列表，self.file是一个文件对象
- parse：方法内是针对每个url成功访问之后进行的页面解析
   关于如何解析具体网页，也就是选择器的使用，与网页格式十分相关，这个样例无法适用于其他网站。由于选择器的使用有很大的选择性，所以可以参考文档http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/selectors.html


In [4]:
import scrapy
import time
import json
import os

class MySpider(scrapy.Spider):

    name = "spider"

    def __init__(self):

        self.file = open('demo1_quotes.json', 'w')

        # 设置待爬取网站列表
        self.urls = []
        for i in range(1, 11):
            self.urls.append('http://quotes.toscrape.com/page/' + str(i))

#       初始化效果 效果等同
#         self.urls = [
#             'http://quotes.toscrape.com/page/1/',
#             'http://quotes.toscrape.com/page/2/',
#         ]

        print(self.urls)

    def start_requests(self):
        # self.init_urls()
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse)

    # parse方法会在每个request收到response之后调用
    def parse(self, response):

        # 提取名言列表
        quotes = response.css("div.quote")
        for quote in quotes:
            # 提取每条名言中的作者名
            author = quote.css("small.author::text").extract_first()
            # 提取名言的文字内容
            text = quote.css(".text::text").extract_first()
            # 提取名言标签
            tags = quote.css(".tags .tag::text").extract()
            # 构建字典对象
            item = {"author": author, "text": text, "tags": tags}
            # 将字典转换成json字符串
            line = json.dumps(dict(item))
            # 将每个条目写入文件
            self.file.write(line + "\n")
        # 及时将内容写入文件，否则可能会出现少许延迟
        self.file.flush()
        os.fsync(self.file)
        # 输出当前解析完成的网页网址，可以当做爬取进度来看待,与程序逻辑无关
        print("over: " + response.url)

In [1]:
import scrapy
import time
import json
import os
from scrapy.selector import Selector
import pandas as pd
import csv

class MySpider(scrapy.Spider):

    name = "spider"

    def __init__(self):

#         self.file = open('demo1_quotes.json', 'w')
        # 写入csv文件
        self.file = open('demo1_quotesAdj.csv', 'w',
                         encoding='GBK', newline='')
        self.csvWriter = csv.DictWriter(self.file, fieldnames=["author", "text", "tags"])
        self.csvWriter.writeheader()

        # 设置待爬取网站列表
        self.urls = []
        for i in range(1, 11):
            self.urls.append('http://quotes.toscrape.com/page/' + str(i))

#       初始化效果 效果等同
#         self.urls = [
#             'http://quotes.toscrape.com/page/1/',
#             'http://quotes.toscrape.com/page/2/',
#         ]

        print(self.urls)

    def start_requests(self):
        # self.init_urls()
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse)

    # parse方法会在每个request收到response之后调用
    def parse(self, response):

        # 提取名言列表
        quotes = response.xpath('//div[@class="quote"]').extract()
        for quote in quotes:
            # 重新将字符串还原为选择器
            quote=Selector(text=quote)
            # 提取每条名言中的作者名
            author = quote.xpath ('//span/small[@class="author"]/text()').extract_first()
            # 提取名言的文字内容
            text = quote.xpath ('//span[@class="text"]/text()').extract_first()
            # 提取名言标签
            tags = quote.xpath ('//div/a[@class="tag"]/text()').extract()
            # 构建字典对象
            item = {"author": author, "text": text, "tags": tags}
            # 将字典转换成json字符串
            line = json.dumps(dict(item))
            # 将每个条目写入文件
#             self.file.write(line + "\n")
            self.csvWriter.writerow(item)
            
        # 及时将内容写入文件，否则可能会出现少许延迟
        self.file.flush()
        os.fsync(self.file)
        # 输出当前解析完成的网页网址，可以当做爬取进度来看待,与程序逻辑无关
        print("over: " + response.url)

# 执行爬虫任务
启动后，将执行Myspider。
这部分的代码块，如果确实非常了解scrapy的运行机制，那么可以做定制，否则不建议自行修改。

In [2]:
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start()  # 这句代码就是开始了整个爬虫过程 ，会输出一大堆信息，可以无视

2019-10-29 14:32:14 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: scrapybot)
2019-10-29 14:32:14 [scrapy.utils.log] INFO: Versions: lxml 4.3.2.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c  28 May 2019), cryptography 2.6.1, Platform Windows-10-10.0.18362-SP0
2019-10-29 14:32:14 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2019-10-29 14:32:14 [scrapy.extensions.telnet] INFO: Telnet Password: 67cea940ddcb9780
2019-10-29 14:32:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']


['http://quotes.toscrape.com/page/1', 'http://quotes.toscrape.com/page/2', 'http://quotes.toscrape.com/page/3', 'http://quotes.toscrape.com/page/4', 'http://quotes.toscrape.com/page/5', 'http://quotes.toscrape.com/page/6', 'http://quotes.toscrape.com/page/7', 'http://quotes.toscrape.com/page/8', 'http://quotes.toscrape.com/page/9', 'http://quotes.toscrape.com/page/10']


2019-10-29 14:32:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-10-29 14:32:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.ref

over: http://quotes.toscrape.com/page/8/
over: http://quotes.toscrape.com/page/2/
over: http://quotes.toscrape.com/page/7/
over: http://quotes.toscrape.com/page/6/


2019-10-29 14:32:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2019-10-29 14:32:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/10/> (referer: None)
2019-10-29 14:32:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/5/> (referer: None)


over: http://quotes.toscrape.com/page/1/
over: http://quotes.toscrape.com/page/10/
over: http://quotes.toscrape.com/page/5/


2019-10-29 14:32:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/page/3/> from <GET http://quotes.toscrape.com/page/3>
2019-10-29 14:32:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/4/> (referer: None)


over: http://quotes.toscrape.com/page/4/


2019-10-29 14:32:20 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET http://quotes.toscrape.com/page/9/> from <GET http://quotes.toscrape.com/page/9>
2019-10-29 14:32:22 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/9/> (referer: None)


over: http://quotes.toscrape.com/page/9/


2019-10-29 14:32:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/3/> (referer: None)
2019-10-29 14:32:24 [scrapy.core.engine] INFO: Closing spider (finished)
2019-10-29 14:32:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4812,
 'downloader/request_count': 20,
 'downloader/request_method_count/GET': 20,
 'downloader/response_bytes': 29347,
 'downloader/response_count': 20,
 'downloader/response_status_count/200': 10,
 'downloader/response_status_count/301': 10,
 'elapsed_time_seconds': 9.315847,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 10, 29, 6, 32, 24, 552600),
 'log_count/DEBUG': 20,
 'log_count/INFO': 10,
 'response_received_count': 10,
 'scheduler/dequeued': 20,
 'scheduler/dequeued/memory': 20,
 'scheduler/enqueued': 20,
 'scheduler/enqueued/memory': 20,
 'start_time': datetime.datetime(2019, 10, 29, 6, 32, 15, 236753)}
2019-10-29 14:32:24 [scrapy.core.engine] INFO: Spider closed 

over: http://quotes.toscrape.com/page/3/


In [6]:
pd.read_csv('demo1_quotesAdj.csv',index_col=0,encoding='GBK')

Unnamed: 0_level_0,text,tags
author,Unnamed: 1_level_1,Unnamed: 2_level_1
Alfred Tennyson,“If I had a flower for every time I thought of...,"['friendship', 'love']"
Charles Bukowski,“Some people never go crazy. What truly horrib...,['humor']
Terry Pratchett,"“The trouble with having an open mind, of cour...","['humor', 'open-mind', 'thinking']"
Dr. Seuss,“Think left and think right and think low and ...,"['humor', 'philosophy']"
J.D. Salinger,"“What really knocks me out is a book that, whe...","['authors', 'books', 'literature', 'reading', ..."
...,...,...
Dr. Seuss,"“Today you are You, that is truer than true. T...","['comedy', 'life', 'yourself']"
Albert Einstein,"“If you want your children to be intelligent, ...","['children', 'fairy-tales']"
J.K. Rowling,“It is impossible to live without failing at s...,[]
Albert Einstein,“Logic will get you from A to Z; imagination w...,['imagination']
