<h1, align=center> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;数据科学引论 - Python之道 </h1>

<h1, align=center> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;第5课 数据收集 - Python网络爬虫实践 II </h1>

## 概述
接下来，我们通过一个更加复杂但是贴近实际的爬虫例子讲述。

这次爬取的内容是 IT桔子 中的新公司成立内容。也就是网站http://www.itjuzi.com/company?sortby=foundtime&page=1  的前10页。

爬取内容包括公司名、公司类别成立时间、省份、最新融资情况。最终以csv格式保存到文件。

> csv格式是一个常见的存储表格数据的格式，爬虫完成之后的csv文件，可以用excel直接打开。

## 注意
这个爬虫比之前的样例更加复杂，因为实际的网站中，可能在解析之后要通过一些字符串操作才能得到有效信息，如网页中常出现一些空格和换行来达到良好的显示效果，但是我们爬取的时候是要将这些字符去除。


In [1]:
import scrapy
import time
import csv
import os

class MySpider(scrapy.Spider):

    name = "spider"

    def __init__(self):
        self.file = open('demo2_newCompanies.csv', 'w',
                         encoding='GBK', newline='')
        self.csvWriter = csv.DictWriter(
            self.file, fieldnames=['name', 'type', 'date', 'province'])

        # 设置待爬取网站列表
        self.urls = []
        for i in range(1, 10):
            self.urls.append(
                'http://www.itjuzi.com/company?sortby=foundtime&page=' + str(i))
        print(self.urls)

    def start_requests(self):
        # self.init_urls()
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse)

    # parse方法会在每个request收到response之后调用
    def parse(self, response):

        # print(response.body)

        # 提取公司列表
        companys = response.css(".list-main-icnset.company-list-ul li")

        # 从一开始是为了跳过网页内的表格标题栏
        for company in companys:
            # 解析公司名
            name = company.css(".title span::text").extract_first()
            # 跳过没有公司名的公司
            if name is None:
                continue

            # 解析公司大类
            type = company.css(".cell.classify::text").extract_first()
            # 去除网页原有的空格、换行、制表符
            type = type.replace('\t', '').replace('\n', '').replace(' ', '')
            # 解析得到时间
            date = company.css(".date::text").extract_first()
            # 去除网页原有的空格、换行、制表符
            date = date.replace('\t', '').replace('\n', '').replace(' ', '')
            # 解析省份
            province = company.css(".cell.place::text").extract_first()
            province = province.replace('\t', '').replace(
                '\n', '').replace(' ', '')
            # 构建字典
            item = {"name": name, "type": type, "date": date,
                    "province": province}

            # 以csv格式写入文件
            self.csvWriter.writerow(item)

        # 及时将内容写入文件，否则可能会出现少许延迟
        self.file.flush()
        os.fsync(self.file)
        # 输出当前解析完成的网页网址，可以当做爬取进度来看待,与程序逻辑无关
        print("over: " + response.url)

In [1]:
import scrapy
import time
import csv
import os
import json

class MySpider(scrapy.Spider):

    name = "spider"

    def __init__(self):
        self.file = open('demo2_newCompanies.csv', 'w',
                         encoding='GBK', newline='')
        self.csvWriter = csv.DictWriter(self.file, fieldnames=['name', 'type', 'date', 'province'])
        self.csvWriter.writeheader()

        # 设置待爬取网站列表
        self.urls = []
        self.urls.append('https://www.itjuzi.com/api/companys')
        print(self.urls)

    def start_requests(self):
        # self.init_urls()
        for url in self.urls:
            for pagenum in range(1,2):
                form_data = {"pagetotal":0,"total":0,"per_page":20,"page":pagenum,"scope":"",
                         "sub_scope":"","round":[],"location":"","prov":"","city":[],
                         "status":"","sort":"","selected":"","year":[],"hot_city":"",
                         "com_fund_needs":"","keyword":""}
                yield scrapy.Request(url=url,method='POST',body=json.dumps(form_data) ,callback=self.parse)

    # parse方法会在每个request收到response之后调用
    def parse(self, response):

#         print(json.loads(response.body)['data']['data'])
        Data = json.loads(response.body)['data']['data']
        
        for eachData in Data:
            item = {'name':eachData['name'],
             'type':eachData['scope'],
             'province':eachData['prov'],
             'date':eachData['agg_born_time']}
            self.csvWriter.writerow(item)
        self.file.flush()
        os.fsync(self.file)

In [2]:
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # 这句代码就是开始了整个爬虫过程 

2019-10-29 13:54:00 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: scrapybot)
2019-10-29 13:54:00 [scrapy.utils.log] INFO: Versions: lxml 4.3.2.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c  28 May 2019), cryptography 2.6.1, Platform Windows-10-10.0.18362-SP0
2019-10-29 13:54:00 [scrapy.crawler] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2019-10-29 13:54:00 [scrapy.extensions.telnet] INFO: Telnet Password: 0784f2cc3e393134
2019-10-29 13:54:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']


['https://www.itjuzi.com/api/companys']


2019-10-29 13:54:01 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-10-29 13:54:01 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.ref

[{'id': 35369418, 'register_name': '上海六街聚将品牌管理有限公司', 'sec_name': '', 'name': '六街聚将', 'logo': 'https://cdn.itjuzi.com/images/d6bfa76567df2bab4660d47f75bdae07.png?imageView2/0/q/100', 'slogan': '品牌管理服务商', 'des': '六街聚将是一家品牌管理服务商，主要经营范围是品牌管理，文化艺术交流策划,日用品、纺织品、皮革制品、工艺品（象牙及其制品除外）、文化用品、包装材料、服装及辅料、鞋帽箱包、陶瓷制品、床上用品、眼镜、电子产品的销售、服,装设计,商务咨询,市场营销策划咨询。', 'scope': '企业服务', 'sub_scope': '综合企业服务', 'round': '天使轮', 'born_time': 1569859200, 'agg_born_time': '2019-10-01', 'update_time': 1572328291, 'year': 2019, 'month': 10, 'location': 'in', 'prov': '上海', 'city': '黄浦区', 'status': '运营中', 'total_money': '100万人民币', 'famous_com': False, 'famous_school': False, 'famous_invst': True, 'hot_news_count': 0, 'invse_interval': -1, 'unicorn': False, 'horse': 0, 'tag': [{'tag_name': '企业服务', 'tag_id': 737}, {'tag_name': '综合企业服务', 'tag_id': 744}, {'tag_name': '品牌管理', 'tag_id': 3697}, {'tag_name': '销售与营销', 'tag_id': 7426}], 'term_tag': [{'tag_name': '企业服务', 'tag_id': 737}, {'tag_name': '综合企业服务', 'tag_id': 744}, {'tag_name': '