# 一、Scrapy的安装
---


安装方法：
- Windows下运行：`conda install -c conda-forge scrapy`
- MAC/其他：`pip install Scrapy`

使用清华的镜像：
`pip install jieba -i https://pypi.tuna.tsinghua.edu.cn/simple`

In [3]:
pip install jieba -i https://pypi.tuna.tsinghua.edu.cn/simple

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Note: you may need to restart the kernel to use updated packages.




# 测试是否安装了Scrapy
> `import scrapy`

In [2]:
import scrapy

# 二、初体验Scrapy的奇妙

打开 [人大新闻网站](https://news.ruc.edu.cn/archives/category/important_news)
展示新闻列表：
![中国人民大学新闻列表](ruc_news_homepage.png)

问题：如何抽取这个页面中的新闻列表，包含标题、日期和对应的网页地址？


尝试一下下面的代码：

In [4]:

#引入Scrapy包
import scrapy

#定义了一个类，这个类必须继承scrapy.Spider，说明这个类中包装了一个爬虫。
class RucNewsSpider(scrapy.Spider):

    #这个爬虫的名字是RucNews
    name = "RucNews"

    #定义start_requests函数，函数名和参数必须和下面一致：
    def start_requests(self):
        #指定要爬取的网页的网址
        url = "http://news.ruc.edu.cn/archives/category/important_news"

        #发送请求，抓取这个网页的内容，指定抓取下来后，用self.parse函数解析
        yield scrapy.Request(url=url, callback=self.parse)

    #这个就是scrapy.Request指定的callback函数
    #参数固定，response是抓取到的内容的封装对象
    def parse(self, response):
        for news in response.css('div.content_col_2_list ul li'):
            yield {
                '标题': news.css('a::text').extract_first(),
                '链接':  news.css('a::attr("href")').extract_first(),
                "日期": news.css("span::text").extract_first()
            }


## 执行这段代码
- 在Jupyter中执行这段代码，无任何输出！
- 因为他只是一个类，没有指定执行的内容
- 那么，如何执行？

-将上述代码保存到一个.py文件中，例如，scrapy_00_StartExample.py
然后打开命令行，切换到当前目录（.py文件存储的目录），在命令行中调用scrapy执行该文件：

在当前目录运行

> scrapy runspider scrapy_00_StartExample.py

或者直接在Jupyter通过!执行scrapy命令
> !scrapy runspider scrapy_00_StartExample.py

注意：不是用python执行，而是用scrapy


In [4]:
!scrapy runspider scrapy_00_StartExample.py

2022-11-09 21:25:53 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-11-09 21:25:53 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-11-09 21:25:53 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-09 21:25:53 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2022-11-09 21:25:53 [scrapy.extensions.telnet] INFO: Telnet Password: 6fa2954a057191fd
2022-11-09 21:25:53 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-11-09 21:25:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.ht

可以通过`-o`命令指定输出为csv（逗号分隔符）文件

```
#通过-o news.csv 指定输出内容到该文件中
!scrapy runspider scrapy_00_StartExample.py -o news.csv
```
- 执行完该命令后，可以在.py文件的目录下发现news.csv
- 存储的.csv可以用记事本打开，可以看到采集到的新闻标题等内容。



In [None]:
!scrapy runspider scrapy_00_StartExample.py -o news.csv

- 如果用excel打开，可能会是乱码，造成这个问题出现的原因是编码问题。
- 默认保存的编码是UTF-8，而excel在中文操作系统中默认编码是中文编码GB2312
- 可以指定保存的文件编码为中文编码：
> ``` !scrapy runspider scrapy_00_StartExample.py -o news.csv -s FEED_EXPORT_ENCODING=gb2312```

这样用excel打开就可以正常使用了！

In [5]:
!scrapy runspider scrapy_00_StartExample.py -o news.csv -s FEED_EXPORT_ENCODING=gb2312

2022-11-09 21:26:15 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-11-09 21:26:15 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-11-09 21:26:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-09 21:26:15 [scrapy.crawler] INFO: Overridden settings:
{'FEED_EXPORT_ENCODING': 'gb2312', 'SPIDER_LOADER_WARN_ONLY': True}
2022-11-09 21:26:15 [scrapy.extensions.telnet] INFO: Telnet Password: 5b444eddfc40cc92
2022-11-09 21:26:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-11-09 21:26:16 [scrapy.mid

# 三、Scrapy框架和执行过程

上述的代码很神奇吧，就这么几行，就能把新闻列表采集下来！！！！！

## Scrapy框架
![框架图](scrapy_framework.png)


## 执行过程

- 运行scrapy runspider *.py时, Scrapy 从其中找到Spider 定义（就是我们定义的类），并使用爬取引擎进行爬取.

- 调用类中的start_request创建请求，内容获得后调用默认的函数 parse, 并将response 对象作为参数传递给这个函数. 

- 在这个函数中，使用CSS Selector进行内容抽取

![框架图](scrapy_framework_2.png)

```python
#定义了一个类，这个类必须继承scrapy.Spider，说明这个类中包装了一个爬虫。
class RucNewsSpider(scrapy.Spider):

    #这个爬虫的名字是RucNews
    name = "RucNews"

    #定义start_requests函数，函数名和参数必须和下面一致：
    def start_requests(self):
        #指定要爬取的网页的网址
        url = "http://news.ruc.edu.cn/archives/category/important_news"

        #发送请求，抓取这个网页的内容，指定抓取下来后，用self.parse函数解析
        yield scrapy.Request(url=url, callback=self.parse)

    #这个就是scrapy.Request指定的callback函数
    #参数固定，response是抓取到的内容的封装对象
    def parse(self, response):
        for news in response.css('div.content_col_2_list ul li'):
            yield {
                '标题': news.css('a::text').extract_first(),
                '链接':  news.css('a::attr("href")').extract_first(),
                "日期": news.css("span::text").extract_first()
            }
```

## Scrapy 文档

- https://docs.scrapy.org/en/latest/
- https://docs.scrapy.org/en/latest/intro/tutorial.html


# 四、Scrapy实战

## 采集和抽取一个网页中的内容

- 打开网址 [有温度的人工智能：为科教兴国、人才强国贡献人大智慧！](https://news.ruc.edu.cn/archives/407847)

- 这个页面中的核心内容：

    - 新闻标题:有温度的人工智能：为科教兴国、人才强国贡献人大智慧！
    - 发表时间：2022-11-02 11:32:34
    - 浏览次数：11,684
    - 来源：党委宣传部 高瓴人工智能学院
    - 编辑：徐 小婷
    - 正文： 习近平总书记在党的二十大报告中指出“加强基础学科、新兴学科、交叉学科建设”，强调“必须坚持科技是第一生产力、人才是第一资源、创新是第一动力，深入实施科教兴国战略、人才强国战略、创新驱动发展战略”。...
-问题是：如何抽取这些信息出来？


### 下载网页内容

#### 下载一个页面的HTML
我们尝试下载整个页面的内容

In [6]:
import scrapy

class RucNewsSpider(scrapy.Spider):
    name = "RucNews"

    #注意这段代码和之前StartExample中的区别
    def start_requests(self):
        urls = [
            'https://news.ruc.edu.cn/archives/407847'   
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)    

    def parse(self, response):
        page = response.url.split("/")[-1]
        filename ="ruc_news_" + page + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        print('Saved file ' +  filename)

保存为scrapy_01_crawl_page.py并执行 `!scrapy runspider scrapy_01_crawl_page.py`。
你会发现目录中生成了一个新文件ruc_news_263466.html。


In [7]:
!scrapy runspider scrapy_01_crawl_page.py

Saved file ruc_news_263466.html


2022-11-09 22:11:49 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-11-09 22:11:49 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-11-09 22:11:49 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-09 22:11:49 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2022-11-09 22:11:49 [scrapy.extensions.telnet] INFO: Telnet Password: 60151579062d958a
2022-11-09 22:11:49 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-11-09 22:11:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.ht

#### Request和Response对象


![Request](request.png)


Response 文档请见： https://docs.scrapy.org/en/latest/topics/request-response.html

对象的属性：
- url: str 返回的url
- status: int http返回的状态码 如： 200 404
- body byte 返回网页的html，若需要返回unicode版本用 TextResponse.text
- text str 返回网页的html


![Response](request.png)


### 爬取多个页面

- 在刚才的代码中增加内容，再下载 https://news.ruc.edu.cn/archives/408414


        urls = [
                'https://news.ruc.edu.cn/archives/407847',
                #增加下面这行：
                'https://news.ruc.edu.cn/archives/408414'
            ]
- 想要增加更多，就接着在urls中增加网址即可！！！

- !!!!竟然可以一次下载这么多个页面了！！！:)

- 大家可以尝试着增加更多的网址
--- 

### 使用start_urls
- 因为start_requests经常使用，所以大家创造了一个简写的方式：start_urls，来代替start_requests
- start_urls = start_requests

In [None]:
import scrapy


class RucNewsSpider(scrapy.Spider):
    name = "RucNews"

    #def start_requests(self):
    #    urls = [
    #        'https://news.ruc.edu.cn/archives/407847',
    #        'https://news.ruc.edu.cn/archives/408414'         
    #    ]
    #    for url in urls:
    #        yield scrapy.Request(url=url, callback=self.parse)

    #简略格式
    start_urls = [
         'https://news.ruc.edu.cn/archives/407847',
         'https://news.ruc.edu.cn/archives/408414'  
    ]

    def parse(self, response):
        page = response.url.split("/")[-1]
        filename ="ruc_news_" + page + '.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        print('Saved file ' +  filename)

In [2]:
!scrapy runspider scrapy_01_crawl_page_start_urls.py

Saved file ruc_news_408414.html
Saved file ruc_news_263466.html


2022-11-10 08:53:37 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-11-10 08:53:37 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-11-10 08:53:37 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-10 08:53:37 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2022-11-10 08:53:37 [scrapy.extensions.telnet] INFO: Telnet Password: 88c85beab8dc0e2f
2022-11-10 08:53:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-11-10 08:53:38 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.ht

### 使用Response.text

- response.body是二进制字节流，因此文件保存的时候需要使用wb选项。
- 也可以使用response.text，文本打开的时候不需要加b选项，改用t选项

存储以下代码到scrapy_01_crawl_page_start_urls_textmode.py

In [None]:
import scrapy


class RucNewsSpider(scrapy.Spider):
    name = "RucNews"

    #简略格式
    start_urls = [
         'https://news.ruc.edu.cn/archives/407847',
         'https://news.ruc.edu.cn/archives/408414'  
    ]

    def parse(self, response):
        page = response.url.split("/")[-1]
        filename ="ruc_news_" + page + '.txt'
        with open(filename, 'wt') as f:
            f.write(response.text)
        print('Saved file ' +  filename)

In [6]:
!scrapy runspider scrapy_01_crawl_page_start_urls_textmode.py

Saved file ruc_news_408414.txt
Saved file ruc_news_263466.txt


2022-11-10 09:22:56 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-11-10 09:22:56 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-11-10 09:22:56 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-10 09:22:56 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2022-11-10 09:22:56 [scrapy.extensions.telnet] INFO: Telnet Password: d7103c584dd3315d
2022-11-10 09:22:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-11-10 09:22:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.ht

## 课堂练习1

- 抓取 http://ai.ruc.edu.cn/newslist/newsdetail/391c706031504058a9a11c5eddc513c6.htm 页面的内容并保存HTML



# 信息抽取
我们已经存储了网页内容下来，但我们回顾一下前面的问题：
- 如何只保存页面的一部分内容呢？例如：标题？ 时间？来源？编辑？正文？
- 仅保存文本（剔除HTML代码）呢？

![人大新闻页面](ruc_news_page.png)


## 了解HTML


HTML 不是一门编程语言，而是一种用于定义内容结构的标记语言。HTML 由一系列的 **元素（elements）** 组成，这些元素可以用来包围不同部分的内容，使其以某种方式呈现或者工作。一对标签（ tags）可以为一段文字或者一张图片添加超链接，将文字设置为斜体，改变字号，等等。


```html
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <base href="http://airs2016.ruc.edu.cn/" >
    <title>
        AIRS 2016, Beijing, China, November 30 – December 2, 2016
    </title>  
</head>
<body role="document">    
    
    <div class="navbar navbar-default navbar-fixed-top" role="navigation">
        <!--navbar-inverse navbar-fixed-top-->
        <div class="container">            
            <div class="navbar-collapse collapse">
                <ul class="nav navbar-nav">                    
                    <li><a href="/cfp.aspx">CFP</a></li>
                    <li><a href="/dates.aspx">Dates</a></li>
                    <li><a href="/papers.aspx">Program</a></li>
                    <li><a href="/reg.aspx">Registration</a></li>
                    <li><a href="/keynote.aspx">Keynotes</a></li>
                    <li><a href="/committee.aspx">Committee</a></li>                    
                    <li><a href="/location.aspx">Venue</a></li>
                    <li><a href="/beijing.aspx">Beijing</a></li>
                </ul>                 
            </div>
            <!--/.navbar-collapse -->
        </div>
    </div>
        
        
    <div class="container">
        <div class="row">           
            <div class="col-xs-12 col-md-12  col-lg-7">
                <div class="text-center" style="margin:40px 0 20px;">            
                    <h2>Welcome to <span class="label label-success">AIRS 2016</span></h2>
                    <h2>
                        The 12th Asia Information Retrieval Societies Conference,
                        <br>
                        Nov. 30 – Dec. 2, 2016 at Tsinghua University, Beijing 
                        <br>
                        In cooperation with ACM SIGIR.
                    </h3>
                </div>

                <div>
                    <p>
                        The <a href="/">Asia Information Retrieval Societies Conference (AIRS)</a> aims to bring together researchers and developers
                        to exchange new ideas and latest achievements in the field of information retrieval (IR). 
                        The scope of the conference covers applications, systems, technologies and theory aspects of information retrieval
                        in text, audio, image, video and multimedia data. The Twelfth AIRS (AIRS 2016) will be hosted by the 
                        Chinese Information Processing Society of China and co-organized by Tsinghua University and Renmin University of China. 
                    </p>
                    <p>
                        We welcome submissions of original papers in the broad field of information retrieval. 

                        Please find the list of <a href="papers.aspx">Accepted papers</a>.
                    </p>    
                    </p>

                    <h2>This sentence is not included in the p element.</h2>

                    <p>
                        Accepted papers will be published as part of the LNCS series from Springer, and will be EI-indexed.
                        Details about relevant topics, publication format and submission deadlines can be found in the 
                        <a href="cfp.aspx">Call for Papers</a>.
                    </p>
                </div>
            </div>
        </div>
    </div>

    <script src="/Scripts/ie10-viewport-bug-workaround.js"></script>
    <script src="/Content/bootstrap/js/bootstrap.min.js"></script>

</body>
</html>
``` 


![html page example](html_page_example.png)

## 查看页面源代码

- 可以使用浏览器的查看源代码或者按F12功能查看网页的源代码
- 打开 https://news.ruc.edu.cn/archives/407847

``` html
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="format-detection" content="telephone=no">
<title>
有温度的人工智能：为科教兴国、人才强国贡献人大智慧！ - 中国人民大学新闻网 | NEWS of RUC</title>   
```


- 问题：如何取出title元素中间的值？
- 先找到起始标记，再找到结束标记
- 用字符串的切片操作取出其中的值

In [11]:
import scrapy


class RucNewsSpider(scrapy.Spider):
    name = "RucNews"

    #简略格式
    start_urls = [
         'https://news.ruc.edu.cn/archives/407847',
         'https://news.ruc.edu.cn/archives/408414'  
    ]

    def parse(self, response):
        page = response.url.split("/")[-1]
        filename ="ruc_news_" + page + '_content.txt'

        html = response.text
        start = html.find("<title>") + len("<title>")
        end = html.find("</title>")

        title = html[start:end]

        #去除掉字符串两头的空白符
        title = title.strip()

        with open(filename, 'wt') as f:
            f.write(title)
        print('Saved file ' +  filename)

In [12]:
!scrapy runspider scrapy_01_crawl_page_extract_content.py

Saved file ruc_news_407847_content.txt

2022-11-10 10:49:13 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-11-10 10:49:13 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-11-10 10:49:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-10 10:49:13 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2022-11-10 10:49:13 [scrapy.extensions.telnet] INFO: Telnet Password: 476375e78fd93830
2022-11-10 10:49:13 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-11-10 10:49:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.ht


Saved file ruc_news_408414_content.txt



2022-11-10 10:49:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://news.ruc.edu.cn/archives/408414> (referer: None)
2022-11-10 10:49:14 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-10 10:49:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 468,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 25810,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'elapsed_time_seconds': 0.580642,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 11, 10, 2, 49, 14, 714592),
 'httpcompression/response_bytes': 77069,
 'httpcompression/response_count': 2,
 'log_count/DEBUG': 2,
 'log_count/INFO': 10,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2022, 11, 10, 2, 49, 14, 133950)}
2022-11-10 10:49:14 [scrapy.cor

- 进一步，如何抓取时间和浏览次数？？ 再看看HTML源代码
https://news.ruc.edu.cn/archives/407847

``` html
<div class="nc_title">有温度的人工智能：为科教兴国、人才强国贡献人大智慧！</div>
<div class="nc_subtitle"></div>
<div class="nc_meta">
    <span class="date">2022-11-02 11:32:34</span> 
    <div class="views">11,809 次浏览</div>
</div>
<div class="nc_author">来源：党委宣传部 高瓴人工智能学院</div>
<div style="display: inline-block; width: auto; margin-right: 25px;" class="nc_author">编辑：徐 小婷</div>
```

- 我们可以用和title类似的方法，来找到想要的内容的起始位置和截止位置
- 新闻的标题也可以不用title，而是从这段中提取

In [None]:
import scrapy


class RucNewsSpider(scrapy.Spider):
    name = "RucNews"

    #简略格式
    start_urls = [
         'https://news.ruc.edu.cn/archives/407847',
         'https://news.ruc.edu.cn/archives/408414'  
    ]

    def parse(self, response):
        page = response.url.split("/")[-1]
        filename ="ruc_news_" + page + '_content.txt'

        html = response.text

        #<div class="nc_title">有温度的人工智能：为科教兴国、人才强国贡献人大智慧！</div>
        #<div class="nc_subtitle"></div>
        #<div class="nc_meta">
        #    <span class="date">2022-11-02 11:32:34</span> 
        #    <div class="views">11,809 次浏览</div>
        #</div>
        #<div class="nc_author">来源：党委宣传部 高瓴人工智能学院</div>


        start = html.find("<div class=\"nc_title\">") + len("<div class=\"nc_title\">")
        #注意这里的find的第一个参数
        end = html.find("</div>", start)
        title = html[start:end]


        start = html.find("<span class=\"date\">") + len("<span class=\"date\">")
        #注意这里的find的第一个参数
        end = html.find("</span>", start)
        newsdate = html[start:end]
        

        #去除掉字符串两头的空白符
        title = title.strip()

        with open(filename, 'wt') as f:
            f.write(newsdate+"\t"+ title)
        print('Saved file ' +  filename)

In [13]:
!scrapy runspider scrapy_01_crawl_page_extract_content_more.py

Saved file ruc_news_408414_content.txt
Saved file ruc_news_407847_content.txt


2022-11-10 11:11:23 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-11-10 11:11:23 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-11-10 11:11:23 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-10 11:11:23 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2022-11-10 11:11:24 [scrapy.extensions.telnet] INFO: Telnet Password: f6f47bf1c359d729
2022-11-10 11:11:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-11-10 11:11:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.ht

## 课堂练习2 请大家补全抽取“浏览次数”和“来源”的代码

### 如何抽取正文？

![正文HTML](news_page_p.png)

- 首先定位到正文上层的标记 
- 挨个抽取段落并拼接



## 更好的抽取方法

- 使用字符串的查找和切片操作进行抽取非常麻烦，而且经常会出现相同的标签出现多次的问题
- 建议使用Scrapy的CSS抽取器和XPATH抽取器


# CSS抽取器

## CSS
http://www.w3school.com.cn/cssref/css_selectors.asp

层叠样式表(英文全称：Cascading Style Sheets)是一种用来表现HTML（标准通用标记语言的一个应用）或XML（标准通用标记语言的一个子集）等文件样式的计算机语言。

CSS不仅可以静态地修饰网页，还可以配合各种脚本语言动态地对网页各元素进行格式化。
 
CSS 能够对网页中元素位置的排版进行像素级精确控制，支持几乎所有的字体字号样式，拥有对网页对象和模型样式编辑的能力。


## 常用CSS抽取器

![css 1](css1.png)

![css 2](css2.png)


重点：
- 如何抽取指定tag类型的节点？
- 如何抽取指定class的节点？
- 如何抽取指定id的节点？
- 如何获取其中的文本？
- 如何获取链接？

## 如何使用？

思考：
- 如何抽取到标题？
- 如何抽取到时间和日期？


``` html
<div class="nc_title">有温度的人工智能：为科教兴国、人才强国贡献人大智慧！</div>
<div class="nc_subtitle"></div>
<div class="nc_meta">
    <span class="date">2022-11-02 11:32:34</span> 
    <div class="views">11,809 次浏览</div>
</div>
<div class="nc_author">来源：党委宣传部 高瓴人工智能学院</div>
<div style="display: inline-block; width: auto; margin-right: 25px;" class="nc_author">编辑：徐 小婷</div>
```

In [None]:
import scrapy

class RucNewsSpider(scrapy.Spider):
    name = "RucNews"

    def start_requests(self):
        urls = [
         'https://news.ruc.edu.cn/archives/407847',
         'https://news.ruc.edu.cn/archives/408414'          
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-1]
        filename ="ruc_news_" + page + '_css.txt'

        #抽取标题和日期
        title = response.css(".nc_title::text").extract_first()
        date = response.css(".date::text").extract_first()

        #抽取段落，返回多个段落
        paragraphs = response.css(".nc_body p::text").extract()

        #yield {title}
        with open(filename, 'w') as f:
            f.write(date)
            f.write("\t")
            f.write(title)
            f.write("\n")
            #依次输出各个段落
            for para in paragraphs:
                f.write(para)
                f.write("\n")
        print('Saved file ' +  filename)

In [15]:
!scrapy runspider scrapy_02_crawl_page_css.py

Saved file ruc_news_408414_css.txt


2022-11-10 12:17:19 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-11-10 12:17:19 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-11-10 12:17:19 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-10 12:17:19 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2022-11-10 12:17:19 [scrapy.extensions.telnet] INFO: Telnet Password: d997b8f4f9835221
2022-11-10 12:17:19 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-11-10 12:17:19 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.ht

## 抽取并使用pipeline进行输出 

```python
import scrapy

class RucNewsSpider(scrapy.Spider):
    name = "RucNews"

    def start_requests(self):
        urls = [
            'https://news.ruc.edu.cn/archives/407847',
            'https://news.ruc.edu.cn/archives/408414'           
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        

        title = response.css(".nc_title::text").extract_first()
        date = response.css(".date::text").extract_first()

        paragraphs = " ".join(response.css(".nc_body p::text").extract())

        yield {
                'title': title,
                'link':  date,
                "date": paragraphs
            }
     
```




In [16]:
!scrapy runspider scrapy_02_crawl_page_css_yield.py

2022-11-10 12:29:57 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-11-10 12:29:57 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-11-10 12:29:57 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-10 12:29:57 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2022-11-10 12:29:57 [scrapy.extensions.telnet] INFO: Telnet Password: f3758abdb97883e4
2022-11-10 12:29:57 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-11-10 12:29:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.ht

## 内部执行过程

- 运行scrapy runspider *.py时, Scrapy 从其中找到Spider 定义，并使用爬取引擎进行爬取.
- 爬取从 start_urls 属性指定的URL开始进行，内容获得后调用默认的函数 parse, 并将response 对象作为参数传递给这个函数. 
- 在这个函数中，使用CSS Selector抽取新闻中的内容，每个页面运行一次，并返回一个抽取结果给上游pipeline


### 课堂练习3： 抓取高瓴人工智能学院新闻中的标题、日期和正文


- http://ai.ruc.edu.cn/newslist/newsdetail/391c706031504058a9a11c5eddc513c6.htm
- http://ai.ruc.edu.cn/newslist/newsdetail/44855dcfb835433486f3020c75ed4804.htm
- http://ai.ruc.edu.cn/newslist/newsdetail/20221104001.html

## 抓取列表

![中国人民大学新闻列表](ruc_news_homepage.png)

问题：如何抽取这个页面中的新闻列表，包含标题、日期和对应的网页地址？

### 逻辑
1. 先获取每条新闻所在的element
2. 遍历这些element，提起其中的日期、链接和标题 （循环），这个过程和前面抽取正文的过程很像

In [None]:
#引入Scrapy包
import scrapy

#定义了一个类，这个类必须继承scrapy.Spider，说明这个类中包装了一个爬虫。
class RucNewsSpider(scrapy.Spider):

    #这个爬虫的名字是RucNews
    name = "RucNews"

    #定义start_requests函数，函数名和参数必须和下面一致：
    def start_requests(self):
        #指定要爬取的网页的网址
        url = "http://news.ruc.edu.cn/archives/category/important_news"

        #发送请求，抓取这个网页的内容，指定抓取下来后，用self.parse函数解析
        yield scrapy.Request(url=url, callback=self.parse)

    #这个就是scrapy.Request指定的callback函数
    #参数固定，response是抓取到的内容的封装对象
    def parse(self, response):
        for news in response.css('div.content_col_2_list ul li'):
            date = news.css(".date::text").extract_first().strip()
            title = news.css("a::text").extract_first().strip()
            url = news.css("a::attr('href')").extract_first().strip()
            yield {"日期": date, "标题": title, "网址":url}


In [17]:
!scrapy runspider scrapy_03_crawl_newslist.py

2022-11-10 12:36:02 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-11-10 12:36:02 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-11-10 12:36:02 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-10 12:36:02 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2022-11-10 12:36:02 [scrapy.extensions.telnet] INFO: Telnet Password: fb66ace69a6a9b78
2022-11-10 12:36:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-11-10 12:36:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.ht

保存到文件

### 课堂练习4：抓取人大新闻的日期、标题、链接并存储到文件中

- 理解下述代码
- 修改下述代码，手动将抽取的内容存储到文件中

```python
#引入Scrapy包
import scrapy

#定义了一个类，这个类必须继承scrapy.Spider，说明这个类中包装了一个爬虫。
class RucNewsSpider(scrapy.Spider):

    #这个爬虫的名字是RucNews
    name = "RucNews"

    #定义start_requests函数，函数名和参数必须和下面一致：
    def start_requests(self):
        #指定要爬取的网页的网址
        url = "http://news.ruc.edu.cn/archives/category/important_news"

        #发送请求，抓取这个网页的内容，指定抓取下来后，用self.parse函数解析
        yield scrapy.Request(url=url, callback=self.parse)

    #这个就是scrapy.Request指定的callback函数
    #参数固定，response是抓取到的内容的封装对象
    def parse(self, response):
        # for news in response.css('div.content_col_2_list ul li'):
        #     yield {
        #         '标题': news.css('a::text').extract_first(),
        #         '链接':  news.css('a::attr("href")').extract_first(),
        #         "日期": news.css("span::text").extract_first()
        #     }
        with open("ruc_news_list.txt", 'w') as f:        
            for news in response.css('div.content_col_2_list ul li'):
                f.write(news.css('a::text').extract_first()),
                f.write("\t")
                f.write(news.css('a::attr("href")').extract_first()),
                f.write("\t")
                f.write(news.css("span::text").extract_first()),
                f.write("\n")              
              

```



In [19]:
!scrapy runspider scrapy_03_crawl_newslist_save2file.py

2022-11-10 12:44:33 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-11-10 12:44:34 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-11-10 12:44:34 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-10 12:44:34 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2022-11-10 12:44:34 [scrapy.extensions.telnet] INFO: Telnet Password: bcbcea33bef7438f
2022-11-10 12:44:34 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-11-10 12:44:34 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.ht

### 课堂练习： 列表页面爬取

- 爬取https://www.ruc.edu.cn/schools-az
- 获取其中的学院名称、学院网站地址、学院简介
- 导出成csv或者tsv格式


## 如何爬取下一页？

- 抓取下一页
- 抽取下一页的url
- 使用Request发起爬取

In [28]:
import scrapy
class RucInfoSpider(scrapy.Spider):
    name = "info"

    maxpage = 3
    page = 0

    def start_requests(self):
        urls = [
            'http://news.ruc.edu.cn/archives/category/important_news'         
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        
        self.page= self.page+1
        print ("正在采集第" + str(self.page)+"页")
        with open("ruc_news_list_all.txt", 'a') as f:        
            for news in response.css('div.content_col_2_list ul li'):
                f.write(news.css('a::text').extract_first()),
                f.write("\t")
                f.write(news.css('a::attr("href")').extract_first()),
                f.write("\t")
                f.write(news.css("span::text").extract_first()),
                f.write("\n")     
        if self.page>= self.maxpage:
            return

        next_page = response.css(".content_col_2_nav_alignright a::attr('href')").extract_first()        
        if next_page is not None:
            #next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, self.parse)            
            

In [29]:
!scrapy runspider scrapy_03_crawl_newslist_nextpage.py

正在采集第1页
正在采集第2页
正在采集第3页


2022-11-10 12:53:33 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: scrapybot)
2022-11-10 12:53:33 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 3.4.8, Platform Windows-10-10.0.22000-SP0
2022-11-10 12:53:33 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2022-11-10 12:53:33 [scrapy.crawler] INFO: Overridden settings:
{'SPIDER_LOADER_WARN_ONLY': True}
2022-11-10 12:53:33 [scrapy.extensions.telnet] INFO: Telnet Password: 705f85d82de8ea6d
2022-11-10 12:53:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2022-11-10 12:53:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.ht