# 1. 使用 requests 库抓取HTML网页

本节学习如何使用 requests 库来抓取HTML网页代码。request 库是一个构造网络资源请求的包。请确保已经正确安装了 requests 库。  

In [1]:
import requests

In [2]:
requests.__version__

'2.26.0'

requests 库使用 get() 方法，构造一个 get 请求。

In [3]:
help(requests.get)

Help on function get in module requests.api:

get(url, params=None, **kwargs)
    Sends a GET request.
    
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary, list of tuples or bytes to send
        in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response



```示例```  

抓取百度主页

In [11]:
r = requests.get('https://www.baidu.com')
print(r)

<Response [200]>


返回的是一个Response类型的数据对象，里面包含了网络服务器给我们返回的响应。

In [12]:
type(r)

requests.models.Response

In [13]:
r.status_code # 网络响应状态码

200

In [14]:
r.text # 响应的主要内容（html网页）

'<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>ç\x99¾åº¦ä¸\x80ä¸\x8bï¼\x8cä½\xa0å°±ç\x9f¥é\x81\x93</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off au

In [15]:
r.encoding  # 获取html文件的编码格式

'ISO-8859-1'

In [16]:
r.encoding='utf-8' # 改编码格式

In [17]:
print(r.text) # 获取修改编码后的html文件内容

<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css><title>百度一下，你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus=autofocus></span><span class="bg s_bt

In [21]:
help(requests.Response)

Help on class Response in module requests.models:

class Response(builtins.object)
 |  The :class:`Response <Response>` object, which contains a
 |  server's response to an HTTP request.
 |  
 |  Methods defined here:
 |  
 |  __bool__(self)
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if the status code of the response is between
 |      400 and 600 to see if there was a client error or a server error. If
 |      the status code, is between 200 and 400, this will return True. This
 |      is **not** a check to see if the response code is ``200 OK``.
 |  
 |  __enter__(self)
 |  
 |  __exit__(self, *args)
 |  
 |  __getstate__(self)
 |  
 |  __init__(self)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |      Allows you to use a response as an iterator.
 |  
 |  __nonzero__(self)
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if 

```示例```

抓取二进制数据：比如百度主页的LOGO

In [None]:
r = requests.get('https://www.baidu.com/img/bd_logo1.png')
print(r.text)

以上是乱码。因为我们获取的是百度logo，其本质是一个二进制码，如果用文本方式编码会出现乱码。因此这里使用response的另一个属性 content 来获取二进制码。

In [None]:
r.content

In [None]:
# 将二进制码写入本地 png 文件，运行下面代码，可产生一个百度logo图像文件。
with open('data\\logo.png','wb') as writer:
    writer.write(r.content)

音频和视频文件均可按照上述方法获取。

```示例```  

抓取知乎的主页面。

In [None]:
r = requests.get('https://www.zhihu.com/explore')
r

In [None]:
print(r.text)

知乎返回了 '403 forbidden'，就是说知乎禁止抓取。那么我们加入请求头的部分信息，伪造成一个浏览器来再次尝试一下。

In [None]:
headers = {
    'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, \
like Gecko) Chrome/96.0.4664.45 Safari/537.36'
}

r = requests.get('https://www.zhihu.com/explore',headers = headers)
r

In [None]:
print(r.text)

添加了headers字段的信息后，我们成功抓到了知乎的探索页面的源代码。当然我们也可以向headers字典中加入其他字段的信息。

```示例```  

抓取登录后的知乎热榜页面

### 会话和Cookie

在浏览网页的过程中，常常碰到需要登录的情况。有的页面只有在登录之后才可以访问，而且过一段时间还要重新登录。用户输入用户名和密码后，拿到了一种某种凭证，这种凭证被记录到了本地浏览器甚至本地磁盘中，因此能够维持自己的登录状态。

1. HTTP协议是无状态的，即服务器端并不知道客户端是什么状态，当客户端向服务器发送请求以后，服务器解析此请求，然后返回响应。这个过程中，服务器不会记录前后的变化。在这种情形下，有的事情很难处理。比如当我们访问网上银行时，发送请求数据后，可以看到服务器端要求我们上传账户和密码，当我们提交了账户和密码后，服务器端并不记得我们是谁，继续要求我们上传账户和密码。 因此需要一种技术，使得服务器端能够保存客户端的状态。  



2. 会话和Cookie用于保持HTTP的连接状态。会话在服务器端，保存用户的会话信息；Cookies在客户端，当下次客户端访问网页时，会自动附带上Cookie上传至服务器，服务器通过识别Cookies并鉴定出是哪个用户，然后再判断用户是否仍处于登录状态，返回对应的响应。  

* 登录知乎，并打开Application选项卡，选择Storage的最后一项，即Cookies信息。

<img src='resource/Cookies.png' style='zoom:50%'>

In [None]:
headers = {
    'Cookie':'SESSIONID=Wqk2T3KYcBdxVjYRTEmClscLUmuUYaQNkTy1MITg5Px; JOID=VFAUAkP1CQ8RIot4E_I51noETl8AvXVGQBj_ISGqckR-Rsglfvbat3cijHEaDJZSTdOIl0HL8LF0GqQARVFDCe8=; osd=UFgRAk7xAQoRL49wFvI00nIBTlIEtXBGTRz3JCGndkx7RsUhdvPaunMqiXEXCJ5XTd6Mn0TL_bV8H6QNQVlGCeI=; _zap=66c25c60-fdef-4f70-a835-2083e0a11337; _xsrf=mK6AYNA3JQcU4AvkUIfCKk9g04dri0OV; d_c0="APAeVi7CJxSPTuY9X0Urz5MJ-oXp71sKIf0=|1639045532"; Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1639045534; SESSIONID=FuIX4zP2QuBomMSdnwZR6lGWghFpXQYJlU10kgrTnrC; osd=V10SC0t3STnNNCePU3p85aIc6K1FMT5_nQVd3W4jPHmtUW_cNeidj6ExJYJU0hzzWEQnWjE2ahiagsZfbP3W9yY=; JOID=U1kRAE1zTTrGMiOLUHF64aYf46tBNT10mwFZ3mUlOH2uWmnYMeuWiaU1JolS1hjwU0IjXjI9bByegc1ZaPnV_CA=; __snaker__id=WQ16kMAbiwUytjjQ; _9755xjdesxxd_=32; gdxidpyhxdE=dvm6cqRvUal9Z2Smu5fwNIDyAV%2BWISf6I2x0Ca72d6efmw1tnKB%5CjDYh1bhOU3KzuUkVCiymxpV5swiNvCTZ%2FljPUtWXiJm%2FQhSnlLiiqbUZuOUYzZxv2atQnKtRBNL14jPKRRh21%2BevqDnNhMK5NYVHLwDu%2Fp5ii%2BslxL7aqma9fLUm%3A1639046435405; YD00517437729195%3AWM_NI=eK3xM5QONWX2i3%2BOhDS8ocxNhgM5xzi10ELr4CF84QAGaIDLRBKUYmBQEjhN8rHWsn1VwclHGZQa62JM19VJMqTnQj7HBkBt2lDgfNnNQqJVkAta7vsWuak0ghaKAN%2BhcTQ%3D; YD00517437729195%3AWM_NIKE=9ca17ae2e6ffcda170e2e6eeb3c94d8eb8fb87d254bae78fa2c44a928e9fabf579b8bfbc8eee6888928cabd72af0fea7c3b92aa795fd99d03f939182b8db3aabede5b0eb5091e8a184fb46acb983d6fb72abef00a6cf7af89d8ad9b75b8399afaac748929e85b6dc67828f8b8beb46fcae88b8f266f38ab984ef4485869f82b53a8a93ffdad5428faf9c8fd462f5988e8cc180f2b48c88e546ae929f92c95ea99fbc82b67eb189bb82ea3ca5b8afacc446e9b8adb5d437e2a3; YD00517437729195%3AWM_TID=X3z3tBfBLKRFBAVUAVd744P9iskWYTcT; captcha_ticket_v2="2|1:0|10:1639045566|17:captcha_ticket_v2|704:eyJ2YWxpZGF0ZSI6IkNOMzFfNUVxOGhSYmRON01zUEVCRm5RUlVsLXZhYWk0SUhhUXRYMmdnZnpHb3E1WlJQOFEwSkJwVF83LmU2LWp6b2pPU282MU1CRHJOclZWaUhBcW44Wk1RbjA5bTgtdDBTbWxBY0l6c3NEN1FkUUxoeS1OXzJxdkc2SHFYWXFqVEZkbF90TTl5UDhfMUJ3ZGpCa2g0a0pPQ1BpQzU3T0R5bC1GSW5SYi1ENi1BNzZaY1FRbGk4T0c2QUNfRHlWdjlvSjY5bUNpYmxqVHk0UGFzV1BUMDk1S3VhUjdmU2V5dTU0UW01eHAucElaRU9hRlJ0Ym8xRGdacHVNci5YUVNYQXJTWFpQa0hNSzZVMmpid01tMDYwMXNpbk1jLmNVWk5Ic2c3Ti5KaHRzbjllaDEuUXhJREE0eWlxamJuck9ELmpPbG0yenNIU0xGc0pseTc0TlR3Sk1LQmxNS3RyWWFMRUJXcXhHUFJrUGFzTThNX1d3N3BSSEFCRlprYk0wa1dvZ0NLRmtObWluQ3gtRXZheGlESGJaeWsuUlk1b2lKYzlJTkdjSS55V1pwLnJlTDlGZjZZMjF1ZHpxZHQ1WVdBcEJRa3Z3dTVPZlcxVzcteUt0Z2Iyd3U5NnRab0xZN0lYYUF1OXZPQ1hTUTE4WmM4RC5hNmZpZUlTOWdKWjZYMyJ9|5d88856b19d6fc58e46f42b6a398078529c657d4a26e44aa35e17a1daabab0bb"; captcha_session_v2="2|1:0|10:1639045580|18:captcha_session_v2|88:dCttZ0w1dnNQTFp2bS94RzZnSExqRmI4cml6NWNFRmJqUzdlN0ZvcjlycHltbWVRUjZaVlU2QXFpMVJzN3lESg==|94d94490abd9df361de4ea7fed73e5102b737ed5e8e4a78839aeda8a1deffd98"; z_c0="2|1:0|10:1639045580|4:z_c0|92:Mi4xN20xckRBQUFBQUFBOEI1V0xzSW5GQ1lBQUFCZ0FsVk55eWVmWWdDVFpuUjNyQjF6N1FGRU9yQjhYUVdmUkxYV0pR|3b018153271472bbf574a57d7cf9b7824fdb628ce75f51024f744583ac204c77"; unlock_ticket="AGDmGfk0RA4mAAAAYAJVTdPgsWF1FHX8Uv07UfzmKkyBjA8Serd0qA=="; NOT_UNREGISTER_WAITING=1; q_c1=0b1995468f7b4b22a09d094adbf73339|1639045593000|1639045593000; tst=r; Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1639045851; KLBRSID=81978cf28cf03c58e07f705c156aa833|1639045866|1639045532',
    'Host':'www.zhihu.com',
    'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, \
like Gecko) Chrome/96.0.4664.45 Safari/537.36'
}
r = requests.get('https://www.zhihu.com/api/v4/questions/504503720/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cvip_info%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled&limit=5&offset=10&platform=desktop&sort_by=default',headers=headers)
#print(r.text)
'''
with open(r'data\zhihu.txt','w',encoding='utf-8') as writer:
    writer.write(r.text)
'''   

In [None]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.text, 'lxml')
print(soup.prettify())

In [None]:
for a in soup.find_all(attrs={'itemprop': 'text'}):
    print(a)