### 爬虫框架 Requests
- Python原生的爬虫模块
- 相较于Urllib模块更为方便快捷
- 用于模拟浏览器发送请求并获取数据
- 基本步骤：
  - 指定URL
  - 发送请求
  - 获取响应数据
  - 存储数据库

In [49]:
import requests

url = 'https://www.sogou.com/'  # 指定URL #
response = requests.get(url = url)  # 发送 GET请求并返回响应对象 #
page_text = response.text  # text() 获取字符串类型的页面数据 #
with open('./Save_files/Sougou.html', 'w', encoding = 'UTF-8') as fp:
    fp.write(page_text)
    print('Save Success')

Save Success


### 响应体的一些属性 Response
- text：字符串类型的页面数据
- content：bytes类型的页面数据
- headers：响应头信息
- status_code：响应状态码
- url：获取请求的URL

In [6]:
import requests

url = 'https://www.sogou.com/'  # 指定URL #
response = requests.get(url = url)  # 发送 GET请求并返回响应对象 #

print(response.text)
print('--------------------------------------------------')
print(response.content)
print('--------------------------------------------------')
print(response.headers)
print('--------------------------------------------------')
print(response.status_code)
print('--------------------------------------------------')
print(response.url)

<!DOCTYPE html><html lang="cn"><head> <meta name="baidu_union_verify" content="efd6e8ce094119528f66c2d380f6ec94">
<meta name='360_ssp_verify' content='651669fb99b77a4e4efae7ec25d6796a' /> <meta name="viewport" content="width=device-width,minimum-scale=1,maximum-scale=1,user-scalable=no"><script>window._speedMark = new Date();
window.lead_ip = '222.94.67.99';
window.now = 1726242111660;</script><script type="text/javascript">/*file=static/js/resourceErrorReport.js*/!function(a){var n=(new Date).getTime(),r=a.location.protocol;function c(e,t){var o=(new Date).getTime()-n;(new Image).src=["//pb.sogou.com/pv.gif?uigs_productid=wapapp&type=resource-error&stype=",e,"&timestamp=",o,"&protocol=",r,"&host=",encodeURIComponent(a.location.host),"&path=",encodeURIComponent(a.location.pathname),"&resource=",encodeURIComponent(t)].join("")}function e(e){if((e=e||a.event)&&"error"===e.type){var t=e.srcElement?e.srcElement:e.target;if(t){var o,n,r=t.tagName;"LINK"===r?(n="css",(o=t.getAttribute("href"

### 请求体的一些参数 Requests
- url：网页地址
- params：携带参数 多为字典格式
- headers：请求头
#### 以搜狗搜索引擎为例爬取不同内容POST请求的信息

In [2]:
import requests

search = input('What are you search? ')
url = 'https://www.sogou.com/web'  # 指定URL #
params = {  # 携带参数字典 #
    'query': search,
    'ie': 'UTF-8'
}
headers = {  # 设置请求头 #
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0'
}
response = requests.get(url = url, params = params, headers = headers)  # 发送 GET请求并返回响应对象 #
page_text = response.text  # 获取字符串类型的页面数据 #
text_name = 'Save_files/Sougou_' + search + '.html'  # 本地文件名 #
with open(text_name, 'w', encoding = 'UTF-8') as fp:
    fp.write(page_text)  # 存储本地文件 # 
    print('Save Success')

What are you search?  NAU


Save Success


### POST请求的实现
- 指定URL
- 自定义携带参数
#### 以百度翻译为例爬取自定义翻译内容的信息

In [22]:
import requests

url = 'https://fanyi.baidu.com/sug'
word = input('What are you search? ')

data = {  # 自定义POST请求参数内容 #
    'kw': word
}
response = requests.post(url = url, data = data)
page_text = response.text
print(page_text)

What are you search?  水果


{"errno":0,"data":[{"k":"\u6c34\u679c","v":"fruit; fruitage"},{"k":"\u6c34\u679c\u51bb","v":"\u540d. Fruit jelly"},{"k":"\u6c34\u679c\u5200","v":"fruit knife"},{"k":"\u6c34\u679c\u53c9","v":"fruit fork"},{"k":"\u6c34\u679c\u5e97","v":"fruit shop;fruit store"}],"logid":441324850}


### 异步（Ajax）的GET请求的实现
- 指定URL
- 自定义携带参数
- 自定义请求头
#### 以豆瓣网为例爬取电影排行榜的搜索页面信息

In [2]:
import requests

url = 'https://movie.douban.com/j/chart/top_list'
params = {
    'type': 13,
    'interval_id': '100:90',
    'action': '',
    'start': 40,
    'limit': 20
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0'
}
response = requests.get(url = url, params = params, headers = headers)
print(response.text)

[{"rating":["8.8","45"],"rank":41,"cover_url":"https://img1.doubanio.com\/view\/photo\/s_ratio_poster\/public\/p2154212680.jpg","is_playable":false,"id":"1418834","types":["剧情","爱情","同性","家庭"],"regions":["美国","加拿大"],"title":"断背山","url":"https:\/\/movie.douban.com\/subject\/1418834\/","release_date":"2005-09-02","actor_count":17,"vote_count":741234,"score":"8.8","actors":["希斯·莱杰","杰克·吉伦哈尔","米歇尔·威廉姆斯","安妮·海瑟薇","凯特·玛拉","兰迪·奎德","琳达·卡德里尼","安娜·法瑞丝","格拉汉姆·贝克尔","斯科特·迈克尔·坎贝尔","大卫·哈伯","罗伯塔·马克斯韦尔","皮特·麦克罗比","夏恩·希尔","布鲁克琳·普劳克斯","杰克·丘奇","罗德里戈·普列托"],"is_watched":false},{"rating":["8.8","45"],"rank":42,"cover_url":"https://img9.doubanio.com\/view\/photo\/s_ratio_poster\/public\/p2555762374.jpg","is_playable":true,"id":"1296339","types":["剧情","爱情"],"regions":["美国","奥地利","瑞士"],"title":"爱在黎明破晓前","url":"https:\/\/movie.douban.com\/subject\/1296339\/","release_date":"1995-01-27","actor_count":18,"vote_count":735988,"score":"8.8","actors":["伊桑·霍克","朱莉·德尔佩","安德莉亚·埃克特","汉诺·波西尔","卡尔·布拉克施魏格尔","特克斯·鲁比诺威茨","埃尔尼·

### 异步（Ajax）的POST请求的实现
- 指定URL
- 自定义携带参数
- 自定义请求头
#### 以KFC为例爬取餐厅位置信息

In [4]:
import requests

url = 'https://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'
data = {
    'cname': '',
    'pid': '',
    'keyword': '吴江',
    'pageIndex': '1',
    'pageSize': '10'
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0'
}
response = requests.post(url = url, data = data, headers = headers)
print(response.text)

{"Table":[{"rowcount":25}],"Table1":[{"rownum":1,"storeName":"吴江运东","addressDetail":"吴江经济开发区运东大道东侧江南奥斯卡5幢1层","pro":"24小时,Wi-Fi,店内参观,礼品卡","provinceName":"江苏省","cityName":"苏州市"},{"rownum":2,"storeName":"盛泽西环路","addressDetail":"吴江市盛泽镇西环路西侧欧尚超一层","pro":"24小时,Wi-Fi,点唱机,店内参观,礼品卡","provinceName":"江苏省","cityName":"苏州市"},{"rownum":3,"storeName":"吴江汾湖","addressDetail":"吴江市汾湖镇杭州路北侧368号华润超市一层","pro":"24小时,Wi-Fi,点唱机,店内参观,礼品卡","provinceName":"江苏省","cityName":"苏州市"},{"rownum":4,"storeName":"松陵百润发","addressDetail":"吴江市笠泽路117号恒森广场百润发","pro":"24小时,Wi-Fi,礼品卡","provinceName":"江苏省","cityName":"苏州市"},{"rownum":5,"storeName":"吴江永康","addressDetail":"吴江市松陵镇永康路68号","pro":"Wi-Fi,礼品卡","provinceName":"江苏省","cityName":"苏州市"},{"rownum":6,"storeName":"盛泽市场","addressDetail":"吴江市盛泽镇东方丝绸市场十字河西1层1区157号","pro":"24小时,Wi-Fi,点唱机,店内参观,礼品卡","provinceName":"江苏省","cityName":"苏州市"},{"rownum":7,"storeName":"盛泽舜新","addressDetail":"吴江市舜新中路27号大润发卖场一层","pro":"Wi-Fi,点唱机,礼品卡","provinceName":"江苏省","cityName":"苏州市"},{"rownum":8,"storeName

### 异步（Ajax）的GET请求的实现
- 指定URL
- 自定义携带参数
- 自定义请求头
#### 以搜狗知乎模块为例爬取在动态页面内的信息

In [14]:
import requests

url = 'https://www.sogou.com/sogou'
word = input('What are you search? ')
start_page = int(input('Please give me the start page number: '))
end_page = int(input('Please give me the end page number: '))
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0'
}

for i in range(start_page, end_page + 1):  # 循环爬取所有页面信息 #  
    params = {
        'query': word,
        'page': i,
        'ie': 'UTF-8'
    }
    response = requests.get(url = url, params = params, headers = headers)
    file_name = './Save_files/Sougou_' + word + '_page_' + str(i) + '.html'
    with open(file_name, 'w', encoding = 'UTF-8') as fp:
        fp.write(response.text)
        print(f'Page {i} Save Success')

What are you search?  顾桓源
Please give me the start page number:  1
Please give me the end page number:  3


Page 1 Save Success
Page 2 Save Success
Page 3 Save Success


### Session对象的应用
- 可以实现requests对象的基本功能
- 多用于自动存储并携带cookie对象
- cookie：服务器端用于记录客户端状态信息
#### 以网上车市为例爬取个人用户主页信息

In [91]:
import requests

session = requests.session()  # 创建session对象 #
login_url = 'https://api.cheshi.com/services/common/api.php?api=login.Login'
data = {
    "act": "login",
    "mobile": 18851915663,
    "source": "pc",
    "password": 212090123,
    "hold_time": "yes"
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0'
}
login_response = session.post(url = login_url, data = data, headers = headers)  # 发送登录请求 并携带返回的cookie #
url = 'https://my.cheshi.com/user/'  # 个人信息主页 #
response = session.get(url = url, headers = headers)
page_text = response.text
with open('./Save_files/Cheshi_userinfo.html', 'w', encoding = 'UTF-8') as fp:
    fp.write(page_text)
    print('Save Success')

Save Success


### IP地址代理
- 适用于反爬机制和反反爬机制
- 正向代理：代替客户端获取数据
- 反向代理：代替服务端提供数据
#### 以查询本地IP地址为例 通过代理改变IP地址

In [94]:
import requests

url = 'https://benjiip.com/'
proxy = {
    'https': '111.11.109.11:80'
}
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0'
}
response = requests.get(url = url, headers = headers, proxies = proxy)
page_text = response.text
with open('./Save_files/Ip.html', 'w', encoding = 'UTF-8') as fp:
    fp.write(page_text)
    print('Save Success')

Save Success


### 验证码的处理方式
- 图形验证码
  - 保存本地人工识别
  - 云打码平台第三方服务
  - 借助OCR识别库&PIL图像处理库
- 点选，滑块验证码
  - 借助第三方开源代码
#### 以OCR识别库结合PIL图像处理库为例识别南京审计大学的图形验证码

In [3]:
import requests
from PIL import Image  # 用于读取图片 #
import tesserocr

url = 'http://sso.nau.edu.cn/sso/captcha.jpg'  # 图形验证码生成网址 #
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0'
}
img_text = requests.get(url=url, headers=headers).content  # 爬取二进制图片信息 #
with open('./Save_files/Auth_code.png', 'wb') as fp:  # 本地存储 #
    fp.write(img_text)
    print('Picture Save Success')

image = Image.open("./Save_files/Auth_code.png")  # 读取图片 #
image = image.convert("L")  # 传入L参数，代表将图片转为灰度图片 #
width = image.size[0]  # 获取图片的宽度 #
height = image.size[1]  # 获取图片的高度 #
threshold = 150  # 设置阈值 #
for h in range(0, height):
    for w in range(0, width):
        if image.getpixel((w, h)) < threshold:  # 遍历每一个像素点，并与阈值比较 #
            image.putpixel((w, h), 0)  # 如果小于阈值，则像素点的值变成 0(黑色) #
        else:
            image.putpixel((w, h), 255)  # 如果大于阈值，则像素点的值变成 255(白色) #
result = tesserocr.image_to_text(image)  # 识别图片文字内容 #
print(result)

Picture Save Success
6900

