### 爬虫框架 Urllib
- Python原生的爬虫模块
- 用于模拟浏览器发送请求并获取数据
- 基本步骤：
  - 指定URL
  - 发送请求
  - 获取响应数据
  - 存储数据库

In [1]:
import urllib.request

url = 'https://www.sougou.com/'  # 指定URL #
response = urllib.request.urlopen(url = url)
page_text = response.read()  # 读取网页数据 返回值为 bytes类型 #
with open('./Save_files/Sougou.html', 'wb') as fp:
    fp.write(page_text)  # 存储本地文件 # 
    print('Save Success')

Save Success


### 反爬机制 UA检查
- User-Agent（UA）：请求载体的身份标识
- 反反爬机制：自定义虚拟UA来创建请求对象
#### 以搜狗搜索引擎为例模拟浏览器发送不同内容的POST请求

In [3]:
import urllib.request
import urllib.parse

url = 'https://www.sogou.com/web?query='
search = input('What are you search? ')  # 询问搜索内容 #
target = urllib.parse.quote(search)  # quote()用于对字符串编码 返回类型为字符串 #
url += target  # 拼接URL #

headers = {  # 设置请求头 #
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0'
}
request = urllib.request.Request(url = url, headers = headers)  # 自定义请求对象 #
response = urllib.request.urlopen(request)
page_text = response.read()
text_name = './Save_files/Sougou_' + search + '.html'  # 本地文件名 #
with open(text_name, 'wb') as fp:
    fp.write(page_text)  # 存储本地文件 # 
    print('Save Success')


What are you search?  苏州


Save Success


### 爬取指定请求内容
- 发送POST请求携带自定义内容
#### 以百度翻译为例自定义翻译内容并爬取结果

In [6]:
import urllib.request
import urllib.parse

url = 'https://fanyi.baidu.com/sug'
word = input('What are you search? ')

data = {  # 自定义POST请求参数内容 #
    'kw': word
}
data = urllib.parse.urlencode(data)  # urlencode()用于对字典编码 返回类型为字符串 #
data = data.encode()  # 把字符串转换为 bytes类型 #
response = urllib.request.urlopen(url = url, data = data)
page_text = response.read()  # 读取网页数据 返回值为 bytes类型 #
print(page_text)

What are you search?  草莓


b'{"errno":0,"data":[{"k":"\\u8349\\u8393","v":"[\\u690d] strawberry \\uff08\\u8349\\u8393\\u5c5e Fragaria \\u690d\\u7269\\u7684\\u6cdb\\u79f0\\uff09"},{"k":"\\u8349\\u8393\\u6c41","v":"strawberry juice"},{"k":"\\u8349\\u8393\\u9171","v":"strawberry jam"},{"k":"\\u8349\\u8393\\u5976\\u6614","v":"\\u540d. Strawberry Shake"},{"k":"\\u8349\\u8393\\u5976\\u8336","v":"Strawberry Milk Tea"}],"logid":2869144086}'
