# 使用`BeautifulSoup`解析网页

**`lxml库`**
- 如果不安装lxml库，就会使用Python默认的解析器

- 尽管`Beautiful Soup`支持Python表中库中的`HTML解析器`, 又支持一些`第三方解析器`, 但是`lxml库`的功能更加强大、速度更快

In [7]:
import requests
from bs4 import BeautifulSoup

url = 'https://pvp.qq.com/web201605/herolist.shtml'

strhtml = requests.get(url)
print(strhtml)
soup = BeautifulSoup(strhtml.text, 'lxml')      # 
selector = 'body > div.wrapper > div > div > div.herolist-box > div.herolist-content > ul > li > a'
data = soup.select(selector)

<Response [200]>


In [8]:
result_list = []
for item in data:
    result = {
        'title':item.get_text(),
        'link':'https://pvp.qq.com/web201605/' + item.get('href')
    }
    result_list.append(result)
result_list[0:5]

[{'title': 'ÔÆÖÐ¾ý',
  'link': 'https://pvp.qq.com/web201605/herodetail/506.shtml'},
 {'title': 'Ñþ', 'link': 'https://pvp.qq.com/web201605/herodetail/505.shtml'},
 {'title': 'ÅÌ¹Å',
  'link': 'https://pvp.qq.com/web201605/herodetail/529.shtml'},
 {'title': 'Öí°Ë½ä',
  'link': 'https://pvp.qq.com/web201605/herodetail/511.shtml'},
 {'title': 'æÏ¶ð',
  'link': 'https://pvp.qq.com/web201605/herodetail/515.shtml'}]

In [143]:
print(type(strhtml.text))   # 网页类型<class 'str'>
print(type(soup))           # 解析之后的网页类型

<class 'str'>
<class 'bs4.BeautifulSoup'>


### 类型
#### 类型
- **`requests`获取的类型为:`requests.models.Response`**【`<Response [200]>`】
- **再`.text`后的类型为:`<class 'str'>`**
    - 【`resopnse.text`】
    - 【`<html><body><a href="https://www.ahpu.edu.cn/">安徽工程大学</a></body></html>`】
- 
- **`Beautiful Soup`解析后的网页类型为:`<class 'bs4.BeautifulSoup'>`**
    - 【`<html><body><a href="https://www.ahpu.edu.cn/">安徽工程大学</a></body></html>`】【整齐的】
- 
- **`soup.select(selector)`后的类型为:`<class 'bs4.element.ResultSet'>`**
    - `[<a href="https://www.ahpu.edu.cn/">安徽工程大学</a>, <a>......</a>, <a>......</a>, ...]`【类似列表, 索引】

#### 获取内容方法
- **获取标签间的正文用`get_text()`方法**
- **提取标签内的属性用`get()`方法**

In [136]:
string = """<a href="https://www.ahpu.edu.cn/">安徽工程大学</a>"""
soup = BeautifulSoup(string)
data = soup.select('body > a')
print("【BeautifulSoup后的类型】\n", type(soup), sep='')
print("\n【BeautifulSoup后的形式】\n", soup, sep='')
print("\n【到达指定标签位置】【select】")
print(type(data))
print(data)

【BeautifulSoup后的类型】
<class 'bs4.BeautifulSoup'>

【BeautifulSoup后的形式】
<html><body><a href="https://www.ahpu.edu.cn/">安徽工程大学</a></body></html>

【到达指定标签位置】【select】
<class 'bs4.element.ResultSet'>
[<a href="https://www.ahpu.edu.cn/">安徽工程大学</a>]


#### 获取标签中正文用`get_text()`方法, 如: 

In [137]:
data[0].get_text()

'安徽工程大学'

#### 提取标签中的属性用`get()`方法, 如:

In [138]:
data[0].get('href')

'https://www.ahpu.edu.cn/'

## 正则表达式

In [42]:
import re

result_list = []
for item in data:
    result={
        'title':item.get_text(),
        'link':'https://pvp.qq.com/web201605/' + item.get('href'),
        'ID':re.findall('\d+', item.get('href'))
    }
    result_list.append(result)
result_list[0:5]

[{'title': 'ÔÆÖÐ¾ý',
  'link': 'https://pvp.qq.com/web201605/herodetail/506.shtml',
  'ID': ['506']},
 {'title': 'Ñþ',
  'link': 'https://pvp.qq.com/web201605/herodetail/505.shtml',
  'ID': ['505']},
 {'title': 'ÅÌ¹Å',
  'link': 'https://pvp.qq.com/web201605/herodetail/529.shtml',
  'ID': ['529']},
 {'title': 'Öí°Ë½ä',
  'link': 'https://pvp.qq.com/web201605/herodetail/511.shtml',
  'ID': ['511']},
 {'title': 'æÏ¶ð',
  'link': 'https://pvp.qq.com/web201605/herodetail/515.shtml',
  'ID': ['515']}]

## 反爬

In [45]:
from bs4 import BeautifulSoup
import requests
import random

'''
    [python爬虫设置请求消息头(headers)](https://blog.csdn.net/aaronjny/article/details/62088640)
'''
# 添加请求头，伪装成浏览器
headers = {
    'Accept': '*/*',
    'Accept-Language': 'en-US,en;q=0.8',
    'Cache-Control': 'max-age=0',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36',
    'Connection': 'keep-alive',
    'Referer': 'http://www.baidu.com/'
}

# 添加代理IP池，防止频繁访问导致IP地址被封
proxies = {
    "http":"http://10.10.1.10.3128",
    "https":"http://10.10.1.10.1080"
}

url = 'https://movie.douban.com/top250'
strhtml = requests.get(url, headers=headers, proxies=proxies)     ## 返回一个response
print(strhtml)
soup = BeautifulSoup(strhtml.text, 'lxml')
soup

ProxyError: HTTPSConnectionPool(host='movie.douban.com', port=443): Max retries exceeded with url: /top250 (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f9d0414e550>: Failed to establish a new connection: [Errno -2] Name or service not known')))

# 中国旅游网

<img src="../../images/76c36180d4c116ad718cc5609104e5b4d86c59c397a902bdc7102fcc901d29e8.png" width="67.5%" height="67.5%" align=left />

In [None]:
from bs4 import BeautifulSoup
import requests
import random

'''
    [python爬虫设置请求消息头(headers)](https://blog.csdn.net/aaronjny/article/details/62088640)
'''
headers = {
    'Accept': '*/*',
    'Accept-Language': 'en-US,en;q=0.8',
    'Cache-Control': 'max-age=0',
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36',
    'Connection': 'keep-alive',
    'Referer': 'http://www.baidu.com/'
}

url = 'http://www.cntour.cn/'
strhtml = requests.get(url, headers=headers)     ## 返回一个response
soup = BeautifulSoup(strhtml.text, 'lxml')

<Response [200]>


**失败**

In [None]:
selector = '#m_news_info1231 > div.news_title > a'
#module6890 > div.formMiddle.formMiddle6890 > div > div > div > div.m_news.m_new_padding_1.m_news__wrap-1 > div > div > div > div:nth-child(1)
#m_news_info1214 > div.news_title > a
#m_news_content > div
data = soup.select(selector)
data

[<a class="article_title" href="/h-nd-1231.html#_np=2_8590" target="_blank" title="数字人民币扩容，旅游业入门淘金"><div class="news_titleimg"></div> <span class="title_content">
             数字人民币扩容，旅游业入门淘金
         </span></a>]

# 有道翻译

<img src="../../../images/5211e8f366e5fb640256fc46c2ea8a3302ca6b9bd1d0d3bd00b6e93dd8a8de4c.png" width="67.5%" height="67.5%" align=center />

## 一、爬虫——普通做法

In [40]:
import json
import requests

def get_translate_date(word=None):
    url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult =rule'
    form_data = {   # 表单数据
        'i':word, 
        'from':'AUTO',
        'to': 'AUTO',
        'smartresult': 'dict', 
        'client':'fanyideskweb',
        'salt':'1512399450582',
        'sign':'78181ebbdcb38de9b4a3f4cd1d3 8816b',
        'doctype':'json',
        'version': '2.1',
        'keyfrom':'fanyi.web',
        'action':'FY_BY_ CLICKBUTTION',
        'typoResult':'false'
    }

    # 请求表单数据
    response = requests.post(url=url, data=form_data)     ## 返回一个response
    # json.loads(str)
    # content = json.loads(response.text)
    content = response.json()
    print("\n翻译结果:", content['translateResult'][0][0]['tgt'], sep='')

get_translate_date('Hello')


翻译结果:你好


## 二、爬虫结合`JS`解密做法

> [`JS`加密详解](https://juejin.cn/post/6932769337115688974?#heading-2)

> 每一次翻译的时候，`translate_o?smartresult=dict&smartresult=rule`接口中的`载荷`->`Form Data`下的数据都会发生变化
- 分别是【`salt`, `sing`, `lts`, `bv`】这四个数据（`bv`好像是不会变的）

<img src="../../../images/09ea1d0966caf59af1185197051b85869bdcf991b4970935bccab34826cba2db.png" width="40%" height="40%">


### **1、加密函数——`JS`版**

<img src="../../../images/0b51701438e3c7d3736632419eadda220cf75c74466a25fccc55ea33b53dec4b.png" width="67.5%" height="67.5%" align=center>

In [None]:
# 加密函数
'''
    var r = function(e) {
        var t = n.md5(navigator.appVersion)
          , r = "" + (new Date).getTime()
          , i = r + parseInt(10 * Math.random(), 10);
        return {
            ts: r,
            bv: t,
            salt: i,
            sign: n.md5("fanyideskweb" + e + i + "Ygy_4c=r#e#4EX^NUGUc5")
        }
    };
'''

#### （1）获取 `bv` 的值

> **`appVersion`**

<img src="../../../images/5f0798ef0dbb39f4f754f7251afcf240146a12fa395fb75203a49e6307da7b1f.png" width="75%" height="75%">

In [28]:
from hashlib import md5
appVersion = "5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"

t = md5(appVersion.encode()).hexdigest()    # hexdigest 为16进制
print(t)

f4ba6f45063bb1cd2f2872ed23e89c8f


> 这里得到的结果和`bv`是一样的

#### （2）获取 `lts` 的值

**即`JS`实现的时间戳**

<img src="../../../images/3fbe4a8516c35884b5b2840932bb992ef5f543544745064dc006ff1c38130a98.png" width="35%" height="35%">

In [22]:
import time
r = time.time()
print("Python时间戳",r)
print("JS实现 时间戳", 1650974766322)

Python时间戳 1650975545.8530574
JS实现 时间戳 1650974766322


- 乘以1000转化为int舍弃小数位再转化为字符串

> **因为JS里是字符串**

In [23]:
r = str(int(time.time()*1000))
print("Python时间戳",r)
print("JS实现 时间戳", 1650974766322)

Python时间戳 1650975616016
JS实现 时间戳 1650974766322


#### （3）获取 `salt` 的值

`i = r + parseInt(10 * Math.random(), 10)`

在 `r` 后面加上一个0到10的随机数，`10`表示十进制

In [27]:
import random
r = str(int(time.time()*1000))
i = r + str(random.randint(0, 9))
print(i)

16509763352232


#### （4）获取 `sign` 的值

`sign: n.md5("fanyideskweb" + e + i + "Ygy_4c=r#e#4EX^NUGUc5")`

In [30]:
e = "I love you"

r = str(int(time.time()*1000))
i = r + str(random.randint(0, 9))
sign = md5(("fanyideskweb" + e + i + "Ygy_4c=r#e#4EX^NUGUc5").encode()).hexdigest()
print(sign)

191e93864c6de81be2f212d143d9053f


### 2、加密函数——`Python`版

In [None]:
# 加密函数——JS版
'''
    var r = function(e) {
        var t = n.md5(navigator.appVersion)
          , r = "" + (new Date).getTime()
          , i = r + parseInt(10 * Math.random(), 10);
        return {
            ts: r,
            bv: t,
            salt: i,
            sign: n.md5("fanyideskweb" + e + i + "Ygy_4c=r#e#4EX^NUGUc5")
        }
    };
'''

# 加密函数——Python版
import time
import random
from hashlib import md5
appVersion = "5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"

def r(e):
    t = md5(appVersion.encode()).hexdigest()    # hexdigest 为16进制
    r = str(int(time.time()*1000))
    i = r + str(random.randint(0,9))

    return {
        "ts":r,
        "bv":t,
        "salt":i,
        "sign":md5(("fanyideskweb" + e + i + "Ygy_4c=r#e#4EX^NUGUc5").encode()).hexdigest()
    }

## 三、爬取结合`JS`解密做法——代码

In [7]:
import json
import requests

import time
import random
from hashlib import md5

appVersion = "5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult =rule'

headers = {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7",
    "Connection": "keep-alive",
    "Content-Length": "261",
    "Content-Type": "application/x-www-form-urlencoded; charset=UTF-8",
    "Cookie": "P_INFO=null; UM_distinctid=18031c6458f169-07ea19b779ecdd-5b4c7c4c-f7716-18031c64590151; OUTFOX_SEARCH_USER_ID=-163867329@10.110.96.158; OUTFOX_SEARCH_USER_ID_NCOO=1855100400.923158; fanyi-ad-id=305676; fanyi-ad-closed=0; JSESSIONID=aaakE6qmBnMvMfr5nFLby; ___rl__test__cookies=1650974132444",
    "Host": "fanyi.youdao.com",
    "Origin": "https://fanyi.youdao.com",
    "Referer": "https://fanyi.youdao.com/",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest"
}

def r(e):
    t = md5(appVersion.encode()).hexdigest()    # hexdigest 为16进制
    r = str(int(time.time()*1000))
    i = r + str(random.randint(0,9))

    return {
        "ts":r,
        "bv":t,
        "salt":i,
        "sign":md5(("fanyideskweb" + e + i + "Ygy_4c=r#e#4EX^NUGUc5").encode()).hexdigest()
    }

def fanyi(word=None):
    
    data = r(word)
    # print(word)
    params = {   # 参数（表单数据
        'i':word,
        'from':'AUTO',
        'to': 'AUTO',
        'smartresult': 'dict',
        'client':'fanyideskweb',
        'salt':data["salt"],
        'sign':data["sign"],
        'lts':data["ts"],
        'bv':data["bv"],
        'doctype':'json',
        'version': '2.1',
        'keyfrom':'fanyi.web',
        'action':'FY_BY_ CLICKBUTTION',
        'typoResult':'false'
    }

    # 请求表单数据
    response = requests.post(url=url, data=params)     ## 返回一个response
    
    # 返回json数据
    return response.json()

In [3]:
if __name__ == "__main__":
    word = input("请输入待翻译文字")
    print(fanyi(word)['translateResult'][0][0]['tgt'])

test
测试


## 读取文件翻译

In [20]:
if __name__ == "__main__":
    # 打开需要翻译的文章
    with open("文章.txt", mode="r", encoding="utf-8") as f:
        # 获取文章全部内容
        text = f.read()
    result = fanyi(text)
    r_data = result["translateResult"]
    
    # 翻译结果保存
    with open("文章翻译.txt", mode="w", encoding="utf-8") as f:
        for data in r_data:
            f.write(data[0]["tgt"])     # 翻译
            f.write('\n')
            f.write(data[0]["src"])     # 原文
            f.write('\n')
            print(data[0]["tgt"])
            print(data[0]["src"])

I wish you a happy simple, warm and kind.
《愿你简单快乐，温暖善良》
Author: whales fall April 22, 2022, literature appreciation
作者: 鲸落2022年04月22日美文欣赏
Sometimes will think, the more people grow up, the more difficult it is to happiness?
有时候会不会觉得，人越长大，越难快乐了？


Is not life become boring, less happy, but happy threshold was raised, change, is our mood.
其实不是生活变得乏味，快乐变少了，而是快乐的门槛被提高了，变化的，是我们的心境。


Most of the time, we want to be too much, care too much, but ignored the plain and simple happiness.
很多时候，我们想要的太多，在意的太多，却忽视了那些朴实的、简单的幸福。


One of the best life state, not every moment with a bang, but at the same time in the pursuit of a warm, also can cherish small beautiful days.
一个人最好的生活状态，并不是每时每刻都要轰轰烈烈，而是在追求热烈的同时，也能珍惜平淡日子里的小美好。


Today, do a simple people.
今天起，做一个简单的人。


Shopping and reading books, go for a walk, have their own small life and interest, to live out yourself.
逛逛街，看看书，散散步，有自己的小生活和小情趣，努力活出自己就好。


As one writer said: "today's things, devoted all my heart and soul to try to do it, no matter how, 