## Pandas借助Python爬虫读取HTML网页表格存储到Excel文件

实现目标：
* 网易有道词典可以用于英语单词查询，可以将查询的单词加入到单词本;
* 当前没有导出全部单词列表的功能。为了复习方便，可以爬取所有的单词列表，存入Excel方便复习

涉及技术：
* Pandas：Python语言最强大的数据处理和数据分析库
* Python爬虫：可以将网页下载下来然后解析，使用requests库实现，需要绕过登录验证


In [1]:
import requests
import requests.cookies
import json
import time
import pandas as pd

### 0. 处理流程

<h4>输入网页：有道词典-单词本</h4>
<img src="./course_datas/c32_read_html/youdao_cidian.png" style="width:50%; margin-left:0px;"/>

<h4>处理流程</h4>
<img src="./course_datas/c32_read_html/ppt_flow.png" style="width:70%; margin-left:0px;"/>

<h4>数据结果到Excel文件（方便打印复习）：</h4>
<img src="./course_datas/c32_read_html/output_excel.png" style="width:70%; margin-left:0px;"/>

### 1. 登录网易有道词典的PC版，微信扫码登录，复制cookies到文件

* PC版地址：http://dict.youdao.com/  
* Chrome插件可以复制Cookies为Json格式：http://www.editthiscookie.com/

In [2]:
cookie_jar = requests.cookies.RequestsCookieJar()

with open("./course_datas/c32_read_html/cookie.txt") as fin:
    cookiejson = json.loads(fin.read())
    for cookie in cookiejson:
        cookie_jar.set(
            name=cookie["name"],
            value=cookie["value"],
            domain=cookie["domain"],
            path=cookie["path"]
        )

In [3]:
cookie_jar

<RequestsCookieJar[Cookie(version=0, name='DICT_LOGIN', value='3||1578922508302', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='DICT_PERS', value='v2|weixin||DICT||web||2592000000||1578922508299||114.244.161.198||wxoXQUDj_FtHSw23tfJWsboPkq38ok||gFnMeLRLQLRpBOMYMhf6LRUf0Mz5P4TLRqSOM6uhfY5RzW0L6ZhHTB0kGRHeukLg40QZOMOMkMwu0gBkfJF0LTL0', port=None, port_specified=False, domain='.youdao.com', domain_specified=True, domain_initial_dot=True, path='/', path_specified=True, secure=False, expires=None, discard=True, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False), Cookie(version=0, name='DICT_SESS', value='v2|odmTRIUgTmgz6MlEOMqB0TBnfk5h4pZ0Py0MeBP4Q40qynHeuPMOWRpLPMY5RHJuRQykfJBOLQBRPKO4YYOLquR6zhLwBnMYMR', port=None, port_specified=False, domain='

### 2. 将html都下载下来存入列表

In [4]:
htmls = []
url = "http://dict.youdao.com/wordbook/wordlist?p={idx}&tags="
for idx in range(6):
    time.sleep(1)
    print("**爬数据：第%d页" % idx)
    r = requests.get(url.format(idx=idx), cookies=cookie_jar)
    htmls.append(r.text)

**爬数据：第0页
**爬数据：第1页
**爬数据：第2页
**爬数据：第3页
**爬数据：第4页
**爬数据：第5页


In [5]:
htmls[0]

'<!doctype html>\n<html>\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>\n<title>有道单词本</title>\n\n<link rel="canonical" href="http://dict.youdao.com/wordbook/"/> \n<meta name="Keywords" content="单词本,web单词本,有道,词典,youdao" />\n<meta name="Description" content="有道词典单词本" />\n<link rel="shortcut icon" href="http://shared.ydstatic.com/images/favicon.ico?213" type="image/x-icon"/>\n<link href="http://shared.ydstatic.com/r/1.0/s/g3.css?20110428" rel="stylesheet" type="text/css"/>\n<link type="text/css" href="resources/styles/main.css" rel="stylesheet">\n\n<style type="text/css">\n\n#f{background-image:url(http://shared.ydstatic.com/images/skins/default/skin-x.jpg)}\n#fbl{background:url(http://shared.ydstatic.com/images/skins/default/skin_.jpg) left top}\n#fbr{background:url(http://shared.ydstatic.com/images/skins/default/skin_.jpg) right -200px}\n\n</style>\n<script type="text/javascript">\nvar VARIABLES={ \n                tags:"",\n                page:"0",\n    

### 3. 使用Pandas解析网页中的表格

In [7]:
df = pd.read_html(htmls[0])

In [8]:
print(len(df))
print(type(df))

2
<class 'list'>


In [9]:
df[0].head(3)

Unnamed: 0,序号,单词,音标,解释,时间,分类,操作


In [10]:
df[1].head(3)

Unnamed: 0,0,1,2,3,4,5,6
0,1,agglomerative,,adj. 会凝聚的；[冶] 烧结的，凝结的,2020-1-13,,
1,2,anatomy,[ə'nætəmɪ],n. 解剖；解剖学；剖析；骨骼,2017-7-17,,
2,3,backbone,['bækbəʊn],"n. 支柱;主干网;决心,毅力;脊椎",2017-7-13,,


In [11]:
df_cont = df[1]

In [12]:
df_cont.columns = df[0].columns

In [13]:
df_cont.head(3)

Unnamed: 0,序号,单词,音标,解释,时间,分类,操作
0,1,agglomerative,,adj. 会凝聚的；[冶] 烧结的，凝结的,2020-1-13,,
1,2,anatomy,[ə'nætəmɪ],n. 解剖；解剖学；剖析；骨骼,2017-7-17,,
2,3,backbone,['bækbəʊn],"n. 支柱;主干网;决心,毅力;脊椎",2017-7-13,,


In [14]:
# 收集6个网页的表格
df_list = []
for html in htmls:
    df = pd.read_html(html)
    df_cont = df[1]
    df_cont.columns = df[0].columns
    df_list.append(df_cont)

In [15]:
# 合并多个表格
df_all = pd.concat(df_list)

In [16]:
df_all.head(3)

Unnamed: 0,序号,单词,音标,解释,时间,分类,操作
0,1,agglomerative,,adj. 会凝聚的；[冶] 烧结的，凝结的,2020-1-13,,
1,2,anatomy,[ə'nætəmɪ],n. 解剖；解剖学；剖析；骨骼,2017-7-17,,
2,3,backbone,['bækbəʊn],"n. 支柱;主干网;决心,毅力;脊椎",2017-7-13,,


In [17]:
df_all.shape

(86, 7)

### 4. 将结果数据输出到Excel文件

In [18]:
df_all[["单词", "音标", "解释"]].to_excel("./course_datas/c32_read_html/网易有道单词本列表.xlsx", index=False)