# HTML解析入门及准备URL生成连续技
![for humans](https://requests-html.kennethreitz.org/_static/requests-html-logo.png#thumbnail)

*  本周主要内容：批量抓取页面基础及技巧
*  上周主要内容：HTML解析（parse HTML）及准备URL生成连续技
*  20春_Web数据挖掘_week04
*  电子讲义设计者：廖汉腾, 许智超
<br/>
<br/>

-----
## 复习

复习：上周内容，实践

* 猎聘PC版 liepin.com 取工作URL参数的牛肉
* 如何生成一连串新URL以进一步爬取数据


-----
## 本周内容及学习目标

本周内容聚焦在

<mark> 如何有系统的把更多页数据(相同结构)作系统性爬取 </mark>

为此，我们需要学习

* 翻页：参数字典的拆解
  * xpath
  * 建构参数模板
  * 建构参数字典
* 翻页：系统性迭代
  * robots.txt
  * 频率及时间
* 翻页：数据备份与整合
  * 储存备份
  * 数据整合
  
### 目标
1. 使用 requests-html 爬取并存取网页文字档，查找[requests-html 中文文档](https://cncert.github.io/requests-html-doc-cn/#/)
2. 熟悉 [xpath 语法](https://www.w3cschool.cn/xpath/xpath-syntax.html)丶[xpath 节点](https://www.w3cschool.cn/xpath/xpath-nodes.html)
3. 使用 [xpath cheatsheet](https://devhints.io/xpath)
  * 在 Chrome Inspector 使用
  * 在 requests-html (Python) 使用
4. 简易使用 [pd.DataFrame](https://www.pypandas.cn/doc/getting_started/dsintro.html#dataframe)
5. 参数字典的拆解与迭代
6. 翻页数据备份与整合

In [1]:
%%html
<style>
/* 本电子讲义使用之CSS */
div.code_cell {
    background-color: #e5f1fe;
}
div.cell.selected {
    background-color: #effee2;
    font-size: 2rem;
    line-height: 2.4rem;
}
div.cell.selected .rendered_html table {
    font-size: 2rem !important;
    line-height: 2.4rem !important;
}
.rendered_html pre code {
    background-color: #C4E4ff;   
    padding: 2px 25px;
}
.rendered_html pre {
    background-color: #99c9ff;
}
div.code_cell .CodeMirror {
    font-size: 2rem !important;
    line-height: 2.4rem !important;
}
.rendered_html img, .rendered_html svg {
    max-width: 60%;
    height: auto;
    float: right;
}

.rendered_html img[src*="#full"], .rendered_html svg[src*="#full"] {
    max-width: 100%;
    height: auto;
    float: none;
}

.rendered_html img[src*="#thumbnail"], .rendered_html svg[src*="#thumbnail"] {
    max-width: 15%;
    height: auto;
}

/* Gradient transparent - color - transparent */
hr {
    border: 0;
    border-bottom: 1px dashed #ccc;
}
.emoticon{
    font-size: 5rem;
    line-height: 4.4rem;
    text-align: center;
    vertical-align: middle;
}
.bg-split_apply_comine {
    width: 500px;     
    height: 300px;
    background: url('02_split-apply-comine_500x300.png') -10px -10px;
    float: right;
}
.bg-comine {
    width: 175px;
    height: 150px;
    background: url('02_split-apply-comine_500x300.png') -280px -80px;
    float: right;
}
.bg-apply {
    width: 155px;
    height: 225px;
    background: url('02_split-apply-comine_500x300.png') -160px -30px;
    float: right;
}
.bg-split {
    width: 205px;
    height: 225px;
    background: url('02_split-apply-comine_500x300.png') -10px -30px;
    float: right;
}
.break {
                   page-break-after: right; 
                   width:700px;
                   clear:both;
}
</style>

In [1]:
# 基本模块
import pandas as pd
from requests_html import HTMLSession

## 0. 上周整合代码

In [3]:
# 上周C-1B-5 建构 参数模板
参数_compTag_用户体验 = {
    '中国500强': {
        'init': ['-1'],
        'headckid': ['58d828c357a8cb19'],
        'flushckid': ['1'],
        'fromSearchBtn': ['2'],
        'keyword': ['用户体验'],
        'compTag': ['155'],
        'ckid': ['58d828c357a8cb19'],
        'siTag': ['1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw'],
        'd_sfrom': ['search_unknown'],
        'd_ckId': ['6aa779111c1b4ca77cff3648d9dee049'],
        'd_curPage': ['0'],
        'd_pageSize': ['40'],
        'd_headId': ['6aa779111c1b4ca77cff3648d9dee049']
    },
    '2018互联网300强': {
        'init': ['-1'],
        'headckid': ['58d828c357a8cb19'],
        'flushckid': ['1'],
        'fromSearchBtn': ['2'],
        'keyword': ['用户体验'],
        'compTag': ['182'],
        'ckid': ['58d828c357a8cb19'],
        'siTag': ['1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw'],
        'd_sfrom': ['search_unknown'],
        'd_ckId': ['6aa779111c1b4ca77cff3648d9dee049'],
        'd_curPage': ['0'],
        'd_pageSize': ['40'],
        'd_headId': ['6aa779111c1b4ca77cff3648d9dee049']
    },
    '制造业500强': {
        'init': ['-1'],
        'headckid': ['58d828c357a8cb19'],
        'flushckid': ['1'],
        'fromSearchBtn': ['2'],
        'keyword': ['用户体验'],
        'compTag': ['186'],
        'ckid': ['58d828c357a8cb19'],
        'siTag': ['1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw'],
        'd_sfrom': ['search_unknown'],
        'd_ckId': ['6aa779111c1b4ca77cff3648d9dee049'],
        'd_curPage': ['0'],
        'd_pageSize': ['40'],
        'd_headId': ['6aa779111c1b4ca77cff3648d9dee049']
    },
    'AI创新成长50强 ': {
        'init': ['-1'],
        'headckid': ['58d828c357a8cb19'],
        'flushckid': ['1'],
        'fromSearchBtn': ['2'],
        'keyword': ['用户体验'],
        'compTag': ['189'],
        'ckid': ['58d828c357a8cb19'],
        'siTag': ['1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw'],
        'd_sfrom': ['search_unknown'],
        'd_ckId': ['6aa779111c1b4ca77cff3648d9dee049'],
        'd_curPage': ['0'],
        'd_pageSize': ['40'],
        'd_headId': ['6aa779111c1b4ca77cff3648d9dee049']
    },
    '独角兽': {
        'init': ['-1'],
        'headckid': ['58d828c357a8cb19'],
        'flushckid': ['1'],
        'fromSearchBtn': ['2'],
        'keyword': ['用户体验'],
        'compTag': ['130'],
        'ckid': ['58d828c357a8cb19'],
        'siTag': ['1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw'],
        'd_sfrom': ['search_unknown'],
        'd_ckId': ['6aa779111c1b4ca77cff3648d9dee049'],
        'd_curPage': ['0'],
        'd_pageSize': ['40'],
        'd_headId': ['6aa779111c1b4ca77cff3648d9dee049']
    },
    '上市公司': {
        'init': ['-1'],
        'headckid': ['58d828c357a8cb19'],
        'flushckid': ['1'],
        'fromSearchBtn': ['2'],
        'keyword': ['用户体验'],
        'compTag': ['156'],
        'ckid': ['58d828c357a8cb19'],
        'siTag': ['1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw'],
        'd_sfrom': ['search_unknown'],
        'd_ckId': ['6aa779111c1b4ca77cff3648d9dee049'],
        'd_curPage': ['0'],
        'd_pageSize': ['40'],
        'd_headId': ['6aa779111c1b4ca77cff3648d9dee049']
    }
}

# 上周C-1   多个页面准备测试1 中国500强
url = "https://www.liepin.com/zhaopin/"
session = HTMLSession()
payload = 参数_compTag_用户体验['中国500强']
r = session.get(url, params=payload)

# r.url

# 上周C-2  简化 A-1   单一页面爬+解析
session = HTMLSession()


def requests_liepin(url, params):
    r = session.get(url, params=payload)

    # 先取特定元素, 精准打击其子后辈
    主要元素 = r.html.xpath('//ul[@class="sojob-list"]/li')

    # 作为xpath字典，键为我要抓的牛肉名称，值为xpath
    dict_xpaths = {
        'text': {
            'edu':
            '//div[contains(@class,"job-info")]/p/span[@class="edu"]',
            '经验':
            '//div[contains(@class,"job-info")]/p/span[@class="edu"]/following-sibling::span',
            '薪水':
            '//div[contains(@class,"job-info")]/p/span[@class="text-warning"]',
            '时间':
            '//div[contains(@class,"job-info")]/p/time/@title',
            '职称':
            '//div[contains(@class,"job-info")]/h3/a',
            '公司地点':
            '//div[contains(@class,"job-info")]/p/a',
            '公司名称':
            '//div[contains(@class,"sojob-item-main")]//p[@class="company-name"]/a',
        },
        'text_content': {},
        'href': {
            '链结':
            '//div[contains(@class,"job-info")]/h3/a',
            '公司URL':
            '//div[contains(@class,"sojob-item-main")]//p[@class="company-name"]/a',
        }
    }

    def get_e_text_content(_xpath_):
        # 高级列表推导
        暂存结果 = [e.xpath(_xpath_)[0].lxml.text_content() for e in 主要元素]
        return (暂存结果)

    def get_e_text(_xpath_):
        # 高级列表推导
        暂存结果 = [
            "".join([
                x.strip() if type(x) is str else x.text.strip()
                for x in e.xpath(_xpath_)
            ]) for e in 主要元素
        ]
        return (暂存结果)

    def get_e_href(_xpath_):
        # 高级列表推导
        暂存结果 = [list(e.xpath(_xpath_, first=True).absolute_links)[0] \
                   if len(e.xpath(_xpath_, first=True).absolute_links) >= 1  \
                   else "" for e in 主要元素]
        return (暂存结果)

    # 只对主要元素下进行.xpath取值
    数据字典 = dict()

    数据字典 = {
        k: get_e_text_content(v)
        for k, v in dict_xpaths['text_content'].items()
    }
    数据字典.update({k: get_e_text(v) for k, v in dict_xpaths['text'].items()})
    数据字典.update({k: get_e_href(v) for k, v in dict_xpaths['href'].items()})

    数据 = pd.DataFrame(数据字典)
    #数据.to_excel("20春_Web数据挖掘_week03_liepin.xlsx", sheet_name="搜查结果")
    return (数据)


# 上周C-3   多个页面
url = "https://www.liepin.com/zhaopin/"

list_df = list()
for k, v in 参数_compTag_用户体验.items():
    payload = v
    df = requests_liepin(url, params=payload)
    df = df.assign(热门公司类型=k)
    list_df.append(df)

df_all = pd.concat(list_df)
df_all

# 上周C-4   输出
df_all.to_excel("20春_Web数据挖掘_week03_liepin_各热门公司类型.xlsx", sheet_name="搜查结果")

# 上周C-5 Pandas  基本能力

print(df_all.nunique())
df_all[['edu']].drop_duplicates()

df_all.groupby(['公司名称', 'edu']).agg({
    "职称": "count"
}).sort_values(by='职称', ascending=False)

edu         5
经验         10
薪水         74
时间         25
职称        189
公司地点       66
公司名称       72
链结        200
公司URL      72
热门公司类型      6
dtype: int64


Unnamed: 0_level_0,Unnamed: 1_level_0,职称
公司名称,edu,Unnamed: 2_level_1
阿里巴巴,学历不限,38
华为,本科及以上,16
小米,统招本科,15
上海擎创信息技术有限公司,本科及以上,11
华为,统招本科,8
...,...,...
戴维医疗,统招本科,1
招金矿业,统招本科,1
新城悦控股有限公司,大专及以上,1
方太,本科及以上,1



-----

## 本周实践目标
<mark> 如何有系统的把更多页数据(相同结构)作系统性爬取 </mark>[猎聘PC版](https://www.liepin.com/zhaopin/)
* 翻页：参数字典的拆解
  * xpath解析翻页a/@href
  * 建构参数模板
  * 建构参数字典
* 翻页：系统性迭代
  * robots.txt
  * 频率及时间
* 翻页：数据备份与整合
  * 储存备份
  * 数据整合

# 翻页：参数字典的拆解
## xpath解析翻页a/@href

* 建立连接

In [5]:
# A-0   单一页面
url = "https://www.liepin.com/zhaopin/?keyword=PRD"
session = HTMLSession()
r = session.get( url )

In [6]:
# A-1  xpath 解析翻页a/@href
xpath_翻页a = '//div[@class="pagerbar"]/a' # 有disabled, current等href是javascript
xpath_翻页a = '//div[@class="pagerbar"]/a[starts-with(@href,"/zhaopin")]'
print (r.html.xpath(xpath_翻页a)) # 物件

href_列表 = [x.xpath('//@href')[0] for x in r.html.xpath(xpath_翻页a)]
#print (href_列表)

文字_列表 = [x.text for x in r.html.xpath(xpath_翻页a)]
#print (文字_列表)

href_字典 = {x.text:x.xpath('//@href')[0]  for x in r.html.xpath(xpath_翻页a)}
#print (href_字典)

[<Element 'a' href='/zhaopin/?init=-1&headckid=26f51855aa93f50f&fromSearchBtn=2&keyword=PRD&ckid=26f51855aa93f50f°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=d171c8b79eed664fd857fa9848a4bdea&d_curPage=0&d_pageSize=40&d_headId=d171c8b79eed664fd857fa9848a4bdea&curPage=1'>, <Element 'a' href='/zhaopin/?init=-1&headckid=26f51855aa93f50f&fromSearchBtn=2&keyword=PRD&ckid=26f51855aa93f50f°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=d171c8b79eed664fd857fa9848a4bdea&d_curPage=0&d_pageSize=40&d_headId=d171c8b79eed664fd857fa9848a4bdea&curPage=2'>, <Element 'a' href='/zhaopin/?init=-1&headckid=26f51855aa93f50f&fromSearchBtn=2&keyword=PRD&ckid=26f51855aa93f50f°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=d171c8b79eed664fd857fa9848a4bdea&d_curPage=0&d_pageSize=40&d_headId=d171c8b79eed664fd857fa9848a4bdea&curPage=3'>, <Element 'a' href='/zhaopin/?init

In [7]:
href_列表 = [x.xpath('//@href')[0] for x in r.html.xpath(xpath_翻页a)]
print (href_列表)

['/zhaopin/?init=-1&headckid=26f51855aa93f50f&fromSearchBtn=2&keyword=PRD&ckid=26f51855aa93f50f°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=d171c8b79eed664fd857fa9848a4bdea&d_curPage=0&d_pageSize=40&d_headId=d171c8b79eed664fd857fa9848a4bdea&curPage=1', '/zhaopin/?init=-1&headckid=26f51855aa93f50f&fromSearchBtn=2&keyword=PRD&ckid=26f51855aa93f50f°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=d171c8b79eed664fd857fa9848a4bdea&d_curPage=0&d_pageSize=40&d_headId=d171c8b79eed664fd857fa9848a4bdea&curPage=2', '/zhaopin/?init=-1&headckid=26f51855aa93f50f&fromSearchBtn=2&keyword=PRD&ckid=26f51855aa93f50f°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=d171c8b79eed664fd857fa9848a4bdea&d_curPage=0&d_pageSize=40&d_headId=d171c8b79eed664fd857fa9848a4bdea&curPage=3', '/zhaopin/?init=-1&headckid=26f51855aa93f50f&fromSearchBtn=2&keyword=PRD&ckid=26f51855aa93

In [8]:
文字_列表 = [x.text for x in r.html.xpath(xpath_翻页a)]
print (文字_列表)

['2', '3', '4', '5', '下一页', '']


In [9]:
href_字典 = {x.text:x.xpath('//@href')[0]  for x in r.html.xpath(xpath_翻页a)}
print (href_字典)

{'2': '/zhaopin/?init=-1&headckid=26f51855aa93f50f&fromSearchBtn=2&keyword=PRD&ckid=26f51855aa93f50f°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=d171c8b79eed664fd857fa9848a4bdea&d_curPage=0&d_pageSize=40&d_headId=d171c8b79eed664fd857fa9848a4bdea&curPage=1', '3': '/zhaopin/?init=-1&headckid=26f51855aa93f50f&fromSearchBtn=2&keyword=PRD&ckid=26f51855aa93f50f°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=d171c8b79eed664fd857fa9848a4bdea&d_curPage=0&d_pageSize=40&d_headId=d171c8b79eed664fd857fa9848a4bdea&curPage=2', '4': '/zhaopin/?init=-1&headckid=26f51855aa93f50f&fromSearchBtn=2&keyword=PRD&ckid=26f51855aa93f50f°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=d171c8b79eed664fd857fa9848a4bdea&d_curPage=0&d_pageSize=40&d_headId=d171c8b79eed664fd857fa9848a4bdea&curPage=3', '5': '/zhaopin/?init=-1&headckid=26f51855aa93f50f&fromSearchBtn=2&keyword=P

### 观察：
此网页是否给出开始丶步进丶及结束的信息，以方便我们完成迭代设置

* 老问题 URL太长，用上周的URL+query参数解析与pandas数据框找到异同之处
* 老问题 怎麽系统化出URL？用上周的URL+query参数解析与pandas数据框找到异同之处的时候，顺便构建参数字典，至少让以下参数可调
  * 搜索关键词：上周keyword(url = "https://www.liepin.com/zhaopin/?keyword=PRD")
  * 页码在哪？(//div[@class="pagerbar")
* 实践挑战：如何把上周代码模块化为我们所用？

-----

## 建构参数模板

```python

# 上周B-1 使用 urllib.parse 解析
from urllib.parse import urlparse, parse_qs


# 上周B-2 使用 pd.DataFrame进行 unuinque()相异值计量比对 
import pandas as pd
df = pd.DataFrame([ urlparse(x) for x in 公司数据选择器链结.values()])
print(df.nunique())

# 上周B-3 针对query 再解析之 
#df_qs = pd.DataFrame([ parse_qs(x) for x in df['query'] ])
df_qs = pd.DataFrame([{k:v[0] for k,v in parse_qs(x).items()} for x in df['query'] ])
print(df.nunique())

# 上周B-4 建构 参数模板 及 字典_compTag
def parse_url_qs_for_compTag (url):
    six_parts = urlparse(url) 
    out = parse_qs(six_parts.query)
    return (out)

# parse_url_qs_for_compTag(list(公司数据选择器链结.values())[0])['compTag']
参数模板 = parse_url_qs_for_compTag(list(公司数据选择器链结.values())[0])
print(参数模板)
# [ parse_url_qs_for_compTag(x)['compTag'] for x in 公司数据选择器链结.values()]
[ parse_url_qs_for_compTag(x)['compTag'][0] for x in 公司数据选择器链结.values()]

字典_compTag = { k:parse_url_qs_for_compTag(v)['compTag'][0] for k,v in 公司数据选择器链结.items()}
print (字典_compTag)

# B-5 建构 参数模板  
def 参数模板生成(compTag , keyword ):
    参数 = 参数模板.copy()
    参数['compTag'] = compTag
    参数['keyword'] = keyword
    return (参数)

参数_compTag_用户体验 = { k:参数模板生成(compTag = [v], keyword = ['用户体验']) for k,v in 字典_compTag.items()}
print(参数_compTag_用户体验)

```

In [10]:
# A-2 建构参数模板：找到关键参数及参数结构

# 需要模组库
from urllib.parse import urlparse, parse_qs
import pandas as pd
from IPython.display import display, HTML

# 总体目标：输入 href_列表, 建构出参数字典

# urlparse 解析后丢入数据框
df = pd.DataFrame([ urlparse(x) for x in href_列表])
df_qs = pd.DataFrame([{k:v[0] for k,v in parse_qs(x).items()} for x in df['query'] ])

display(df)
print(df.nunique())
display(df_qs)
print(df_qs.nunique())

df_qs.curPage
df_qs = df_qs.assign (curPage_int=df_qs.curPage.astype(int)) # 变成整数

Unnamed: 0,scheme,netloc,path,params,query,fragment
0,,,/zhaopin/,,init=-1&headckid=26f51855aa93f50f&fromSearchBt...,
1,,,/zhaopin/,,init=-1&headckid=26f51855aa93f50f&fromSearchBt...,
2,,,/zhaopin/,,init=-1&headckid=26f51855aa93f50f&fromSearchBt...,
3,,,/zhaopin/,,init=-1&headckid=26f51855aa93f50f&fromSearchBt...,
4,,,/zhaopin/,,init=-1&headckid=26f51855aa93f50f&fromSearchBt...,
5,,,/zhaopin/,,init=-1&headckid=26f51855aa93f50f&fromSearchBt...,


scheme      1
netloc      1
path        1
params      1
query       5
fragment    1
dtype: int64


Unnamed: 0,init,headckid,fromSearchBtn,keyword,ckid,siTag,d_sfrom,d_ckId,d_curPage,d_pageSize,d_headId,curPage
0,-1,26f51855aa93f50f,2,PRD,26f51855aa93f50f°radeFlag=0,1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw,search_unknown,d171c8b79eed664fd857fa9848a4bdea,0,40,d171c8b79eed664fd857fa9848a4bdea,1
1,-1,26f51855aa93f50f,2,PRD,26f51855aa93f50f°radeFlag=0,1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw,search_unknown,d171c8b79eed664fd857fa9848a4bdea,0,40,d171c8b79eed664fd857fa9848a4bdea,2
2,-1,26f51855aa93f50f,2,PRD,26f51855aa93f50f°radeFlag=0,1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw,search_unknown,d171c8b79eed664fd857fa9848a4bdea,0,40,d171c8b79eed664fd857fa9848a4bdea,3
3,-1,26f51855aa93f50f,2,PRD,26f51855aa93f50f°radeFlag=0,1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw,search_unknown,d171c8b79eed664fd857fa9848a4bdea,0,40,d171c8b79eed664fd857fa9848a4bdea,4
4,-1,26f51855aa93f50f,2,PRD,26f51855aa93f50f°radeFlag=0,1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw,search_unknown,d171c8b79eed664fd857fa9848a4bdea,0,40,d171c8b79eed664fd857fa9848a4bdea,1
5,-1,26f51855aa93f50f,2,PRD,26f51855aa93f50f°radeFlag=0,1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw,search_unknown,d171c8b79eed664fd857fa9848a4bdea,0,40,d171c8b79eed664fd857fa9848a4bdea,9


init             1
headckid         1
fromSearchBtn    1
keyword          1
ckid             1
siTag            1
d_sfrom          1
d_ckId           1
d_curPage        1
d_pageSize       1
d_headId         1
curPage          5
dtype: int64


### 观察：
* query
* curPage 5次, 最大值9, 本页不算?

-----

## 建构参数模板：curPage


In [22]:
# A-2 建构参数模板：找到关键参数及参数结构
# six_parts表示把函数拆成六部分


def parse_url_qs_for_curPage(url):
    six_parts = urlparse(url)
    out = parse_qs(six_parts.query)
    return (out)


# 取一例做模板
参数模板 = parse_url_qs_for_curPage(href_列表[0])
print(参数模板)

print(href_字典)

{'init': ['-1'], 'headckid': ['83c3162f4f93f306'], 'fromSearchBtn': ['2'], 'keyword': ['PRD'], 'ckid': ['83c3162f4f93f306°radeFlag=0'], 'siTag': ['1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw'], 'd_sfrom': ['search_unknown'], 'd_ckId': ['c50c265a4c92032880cb36b10754865a'], 'd_curPage': ['0'], 'd_pageSize': ['40'], 'd_headId': ['c50c265a4c92032880cb36b10754865a'], 'curPage': ['1']}
{'2': '/zhaopin/?init=-1&headckid=83c3162f4f93f306&fromSearchBtn=2&keyword=PRD&ckid=83c3162f4f93f306°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=c50c265a4c92032880cb36b10754865a&d_curPage=0&d_pageSize=40&d_headId=c50c265a4c92032880cb36b10754865a&curPage=1', '3': '/zhaopin/?init=-1&headckid=83c3162f4f93f306&fromSearchBtn=2&keyword=PRD&ckid=83c3162f4f93f306°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=c50c265a4c92032880cb36b10754865a&d_curPage=0&d_pageSize=40&d_headId=c50c265a4c92032880cb36b10754865a&curPage=2'

In [23]:
# A-3 建构参数模板生成器：keyword curPage
def 参数模板生成(keyword, curPage):
    参数 = 参数模板.copy()
    参数['curPage'] = curPage
    参数['keyword'] = keyword
    return (参数)

参数_keyword_用户体验_curPage = { 
    i:参数模板生成(curPage = [i], \
                  keyword = ['用户体验']) \
    for i,v in href_字典.items()\
    }

# print(参数_keyword_用户体验_curPage) # 只生成本页有的额外翻页URL, 并没有推估到&curPage=9,也没有这页

print (df_qs.curPage_int.min()) # 最小值只有1
print (df_qs.curPage_int.max()) # 最大值只有9

# 应该是 0 (本页)....9(最大值)

参数_keyword_用户体验_curPage = { 
    i:参数模板生成(curPage = [i], \
                  keyword = ['用户体验']) \
    for i in range(0,df_qs.curPage_int.max()+1)\
    }
参数_keyword_用户体验_curPage

1
9


{0: {'init': ['-1'],
  'headckid': ['83c3162f4f93f306'],
  'fromSearchBtn': ['2'],
  'keyword': ['用户体验'],
  'ckid': ['83c3162f4f93f306°radeFlag=0'],
  'siTag': ['1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw'],
  'd_sfrom': ['search_unknown'],
  'd_ckId': ['c50c265a4c92032880cb36b10754865a'],
  'd_curPage': ['0'],
  'd_pageSize': ['40'],
  'd_headId': ['c50c265a4c92032880cb36b10754865a'],
  'curPage': [0]},
 1: {'init': ['-1'],
  'headckid': ['83c3162f4f93f306'],
  'fromSearchBtn': ['2'],
  'keyword': ['用户体验'],
  'ckid': ['83c3162f4f93f306°radeFlag=0'],
  'siTag': ['1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw'],
  'd_sfrom': ['search_unknown'],
  'd_ckId': ['c50c265a4c92032880cb36b10754865a'],
  'd_curPage': ['0'],
  'd_pageSize': ['40'],
  'd_headId': ['c50c265a4c92032880cb36b10754865a'],
  'curPage': [1]},
 2: {'init': ['-1'],
  'headckid': ['83c3162f4f93f306'],
  'fromSearchBtn': ['2'],
  'keyword': ['用户体验'],
  'ckid': ['83c3162f4f93f306°radeFlag=0'],
  'siTag': ['1B2M2Y8AsgTpgAmY

# 翻页：系统性迭代

## 爬亦有道
* robots.txt 站长/网站拥有者给搜索引擎的"道"
* 频率及时间
  * 不要爬太快
  * 尽量像"人"一样礼貌
  * time.sleep
  
```python

# 上周C-3   多个页面
url = "https://www.liepin.com/zhaopin/"

list_df = list()
for k,v in 参数_compTag_用户体验.items():
    payload = v
    df = requests_liepin( url, params = payload)
    df = df.assign (热门公司类型 = k)    
    list_df.append(df)

df_all = pd.concat(list_df)
df_all
```

In [24]:
# B-1 上周C-2  简化 上上周A-1   单一页面爬+解析
session = HTMLSession()

def requests_liepin( url, params):
    r = session.get( url , params = payload)

    # 先取特定元素, 精准打击其子后辈
    主要元素 = r.html.xpath( '//ul[@class="sojob-list"]/li')

    # 作为xpath字典，键为我要抓的牛肉名称，值为xpath
    dict_xpaths={ 
        'text': {
            'edu':      '//div[contains(@class,"job-info")]/p/span[@class="edu"]',
            '经验':      '//div[contains(@class,"job-info")]/p/span[@class="edu"]/following-sibling::span',
            '薪水':    '//div[contains(@class,"job-info")]/p/span[@class="text-warning"]', 
            '时间':    '//div[contains(@class,"job-info")]/p/time/@title', 
            '职称':    '//div[contains(@class,"job-info")]/h3/a', 
            '公司地点': '//div[contains(@class,"job-info")]/p/a',
            '公司名称': '//div[contains(@class,"sojob-item-main")]//p[@class="company-name"]/a', 
        },
        'text_content': {
        },
        'href': {
            '链结':    '//div[contains(@class,"job-info")]/h3/a', 
            '公司URL': '//div[contains(@class,"sojob-item-main")]//p[@class="company-name"]/a', 
        }
    }

    def get_e_text_content(_xpath_):
        # 高级列表推导
        暂存结果 = [e.xpath(_xpath_)[0].lxml.text_content() for e in 主要元素]
        return(暂存结果)

    def get_e_text(_xpath_):
        # 高级列表推导
        暂存结果 = ["".join([x.strip() if type(x) is str else x.text.strip() for x in e.xpath(_xpath_)]) for e in 主要元素]
        return(暂存结果)

    def get_e_href(_xpath_):
        # 高级列表推导
        暂存结果 = [list(e.xpath(_xpath_, first=True).absolute_links)[0] \
                   if len(e.xpath(_xpath_, first=True).absolute_links) >= 1  \
                   else "" for e in 主要元素]
        return(暂存结果)

    # 只对主要元素下进行.xpath取值
    数据字典 = dict()

    数据字典 = {k:get_e_text_content(v) for k,v in dict_xpaths['text_content'].items()}
    数据字典.update({k:get_e_text(v) for k,v in dict_xpaths['text'].items()})
    数据字典.update({k:get_e_href(v) for k,v in dict_xpaths['href'].items()})

    数据 = pd.DataFrame(数据字典)
    #数据.to_excel("20春_Web数据挖掘_week03_liepin.xlsx", sheet_name="搜查结果")
    return (数据)


## 爬亦有道- 不要爬太快
time.sleep

In [31]:
%%time
time.sleep(3+4*random())

Wall time: 3.57 s


In [32]:
%%time
# B-2 多个页面，但放慢脚步 time.sleep
import time
from random import random

url = "https://www.liepin.com/zhaopin/"

list_df = list()
for k,v in 参数_keyword_用户体验_curPage.items():
    payload = v
    df = requests_liepin( url, params = payload)
    time.sleep(3+4*random())  #放慢脚步 3-7秒, 平均约5秒
    df = df.assign (curPage = k)  # 区分  curPage
    list_df.append(df)

df_all = pd.concat(list_df).reset_index()
df_all.index.name = '序'

# 上周C-4   输出
df_all.to_excel("20春_Web数据挖掘_week04_liepin_翻页.xlsx",\
                sheet_name="用户体验")

# 预估时间: 5秒*10 =50
# 预估数量: 40*10 =400

Wall time: 1min


In [33]:
## 多个页面+多个关键词
time.sleep

<function time.sleep>

In [34]:
%%time
# B-3 多个页面+多个关键词
import time
from random import random

url = "https://www.liepin.com/zhaopin/"
xpath_翻页a = '//div[@class="pagerbar"]/a[starts-with(@href,"/zhaopin")]'

keywords = ['用户体验','UX']
list_df = list()

## 第一页试探有多长的页面
for key in keywords:
    payload = 参数模板生成(keyword=[key], curPage=['0'])
    df = requests_liepin( url, params = payload)
    href_列表 = [x.xpath('//@href')[0] for x in r.html.xpath(xpath_翻页a)]
    df = pd.DataFrame([ urlparse(x) for x in href_列表])
    df_qs = pd.DataFrame([{k:v[0] for k,v in parse_qs(x).items()} for x in df['query'] ])
    df_qs = df_qs.assign (curPage_int=df_qs.curPage.astype(int)) # 变成整数
    长度 = df_qs.curPage_int.max()+1
    参数_keyword_X_curPage = { 
        i:参数模板生成(curPage = [i], \
                      keyword = [key]) \
        for i in range(0,长度)\
        }
    #print (参数_keyword_X_curPage)
    print (key,长度)
    
    for k,v in 参数_keyword_X_curPage.items():
        payload = v
        df = requests_liepin( url, params = payload)
        time.sleep(3+4*random())  #放慢脚步 3-7秒, 平均约5秒
        df = df.assign (keyword = key)  # 区分  keyword    
        df = df.assign (curPage = k)  # 区分  curPage    
        list_df.append(df)
        
df_all = pd.concat(list_df).reset_index()
df_all.index.name = '序'

df_all.to_excel("20春_Web数据挖掘_week04_liepin_翻页.xlsx",\
                sheet_name="_".join(keywords))
# 预估时间: 2*5秒*10 =100
# 预估数量: 2*40*10 =800

用户体验 10
UX 10
Wall time: 2min 5s


# 翻页：数据备份与整合
多个页面+多个关键词执行时，若怕中断最好把每一页的df内容备份做中继

In [11]:
%%time
# C-1 多个页面+多个关键词
import time
from random import random

url = "https://www.liepin.com/zhaopin/"
xpath_翻页a = '//div[@class="pagerbar"]/a[starts-with(@href,"/zhaopin")]'

keywords = ['用户体验','UX','产品需求','PRD']
list_df = list()

## 第一页试探有多长的页面
for key in keywords:
    payload = 参数模板生成(keyword=[key], curPage=['0'])
    df = requests_liepin( url, params = payload)
    href_列表 = [x.xpath('//@href')[0] for x in r.html.xpath(xpath_翻页a)]
    df = pd.DataFrame([ urlparse(x) for x in href_列表])
    df_qs = pd.DataFrame([{k:v[0] for k,v in parse_qs(x).items()} for x in df['query'] ])
    df_qs = df_qs.assign (curPage_int=df_qs.curPage.astype(int)) # 变成整数
    长度 = df_qs.curPage_int.max()+1
    参数_keyword_X_curPage = { 
        i:参数模板生成(curPage = [i], \
                      keyword = [key]) \
        for i in range(0,长度)\
        }
    #print (参数_keyword_X_curPage)
    print (key,长度)
    
    for k,v in 参数_keyword_X_curPage.items():
        payload = v
        df = requests_liepin( url, params = payload)
        time.sleep(3+4*random())  #放慢脚步 3-7秒, 平均约5秒
        ## 备份
        df.to_csv("20春_Web数据挖掘_week04_liepin_{key}_{k}.tsv"\
                  .format(key=key, k=k), sep="\t", encoding="utf8")
        
        df = df.assign (keyword = key)  # 区分  keyword    
        df = df.assign (curPage = k)  # 区分  curPage    
        list_df.append(df)
        
df_all = pd.concat(list_df).reset_index()
df_all.index.name = '序'

df_all.to_excel("20春_Web数据挖掘_week04_liepin_翻页_4.xlsx",\
                sheet_name="_".join(keywords))
# 预估时间: 4*5秒*10 =200
# 预估数量: 4*40*10 =1600

用户体验 10
UX 10
产品需求 10
PRD 10
Wall time: 3min 54s


# 本周练习

* 开始试验各类参数的调整


## xpath解析翻页练习

In [4]:
href_列表 = [x.xpath('//@href')[0] for x in r.html.xpath(xpath_翻页a)]
print (href_列表)

['/zhaopin/?init=-1&headckid=7f82f3e88db0770e&fromSearchBtn=2&keyword=PRD&ckid=7f82f3e88db0770e°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=0a634699f6acc0f4f8992a7bab869028&d_curPage=0&d_pageSize=40&d_headId=0a634699f6acc0f4f8992a7bab869028&curPage=1', '/zhaopin/?init=-1&headckid=7f82f3e88db0770e&fromSearchBtn=2&keyword=PRD&ckid=7f82f3e88db0770e°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=0a634699f6acc0f4f8992a7bab869028&d_curPage=0&d_pageSize=40&d_headId=0a634699f6acc0f4f8992a7bab869028&curPage=2', '/zhaopin/?init=-1&headckid=7f82f3e88db0770e&fromSearchBtn=2&keyword=PRD&ckid=7f82f3e88db0770e°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=0a634699f6acc0f4f8992a7bab869028&d_curPage=0&d_pageSize=40&d_headId=0a634699f6acc0f4f8992a7bab869028&curPage=3', '/zhaopin/?init=-1&headckid=7f82f3e88db0770e&fromSearchBtn=2&keyword=PRD&ckid=7f82f3e88db0

In [5]:
文字_列表 = [x.text for x in r.html.xpath(xpath_翻页a)]
print (文字_列表)

['2', '3', '4', '5', '下一页', '']


In [6]:
href_字典 = {x.text:x.xpath('//@href')[0]  for x in r.html.xpath(xpath_翻页a)}
print (href_字典)

{'2': '/zhaopin/?init=-1&headckid=7f82f3e88db0770e&fromSearchBtn=2&keyword=PRD&ckid=7f82f3e88db0770e°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=0a634699f6acc0f4f8992a7bab869028&d_curPage=0&d_pageSize=40&d_headId=0a634699f6acc0f4f8992a7bab869028&curPage=1', '3': '/zhaopin/?init=-1&headckid=7f82f3e88db0770e&fromSearchBtn=2&keyword=PRD&ckid=7f82f3e88db0770e°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=0a634699f6acc0f4f8992a7bab869028&d_curPage=0&d_pageSize=40&d_headId=0a634699f6acc0f4f8992a7bab869028&curPage=2', '4': '/zhaopin/?init=-1&headckid=7f82f3e88db0770e&fromSearchBtn=2&keyword=PRD&ckid=7f82f3e88db0770e°radeFlag=0&siTag=1B2M2Y8AsgTpgAmY7PhCfg%7EfA9rXquZc5IkJpXC-Ycixw&d_sfrom=search_unknown&d_ckId=0a634699f6acc0f4f8992a7bab869028&d_curPage=0&d_pageSize=40&d_headId=0a634699f6acc0f4f8992a7bab869028&curPage=3', '5': '/zhaopin/?init=-1&headckid=7f82f3e88db0770e&fromSearchBtn=2&keyword=P

## 尝试修改keyword

In [35]:
def 参数模板生成(keyword, curPage):
    参数 = 参数模板.copy()
    参数['curPage'] = curPage
    参数['keyword'] = keyword
    return (参数)

参数_keyword_UX_curPage = { 
    i:参数模板生成(curPage = [i], \
                  keyword = ['UX']) \
    for i,v in href_字典.items()\
    }


print (df_qs.curPage_int.min()) # 最小值只有1
print (df_qs.curPage_int.max()) # 最大值只有9

# 应该是 0 (本页)....9(最大值)

参数_keyword_UX_curPage = { 
    i:参数模板生成(curPage = [i], \
                  keyword = ['UX']) \
    for i in range(0,df_qs.curPage_int.max()+1)\
    }
参数_keyword_UX_curPage

1
9


{0: {'init': ['-1'],
  'headckid': ['83c3162f4f93f306'],
  'fromSearchBtn': ['2'],
  'keyword': ['UX'],
  'ckid': ['83c3162f4f93f306°radeFlag=0'],
  'siTag': ['1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw'],
  'd_sfrom': ['search_unknown'],
  'd_ckId': ['c50c265a4c92032880cb36b10754865a'],
  'd_curPage': ['0'],
  'd_pageSize': ['40'],
  'd_headId': ['c50c265a4c92032880cb36b10754865a'],
  'curPage': [0]},
 1: {'init': ['-1'],
  'headckid': ['83c3162f4f93f306'],
  'fromSearchBtn': ['2'],
  'keyword': ['UX'],
  'ckid': ['83c3162f4f93f306°radeFlag=0'],
  'siTag': ['1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw'],
  'd_sfrom': ['search_unknown'],
  'd_ckId': ['c50c265a4c92032880cb36b10754865a'],
  'd_curPage': ['0'],
  'd_pageSize': ['40'],
  'd_headId': ['c50c265a4c92032880cb36b10754865a'],
  'curPage': [1]},
 2: {'init': ['-1'],
  'headckid': ['83c3162f4f93f306'],
  'fromSearchBtn': ['2'],
  'keyword': ['UX'],
  'ckid': ['83c3162f4f93f306°radeFlag=0'],
  'siTag': ['1B2M2Y8AsgTpgAmY7PhCfg

In [37]:
session = HTMLSession()

def requests_liepin( url, params):
    r = session.get( url , params = payload)

    # 先取特定元素, 精准打击其子后辈
    主要元素 = r.html.xpath( '//ul[@class="sojob-list"]/li')

    # 作为xpath字典，键为我要抓的牛肉名称，值为xpath
    dict_xpaths={ 
        'text': {
            'edu':      '//div[contains(@class,"job-info")]/p/span[@class="edu"]',
            '经验':      '//div[contains(@class,"job-info")]/p/span[@class="edu"]/following-sibling::span',
            '薪水':    '//div[contains(@class,"job-info")]/p/span[@class="text-warning"]', 
            '时间':    '//div[contains(@class,"job-info")]/p/time/@title', 
            '职称':    '//div[contains(@class,"job-info")]/h3/a', 
            '公司地点': '//div[contains(@class,"job-info")]/p/a',
            '公司名称': '//div[contains(@class,"sojob-item-main")]//p[@class="company-name"]/a', 
        },
        'text_content': {
        },
        'href': {
            '链结':    '//div[contains(@class,"job-info")]/h3/a', 
            '公司URL': '//div[contains(@class,"sojob-item-main")]//p[@class="company-name"]/a', 
        }
    }

    def get_e_text_content(_xpath_):
        # 高级列表推导
        暂存结果 = [e.xpath(_xpath_)[0].lxml.text_content() for e in 主要元素]
        return(暂存结果)

    def get_e_text(_xpath_):
        # 高级列表推导
        暂存结果 = ["".join([x.strip() if type(x) is str else x.text.strip() for x in e.xpath(_xpath_)]) for e in 主要元素]
        return(暂存结果)

    def get_e_href(_xpath_):
        # 高级列表推导
        暂存结果 = [list(e.xpath(_xpath_, first=True).absolute_links)[0] \
                   if len(e.xpath(_xpath_, first=True).absolute_links) >= 1  \
                   else "" for e in 主要元素]
        return(暂存结果)

    # 只对主要元素下进行.xpath取值
    数据字典 = dict()

    数据字典 = {k:get_e_text_content(v) for k,v in dict_xpaths['text_content'].items()}
    数据字典.update({k:get_e_text(v) for k,v in dict_xpaths['text'].items()})
    数据字典.update({k:get_e_href(v) for k,v in dict_xpaths['href'].items()})

    数据 = pd.DataFrame(数据字典)
    数据.to_excel("20春_Web数据挖掘_week04_liepin.xlsx", sheet_name="搜查结果")
    return (数据)

In [38]:
%%time
time.sleep(3+4*random()) # 爬亦有道

Wall time: 4.65 s


In [39]:
%%time
# 多个页面，但放慢脚步 time.sleep
import time
from random import random

url = "https://www.liepin.com/zhaopin/"

list_df = list()
for k,v in 参数_keyword_用户体验_curPage.items():
    payload = v
    df = requests_liepin( url, params = payload)
    time.sleep(3+4*random())  #放慢脚步 3-7秒, 平均约5秒
    df = df.assign (curPage = k)  # 区分  curPage
    list_df.append(df)

df_all = pd.concat(list_df).reset_index()
df_all.index.name = '序'

# 上周C-4   输出
df_all.to_excel("20春_Web数据挖掘_week04try_liepin_翻页.xlsx",\
                sheet_name="UX")


Wall time: 55.6 s


## 解析实践

In [15]:
from urllib.parse import urlparse, parse_qs
import pandas as pd
from IPython.display import display, HTML

# 总体目标：输入 href_列表, 建构出参数字典

# urlparse 解析后丢入数据框
df = pd.DataFrame([ urlparse(x) for x in href_列表])
df

# df.nunique()检查差异

Unnamed: 0,scheme,netloc,path,params,query,fragment
0,,,/zhaopin/,,init=-1&headckid=26f51855aa93f50f&fromSearchBt...,
1,,,/zhaopin/,,init=-1&headckid=26f51855aa93f50f&fromSearchBt...,
2,,,/zhaopin/,,init=-1&headckid=26f51855aa93f50f&fromSearchBt...,
3,,,/zhaopin/,,init=-1&headckid=26f51855aa93f50f&fromSearchBt...,
4,,,/zhaopin/,,init=-1&headckid=26f51855aa93f50f&fromSearchBt...,
5,,,/zhaopin/,,init=-1&headckid=26f51855aa93f50f&fromSearchBt...,


In [16]:
df.nunique
# 找到需要针对query

<bound method DataFrame.nunique of   scheme netloc       path params  \
0                /zhaopin/          
1                /zhaopin/          
2                /zhaopin/          
3                /zhaopin/          
4                /zhaopin/          
5                /zhaopin/          

                                               query fragment  
0  init=-1&headckid=26f51855aa93f50f&fromSearchBt...           
1  init=-1&headckid=26f51855aa93f50f&fromSearchBt...           
2  init=-1&headckid=26f51855aa93f50f&fromSearchBt...           
3  init=-1&headckid=26f51855aa93f50f&fromSearchBt...           
4  init=-1&headckid=26f51855aa93f50f&fromSearchBt...           
5  init=-1&headckid=26f51855aa93f50f&fromSearchBt...           >

In [20]:
列表暂存 = [{k:v[0] for k,v in parse_qs(q).items()} for q in df['query'] ]
print(列表暂存)

[{'init': '-1', 'headckid': '26f51855aa93f50f', 'fromSearchBtn': '2', 'keyword': 'PRD', 'ckid': '26f51855aa93f50f°radeFlag=0', 'siTag': '1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw', 'd_sfrom': 'search_unknown', 'd_ckId': 'd171c8b79eed664fd857fa9848a4bdea', 'd_curPage': '0', 'd_pageSize': '40', 'd_headId': 'd171c8b79eed664fd857fa9848a4bdea', 'curPage': '1'}, {'init': '-1', 'headckid': '26f51855aa93f50f', 'fromSearchBtn': '2', 'keyword': 'PRD', 'ckid': '26f51855aa93f50f°radeFlag=0', 'siTag': '1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw', 'd_sfrom': 'search_unknown', 'd_ckId': 'd171c8b79eed664fd857fa9848a4bdea', 'd_curPage': '0', 'd_pageSize': '40', 'd_headId': 'd171c8b79eed664fd857fa9848a4bdea', 'curPage': '2'}, {'init': '-1', 'headckid': '26f51855aa93f50f', 'fromSearchBtn': '2', 'keyword': 'PRD', 'ckid': '26f51855aa93f50f°radeFlag=0', 'siTag': '1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw', 'd_sfrom': 'search_unknown', 'd_ckId': 'd171c8b79eed664fd857fa9848a4bdea', 'd_curPage': '0'

In [21]:
df_qs = pd.DataFrame([{k:v[0] for k,v in parse_qs(x).items()} for x in df['query'] ])
df_qs

Unnamed: 0,init,headckid,fromSearchBtn,keyword,ckid,siTag,d_sfrom,d_ckId,d_curPage,d_pageSize,d_headId,curPage
0,-1,26f51855aa93f50f,2,PRD,26f51855aa93f50f°radeFlag=0,1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw,search_unknown,d171c8b79eed664fd857fa9848a4bdea,0,40,d171c8b79eed664fd857fa9848a4bdea,1
1,-1,26f51855aa93f50f,2,PRD,26f51855aa93f50f°radeFlag=0,1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw,search_unknown,d171c8b79eed664fd857fa9848a4bdea,0,40,d171c8b79eed664fd857fa9848a4bdea,2
2,-1,26f51855aa93f50f,2,PRD,26f51855aa93f50f°radeFlag=0,1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw,search_unknown,d171c8b79eed664fd857fa9848a4bdea,0,40,d171c8b79eed664fd857fa9848a4bdea,3
3,-1,26f51855aa93f50f,2,PRD,26f51855aa93f50f°radeFlag=0,1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw,search_unknown,d171c8b79eed664fd857fa9848a4bdea,0,40,d171c8b79eed664fd857fa9848a4bdea,4
4,-1,26f51855aa93f50f,2,PRD,26f51855aa93f50f°radeFlag=0,1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw,search_unknown,d171c8b79eed664fd857fa9848a4bdea,0,40,d171c8b79eed664fd857fa9848a4bdea,1
5,-1,26f51855aa93f50f,2,PRD,26f51855aa93f50f°radeFlag=0,1B2M2Y8AsgTpgAmY7PhCfg~fA9rXquZc5IkJpXC-Ycixw,search_unknown,d171c8b79eed664fd857fa9848a4bdea,0,40,d171c8b79eed664fd857fa9848a4bdea,9


In [22]:
df_qs.nunique()
# 发现curPage

init             1
headckid         1
fromSearchBtn    1
keyword          1
ckid             1
siTag            1
d_sfrom          1
d_ckId           1
d_curPage        1
d_pageSize       1
d_headId         1
curPage          5
dtype: int64

# 小结

* 解牛的解剖刀法在相同结构的解析器也是可用的。
* Xpath继续温习，右键检查,//,[@],/子级父级。
* starts-with(@href,"/zhaopin")]'过滤掉前面不要的东西，只需要有zhaopin的超链接。可以指定属性值开头，以筛选出想要的url
* df.nunique()解析出有什么不同，只需要找到curpage的不同即可
* robots.txt带有法律效力，约定俗成的规则，如果不允许千万别以身试法。
* %%time带有此符号的应放在第一行