# 数据挖掘：交互动态内容

*  本周主要内容： XHR 请求实践示例
*  20春_Web数据挖掘_week10
*  电子讲义设计者：廖汉腾, 许智超

## 上周内容：Selenium交互实践

以微信公众平台网站为例，使用Selenium控制浏览器进行交互，登入及取数据的过程都需要更复杂的表单交互，目标是根据需求的关键词，以正确的流程使用Selenium操控浏览器去模拟用户输入，以返回相关内容。

## 本周内容： XHR 请求实践

以国家数据库为例，在遇到爬虫抽取结果失败时，要能学会逆向工程/反向工程，找到交互动态内容的机制。

交互动态内容的机制，发现是XHR请求的话，学习使用requests模块，也是requests_html的大哥，将交互动态内容下下来。


### 本电子讲义说明

本电子讲义为一系列课程的主要教材
* 课程：
  * 20春_Web数据挖掘 （中山大学南方学院）
* 电子讲义设计者：廖汉腾, 许智超

# 国家数据库

国家数据库[分省年度(fsnd)数据入口页面](http://data.stats.gov.cn/adv.htm?cn=E0103)左侧点开可以看到不同的地区分类，如"八大经济区域"，要如何按结构抓取呢？

请大家运用Xpath知识，在Chrome实践成功之后，使用requests_html抓取试试？


In [1]:
# 国家数据库 requests_html
import pandas as pd
from requests_html import HTMLSession

In [2]:
url = "http://data.stats.gov.cn/adv.htm?cn=E0103"
session = HTMLSession()
r = session.get( url )

In [3]:
# Chrome 实践成功 xpath:  $x(xpath应该是什麽？)
r.html.lxml.xpath('xpath应该是什麽？')

[]

## 抽取失败说明

这个示例抽取失败。

为什麽呢？


### Inspector 和 网页源代码 检查结果不一样

＂八大经济区域＂在浏览器Inspector工具是找得到的，用xpath语法也能找到

```
$x('//a[@id="reg_tree_4_a"]')
```

检查网页源代码（通过使用鼠标右键单击View Page Source选项）可以发现我们准备抓取的a元素实际上是空的。

在浏览器Inspector工具检查左侧地区和指标切换时，发现HTML的代码内容会变动。


抽取失败的原因正是这个动态交互内容的变化，这机制叫做XHR请求，前端和交互设计师也一定要知道的基本常识。

### XHR 介绍

* 大一教科书上野 宣《图解HTTP 》
    * 9.2.1　HTTP 的瓶颈 
        * 使用 HTTP 协议探知服务器上是否有内容更新，就必须频繁地从客户端到服务器端进行确认。如果服务器上没有内容更新，那么就会产生徒劳的通信。
        * 下这些 HTTP 标准就会成为瓶颈。        
            * 一条连接上只可发送一个请求。
            * 请求只能从客户端开始。客户端不可以接收除响应以外的指令。
            * 请求 / 响应首部未经压缩就发送。首部信息越多延迟越大。
            * 发送冗长的首部。每次互相发送相同的首部造成的浪费较多。
            * 可任意选择数据压缩格式。非强制压缩发送。
        * 图：以前的 HTTP 通信
![](HTTP_issues.png)        
[![](https://mermaid.ink/img/eyJjb2RlIjoic2VxdWVuY2VEaWFncmFtXG4gICAgcGFydGljaXBhbnQg5a6i5oi356uvXG4gICAgcGFydGljaXBhbnQg5pyN5Yqh5ZmoXG4gICAg5a6i5oi356uvLT4-5pyN5Yqh5ZmoOiDor7fmsYLnu5nkuKrnvZHpobUgICBcbiAgICDmnI3liqHlmagtLT4-5a6i5oi356uvOiDlpb3nmoTnu5nmlbTkuKpIVE1MXG4gICAgTm90ZSBsZWZ0IG9mIOWuouaIt-errzogIOS4gOmYteWtkOWQjjxici8-5YaN6K-35rGCXG4gICAg5a6i5oi356uvLT4-5pyN5Yqh5ZmoOiDor7fmsYLnu5nkuKrmm7TmlrDlhoXlrrko5bCR6YePKVxuICAgIOacjeWKoeWZqC0tPj7lrqLmiLfnq686IOWlveeahOe7meaVtOS4qkhUTUxcbiAgICBOb3RlIHJpZ2h0IG9mIOacjeWKoeWZqDogIOWQjOatpeivt-axgiIsIm1lcm1haWQiOnsidGhlbWUiOiJkZWZhdWx0In0sInVwZGF0ZUVkaXRvciI6ZmFsc2V9)](https://mermaid-js.github.io/mermaid-live-editor/#/edit/eyJjb2RlIjoic2VxdWVuY2VEaWFncmFtXG4gICAgcGFydGljaXBhbnQg5a6i5oi356uvXG4gICAgcGFydGljaXBhbnQg5pyN5Yqh5ZmoXG4gICAg5a6i5oi356uvLT4-5pyN5Yqh5ZmoOiDor7fmsYLnu5nkuKrnvZHpobUgICBcbiAgICDmnI3liqHlmagtLT4-5a6i5oi356uvOiDlpb3nmoTnu5nmlbTkuKpIVE1MXG4gICAgTm90ZSBsZWZ0IG9mIOWuouaIt-errzogIOS4gOmYteWtkOWQjjxici8-5YaN6K-35rGCXG4gICAg5a6i5oi356uvLT4-5pyN5Yqh5ZmoOiDor7fmsYLnu5nkuKrmm7TmlrDlhoXlrrko5bCR6YePKVxuICAgIOacjeWKoeWZqC0tPj7lrqLmiLfnq686IOWlveeahOe7meaVtOS4qkhUTUxcbiAgICBOb3RlIHJpZ2h0IG9mIOacjeWKoeWZqDogIOWQjOatpeivt-axgiIsIm1lcm1haWQiOnsidGhlbWUiOiJkZWZhdWx0In0sInVwZGF0ZUVkaXRvciI6ZmFsc2V9)
    * ... Ajax 的解决方法
        * Ajax（Asynchronous JavaScript and XML， 异 步 JavaScript 与 XML 技术）是一种有效利用 JavaScript 和 DOM（Document Object Model，文档对象模型）的操作，以达到局部 Web 页面替换加载的异步通信手段。和以前的同步通信相比，由于它只更新一部分页面，响应中传输的数据量会因此而减少，这一优点显而易见。
        * Ajax 的核心技术是名为 XMLHttpRequest 的 API，通过 JavaScript 脚本语言的调用就能和服务器进行 HTTP 通信。借由这种手段，就能从已加载完毕的 Web 页面上发起请求，只更新局部页面。
![](XHR_solution.png)     

[![](https://mermaid.ink/img/eyJjb2RlIjoic2VxdWVuY2VEaWFncmFtXG4gICAgcGFydGljaXBhbnQg5a6i5oi356uvXG4gICAgcGFydGljaXBhbnQg5pyN5Yqh5ZmoXG4gICAg5a6i5oi356uvLT4-5pyN5Yqh5ZmoOiDor7fmsYLnu5nkuKrnvZHpobUgICBcbiAgICDmnI3liqHlmagtLT4-5a6i5oi356uvOiDlpb3nmoTnu5nmlbTkuKpIVE1MXG4gICAgTm90ZSByaWdodCBvZiDmnI3liqHlmag6ICDmnInmlrDmlbDmja7nvZdcbiAgICDmnI3liqHlmagtLT4-5a6i5oi356uvOiDmnInmlrDmlbDmja5qc29u57uZ5L2g5o6o5LiA54K554K5XG4gICAg5pyN5Yqh5ZmoLS0-PuWuouaIt-errzog5pyJ5paw5pWw5o2uWE1M57uZ5L2g5o6o5LiA54K554K5XG4gICAgTm90ZSBsZWZ0IG9mIOWuouaIt-errzogIOS4gOmYteWtkOWQjjxici8-5YaN6K-35rGCXG4gICAg5a6i5oi356uvLT4-5pyN5Yqh5ZmoOiDor7fmsYLnu5nkuKrmm7TmlrDlhoXlrrko5bCR6YePKVxuICAgIOacjeWKoeWZqC0tPj7lrqLmiLfnq686IOacieaWsOaVsOaNrmpzb27nu5nkvaDmjqjkuIDngrnngrlcbiAgICBOb3RlIHJpZ2h0IG9mIOacjeWKoeWZqDogIOW8guatpeivt-axgiIsIm1lcm1haWQiOnsidGhlbWUiOiJkZWZhdWx0In0sInVwZGF0ZUVkaXRvciI6ZmFsc2V9)](https://mermaid-js.github.io/mermaid-live-editor/#/edit/eyJjb2RlIjoic2VxdWVuY2VEaWFncmFtXG4gICAgcGFydGljaXBhbnQg5a6i5oi356uvXG4gICAgcGFydGljaXBhbnQg5pyN5Yqh5ZmoXG4gICAg5a6i5oi356uvLT4-5pyN5Yqh5ZmoOiDor7fmsYLnu5nkuKrnvZHpobUgICBcbiAgICDmnI3liqHlmagtLT4-5a6i5oi356uvOiDlpb3nmoTnu5nmlbTkuKpIVE1MXG4gICAgTm90ZSByaWdodCBvZiDmnI3liqHlmag6ICDmnInmlrDmlbDmja7nvZdcbiAgICDmnI3liqHlmagtLT4-5a6i5oi356uvOiDmnInmlrDmlbDmja5qc29u57uZ5L2g5o6o5LiA54K554K5XG4gICAg5pyN5Yqh5ZmoLS0-PuWuouaIt-errzog5pyJ5paw5pWw5o2uWE1M57uZ5L2g5o6o5LiA54K554K5XG4gICAgTm90ZSBsZWZ0IG9mIOWuouaIt-errzogIOS4gOmYteWtkOWQjjxici8-5YaN6K-35rGCXG4gICAg5a6i5oi356uvLT4-5pyN5Yqh5ZmoOiDor7fmsYLnu5nkuKrmm7TmlrDlhoXlrrko5bCR6YePKVxuICAgIOacjeWKoeWZqC0tPj7lrqLmiLfnq686IOacieaWsOaVsOaNrmpzb27nu5nkvaDmjqjkuIDngrnngrlcbiAgICBOb3RlIHJpZ2h0IG9mIOacjeWKoeWZqDogIOW8guatpeivt-axgiIsIm1lcm1haWQiOnsidGhlbWUiOiJkZWZhdWx0In0sInVwZGF0ZUVkaXRvciI6ZmFsc2V9)

## 偷懒用法，仅适合仅抓一次
小技巧 Chrome Inspector Element: "Copy outerHTML"

In [4]:
import requests_html

In [19]:
## HTML 代码块解析
HTML_text = """
<ul id="reg_tree" class="ztree">
<li id="reg_tree_1" class="level0" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_1_switch" title="" class="button level0 switch noline_docu" treenode_switch=""></span><a id="reg_tree_1_a" class="level0 curSelectedNode" treenode_a="" onclick="" target="_blank" style="" title="全部地区"><span id="reg_tree_1_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_1_span">全部地区</span></a></li><li id="reg_tree_2" class="level0" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_2_switch" title="" class="button level0 switch noline_open" treenode_switch=""></span><a id="reg_tree_2_a" class="level0" treenode_a="" onclick="" target="_blank" style="" title="常规分类"><span id="reg_tree_2_ico" title="" treenode_ico="" class="button ico_open" style=""></span><span id="reg_tree_2_span">常规分类</span></a><ul id="reg_tree_2_ul" class="level0 " style=""><li id="reg_tree_6" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_6_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_6_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="华北"><span id="reg_tree_6_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_6_span">华北</span></a></li><li id="reg_tree_7" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_7_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_7_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="东北"><span id="reg_tree_7_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_7_span">东北</span></a></li><li id="reg_tree_8" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_8_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_8_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="华东"><span id="reg_tree_8_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_8_span">华东</span></a></li><li id="reg_tree_9" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_9_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_9_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="中南"><span id="reg_tree_9_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_9_span">中南</span></a></li><li id="reg_tree_10" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_10_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_10_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="西南"><span id="reg_tree_10_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_10_span">西南</span></a></li><li id="reg_tree_11" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_11_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_11_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="西北"><span id="reg_tree_11_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_11_span">西北</span></a></li></ul></li><li id="reg_tree_3" class="level0" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_3_switch" title="" class="button level0 switch noline_open" treenode_switch=""></span><a id="reg_tree_3_a" class="level0" treenode_a="" onclick="" target="_blank" style="" title="热点地区"><span id="reg_tree_3_ico" title="" treenode_ico="" class="button ico_open" style=""></span><span id="reg_tree_3_span">热点地区</span></a><ul id="reg_tree_3_ul" class="level0 " style=""><li id="reg_tree_12" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_12_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_12_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="长江三角洲"><span id="reg_tree_12_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_12_span">长江三角洲</span></a></li><li id="reg_tree_13" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_13_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_13_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="环渤海地区"><span id="reg_tree_13_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_13_span">环渤海地区</span></a></li><li id="reg_tree_14" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_14_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_14_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="泛珠三角"><span id="reg_tree_14_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_14_span">泛珠三角</span></a></li><li id="reg_tree_15" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_15_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_15_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="东部地区"><span id="reg_tree_15_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_15_span">东部地区</span></a></li><li id="reg_tree_16" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_16_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_16_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="西部地区"><span id="reg_tree_16_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_16_span">西部地区</span></a></li></ul></li><li id="reg_tree_4" class="level0" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_4_switch" title="" class="button level0 switch noline_open" treenode_switch=""></span><a id="reg_tree_4_a" class="level0" treenode_a="" onclick="" target="_blank" style="" title="八大经济区域"><span id="reg_tree_4_ico" title="" treenode_ico="" class="button ico_open" style=""></span><span id="reg_tree_4_span">八大经济区域</span></a><ul id="reg_tree_4_ul" class="level0 " style=""><li id="reg_tree_17" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_17_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_17_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="东北地区"><span id="reg_tree_17_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_17_span">东北地区</span></a></li><li id="reg_tree_18" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_18_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_18_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="北部沿海"><span id="reg_tree_18_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_18_span">北部沿海</span></a></li><li id="reg_tree_19" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_19_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_19_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="东部沿海"><span id="reg_tree_19_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_19_span">东部沿海</span></a></li><li id="reg_tree_20" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_20_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_20_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="南部沿海"><span id="reg_tree_20_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_20_span">南部沿海</span></a></li><li id="reg_tree_21" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_21_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_21_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="黄河中游"><span id="reg_tree_21_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_21_span">黄河中游</span></a></li><li id="reg_tree_22" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_22_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_22_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="长江中游"><span id="reg_tree_22_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_22_span">长江中游</span></a></li><li id="reg_tree_23" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_23_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_23_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="西南地区"><span id="reg_tree_23_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_23_span">西南地区</span></a></li><li id="reg_tree_24" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_24_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_24_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="大西北地区"><span id="reg_tree_24_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_24_span">大西北地区</span></a></li></ul></li><li id="reg_tree_5" class="level0" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_5_switch" title="" class="button level0 switch noline_open" treenode_switch=""></span><a id="reg_tree_5_a" class="level0" treenode_a="" onclick="" target="_blank" style="" title="三大地带"><span id="reg_tree_5_ico" title="" treenode_ico="" class="button ico_open" style=""></span><span id="reg_tree_5_span">三大地带</span></a><ul id="reg_tree_5_ul" class="level0 " style=""><li id="reg_tree_25" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_25_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_25_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="东部地带"><span id="reg_tree_25_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_25_span">东部地带</span></a></li><li id="reg_tree_26" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_26_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_26_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="中部地带"><span id="reg_tree_26_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_26_span">中部地带</span></a></li><li id="reg_tree_27" class="level1" tabindex="0" hidefocus="true" treenode=""><span id="reg_tree_27_switch" title="" class="button level1 switch noline_docu" treenode_switch=""></span><a id="reg_tree_27_a" class="level1" treenode_a="" onclick="" target="_blank" style="" title="西部地带"><span id="reg_tree_27_ico" title="" treenode_ico="" class="button ico_docu" style=""></span><span id="reg_tree_27_span">西部地带</span></a></li></ul></li></ul>
"""
html_load = requests_html.HTML(html= HTML_text, url='https://localhost/')
html_load
reg_dict = {}

# 使用requests_html 的html物件之lxml.xpath方法
for tr in html_load.lxml.xpath('//a[@class="level0"]'):   
    print ( { tr.get("title"): tr.getnext().xpath('li/a/@title') })
    reg_dict.update({tr.get("title"):tr.getnext().xpath('li/a/@title')})

{'常规分类': ['华北', '东北', '华东', '中南', '西南', '西北']}
{'热点地区': ['长江三角洲', '环渤海地区', '泛珠三角', '东部地区', '西部地区']}
{'八大经济区域': ['东北地区', '北部沿海', '东部沿海', '南部沿海', '黄河中游', '长江中游', '西南地区', '大西北地区']}
{'三大地带': ['东部地带', '中部地带', '西部地带']}


In [23]:
# 使用requests_html 的html物件之xpath方法
outcomes = {}
for tr in html_load.xpath('//li[@class="level0"]'):  
    one =  {tr.xpath("//a/@title", first=True): tr.xpath('//ul/li/a/@title')}
    outcomes.update ( one )
    print ( one )   


{'全部地区': []}
{'常规分类': ['华北', '东北', '华东', '中南', '西南', '西北']}
{'热点地区': ['长江三角洲', '环渤海地区', '泛珠三角', '东部地区', '西部地区']}
{'八大经济区域': ['东北地区', '北部沿海', '东部沿海', '南部沿海', '黄河中游', '长江中游', '西南地区', '大西北地区']}
{'三大地带': ['东部地带', '中部地带', '西部地带']}


## 反向工程观察XHR之提交参数

* 要抓取交互动态内容数据，我们需要了解网页是如何加载该数据的，该过程也可以描述为逆向工程。继续上一节的例子，在浏览器工具中单击Network选项卡，然后执行一次搜索，我们将会看到对于给定页面的所有请求。

* 交互动态内容的机制，发现是XHR请求的话，学习使用requests模块，也是requests_html的大哥，将交互动态内容下下来。

In [7]:
# 用 urllib.pars  

import requests  # 
from urllib.parse import urlparse, urlunparse, parse_qs, parse_qsl, quote, urlsplit, urlencode

import pandas as pd


In [8]:
# 建构参数：找到关键参数及参数结构

def parse_url_qs (url):
# 函数： 
# 输入：url 
# 输出：url各别解析成果，六大块
    six_parts = urlparse(url) 
    out = parse_qs(six_parts.query)
    return (out, six_parts)

In [9]:
parse_url_qs ("http://data.stats.gov.cn/adv.htm?m=findZbXl&wd=reg&db=fsnd")

({'m': ['findZbXl'], 'wd': ['reg'], 'db': ['fsnd']},
 ParseResult(scheme='http', netloc='data.stats.gov.cn', path='/adv.htm', params='', query='m=findZbXl&wd=reg&db=fsnd', fragment=''))

In [10]:
parse_url_qs ("http://data.stats.gov.cn/adv.htm?m=findZbXl&db=fsnd&wd=reg&treeId=900002")

({'m': ['findZbXl'], 'db': ['fsnd'], 'wd': ['reg'], 'treeId': ['900002']},
 ParseResult(scheme='http', netloc='data.stats.gov.cn', path='/adv.htm', params='', query='m=findZbXl&db=fsnd&wd=reg&treeId=900002', fragment=''))

In [49]:
# treeID 是可替换的
treeId_new = '900002'

参数, s = parse_url_qs ("http://data.stats.gov.cn/adv.htm?m=findZbXl&db=fsnd&wd=reg&treeId=800005")
参数['treeId'] = [treeId_new]
参数

{'m': ['findZbXl'], 'db': ['fsnd'], 'wd': ['reg'], 'treeId': ['900002']}

## 使用requests模块

In [52]:
# 使用 requests 模块
headers = {
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2900.1 Iron Safari/537.36}"
}

url = "http://data.stats.gov.cn/adv.htm"

treeId_new = '800004'
参数['treeId'] = [treeId_new]

r = requests.get(url, params = 参数, headers= headers)
r.json()


[{'dbcode': 'fsnd',
  'exp': '',
  'id': '350000',
  'isParent': True,
  'name': '福建省',
  'open': False,
  'pid': '800004',
  'wd': 'reg'},
 {'dbcode': 'fsnd',
  'exp': '',
  'id': '440000',
  'isParent': True,
  'name': '广东省',
  'open': False,
  'pid': '800004',
  'wd': 'reg'},
 {'dbcode': 'fsnd',
  'exp': '',
  'id': '460000',
  'isParent': True,
  'name': '海南省',
  'open': False,
  'pid': '800004',
  'wd': 'reg'}]

### json物件数据框解析

In [13]:
pd.DataFrame(r.json())

Unnamed: 0,dbcode,exp,id,isParent,name,open,pid,wd
0,fsnd,,800001,False,东北地区,False,800000,reg
1,fsnd,,800002,False,北部沿海,False,800000,reg
2,fsnd,,800003,False,东部沿海,False,800000,reg
3,fsnd,,800004,False,南部沿海,False,800000,reg
4,fsnd,,800005,False,黄河中游,False,800000,reg
5,fsnd,,800006,False,长江中游,False,800000,reg
6,fsnd,,800007,False,西南地区,False,800000,reg
7,fsnd,,800008,False,大西北地区,False,800000,reg


### requests返回的响应对象

requests 模块返回的响应对象，可以使用不同方法取用取其内容

* r.content
* r.text
* r.json

数据型态不同，pandas pd接手时，使用的API也不同

In [54]:
r.text #.content

'[{"dbcode":"fsnd","exp":"","id":"350000","isParent":true,"name":"福建省","open":false,"pid":"800004","wd":"reg"},{"dbcode":"fsnd","exp":"","id":"440000","isParent":true,"name":"广东省","open":false,"pid":"800004","wd":"reg"},{"dbcode":"fsnd","exp":"","id":"460000","isParent":true,"name":"海南省","open":false,"pid":"800004","wd":"reg"}]'

In [31]:
pd.read_json(r.text)

Unnamed: 0,dbcode,exp,id,isParent,name,open,pid,wd
0,fsnd,,800001,False,东北地区,False,800000,reg
1,fsnd,,800002,False,北部沿海,False,800000,reg
2,fsnd,,800003,False,东部沿海,False,800000,reg
3,fsnd,,800004,False,南部沿海,False,800000,reg
4,fsnd,,800005,False,黄河中游,False,800000,reg
5,fsnd,,800006,False,长江中游,False,800000,reg
6,fsnd,,800007,False,西南地区,False,800000,reg
7,fsnd,,800008,False,大西北地区,False,800000,reg


-----

### 模块化包以上代码备用
* 可单独使用

In [55]:
# 使用 requests 模块
from urllib.parse import urlparse, parse_qs

def parse_url_qs (url):
# 函数： 
# 输入：url 
# 输出：url各别解析成果，六大块
    six_parts = urlparse(url) 
    out = parse_qs(six_parts.query)
    return (out, six_parts)

参数, s = parse_url_qs ("http://data.stats.gov.cn/adv.htm?m=findZbXl&db=fsnd&wd=reg&treeId=800005")

# 使用 requests 模块
import requests
headers = {
    "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2900.1 Iron Safari/537.36}"
}

url = "http://data.stats.gov.cn/adv.htm"

def get_df(treeId_new = '800000'):

    参数['treeId'] = [treeId_new]
    r = requests.get(url, params = 参数 , headers= headers)
    try:
        df = pd.DataFrame(r.json())
    except:
        df = None
    return (df)


In [56]:
get_df( '800000' )

Unnamed: 0,dbcode,exp,id,isParent,name,open,pid,wd
0,fsnd,,800001,False,东北地区,False,800000,reg
1,fsnd,,800002,False,北部沿海,False,800000,reg
2,fsnd,,800003,False,东部沿海,False,800000,reg
3,fsnd,,800004,False,南部沿海,False,800000,reg
4,fsnd,,800005,False,黄河中游,False,800000,reg
5,fsnd,,800006,False,长江中游,False,800000,reg
6,fsnd,,800007,False,西南地区,False,800000,reg
7,fsnd,,800008,False,大西北地区,False,800000,reg


-----
# production

In [None]:
# 阶段性目标：指标

In [None]:
# 阶段性目标：分省年度数据

In [None]:
# 终极目标：分省年度数据＋指标＋地区集成