# 爬取全国行政区划


1. 网站地址[http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2017/index.html](http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2017/index.html)


## 爬取策略
1. 爬取主页数据，将链接数据入队
2. 在队列中，依次拿出各网址，爬取该网址的数据，入队
3. 每一页中，逐行将链接的文字，放入数据库中


## 知识点
1. requests
2. beautifulsoup
    - soup.select('tr[class="villagetr"]')
    - soup.find_all("a")

3. 爬虫攻防：构建头


## 参考
1. [python简单爬虫](http://cache.baiducontent.com/c?m=9d78d513d9810ae902b0c8690d67c0171e43f1612ba7d10208d08448e2320c1e1a72a4fb792d4a4295873d7000dc5441afb57365377471ebcb96d51f9cac925f7ed57829234cd11f539404edd64126c327975ce9b81990e0b66dcd&p=b4769a4786cc4ae000a48e2c4f&newp=82769a47928911a053a4d6275953d8224216ed623fd4c44324b9d71fd325001c1b69e7bc2d261702d4c4796d0bad4d5aeef63078341766dada9fca458ae7c46c65&user=baidu&fm=sc&query=python+%C5%C0%B3%E6+demo&qid=bd8c9e6500034ce3&p1=2)
2. [1个小白五小时的爬虫经历](https://www.cnblogs.com/panzi/p/6421826.html)
3. [32个爬虫项目-让你一次吃到饱](https://blog.csdn.net/qq_41396296/article/details/79428834)

In [None]:
# 引入库
import requests
import re
import pandas as pd
import time
import random
from bs4 import BeautifulSoup


dir_base = "http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2017"

In [None]:
# 获得html文档
def get_html(url):
    """get the content of the url"""
    session = requests.Session()
    
    header = {
#         "User-Agent":"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36"
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
        ,'Referer':'http://www.stats.gov.cn/'
    }
    res = session.get(url, headers=header, timeout = 50000)
    res.encoding = "gb2312"
    return BeautifulSoup(res.text)

soup = get_html("%s/%s" %(dir_base, "62/09/22/620922215.html"))
soup.select("a")

In [None]:
len(soup.find_all("a"))

In [None]:
for link in soup.select('tr[class="villagetr"]'):
    print(link.find_all("td")[2].get_text())

In [None]:
%%time
# 地址格式(五级)
# 省 市 区/县 乡镇/街道 居委会
"""
算法：
    1. 创建一个行标，一个空数组，用于存放行政区划
    2. 从根路径index开始，采用深度优先搜索策略
        爬取当前页面，如果当前页面有链接，则压入栈内
                      如果当前页面没有链接(除了外链)，即已经爬到了末端，将该页面的内容写入行政区划表，row+1
        取栈中的下一个地址
            如果栈中记录的不是地址，那么row-1，并在对应列中写入栈中记入的数据，row+1
            取栈中的下一个地址
    3. 空值填充，后向填充
"""
row = 0
df_district = pd.DataFrame()

html_list = [{"url":"%s/%s" %(dir_base, "index.html")}]

f = open(r"E:\workspace\0. git\project\4. 网络爬虫\2. 爬虫实战\1. 全国行政区划\addr.txt", "w")

while len(html_list) > 0:
    time.sleep(random.random())   # 休眠随机秒(0-1秒)
    node = html_list.pop()
    if "url" in list(node.keys()):
        url = node["url"]
    else:
        key_no = node["key_no"]
        value_no = node["value_no"]
        tag = node["tag"]
        value_tag = node["value_tag"]
        
        row -= 1
        df_district.loc[row, key_no] = value_no
        df_district.loc[row, tag] = value_tag
        row += 1
        continue
    
    soup = get_html(url)
    # 如果找到a
    if len(soup.find_all("a")) == 1: # 如果找到的a标签只有1个，即[<a class="STYLE3" href="http://www.miibeian.gov.cn/" target="_blank">京ICP备05034670号</a>]，表示爬到了末端
        for link in soup.select('tr[class="villagetr"]'):
            df_district.loc[row, "v_no"] = link.find_all("td")[1].get_text()
            df_district.loc[row, "village"] = link.find_all("td")[2].get_text()
            
#             df_district.loc[row, "t_no"] = link.find_all("td")[1].get_text()
#             df_district.loc[row, "town"] = link.find_all("td")[2].get_text()
            
            row += 1
        continue
        
    for link in soup.find_all("a"):
        html = link.get("href")
        tag = link.get_text()
        
        if html.startswith("http") == False and re.match(r"\d+", tag) is None:
            if len(html) == 7:
                html_list.append({"tag":"province", "value_tag":tag, "key_no":"p_no", "value_no":html[:2]})
                r = "%s/%s" %(dir_base, html)
            elif len(html) == 12:
                html_list.append({"tag":"city", "value_tag":tag, "key_no":"c_no", "value_no":html[3:7]})
                r = "%s/%s" %(dir_base, html)
            elif len(html) == 14:
                html_list.append({"tag":"district", "value_tag":tag, "key_no":"d_no", "value_no":html[3:9]})
                r = "%s/%s/%s" %(dir_base, html[3:5], html)
            elif len(html) == 17:
                html_list.append({"tag":"town", "value_tag":tag, "key_no":"t_no", "value_no":html[3:12]})
                r = "%s/%s/%s/%s" %(dir_base, html[3:5], html[5:7], html)
                
            html_list.append({"url":r})
            f.write("%s\n" %(r))
f.close()

In [13]:
df_district.fillna(method="backfill", inplace=True)

df_district.head(10)

Unnamed: 0,v_no,village,t_no,town,d_no,district,c_no,city
0,121,虚拟社区,659006101,双丰镇,659006,铁门关市,6590,自治区直辖县级行政区划
1,111,虚拟社区,659006100,博古其镇,659006,铁门关市,6590,自治区直辖县级行政区划
2,123,团部,659004502,兵团一零三团,659004,五家渠市,6590,自治区直辖县级行政区划
3,220,一连,659004502,兵团一零三团,659004,五家渠市,6590,自治区直辖县级行政区划
4,220,二连,659004502,兵团一零三团,659004,五家渠市,6590,自治区直辖县级行政区划
5,220,三连,659004502,兵团一零三团,659004,五家渠市,6590,自治区直辖县级行政区划
6,220,六连,659004502,兵团一零三团,659004,五家渠市,6590,自治区直辖县级行政区划
7,220,七连,659004502,兵团一零三团,659004,五家渠市,6590,自治区直辖县级行政区划
8,220,八连,659004502,兵团一零三团,659004,五家渠市,6590,自治区直辖县级行政区划
9,220,九连,659004502,兵团一零三团,659004,五家渠市,6590,自治区直辖县级行政区划


In [7]:
len(html_list)

85

In [17]:
df_district.to_csv(r"C:\Users\Progress\Desktop\district.csv")

In [15]:
html_list

[{'key_no': 'p_no', 'tag': 'province', 'value_no': '11', 'value_tag': '北京市'},
 {'url': 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2017/11.html'},
 {'key_no': 'p_no', 'tag': 'province', 'value_no': '12', 'value_tag': '天津市'},
 {'url': 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2017/12.html'},
 {'key_no': 'p_no', 'tag': 'province', 'value_no': '13', 'value_tag': '河北省'},
 {'url': 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2017/13.html'},
 {'key_no': 'p_no', 'tag': 'province', 'value_no': '14', 'value_tag': '山西省'},
 {'url': 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2017/14.html'},
 {'key_no': 'p_no',
  'tag': 'province',
  'value_no': '15',
  'value_tag': '内蒙古自治区'},
 {'url': 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2017/15.html'},
 {'key_no': 'p_no', 'tag': 'province', 'value_no': '21', 'value_tag': '辽宁省'},
 {'url': 'http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2017/21.html'},
 {'key_no': 'p_no', 'tag': 'province', 'value_no': '22', 'value_tag': '吉林