# 抓取链家小区数据

** 思路 ** ：
1. 获取一个小区页面的数据
2. 获取一页所有小区的链接
3. 获取所有页面的链接
4. 获取所有区域的链接

## 获取一个小区页面数据

以华龙美晟为例：https://bj.lianjia.com/xiaoqu/1111027375142/

首先导入两个库
- `requests` 用于获取网页
- `from bs4 import BeautifulSoup ` 用于解析数据

In [1]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin  ## 用于 url 拼接

### 获取房价

- 定位：进入页面后按 `F12` 获取页面信息,然后点击 1 所在的位置，就可以定位到你想要抓取的元素对应的信息。比如想抓取房价，则点击 1 之后再点击房价即可定位到房价所在的标签为 `span` 和 `class=xiaoquUnitPrice` 。用 `find` 或者 `find_all` 根据显示的信息即可定位需要抓取的信息
- 抓取：定位完需要抓取信息所在的位置后用 `get_text()` 可以抓取里面的信息，用`[]`取里面标签属性的信息如 `['href']`可以取链接
- 存储：一般将抓取的信息存在一个字典里，定义一个字典 `contents`, 然后存入`contents['房价']`中

- `find` 找第一个指定的数据
- `get_text()` 获取文本内容

In [2]:
href = 'https://bj.lianjia.com/xiaoqu/1111027375142/'
### 获取网页数据
res = requests.get(href)
### 解析网页
soup = BeautifulSoup(res.text, 'lxml')

## 获取房价
price = soup.find('span', class_='xiaoquUnitPrice').get_text()
print('房价为:', price)

## 存入房价数据
contents = {}
contents['房价'] = price

房价为: 64154


In [3]:
## 小区名称的获取也是类似的思路
contents['小区名称'] = soup.find('h1').get_text()

![image.png](./fig/1.png)

### 小区基本信息

![image.png](./fig/2.png)

获取小区基本信息思路也是和前面一样：1.定位 2.获取 3.存储

从上图获取信息结构可以看出，每条信息对应的标签都一样的，所以只用定位到该标签的上一级标签，用`find_all` 找到包含小区信息的所有的标签，然后遍历该标签即可获得所有小区标签数据。

In [4]:
## 获取小区基本信息的上一级标签信息，并找到包含小区信息的所有标签
infos = soup.find('div', class_="xiaoquInfo").find_all('div')
## 遍历包含所有小区信息的标签，然后取值和保存
for info in infos:
    key = info.find('span', class_="xiaoquInfoLabel").get_text()
    value = info.find('span', class_="xiaoquInfoContent").get_text()
    contents[key] = value
    print(key, value)

建筑年代 2009年建成 
建筑类型 塔楼/塔板结合
物业费用 2.38元/平米/月
物业公司 北京喜莱达物业管理有限公司
开发商 华龙置业房地产开发有限公司
楼栋总数 5栋
房屋总数 1124户
附近门店 新天天家园一店/东城天天家园3号楼1层102


### 小区经纬度获取

![image.png](./fig/3.png)

小区经纬度存储在小区信息的最后一个标签里面的 `span`里面，同样原理，定位、获取然后存储

因为存储的经纬度放在一起，所以用 `split` 以 `,` 为分隔符分割该字符串分别存储

In [5]:
area = infos[-1].find('span', class_='actshowMap')['xiaoqu']
print(area)
contents['经度'], contents['维度'] = area[1:-1].split(',')

[116.411858,39.869364]


### 将获取小区信息所有内容整合成一个函数以便于调用

In [6]:
def get_xiaoqu_content(href):
    res = requests.get(href)
    soup = BeautifulSoup(res.text, 'lxml')
    contents = {}
    
    ## 小区名称 
    contents['小区名称'] = soup.find('h1').get_text()
    
    ## 房价
    contents['房价'] = soup.find('span', class_="xiaoquUnitPrice").get_text()
    
    ## 基本信息
    infos = soup.find('div', class_="xiaoquInfo").find_all('div')
    for info in infos:
        key = info.find('span', class_="xiaoquInfoLabel").get_text()
        value = info.find('span', class_="xiaoquInfoContent").get_text()
        contents[key] = value
        
    ## 经纬度
    contents['经度'], contents['维度'] = info.find('span', class_="actshowMap")['xiaoqu'][1:-1].split(',')
    
    return contents

In [7]:
## 调用该函数即可得到该小区的信息
get_xiaoqu_content(href)

{'小区名称': '华龙美晟',
 '建筑年代': '2009年建成 ',
 '建筑类型': '塔楼/塔板结合',
 '开发商': '华龙置业房地产开发有限公司',
 '房价': '64154',
 '房屋总数': '1124户',
 '楼栋总数': '5栋',
 '物业公司': '北京喜莱达物业管理有限公司',
 '物业费用': '2.38元/平米/月',
 '经度': '116.411858',
 '维度': '39.869364',
 '附近门店': '新天天家园一店/东城天天家园3号楼1层102'}

## 获取所有小区的链接
有了 `get_xiaoqu_content` 这个函数之后，只要我们能获取所有小区的链接，我们就能得到所有小区的信息，所有我们现在需要做的是获取所有小区的链接。

我们能进入的页面是某个区域某一页面的数据如东城区第一页，所有我们想要获得所有小区的链接思路是：
>获得所有区域的链接 ==》 获得该区域所有页码链接 ==》 获取每一页面所有小区链接 ==》 调用 `get_xiaoqu_content` 获取所有小区信息

### 获取一个区域一页所有小区的链接
所以我们先获取一页显示的所有小区的链接，以东城区第一页为例：https://bj.lianjia.com/xiaoqu/dongcheng/

从图中个可以到所有小区链接都是在 `<ul class="listContent", log-mod="list">` 的 `li` 里面的 `a` 中的 `href` 属性里面，所有获取的思路是先定位到上一级标签即 `ul`，找到该标签下的所有 `li`， 然后遍历获取即可

![image.png](./fig/4.png)

In [8]:
url = 'https://bj.lianjia.com/xiaoqu/dongcheng/'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
lis = soup.find('ul', class_='listContent').find_all('li')
for li in lis:
    href = li.find('div', class_='title').find('a')['href']
    print(href)

https://bj.lianjia.com/xiaoqu/1111027375142/
https://bj.lianjia.com/xiaoqu/1111027380930/
https://bj.lianjia.com/xiaoqu/1111027381005/
https://bj.lianjia.com/xiaoqu/1111027380567/
https://bj.lianjia.com/xiaoqu/1111027382485/
https://bj.lianjia.com/xiaoqu/1111027380027/
https://bj.lianjia.com/xiaoqu/1111027382283/
https://bj.lianjia.com/xiaoqu/1111027375280/
https://bj.lianjia.com/xiaoqu/1111027380931/
https://bj.lianjia.com/xiaoqu/1111027380242/
https://bj.lianjia.com/xiaoqu/1111027374680/
https://bj.lianjia.com/xiaoqu/1111027374691/
https://bj.lianjia.com/xiaoqu/1111027382190/
https://bj.lianjia.com/xiaoqu/1111027374707/
https://bj.lianjia.com/xiaoqu/1111027377413/
https://bj.lianjia.com/xiaoqu/1111027382035/
https://bj.lianjia.com/xiaoqu/1111027379543/
https://bj.lianjia.com/xiaoqu/1111027375998/
https://bj.lianjia.com/xiaoqu/1111027374300/
https://bj.lianjia.com/xiaoqu/1111027375430/
https://bj.lianjia.com/xiaoqu/1111027377351/
https://bj.lianjia.com/xiaoqu/1111027382286/
https://bj

In [9]:
## 写成一个函数便于调用, 
## yield 表示返回生成器，每次调用都返回一个新值，一般用 for 遍历获取所有信息
def get_xiaoqu_one_page(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'lxml')
    lis = soup.find('ul', class_='listContent').find_all('li')
    for li in lis:
        href = li.find('div', class_='title').find('a')['href']
        yield href

### 获取一个区域所有页码链接
以东城区第1页为例：https://bj.lianjia.com/xiaoqu/dongcheng/

观察第每一页链接，如第2页:`https://bj.lianjia.com/xiaoqu/dongcheng/pg2/` 可以发现不同页面的只是 `pg2` 的数字发生了变化，所以只用得到该区域拥有的所有页面数，通过修改数字，我们就可以构造出所有的页码链接

![image.png](./fig/5.png)


In [10]:
url = 'https://bj.lianjia.com/xiaoqu/dongcheng/'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')

In [11]:
## 总页码
total_page_str = soup.find('div', class_='page-box house-lst-page-box')['page-data']
total_page = eval(total_page_str)['totalPage']  ## eval 将字典文本识别为字典

In [12]:
## 构造所有页码链接
for i in range(1, int(total_page)+1):
    href = urljoin(url, 'pg'+ str(i))
    print(href)

https://bj.lianjia.com/xiaoqu/dongcheng/pg1
https://bj.lianjia.com/xiaoqu/dongcheng/pg2
https://bj.lianjia.com/xiaoqu/dongcheng/pg3
https://bj.lianjia.com/xiaoqu/dongcheng/pg4
https://bj.lianjia.com/xiaoqu/dongcheng/pg5
https://bj.lianjia.com/xiaoqu/dongcheng/pg6
https://bj.lianjia.com/xiaoqu/dongcheng/pg7
https://bj.lianjia.com/xiaoqu/dongcheng/pg8
https://bj.lianjia.com/xiaoqu/dongcheng/pg9
https://bj.lianjia.com/xiaoqu/dongcheng/pg10
https://bj.lianjia.com/xiaoqu/dongcheng/pg11
https://bj.lianjia.com/xiaoqu/dongcheng/pg12
https://bj.lianjia.com/xiaoqu/dongcheng/pg13
https://bj.lianjia.com/xiaoqu/dongcheng/pg14
https://bj.lianjia.com/xiaoqu/dongcheng/pg15
https://bj.lianjia.com/xiaoqu/dongcheng/pg16
https://bj.lianjia.com/xiaoqu/dongcheng/pg17
https://bj.lianjia.com/xiaoqu/dongcheng/pg18
https://bj.lianjia.com/xiaoqu/dongcheng/pg19
https://bj.lianjia.com/xiaoqu/dongcheng/pg20
https://bj.lianjia.com/xiaoqu/dongcheng/pg21
https://bj.lianjia.com/xiaoqu/dongcheng/pg22
https://bj.lianjia.

In [13]:
## 写成函数
def get_xiaoqu_all_page(url):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'lxml')

    ## 总页码
    total_page_str = soup.find('div', class_='page-box house-lst-page-box')['page-data']
    total_page = eval(total_page_str)['totalPage']  ## eval 将字典文本识别为字典

    ## 构造所有页码链接
    for i in range(1, int(total_page)+1):
        href = urljoin(url, 'pg'+ str(i))
        yield href 

### 获取所有区域链接

这个和之前获取一页所有链接思路类似，找到区域上一级标签然后遍历,不过我们得到的不是完整的链接，只是部分链接，所有还需要拼接。

![image.png](./fig/6.png)

In [14]:
url = 'https://bj.lianjia.com/xiaoqu/'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
regions = soup.find('div', {'data-role': 'ershoufang'}).find_all('a')
base_url = 'https://bj.lianjia.com' ## 用于凭借
for region in regions:
    href = urljoin(base_url, region['href'])
    print(href)

https://bj.lianjia.com/xiaoqu/dongcheng/
https://bj.lianjia.com/xiaoqu/xicheng/
https://bj.lianjia.com/xiaoqu/chaoyang/
https://bj.lianjia.com/xiaoqu/haidian/
https://bj.lianjia.com/xiaoqu/fengtai/
https://bj.lianjia.com/xiaoqu/shijingshan/
https://bj.lianjia.com/xiaoqu/tongzhou/
https://bj.lianjia.com/xiaoqu/changping/
https://bj.lianjia.com/xiaoqu/daxing/
https://bj.lianjia.com/xiaoqu/yizhuangkaifaqu/
https://bj.lianjia.com/xiaoqu/shunyi/
https://bj.lianjia.com/xiaoqu/fangshan/
https://bj.lianjia.com/xiaoqu/mentougou/
https://bj.lianjia.com/xiaoqu/pinggu/
https://bj.lianjia.com/xiaoqu/huairou/
https://bj.lianjia.com/xiaoqu/miyun/
https://bj.lianjia.com/xiaoqu/yanqing/
https://lf.lianjia.com/xiaoqu/yanjiao/
https://lf.lianjia.com/xiaoqu/xianghe/


In [15]:
## 写成函数,
def get_xiao_regions(url):
    url = 'https://bj.lianjia.com/xiaoqu/'
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'lxml')
    regions = soup.find('div', {'data-role': 'ershoufang'}).find_all('a')
    base_url = 'https://bj.lianjia.com' ## 用于凭借
    for region in regions:
        href = urljoin(base_url, region['href'])
        
        yield href

## 整合在一起,并增加容错性

In [16]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin  ## 用于 url 拼接

def get_html(url):
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36',
            }    
    res = requests.get(url, headers = headers)
    if res.ok:
        soup = BeautifulSoup(res.text, 'lxml')
        return soup
    else:
        print('Error!')

    
def get_xiaoqu_content(href):
    try:
        soup = get_html(href)
        contents = {}

        ## 小区名称 
        contents['小区名称'] = soup.find('h1').get_text()

        ## 房价
        contents['房价'] = soup.find('span', class_="xiaoquUnitPrice").get_text()

        ## 基本信息
        infos = soup.find('div', class_="xiaoquInfo").find_all('div')
        for info in infos:
            key = info.find('span', class_="xiaoquInfoLabel").get_text()
            value = info.find('span', class_="xiaoquInfoContent").get_text()
            contents[key] = value

        ## 经纬度
        contents['经度'], contents['维度'] = info.find('span', class_="actshowMap")['xiaoqu'][1:-1].split(',')

        return contents
    except:
        print('获取小区数据出错！',href)
        
def get_xiaoqu_one_page(url):
    try:
        soup = get_html(url)
        lis = soup.find('ul', class_='listContent').find_all('li')
        for li in lis:
            href = li.find('div', class_='title').find('a')['href']
            yield href
    except:
        print('获取一页小区链接出错！', url)
        
def get_xiaoqu_all_page(url):
    try:
        soup = get_html(url)

        ## 总页码
        total_page_str = soup.find('div', class_='page-box house-lst-page-box')['page-data']
        total_page = eval(total_page_str)['totalPage']  ## eval 将字典文本识别为字典

        ## 构造所有页码链接
        for i in range(1, int(total_page)+1):
            href = urljoin(url, 'pg'+ str(i))
            
            yield href 
    except:
        print('获取所有页面出错!', url)
        
def get_xiao_regions(url):
    try:
        soup = get_html(url)
        regions = soup.find('div', {'data-role': 'ershoufang'}).find_all('a')
        base_url = 'https://bj.lianjia.com' ## 用于拼接
        for region in regions:
            href = urljoin(base_url, region['href'])

            yield href
    except:
        print('获取所有区域出错!', url)
        
def get_xiaoqu_all_url(url):
    for url_region in get_xiao_regions(url):
        for url_page in get_xiaoqu_all_page(url_region):
            for url_xiaoqu in get_xiaoqu_one_page(url_page):
                yield url_xiaoqu

In [None]:
data = []
url = 'https://bj.lianjia.com/xiaoqu/'

for href in get_xiaoqu_all_url:
    data.append(get_xiaoqu_content(href))
    print('抓取成功：',href)