Skip to content

A-BigTree/hotel_Crawling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

@author: Shuxin_Wang

@time: 2023.03.24


软工课程项目需要Booking酒店数据,需要酒店的信息和图片,最后一共获得2G+的的数据,信息包括10000+酒店的基本数据,和80000+的酒店图片,因为数据量较大(我怕吃牢饭🥲),这里并没有放出来,感兴趣或者有需求的bro~可以照着代码自己爬取一下😀(友情提示:数据量较大,一时半会爬不完)。

示例网址:

基于地域名称查询

基于dest_id查询


1 路径说明

|--config(配置文件夹)
|	|--backup(手动备份文件夹)
|	|
|	`--cookie.txt(cookie设置)
|	`--page_num.csv(城市相关信息记录)
|	`--user-agent.txt(用户客户端设置)
|
`--params_setting.py(相关参数设置)
`--crawling_process.py(爬取过程函数)
`--pretreatment.py(预处理过程)
`--Main.py(主函数运行)
|
|--data(爬取数据存放)
|	`--README.md
|	`--hotels.csv(酒店具体页面链接记录)【酒店名称,酒店链接,地区,页位移,封面图片链接,酒店索引】
|	`--image_error.csv(封面图爬取失败日志)【酒店索引,封面图片链接】
|	`--province.csv(省份表)【省名,html索引】
|	`--city.csv(市表)【省名,市名,html索引,市邮政编码】
|	`--county.csv(区县表)【省名,市名,县名,html索引,县邮政编码】
|	`--pre_city.csv(预处理区县表)【省名,市名,县名,省名,市名,县名】
|	|
|	|--backup(手动备份文件夹)
|	|
|	|--image(酒店封面图)
|	|	`--0.jpg
|	|	`--...
|	|
|	|--info(酒店详细信息)
|	|	`--info.csv(酒店详细信息记录)【索引,名称,省份,地区,详细地址,评分,图片数量,星级】
|	|	`--address.csv(酒店规范地址表)【索引,省名,市名,县名】
|	|	`--desc.csv(酒店具体描述表)【索引,详细描述】
| 	|	`--image.csv(酒店详细照片链接表)【索引,索引内标号,图片链接】
|	|	`--image_error.csv(详细照片爬取失败)【索引,索引内标号,图片链接】
|	|	|
|	|	|--html(保存的html文件,方便以后直接抓取,大小有10+G,预留好空间)
|	|	|	`--0.txt
|	|	|	`--1.txt
|	|	|	`--...
|	|	|
|	|	|--image(酒店详细照片)
|	|	|	|--0(标号0-999的酒店照片)
|	|	|	|	`--0_0.jpg
|	|	|	|	`--0_1.jpg
|	|	|	|	`--...
|	|	|	|
|	|	|	|--1000(标号1000-1999的照片)
|	|	|	|	`--1000_0.jpg
|	|	|	|	`--...
|	|	|	|
|	|	|	|--...
|	|	|	|
|	|	|	|--10000(标号大于10000的酒店照片)
|	|	|	|	`--10000_0.jpg
|	|	|	|	`--...

  • data数据文件夹没有放出来,需要自己新建不然会报错

2 配置文件

请求头配置

文件cookie.txtuser-agent.txt,可以通过在浏览器访问上面的示例网址,打开检查的网络模块可以看到具体的Cookie🍪和User-Agent设置,直接把对应的值复制过去即可,同时如果需要其他添加其他参数,可以在params_setting.py中的PARAMS_REQUEST修改:

PARAMS_REQUEST = {
    "User-Agent": None,
    "Cookie": None# add params here
}
"""Parameters of request for *Booking*"""

同时在config文件夹里添加**对应名称**的txt文件,在初始化的时候自动读取:

def init_params_request():
    """Init parameters using in request"""
    for key in PARAMS_REQUEST.keys():
        try:
            with open("config/" + key.lower() + ".txt", encoding='utf-8') as f:
                PARAMS_REQUEST[key] = f.readline().encode('utf-8').decode('latin1')
        except Exception as e:
            print(e)
            raise RuntimeError("Init process error.")

城市信息配置

城市配置的相关信息在 config/page_num.csv文件中,该csv文件的表头信息如下:

城市名称 所含酒店网页数 Booking设置的目的地ID标识

要问我ID标识怎么得到的(我才不说是一个省份一个省份试出来的呢🥲),咳咳所以不知道有没有时效性,大家且用且珍惜。

在整个程序运行开始之前,城市信息会自动初始化到params_setting.py中的CON_INFO中:

CITY_INFO = {
    "city_num": 0,
    "city_list": []
}
"""The dict of cities' information"""


def init_city_dict():
    """Init city dict using .csv file"""
    city_list = read_csv(file_name="config/page_num.csv")
    CITY_INFO['city_num'] = len(city_list)
    for info in city_list:
        CITY_INFO['city_list'].append({
            "name": info[0],
            "page_num": info[1],
            "dest_id": info[2]
        })

3 网址设置

请求基址

网址的请求基址为params_setting.py中的URL_BOOKING

URL_BOOKING = "https://www.booking.cn/searchresults.zh-cn.html"
"""URL using in query"""

所有的查询界面都是在此基址上进行请求参数的设置变化

基于城市名称的参数设置(不稳定不推荐)

PARAMS_URL_CITY_NAME = {
    "aid": 3976,  # 访客ID(maybe),具体情况需要更新
    "ss": None,  # 城市名称
    "lang": "zh-cn",  # Language
    "sb": 1,
    "src_elem": "sb",
    "src": "index",
    "group_adults": 2, # 要住两个成人
    "no_rooms": 1, # 需要几间房间
    "group_children": 0, # 居住孩子的数量
    "sb_travel_purpose": "leisure", # 订酒店的目的
    "offset": 0  # Page # 页面位移(一页一般为25个结果->(页面数-1)*25)
}
"""Parameters using city name in query URL"""

因为有点城市名称直接查询结果比较奇葩,不稳定,所以不推荐使用这种方法

基于城市ID的参数设置(推荐)

具体参数含义与上面的类似,具体城市的dest_id在城市信息字典中CITY_INFO中,具体字段解释见上面的 基于城市名称的参数设置

PARAMS_URL_CITY_ID = {
    "aid": 3976,  # 访客ID,具体情况需要更新
    "lang": "zh-cn",
    "sb": 1,
    "src_elem": "sb",
    "src": "searchresults",
    "dest_id": None,   # 城市ID
    "dest_type": "region",
    "group_adults": 2,
    "no_rooms": 1,
    "group_children": 0,
    "sb_travel_purpose": "leisure",
    "offset": 0  # Page
}
"""Parameters using city ID in query URL"""

4 解析路径

习惯使用XPath对网页进行解析,通用的请求解析函数如下:

# ----------Request and analysis
def get_response(url: str, params: dict = None, headers: dict = PARAMS_REQUEST) -> Response:
    """Get `Response` Object using URL and Request parameters"""
    try:
        response = requests.get(url=url,
                                params=params,
                                headers=headers)
    except Exception as e:
        print(e)
        raise RuntimeError("Request error.")
    if response.status_code != 200:
        raise RuntimeError
    return response


def get_html_element(text: str) -> _Element:
    """Get `etree._Element` Object from a text"""
    return etree.HTML(text)


def get_html_element_by_file(file: str, encoding="utf-8") -> _Element:
    """Get `etree._Element` Object from a file"""
    try:
        with open(file, "r", encoding=encoding) as f:
            text = f.read()
    except Exception as e:
        print(e)
        raise RuntimeError("Read file: %s error" % file)
    return get_html_element(text)


def xpath_analysis(xpath_data: _Element, xpath_: Union[str, List[str]]) -> dict:
    """Analysis `etree._Element` Object using list of xpath"""
    if isinstance(xpath_, str):
        xpath_ = [xpath_]
    results = dict()
    for xp in xpath_:
        result = xpath_data.xpath(xp)
        results[xp] = result
    return results


def xpath_analysis_by_text(text: str, xpath_: Union[str, List[str]]) -> dict:
    """Analysis article from string using list of xpath"""
    xpath_data = get_html_element(text)
    return xpath_analysis(xpath_data, xpath_)


def xpath_analysis_by_response(response: Response, xpath_: Union[str, List[str]]) -> dict:
    """Analysis article from response using list of xpath"""
    response.encoding = "utf-8"
    return xpath_analysis_by_text(response.text, xpath_)


def xpath_analysis_by_file(file: str, xpath_: Union[str, List[str]], encoding: str = "utf-8") -> dict:
    """Analysis article from file using list of xpath"""
    return xpath_analysis(get_html_element_by_file(file, encoding), xpath_)

查询界面相关信息解析

# 城市酒店查询结果页数
XPATH_HOTEL_PAGE_NUM = "//*[@id='search_results_table']/div[2]/div/div/div[4]/div[2]/nav/div/div[2]/ol//li/button/text()"
"""The number of hotels' page"""

# 城市查询结果汇总
XPATH_HOTEL_PAGE_TITLE = "//*[@id='right']/div[1]/div/div/div/h1/text()"
"""The title in this page"""

# 该页结果酒店名称列表
XPATH_HOTEL_PAGE_NAME = "//*[@id='search_results_table']/div[2]/div/div/div[3]//div/div[1]/div[2]/div/div/div/div[1]/div/div[1]/div/h3/a/div[1]/text()"
"""The name of hotel in the page"""

# 该页酒店详细信息链接列表
XPATH_HOTEL_PAGE_HREF = "//*[@id='search_results_table']/div[2]/div/div/div[3]//div/div[1]/div[2]/div/div/div/div[1]/div/div[1]/div/h3/a/@href"
"""The link of hotel in the link"""

# 该页酒店封面图片链接列表
XPATH_HOTEL_PAGE_IMAGE = "//*[@id='search_results_table']/div[2]/div/div/div[3]//div/div[1]/div[1]/div/a/img/@src"
"""The image of hotel in the page"""

酒店详情页面解析

# 酒店星级
XPATH_HOTEL_STAR = "//*[@id='hp_hotel_name']/span/span[2]/div/span/div/span//span"
"""The star of hotel"""

# 酒店所在城市
XPATH_HOTEL_CITY = "//*[@id='breadcrumb']/ol//li/div/a/text()"
"""The city of the hotel"""

# 酒店名称
XPATH_HOTEL_NAME = "//*[@id='hp_hotel_name']/div/h2/text()"
"""The name of hotel"""

# 酒店详细地址
XPATH_HOTEL_ADDRESS = "//*[@id='showMap2']/span/text()"
"""The address of the hotel"""

# 酒店评分
XPATH_HOTEL_POINT = "//*[@id='js--hp-gallery-scorecard']/a/div/div/div/div[1]/text()"
"""The point of the hotel"""

# 酒店图片链接列表
XPATH_HOTEL_IMAGES = "//*[@id='hotel_main_content']//a/img/@src"
"""The images of the hotel"""

# 酒店详细描述
XPATH_HOTEL_DESC = "//*[@id='property_description_content']//p/text()"
"""The description of the hotel"""

5 运行过程

文字信息获取

  1. 获取查询页面所有酒店的详细页面链接和封面图片链接并标号存放到data/hotels.csv文件:
  • get_all_city_hotel()

data/hotels.csv表头含义:

名称 详细页面链接 省份 offset 封面图片链接 标号
  1. 爬取所有酒店的详细页面,获取基本信息,详细描述存放到data/info/desc.csv中,图片链接放到data/info/images.csv中,其余信息放到data/info/info.csv中(所需时间很长很长,需要等待):
  • get_all_hotel_info():出现报错可以调整开始位置INDEX_START继续进行爬取;

data/info/desc.csv表头含义:

标号 详细介绍

data/info/images.csv表头含义:

标号 图片链接

data/info/info.csv表头含义:

标号 名称 省份 地区 详细地址 评分 图片数量 星级

图片获取

图片获取通用函数如下:

ef get_image_from_url(image_url: str, file_name: str):
    """Get image from a image url"""
    try:
        response = get_response(image_url)
    except Exception as e:
        print(e)
        raise RuntimeError("Get image error")
    if response.status_code != 200:
        raise RuntimeError
    try:
        with open(file_name + ".jpg", "wb") as f:
            f.write(response.content)
    except Exception as e:
        print(e)
        raise RuntimeError("Write image error")
  1. 获取封面图片,将图片存放到data/image/文件夹中:
  • get_page_image()
  1. 获取酒店的所有图片,因数目较多(7w+)使用**多线程技术**,每1000家酒店为一组放在一个文件夹里:
  • get_all_images():可在该函数中修改线程数量和批次(需调整图片存放文件夹)

About

Booking酒店爬取

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages