# BS4、Xpath、re

## BS4

BeautifulSoup4 [学习教程](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/)

1. Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.

2. Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种:

+ Tag 标签，最基本的信息组织单元，分别用<>和</>标明开头和结尾

+ Name 标签的名字，<p>...</p>的名字是'p'，格式：<tag>.name;

+ Attributes 标签的属性，字典形式组织，格式：<tag>.attrs

+ NavigableString 标签内非属性字符串，<>...</>中字符串，格式：<tag>.string

+ Comment 标签内字符串的注释部分，一种特殊的Comment类型

In [1]:
# 导入bs4库
from bs4 import BeautifulSoup
import requests # 抓取页面
r = requests.get('https://python123.io/ws/demo.html') # Demo网址
demo = r.text # 抓取的数据
demo

'<html><head><title>This is a python demo page</title></head>\r\n<body>\r\n<p class="title"><b>The demo python introduces several python courses.</b></p>\r\n<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n<a href="http://www.icourse163.org/course/BIT-268001" class="py1" id="link1">Basic Python</a> and <a href="http://www.icourse163.org/course/BIT-1001870001" class="py2" id="link2">Advanced Python</a>.</p>\r\n</body></html>'

In [2]:
# 解析HTML页面
soup = BeautifulSoup(demo, 'html.parser') # 抓取的页面数据；bs4的解析器
# 有层次感的输出解析后的HTML页面
print(soup.prettify())

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>


**1) 标签，用soup.<tag>访问获得：**

+ 当HTML文档中存在多个相同<tag>对应内容时，soup.<tag>返回第一个

In [3]:
soup.a # 访问标签a

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

In [4]:
soup.title

<title>This is a python demo page</title>

**2) 标签的名字：每个<tag>都有自己的名字，通过soup.<tag>.name获取，字符串类型**

In [5]:
soup.a.name

'a'

In [6]:
soup.a.parent.name

'p'

In [7]:
soup.p.parent.name

'body'

**3) 标签的属性，一个<tag>可以有0或多个属性，字典类型，soup.<tag>.attrs**

In [8]:
tag = soup.a
print(tag.attrs)
print(tag.attrs['class'])
print(type(tag.attrs))

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
['py1']
<class 'dict'>


**4) Attributes:标签内非属性字符串，格式：soup.<tag>.string, NavigableString可以跨越多个层次**

In [10]:
print(soup.a.string)
print(type(soup.a.string))

Basic Python
<class 'bs4.element.NavigableString'>


**5)NavigableString:标签内字符串的注释部分，Comment是一种特殊类型（有-->）**

In [11]:
print(type(soup.p.string))

<class 'bs4.element.NavigableString'>


**6) .prettify()为HTML文本<>及其内容增加更加'\n',有层次感的输出**

**.prettify()可用于标签，方法：<tag>.prettify()**

In [12]:
print(soup.prettify())

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>


In [13]:
print(soup.a.prettify())

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
 Basic Python
</a>



**7)bs4库将任何HTML输入都变成utf-8编码**

**Python3.x默认支持编码是utf-8，解析无障碍**

In [14]:
newsoup = BeautifulSoup('<a>中文</a>', 'html.parser')
print(newsoup.prettify())

<a>
 中文
</a>


### 3. 基于bs4库的HTML内容遍历方法

HTML基本格式：<>...</>构成了所属关系，形成了标签的树形结构

+ 标签树的下行遍历

    .contents 子节点的列表，将<tag>所有儿子节点存入列表

    .children 子节点的迭代类型，与.contents类似，用于循环遍历儿子节点