# Chapter 1

## 获取并解析html

In [1]:
from urllib.request import urlopen
# urllib是Python内置的一个模块，用于处理URL请求
html = urlopen('http://pythonscraping.com/pages/page1.html')
content = html.read()
print(content.decode('utf-8'))

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>



In [2]:
from bs4 import BeautifulSoup

html = urlopen('http://pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')  #这里可以不使用.read()
print(bs.h1)

<h1>An Interesting Title</h1>


In [3]:
# 以下表示不同的方式访问同一个元素
print(bs.html.body.h1) # type: ignore
print(bs.body.h1) # type: ignore
print(bs.html.h1) # type: ignore
print(bs.h1)

<h1>An Interesting Title</h1>
<h1>An Interesting Title</h1>
<h1>An Interesting Title</h1>
<h1>An Interesting Title</h1>


In [4]:
# 当创建BeautifulSoup对象时，需要传递两个参数
# 其中第一个参数是HTML文档的内容，第二个参数是解析器的类型
# html.parser是Python内置的HTML解析器，无需安全，大部分情况下都可以使用
bs = BeautifulSoup(html.read(), 'html.parser')

In [5]:
# 与html.parser相比，lxml解析器速度更快，功能更强大，但需要额外安装
# lxml可以容忍并修正一些问题，如未闭合的标签等
html = urlopen('http://pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'lxml')
bs

<html>
<head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>
</html>

In [6]:
# 另一个常用的解析器是html5lib，它可以解析HTML5文档
# 它的功能最强大，能够处理更加糟糕的html，但速度较慢，且需要额外安装
html = urlopen('http://pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html5lib')
bs

<html><head>
<title>A Useful Page</title>
</head>
<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>


</body></html>

## 异常处理

In [7]:
html = urlopen('http://pythonscraping.com/pages/page1.html')

上面这行代码主要会发生两种异常：
+ 网页在服务器上不存在（或者获取页面时出现异常）：返回HTTPError
+ 服务器不存在：返回URLError

In [8]:
from urllib.request import urlopen
from urllib.error import HTTPError

try:
    html = urlopen('http://pythonscraping.com/pages/page111.html')
except HTTPError as e:
    print(e)

print(html)

HTTP Error 404: Not Found
<http.client.HTTPResponse object at 0x000001657A8ADF90>


In [9]:
from urllib.request import urlopen
from urllib.error import URLError

try:
    html = urlopen('http://pythonscraping.com/pages/page1.html')
except HTTPError as e:  
    print('HTTPError')
except URLError as e:   # 断开网络以测试URLError
    print('URLError')

print(html)

<http.client.HTTPResponse object at 0x000001657BC67D00>


In [None]:
print(bs.nonExistentTag)    # 调用这个不存在的标签会返回None
print(bs.nonExistentTag.someTag)    # 调用None的子标签则会出现异常

  print(bs.nonExistentTag)
  print(bs.nonExistentTag.someTag)


None


AttributeError: 'NoneType' object has no attribute 'someTag'

In [16]:
# 为了处理可能发生的两种异常，可分别对这两种异常进行检测
try:
    badContent = bs.nonExistentTag.someTag # type: ignore
except AttributeError as e:
    print('Tag was not found')
else:
    if badContent == None:
        print('Tag was not found')
    else:
        print(badContent)

Tag was not found


  badContent = bs.nonExistentTag.someTag # type: ignore


## 总结

In [None]:
# 整体代码可规范为如下：
from urllib.request import urlopen
from urllib.error import HTTPError, URLError
from bs4 import BeautifulSoup

def getTitle(url):
    # 处理获取html时的异常
    try:
        html = urlopen(url)
    except HTTPError as e:
        return None
    except URLError as e:
        return None

    # 处理bs解析时的异常
    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.body.h1 # type: ignore
    except AttributeError as e:
        return None
    
    return title


url = 'http://pythonscraping.com/pages/page1.html'
title = getTitle(url)
if title == None:
    print('Title could not found')
else:
    print(title)

Title could not found
