# 第11章 从web抓取信息

#### 江川
#### 20190721

### 初识webbrowser、requests模块

In [1]:
"""
用webbrowser模块的open()函数打开网页
author@jiangchuan20190721
"""
import webbrowser
webbrowser.open('http://inventwithpython.com')

True

In [4]:
#!/usr/bin/env python3
# mapIt.py - Launches a map in the browser using an address from the command line or clipboard.
import webbrowser, sys, pyperclip
if len(sys.argv) > 1:             # 若参数个数大于
    
    # Get addrese from clipboard
    address = ' '.join(sys.argv[1:])
else:
    # Get address from clipboard
    address = pyperclip.paste()
    
webbrowser.open('https://www.google.com/map/place/' + address)

True

In [8]:
"""
用requests模块的get()函数从web下载一个网页
author@jiangchuan20190721
"""

import requests
res = requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')
print(type(res))
print(res.status_code == requests.codes.ok)      # status_code检查请求是否成功
print(len(res.text))
print(res.text[:1000])                # 显示前1000个字符

<class 'requests.models.Response'>
True
179378
﻿The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare


*******************************************************************
THIS EBOOK WAS ONE OF PROJECT GUTENBERG'S EARLY FILES PRODUCED AT A
TIME WHEN PROOFING METHODS AND TOOLS WERE NOT WELL DEVELOPED. THERE
IS AN IMPROVED EDITION OF THIS TITLE WHICH MAY BE VIEWED AS EBOOK
(#100) at https://www.gutenberg.org/ebooks/100
*******************************************************************


This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org/license


Title: Romeo and Juliet

Author: William Shakespeare

Posting Date: May 25, 2012 [EBook #1112]
Release Date: November, 1997  [Etext #1112]

Language: English


*** START OF THIS PROJECT GUTENBERG EBOOK ROMEO AND JULIET ***











In [9]:
"""
检查错误 
author@jiangchuan20190721
"""

import requests
res = requests.get('http://xiadade.com/buzhidaodea')
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem:404 Client Error: %s' % (exc))


There was a problem:404 Client Error: 404 Client Error: Not Found for url: http://xiadade.com/buzhidaodea


In [10]:
import requests
res = requests.get('http://xiadade.com/buzhidaodea')
res.raise_for_status()

HTTPError: 404 Client Error: Not Found for url: http://xiadade.com/buzhidaodea

### 下载并保存文件的过程步骤：
1. 调用requests.get()下载该文件
2. 用wb调用open()，以二进制的方式打开一个新文件
3. 用Response对象的iter_content()方法做循环
4. 在每次迭代中调用write()，将内容写入该文件
5. 调用close()关闭文件

In [11]:
"""
将下载的文件保存到硬盘
author@jiangchuan20190721
"""

import requests
res = requests.get('http://www.gutenberg.org/cache/epub/1112/pg1112.txt')

# 检查错误
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem:404 Client Error: %s' % (exc))

# 写入新文件
playFile = open('RomeoAndJuliet.txt', 'wb')

# iter_content()方法在循环的每次迭代中，返回一段内容。每段都是bytes数据类型，需要指定一段包含多少字节（此处是每段包含10万字节）
for chunk in res.iter_content(100000):
    playFile.write(chunk)
playFile.close()

### 从HTML创建一个BeautifulSoup对象

In [3]:
"""
BeautifulSoup模块解析HTML
author@jiangchuan20190721
"""

import requests, bs4
res = requests.get('http://baidu.com')
res.raise_for_status()
noStarchSoup = bs4.BeautifulSoup(res.text)
print(type(noStarchSoup))

<class 'bs4.BeautifulSoup'>


In [4]:
import requests, bs4
exampleFile = open('example.html')
exampleSoup = bs4.BeautifulSoup(exampleFile)
print(type(exampleSoup))

<class 'bs4.BeautifulSoup'>


### 用select()方法寻找元素 
```
传递给select()方法的选择器             将匹配

soup.select('div')                  所有名为<div>的元素
soup.select('#author')              带有id属性的author的元素
soup.select('.notice')              所有使用CSS class 属性名为notice的元素
soup.select('div span')             所有在<div>元素内的<span>元素
soup.select('div > span')           所有直接在<div>元素之内<span>元素，中间没有其他元素
soup.select('input[name]')          所有名为<input> ，并有一个name属性，其值无所谓的元素
soup.select('input[type="button"]') 所有名为<input>,并有一个type属性，其值为button的元素
```

In [7]:
import bs4
exampleFile = open('example.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read())
elems = exampleSoup.select('#author')                 # 将带有id=‘author’的元素，从示例HTML中找出来

print(elems)
print(type(elems))
print(len(elems))
print(type(elems[0]))
print(elems[0].getText())                        # getText()方法返回该元素的文本或内部HTML（一个元素的文本：在开始和结束标签之间的内容）
print(str(elems[0]))                             # 将该元素传递给str()，将返回一个字符串，其中包含开始和结束标签以及该元素的文本
print(elems[0].attrs)                            # attrs给我们一个字典，包含该元素的属性‘id’，以及id属性的值‘author’

[<span id="author">Al Sweigart</span>]
<class 'list'>
1
<class 'bs4.element.Tag'>
Al Sweigart
<span id="author">Al Sweigart</span>
{'id': 'author'}


In [8]:
import bs4
exampleFile = open('example.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read())
pElems = exampleSoup.select('p')

print(str(pElems[0]))
print(pElems[0].getText())
print(str(pElems[1]))
print(pElems[1].getText())
print(str(pElems[2]))
print(pElems[2].getText())

<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>
Download my Python book from my website.
<p class="slogan">Learn Python the easy way!</p>
Learn Python the easy way!
<p>By <span id="author">Al Sweigart</span></p>
By Al Sweigart


### 通过元素的属性获取数据

In [10]:
import bs4
soup = bs4.BeautifulSoup(open('example.html'))
spanElem = soup.select('span')[0]

print(str(spanElem))
print(spanElem.get('id'))
print(spanElem.get('some_nonexistent_addr')== None)
print(spanElem.attrs)

<span id="author">Al Sweigart</span>
author
True
{'id': 'author'}


### “I'm Feeling Lucky” Google查找


#### 程序要做的事：
- 从命令行参数中获取查询关键字
- 取得查询结果页面
- 为每个结果打开一个浏览器选项卡

#### 意味着需要完成以下工作：
- 从sys.argv中读取命令行参数
- 用requests模块去得查询结果页面
- 找到每个查询结果的链接
- 调用webbrowser.open()函数打开web浏览器

打开一个新的文件编辑器窗口，并保存为lucky.py

In [2]:
?sum()

Object `sum()` not found.
