# urllib库

urllib是Python中最基本的网络请求库。可以模拟浏览器行为，向指定的服务器发送请求，并保存服务器返回的数据。

## urllib中包含的模块

1. urllib.request：用于发起并读取URLS
2. urllib.error：用于处理由urllib.request抛出的异常
3. urllib.parse： url解析
4. urllib.robotparser： robots.txt解析

## urllib.request常用函数

### urlopen函数

函数及参数:

```python
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
```

在Python3的urllib库中，所有和网站请求相关的方法，都集成到了```urllib.request```模块中，以下是相关示例：

In [1]:
from urllib import request

'''
    打开一个url
'''

# urlopen的返回对象是一个http.client.HttpResponse对象，是一个类文件句柄
req = request.urlopen('http://localhost/get') # http://www.httpbin.org/get

print('状态码：', req.getcode())
print('\n--------------------分割线--------------------\n')
text = req.read()
print('读取到的内容（不解码）：\n', text)
print('\n--------------------分割线--------------------\n')
print('读取到的内容（用utf-8解码）：\n', text.decode('utf-8'))

状态码： 200

--------------------分割线--------------------

读取到的内容（不解码）：
 b'{\n  "args": {}, \n  "headers": {\n    "Accept-Encoding": "identity", \n    "Connection": "close", \n    "Host": "localhost", \n    "User-Agent": "Python-urllib/3.5"\n  }, \n  "origin": "172.17.0.1", \n  "url": "http://localhost/get"\n}\n'

--------------------分割线--------------------

读取到的内容（用utf-8解码）：
 {
  "args": {}, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Host": "localhost", 
    "User-Agent": "Python-urllib/3.5"
  }, 
  "origin": "172.17.0.1", 
  "url": "http://localhost/get"
}



In [2]:
from urllib import parse
from urllib import request

'''
    设置data参数
'''
data = bytes(parse.urlencode({"message":"Hello, httpbin. This is from urllib","date":"2019-07-25"}), encoding='utf8')
# https://httpbin.org/post
response = request.urlopen("http://localhost/post", data=data)
print(response.read().decode('utf-8'))

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "date": "2019-07-25", 
    "message": "Hello, httpbin. This is from urllib"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "61", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "localhost", 
    "User-Agent": "Python-urllib/3.5"
  }, 
  "json": null, 
  "origin": "172.17.0.1", 
  "url": "http://localhost/post"
}



In [3]:
from urllib import request

'''
    设置超时参数timeout
'''

res = request.urlopen("https://baidu.com", timeout=1)
print('Get code:',res.getcode(),'. No Exception')

Get code: 200 . No Exception


In [4]:
from urllib import request, error
import socket

try:
    response = request.urlopen("https://httpbin.org", timeout=0.2)
except error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print("Exception: ",e.reason)

Exception:  _ssl.c:630: The handshake operation timed out


### 响应

In [7]:
from urllib import request

# 响应类型
res = request.urlopen('http://localhost')
print("响应类型：", type(res))

# 状态码
print('\n状态码：', res.status)

# 响应头
print("\n响应头：")
print(res.getheader('Server'))
print(res.getheaders())

响应类型： <class 'http.client.HTTPResponse'>

状态码： 200

响应头：
gunicorn/19.9.0
[('Server', 'gunicorn/19.9.0'), ('Date', 'Thu, 25 Jul 2019 08:09:13 GMT'), ('Connection', 'close'), ('Content-Type', 'text/html; charset=utf-8'), ('Content-Length', '9593'), ('Access-Control-Allow-Origin', '*'), ('Access-Control-Allow-Credentials', 'true')]


In [6]:
from urllib import request, parse

url = "http://localhost/post"
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362',
    'Host':'localhost'
}
data = {
    'title':'For more test',
    'date':'2019-07-25',
    'message':'Go to the world',
    'day_of_outing' : 10
}
byte_data = bytes(parse.urlencode(data), encoding='utf8')
req = request.Request(url, data=byte_data,headers=headers, method='POST')
res = request.urlopen(req)
print(res.read().decode('utf-8'))

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "date": "2019-07-25", 
    "day_of_outing": "10", 
    "message": "Go to the world", 
    "title": "For more test"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "76", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "localhost", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362"
  }, 
  "json": null, 
  "origin": "172.17.0.1", 
  "url": "http://localhost/post"
}



### Handler 

In [None]:
# 代理设置
from urllib import  request

proxy_handler = request.ProxyHandler({
    'http':'192.168.88.4:1080',
    'https':'192.168.88.4:1080'
})
opener = request.build_opener(proxy_handler)
res = opener.open('https://httpbin.org/')
print(response.read())

### Cookie

In [8]:
import http.cookiejar, urllib.request
'''
读取对应网站存储在本地的Cookie
'''
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = request.urlopen("https://weibo.com")
for item in cookie:
    print(item.name+"="+item.value)


In [9]:
import http.cookiejar, urllib.request

'''
将cookie存储在文件中
'''
filename="cookie.txt"
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = request.urlopen("https://weibo.com")
cookie.save(ignore_discard=True,ignore_expires=True)

In [10]:
import http.cookiejar, urllib.request

'''
读取保存的Cookie
'''
cookie = http.cookiejar.LWPCookieJar(filename)
# 从文件中加载之前保存的cookie
cookie.load('cookie.txt', ignore_discard=True,ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = request.urlopen("https://weibo.com")
cookie.save(ignore_discard=True,ignore_expires=True)

### 异常处理

In [11]:
from urllib import request,error

try:
    request.urlopen('http://asdasdasdsadasdm.abs')
except error.URLError as e:
    print(e.reason)
    
try:
    request.urlopen('https://heymax.site/about')
except error.URLError as e:
    print(e.reason)

[Errno 11001] getaddrinfo failed
Not Found


In [12]:
from urllib import request,error

try:
    request.urlopen('https://heymax.site/article')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

Not Found

404

Server: nginx/1.14.2
Date: Thu, 25 Jul 2019 08:09:29 GMT
Content-Type: text/html
Content-Length: 0
Connection: close
ETag: "5c6cb500-0"




In [13]:
from urllib import request,error

# 捕获两种异常，HTTPError和URLError
# 访问的网址存在，但是请求的页面不不存在
try:
    request.urlopen('https://heymax.site/article')
except error.HTTPError as e:
    print(type(e.reason))
    
try:
    request.urlopen('https://heymax.site/article')
except error.URLError as e:
    print(type(e.reason))


# 访问的网址不存在
# 异常属于'socket.gaierror'类型
try:
    request.urlopen('https://heymasdsad.asde/article')
except error.URLError as e:
    print(type(e.reason))
    
# 访问的网址存在，捕获超时异常
# 超时异常属于'socket.timeout'类型
try:
    request.urlopen('https://heymax.site', timeout=0.1)
except error.URLError as e:
    print(type(e.reason))

<class 'str'>
<class 'str'>
<class 'socket.gaierror'>


## urllib.parse常用函数（URL解析）

### urlparse

```python
urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)
```

In [14]:
from urllib.parse import urlparse

result = urlparse("http://www.baidu.com/index.html;user=5#comment")
print(result)

ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user=5', query='', fragment='comment')


In [15]:
from urllib.parse import urlparse

# scheme : 协议类型
# 如果url中不带有协议类型，则指定scheme后，会按照scheme填上
result = urlparse("www.baidu.com/index.html;user=5#comment",scheme="https")
print(result)

result = urlparse("http://www.baidu.com/index.html;user=5#comment", scheme="https")
print(result)

ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user=5', query='', fragment='comment')
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user=5', query='', fragment='comment')


In [16]:
from urllib.parse import urlparse

# 当allow_fragment为False，fragment的内容为空
result = urlparse("www.baidu.com/index.html;user=5#comment",scheme="https", allow_fragments=False)
print(result)

ParseResult(scheme='https', netloc='', path='www.baidu.com/index.html', params='user=5#comment', query='', fragment='')


In [17]:
from urllib.parse import urlparse

# 当allow_fragment为False，则会按照query，path是否为空，依次拼接。
# 当query不为空时，fragment的内容会拼接到query中
# 当query为空，path不为空时，fragment的内容会拼接到path中
result = urlparse("https://www.baidu.com/index.html#comment", allow_fragments=False)
print(result)

ParseResult(scheme='https', netloc='www.baidu.com', path='/index.html#comment', params='', query='', fragment='')


### urlunparse

从一个元组/数组中拼接出url

In [18]:
from urllib.parse import urlunparse

# 元组/数组的长度不得超过6
url_block = ('http','www.baidu.com','api','user','name=jeryy','comment')
print(urlunparse(url_block))

http://www.baidu.com/api;user?name=jeryy#comment


### urljoin

url拼接

In [19]:
from urllib.parse import urljoin

print(urljoin('http://www.baidu.com','https://heymax.site'))
print(urljoin('https://heymax.site/about-me','http://www.baidu.com'))
print(urljoin('http://www.baidu.com/about','https://heymax.site/about-me'))
print(urljoin('http://www.baidu.com/about','https://heymax.site/about-me?categray=happy'))
print(urljoin('http://www.baidu.com/about?word=asbc','https://heymax.site/about-me'))
print(urljoin('https://heymax.site/about-me','?categray=education'))
print(urljoin('https://heymax.site/#today','?categray=education'))
print(urljoin('heymax.site','?categray=education'))
print(urljoin('heymax.site','?categray=education#comment'))


https://heymax.site
http://www.baidu.com
https://heymax.site/about-me
https://heymax.site/about-me?categray=happy
https://heymax.site/about-me
https://heymax.site/about-me?categray=education
https://heymax.site/?categray=education
heymax.site?categray=education
heymax.site?categray=education#comment


### urlencode

url编码，可将字典类型转换为get参数

In [20]:
from urllib.parse import urlencode

params = {
    'name':'Jerry',
    'age':33
}
base_url = 'https://heymax.site/lookup?'
url = base_url + urlencode(params)
print(url)

https://heymax.site/lookup?age=33&name=Jerry


## urllib.robotparser

urllib.robotparser模块只提供一个类 `RobotFileParser`，用来读取，解析，回答robots.txt

In [None]:
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url("http://www.musi-cal.com/robots.txt")
rfile_url = rp.read()
# rrate = rp.crawl_delay("*")
# rrate.requests