<a id='HOME'></a>
# CHAPTER 9 The Web, Untangled
## Web剖析

* [9.1 Client端](#Clients)
* [9.2 Server端](#Server)
* [9.3 服務與自動化](#ServicesAutomation)

---
<a id='Clients'></a>
## 9.1 Client端
[回目錄](#HOME)

web瀏覽器就是一種網路的Client端，python也可以做為Client端接收檔案  
web會以三位數編號來表示目前狀態，百位數為狀態大分類如下。

* 1xx訊息
* 2xx成功
* 3xx重新導向
* 4xx用戶端錯誤
* 5xx伺服器錯誤

In [1]:
# 標準函式庫

import urllib.request as ur
url = 'http://nrl.iis.sinica.edu.tw/LASS/last-all-airbox.json'
conn = ur.urlopen(url)
print(conn)

data = conn.read()
print(data)

<http.client.HTTPResponse object at 0x000001F89F4C1E80>
b'{"num_of_records": 267, "original-data-source": "Edimax Technolgy", "source": "last-all-airbox by IIS-NRL", "version": "2016-09-20T09:15:01Z", "original-project-website": "http://airbox.edimaxcloud.com/", "feeds": [{"gps_lat": 23.741, "gps_num": 9, "s_t0": 33.12, "SiteName": "\\u9f8d\\u5b89\\u570b\\u5c0f", "timestamp": "2016-09-20T09:13:23Z", "gps_lon": 120.755, "s_d0": 32, "s_h0": 66, "device_id": "28C2DDDD415C"}, {"gps_lat": 22.373, "gps_num": 9, "s_t0": 25.62, "SiteName": "74DA3895C540", "timestamp": "2016-09-20T09:12:34Z", "gps_lon": 114.109, "s_d0": 32, "s_h0": 54, "device_id": "74DA3895C540"}, {"gps_lat": 25.062, "gps_num": 9, "s_t0": 27.5, "SiteName": "74DA3895C5AA", "timestamp": "2016-09-20T07:26:33Z", "gps_lon": 121.451, "s_d0": 22, "s_h0": 67, "device_id": "74DA3895C5AA"}, {"gps_lat": 25.062, "gps_num": 9, "s_t0": 28, "SiteName": "74DA3895C280", "timestamp": "2016-09-20T07:27:14Z", "gps_lon": 121.451, "s_d0": 25, "s_h0

In [2]:
print(conn.getheader('Content-Type'))

print('===========================================================================\n')

for key, value in conn.getheaders():
    print(key, value, sep='：')

application/json

Date：Tue, 20 Sep 2016 09:19:51 GMT
Server：Apache/2.4.17 (FreeBSD) PHP/5.6.14 mod_perl/2.0.9 Perl/v5.20.3
Last-Modified：Tue, 20 Sep 2016 09:15:02 GMT
ETag："c7aa-53cecdd0b3180"
Accept-Ranges：bytes
Content-Length：51114
Connection：close
Content-Type：application/json


In [3]:
# 第三方函式庫 requests

import requests
url = 'http://nrl.iis.sinica.edu.tw/LASS/last-all-airbox.json'
resp = requests.get(url)
resp

print(resp)
print(resp.text)

<Response [200]>
{"num_of_records": 267, "original-data-source": "Edimax Technolgy", "source": "last-all-airbox by IIS-NRL", "version": "2016-09-20T09:15:01Z", "original-project-website": "http://airbox.edimaxcloud.com/", "feeds": [{"gps_lat": 23.741, "gps_num": 9, "s_t0": 33.12, "SiteName": "\u9f8d\u5b89\u570b\u5c0f", "timestamp": "2016-09-20T09:13:23Z", "gps_lon": 120.755, "s_d0": 32, "s_h0": 66, "device_id": "28C2DDDD415C"}, {"gps_lat": 22.373, "gps_num": 9, "s_t0": 25.62, "SiteName": "74DA3895C540", "timestamp": "2016-09-20T09:12:34Z", "gps_lon": 114.109, "s_d0": 32, "s_h0": 54, "device_id": "74DA3895C540"}, {"gps_lat": 25.062, "gps_num": 9, "s_t0": 27.5, "SiteName": "74DA3895C5AA", "timestamp": "2016-09-20T07:26:33Z", "gps_lon": 121.451, "s_d0": 22, "s_h0": 67, "device_id": "74DA3895C5AA"}, {"gps_lat": 25.062, "gps_num": 9, "s_t0": 28, "SiteName": "74DA3895C280", "timestamp": "2016-09-20T07:27:14Z", "gps_lon": 121.451, "s_d0": 25, "s_h0": 64, "device_id": "74DA3895C280"}, {"gps_la

---
<a id='Server'></a>
## 9.2 Server端
[回目錄](#HOME)

相對於Client端，python也可以做為server使用。

```python
#預設 8000
python -m http.server

#自訂
python -m http.server 9999
```

常見的框架如下
* Bottle
* Flask

非Python的Web服務器
* apache 加上 mod_wsgi 模塊
* nginx 加上 WSGI 應用服務器

其他框架
* django
* web2py
* pyramid
* turbogears
* wheezy.web

---
<a id='ServicesAutomation'></a>
## 9.3 服務與自動化
[回目錄](#HOME)


In [4]:
import antigravity #開啟 http://xkcd.com/353/

import webbrowser
url = 'http://www.python.org/'
webbrowser.open(url) #開啟指定網頁

webbrowser.open_new(url) #開視窗

webbrowser.open_new_tab('http://www.python.org/') #開新分頁

True

## 爬蟲套件

爬蟲是一種利用HTTP Request 抓取網路資料的技術。

作者推薦
* Scrapy

解析HTML
* BeautifulSoup

In [5]:
def get_links(url):
    import requests
    from bs4 import BeautifulSoup as soup
    result = requests.get(url)
    page = result.text
    doc = soup(page)
    links = [element.get('href') for element in doc.find_all('a')]
    return links


for num, link in enumerate(get_links('http://boingboing.net'), start=1):
    print(num, link)
    print()
    

1 http://boingboing.net

2 http://boingboing.net/sub

3 http://boingboing.net/search

4 http://boingboingpodcasts.com

5 javascript:void(0)

6 http://boingboing.net/blog

7 http://bbs.boingboing.net

8 https://bbs.boingboing.net/faq

9 http://store.boingboing.net

10 mailto:support+id154252@vipstack.zendesk.com

11 http://boingboing.net/about

12 http://boingboing.net/contact

13 http://boingboing.net/advertise

14 http://boingboing.net/privacy

15 http://boingboing.net/tos

16 http://boingboing.net/2016/09/19/terencecrutcher.html

17 http://boingboing.net/tag/terence-crutcher

18 http://boingboing.net/2016/09/19/terencecrutcher.html

19 http://boingboing.net/author/xeni_jardin

20 http://boingboing.net/2016/09/19/hp-detonates-its-timebomb-pri.html

21 http://boingboing.net/tag/printers

22 http://boingboing.net/2016/09/19/hp-detonates-its-timebomb-pri.html

23 http://boingboing.net/author/cory_doctorow_1

24 http://boingboing.net/2016/09/19/heres-what-you-need-to-know.html

25 http://



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))
