# 靜態網頁的資料爬蟲策略


* 了解靜態網頁的資料爬蟲策略
* 認識適用於靜態網頁爬蟲的相關套件工具：Request
* 認識適用於靜態網頁爬蟲的相關套件工具：BeatifulSoup

## 作業目標

利用 Request + BeatifulSoup 爬取下列兩個網站內容並解析：

1. Dcared 網址： https://www.dcard.tw/f
2. 知乎： https://www.zhihu.com/explore

並且回答下面問題：

1. Request 取回之後該怎麼取出資料，資料型態是什麼？
2. 為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？
3. 觀察一下知乎回來的資料好像有點怪怪的，該怎麼解決？

### 1. Dcard 網址： https://www.dcard.tw/f

In [1]:
import requests
from bs4 import BeautifulSoup


In [12]:
url = 'https://www.dcard.tw/f'

r = requests.get(url)
r.encoding = 'utf-8'
print(r.text[0:3000])

<!DOCTYPE html><html lang="zh-TW"><head prefix="og: http://ogp.me/ns#" itemscope="" itemType="https://schema.org/WebSite"><meta charSet="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=edge"/><meta name="apple-mobile-web-app-status-bar-style" content="default"/><meta name="application-name" content="Dcard"/><meta name="apple-itunes-app" content="app-id=951353454"/><meta name="theme-color" content="#006aa6"/><meta name="mobile-web-app-capable" content="yes"/><meta name="apple-mobile-web-app-capable" content="yes"/><meta name="supported-color-schemes" content="light"/><meta property="fb:app_id" content="211628828926493"/><meta property="fb:pages" content="178875832200695,577748865730563,1333515469994506,619122564952487,804004803032067,178024805867764"/><meta property="al:ios:app_store_id" content="951353454"/><meta property="al:ios:app_name" content="Dcard"/><meta property="al:android:package" content="com.sparkslab.dcardreader"/><meta property="al:android:app_name" content="Dcard

In [21]:
print('Request 取回之後該怎麼取出資料，資料型態是什麼？ =>')

print(r.text[0:3000])

Request 取回之後該怎麼取出資料，資料型態是什麼？ =>
{"data":[{"id":637938925,"type":"answer","answer_type":"normal","question":{"type":"question","id":55493026,"title":"你们都是怎么学 Python 的？","question_type":"normal","created":1486390229,"updated_time":1582533957,"url":"https://www.zhihu.com/api/v4/questions/55493026","relationship":{}},"author":{"id":"e8c4768eaa41e3749f7e8bc5ac6aa74b","url_token":"Lanyuneet","name":"Slumbers","avatar_url":"https://pic1.zhimg.com/v2-f950cfef511d33500177be90030dcd3d_l.jpg?source=1940ef5c","avatar_url_template":"https://pic2.zhimg.com/v2-f950cfef511d33500177be90030dcd3d.jpg?source=1940ef5c","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/e8c4768eaa41e3749f7e8bc5ac6aa74b","user_type":"people","headline":"算法工程师","badge":[],"badge_v2":{"title":"","merged_badges":[],"detail_badges":[],"icon":"","night_icon":""},"gender":0,"is_advertiser":false,"is_privacy":false},"url":"https://www.zhihu.com/api/v4/answers/637938925","is_collapsed":false,"created_time":155

In [25]:
print('為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？ => ')

soup = BeautifulSoup(r.text[0:3000], "html5lib")
print(soup)


為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？ => 
<html><head></head><body>{"data":[{"id":637938925,"type":"answer","answer_type":"normal","question":{"type":"question","id":55493026,"title":"你们都是怎么学 Python 的？","question_type":"normal","created":1486390229,"updated_time":1582533957,"url":"https://www.zhihu.com/api/v4/questions/55493026","relationship":{}},"author":{"id":"e8c4768eaa41e3749f7e8bc5ac6aa74b","url_token":"Lanyuneet","name":"Slumbers","avatar_url":"https://pic1.zhimg.com/v2-f950cfef511d33500177be90030dcd3d_l.jpg?source=1940ef5c","avatar_url_template":"https://pic2.zhimg.com/v2-f950cfef511d33500177be90030dcd3d.jpg?source=1940ef5c","is_org":false,"type":"people","url":"https://www.zhihu.com/api/v4/people/e8c4768eaa41e3749f7e8bc5ac6aa74b","user_type":"people","headline":"算法工程师","badge":[],"badge_v2":{"title":"","merged_badges":[],"detail_badges":[],"icon":"","night_icon":""},"gender":0,"is_advertiser":false,"is_privacy":false},"url":"https://www.zhihu.com/api/v4/answers/637938925","is_coll

### 2. 知乎： https://www.zhihu.com/explore

In [4]:
url = 'https://www.zhihu.com/explore'
r = requests.get(url)
r.encoding = 'utf-8'

print(r.text[0:600])

<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>



### 3. 觀察一下知乎回來的資料好像有點怪怪的，該怎麼解決？

In [16]:
import requests
url = 'https://www.zhihu.com/explore'


headers = {'user-agent': 'my-app/0.0.1'}

r = requests.get('https://www.zhihu.com/api/v4/questions/55493026/answers',headers=headers)

r.encoding = 'utf-8'
print(r.text[0:600])

{"data":[{"id":637938925,"type":"answer","answer_type":"normal","question":{"type":"question","id":55493026,"title":"你们都是怎么学 Python 的？","question_type":"normal","created":1486390229,"updated_time":1582533957,"url":"https://www.zhihu.com/api/v4/questions/55493026","relationship":{}},"author":{"id":"e8c4768eaa41e3749f7e8bc5ac6aa74b","url_token":"Lanyuneet","name":"Slumbers","avatar_url":"https://pic1.zhimg.com/v2-f950cfef511d33500177be90030dcd3d_l.jpg?source=1940ef5c","avatar_url_template":"https://pic2.zhimg.com/v2-f950cfef511d33500177be90030dcd3d.jpg?source=1940ef5c","is_org":false,"type":"peo
