# 靜態網頁的資料爬蟲策略


* 了解靜態網頁的資料爬蟲策略
* 認識適用於靜態網頁爬蟲的相關套件工具：Request
* 認識適用於靜態網頁爬蟲的相關套件工具：BeatifulSoup

## 作業目標

利用 Request + BeatifulSoup 爬取下列兩個網站內容並解析：

1. Dcared 網址： https://www.dcard.tw/f
2. 知乎： https://www.zhihu.com/explore

並且回答下面問題：

1. Request 取回之後該怎麼取出資料，資料型態是什麼？
2. 為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？
3. 觀察一下知乎回來的資料好像有點怪怪的，該怎麼解決？

### 1. Dcard 網址： https://www.dcard.tw/f

In [2]:
import requests
from bs4 import BeautifulSoup


In [3]:
url = 'https://www.dcard.tw/f'
r = requests.get(url)
r.encoding = 'utf-8'
print(r.text[0:3000])

<!DOCTYPE html><html lang="zh-Hant-TW"><head prefix="og: http://ogp.me/ns#" itemscope="" itemType="https://schema.org/WebSite"><title data-react-helmet="true">Dcard</title><meta data-react-helmet="true" property="og:image" content="https://www.dcard.tw/build/landing-c9e7b8fb.png"/><meta data-react-helmet="true" property="og:image:secure_url" content="https://www.dcard.tw/build/landing-c9e7b8fb.png"/><meta data-react-helmet="true" charSet="utf-8"/><meta data-react-helmet="true" http-equiv="X-UA-Compatible" content="IE=edge"/><meta data-react-helmet="true" name="application-name" content="Dcard"/><meta data-react-helmet="true" name="apple-itunes-app" content="app-id=951353454"/><meta data-react-helmet="true" name="theme-color" content="#006aa6"/><meta data-react-helmet="true" name="mobile-web-app-capable" content="yes"/><meta data-react-helmet="true" name="apple-mobile-web-app-capable" content="yes"/><meta data-react-helmet="true" property="fb:app_id" content="211628828926493"/><meta dat

In [4]:
print('Request 取回之後該怎麼取出資料，資料型態是什麼？ =>')
print(type(r))

Request 取回之後該怎麼取出資料，資料型態是什麼？ =>
<class 'requests.models.Response'>


In [5]:
print('為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？ => ')
from bs4 import BeautifulSoup
htmlcode = r.text
soup = BeautifulSoup(htmlcode,'lxml')

print(type(soup))


為什麼要使用 BeatifulSoup 處理？處理後的型態是什麼？ => 
<class 'bs4.BeautifulSoup'>


### 2. 知乎： https://www.zhihu.com/explore

In [6]:
url = 'https://www.zhihu.com/explore'
r = requests.get(url)
r.encoding = 'utf-8'

print(r.text[0:600])

<html>
<head><title>400 Bad Request</title></head>
<body bgcolor="white">
<center><h1>400 Bad Request</h1></center>
<hr><center>openresty</center>
</body>
</html>



### 3. 觀察一下知乎回來的資料好像有點怪怪的，該怎麼解決？

In [8]:
import requests
url = 'https://www.zhihu.com/explore'
Myheaders = {
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,zh-TW;q=0.8,zh;q=0.7,ja;q=0.6',
'content-encoding': 'gzip',
'content-length': '425',
'content-type': 'application/x-protobuf',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-site',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36',
'x-za-batch-size': '1',
'x-za-clientid': '61cf7617-3d5a-492e-8acd-a561fd02c35a',
'x-za-log-version': '2.6.25',
'x-za-platform': 'DesktopWeb',
'x-za-product': 'Zhihu'
}
'''Myheaders = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'zh-TW,zh;q=0.9,en-US;q=0.8,en;q=0.7,zh-CN;q=0.6,ja;q=0.5',
    'cache-control': 'max-age=0',
    'cookie': '_zap=a5634a47-a806-4828-8ab4-2cbc03734bda; d_c0="AFBoGjc1fg6PTus2_H76YgMM3xvztVHRnCs=|1541767835"; __gads=ID=18e1196642fc5994:T=1544975081:S=ALNI_MZ2lQjiLHPlLxjVqGGH6o-EiL2luQ; z_c0="2|1:0|10:1551204934|4:z_c0|92:Mi4xM1FkMERnQUFBQUFBVUdnYU56Vi1EaVlBQUFCZ0FsVk5SdEJpWFFDUXZ6REVHVWw0QWtXblh6WGt4T183ekdxNFhn|fa542d11758c34207cdd6a1edf85de768ac02624c8cb53aa687017af49accd7b"; tst=r; q_c1=aed50b9b158344d6ac78a230c8970d83|1560948516000|1543510922000; __utmv=51854390.100--|2=registration_date=20190226=1^3=entry_date=20181130=1; _xsrf=13cd498d-6139-4c50-99a6-df5c749dd64c; tgw_l7_route=4860b599c6644634a0abcd4d10d37251; __utma=51854390.567487929.1560948518.1560950333.1562469680.3; __utmb=51854390.0.10.1562469680; __utmc=51854390; __utmz=51854390.1562469680.3.3.utmcsr=localhost:8888|utmccn=(referral)|utmcmd=referral|utmcct=/notebooks/day2-example.ipynb',
    'referer': 'http://localhost:8888/notebooks/day2-example.ipynb',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}'''
r = requests.get(url,headers = Myheaders)

r.encoding = 'utf-8'
soup = BeautifulSoup(r.text)
print(soup)

<!DOCTYPE html>
<html data-hairline="true" data-theme="light" lang="zh"><head><meta charset="utf-8"/><title data-react-helmet="true">发现 - 知乎</title><meta content="width=device-width,initial-scale=1,maximum-scale=1" name="viewport"/><meta content="webkit" name="renderer"/><meta content="webkit" name="force-rendering"/><meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/><meta content="FTeR0c8arOPKh8c5DYh_9uu98_zJbaWw53J-Sch9MTg" name="google-site-verification"/><meta content="有问题，上知乎。知乎，可信赖的问答社区，以让每个人高效获得可信赖的解答为使命。知乎凭借认真、专业和友善的社区氛围，结构化、易获得的优质内容，基于问答的内容生产方式和独特的社区机制，吸引、聚集了各行各业中大量的亲历者、内行人、领域专家、领域爱好者，将高质量的内容透过人的节点来成规模地生产和分享。用户通过问答等交流方式建立信任和连接，打造和提升个人影响力，并发现、获得新机会。" name="description" property="og:description"/><link data-react-helmet="true" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-152.67c7b278.png" rel="apple-touch-icon"/><link data-react-helmet="true" href="https://static.zhihu.com/heifetz/assets/apple-touch-icon-152.67c7b278.png" rel="apple-touch-icon" size