# <center> Lesson 2: crawler </center>

# 本章目標
抓取網站上的資料並整理輸出抓取的資料

# Outline
> 1. Introduction to Crawler
> 2. Introduction to Website
> 3. Crawl Recipe Website
> 4. A little summary
> 5. Practice

![recipes](res/website.png)

食譜這麼多，有沒有辦法把他們都抓下來呢？

# 1. Introduction to Crawler
### Crawler: 
* 網路爬蟲程式
* 可以幫助我們自動化的去抓取網頁結構相同的網頁

### 好用的自動化抓取工具: [httptrack](https://www.httrack.com/)
* 抓小量資料很方便
* 只要輸入一個root url，程式就會自動尋找其他的url去抓

### python知名的crawler framwork: [scrapy](http://scrapy.org/)
* 實作了許多很常用的小細節
* 上手需要一些練習，需要熟悉框架才能知道要定義哪些檔案

### 套件安裝
* 下載網頁的好用套件: [requests](http://docs.python-requests.org/en/latest/user/quickstart/)
> sudo pip install requests
* 解析網頁的好用套件: [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)
> sudo pip install beautifulsoup4

# Attention Please!!
在使用crawler時，請千萬注意不要一口氣抓取過多的網頁 <br> 
這樣子做相當於對對方的server進行惡意攻擊！！ <br>
也可能會導致你的IP被BAN掉而無法繼續瀏覽該網站！！ <br>

一種解決方法是使用delay的機制來避免過度頻繁存取 <br>
另外設置每小時瀏覽量也是一個做法 <br>
此外，如果有多組ip可以使用，將工作分配給各IP去執行也是個做法 <br>

# 2. Introduction to Website

# What is website?

# This is a website.
![recipes](res/website.png)

# This is a webiste, too.

![recipes](res/website_code.png)

# 網站基本架構
一般來說，網站分成<b>前端(fron-end)</b>與<b>後端(back-end)</b>

![](res/simple_website_architecher.jpg)

# Html
html是網頁的主體，由tag組成，定義了網頁的內容與框架

In [60]:
from IPython.core.display import HTML

In [97]:
HTML("""
<html>
    <body>
        <button class="btn1">hello</button><button class="btn2"> world</button>
        <p> the world is beautiful</p>
    <body>
</html>""")


# css
css詮釋了網頁的框架，我們可以透過css定義框架應該怎樣呈現給user

In [98]:
css_file = """ p#pdemo{ color: orange; } """
open("demo.css","w").write(css_file)

In [99]:
HTML("""
<html>
    <head> <link rel="stylesheet" href="demo.css"> </head>
    <body>
        <button class="btn1">hello</button><button class="btn2"> world</button>
        <p id="pdemo"> the world is beautiful</p>
    <body>
</html> """)


# Javascript
* javascript讓user可以跟網頁進行更豐富的互動
* 不論是動畫的效果，或甚至是跟server進行溝通，都可以透過js完成
* 甚至還有js的套件可以做到更多更豐富的效果，像是有名的[node.js](https://nodejs.org/en/)

In [101]:
HTML("""
<html>
    <head> <link rel="stylesheet" href="demo.css"> 
        <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.3/jquery.min.js"></script>
        <script>
        $(document).ready(function(){
            $(".btn1").click(function(){ 
                $("p.jdemo").animate({wordSpacing: "+=15px"}); });
            $(".btn2").click(function(){ 
                $("p.jdemo").animate({wordSpacing: "-=15px"}); });
        });
        </script>
    </head>
    <body>
        <button class="btn1">hello</button><button class="btn2"> world</button>
        <p id="pdemo" class="jdemo"> the world is beautiful</p>
    <body>
</html> """)

# 3. Crawl Recipe Website
### 現在讓我們來抓取食譜網站吧！


### 首先，先讓我們來試著抓一塊蛋糕吧！  [![recipes](res/cake.png)](http://allrecipes.com/recipe/8372/black-magic-cake/?internalSource=staff%20pick&referringId=79&referringContentType=recipe%20hub)

In [65]:
import requests
url = "http://allrecipes.com/recipe/8372/black-magic-cake/?internalSource=staff%20pick&referringId=79&referringContentType=recipe%20hub"
recipe = requests.get(url)
recipe_html = recipe.content
print repr(recipe_html[:500])
print repr(recipe_html[-500:])

'\r\n<!DOCTYPE html>\r\n<html>\r\n<!-- ARLOG SERVER: USSEAMOBILE32 LOCAL_IP: 192.168.5.121 MERCH_KEY: MerchData_4_1_1_8372_***_68 -->\r\n\r\n    <head>\r\n        <title>Black Magic Cake Recipe - Allrecipes.com</title>\r\n            <meta property="og:title" content="Black Magic Cake Recipe" />\r\n    \r\n        <meta property="og:site_name" content="Allrecipes" />\r\n        <meta charset="utf-8" />\r\n        <meta name="viewport" content="width=device-width, initial-scale=1.0" />\r\n            <meta id="metaDescri'
'n(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':new Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\'//www.googletagmanager.com/gtm.js?id=\'+i+dl;f.parentNode.insertBefore(j,f);})(window,document,\'script\',\'placeholderDL\',\'GTM-MW2LG9\');</script>\r\n        <!-- End Google Tag Manager -->\r\n\r\n        <script type="text/javascript">_satellite.pageBottom(); // I

# 等等，requests.get是啥呢？
* get是一種傳輸協定，藉由網址可以向server傳遞一些參數 <br>
比方說 http://allrecipes.com/recipe/8372/black-magic-cake/?internalSource=staff%20pick&referringId=79&referringContentType=recipe%20hub <br>
傳遞了<i>internalSource</i>這個參數，其值為<i>staff</i> <br>
而<i>referringId</i>這個參數，其值為<i>79</i> <br><br>

* 另外一種常見的傳遞方式為post <br>
post傳遞的參數不會顯示在?後面 <br>
也因此這種傳遞方式比較安全！傳遞的量也比較大！

### 接著讓我們將"食譜名稱"，"評分"，"評分人數"，"食材"，"烹煮時間"抓出來吧！
![recipes](res/extract_cake.jpg)

### 但在抓之前...

# 我們要怎麼知道網頁上的什麼元素對應到html裡的哪裏啊？？

# chrome 開發者工具
按右鍵，選擇"檢查元素"
1. 點選搜尋
2. 選取想搜尋的元素
3. 查看對應的html tag
![](res/chrome.jpg)

# 現在我們找到tag了!
### 食譜名稱就藏在"< h1 class="recipe-summary__h1" itemprop="name">"這個tag中，但我們還必須確認他的唯一性
對網頁按右鍵，點選"檢視網頁原始碼"
搜尋"recipe-summary__h1"

![](res/chrome2.png)

我們可以看到搜尋結果只有一個，這代表這個tag確實只會用來顯示這個食譜名稱！

# 我們現在已經找到tag了，但是我們要怎麼用python去指定這個tag呢？

# DOM(Document Object Model) Tree

### 其實一個html檔案是可以表示成一個tree的

![](res/dom_tree.gif)

### 所以藉由將html網頁表示成tree，我們便可以對tree進行搜尋找出我們要的tag
### 像是使用xpath，或是css selector的方式

# Parse(頗析)html的一些好用套件:
1. beautiful soup <br>
  - 可以使用tag搭配屬性的方式搜尋tag，也可以使用selector的方式指定某個位置的tag <br>
  - 用搜尋的方式有可能會導致parse的速度較慢
2. pyquery <br>
  - 可以使用selector的方式指定某個位置的tag
3. lxml <br>
  - 使用xpath的方式指定某個位置的tag
  - 但安裝上可能會比較容易遇到些困難

### 為了方便起見，之後將會以beautiful soup為主來進行parse

# 現在我們有了分析的工具！
### 讓我們把"食譜名稱"，"評分"，"評分人數"，"食材"，"烹煮時間"抓出來吧！

![recipes](res/extract_cake.jpg)

In [66]:
import requests
from bs4 import BeautifulSoup
from pprint import pprint

def get_recipe_html(url):
    recipe = requests.get(url)
    recipe_html = recipe.content
    return recipe_html

def parse_a_recipe(url,recipe_html):
    parsed_data = {'url':url}
    soup = BeautifulSoup(recipe_html)
    parsed_data['id'] = int(soup.find("meta", property="fb:app_id").get("content"))
    parsed_data['recipe_name'] = soup.find('h1',class_="recipe-summary__h1").get_text()
    parsed_data['rating'] = float(soup.find('div',class_="rating-stars").get("data-ratingstars"))
    parsed_data['vote_number'] = int(soup.find('span',class_="review-count").get_text())
    parsed_data['ingredients'] = [ ingredient.get_text() for ingredient in soup.find_all("span",itemprop="ingredients") ]
    parsed_data['prep_time'] = soup.find("time", itemprop="prepTime").get("datetime")
    parsed_data['cook_time'] = soup.find("time", itemprop="cookTime").get("datetime")
    parsed_data['total_time'] = soup.find("time", itemprop="totalTime").get("datetime")
    return parsed_data
    

In [67]:
url = "http://allrecipes.com/recipe/8372/black-magic-cake/?internalSource=staff%20pick&referringId=79&referringContentType=recipe%20hub"
pprint( parse_a_recipe(url,get_recipe_html(url)) )

{'cook_time': 'PT35M',
 'id': 66102450266,
 'ingredients': [u'1 3/4 cups all-purpose flour',
                 u'2 cups white sugar',
                 u'3/4 cup unsweetened cocoa powder',
                 u'2 teaspoons baking soda',
                 u'1 teaspoon baking powder',
                 u'1 teaspoon salt',
                 u'2 eggs',
                 u'1 cup strong brewed coffee',
                 u'1 cup buttermilk',
                 u'1/2 cup vegetable oil',
                 u'1 teaspoon vanilla extract'],
 'prep_time': 'PT15M',
 'rating': 4.76773357391357,
 'recipe_name': u'Black Magic Cake',
 'total_time': 'PT1H',
 'url': 'http://allrecipes.com/recipe/8372/black-magic-cake/?internalSource=staff%20pick&referringId=79&referringContentType=recipe%20hub',
 'vote_number': 1786}


# 時間的格式好像有點難搞啊！
'PT35M' <br>
'PT15M' <br>
'PT1H' <br>

### 有沒有辦法都轉成分鐘呢？

# Regular Expression
使用一套描述的方法，去描述形式上固定，但值不固定的pattern

In [68]:
import re
testing_case = [ 'PT1H35M' ,'PT15M' ,'PT1H' ]
for test in testing_case:
    hour = re.search("PT(.+)H",test)
    hour = int(hour.group(1)) if hour else 0
        
    minute = re.search("H(.+)M",test)
    minute = minute if minute else re.search("PT(.+)M",test)
    minute = int(minute.group(1)) if minute else 0
    
    total = hour*60+minute
    print "H: {0} M: {1:>2} Total: {2} mins".format(hour,minute,total)

H: 1 M: 35 Total: 95 mins
H: 0 M: 15 Total: 15 mins
H: 1 M:  0 Total: 60 mins


# 現在讓我們來把Parser修的更完整一點吧！
* 加入re解析時間
* 加入例外處理機制
* 加上編碼轉換

In [77]:
import requests
import re
import traceback
from bs4 import BeautifulSoup
from pprint import pprint

def get_recipe_html(url):
    recipe = requests.get(url)
    recipe_html = recipe.content
    return recipe_html

def extract_time(t):
    if t != None :
        t = t.get("datetime")
        hour = re.search("PT(.+)H",t)
        hour = int(hour.group(1)) if hour else 0
        minute = re.search("H(.+)M",t)
        minute = minute if minute else re.search("PT(.+)M",t)
        minute = int(minute.group(1)) if minute else 0
        total = hour*60+minute
        return total
    else:
        return "NULL"

def parse_a_recipe(url,recipe_html):
    try:
        parsed_data = {'url':url}
        soup = BeautifulSoup(recipe_html)
        parsed_data['id'] = int(soup.find("meta", property="fb:app_id").get("content"))
        parsed_data['recipe_name'] = soup.find('h1',class_="recipe-summary__h1").get_text()
        parsed_data['rating'] = float(soup.find('div',class_="rating-stars").get("data-ratingstars"))
        parsed_data['vote_number'] = int(soup.find('span',class_="review-count").get_text())
        parsed_data['ingredients'] = [ ingredient.get_text() for ingredient in soup.find_all("span",itemprop="ingredients") ]
        parsed_data['prep_time'] = extract_time( soup.find("time", itemprop="prepTime") )
        parsed_data['cook_time'] = extract_time( soup.find("time", itemprop="cookTime") )
        parsed_data['total_time'] = extract_time( soup.find("time", itemprop="totalTime") )

        handle_encoding(parsed_data)
    except:
        print "url:{0}\n{1}".format(url,traceback.format_exc())
    
    return parsed_data

def handle_encoding(parsed_data):
    for k,v in parsed_data.items():
        if type(v) == type(u"unicode"):
            parsed_data[k] = v.encode("utf8",'ignore')
        if type(v) == type([]):
            for idx in xrange(len(v)):
                v[idx] = v[idx].encode("utf8",'ignore')

In [78]:
url = "http://allrecipes.com/recipe/8372/black-magic-cake/?internalSource=staff%20pick&referringId=79&referringContentType=recipe%20hub"
parsed_result = parse_a_recipe(url,get_recipe_html(url))
pprint( parsed_result )

{'cook_time': 35,
 'id': 66102450266,
 'ingredients': ['1 3/4 cups all-purpose flour',
                 '2 cups white sugar',
                 '3/4 cup unsweetened cocoa powder',
                 '2 teaspoons baking soda',
                 '1 teaspoon baking powder',
                 '1 teaspoon salt',
                 '2 eggs',
                 '1 cup strong brewed coffee',
                 '1 cup buttermilk',
                 '1/2 cup vegetable oil',
                 '1 teaspoon vanilla extract'],
 'prep_time': 15,
 'rating': 4.76773357391357,
 'recipe_name': 'Black Magic Cake',
 'total_time': 60,
 'url': 'http://allrecipes.com/recipe/8372/black-magic-cake/?internalSource=staff%20pick&referringId=79&referringContentType=recipe%20hub',
 'vote_number': 1786}


* 時間成功轉成了分鐘
* 並且字串都已經變成utf8編碼

# 將資料輸出成csv
* 將食譜的基本資料存成一個csv <br>
* 食材的資料存成一個csv <br>
* 兩個資料都存放id欄位，方便資料做對應<br>
* 如果存放的值是字串，要加上引號！！

In [79]:
basic_info_order = ['id','recipe_name','rating','vote_number','prep_time','cook_time','total_time','url']
ingredient_info_order = ['id','ingredient']

def add_quote(text):
    return '"{0}"'.format(text) if type(text) == type('string') else str(text)

def output_csv(parsed_results):
    basic_writer = open( "basic_info.csv", 'w' )
    ingredient_writer = open( "ingredient_info.csv", 'w' )
    
    # write head
    basic_writer.write( '"{0}"\n'.format('","'.join(basic_info_order)) )
    ingredient_writer.write( '"{0}"\n'.format('","'.join(ingredient_info_order)) )
    
    # write content
    for result in parsed_results:
        basic_writer.write(','.join([ add_quote(result[order]) for order in basic_info_order ])+'\n')
        for igd in result['ingredients']:
            ingredient_writer.write('{id},"{igd}"\n'.format(id=result['id'],igd=igd))
    basic_writer.close()
    ingredient_writer.close()

In [80]:
output_csv([parsed_result])

In [81]:
import pandas as pd
pd.read_csv("basic_info.csv")

Unnamed: 0,id,recipe_name,rating,vote_number,prep_time,cook_time,total_time,url
0,66102450266,Black Magic Cake,4.767734,1786,15,35,60,http://allrecipes.com/recipe/8372/black-magic-...


In [82]:
pd.read_csv("ingredient_info.csv")

Unnamed: 0,id,ingredient
0,66102450266,1 3/4 cups all-purpose flour
1,66102450266,2 cups white sugar
2,66102450266,3/4 cup unsweetened cocoa powder
3,66102450266,2 teaspoons baking soda
4,66102450266,1 teaspoon baking powder
5,66102450266,1 teaspoon salt
6,66102450266,2 eggs
7,66102450266,1 cup strong brewed coffee
8,66102450266,1 cup buttermilk
9,66102450266,1/2 cup vegetable oil


# 我們已經能抓取一個頁面了！
現在我們要來抓取多道食譜了！

# robots.txt
* 通常告訴了爬蟲程式他們對爬蟲程式做了什麼限制
* 因此當我們需要進行大量抓取資料時，不仿從這邊先看一下網站對crawler有什麼限制
* 這樣子IP才比較不容易被BAN掉喔！

In [94]:
print requests.get("http://allrecipes.com/robots.txt").content

User-agent: *
Disallow: /my/
Disallow: /search/
Disallow: /account/
Disallow: /error/
Disallow: /admin/

User-agent: Slurp
Crawl-delay: .05

User-agent: 008
Disallow: /


# 抓取多道食譜
進入主頁面後，我們將視窗往下拉到底，觀察一下主頁面 [http://allrecipes.com/recipes/79/desserts/?](http://allrecipes.com/recipes/79/desserts/?) <br>
當更新的符號結束後，可以看到網頁多了一個page的參數！
![](res/ajax.jpg)

# 讓我們來試抓個5頁吧！

In [85]:
def get_recipes(page_size):
    url_head = "http://allrecipes.com"
    url_format = "http://allrecipes.com/recipes/79/desserts/?page={page}#{page}"
    pages = [ requests.get(url_format.format(page=i)) for i in xrange(page_size) ]
    recipes_urls = [ url.get('href') for page in pages for url in BeautifulSoup(page.content).find_all('a') ]
    # 去掉不要的參數，去掉不合理的網址
    recipes_urls = [ "{0}{1}".format(url_head,url.split("?")[0]) for url in recipes_urls if url != None and url[:7] == '/recipe' and url[:8] != '/recipes']
    # 去掉重複的頁面
    recipes_urls = list(set(recipes_urls))
    return recipes_urls

In [89]:
recipes_urls = get_recipes(5)
print "Number of links:", len(recipes_urls)
parsed_results = [ parse_a_recipe(url,get_recipe_html(url)) for url in recipes_urls ]
output_csv(parsed_results)

Number of links: 89


In [90]:
pd.read_csv("basic_info.csv")

Unnamed: 0,id,recipe_name,rating,vote_number,prep_time,cook_time,total_time,url
0.0,66102450266,Cake Balls,4.255517,1852,40,30,190,http://allrecipes.com/recipe/67656/cake-balls/
1.0,66102450266,Whipped Cream Cream Cheese Frosting,4.716981,1280,15,,15,http://allrecipes.com/recipe/19108/whipped-cre...
2.0,66102450266,Too Much Chocolate Cake,4.773054,4636,,,,http://allrecipes.com/recipe/7565/too-much-cho...
3.0,66102450266,Easy Lemon Cookies,4.580826,1139,,,,http://allrecipes.com/recipe/10776/easy-lemon-...
4.0,66102450266,Grandma Ople's Apple Pie,4.796273,1866,30,60,90,http://allrecipes.com/recipe/12377/grandma-opl...
5.0,66102450266,Apple Slab Pie,4.508772,43,30,60,90,http://allrecipes.com/recipe/19460/apple-slab-...
6.0,66102450266,Dark Chocolate Cake I,4.583763,1283,30,30,80,http://allrecipes.com/recipe/7736/dark-chocola...
7.0,66102450266,Raspberry and Almond Shortbread Thumbprints,4.673727,1554,30,18,75,http://allrecipes.com/recipe/10222/raspberry-a...
8.0,66102450266,Soft Sugar Cookies IV,4.544470,1414,,,,http://allrecipes.com/recipe/15295/soft-sugar-...
9.0,66102450266,Chewy Peanut Butter Chocolate Chip Cookies,4.350261,1090,15,15,60,http://allrecipes.com/recipe/9996/chewy-peanut...


In [91]:
pd.read_csv("ingredient_info.csv")

Unnamed: 0,id,ingredient
0.0,66102450266,1 (18.25 ounce) package chocolate cake mix
1.0,66102450266,1 (16 ounce) container prepared chocolate fros...
2.0,66102450266,1 (3 ounce) bar chocolate flavored confectione...
3.0,66102450266,1 (8 ounce) package cream cheese
4.0,66102450266,1 cup white sugar
5.0,66102450266,1/8 teaspoon salt
6.0,66102450266,1 teaspoon vanilla extract
7.0,66102450266,1 1/2 cups heavy whipping cream
8.0,66102450266,1 (18.25 ounce) package devil's food cake mix
9.0,66102450266,1 (5.9 ounce) package instant chocolate puddin...


# 4. A little summary
### so far ... 我們主要定義了3個function
1. get_recipes
 * 找出了要抓取的食譜網頁
2. parse_a_recipe
 * parse了要分析的食譜網頁
3. output_csv
 * 將整理好的json檔案輸出成了2個csv檔
 
### 另外，requests則是擔任了downloader的角色

# 所以我們所做的事也可以整理成以下架構圖
* Data Center: output_csv
* Web Connector: requests
* Data Parser: get_recipes, parse_a_recipe

![](res/spider_arch.png)

# Scrapy
知名的python crawler framwork Scrapy，其架構則更加豐富與複雜 <br>
一方面幫我們做好了很多複雜的功能，ex 續傳功能 <br>
因此如果需要寫比較大型的crawler的話，推薦可以研究看看scrapy <br>
但如果是要寫一個小crawler的話，其實簡單搭配requests 跟 beautiful soup就可以了！ <br>

![](res/scrapy_arch.png)

# 5. Practice
### Now is your turn!
* 請大家依照下列4個步驟來寫一個ptt crawler抓取ptt八卦版吧！ <br>
* 八卦版網址: [https://www.ptt.cc/bbs/Gossiping/](https://www.ptt.cc/bbs/Gossiping/)
    1. 抓出每一篇po文的link(抓取5個頁面即可)
    2. 抓取各頁面
    3. parse資料成json
    4. 整理資料輸出成ptt.csv
    
* 需要留下的欄位有...

|   |     說明      |       欄位       |  範例  |
|---|--------------|------------------|------|
| 1.| po文是否被推爆 | overpush         | True or False |
| 2.| po文網址      | url              | https://www.ptt.cc/bbs/Gossiping/M.1119534821.A.8B6.html |
| 3.| po文作者      | author           | benjamin0126 (LaNew衝五連勝吧) |
| 4.| po文標題      | title            | 有那些藝人有高學歷 |
| 5.| po文時間      | author_post_time | Thu Jun 23 21:56:57 2005 |
| 6.| 推文帳號      | reviewer         | Armando |
| 7.| 推文內容      | review_content   | 哈佛中文 ...是在幹麻?  到外國學中文 |
| 8.| 推文時間      | review_time      | 06/23 |
| 9.| 推虛          | review_status    | 推 |


In [120]:
print requests.get("http://www.ptt.cc/bbs/Gossiping/index1.html").content

SSLError: hostname 'www.ptt.cc' doesn't match either of 'images.ptt.cc', 'ptt.cc'

# Hint
由於連線憑證的關係，請求傳送時，需加入verify=False的參數取消憑證

In [121]:
print requests.get("http://www.ptt.cc/bbs/Gossiping/index1.html",verify=False).content

<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8" />
		

<meta name="viewport" content="width=device-width">

<title>批踢踢實業坊</title>

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.13/bbs-common.css" />
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.13/bbs.css" media="screen" />
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.13/pushstream.css" media="screen" />
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.13/bbs-print.css" media="print" />


<script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="//images.ptt.cc/v2.13/bbs.js"></script>


		

<script type="text/javascript">

  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-32365737-1']);
  _gaq.push(['_setDomainName', 'ptt.cc']);
  _gaq.push(['_trackPageview']);

  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.pro

# 想看八卦怎麼辦呢???
![](res/pttR18.png)

1. 開啟chrome，按右鍵"檢視元素"
2. 點選network
3. 點選我同意，觀察傳送的封包，會有一個over18，裏頭向https://www.ptt.cc/ask/over18 post了參數
![](res/over18.png)

4. 檢查傳送的參數中，會發現他設定了一個cookie，裏頭設定了from跟yes

![](res/from_data.png)

# requests.session()
為了能夠紀錄住已經通過的驗證，我們必須使用session來記錄狀態

In [251]:
s = requests.session()
cookie = {
    "from": "/bbs/Gossiping/index.html",
    "yes": "yes"
}
s.post("https://www.ptt.cc/ask/over18",verify=False,data=cookie)
result = s.get("https://www.ptt.cc/bbs/Gossiping/index.html",verify=False)
print result.text

<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8" />
		

<meta name="viewport" content="width=device-width">

<title>看板 Gossiping 文章列表 - 批踢踢實業坊</title>

<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.13/bbs-common.css" />
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.13/bbs.css" media="screen" />
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.13/pushstream.css" media="screen" />
<link rel="stylesheet" type="text/css" href="//images.ptt.cc/v2.13/bbs-print.css" media="print" />


<script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<script src="//images.ptt.cc/v2.13/bbs.js"></script>


		

<script type="text/javascript">

  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-32365737-1']);
  _gaq.push(['_setDomainName', 'ptt.cc']);
  _gaq.push(['_trackPageview']);

  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == d

# 接下來只要使用S來get，並記得verify設為False就可以正常抓取網頁了！

# 1. 抓出每一篇po文的link
並順便抓出是否推爆

In [239]:
def extract_post_page_url(start_page,end_page):
    url_format = "https://www.ptt.cc/bbs/Gossiping/index{index}.html"
    head = "https://www.ptt.cc"
    post_page_urls = []
    for i in xrange(start_page,end_page):
        url = url_format.format(index=i)
        page_list = BeautifulSoup(s.get(url,verify=False).content).find_all('div',class_="r-ent")
        post_page_urls.extend([ ("{0}{1}".format(head,page.find('a').get('href')),page) for page in page_list ])
    return list(set(post_page_urls))

In [242]:
post_page_url = extract_post_page_url(3,5)
print len(post_page_url)

40


# 2. 抓取各頁面

In [243]:
def download_pages(urls):
    return [ (url,overpush,BeautifulSoup(s.get(url,verify=False).content)) for url,overpush in urls ]

In [244]:
pages = download_pages(post_page_url)

# 3. parse資料成json
|   |     說明      |       欄位       |  範例  |
|---|--------------|------------------|------|
| 1.| po文是否被推爆 | overpush         | True or False |
| 2.| po文網址      | url              | https://www.ptt.cc/bbs/Gossiping/M.1119534821.A.8B6.html |
| 3.| po文作者      | author           | benjamin0126 (LaNew衝五連勝吧) |
| 4.| po文標題      | title            | 有那些藝人有高學歷 |
| 5.| po文時間      | author_post_time | Thu Jun 23 21:56:57 2005 |
| 6.| 推文帳號      | reviewer         | Armando |
| 7.| 推文內容      | review_content   | 哈佛中文 ...是在幹麻?  到外國學中文 |
| 8.| 推文時間      | review_time      | 06/23 |
| 9.| 推虛          | review_status    | 推 |

In [245]:
def extract_review(review,class_):
    return review.find("span",class_=class_).get_text().encode("utf8")

def parser(pages):
    parsed_results = []
    for url,overpush,soup in pages:
        try:
            overpush = overpush.find('span',class_="hl")
            tmp_parsed_result = {
                "overpush": True if overpush != None and overpush.get_text().encode("utf8") == '爆' else False,
                "url": url,
                "reviewer": [],
                "review_content": [],
                "review_time": [],
                "review_status": []
            }
            article_meta_line = soup.find_all("div",class_="article-metaline")
            tmp_parsed_result['author'] = article_meta_line[0].find('span',class_="article-meta-value").get_text().encode("utf8")
            tmp_parsed_result['title'] = article_meta_line[1].find('span',class_="article-meta-value").get_text().encode("utf8")
            tmp_parsed_result['author_post_time'] = article_meta_line[2].find('span',class_="article-meta-value").get_text().encode("utf8")
            for review in soup.find_all("div",class_="push"):
                if review.get_text().encode("utf8") == "檔案過大！部分文章無法顯示":
                    continue
                tmp_parsed_result['reviewer'].append( extract_review(review,"push-userid") )
                tmp_parsed_result['review_content'].append( extract_review(review,"push-content") )
                tmp_parsed_result['review_time'].append( extract_review(review,"push-ipdatetime") )
                tmp_parsed_result['review_status'].append( extract_review(review,"push-tag") )
            parsed_results.append(tmp_parsed_result)
        except:
            print soup
    return parsed_results
        

In [246]:
parsed_results = parser(pages)

# 4. 整理資料輸出成ptt.csv

In [247]:
def output_ptt_csv(parsed_results):
    order = ['overpush','url','author','title','author_post_time','reviewer','review_content','review_time','review_status']
    writer = open("ptt.csv",'w')
    writer.write('"'+'","'.join(order)+'"\n')
    for parsed_result in parsed_results:
        tmp_raw = '"'+'","'.join([ str(parsed_result[o]) for o in order[:5] ])+'"'
        for i in xrange(len(parsed_result['reviewer'])):
            writer.write( '{0},"{1}"\n'.format(tmp_raw,'","'.join([ str(parsed_result[o][i]) for o in order[5:] ])) )
    writer.close()

In [248]:
output_ptt_csv(parsed_results)

In [249]:
pd.read_csv("ptt.csv")

Unnamed: 0,overpush,url,author,title,author_post_time,reviewer,review_content,review_time,review_status
0.0,False,https://www.ptt.cc/bbs/Gossiping/M.1119650484....,youngersheep (...),Re: 王童為什麼可以當黃河男的乾女兒,Sat Jun 25 06:19:58 2005,stationery,:她演過小阿珍嗎? (當藝人說髒話也沒關係吧@@),163.25.118.8 06/25\n,推
1.0,False,https://www.ptt.cc/bbs/Gossiping/M.1119650484....,youngersheep (...),Re: 王童為什麼可以當黃河男的乾女兒,Sat Jun 25 06:19:58 2005,stationery,:一般人還不是在說 沒必要當藝人就偽裝自己啦,163.25.118.8 06/25\n,→
2.0,False,https://www.ptt.cc/bbs/Gossiping/M.1119650484....,youngersheep (...),Re: 王童為什麼可以當黃河男的乾女兒,Sat Jun 25 06:19:58 2005,youngersheep,:有啊 她第一部戲就是阿扁與阿珍呀 深刻哩,218.165.106.83 06/25\n,推
3.0,False,https://www.ptt.cc/bbs/Gossiping/M.1119650484....,youngersheep (...),Re: 王童為什麼可以當黃河男的乾女兒,Sat Jun 25 06:19:58 2005,stationery,:她後來好像去拍華視的戲 忘了演什麼@@,163.25.118.8 06/25\n,→
4.0,False,https://www.ptt.cc/bbs/Gossiping/M.1119650484....,youngersheep (...),Re: 王童為什麼可以當黃河男的乾女兒,Sat Jun 25 06:19:58 2005,loveices,:女生向錢走,140.123.222.71 06/25\n,推
5.0,False,https://www.ptt.cc/bbs/Gossiping/M.1119622394....,ku23 (Good Luck),Re: [新聞] 有沒有911的八卦,Fri Jun 24 22:34:06 2005,zett,:民航機內不是禁用手機嗎？ 怎麼會....,61.217.145.45 06/24\n,推
6.0,False,https://www.ptt.cc/bbs/Gossiping/M.1119622394....,ku23 (Good Luck),Re: [新聞] 有沒有911的八卦,Fri Jun 24 22:34:06 2005,zett,:還是已知被劫機，才用手機聯絡？,61.217.145.45 06/24\n,→
7.0,False,https://www.ptt.cc/bbs/Gossiping/M.1119622394....,ku23 (Good Luck),Re: [新聞] 有沒有911的八卦,Fri Jun 24 22:34:06 2005,ykao,:美國國內線我沒記錯的話是可以用電話,61.70.169.60 06/24\n,→
8.0,False,https://www.ptt.cc/bbs/Gossiping/M.1119622394....,ku23 (Good Luck),Re: [新聞] 有沒有911的八卦,Fri Jun 24 22:34:06 2005,ykao,":有些機上還有AT&T的刷卡電話機, 所以應該是可以打的",61.70.169.60 06/24\n,→
9.0,False,https://www.ptt.cc/bbs/Gossiping/M.1119622394....,ku23 (Good Luck),Re: [新聞] 有沒有911的八卦,Fri Jun 24 22:34:06 2005,CRAZYDRAGON,:Autobots! Let's roll!~,11/26 12:16\n,推


# Thanks for your watching!