# Regular expression 正規表達式

## 使用時機: 只要是網站難爬，能拿到完整source code的話，這是最暴力的解法

## 否則用Selenium感覺會更輕鬆，而且source code更完整

## Re.match()

## 最常規的匹配

常用 .* 忽略過中間整串

In [4]:
import re

content = 'Hello 123 4567 World_this is a Regex Demo'
print(len(content))

#常用 .* 忽略過中間整串
result = re.match('^Hello\s\d\d\d\s\d{4}\s\w{10}.*Demo$', content)

'''       註解:
^Hello    ^表示開頭Hello
\s        空白鍵(space)
\d\d\d    數字3個(digit)
\d{4}     數字4個(digit)
\w{10}    字母10個(word) 匹配字母数字及下划线
.         任意字母 .匹配任意字符，除了换行符，当re.DOTALL标记被指定时，则可以匹配包括换行符的任意字符。
*         0或多個 + 1或多個 ?0或1個，非貪婪
Demo$     $結尾是Demo
'''

print(result)
print(result.group())
print(result.span())

41
<_sre.SRE_Match object; span=(0, 41), match='Hello 123 4567 World_this is a Regex Demo'>
Hello 123 4567 World_this is a Regex Demo
(0, 41)


## 泛匹配: .*

In [6]:
import re

content = 'Hello 123 4567 World_this is a Regex Demo'
result = re.match('^Hello.*Demo$', content)
print(result)
print(result.group)

<_sre.SRE_Match object; span=(0, 41), match='Hello 123 4567 World_this is a Regex Demo'>
<built-in method group of _sre.SRE_Match object at 0x0000025692B14C60>


## 匹配目標

In [9]:
import re

content = 'Hello 1234567 World_this is a Regex Demo'
result = re.match('^Hello\s(\d+)\sWorld.*Demo$', content)
print(result)
print(result.group(1)) #顯示打括號的地方

<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_this is a Regex Demo'>
1234567


## 貪婪匹配: match最大化

In [14]:
import re

content = 'Hello 1234567 World_this is a Regex Demo'
result = re.match('^He.*(\d+).*Demo$', content)
print(result)
print(result.group(0)) 
print(result.group(1)) #顯示打括號的地方，.*吃掉123456

<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_this is a Regex Demo'>
Hello 1234567 World_this is a Regex Demo
7


## 非貪婪匹配: ? 

In [11]:
import re

content = 'Hello 1234567 World_this is a Regex Demo'
result = re.match('^He.*?(\d+).*Demo$', content)
print(result)
print(result.group(1)) #顯示打括號的地方，.*只吃掉字母的部分

<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_this is a Regex Demo'>
1234567


## 匹配模式 

re.S 讓.匹配任何字符

In [15]:
import re

content = '''Hello 1234567 World_this 
is a Regex Demo'''

result = re.match('^He.*?(\d+).*Demo$', content, re.S)
print(result)
print(result.group(1)) #顯示打括號的地方，.*只吃掉字母的部分

<_sre.SRE_Match object; span=(0, 41), match='Hello 1234567 World_this \nis a Regex Demo'>
1234567


## 轉義 符號前面加上\，以避開符號

In [16]:
import re

content = 'This apple is $5.00'
result = re.match('This apple is $5.00', content)
print(result)

None


In [17]:
import re

content = 'This apple is $5.00'
result = re.match('This apple is \$5\.00', content)
print(result)

<_sre.SRE_Match object; span=(0, 19), match='This apple is $5.00'>


# 小結: 用泛匹配.\*，用括號目標匹配，用非貪婪匹配，換行符號就re.S

# re.search: 返回第一個成功的match

In [21]:
import re

content = 'Extra strings Hello 1234567 World_this is a Regex Demo'
result = re.match('He.*?(\d+).*Demo$', content)
print(result)

None


In [23]:
import re

content = 'Extra strings Hello 1234567 World_this is a Regex Demo'
result = re.search('He.*?(\d+).*Demo$', content)
print(result)
print(result.group(1))

<_sre.SRE_Match object; span=(14, 54), match='Hello 1234567 World_this is a Regex Demo'>
1234567


## 匹配練習

In [32]:
import re

html= '''
<HTML>
<HEAD>
<TITLE>Alice's Adventures in Wonderland (Project Gutenberg)</TITLE>
</HEAD>

<frameset Rows="50, *">
<frame src="alice-finfo.html">
<frame src="alice-ftitle.html" name="alice-main">
</frameset>

<noframes>
<BODY>
<H1>Alice's Adventures in Wonderland</H1>
                          <H1>Lewis Carroll</H1>
               <H1>The Millennium Fulcrum Edition 3.0</H1>

NOTE:  This is a hypertext formatted version of the Project Gutenberg edition.
For more information, check the 
<A HREF="alice-small.txt">small print</A> 
or check out the 
<A HREF="ftp://uiarchive.cso.uiuc.edu/pub/etext/gutenberg/etext91/alice30.txt">
full ascii text</A>.  The original Tenniel illustrations are also available 
due to the efforts of Project Gutenberg.  You can if you like, grab them as a 
<A HREF="ftp://uiarchive.cso.uiuc.edu/pub/etext/gutenberg/etext94/algif10.zip">
"zip file"</A> or read the <A HREF="algif-small.txt">small print</A> 
that comes with them.
This document is part of a small, but growing collection of html formatted
etexts.  (Others may be found in either my <A
HREF="http://www.cs.cmu.edu/Web/People/rgs/rgs-home.html">home page</A> or 
John Ockerbloom's indexes by <A
HREF="http://www.cs.cmu.edu/Web/bookauthors.html">author</A> and <A
HREF="http://www.cs.cmu.edu/Web/booktitles.html">title</A>.)
I am still trying to figure out whether anyone else is interested in these 
on-line readable documents.  If you appreciate this document or would like to 
see more such, send me mail at "rgs@cs.cmu.edu". 
<P>
<A HREF="alice01a.gif"><IMG SRC="alice01th.gif"></A>
<P>
<H2>CONTENTS</H2>
<PRE>
    CHAPTER I:    <A HREF="alice-I.html">Down the Rabbit-Hole</A>
    CHAPTER II:   <A HREF="alice-II.html">The Pool of Tears</A>
    CHAPTER III:  <A HREF="alice-III.html">A Caucus-Race and a Long Tale</A>
    CHAPTER IV:   <A HREF="alice-IV.html">The Rabbit Sends in a Little Bill</A>
    CHAPTER V:    <A HREF="alice-V.html">Advice from a Caterpillar</A>
    CHAPTER VI:   <A HREF="alice-VI.html">Pig and Pepper</A>
    CHAPTER VII:  <A HREF="alice-VII.html">A Mad Tea-Party</A>
    CHAPTER VIII: <A HREF="alice-VIII.html">The Queen's Croquet-Ground</A>
    CHAPTER IX:   <A HREF="alice-IX.html">The Mock Turtle's Story</A>
    CHAPTER X:    <A HREF="alice-X.html">The Lobster Quadrille</A>
    CHAPTER XI:   <A HREF="alice-XI.html">Who Stole the Tarts?</A>
    CHAPTER XII:  <A HREF="alice-XII.html">Alice's Evidence</A>
</PRE>

<ADDRESS><A HREF="mailto:rgs@cs.cmu.edu">Robert Stockton</A></ADDRESS>
<P>
<!- Access counter added 5/25 1:49am ->
<A href="http://www.dbasics.com/cgi-bin/pages.cgi?143205747"><IMG SRC="http://www.dbasics.com/cgi-bin/counter.cgi?143205747.2&https://www.google.com.tw/"></A> Access statistics from htmlZine
<!- This page has been visited 
A HREF="http://counter.digits.com/wc?--info=yes&--name=rgsalice"
IMG SRC="http://counter.digits.com/wc/-d/4/-r/-z/rgsalice"
 ALIGN=absmiddle WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 ALT="????"/A
times since March 2, 1996. ->
</BODY>
</noframes>
</HTML>
'''

#取出章節名稱
#注意!!! 用.*請加上?表非貪婪才查的到東西!!
result = re.search('CHAPTER.*?<A HREF="ali.*?">(.*?)</A>', html, re.S)
if result:
    print(result.group(1))

Down the Rabbit-Hole


# re.findall 用list返還查到的所有東西!!

In [46]:
import re

html= '''
<HTML>
<HEAD>
<TITLE>Alice's Adventures in Wonderland (Project Gutenberg)</TITLE>
</HEAD>

<frameset Rows="50, *">
<frame src="alice-finfo.html">
<frame src="alice-ftitle.html" name="alice-main">
</frameset>

<noframes>
<BODY>
<H1>Alice's Adventures in Wonderland</H1>
                          <H1>Lewis Carroll</H1>
               <H1>The Millennium Fulcrum Edition 3.0</H1>

NOTE:  This is a hypertext formatted version of the Project Gutenberg edition.
For more information, check the 
<A HREF="alice-small.txt">small print</A> 
or check out the 
<A HREF="ftp://uiarchive.cso.uiuc.edu/pub/etext/gutenberg/etext91/alice30.txt">
full ascii text</A>.  The original Tenniel illustrations are also available 
due to the efforts of Project Gutenberg.  You can if you like, grab them as a 
<A HREF="ftp://uiarchive.cso.uiuc.edu/pub/etext/gutenberg/etext94/algif10.zip">
"zip file"</A> or read the <A HREF="algif-small.txt">small print</A> 
that comes with them.
This document is part of a small, but growing collection of html formatted
etexts.  (Others may be found in either my <A
HREF="http://www.cs.cmu.edu/Web/People/rgs/rgs-home.html">home page</A> or 
John Ockerbloom's indexes by <A
HREF="http://www.cs.cmu.edu/Web/bookauthors.html">author</A> and <A
HREF="http://www.cs.cmu.edu/Web/booktitles.html">title</A>.)
I am still trying to figure out whether anyone else is interested in these 
on-line readable documents.  If you appreciate this document or would like to 
see more such, send me mail at "rgs@cs.cmu.edu". 
<P>
<A HREF="alice01a.gif"><IMG SRC="alice01th.gif"></A>
<P>
<H2>CONTENTS</H2>
<PRE>
    CHAPTER I:    <A HREF="alice-I.html">Down the Rabbit-Hole</A>
    CHAPTER II:   <A HREF="alice-II.html">The Pool of Tears</A>
    CHAPTER III:  <A HREF="alice-III.html">A Caucus-Race and a Long Tale</A>
    CHAPTER IV:   <A HREF="alice-IV.html">The Rabbit Sends in a Little Bill</A>
    CHAPTER V:    <A HREF="alice-V.html">Advice from a Caterpillar</A>
    CHAPTER VI:   <A HREF="alice-VI.html">Pig and Pepper</A>
    CHAPTER VII:  <A HREF="alice-VII.html">A Mad Tea-Party</A>
    CHAPTER VIII: <A HREF="alice-VIII.html">The Queen's Croquet-Ground</A>
    CHAPTER IX:   <A HREF="alice-IX.html">The Mock Turtle's Story</A>
    CHAPTER X:    <A HREF="alice-X.html">The Lobster Quadrille</A>
    CHAPTER XI:   <A HREF="alice-XI.html">Who Stole the Tarts?</A>
    CHAPTER XII:  <A HREF="alice-XII.html">Alice's Evidence</A>
</PRE>

<ADDRESS><A HREF="mailto:rgs@cs.cmu.edu">Robert Stockton</A></ADDRESS>
<P>
<!- Access counter added 5/25 1:49am ->
<A href="http://www.dbasics.com/cgi-bin/pages.cgi?143205747"><IMG SRC="http://www.dbasics.com/cgi-bin/counter.cgi?143205747.2&https://www.google.com.tw/"></A> Access statistics from htmlZine
<!- This page has been visited 
A HREF="http://counter.digits.com/wc?--info=yes&--name=rgsalice"
IMG SRC="http://counter.digits.com/wc/-d/4/-r/-z/rgsalice"
 ALIGN=absmiddle WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 ALT="????"/A
times since March 2, 1996. ->
</BODY>
</noframes>
</HTML>
'''

#取出章節名稱
#注意!!! 用.*請加上?表非貪婪才查的到東西!!
result = re.findall('CHAPTER.*?<A HREF="ali.*?">(.*?)</A>', html, re.S)
print(type(result))
print(result)

for i in range(len(result)):
    print("Chapter "+str(i+1)+": "+result[i])

<class 'list'>
['Down the Rabbit-Hole', 'The Pool of Tears', 'A Caucus-Race and a Long Tale', 'The Rabbit Sends in a Little Bill', 'Advice from a Caterpillar', 'Pig and Pepper', 'A Mad Tea-Party', "The Queen's Croquet-Ground", "The Mock Turtle's Story", 'The Lobster Quadrille', 'Who Stole the Tarts?', "Alice's Evidence"]
Chapter 1: Down the Rabbit-Hole
Chapter 2: The Pool of Tears
Chapter 3: A Caucus-Race and a Long Tale
Chapter 4: The Rabbit Sends in a Little Bill
Chapter 5: Advice from a Caterpillar
Chapter 6: Pig and Pepper
Chapter 7: A Mad Tea-Party
Chapter 8: The Queen's Croquet-Ground
Chapter 9: The Mock Turtle's Story
Chapter 10: The Lobster Quadrille
Chapter 11: Who Stole the Tarts?
Chapter 12: Alice's Evidence


# re.sub 替換String中每個被匹配的字串

In [47]:
import re

content = 'Extra strings Hello 1234567 World_this is a Regex Demo'
result = re.sub('\d+', 'Replacement', content)
print(result)

Extra strings Hello Replacement World_this is a Regex Demo


In [49]:
import re

content = 'Extra strings Hello 1234567 World_this is a Regex Demo'
result = re.sub('(\d+)', r'\1 8910', content)
print(result)

Extra strings Hello 1234567 8910 World_this is a Regex Demo


用空白字串來替換，就是刪除!!!

In [129]:
import re

html= '''
<HTML>
<HEAD>
<TITLE>Alice's Adventures in Wonderland (Project Gutenberg)</TITLE>
</HEAD>

<frameset Rows="50, *">
<frame src="alice-finfo.html">
<frame src="alice-ftitle.html" name="alice-main">
</frameset>

<noframes>
<BODY>
<H1>Alice's Adventures in Wonderland</H1>
                          <H1>Lewis Carroll</H1>
               <H1>The Millennium Fulcrum Edition 3.0</H1>

NOTE:  This is a hypertext formatted version of the Project Gutenberg edition.
For more information, check the 
<A HREF="alice-small.txt">small print</A> 
or check out the 
<A HREF="ftp://uiarchive.cso.uiuc.edu/pub/etext/gutenberg/etext91/alice30.txt">
full ascii text</A>.  The original Tenniel illustrations are also available 
due to the efforts of Project Gutenberg.  You can if you like, grab them as a 
<A HREF="ftp://uiarchive.cso.uiuc.edu/pub/etext/gutenberg/etext94/algif10.zip">
"zip file"</A> or read the <A HREF="algif-small.txt">small print</A> 
that comes with them.
This document is part of a small, but growing collection of html formatted
etexts.  (Others may be found in either my <A
HREF="http://www.cs.cmu.edu/Web/People/rgs/rgs-home.html">home page</A> or 
John Ockerbloom's indexes by <A
HREF="http://www.cs.cmu.edu/Web/bookauthors.html">author</A> and <A
HREF="http://www.cs.cmu.edu/Web/booktitles.html">title</A>.)
I am still trying to figure out whether anyone else is interested in these 
on-line readable documents.  If you appreciate this document or would like to 
see more such, send me mail at "rgs@cs.cmu.edu". 
<P>
<A HREF="alice01a.gif"><IMG SRC="alice01th.gif"></A>
<P>
<H2>CONTENTS</H2>
<PRE>
    CHAPTER I:    <A HREF="alice-I.html">Down the Rabbit-Hole</A>
    CHAPTER II:   <A HREF="alice-II.html">The Pool of Tears</A>
    CHAPTER III:  <A HREF="alice-III.html">A Caucus-Race and a Long Tale</A>
    CHAPTER IV:   <A HREF="alice-IV.html">The Rabbit Sends in a Little Bill</A>
    CHAPTER V:    <A HREF="alice-V.html">Advice from a Caterpillar</A>
    CHAPTER VI:   <A HREF="alice-VI.html">Pig and Pepper</A>
    CHAPTER VII:  <A HREF="alice-VII.html">A Mad Tea-Party</A>
    CHAPTER VIII: <A HREF="alice-VIII.html">The Queen's Croquet-Ground</A>
    CHAPTER IX:   <A HREF="alice-IX.html">The Mock Turtle's Story</A>
    CHAPTER X:    <A HREF="alice-X.html">The Lobster Quadrille</A>
    CHAPTER XI:   <A HREF="alice-XI.html">Who Stole the Tarts?</A>
    CHAPTER XII:  <A HREF="alice-XII.html">Alice's Evidence</A>
</PRE>

<ADDRESS><A HREF="mailto:rgs@cs.cmu.edu">Robert Stockton</A></ADDRESS>
<P>
<!- Access counter added 5/25 1:49am ->
<A href="http://www.dbasics.com/cgi-bin/pages.cgi?143205747"><IMG SRC="http://www.dbasics.com/cgi-bin/counter.cgi?143205747.2&https://www.google.com.tw/"></A> Access statistics from htmlZine
<!- This page has been visited 
A HREF="http://counter.digits.com/wc?--info=yes&--name=rgsalice"
IMG SRC="http://counter.digits.com/wc/-d/4/-r/-z/rgsalice"
 ALIGN=absmiddle WIDTH=60 HEIGHT=20 BORDER=0 HSPACE=4 ALT="????"/A
times since March 2, 1996. ->
</BODY>
</noframes>
</HTML>
'''

result_1 = re.search('<PRE>(.*?)</PRE>', html, re.S)
print(type(result_1.group(1)))
print(result_1.group(1))

result_2 = re.sub('<A HREF=.*">|</A>', '', result_1.group(1), re.S)
print(type(result_2))
print(result_2) #為什麼刪除不到後面那幾個???

<class 'str'>

    CHAPTER I:    <A HREF="alice-I.html">Down the Rabbit-Hole</A>
    CHAPTER II:   <A HREF="alice-II.html">The Pool of Tears</A>
    CHAPTER III:  <A HREF="alice-III.html">A Caucus-Race and a Long Tale</A>
    CHAPTER IV:   <A HREF="alice-IV.html">The Rabbit Sends in a Little Bill</A>
    CHAPTER V:    <A HREF="alice-V.html">Advice from a Caterpillar</A>
    CHAPTER VI:   <A HREF="alice-VI.html">Pig and Pepper</A>
    CHAPTER VII:  <A HREF="alice-VII.html">A Mad Tea-Party</A>
    CHAPTER VIII: <A HREF="alice-VIII.html">The Queen's Croquet-Ground</A>
    CHAPTER IX:   <A HREF="alice-IX.html">The Mock Turtle's Story</A>
    CHAPTER X:    <A HREF="alice-X.html">The Lobster Quadrille</A>
    CHAPTER XI:   <A HREF="alice-XI.html">Who Stole the Tarts?</A>
    CHAPTER XII:  <A HREF="alice-XII.html">Alice's Evidence</A>

<class 'str'>

    CHAPTER I:    Down the Rabbit-Hole
    CHAPTER II:   The Pool of Tears
    CHAPTER III:  A Caucus-Race and a Long Tale
    CHAPTER IV:   The

In [130]:
import re

html_1 = '''
    CHAPTER I:    <A HREF="alice-I.html">Down the Rabbit-Hole</A>
    CHAPTER II:   <A HREF="alice-II.html">The Pool of Tears</A>
    CHAPTER III:  <A HREF="alice-III.html">A Caucus-Race and a Long Tale</A>
    CHAPTER IV:   <A HREF="alice-IV.html">The Rabbit Sends in a Little Bill</A>
    CHAPTER V:    <A HREF="alice-V.html">Advice from a Caterpillar</A>
    CHAPTER VI:   <A HREF="alice-VI.html">Pig and Pepper</A>
    CHAPTER VII:  <A HREF="alice-VII.html">A Mad Tea-Party</A>
    CHAPTER VIII: <A HREF="alice-VIII.html">The Queen's Croquet-Ground</A>
    CHAPTER IX:   <A HREF="alice-IX.html">The Mock Turtle's Story</A>
    CHAPTER X:    <A HREF="alice-X.html">The Lobster Quadrille</A>
    CHAPTER XI:   <A HREF="alice-XI.html">Who Stole the Tarts?</A>
    CHAPTER XII:  <A HREF="alice-XII.html">Alice's Evidence</A>
'''

result_2 = re.sub('<A HREF=".*">|</A>', '', html_1, re.S)
print(type(result_2))
print(result_2) #為什麼刪除不到後面那幾個???

<class 'str'>

    CHAPTER I:    Down the Rabbit-Hole
    CHAPTER II:   The Pool of Tears
    CHAPTER III:  A Caucus-Race and a Long Tale
    CHAPTER IV:   The Rabbit Sends in a Little Bill
    CHAPTER V:    Advice from a Caterpillar
    CHAPTER VI:   Pig and Pepper
    CHAPTER VII:  A Mad Tea-Party
    CHAPTER VIII: The Queen's Croquet-Ground
    CHAPTER IX:   <A HREF="alice-IX.html">The Mock Turtle's Story</A>
    CHAPTER X:    <A HREF="alice-X.html">The Lobster Quadrille</A>
    CHAPTER XI:   <A HREF="alice-XI.html">Who Stole the Tarts?</A>
    CHAPTER XII:  <A HREF="alice-XII.html">Alice's Evidence</A>



# re.compile: 把一串Reg直接化作某pattern方便後續使用

In [100]:
import re

content = '''Hello 1234567 
World_this is a Regex Demo'''

pattern = re.compile('Hello.*Demo', re.S)
result = re.match(pattern, content)
print(result)

<_sre.SRE_Match object; span=(0, 41), match='Hello 1234567 \nWorld_this is a Regex Demo'>


# 實戰練習:尋找目標匹配的前後都只用「關鍵字」+.\*?帶過去就可以啦!!

In [143]:
import re
import requests

#擷取好原始碼text檔案
content = requests.get('http://book.douban.com/').text
print(type(content))
print(content)

<class 'str'>


<!DOCTYPE html>
<html lang="zh-CN" class=" book-new-nav">
  <head>
    <meta charset="utf-8">
    <meta http-equiv="Pragma" content="no-cache">
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
    <meta http-equiv="Expires" content="Sun, 6 Mar 2005 01:00:00 GMT">
    
  <meta http-equiv="mobile-agent" content="format=xhtml; url=http://m.douban.com/book/">
  <meta name="keywords" content="豆瓣读书,新书速递,畅销书,书评,书单"/>
  <meta name="description" content="记录你读过的、想读和正在读的书，顺便打分，添加标签及个人附注，写评论。根据你的口味，推荐适合的书给你。" />
  <meta name="verify-v1" content="EYARGSAVd5U+06FeTmxO8Mj28Fc/hM/9PqMfrlMo8YA=">
  <meta property="wb:webmaster" content="7c86191e898cd20d">
  <meta property="qc:admins" content="1520412177364752166375">

    <title>
    豆瓣读书
</title>
    <link rel="shortcut icon" href="https://img3.doubanio.com/favicon.ico"
      type="image/x-icon">
    <script src="https://img3.doubanio.com/f/book/0495cb173e298c28593766009c7b0a953246c5b5/js/book/lib/jquery/jquery.js"></

In [174]:
#爬取網路書商的書籍綜合資料

import re
import requests

content = requests.get('http://book.douban.com/').text
results = re.findall('<li.*?cover.*?href="(.*?)".*?title="(.*?)">.*?more-meta.*?author">(.*?)</span>.*?year">(.*?)</span>.*?publisher">(.*?)</span>.*?</li>', content, re.S)

print(type(result))

for result in results:
    url, name, author, year, publisher = result
    author = re.sub('\s',"", author)
    year = re.sub('\s',"", year)
    publisher = re.sub('\s',"", publisher)
    print(url, name, author, year, publisher)


<class 'tuple'>
https://book.douban.com/subject/27052524/?icn=index-editionrecommend 时间煮雨 百草园 2017-5 北京联合出版社
https://book.douban.com/subject/27034242/?icn=index-editionrecommend 停靠，一座城 李婧&nbsp;/&nbsp;村上春花 2017-5-1 新星出版社
https://book.douban.com/subject/27073634/?icn=index-editionrecommend 巴慕达 [日]守山久子 2017-6-1 电子工业出版社
https://book.douban.com/subject/27000152/?icn=index-editionrecommend 海洋与文明 [美]林肯·佩恩 2017-4 后浪丨天津人民出版社
https://book.douban.com/subject/27046761/?icn=index-editionrecommend 1987，我们的红楼梦 欧阳奋强 2017-6 中国轻工业出版社
https://book.douban.com/subject/26842992/?icn=index-latestbook-subject 绛红雪白的花瓣 [英]米歇尔·法柏 2017-5 重庆出版社
https://book.douban.com/subject/27054637/?icn=index-latestbook-subject 隐性逻辑 [德]卡尔·诺顿 2017-7-1 九州出版社丨阳光博客
https://book.douban.com/subject/26956486/?icn=index-latestbook-subject 无根之木 [美]娜奥米·诺维克 2017-5 天地出版社
https://book.douban.com/subject/26865132/?icn=index-latestbook-subject 如何读懂经典 [英]亨利·希金斯 2017-6-10 中信出版集团/楚尘文化
https://book.douban.com/subject/27014578/?icn=index-latestboo

In [7]:
#爬取 diagnosaur DDx_A
import re
import requests

#content = requests.get('https://accessmedicine.mhmedical.com/Diagnosaurus.aspx?categoryid=41309&selectedletter=A').text
#由於不給request沒辦法使用，改用Selenium
from selenium import webdriver
browser = webdriver.PhantomJS()
url = "https://accessmedicine.mhmedical.com/Diagnosaurus.aspx?categoryid=41309&selectedletter=A"
browser.get(url)
content = browser.page_source

result_dx_A = re.findall('"leftNav".*?"span10".*?"keep-params">(.*?)</a>.*?span', content, re.S)
print(type(result_dx_A))
print("dx_A number = "+str(len(result_dx_A)))

for i in range(len(result_dx_A)):
    print(result_dx_A[i])

<class 'list'>
dx_A number = 134
Abdominal aortic aneurysm
Abdominal pain
Abdominal pain and fever
Abdominal pain and hematuria
Abdominal pain and rash
Abdominal pain and weight loss
Abdominal pain in women
Abdominal pain, generalized
Abdominal pain, left lower quadrant
Abdominal pain, left upper quadrant
Abdominal pain, right lower quadrant
Abdominal pain, right upper quadrant
Abdominal pain, upper or epigastric
Abnormal premenopausal bleeding (increased or irregular)
Absent or decreased pulse
Acanthosis nigricans
Acetaminophen poisoning
Achalasia
Acne vulgaris
Acromegaly and gigantism
Actinic keratoses
Actinomycosis
Acute (angle-closure) glaucoma
Acute adrenal insufficiency
Acute bacterial prostatitis
Acute bronchitis
Acute cholecystitis
Acute colonic pseudo-obstruction (Ogilvie's syndrome)
Acute cough
Acute cough and fever
Acute cough and shortness of breath
Acute cough in hospitalized patient
Acute cystitis
Acute diarrhea
Acute fatty liver of pregnancy
Acute glaucoma
Acute hepatic 

# 抓取PTT_beauty版圖片

In [49]:
'''
思考路線:
1.找到PTT表特版網址，get表特版原始碼
2.由表特版的原始碼，匹配get各個文章網址列表
3.由各個文章原始碼，匹配單一文章中各個圖片網址列表
4.由各個圖片網址列表，下載儲存圖片(由二進位碼寫入檔案中儲存.jpg格式)
'''

from selenium import webdriver
from selenium.webdriver.common.by import By

import requests
import re

import os
from urllib.request import urlretrieve
from urllib import request, error
    
browser = webdriver.PhantomJS()
url = "https://www.ptt.cc/bbs/beauty/index.html"
browser.get(url)
browser.implicitly_wait(3)
#page source盡量用selenium爬取，不要用requests可能被擋
content_beauty = browser.page_source 

'''
#刪文章不要讀
#topic 用selenium爬，回傳是element的list
topic = browser.find_elements_by_class_name('title')
'''

#文章連結們topic_url用Regex爬，回傳字串的list，
topic_url = re.findall('mark.*?title.*?href="(.*?)">.*?</a>', content_beauty, re.S)

#驗證是否有抓到文章列表
''' 
#被刪除的文章不要讀
print(len(topic)) #這是element的list
'''
print("PTT_Beauty文章列表\n")
print(topic_url) #這是list
print("\n")

#爬topic_url-4篇文章，因為要扣除【版規文章】
for i in range(0, len(topic_url)-4):
    
    #topic是element的list，所以要.text才可以看內容
    browser.get("https://www.ptt.cc"+topic_url[i])
    topic = browser.find_elements_by_class_name('article-meta-value') 
    #會有作者[0]、看板[1]、標題[2]、時間[3]
    print("標題: "+topic[2].text+" 作者: "+topic[0].text+" 時間: "+topic[3].text+"\n")

    #獲取文章的source code，用selenium比較保險
    content_topic = browser.page_source

    #抓取文章內的pic_url，使用list儲存
    pic_url = re.findall('<a href=".*?" target="_blank.*?nofollow.*?">(.*?)</a>', content_topic, re.S)

    #爬1到pic_url-1篇文章，因為要扣除【置底連結】
    for i in range(0, len(pic_url)-1):
        print(pic_url[i]+"\n") #印出每張圖片連結，確認網址都對
        
        pic = requests.get(pic_url[i])
        
        #文章名當標題，for迴圈順序當附加，記得用.jpg當結尾不然不能簡單看
        pic_title = topic[2].text+"_"+str(i)+".jpg"
        
        #由連結下載寫入圖片二進位碼
        with open(pic_title,'wb') as f:
            f.write(pic.content)
            f.close()
        #存取位置就是執行python程式的位置(不是python程式的位置)
            
browser.close()

PTT_Beauty文章列表

['/bbs/Beauty/M.1499693589.A.A2B.html', '/bbs/Beauty/M.1499696636.A.095.html', '/bbs/Beauty/M.1499699086.A.142.html', '/bbs/Beauty/M.1499699320.A.D3A.html', '/bbs/Beauty/M.1499701494.A.D0D.html', '/bbs/Beauty/M.1499703757.A.916.html', '/bbs/Beauty/M.1499704450.A.301.html', '/bbs/Beauty/M.1499704746.A.21A.html', '/bbs/Beauty/M.1499731401.A.426.html', '/bbs/Beauty/M.1499733684.A.BCF.html', '/bbs/Beauty/M.1499734124.A.76D.html', '/bbs/Beauty/M.1499737722.A.816.html', '/bbs/Beauty/M.1499746025.A.5E5.html', '/bbs/Beauty/M.1443906121.A.65B.html', '/bbs/Beauty/M.1423752558.A.849.html', '/bbs/Beauty/M.1430099938.A.3B7.html', '/bbs/Beauty/M.1476111251.A.C20.html']


標題: [神人] 服飾粉絲團的麻豆 作者: rocky9137No2 (麥寮衛生棉) 時間: Mon Jul 10 21:33:06 2017

https://goo.gl/xczaVP

https://goo.gl/RjNT5Y

標題: [正妹] 喜歡你 作者: wow919980 (wow) 時間: Mon Jul 10 22:23:53 2017

http://i.imgur.com/VGnEorZ.jpg

http://i.imgur.com/3ndA3J0.jpg

http://i.imgur.com/E49R2dU.jpg

http://i.imgur.com/bHqpy91.jpg

http://i