# 爬虫

## 1.1 了解网页结构

为了正常显示中文，`read()`之后要使用`decode()`使文字转换为成可以正常显示中文的形式。

In [165]:
from urllib.request import urlopen

# if has Chinese, apply decode()
html=urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')

print(html)

<!DOCTYPE html>
<html lang="cn">
<head>
	<meta charset="UTF-8">
	<title>Scraping tutorial 1 | 莫烦Python</title>
	<link rel="icon" href="https://morvanzhou.github.io/static/img/description/tab_icon.png">
</head>
<body>
	<h1>爬虫测试1</h1>
	<p>
		这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
		<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.
	</p>

</body>
</html>


In [166]:
import re
res=re.findall(r'<title>(.+?)</title>',html)
print('\nPage title is: {}'.format(res[0]))


Page title is: Scraping tutorial 1 | 莫烦Python


In [167]:
res=re.findall(r'<p>(.*?)</p>',html,flags=re.DOTALL)
print('\nPage paragraph is: {}'.format(res[0]))


Page paragraph is: 
		这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
		<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.
	


In [168]:
res=re.findall(r'href="(.*?)"',html)
print('nAll links: {}'.format(res))

nAll links: ['https://morvanzhou.github.io/static/img/description/tab_icon.png', 'https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/data-manipulation/scraping/']


## BeautifulSoup解析网页
用于简化提取内容

[中文官方文档](https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/)

In [169]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [170]:
html = urlopen(
    "https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
print(html)

<!DOCTYPE html>
<html lang="cn">
<head>
	<meta charset="UTF-8">
	<title>Scraping tutorial 1 | 莫烦Python</title>
	<link rel="icon" href="https://morvanzhou.github.io/static/img/description/tab_icon.png">
</head>
<body>
	<h1>爬虫测试1</h1>
	<p>
		这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
		<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.
	</p>

</body>
</html>


In [171]:
soup=BeautifulSoup(html,features='lxml')
print(soup.h1)
print('\n',soup.p)

<h1>爬虫测试1</h1>

 <p>
		这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.
	</p>


In [172]:
all_href=soup.find_all('a')
print(all_href)
#all_href=[l['href'] for l in all_href]
#print('\n',all_href)
for l in all_href:
    print(l['href'])

[<a href="https://morvanzhou.github.io/">莫烦Python</a>, <a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a>]
https://morvanzhou.github.io/
https://morvanzhou.github.io/tutorials/data-manipulation/scraping/


In [173]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [174]:
html=urlopen(
    "https://morvanzhou.github.io/static/scraping/list.html"
).read().decode('utf-8')

In [175]:
soup=BeautifulSoup(html,features='lxml')

month=soup.find_all('li',{"class":"month"}) 
#通过字典指定class
for m in month:
    print(m.get_text())

一月
二月
三月
四月
五月


In [176]:
jan=soup.find('ul',{"class":"jan"})
d_jan=jan.find_all('li')
for d in d_jan:
    print(d.get_text())

一月一号
一月二号
一月三号


In [177]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

html=urlopen(
    'https://morvanzhou.github.io/static/scraping/table.html'
).read().decode('utf-8')

In [178]:
soup=BeautifulSoup(html,features='lxml')

img_links=soup.find_all('img',{'src':re.compile('.*?\.jpg')})
for link in img_links:
    print(link['src'])

https://morvanzhou.github.io/static/img/course_cover/tf.jpg
https://morvanzhou.github.io/static/img/course_cover/rl.jpg
https://morvanzhou.github.io/static/img/course_cover/scraping.jpg


In [179]:
course_links=soup.find_all('a',{'href':re.compile('https://morvan.*')})

for link in course_links:
    print(link['href'])

https://morvanzhou.github.io/
https://morvanzhou.github.io/tutorials/data-manipulation/scraping/
https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/
https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/
https://morvanzhou.github.io/tutorials/data-manipulation/scraping/


## 爬百度百科

In [180]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random

设置起始页. 并将 `/item/...` 的网页都放在 `his` 中, 做一个备案, 记录我们浏览过的网页.

In [181]:
base_url='https://baike.baidu.com'
his = ['/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711']

In [182]:
url=base_url+his[-1]

html=urlopen(url).read().decode('utf-8')
soup=BeautifulSoup(html,features='lxml')
print(soup.find('h1').get_text(),'url: {}'.format(his[-1]))

网络爬虫 url: /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711


In [183]:
sub_urls=soup.find_all('a',
                      {
                          'target':'_blank',
                          'href':re.compile('/item/(%.{2})+$')
                      })
#print(sub_urls)
if len(sub_urls)!=0:
    new=random.sample(sub_urls,1)[0]['href']
    his.append(new)
else:
    his.pop()
print(his)

['/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711', '/item/%E7%88%AC%E8%99%AB%E7%A8%8B%E5%BA%8F']


In [210]:
for i in range(10):
    url=base_url+his[-1]
    
    html=urlopen(url).read().decode('utf-8')
    soup=BeautifulSoup(html,features='lxml')
    print(i+1,soup.find('h1').get_text(),'url: {}'.format(his[-1]))
    
    sub_urls=soup.find_all('a',
                          {
                              'target':'_blank',
                              'href':re.compile('/item/(%.{2})+$')
                          })
    if len(sub_urls)!=0:
        his.append(random.sample(sub_urls,1)[0]['href'])
    else:
        his.pop()

1 第二次世界大战 url: /item/%E7%AC%AC%E4%BA%8C%E6%AC%A1%E4%B8%96%E7%95%8C%E5%A4%A7%E6%88%98
2 捷克斯洛伐克 url: /item/%E6%8D%B7%E5%85%8B%E6%96%AF%E6%B4%9B%E4%BC%90%E5%85%8B
3 古斯塔夫·胡萨克 url: /item/%E5%8F%A4%E6%96%AF%E5%A1%94%E5%A4%AB%C2%B7%E8%83%A1%E8%90%A8%E5%85%8B
4 天鹅绒革命 url: /item/%E5%A4%A9%E9%B9%85%E7%BB%92%E9%9D%A9%E5%91%BD
5 实验员 url: /item/%E5%AE%9E%E9%AA%8C%E5%91%98
6 国家职业资格 url: /item/%E5%9B%BD%E5%AE%B6%E8%81%8C%E4%B8%9A%E8%B5%84%E6%A0%BC
7 项目管理师 url: /item/%E9%A1%B9%E7%9B%AE%E7%AE%A1%E7%90%86%E5%B8%88
8 工程勘察 url: /item/%E5%B7%A5%E7%A8%8B%E5%8B%98%E5%AF%9F
9 遥感技术 url: /item/%E9%81%A5%E6%84%9F%E6%8A%80%E6%9C%AF
10 海南岛 url: /item/%E6%B5%B7%E5%8D%97%E5%B2%9B


['a']


# 正则表达式

## 导入模块

In [184]:
import re

## 简单Python匹配

In [185]:
pattern1='cat'
pattern2='bird'
string='dog runs to cat'
print(pattern1 in string)
print(pattern2 in string)

True
False


## 正则寻找配对

In [186]:
pattern1='cat'
pattern2='bird'
string='dog runs to cat'
print(re.search(pattern1,string))


<_sre.SRE_Match object; span=(12, 15), match='cat'>


## 匹配多种可能 使用[]

In [187]:
ptn=r'r[au]n'
#r[au]n表示ran, run皆可成功
print(re.search(ptn,string))
string='dog ran to cat'
print(re.search(ptn,string))

<_sre.SRE_Match object; span=(4, 7), match='run'>
<_sre.SRE_Match object; span=(4, 7), match='ran'>


In [188]:
print(re.search(r'r[A-Z]n','dog runs to cat'))
print(re.search(r'r[0-9a-z]n','dog runs to cat'))

None
<_sre.SRE_Match object; span=(4, 7), match='run'>


## 特殊种类匹配

### 数字

In [189]:
# \d :数字
print(re.search(r'r\dn','run r4n'))
# \D :不是数字的形式
print(re.search(r'r\Dn','run r4n'))

<_sre.SRE_Match object; span=(4, 7), match='r4n'>
<_sre.SRE_Match object; span=(0, 3), match='run'>


### 空白

In [190]:
# \s :空白符
# \S :非空白符
print(re.search(r'r\sn','r\nn r4n'))
print(re.search(r'r\Sn','r\nn r4n'))


<_sre.SRE_Match object; span=(0, 3), match='r\nn'>
<_sre.SRE_Match object; span=(4, 7), match='r4n'>


### 所有字母数字和“_”

In [191]:
# \w :[a-zA-z0-9_]
print(re.search(r'r\wn','r\nn r4n'))
# \W :相反

<_sre.SRE_Match object; span=(4, 7), match='r4n'>


### 空白字符

In [192]:
# \b :词首和词尾的空白符
# \B :不是位于词首和词尾的空白符
print(re.search(r'\bruns\b','dog runs to cat'))
print(re.search(r'\b runs \b','dog  runs  to cat'))
print(re.search(r'\Bruns\B','dog  runs  to cat'))
print(re.search(r'\B runs \B','dog  runs  to cat'))

<_sre.SRE_Match object; span=(4, 8), match='runs'>
None
None
<_sre.SRE_Match object; span=(4, 10), match=' runs '>


### 特殊字符

In [193]:
# \\ :匹配\
# . :匹配\n以外所有字符
print(re.search(r'r.n','r-ns to'))

<_sre.SRE_Match object; span=(0, 3), match='r-n'>


### 句首句尾

In [194]:
# ^ : 匹配句首
# $ : 匹配句尾
print(re.search(r'^dog','dog runs to cat'))
print(re.search(r'cat$','dog runs to cat'))
print(re.search(r'cat$','cat runs to dog'))

<_sre.SRE_Match object; span=(0, 3), match='dog'>
<_sre.SRE_Match object; span=(12, 15), match='cat'>
None


### 是否出现

In [195]:
# ()? : ()里是否出现，出不出现都匹配
print(re.search(r'Mon(day)?','Monday'))
print(re.search(r'Mon(day)?','Mon'))

<_sre.SRE_Match object; span=(0, 6), match='Monday'>
<_sre.SRE_Match object; span=(0, 3), match='Mon'>


### 多行匹配

In [196]:
string = '''
dog runs to cat.
I run to dog.
'''
print(re.search(r'^I',string))
print(re.search(r'^I',string,flags=re.M))
# flags=re.M使string中每一个换行后的句子都被当成一行

None
<_sre.SRE_Match object; span=(18, 19), match='I'>


### 0或多次

In [197]:
# * 
print(re.search(r'ab*','a'))
print(re.search(r'ab*','ab'))
print(re.search(r'ab*','abbbbb'))

<_sre.SRE_Match object; span=(0, 1), match='a'>
<_sre.SRE_Match object; span=(0, 2), match='ab'>
<_sre.SRE_Match object; span=(0, 6), match='abbbbb'>


### 1或多次

In [198]:
# +
print(re.search(r'ab+','a'))
print(re.search(r'ab+','ab'))
print(re.search(r'ab+','abbbbb'))

None
<_sre.SRE_Match object; span=(0, 2), match='ab'>
<_sre.SRE_Match object; span=(0, 6), match='abbbbb'>


### 可选次数

In [199]:
# {n, m} : 出现n到m次
print(re.search(r'ab{2,10}','a'))
print(re.search(r'ab{2,10}','ab'))
print(re.search(r'ab{2,10}','abbbb'))

None
None
<_sre.SRE_Match object; span=(0, 5), match='abbbb'>


### Group 组

In [200]:
# () group的信息
match=re.search(r'(\d+), Date: (.+)','ID: 021523, Date: Feb/12/2017')
print(match.group())
print(match.group(1)) #只返回第1个括号匹配的内容

021523, Date: Feb/12/2017
021523


In [201]:
# ?P<group名> : 给group命名
match=re.search(r'(?P<id>\d+), Date: (?P<date>.+)','ID: 021523, Date: Feb/12/2017')
print(match.group('id'))

021523


### 寻找所有匹配

In [202]:
# findall
print(re.findall(r'r[ua]n','run ran ren'))
print(re.findall(r'(ran|run)','run ran ren'))

['run', 'ran']
['run', 'ran']


### 替换

In [203]:
# re.sub()
print(re.sub(r'r[au]ns','catches','dog runs to cats'))

dog catches to cats


### 分裂

In [204]:
# re.split()
print(re.split(r'[;,\.]','a;b,c.d;e'))

['a', 'b', 'c', 'd', 'e']


### compile

In [205]:
compiled_re=re.compile(r'r[au]n')
print(compiled_re.search('dog ran to cat'))

<_sre.SRE_Match object; span=(4, 7), match='ran'>
