# Regular Expression 正则表达式

## 常见匹配模式

模式 | 描述 |
- | :-|
\w | 匹配字母数字及下划线 |
    \W | 匹配非字母数字下划线 | |
    \s | 任意空白字符，等价于[\t\n\r\f] |
    \S | 匹配任意非空白字符 |
    \d | 匹配数字，等价于[0-9] |
  \D | 匹配任意非数字 |
\A |匹配字符串开始 |
  \Z | 匹配字符串结束，如果是存在换行，只匹配到换行前的结束字符串 |
 \z | 匹配字符串结束 |
  \G | 匹配最后匹配完成的位置 |
\n | 匹配一个换行符 |
\t | 匹配一个制表符 |
^ | 匹配字符串的开头 |
$ | 匹配字串的结尾 |
. | 匹配任意字符，除了换行符。当re.DOTALL标记被指定时，则可以匹配包括换行符的任意字符 |
[...] |匹配方括号内的一组字符 |
[^...] |不匹配方括号内的一组字符 |
* | 匹配0个或多个的表达式 |
+ | 匹配1个或多个的表达式 |
? | 匹配0个或1个前面的正则表达式片段，非贪婪模式 |
{n} | 精确匹配n个前面表达式 |
{n,m} | 匹配n到m次由前面的正则表达式定义的片段，贪婪模式 |
a\|b | 匹配a或b |
 () | 匹配括号内的表达式，也表示一个组 |


## re.match

re.match 尝试从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就会返回none

```python
re.match(pattern, string, flags=0)
```

### 常规匹配

In [50]:
import re

content = 'Hello 1237 127 World_This_is_Regex_Demo'
result = re.match(r'^Hello\s\d{4}\s\d{3}\s\w{10}.*Demo$',content)

print(result)
# print(result.group())
# print(result.span())

<_sre.SRE_Match object; span=(0, 39), match='Hello 1237 127 World_This_is_Regex_Demo'>


### 范匹配

In [51]:
import re

content = 'Hello 1237 127 World_This_is_Regex_Demo'
result = re.match(r'^Hello.*Demo$',content)
print(result)
print(result.group())
print(result.span())

<_sre.SRE_Match object; span=(0, 39), match='Hello 1237 127 World_This_is_Regex_Demo'>
Hello 1237 127 World_This_is_Regex_Demo
(0, 39)


### 匹配目标

In [53]:
import re

content = "Hello 1234567 World_This is a Regex_demo"
result=re.match('^Hello\s(\d+)\s(\w+).*demo$',content)
print(result)
print(result.group(1))
print(result.group(2))

<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex_demo'>
1234567
World_This


### 贪婪匹配

### 非贪婪匹配

### 匹配模式

In [55]:
import re

content = '''Hello 1234567 World_This
is a Regex_demo
'''

result = re.match('^Hello\s(\d+).*?demo$', content)
print(result)

print('==='*10)

'''
使用的模式
    ^Hello\s(\d+).*?demo$
解释：
    ^Hello 以Hello开头
    \s 匹配空白字符
    (\d+) 匹配1个或多个数字
    .*? 匹配0个或1个前面的'.*', '.*' 任意字符除了换行符
    demo$ 以demo结尾
    
    re.S 模式为
'''
result2 = re.match('^Hello\s(\d+).*?demo$', content, re.S)
print(result2)
print(result2.group(1))

None
<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This\nis a Regex_demo'>
1234567


### 转义

In [22]:
import re

content = 'price is $5.13'
result = re.match('^$(\d.\d+)',content)
print('未转义的结果：', result)

未转义的结果： None


In [23]:
import re

# 转义后
pattern = '^.*\$(\d+\.\d+)'
def get_price(content_with_price):
    result2 = re.match(pattern,content_with_price)
#     print(result2)
    print('匹配到的价格为：$', result2.group(1), sep='')
    
get_price('price is $5.13')
get_price('price is $5.11231233')
get_price('price is $123125.11231233')

匹配到的价格为：$5.13
匹配到的价格为：$5.11231233
匹配到的价格为：$123125.11231233


总结：尽量使用（1）泛匹配；（2）使用括号得到匹配目标；（3）尽量使用非贪婪模式；（4）有换行符就用re.S

## re.search

re.search 扫描整个字符串并返回第一个成功的匹配

In [26]:
import re

content = 'Extra strings Hello 1234567 World_This is a Regex Demo Extra string'
result = re.match('Hello.*?(\d+).*?Demo', content)
print(result)

None


In [28]:
import re

content = 'Extra strings Hello 1234567 World_This is a Regex Demo Extra string'
result = re.search('Hello.*?(\d+).*?Demo', content)
print(result)
print(result.group(1))

<_sre.SRE_Match object; span=(14, 54), match='Hello 1234567 World_This is a Regex Demo'>
1234567


总结：match匹配要限定头部和尾部，所以能用search的就不要用match

### 匹配练习

In [12]:
html = '''<div class="mdBox bgWrite">
            <div class="mdBoxHd">
                <a title="更多" class="fr" href="/laoge/70.htm" target="_blank">更多</a>
                <h2 class="mdBoxHdTit">
                    <span>70后喜欢的80年代经典老歌</span>
                </h2>
            </div>
            <div class="mdBoxBd">
                <div class="songList clearfix">
                    <ol id="f4">
                                                                    <li><input name="Url" class="check" type="checkbox" value="65937@"><span class="songNum  topRed">
                                    01.</span><a class="songName " href="http://www.jingdianlaoge.com/play/65937.htm" target="_1">
                                    再回首 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="66417@"><span class="songNum  topRed">
                                    02.</span><a class="songName " href="http://www.jingdianlaoge.com/play/66417.htm" target="_1">
                                    爱拼才会赢 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="35721@"><span class="songNum  topRed">
                                    03.</span><a class="songName " href="http://www.jingdianlaoge.com/play/35721.htm" target="_1">
                                    我只在乎你 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="81667@"><span class="songNum ">
                                    04.</span><a class="songName " href="http://www.jingdianlaoge.com/play/81667.htm" target="_1">
                                    千年等一回 《新白娘子传奇》电视剧主题曲 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="24865@"><span class="songNum ">
                                    05.</span><a class="songName " href="http://www.jingdianlaoge.com/play/24865.htm" target="_1">
                                    九百九十九朵玫瑰 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="49772@"><span class="songNum ">
                                    06.</span><a class="songName " href="http://www.jingdianlaoge.com/play/49772.htm" target="_1">
                                    女人花 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="82151@"><span class="songNum ">
                                    07.</span><a class="songName " href="http://www.jingdianlaoge.com/play/82151.htm" target="_1">
                                    走过咖啡屋 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="20753@"><span class="songNum ">
                                    08.</span><a class="songName " href="http://www.jingdianlaoge.com/play/20753.htm" target="_1">
                                    酒干倘卖无 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="199353@"><span class="songNum ">
                                    09.</span><a class="songName " href="http://www.jingdianlaoge.com/play/199353.htm" target="_1">
                                    你的柔情我永远不懂 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="2247@"><span class="songNum ">
                                    10.</span><a class="songName " href="http://www.jingdianlaoge.com/play/2247.htm" target="_1">
                                    红豆 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="1939@"><span class="songNum ">
                                    11.</span><a class="songName " href="http://www.jingdianlaoge.com/play/1939.htm" target="_1">
                                    童年 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="37778@"><span class="songNum ">
                                    12.</span><a class="songName " href="http://www.jingdianlaoge.com/play/37778.htm" target="_1">
                                    独角戏 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="29215@"><span class="songNum ">
                                    13.</span><a class="songName " href="http://www.jingdianlaoge.com/play/29215.htm" target="_1">
                                    一剪梅 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="42620@"><span class="songNum ">
                                    14.</span><a class="songName " href="http://www.jingdianlaoge.com/play/42620.htm" target="_1">
                                    千千阙歌 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="20823@"><span class="songNum ">
                                    15.</span><a class="songName " href="http://www.jingdianlaoge.com/play/20823.htm" target="_1">
                                    万水千山总是情 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="27247@"><span class="songNum ">
                                    16.</span><a class="songName " href="http://www.jingdianlaoge.com/play/27247.htm" target="_1">
                                    相见恨晚 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="3954@"><span class="songNum ">
                                    17.</span><a class="songName " href="http://www.jingdianlaoge.com/play/3954.htm" target="_1">
                                    风中有朵雨做的云 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="22985@"><span class="songNum ">
                                    18.</span><a class="songName " href="http://www.jingdianlaoge.com/play/22985.htm" target="_1">
                                    容易受伤的女人 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="79930@"><span class="songNum ">
                                    19.</span><a class="songName " href="http://www.jingdianlaoge.com/play/79930.htm" target="_1">
                                    我是不是该安静的走开 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="81655@"><span class="songNum ">
                                    20.</span><a class="songName " href="http://www.jingdianlaoge.com/play/81655.htm" target="_1">
                                    铁血丹心 《射雕英雄传之铁血丹心》电视剧主题曲 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="11411@"><span class="songNum ">
                                    21.</span><a class="songName " href="http://www.jingdianlaoge.com/play/11411.htm" target="_1">
                                    我的未来不是梦 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="64974@"><span class="songNum ">
                                    22.</span><a class="songName " href="http://www.jingdianlaoge.com/play/64974.htm" target="_1">
                                    粉红色的回忆 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="38248@"><span class="songNum ">
                                    23.</span><a class="songName " href="http://www.jingdianlaoge.com/play/38248.htm" target="_1">
                                    小城故事 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="49771@"><span class="songNum ">
                                    24.</span><a class="songName " href="http://www.jingdianlaoge.com/play/49771.htm" target="_1">
                                    一生爱你千百回 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="671@"><span class="songNum ">
                                    25.</span><a class="songName " href="http://www.jingdianlaoge.com/play/671.htm" target="_1">
                                    真的爱你 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="35504@"><span class="songNum ">
                                    26.</span><a class="songName " href="http://www.jingdianlaoge.com/play/35504.htm" target="_1">
                                    讲不出再见 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="45258@"><span class="songNum ">
                                    27.</span><a class="songName " href="http://www.jingdianlaoge.com/play/45258.htm" target="_1">
                                    又见炊烟 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="81669@"><span class="songNum ">
                                    28.</span><a class="songName " href="http://www.jingdianlaoge.com/play/81669.htm" target="_1">
                                    万里长城永不倒 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="188203@"><span class="songNum ">
                                    29.</span><a class="songName " href="http://www.jingdianlaoge.com/play/188203.htm" target="_1">
                                    爱江山更爱美人 </a>
                            </li>
                                                                    <li><input name="Url" class="check" type="checkbox" value="81654@"><span class="songNum ">
                                    30.</span><a class="songName " href="http://www.jingdianlaoge.com/play/81654.htm" target="_1">
                                    梅花三弄 </a>
                            </li>
                                                            </ol>
                </div>
                <div class="setPlay">
                    <a class="setPlayPlay" style="cursor: pointer;" onclick="javascript:new_lbplay('f4');">全部播放</a><a class="setPlayXuan" style="cursor: pointer;" onclick="javascript:quanxuan('f4');">全选/反选</a><a class="setPlayAdd" style="cursor: pointer;" onclick="javascript:clk('playadd','f4');">播放选中</a><a class="setPlaySui" style="cursor: pointer;" onclick="subrnd(40,'form','Url','index','rm1')">随机播放</a>
                </div>
            </div>
        </div>
'''

small_html='''<li><input name="Url" class="check" type="checkbox" value="81655@"><span class="songNum ">
        20.</span><a class="songName " href="http://www.jingdianlaoge.com/play/81655.htm" target="_1">
        铁血丹心 《射雕英雄传之铁血丹心》电视剧主题曲 </a>
</li>
'''

In [29]:
import re

pattern = '<span.*?songNum.*?(\d+)\.</span><a.*?songName.*?href="(.*?)".*?([\u4e00-\u9fa5]+).?</a>'
pattern2 = '<span.*?songNum.*?(\d+)\.</span><a.*?songName.*?href="(.*?)".*?([^\x00-\xff]+)\s?</a>'
result = re.search(pattern2,html,re.S)
print(result)
if result != None:
    print('|-->No.', result.group(1), '|-->URL:', result.group(2), '|-->SongName:', result.group(3))

<_sre.SRE_Match object; span=(207, 732), match='<span>70后喜欢的80年代经典老歌</span>\n                </h2>
|-->No. 01 |-->URL: http://www.jingdianlaoge.com/play/65937.htm |-->SongName: 再回首


In [35]:
pattern = '<span.*?songNum.*?(\d+)\.</span><a.*?songName.*?href="(.*?)".*?([^\x00-\xff]+\s?[^\x00-\xff]+).?</a>'
result = re.search(pattern,html,re.S)
if result != None:
    print('|-->No.', result.group(1), '|-->SongName:', result.group(3),'|-->URL:', result.group(2), )
else:
    print(result)

|-->No. 01 |-->SongName: 再回首 |-->URL: http://www.jingdianlaoge.com/play/65937.htm


## re.findall

搜索字符串，以列表的形式返回能匹配的子串

In [40]:
import re

'''
中文汉字
[\u4e00-\u9fa5]


用于匹配双字节字符
[^\x00-\xff]

'''
# pattern = '<span.*?songNum.*?(\d+)\.</span><a.*?songName.*?href="(.*?)".*?([\u4e00-\u9fa5]+\s?[\u4e00-\u9fa5]+)\s?</a>'
pattern = '<span.*?songNum.*?(\d+)\.</span><a.*?songName.*?href="(.*?)".*?([^\x00-\xff]+\s?[^\x00-\xff]+)\s?</a>'
result = re.findall(pattern,html,re.S)
# print(result)
if result:
    print('Total song: %d'%len(result))
    for item in result:
        print("=========="*11)
        print("第%s首， 歌曲为《%s》，链接：%s"%(item[0],item[2],item[1]))
        print("=========="*11)

Total song: 30
第01首， 歌曲为《再回首》，链接：http://www.jingdianlaoge.com/play/65937.htm
第02首， 歌曲为《爱拼才会赢》，链接：http://www.jingdianlaoge.com/play/66417.htm
第03首， 歌曲为《我只在乎你》，链接：http://www.jingdianlaoge.com/play/35721.htm
第04首， 歌曲为《千年等一回 《新白娘子传奇》电视剧主题曲》，链接：http://www.jingdianlaoge.com/play/81667.htm
第05首， 歌曲为《九百九十九朵玫瑰》，链接：http://www.jingdianlaoge.com/play/24865.htm
第06首， 歌曲为《女人花》，链接：http://www.jingdianlaoge.com/play/49772.htm
第07首， 歌曲为《走过咖啡屋》，链接：http://www.jingdianlaoge.com/play/82151.htm
第08首， 歌曲为《酒干倘卖无》，链接：http://www.jingdianlaoge.com/play/20753.htm
第09首， 歌曲为《你的柔情我永远不懂》，链接：http://www.jingdianlaoge.com/play/199353.htm
第10首， 歌曲为《红豆》，链接：http://www.jingdianlaoge.com/play/2247.htm
第11首， 歌曲为《童年》，链接：http://www.jingdianlaoge.com/play/1939.htm
第12首， 歌曲为《独角戏》，链接：http://www.jingdianlaoge.com/play/37778.htm
第13首， 歌曲为《一剪梅》，链接：http://www.jingdianlaoge.com/play/29215.htm
第14首， 歌曲为《千千阙歌》，链接：http://www.jingdianlaoge.com/play/42620.htm
第15首， 歌曲为《万水千山总是情》，链接：http://www.jingdianlaoge.com/play/20823.htm
第16首， 歌曲为《相见恨晚》，链

## re.sub

替换字符串中每一个匹配的子串后返回替换后的字符串

```python
re.sub(pattern, repl, string, count=0, flags=0)
```

In [3]:
import re

content1 = "Extra strings 123457 World_This is a Regex Demo Extra strings"
content1 = re.sub('\d+', "", content1)
print("替换为空：", content1)

content2 = "Extra strings 123457 World_This is a Regex Demo Extra strings"
content2 = re.sub('\d+', "$Original is number$", content2)
print("替换为其他字符串：",content2)

替换为空： Extra strings  World_This is a Regex Demo Extra strings
替换为其他字符串： Extra strings $Original is number$ World_This is a Regex Demo Extra strings


In [7]:
import re

content = "Extra strings 123457 World_This is a Regex Demo Extra strings"
# \1获取第一组的内容，依次类推
content = re.sub('(\d+)', r'\1-replacement', content)
print(content)

Extra strings 123457-replacement World_This is a Regex Demo Extra strings


## re.compile

将正则字符串编译成正则表达式对象，以便于复用该匹配模式

```python
re.compile(pattern, flags=0)
```

In [1]:
import re

content = """Hello 1234567 World_This
is a Regex Demo
"""
# 将常用的匹配模式编译成正则表达式对象后，增加其复用性，不需要重新再写一遍正则表达式串
pattern = re.compile("Hello.*?Demo", re.S)
result = re.match(pattern, content)
print(result)

<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This\nis a Regex Demo'>


## 综合示例

In [1]:
import re
import requests
from requests.exceptions import RequestException, HTTPError

In [1]:
url = "https://book.douban.com/"
html=""
headers = {
    'Accept': 'text/html, application/xhtml+xml, application/xml; q=0.9, */*; q=0.8',
    'Accept-Language': 'zh-Hans-CN, zh-Hans; q=0.8, en-US; q=0.5, en; q=0.3', 
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18362',
    'Host': 'book.douban.com'
}

try:
    response = requests.get(url, headers=headers)
    print("Status_code: ", response.status_code)
    html = response.text
except HTTPError as http_error:
    print(http_error)
except RequestException as e:
    print(e)

Status_code:  200


In [1]:
# print(html)
import os

demo_file = "demo_file/books_douban.txt"
if not os.path.isfile(demo_file):    
    with open(demo_file, 'w', encoding='utf-8') as f:
        f.write(html)
    print('ok')
else:
    print("exists")

exists


In [7]:
# 开始匹配
test_text = '''<h2 class=''>
    <span class="">新书速递</span>
      <span class="link-more">
        <a class="" href="/latest?icn=index-latestbook-all"
          >更多»</a>
      </span>
  </h2>
'''

test_text2 = """<div class="hd">
      
  <h2 class=''>
    <span class="">新书速递</span>
      <span class="link-more">
        <a class="" href="/latest?icn=index-latestbook-all"
          >更多»</a>
      </span>
  </h2>
"""

test_text3='''<li class="">
    <div class="cover">
      <a onclick="moreurl(this, {from:'pop_fiction'})" href="https://book.douban.com/subject/30259677/?icn=index-topchart-subject">
        <img src="https://img3.doubanio.com/view/subject/m/public/s30022290.jpg"
          alt="焚舟纪" class="">
      </a>
    </div>
    <div class="info">
      <h4 class="title">
        <a onclick="moreurl(this, {from:'pop_fiction'})"
          href="https://book.douban.com/subject/30259677/?icn=index-topchart-subject" class="">焚舟纪</a>
      </h4>
      <p class="entry-star-small">
        <span class="allstar45 star-img">
        </span>
        <span class="average-rating">
          9.1
        </span>
      </p>
      <p class="author">
        作者：[英] 安吉拉·卡特
      </p>
      <p class="book-list-classification">
        英国文学&nbsp;/&nbsp;短篇小说
      </p>
      <p class="extra-info">
        
      </p>
        
        <p class="reviews">
          这本全集里的作品正显示我们的损失有多大。
          (<a onclick="moreurl(this, {from:'pop_fiction'})" href="https://book.douban.com/review/10283796/?icn=index-topchart-subject">purplepine评论</a>)
        </p>
    </div>
  </li>
'''
#pattern_title=re.compile('<div\s+class="hd">\s+<h2.*?>\s+<span.*?>(\w+).*?</span>.*?</h2>', re.S)
#result=re.findall(pattern_title, html)
#print(result)
pattern_c = re.compile('<li.*?cover.*?href="(.*?)">.*?src="(.*?)".*?title.*?>(\w+)</a>\s+</h4>.*?<p class="author">\s+(.*)\s+</p>.*?book-list-classification.*?([^\x00-\xff]+.*?[^\x00-\xff]+).*?</p>.*?</li>', re.S)
result2=re.findall(pattern_c, test_text3)
print(result2)
print('---'*10)
for url,img_url,title,author,classification in result2:
    print("URL:%s, img_url:%s, title:《%s》, author:%s, book-list-classification:%s"%(
        url, img_url, title, re.sub('\s+', "", author), re.sub('(&nbsp;)', "", classification)
    ))

[('https://book.douban.com/subject/30259677/?icn=index-topchart-subject', 'https://img3.doubanio.com/view/subject/m/public/s30022290.jpg', '焚舟纪', '作者：[英] 安吉拉·卡特\n     ', '英国文学&nbsp;/&nbsp;短篇小说')]
------------------------------
URL:https://book.douban.com/subject/30259677/?icn=index-topchart-subject, img_url:https://img3.doubanio.com/view/subject/m/public/s30022290.jpg, title:《焚舟纪》, author:作者：[英]安吉拉·卡特, book-list-classification:英国文学/短篇小说
