# 正则表达式

## 常见匹配模式

| 模式| 描述|
|----|----|
| \w	| 匹配字母数字及下划线 |
| \W	| 匹配非字母数字下划线 |
| \s	| 匹配任意空白字符，等价于 [\t\n\r\f]. |
| \S	| 匹配任意非空字符 |
| \d	| 匹配任意数字，等价于 [0-9] |
| \D	| 匹配任意非数字 |
| \A	| 匹配字符串开始 |
| \Z	| 匹配字符串结束，如果是存在换行，只匹配到换行前的结束字符串 |
| \z	| 匹配字符串结束 |
| \G	| 匹配最后匹配完成的位置 |
| \n | 匹配一个换行符 |
| \t | 匹配一个制表符 |
| ^	| 匹配字符串的开头 |
| $	| 匹配字符串的末尾。|
| .	| 匹配任意字符，除了换行符，当re.DOTALL标记被指定时，则可以匹配包括换行符的任意字符。|
| [...]	| 用来表示一组字符,单独列出：[amk] 匹配 'a'，'m'或'k' |
| [^...]	| 不在[]中的字符：[^abc] 匹配除了a,b,c之外的字符。| 
| *	| 匹配0个或多个的表达式。|
| +	| 匹配1个或多个的表达式。|
| ?	| 匹配0个或1个由前面的正则表达式定义的片段，非贪婪方式| 
| {n}	| 精确匹配n个前面表达式。|
| {n, m} | 匹配 n 到 m 次由前面的正则表达式定义的片段，贪婪方式| 
| a&#124;b | 匹配a或b |
| ( )	| 匹配括号内的表达式，也表示一个组 |

## re.match
re.match 尝试从字符串的起始位置匹配一个模式，如果不是起始位置匹配成功的话，match()就返回none。

### 最常规的匹配

In [50]:
import re

content = 'Hell1 123 4567 World_This is a Regex Demo'
#匹配hello,瞎写的匹配
# a = re.match('\w\w\S\S{2}',content)
#匹配整个content变量

# a = re.match('\w{5}\s\d{3}\s\d{4}\s\w{10}\s\w{2}\s\w\s\w{5}\s\w{4}',content)
# a = re.match('.*?',content)

# a = re.match('[a-zA-Z0-9]{1,5}?',content)

a = re.match('^h.*Demo$',content,re.I)




dir(a)
print(a)
print(a.start())
print(a.end())
print(a.group())
content[a.start():a.end()]
# a.start()
# a.end()
# a.group()


<_sre.SRE_Match object; span=(0, 41), match='Hell1 123 4567 World_This is a Regex Demo'>
0
41
Hell1 123 4567 World_This is a Regex Demo


'Hell1 123 4567 World_This is a Regex Demo'

In [54]:
import re

content = 'Hello 123 4567 World_This is a Regex Demo'

a = re.match('Hello',content)
dir(a)
print(a.start())
print(a.end())
print(a.group())
print(a.span())



0
5
Hello
(0, 5)


['小红 女 三年二班', '小李 男 三年二班', '小黑 男 三年二班']

### 泛匹配

In [18]:
import re

content = 'Hello 123 4567 World_This is a Regex Demo'
result = re.match('^Hello.*Demo$', content)
print(result)
print(result.group())
print(result.span())

<_sre.SRE_Match object; span=(0, 41), match='Hello 123 4567 World_This is a Regex Demo'>
Hello 123 4567 World_This is a Regex Demo
(0, 41)


### 匹配目标

In [68]:
import re

content = 'Hello 1234567 World_This is a Regex Demo'

result = re.search('\d{7} (\w*) \w{2}.{3}(\w*) ',content)
result
#整个正则匹配的内容result.group()
result.group()
#正则匹配内容的组
groups = result.groups()
print(groups)
groups[1]

('World_This', 'Regex')


'Regex'

### 贪婪匹配

In [21]:
import re

content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*(\d+).*Demo$', content)
print(result)
print(result.group(1))

<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>
7


### 非贪婪匹配

In [51]:
import re

content = 'Hello 1234567 World_This is a Regex Demo'
result = re.match('^He.*?(\d+).*Demo$', content)
print(result)
print(result.group(1))

<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This is a Regex Demo'>
1234567


### 匹配模式

In [70]:
import re

content = '''Hello 1234567 World_This
is a Regex Demo
'''
print(re.S)#实行多行匹配.可以匹配任意字符包含换行符
print(re.I)#忽视大小写
result = re.match('^He.*?(\d+).*?Demo$', content, re.S)
print(result.group(1))

RegexFlag.DOTALL
RegexFlag.IGNORECASE
1234567


### 转义

In [81]:
import re

content = 'price\ is $5.00'
result = re.match('price\\\ is \$5\.00', content)
print(result)
print(result.group())
result.group()
content

<_sre.SRE_Match object; span=(0, 15), match='price\\ is $5.00'>
price\ is $5.00


'price\\ is $5.00'

In [31]:
import re

content = 'price is $5.00'
result = re.match('price is \$5\.00', content)
print(result)

<_sre.SRE_Match object; span=(0, 14), match='price is $5.00'>


总结：尽量使用泛匹配、使用括号得到匹配目标、尽量使用非贪婪模式、有换行符就用re.S

## re.search
re.search 扫描整个字符串并返回第一个成功的匹配。

In [32]:
import re


content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
result = re.match('Hello.*?(\d+).*?Demo', content)
print(result)

None


In [102]:
import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
result = re.search('Hello.*?(?P<singer>\d+).*?Demo', content)
print(result)

print(result.group('singer'))
result.group('singer')

<_sre.SRE_Match object; span=(13, 53), match='Hello 1234567 World_This is a Regex Demo'>
1234567


'1234567'

In [60]:
# 大小写下划线数字@字母数字.大小写字母.大小写字母
import re
# mail = 'abc123@qq.com'
# mail = input('请输入你的邮箱')
# result = re.search('^\w+@[a-zA-Z0-9]+(\.[a-zA-Z0-9]+)+$',mail)
# print(result)
# if not result == None:
#     print('你输入的邮箱属于正常格式')
    
    
#第一次进行注册的时候，密码必须含有大小写字母数字
# password = 'Password123'
password = input('请输入密码')
result1 = re.search('[A-Z]',password)
result2 = re.search('[a-z]',password)
result3 = re.search('[0-9]',password)
result4 = re.search('.{8}',password)

if result1 and result2 and result3 and result4:
    print('你输入的密码格式正确')
else:
    print('你输入密码格式不正确')

请输入密码1Ty00
你输入密码格式不正确


总结：为匹配方便，能用search就不用match

### 匹配演练

In [36]:
import re

html = '''<div id="songs-list">
    <h2 class="title">经典老歌</h2>
    <p class="introduction">
        经典老歌列表
    </p>
    <ul id="list" class="list-group">
        <li data-view="2">一路上有你</li>
        <li data-view="7">
            <a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
        </li>
        <li data-view="4" class="active">
            <a href="/3.mp3" singer="齐秦">往事随风</a>
        </li>
        <li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
        <li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
        <li data-view="5">
            <a href="/6.mp3" singer="邓丽君"><i class="fa fa-user"></i>但愿人长久</a>
        </li>
    </ul>
</div>'''
result = re.search('<li.*?active.*?singer="(.*?)">(.*?)</a>', html, re.S)
if result:
    print(result.group(1), result.group(2))

齐秦 往事随风


In [37]:
import re

html = '''<div id="songs-list">
    <h2 class="title">经典老歌</h2>
    <p class="introduction">
        经典老歌列表
    </p>
    <ul id="list" class="list-group">
        <li data-view="2">一路上有你</li>
        <li data-view="7">
            <a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
        </li>
        <li data-view="4" class="active">
            <a href="/3.mp3" singer="齐秦">往事随风</a>
        </li>
        <li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
        <li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
        <li data-view="5">
            <a href="/6.mp3" singer="邓丽君">但愿人长久</a>
        </li>
    </ul>
</div>'''
result = re.search('<li.*?singer="(.*?)">(.*?)</a>', html, re.S)
if result:
    print(result.group(1), result.group(2))

任贤齐 沧海一声笑


In [40]:
import re

html = '''<div id="songs-list">
    <h2 class="title">经典老歌</h2>
    <p class="introduction">
        经典老歌列表
    </p>
    <ul id="list" class="list-group">
        <li data-view="2">一路上有你</li>
        <li data-view="7">
            <a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
        </li>
        <li data-view="4" class="active">
            <a href="/3.mp3" singer="齐秦">往事随风</a>
        </li>
        <li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
        <li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
        <li data-view="5">
            <a href="/6.mp3" singer="邓丽君">但愿人长久</a>
        </li>
    </ul>
</div>'''
result = re.search('<li.*?singer="(.*?)">(.*?)</a>', html)
if result:
    print(result.group(1), result.group(2))

beyond 光辉岁月


## re.findall
搜索字符串，以列表形式返回全部能匹配的子串。
### 内涵端子小案例

In [93]:
import re

html = '''<div id="songs-list">
    <h2 class="title">经典老歌</h2>
    <p class="introduction">
        经典老歌列表
    </p>
    <ul id="list" class="list-group">
        <li data-view="7">
            <a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
        </li>
        <li data-view="4" class="active">
            <a href="/3.mp3" singer="齐秦">往事随风</a>
        </li>
        <li data-view="6">
            <a href="/4.mp3" singer="beyond">光辉岁月</a>
        </li>
        <li data-view="5">
            <a href="/5.mp3" singer="陈慧琳">记事本</a>
        </li>
        <li data-view="5">
            <a href="/6.mp3" singer="邓丽君">但愿人长久</a>
        </li>
    </ul>
</div>'''
# results = re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>', html, re.S)

regObj = re.compile('<a.*?href="(.*?)".*?ger="(.*?)">(.*?)</a>',re.S)
print(regObj)

results = re.findall(regObj,html)


dir(results)







re.compile('<a.*?href="(.*?)".*?ger="(?P<singer>.*?)">(.*?)</a>', re.DOTALL)


['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

### re.split()


In [91]:
with open('./students.txt','r') as f:
    student = f.read()
print(student)
regObj = re.compile('\n',re.S)
re.split(regObj,student)


小红 女 三年二班
小李 男 三年二班
小黑 男 三年二班


['小红 女 三年二班', '小李 男 三年二班', '小黑 男 三年二班']

In [None]:
import re
import requests

res = requests.get('http://www.neihanshequ.com')
text = res.text
listArr = re.findall('<div.*?class="name"\>(.*?)</.*?<p>(.*?)</p>',text,re.S)
#print(res.text)
#print(listArr)
for item in listArr:
    print(item[0])

### re.sub
替换字符串中每一个匹配的子串后返回替换后的字符串。

In [86]:
import re

content = 'Extra 123 Hello 1234567 World_This is a Regex Demo 0987 stings'
result = re.sub('\d+','',content)
print(result)

Extra  Hello  World_This is a Regex Demo  stings


In [48]:
import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
content = re.sub('\d+', 'Replacement', content)
print(content)

Extra stings Hello Replacement World_This is a Regex Demo Extra stings


In [51]:
import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
content = re.sub('(\d+)', r'\1 8910', content)
print(content)

Extra stings Hello 1234567 8910 World_This is a Regex Demo Extra stings


In [None]:
import re

html = '''<div id="songs-list">
    <h2 class="title">经典老歌</h2>
    <p class="introduction">
        经典老歌列表
    </p>
    <ul id="list" class="list-group">
        <li data-view="2">一路上有你</li>
        <li data-view="7">
            <a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
        </li>
        <li data-view="4" class="active">
            <a href="/3.mp3" singer="齐秦">往事随风</a>
        </li>
        <li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
        <li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
        <li data-view="5">
            <a href="/6.mp3" singer="邓丽君">但愿人长久</a>
        </li>
    </ul>
</div>'''


In [54]:
import re

html = '''<div id="songs-list">
    <h2 class="title">经典老歌</h2>
    <p class="introduction">
        经典老歌列表
    </p>
    <ul id="list" class="list-group">
        <li data-view="2">一路上有你</li>
        <li data-view="7">
            <a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
        </li>
        <li data-view="4" class="active">
            <a href="/3.mp3" singer="齐秦">往事随风</a>
        </li>
        <li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
        <li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
        <li data-view="5">
            <a href="/6.mp3" singer="邓丽君">但愿人长久</a>
        </li>
    </ul>
</div>'''
html = re.sub('<a.*?>|</a>', '', html)
print(html)
results = re.findall('<li.*?>(.*?)</li>', html, re.S)
print(results)
for result in results:
    print(result.strip())

<div id="songs-list">
    <h2 class="title">经典老歌</h2>
    <p class="introduction">
        经典老歌列表
    </p>
    <ul id="list" class="list-group">
        <li data-view="2">一路上有你</li>
        <li data-view="7">
            沧海一声笑
        </li>
        <li data-view="4" class="active">
            往事随风
        </li>
        <li data-view="6">光辉岁月</li>
        <li data-view="5">记事本</li>
        <li data-view="5">
            但愿人长久
        </li>
    </ul>
</div>
['一路上有你', '\n            沧海一声笑\n        ', '\n            往事随风\n        ', '光辉岁月', '记事本', '\n            但愿人长久\n        ']
一路上有你
沧海一声笑
往事随风
光辉岁月
记事本
但愿人长久


## re.compile
将正则字符串编译成正则表达式对象

In [None]:
将一个正则表达式串编译成正则对象，以便于复用该匹配模式

In [57]:
import re

content = '''Hello 1234567 World_This
is a Regex Demo'''
pattern = re.compile('Hello.*Demo', re.S)
result = re.match(pattern, content)
#result = re.match('Hello.*Demo', content, re.S)
print(result)

<_sre.SRE_Match object; span=(0, 40), match='Hello 1234567 World_This\nis a Regex Demo'>


## 实战练习

In [62]:
import requests
import re
content = requests.get('https://book.douban.com/').text
pattern = re.compile('<li.*?cover.*?href="(.*?)".*?title="(.*?)".*?more-meta.*?author">(.*?)</span>.*?year">(.*?)</span>.*?</li>', re.S)
results = re.findall(pattern, content)
for result in results:
    url, name, author, date = result
    author = re.sub('\s', '', author)
    date = re.sub('\s', '', date)
    print(url, name, author, date)

https://book.douban.com/subject/26925834/?icn=index-editionrecommend 别走出这一步 [英]S.J.沃森 2017-1
https://book.douban.com/subject/26953532/?icn=index-editionrecommend 白先勇细说红楼梦 白先勇 2017-2-1
https://book.douban.com/subject/26959159/?icn=index-editionrecommend 岁月凶猛 冯仑 2017-2
https://book.douban.com/subject/26949210/?icn=index-editionrecommend 如果没有今天，明天会不会有昨天？ [瑞士]伊夫·博萨尔特（YvesBossart） 2017-1
https://book.douban.com/subject/27001447/?icn=index-editionrecommend 人类这100年 阿夏 2017-2
https://book.douban.com/subject/26864566/?icn=index-latestbook-subject 眼泪的化学 [澳]彼得·凯里 2017-2
https://book.douban.com/subject/26991064/?icn=index-latestbook-subject 青年斯大林 [英]西蒙·蒙蒂菲奥里 2017-3
https://book.douban.com/subject/26938056/?icn=index-latestbook-subject 带艾伯特回家 [美]霍默·希卡姆 2017-3
https://book.douban.com/subject/26954757/?icn=index-latestbook-subject 乳房 [美]弗洛伦斯·威廉姆斯 2017-2
https://book.douban.com/subject/26956479/?icn=index-latestbook-subject 草原动物园 马伯庸 2017-3
https://book.douban.com/subject/26956018/?icn=index-latestboo

In [1]:
import requests
help(requests)

Help on package requests:

NAME
    requests

DESCRIPTION
    Requests HTTP Library
    ~~~~~~~~~~~~~~~~~~~~~
    
    Requests is an HTTP library, written in Python, for human beings. Basic GET
    usage:
    
       >>> import requests
       >>> r = requests.get('https://www.python.org')
       >>> r.status_code
       200
       >>> 'Python is a programming language' in r.content
       True
    
    ... or POST:
    
       >>> payload = dict(key1='value1', key2='value2')
       >>> r = requests.post('http://httpbin.org/post', data=payload)
       >>> print(r.text)
       {
         ...
         "form": {
           "key2": "value2",
           "key1": "value1"
         },
         ...
       }
    
    The other HTTP methods are supported - see `requests.api`. Full documentation
    is at <http://python-requests.org>.
    
    :copyright: (c) 2017 by Kenneth Reitz.
    :license: Apache 2.0, see LICENSE for more details.

PACKAGE CONTENTS
    __version__
    _internal_utils
    ad

In [8]:
res = requests.get('http://pg.qq.com/gicp/news/103/2/2010/1.html')
res.encoding = 'gbk'
res.text

'<!DOCTYPE html>\n<html lang="zh-CN">\n\n<head>\n\t<meta charset="gbk">\n<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n<meta name="robots" content="all">\n<meta name="author" content="Tencent-CP">\n<meta name="Copyright" content="Tencent">\n<meta name="Description" content="腾讯光子工作室群自研大作，正版战斗特训手游《绝地求生：刺激战场》！虚幻4引擎,次世代完美画质,重现端游视听感受;8000M×8000M实景地图,全面自由施展战术;百人竞技,真实射击手感;好友一键组队,语音开黑;给您带来一场刺激的竞技体验" />\n<meta name="Keywords" content="绝地求生,绝地求生刺激战场,绝地求生手游,刺激战场,绝地求生手游下载,绝地求生刺激战场下载,绝地求生国服,吃鸡手游,刺激战场下载,绝地求生 刺激战场,绝地求生下载,掘地求生刺激战场,腾讯吃鸡手游,绝地求生大逃杀,腾讯绝地求生手游"\n/> \n\t<title>枪械-正版战斗特训手游《绝地求生：刺激战场》刺激开测，立即加入！-绝地求生刺激战场官网网站-腾讯游戏 </title>\n\t<script>var _gcatName = "枪械";</script>\n\t<link rel="stylesheet" href="//game.gtimg.cn/images/cjm/web201801/css/listpage.css">\n\t<link rel="stylesheet" href="//game.gtimg.cn/images/cjm/web201801/public.css">\n</head>\n<body>\n\t<div class="wrap">\n\t\t<!--公共top-->\n\t\t\n <div class="top-nav">\n    <!--[if lt IE 8]><p class="browser-tips">您的浏览器版本过低，请升级浏览器

In [9]:
import re

In [16]:
pattern = re.compile('<li>.*?a href="(.*?)".*?gllink.*?title="(.*?)".*?<span>\s*(.*?)\s*</span>',re.S)
result = re.findall(pattern,res.text)
result

[('/web201801/main.shtml', '近战爆发之王!你需要掌握的霰弹枪射击技巧', '2018-05-05'),
 ('/gicp/news/104/6424335.html', '狙击模式神器！VSS枪械实战解析', '2018-04-24'),
 ('/gicp/news/104/6424332.html', '极速战场玩法多，武器选择各不同', '2018-04-24'),
 ('/gicp/news/104/6422095.html', '步枪的最佳替身!UMP9冲锋枪介绍', '2018-04-13'),
 ('/gicp/news/104/6422094.html', '能扫射的狙击枪，VSS枪械瞄准镜运用浅析', '2018-04-13'),
 ('/gicp/news/104/6422093.html', '硬汉专属武器，M249重机枪剖析', '2018-04-13'),
 ('/gicp/news/104/6421856.html', '亲测无解!DP-28轻机枪扫射太可怕', '2018-04-12'),
 ('/gicp/news/104/6421859.html', '制霸近身钢枪，汤姆逊冲锋枪解读', '2018-04-12'),
 ('/gicp/news/104/6421858.html', '被忽略的武器，现代突击步枪SCAR-L', '2018-04-12'),
 ('/gicp/news/104/6421857.html', '综合性能最佳!M24狙击步枪解读', '2018-04-12'),
 ('/gicp/news/104/6421653.html', '新手秒成狙神？Mini14狙击枪介绍', '2018-04-11'),
 ('/gicp/news/104/6421387.html', '一枪一人头!AWM狙击步枪的爆头艺术', '2018-04-10'),
 ('/gicp/news/104/6420605.html', '无视防具真神器！霰弹枪全解析', '2018-04-06'),
 ('/gicp/news/104/6420162.html', '新手也能驰骋沙漠，新地图枪械推荐', '2018-04-05'),
 ('/gicp/news/104/6419328.html', '近战无敌！空

In [11]:
result

[]

In [None]:
import requests 
import re
res = requests.get('http://www.dytt8.net/html/gndy/dyzz/20180919/57491.html')
res.encoding = 'gb2312'
content = res.text



pattern = re.compile('<div class="title_all.*?#07519a>(.*?)\s*</font>.*?发布时间：\s*(.*?)\s*<tr>.*?<img.*?src="(.*?)".*?<td style="WORD-WRAP: break-word".*?href="(.*?)"',re.S)
result = re.search(pattern,content)
a = result.groups()
print(a)

imgUrl = a[2]

resImg = requests.get(imgUrl)

print(resImg.content)


with open('./{}.jpg'.format(a[0][:5]),'wb') as f:
    f.write(resImg.content)
    
    
    



#title
#time
#url







In [4]:
import requests 
import re
def pageContent(url):
    res = requests.get(url)
    res.encoding = 'gb2312'
    content = res.text
    pattern = re.compile('<div class="title_all.*?#07519a>(.*?)\s*</font>.*?发布时间：\s*(.*?)\s*<tr>.*?<img.*?src="(.*?)".*?<td style="WORD-WRAP: break-word".*?href="(.*?)"',re.S)
    result = re.search(pattern,content)
    a = result.groups()
    return {
        'title':a[0],
        'time':a[1],
        'imgUrl':a[2],
        'download':a[3]
    }


pageContent('http://www.dytt8.net/html/gndy/dyzz/20180919/57492.html')

{'title': '2018年科幻惊悚《人类清除计划4》BD中英双字幕',
 'time': '2018-09-19',
 'imgUrl': 'https://lookimg.com/images/2018/09/18/MphLn.jpg',
 'download': 'ftp://ygdy8:ygdy8@yg72.dydytt.net:8282/阳光电影www.ygdy8.com.人类清除计划4.BD.720p.中英双字幕.mkv'}

In [5]:
#获取总页码的方法

In [6]:
import requests
import re
import json


num = 1
res = requests.get('http://www.dytt8.net/html/gndy/oumei/list_7_'+str(num)+'.html')
res.encoding = 'gb2312'
result = res.text
# print(result)


yema = re.compile('''共(\d*?)页''',re.S)
pageNum = re.search(yema,result)
pageNum =pageNum.groups()[0]





for i in range(int(pageNum)):
    res = requests.get('http://www.dytt8.net/html/gndy/oumei/list_7_'+str(i+1)+'.html')
    res.encoding = 'gb2312'
    result = res.text
    
    pattern = re.compile('''<td height="26">.*?<a href="(.*?)" class="ulink">(.*?)</a>''',re.S)
    resultList = re.findall(pattern,result)
    print(resultList)

    filmList = []

    for item in resultList:
        url = 'http://www.dytt8.net'+item[0]
        filmDict = pageContent(url)
        filmList.append(filmDict)

    with open('./json/%s.json'%i,'w') as f:
        json.dump(filmList,f)
    









[('/html/gndy/dyzz/20180919/57492.html', '2018年科幻惊悚《人类清除计划4》BD中英双字幕'), ('/html/gndy/dyzz/20180918/57485.html', '2018年惊悚动作《心甘情愿/谍影丽人》BD中英双字幕'), ('/html/gndy/jddy/20180918/57484.html', '2018年惊悚动作《地狱之路》BD中英双字幕'), ('/html/gndy/dyzz/20180918/57481.html', '2018年科幻动作《巨齿鲨/极悍巨鲨》HD韩版中字'), ('/html/gndy/dyzz/20180917/57479.html', '2018年动作《讨债人》BD中英双字幕'), ('/html/gndy/jddy/20180917/57478.html', '2017年剧情《扎马/流亡将军沙马》BD中英双字幕'), ('/html/gndy/jddy/20180917/57476.html', '2018年动作惊悚《消音器/沉默者》BD中英双字幕'), ('/html/gndy/jddy/20180917/57475.html', '2018年动作喜剧《玩命毒师2》BD意大利语中字'), ('/html/gndy/dyzz/20180916/57470.html', '2018年科幻动作《游侠索罗：星球大战外传》BD国英双语双字'), ('/html/gndy/dyzz/20180916/57469.html', '2018年剧情运动《奇迹赛季》BD中英双字幕'), ('/html/gndy/jddy/20180916/57468.html', '2018年奇幻冒险《炭火仔：勇战巨魔王》BD中字'), ('/html/gndy/jddy/20180916/57467.html', '2018年剧情喜剧《篮球冠军》BD西班牙语中字'), ('/html/gndy/jddy/20180915/57464.html', '2018年剧情《切肤之痛》BD中英双字幕'), ('/html/gndy/jddy/20180915/57463.html', '2017年剧情《温柔女子》BD俄语中字'), ('/html/gndy/dyzz/20180915/57461.html',

ConnectionError: HTTPConnectionPool(host='www.dytt8.net', port=80): Max retries exceeded with url: /html/gndy/dyzz/20180830/57348.html (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000002446E298DA0>: Failed to establish a new connection: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。',))

In [5]:
%%writefile app.py
from flask import Flask
app = Flask(__name__)


@app.route('/')
def index():
    return 

Writing app.py
