### 数据解析 RE
- 基本步骤：
  - 指定URL
  - 发送请求
  - 获取响应数据
  - 解析数据（正则表达式）
  - 存储数据库
#### 简单回顾下正则表达式的匹配方式

In [2]:
import re

In [5]:
pattern = ".*"
text = "Hello, World!"
result = re.match(pattern, text)
print(result.group())

Hello, World!


In [6]:
pattern = r'\d+'  # 匹配一个或多个数字 #
result = re.findall(pattern, 'abc123def456')
print(result)

['123', '456']


In [7]:
pattern = r'[a-zA-Z]'  # 匹配任意一个英文字母（不区分大小写） #
text = 'Hello, World! 123'
 
matches = re.findall(pattern, text)
print(matches)

['H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd']


In [8]:
pattern = re.compile(r'[a-z]+', re.IGNORECASE)  # 匹配一个或多个字母的正则表达式，忽略大小写 #
result = pattern.findall('AbCdEfGhIjKl')
print(result)

['AbCdEfGhIjKl']


In [11]:
pattern = re.compile(r'.at')  # 编译一个匹配以任意字符后跟"at"的正则表达式 #
result = pattern.findall('cat, hat, rat')
print(result)

['cat', 'hat', 'rat']


In [12]:
text = '''first line
second line
third line'''
pattern = re.compile(r'line$', re.MULTILINE)  # 编译一个匹配以"line"结尾的行的正则表达式，忽略换行符 #
result = pattern.findall(text)
print(result)

['line', 'line', 'line']


In [13]:
text = '''so much
so hot
so beautiful
'''
pattern = re.compile(r'^so', re.MULTILINE)  # 编译一个匹配以"so"开头的行的正则表达式，忽略换行符 #
result = pattern.findall(text)
print(result)

['so', 'so', 'so']


In [14]:
pattern = re.compile(r'\bword\b')  # 匹配完整单词"word"的正则表达式 #
result = pattern.findall('hello word, word!')
print(result)

['word', 'word']


In [15]:
pattern = r'\s+'  # 匹配一个或多个空白字符 #
result = re.split(pattern, 'hello world python')
print(result)

['hello', 'world', 'python']


In [17]:
pattern = r'\d+'  # 匹配一个或多个数字 #
replacement = '***'
result = re.sub(pattern, replacement, 'abc123def456')
print(result)

abc***def***


In [19]:
pattern = "a|b"  # 匹配a或b #
text = "appbleb"
result = re.findall(pattern, text)
print(result)

['a', 'b', 'b']


### 以南京审计大学为例爬取主页面中的图片
- 设置目标网址URL
- 通过数据解析获取HTML文件中的指定字段
- 爬取数据并存储本地|

In [28]:
import re
import os
import requests

url = 'https://www.nau.edu.cn/_t1096/main.psp'
headers = {  # 设置请求头 #
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0'
}
page_text = requests.get(url = url, headers = headers).text

imgs_list = re.findall('<img alt=.*?src="(.*?)".*?>', page_text, re.S)  # 通过数据解析获取图片地址 #
if not os.path.exists('./NAU_imgs'):  # 创建文件夹 #
    os.mkdir('NAU_imgs')
for img in imgs_list:
    img_url = 'https://www.nau.edu.cn/' + img  # 处理图片地址 #
    img_text = requests.get(url = img_url, headers = headers).content
    img_name = img.split('/')[-1]  # 处理文件名 #
    img_path = 'NAU_imgs/' + img_name  # 本地存储地址 #
    with open(img_path, 'wb') as fp:
        fp.write(img_text)

print('Save Success')

Save Success
