# 2 字符串和文本

## 2.1 使用多个界定符分割字符串

problem: 你需要将一个字符串分割为多个字段，但是分隔符(还有周围的空格) 并不是固定的。

ans: string 对象的split() 方法, re.split()

In [1]:
line = 'asdf fjdk; afed, fjek,asdf, foo'
import re
re.split(r'[;,\s]\s*', line)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

In [10]:
# 括号捕获分组
fields = re.split(r'(;|,|\s)\s*', line)
print(fields)

# 保留分割字符串, 构造新字符串
values = fields[::2]    # start=0, step=2, 偶数号
delimiters = fields[1::2] + ['']    # start=1, step=2, 奇数号
print(values)
print(delimiters)

# Reform the line using the same delimiters
''.join(v+d for v,d in zip(values, delimiters))

['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
[' ', ';', ',', ',', ',', '']


'asdf fjdk;afed,fjek,asdf,foo'

## 2.2 字符串开头或结尾匹配

problem: 你需要通过指定的文本模式去检查字符串的开头或者结尾，比如文件名后缀，URL Scheme 等等。

ans: str.startswith() 或者是str.endswith() 方法

In [15]:
# 检查字符串开头或结尾
filename = 'spam.txt'
filename.endswith('.txt')
filename.startswith('file:')
url = 'https://www.python.org'
url.startswith('https:')

True

In [21]:
# 检查多种匹配可能, use tuple
import os

filenames = os.listdir('.')
print(filenames)
ipynb = [name for name in filenames if name.endswith('.ipynb')]
print(ipynb)
any(name.endswith('.ipynb') for name in filenames)

['2.string and text.ipynb']
['2.string and text.ipynb']


True

In [25]:
from urllib.request import urlopen

def read_data(name):
    if name.startswith(('http:', 'https:', 'ftp:')):  # input tuple
        return urlopen(name).read()
    else:
        with open(name) as f:
            return f.read()

# # input tuple
choices = ['https:', 'ftp:']
url = 'https://www.python.org'
url.startswith(tuple(choices)) 
read_data(url)

True

In [29]:
# slice
filename = 'spam.txt'
filename[-4:] == '.txt'
url = 'https://www.python.org'
url[:6] == 'https:' or url[:5] == 'http:' or url[:4] == 'ftp:'

# re
import re
re.match('http:|https:|ftp:', url)

<re.Match object; span=(0, 6), match='https:'>

## 2.3 用Shell 通配符匹配字符串

problem: 你想使用Unix Shell 中常用的通配符(比如\*.py , Dat\[0-9\]\*.csv 等) 去匹配文本字符串。

ans: fnmatch 模块提供了两个函数——fnmatch() 和fnmatchcase() ，可以用来实现这样的匹配。

In [32]:
from fnmatch import fnmatch, fnmatchcase

fnmatch('foo.txt', '*.txt')
fnmatch('foo.txt', '?txt')
fnmatch('Dat45.csv', 'Dat[0-9]*')
names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']
[name for name in names if fnmatch(name, 'Dat*.csv')]

['Dat1.csv', 'Dat2.csv']

In [34]:
# fnmatch() 函数使用底层操作系统的大小写敏感规则(不同的系统是不一样的) 来匹配模式
fnmatch('foo.txt', '*.TXT')
# fnmatchcase
fnmatchcase('foo.txt', '*.TXT')

True

In [36]:
# 处理非文件名的字符串
addresses = [
    '5412 N CLARK ST',
    '1060 W ADDISON ST',
    '1039 W GRANVILLE AVE',
    '2122 N CLARK ST',
    '4802 N BROADWAY',
]

from fnmatch import fnmatchcase
ls = [addr for addr in addresses if fnmatchcase(addr, '*ST')]
print(ls)
ls = [addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')]
print(ls)

['5412 N CLARK ST', '1060 W ADDISON ST', '2122 N CLARK ST']
['5412 N CLARK ST']


## 2.4 字符串匹配和搜索

problem: 你想匹配或者搜索特定模式的文本。

ans: str.find(), str.endswith(), str.startswith(), etc. | re

In [None]:
text = 'yeah, but no, but yeah, but no, but yeah'
# Exact match
