# Ch02-Strings and Text 

## 2.1. Splitting Strings on Any of Multiple Delimiters(针对任意多的分隔符拆分字符串 

### Problem 

You need to split a string into fields, but the delimiters (and spacing around them) aren’t consistent throughout the string

### Solution 

The *split()* method of string objects is really meant for very simple cases, and does not allow for multiple delimiters or account for possible whitespace around the delimiters. In cases when you need a bit more flexibility, use the *re.split()* method:

In [1]:
line = 'asdf fjdk; afed, fjek,asdf,       foo'

In [2]:
import re

In [3]:
re.split(r'[;,\s]\s*', line)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

### Discussion 

The *re.split()* function is useful because you can specify multiple patterns for the separator. For example, as shown in the solution, the separator is either a comma (,),semicolon (;), or whitespace followed by any amount of extra whitespace. Whenever that pattern is found, the entire match becomes the delimiter between whatever fields
lie on either side of the match. The result is a list of fields, just as with *str.split()* .When using *re.split()* , you need to be a bit careful should the regular expression pattern involve a capture group(捕获组) enclosed in parentheses. If capture groups are used, then the matched text is also included in the result. 

In [4]:
fields = re.split(r'(;|,|\s)\s*', line)

In [5]:
fields

['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']

Getting the split characters might be useful in certain contexts. For example, maybe you need the split characters later on to reform an output string:

In [6]:
values = fields[::2] # fields中所有的偶数项

In [7]:
values

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

In [8]:
delimiters = fields[1::2] + ['']

In [9]:
delimiters

[' ', ';', ',', ',', ',', '']

In [10]:
# Reform the line using the same delimiters
''.join(v+d for v,d in zip(values, delimiters))

'asdf fjdk;afed,fjek,asdf,foo'

If you don’t want the separator characters in the result, but still need to use parentheses to group parts of the regular expression pattern, make sure you use a noncapture group, specified as *(?:...)* . 

In [11]:
re.split(r'(?:,|;|\s)\s*', line)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

## 2.2. Matching Text at the Start or End of a String 

### Problem 

You need to check the start or end of a string for specific text patterns, such as filename extensions, URL schemes, and so on.

### Solution 

A simple way to check the beginning or end of a string is to use the *str.starts with()* or *str.endswith()* methods. 

In [12]:
filename = 'spam.txt'

In [13]:
filename.endswith('.txt')

True

In [14]:
filename.startswith('file:')

False

In [15]:
url = 'http://www.python.org'

In [16]:
url.startswith('http:')

True

If you need to check against multiple choices, simply provide a tuple of possibilities to *startswith()* or *endswith()* 

In [17]:
from urllib.request import urlopen

In [18]:
def read_data(name):
    if name.startwith(('http:', 'https:', 'ftp:')):
        return urlopen(name).read()
    else:
        with open(name) as f:
            return f.read()

Oddly, this is one part of Python where a tuple is actually required as input. If you happen to have the choices specified in a list or set, just make sure you convert them using *tuple()* first.

In [19]:
choices = ['http:', 'ftp:']

In [20]:
url = 'http://www.python.org'

In [21]:
url.startswith(tuple(choices))

True

### Discussion 

The startswith() and endswith() methods provide a very convenient way to perform basic prefix and suffix checking. Similar operations can be performed with slices, but are far less elegant. 

In [22]:
filename='spam.txt'

In [23]:
filename[-4:] == '.txt'

True

In [24]:
url = 'http://www.python.org'

In [25]:
url[:5] == 'http:' or url[:6] == 'https:' or url[:4] == 'ftp:'

True

You might also be inclined to use regular expressions as an alternative.

In [26]:
re.match('http:|https:|ftp:', url)

<_sre.SRE_Match object; span=(0, 5), match='http:'>

## 2.3.Matching Strings Using Shell Wildcard Patterns(利用shell通配符做字符串匹配)

### Problem 

You want to match text using the same wildcard patterns as are commonly used when working in Unix shells (e.g., \*.py , Dat[0-9]\*.csv , etc.).

### Solution 

The fnmatch module provides two functions— *fnmatch()* and *fnmatchcase()* —that can be used to perform such matching. 

In [27]:
from fnmatch import fnmatch, fnmatchcase

In [28]:
fnmatch('foo.txt','*.txt')

True

In [29]:
fnmatch('foo.txt', '?oo.txt')

True

In [30]:
fnmatch('Dat45.csv', 'Dat[0-9][0-9]')

False

In [31]:
names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']

In [32]:
[name for name in names if fnmatch(name, 'Dat*.csv')]

['Dat1.csv', 'Dat2.csv']

If this distinction matters, use *fnmatchcase()* instead. It matches exactly based on the lower- and uppercase conventions that you supply

In [33]:
fnmatchcase('foo.txt', '*.TXT')

False

An often overlooked feature of these functions is their potential use with data processing of nonfilename strings. For example, suppose you have a list of street addresses like this:

In [34]:
addresses = [
'5412 N CLARK ST',
'1060 W ADDISON ST',
'1039 W GRANVILLE AVE',
'2122 N CLARK ST',
'4802 N BROADWAY',
]

In [35]:
[addr for addr in addresses if fnmatchcase(addr, '* ST')]

['5412 N CLARK ST', '1060 W ADDISON ST', '2122 N CLARK ST']

In [36]:
[addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')]

['5412 N CLARK ST']

### Discussion 

The matching performed by *fnmatch* sits somewhere between the functionality of simple string methods and the full power of regular expressions. If you’re just trying to provide a simple mechanism for allowing wildcards in data processing operations, it’s often a reasonable solution.

## 2.4. Matching and Searching for Text Patterns 

### Problem 

You want to match or search text for a specific pattern.

### Solution 

If the text you’re trying to match is a simple literal, you can often just use the basic string methods, such as *str.find()* , *str.endswith()* , *str.startswith()* , or similar. 

In [37]:
text = 'yeah, but no, but yeah, but no, but yeah'

In [38]:
# Exact match
text == 'yeah'

False

In [39]:
# Match at start or end
text.startswith('yeah')

True

In [40]:
text.endswith('no')

False

In [41]:
# Search for the location of the list
text.find('no')

10

For more complicated matching, use regular expressions and the *re* module. To illustrate the basic mechanics of using regular expressions, suppose you want to match dates specified as digits, such as “11/27/2012.” Here is a sample of how you would do it 

In [42]:
text1 = '11/27/2012'

In [43]:
text2 = 'Nov 27, 2012'

In [44]:
# Simple matching: \d+ means match one or more digits(数字)
if re.match(r'\d+/\d+/\d+', text1):
    print('yes')
else:
    print('no')

yes


If you’re going to perform a lot of matches using the same pattern, it usually pays to
precompile the regular expression pattern into a pattern object first. 

In [45]:
datepat = re.compile(r'\d+/\d+/\d+')

In [46]:
if datepat.match(text1):
    print('yes')

yes


In [47]:
if datepat.match(text2):
    print('yes')
else:
    print('no')

no


*match()* always tries to find the match at the start of a string. If you want to search text for all occurrences of a pattern, use the *findall()* method instead. 

In [48]:
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'

In [49]:
datepat.findall(text)

['11/27/2012', '3/13/2013']

When defining regular expressions, it is common to introduce capture groups by enclosing parts of the pattern in parentheses.

In [51]:
datepat = re.compile(r'(\d+)/(\d+)/(\d+)')

Capture groups(捕获组) often simplify subsequent processing of the matched text because the contents of each group can be extracted individually.

In [52]:
m = datepat.match('11/27/2012')

In [53]:
m

<_sre.SRE_Match object; span=(0, 10), match='11/27/2012'>

In [54]:
# Extract the contents of each group
m.group(0)

'11/27/2012'

In [55]:
m.group(1)

'11'

In [56]:
m.group(2)

'27'

In [57]:
m.group(3)

'2012'

In [58]:
m.groups()

('11', '27', '2012')

In [59]:
month, day, year = m.groups()

In [60]:
#Find all matches (notice spliting into tuples)
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'

In [61]:
datepat.findall(text)

[('11', '27', '2012'), ('3', '13', '2013')]

In [64]:
for month, day, year in datepat.findall(text):
    print('{}-{}-{}'.format(year, month, day))

2012-11-27
2013-3-13


The *findall()* method searches the text and finds all matches, returning them as a list. If you want to find matches iteratively, use the *finditer()* method instead. 

In [65]:
for m in datepat.finditer(text):
    print(m.groups())

('11', '27', '2012')
('3', '13', '2013')


### Discussion

Be aware that the match() method only checks the beginning of a string. It’s possible that it will match things you aren’t expecting.

In [66]:
m = datepat.match('11/27/2012abcdef')

In [67]:
m

<_sre.SRE_Match object; span=(0, 10), match='11/27/2012'>

In [68]:
m.group()

'11/27/2012'

If you want an exact match, make sure the pattern includes the end-marker ( $ )

In [69]:
datepat = re.compile(r'(\d+)/(\d+)/(\d+)$')

In [70]:
datepat.match('11/27/2012abcdef')

In [71]:
datepat.match('11/27/2012')

<_sre.SRE_Match object; span=(0, 10), match='11/27/2012'>

Last, if you’re just doing a simple text matching/searching operation, you can often skip the compilation step and use module-level functions in the re module instead.

In [72]:
re.findall(r'(\d+)/(\d+)/(\d+)', text)

[('11', '27', '2012'), ('3', '13', '2013')]

## 2.5. Searching and Replacing Text 

### Problem 

You want to search for and replace a text pattern in a string.

### Solution 

For simple literal patterns, use the *str.replace()* method.

In [73]:
text = 'yeah, but no, but yeah, but no, but yeah'

In [74]:
text.replace('yeah', 'yep')

'yep, but no, but yep, but no, but yep'

For more complicated patterns, use the *sub()* functions/methods in the *re* module.

In [77]:
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)

'Today is 2012-11-27. PyCon starts 2013-3-13.'

The first argument to sub() is the pattern to match and the second argument is the replacement pattern. Backslashed(斜线) digits such as \3 refer to capture group numbers in the pattern.

If you’re going to perform repeated substitutions of the same pattern, consider compiling it first for better performance. 

In [82]:
datepat=re.compile(r'(\d+)/(\d+)/(\d+)')
datepat.sub(r'\3-\1-\2', text)

'Today is 2012-11-27. PyCon starts 2013-3-13.'

For more complicated substitutions, it’s possible to specify a substitution callback function instead. 

In [85]:
from calendar import month_abbr
def change_date(m):
    mon_name = month_abbr[int(m.group(1))]
    return '{} {} {}'.format(m.group(2), mon_name, m.group(3))

In [86]:
datepat.sub(change_date, text)

'Today is 27 Nov 2012. PyCon starts 13 Mar 2013.'

As input, the argument to the substitution callback is a match object, as returned by *match()* or *find()* . Use the *.group()* method to extract specific parts of the match. The function should return the replacement text.

If you want to know how many substitutions were made in addition to getting the replacement text, use *re.subn()* instead. 

In [87]:
newtext, n = datepat.subn(r'\3-\1-\2', text)

In [88]:
newtext

'Today is 2012-11-27. PyCon starts 2013-3-13.'

In [89]:
n

2

### Discussion 

There isn’t much more to regular expression search and replace than the *sub()* method shown. The trickiest part is specifying the regular expression pattern—something that’s best left as an exercise to the reader.

## 2.6. Searching and Replacing Case-Insensitive Text  

### Problem 

You need to search for and possibly replace text in a case-insensitive(不区分大小写) manner.

### Solution

To perform case-insensitive text operations, you need to use the re module and supply the *re.IGNORECASE* flag to various operations. 

In [90]:
text = 'UPPER PYTHON, lower python, Mixed Python'

In [91]:
re.findall('python', text, flags=re.IGNORECASE)

['PYTHON', 'python', 'Python']

In [92]:
re.sub('python', 'snake', text, flags=re.IGNORECASE)

'UPPER snake, lower snake, Mixed snake'

The last example reveals a limitation that replacing text won’t match the case of the matched text. If you need to fix this, you might have to use a support function(支撑函数), as in the following:

In [94]:
def matchcase(word):
    def replace(m):
        text = m.group()
        if text.isupper():
            return word.upper()
        elif text.islower():
            return word.lower()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace

Here is an example of using this last function:

In [95]:
re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)

'UPPER SNAKE, lower snake, Mixed Snake'

### Discussion 

For simple cases, simply providing the *re.IGNORECASE* is enough to perform caseinsensitive matching. However, be aware that this may not be enough for certain kinds of Unicode matching involving case folding. 

## 2.7. Specifying a Regular Expression for the Shortest Match(定义实现最短匹配的正则表达式) 

### Problem 

You’re trying to match a text pattern using regular expressions, but it is identifying the longest possible matches of a pattern. Instead, you would like to change it to find the shortest possible match.

### Solution 

In [96]:
str_pat = re.compile(r'\"(.*)\"')

In [97]:
text1 = 'Computer says "no."'

In [98]:
str_pat.findall(text1)

['no.']

In [99]:
text2 = 'Computer says "no." Phone says "yes."'

In [100]:
str_pat.findall(text2)

['no." Phone says "yes.']

In this example, the pattern r'\"(.\*)\"' is attempting to match text enclosed inside quotes. However, the * operator in a regular expression is greedy, so matching is based on finding the longest possible match. Thus, in the second example involving text2 , it incorrectly matches the two quoted strings.
To fix this, add the *?* modifier after the * operator in the pattern, like this:

In [101]:
str_pat = re.compile(r'\"(.*?)\"')

In [102]:
str_pat.findall(text2)

['no.', 'yes.']

### Discussion 

This recipe addresses one of the more common problems encountered when writing regular expressions involving the dot (.) character. In a pattern, the dot matches any character except a newline. However, if you bracket the dot with starting and ending text (such as a quote), matching will try to find the longest possible match to the pattern. This causes multiple occurrences of the starting or ending text to be skipped altogether and included in the results of the longer match. Adding the ? right after operators such as * or + forces the matching algorithm to look for the shortest possible match instead.

## 2.8. Writing a Regular Expression for Multiline Patterns(编写多行模式的正则表达式) 

### Problem 

You’re trying to match a block of text using a regular expression, but you need the match to span multiple lines.

### Solution 

This problem typically arises in patterns that use the dot (.) to match any character but forget to account for the fact that it doesn’t match newlines. For example, suppose you are trying to match C-style comments:

In [103]:
comment = re.compile(r'/\*(.*?)\*/')

In [104]:
text1 = '/* this is a comment */'

In [105]:
 text2 = '''/* this is a
                multiline comment */
        '''        

In [106]:
comment.findall(text1)

[' this is a comment ']

In [107]:
comment.findall(text2)

[]

To fix the problem, you can add support for newlines.

In [108]:
comment = re.compile(r'/\*((?:.|\n)*?)\*/')

In [109]:
comment.findall(text2)

[' this is a\n               multiline comment ']

In this pattern, *(?:.|\n)* specifies a noncapture group (i.e., it defines a group for the purposes of matching, but that group is not captured separately or numbered).

### Discussion

The *re.compile()* function accepts a flag, *re.DOTALL* , which is useful here. It makes the . in a regular expression match all characters, including newlines. 

In [110]:
comment = re.compile(r'/\*(.*?)\*/', re.DOTALL)

In [111]:
comment.findall(text2)

[' this is a\n               multiline comment ']

## 2.9. Normalizing Unicode Text to a Standard Representation

### Problem 

You’re working with Unicode strings, but need to make sure that all of the strings have the same underlying representation(底层表示).

### Solution

In Unicode, certain characters can be represented by more than one valid sequence of code points. 

In [112]:
s1 = 'Spicy Jalape\u00f1o'
s2 = 'Spicy Jalapen\u0303o'

In [113]:
s1

'Spicy Jalapeño'

In [114]:
s2

'Spicy Jalapeño'

In [115]:
s1 == s2

False

Having multiple representations is a problem for programs that compare strings. In order to fix this, you should first normalize the text into a standard representation using the *unicodedata* module:

In [116]:
import unicodedata

In [117]:
t1 = unicodedata.normalize('NFC', s1)

In [118]:
t2 = unicodedata.normalize('NFC', s2)

In [119]:
t1 == t2

True

In [120]:
print(ascii(t1))

'Spicy Jalape\xf1o'


In [121]:
t3 = unicodedata.normalize('NFD', s1)
t4 = unicodedata.normalize('NFD', s2)
t3 == t4

True

In [122]:
print(ascii(t3))

'Spicy Jalapen\u0303o'


The first argument to *normalize()* specifies how you want the string normalized. NFC means that characters should be fully composed (i.e., use a single code point if possible). NFD means that characters should be fully decomposed with the use of combining characters.

## 2.11. Stripping Unwanted Characters from Strings(从字符串中去掉不需要的字符)

### Problem 

You want to strip unwanted characters, such as whitespace, from the beginning, end, or middle of a text string.

### Solution

The *strip()* method can be used to strip characters from the beginning or end of a string. *lstrip()* and *rstrip()* perform stripping from the left or right side, respectively. By default, these methods strip whitespace, but other characters can be given. 

In [123]:
# Whitespace stripping
s = ' hello world \n'

In [124]:
s.strip()

'hello world'

In [125]:
s.lstrip()

'hello world \n'

In [126]:
s.rstrip()

' hello world'

In [129]:
# Character stripping
t = '----hello===='

In [130]:
t.lstrip('-')

'hello===='

In [131]:
t.rstrip('=')

'----hello'

In [132]:
t.strip('-=')

'hello'

### Discussion 

The various *strip()* methods are commonly used when reading and cleaning up data for later processing. For example, you can use them to get rid of whitespace, remove quotations, and other tasks.

Be aware that stripping does not apply to any text in the middle of a string. 

In [134]:
s = ' hello             world  \n'
s = s.strip()
s

'hello             world'

If you needed to do something to the inner space, you would need to use another technique, such as using the *replace()* method or a regular expression substitution. 

In [135]:
s.replace(' ','')

'helloworld'

In [136]:
re.sub('\s+', ' ', s)

'hello world'