# Text Strings: Regular Expressions

> It’s time to explore more complex pattern matching by using regular expressions. These are provided in the standard module re, which we’ll import

In [1]:
import re
# Here, 'You' is the pattern we’re looking for, and 'Young Frankenstein' is the source
# (the string we want to search). match() checks whether the source begins with the
# pattern.
result = re.match("you", "young money")
print(result)
print(result.end())
print(result.endpos)
print(result.lastindex)
print(result.group())
print(result.re)
print(result.pos)
print(result.regs)
print(result.string)

# ............................................... and more

<re.Match object; span=(0, 3), match='you'>
3
11
None
you
re.compile('you')
0
((0, 3),)
young money


>For more complex matches, you can compile your pattern first to speed up the match later:

In [7]:
import re
thepattern = re.compile("You")

result = thepattern.match('Young Frankenstein')
# is not fun? I like python for these things. its flexible
print(result)

<re.Match object; span=(0, 3), match='You'>


>I’ll say it again here:
match() only matches a pattern starting at the beginning of the
source. search() matches a pattern anywhere in the source.

match() is not the only way to compare the pattern and source. Here are several other
methods you can use (we discuss each in the following sections):
• search() returns the first match, if any.
• findall() returns a list of all non-overlapping matches, if any.
• split() splits source at matches with pattern and returns a list of the string
pieces.

sub() takes another replacement argument, and changes all parts of source that
are matched by pattern to replacement.

In [5]:
import re
thepattern = re.compile("You")

result = thepattern.search('Lets Teast Young Frankenstein')
print(result)

<re.Match object; span=(11, 14), match='You'>


>Most of the regular expression examples here use ASCII, but
Python’s string functions, including regular expressions, work with
any Python string and any Unicode characters.

# Find Exact Beginning Match with match()

In [9]:
import re
source = 'Young Frankenstein'
m = re.match('You', source) # match starts at "the beginning of source"

if m: # match returns an object; do this to see what matched
    print(m.group())

m = re.match('^You', source) # start anchor does the same  # ^_^
if m:
    print(m.group())

You
You


In [10]:
# How about 'Frank'?

import re
source = 'Young Frankenstein'
m = re.match('Frank', source) # match starts at "the beginning of source"

if m: # match returns an object; do this to see what matched
    print(m.group())

# sure; there is nothing
# let’s use search() 
# oh yeah we try that 

# ok check this out :

Let’s change the pattern and try a beginning match with match() again:

In [11]:
import re
source = 'Young Frankenstein'
m = re.match('.*Frank', source)
if m: # match returns an object
    print(m.group())


Young Frank


> WTF? 

# I now this sound crazy but :

>Here’s a brief explanation of how our new '.*Frank' pattern works:
• . means any single character.
• * means zero or more of the preceding thing. Together, .* mean any number of
characters (even zero).
• Frank is the phrase that we wanted to match, somewhere.
match() returned the string that matched .*Frank: 'Young Frank'.

# Find First Match with search()
You can use search() to find the pattern 'Frank' anywhere in the source string
'Young Frankenstein', without the need for the .* wildcards:

In [12]:
import re
source = 'Young Frankenstein'
m = re.search('Frank', source)
if m: # search returns an object
    print(m.group())

Frank


# Find All Matches with findall()
The preceding examples looked for one match only. But what if you want to know
how many instances of the single-letter string 'n' are in the string?


In [13]:
import re
source = 'Young Frankenstein'
m = re.findall('n', source)
print(m) # findall returns a list
print('Found', len(m), 'matches')

['n', 'n', 'n', 'n']
Found 4 matches


In [None]:
# How about 'n' followed by any character?
import re
source = 'Young Frankenstein'
m = re.findall('n.', source)
print(m)