## Regular Expressions

Regular expressions are special patterns that we can specify as strings that are useful for searching element in a text. Mastering them takes practice and time but they are really powerful in many way. In this lesson we will review some of the ways that regular expressions can be used and also hwo to combine them with functions.

## Characters

Characters match themselves. That means that the regular expression for 'a' is 'a' and so on.

In [3]:
import re ##re is the python module you need for regular expressions
t = 'The quick brown fox born on 1/23/2013 jumped over the lazy dog born on 10/6/10.'
match1=re.search(r'a', t)
print(match1)

<_sre.SRE_Match object; span=(55, 56), match='a'>


There are several thing to notice from the previous code. First of all we need to import the module re. Re is the official Python module for regular expressions which includes regex, a revisited updated version of re (note: old documentation bout regular expressions will differ due to this update)

Next the re.search() method return the indexes of the matching string. Usage is re.search(pattern, string). The returned object is an SRE_Match type.

Finally the r'a'. In this case r is used as an expression which means that the pattern should be interpreted as raw string, meaning that if we were looking for '\n' it would be considered as \n and not as new line

### Metacharacters

These are special characters that denote a special match in the text

In [4]:
match1=re.search('T.e', t) ## '.' matches any character except for \n
print(match1)

match1=re.search('[ie]', t)  ## []	Matches one character specified inside square brackets []; e.g., [aeiou]
print(match1)

match1=re.search('[a-d]', t)  ## -   Matches one character in range inside []: e.g., [0-9] matches any digit
print(match1)

match1=re.search('[A-Z]', t)  ##     To match any letter (upper/lower case) or digit we write [A-Za-z0-9]
print(match1)

match1=re.search('[^The ]', t)  ##  [^]	Matches any one character NOT specified inside [^]; e.g., [^aeiou]
print(match1)




<_sre.SRE_Match object; span=(0, 3), match='The'>
<_sre.SRE_Match object; span=(2, 3), match='e'>
<_sre.SRE_Match object; span=(7, 8), match='c'>
<_sre.SRE_Match object; span=(0, 1), match='T'>
<_sre.SRE_Match object; span=(4, 5), match='q'>


The following are called anchors, they don't really match anything in the text but reference the beginning and the end of a line

In [5]:
## Anchors

match1=re.search('^...', t)  ##  ^	matches beginning of line (when not used in [^])
print(match1)

match1=re.search('...$', t)  ##   $	matches end of line (when not used in [] or [^])
print(match1)

<_sre.SRE_Match object; span=(0, 3), match='The'>
<_sre.SRE_Match object; span=(76, 79), match='10.'>


Patterns are modifiers to regular expressions. In the notation consider R, Ra and Rb to be three different expressions

In [6]:
## Patterns

match1=re.search('T.e' ' quick', t)  ##RaRb	Matches a sequence (one after the other) of Ra followed by Rb
print(match1)

match1=re.search('Tha|The', t)  ##   Ra|Rb	Matches either alternative Ra or Rb 
print(match1)

match1=re.search('Tha?', t)  ##   R?	Matches regular expression R 0/1 time: e.g., R is optional
print(match1)

match1=re.search('s*', t)  ##   R*	Matches regular expression R 0 or more times
print(match1)



<_sre.SRE_Match object; span=(0, 9), match='The quick'>
<_sre.SRE_Match object; span=(0, 3), match='The'>
<_sre.SRE_Match object; span=(0, 2), match='Th'>
<_sre.SRE_Match object; span=(0, 0), match=''>


In [16]:
match1=re.search('s+', t)  ##  R+	Matches regular expression R 1 or more times (note */+ difference)
print(match1)

match1=re.search('q{2}', t)  ##  R{m}	Matches regular expression R exactly m times: e.g., R{5} = RRRRR
print(match1)

match1=re.search('o{3,4}', t)  ##   R{m,n}	Matches regular expression R at least m and at most n times:
                             ##   R{3,5} = RRR|RRRR|RRRRR = RRRR?R?
print(match1)

None
None
None


## Parentheses

Groups are marked by the '(', ')' metacharacters. '(' and ')' have much the same meaning as they do in mathematical expressions; they group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier, such as *, +, ?, or {m,n}. For example, (ab)* will match zero or more repetitions of ab.

In [42]:
m = re.compile ('(f((o)x))..')
match1=m.match('foxes')  ##  (R)	     Matches R and delimits a group (1, 2, ...) (remembers/captures matched text in a group)
print(match1.group(0))
print(match1.group(1))
print(match1.group(3))
print(match1.groups())
print(match1)

foxes
fox
o
('fox', 'ox', 'o')
<_sre.SRE_Match object; span=(0, 5), match='foxes'>


We used two different operators here:

Groups() returns a tuple containing the strings for all the subgroups in a matched pattern

Group(#) returns the subpatterns by index

Compile() transforms the expression into a pattern object.

Each new left parenthesis starts a new group (unless it is "(?:...)"). Some
  groups are sequential (one after another); some groups are nested (one inside
  another). 

Group 0 (in Python) is considered the entire regular expression, even when
it is not in any parentheses. So, in the pattern "a(b(cd)+)?e"

  Group 0 is a(b(c.d)+)?e
  
  Group 1 is (b(c.d)+)
  
  Group 2 is (c.d)


## Escape characters

We will look more into grouping later on, but for now let's look at escape characters. These are characters with special meanings, you may already know some. First we need to make the text a little more interesting.

In [46]:
t = 'These are my favorite meals: \n \t burgers \n \t enchiladas \n \t tacos'
print(t)  ##\n means new line \t means tab they can be used to search a text as well

These are my favorite meals: 
 	 burgers 
 	 enchiladas 
 	 tacos


In [51]:
t = 'This is a list of numbers: \n 1 \n \n    2  \r 3' 
print(t) ##\r is carriage return, i.e. return to beginning of line

This is a list of numbers: 
 1 
 
    2   3


In [58]:
t = 'Top of the world \v bottom of the world \f ?' 
print(t) ## \v used to be vertical tab, now it translates to an obsolete symbol
\f is hard to visualize here but it means form feed, equivalent to "next page"

Top of the world  bottom of the world  ?


In [61]:
t = 'This is a list of numbers: \n 1 \n \n    2  \r 3' 
match1=re.search('\d', t)  ##  [0-9]			Digit
print(match1)

match1=re.search('\D', t)  ##  [^0-9]			non-Digit
print(match1)

match1=re.search('\s', t)  ##  [ \t\n\r\f\v]		White space
print(match1)

match1=re.search('\S', t)  ##  [^ \t\n\r\f\v]		non-White space
print(match1)

match1=re.search('\w', t)  ##  [a-zA-Z0-9_]		alphanumeric(or underscore): used in identifiers
print(match1)

match1=re.search('\W', t)  ##  [^a-zA-Z0-9_]		non alphanumeric
print(match1)


<_sre.SRE_Match object; span=(29, 30), match='1'>
<_sre.SRE_Match object; span=(0, 1), match='T'>
<_sre.SRE_Match object; span=(4, 5), match=' '>
<_sre.SRE_Match object; span=(0, 1), match='T'>
<_sre.SRE_Match object; span=(0, 1), match='T'>
<_sre.SRE_Match object; span=(4, 5), match=' '>


In [68]:
## Interesting equivalences

match1=re.search('s+', t)
print(match1)
match1=re.search('ss*', t)
print(match1)

match1=re.search('l(a|e|i)st', t)
print(match1)
match1=re.search('l[aei]st', t)
print(match1)

match1=re.search('i.{0,1}', t) # 0 or 1 times means the same as optional
print(match1)
match1=re.search('.?', t)
print(match1)

<_sre.SRE_Match object; span=(3, 4), match='s'>
<_sre.SRE_Match object; span=(3, 4), match='s'>
<_sre.SRE_Match object; span=(10, 14), match='list'>
<_sre.SRE_Match object; span=(10, 14), match='list'>
<_sre.SRE_Match object; span=(2, 4), match='is'>
<_sre.SRE_Match object; span=(0, 1), match='T'>


## Problems

There is still a lot to learn about expressions, and we won't be able to ver it all in class, if you want to know more feel free to consult other resources. The official python documentation (https://docs.python.org/3/howto/regex.html) is an excellent referrence for expressions.

Let's break here and look at some excercises. Go to https://regex101.com/ and solve the following problems (taken from prof Pattis' notes):

1. Write a regular expression pattern that matches the strings Jul 4, July 4,
   Jul 4th, July 4th, July fourth, and July Fourth.
   Hint: my re pattern was 24 characters.

2. Write a regular expression pattern that matches strings representing times on
   a 12 hour clock. An example time is  5:09am or 11:23pm. Allow only times that
   are legal (not 1:73pm nor 13:02pm)
   Hint: my re pattern was 32 characters.

3. Write a regular expression pattern that matches strings representing phone
   numbers of the following form.

   Normal: a three digit exchange, followed by a dash, followed by a four digit
           number: e.g., 555-1212

   Long Distance: a 1, followed by a dash, followed by a three digit area code enclosed in parentheses, followed by a three digit exchange, followed by a dash, followed by a four digit number: e.g.,
           1-(800)555-1212

   Interoffice: a single digit followed by a dash followed by a four digit
            number: e.g., 8-2404.

   Hint: my re pattern was 30 characters; note that you must use \( and \) to
   match parentheses.



# Function Methods

We have discussed functions that operate on a regular expression pattern (specified by a string) and text (also specified by a string). These functions produce information (capture groups: see parenthesized patterns above) related to attempting to match the pattern and text: which parts of the text matched which parts of the pattern.

Though we used some functions in previous examples, we have mostly focused on the expressions themselves. Now let's focus on how to use each function

## Compile function

We have discussed functions that operate on a regular expression pattern (specified by a string) and text (also specified by a string). These functions produce information (capture groups: see parenthesized patterns above) related to attempting to match the pattern and text: which parts of the text matched which parts of the pattern.

We already saw one eample of the compile() function. We can use the compile function to compile a pattern (producing a regex), and then call methods on that regex directly, as an object to perform the same operations as the functions, but more efficiently if the pattern is to be used repeatedly (since the pattern is compiled into the regex once, not in each function call).

The compile() function also has some optional flags. We will not review them in detail but it would be beneficial for you to know them.


In [69]:
p = re.compile('ab*', re.IGNORECASE)
print (p)

re.compile('ab*', re.IGNORECASE)


In this piece of code we can see that the pattern ab* is bound to an object p. We enabled the flag IGNORECASE which instructs the function to match regardless of case.

In [71]:
s= p.match('ABC abc')
print(s)

<_sre.SRE_Match object; span=(0, 2), match='AB'>


See how we used the pattern object?. We can now reuse it as many times we wish. This allows us to easily manage, mix, and organize our patterns.

# Matching functions

There are four mathcing functions:

match(): Determine if the regular expression matches at the beginning of the string

search(): Scan a string to find where the expression matches

findall(): Returns a list with all the matches for the expressions

finditer() Returns an iterator with all the matches

In [79]:
t= "gattacaagatacattacc"
p=re.compile('ga..')
print(p.match(t))

p=re.compile('tt..')
print(p.search(t))

p=re.compile('ta')
print(p.findall(t))

p=re.compile('ta')
print(p.finditer(t))

for i in p.finditer(t):
    print (i)

<_sre.SRE_Match object; span=(0, 4), match='gatt'>
<_sre.SRE_Match object; span=(2, 6), match='ttac'>
['ta', 'ta', 'ta']
<callable_iterator object at 0x00000122522E76A0>
<_sre.SRE_Match object; span=(3, 5), match='ta'>
<_sre.SRE_Match object; span=(10, 12), match='ta'>
<_sre.SRE_Match object; span=(15, 17), match='ta'>


## Match object functions

We can bind our matches to an object. This allows us to use methods that are exclusive to match objects. These methods are:

group(): Returns the srting matched
start(): Returns the starting position of the match
end(): Returns the end position
span(): Returns a tuple with the starting and end position of a match

In [80]:
t= "gattacaagatacattacc"
p=re.compile('ga..')
q=p.match(t)
print(q.group())
print(q.start())
print(q.end())
print(q.span())

gatt
0
4
(0, 4)


## Non- Greedy Matching

So far we have only discussed greedy matching. That is, our metacharacters match as much of the string as possible. But here is an example of where that fails (taken form the Python documentation)

In [81]:
s = '<html><head><title>Title</title>'
len(s)
print(re.match('<.*>', s).span())
print(re.match('<.*>', s).group())

(0, 32)
<html><head><title>Title</title>


As you can see, the expression .* is consuming the rest of the string, including all the '>' in it. 

What is going on behind the scenes is that < is matched by the first character, .* matches the whole string and then the expression engine has to backtrack through it to find the first > it encounters. Therefore the whole string is the match.

For this example we require non-greedy matching, and for that we use the ? symbol. We already saw one example previously but this is how is used in this case:

In [82]:
print(re.match('<.*?>', s).group())

<html>


Using a non-greedy qualifier (*?, +?, ??, or {m,n}?) means that the metacharacter will matche the least possible. Therefore the .* will match until the next pattern is found.

In [84]:
t= "gattacaagatacattacc"
p=re.compile('ga..', re.VERBOSE)
print(p)

re.compile('ga..', re.VERBOSE)


## Modifying Strings

These methods are used to modify strings. We can use regular expressions with these methods, and you already know them. These are:

split() Split the string into a list
sub() Find all substrings matched and replace them
subn() Same as sub but returns the new string and number of repacements

In [95]:
print(re.split(r'[\W]+', 'Words, words, words.'))
print(re.split(r'([\W]+)', 'Words, words, words.'))
print(re.split(r'[\W]+', 'Words, words, words.', 1))

['Words', 'words', 'words', '']
['Words', ', ', 'words', ', ', 'words', '.', '']
['Words', 'words, words.']


In [98]:
p = re.compile('(blue|white|red)')
print(p.sub('colour', 'blue socks and red shoes'))
print(p.sub('colour', 'blue socks and red shoes', count=1))

colour socks and colour shoes
colour socks and red shoes


In [100]:
p = re.compile('(blue|white|red)')
print(p.subn('colour', 'blue socks and red shoes'))
print(p.subn('colour', 'no colours at all'))

('colour socks and colour shoes', 2)
('no colours at all', 0)


## Final Notes

We already talked about this but remember, raw strings are preceded by r, that means that special characters will be ignored

In [86]:
print(len('\n'))
print(len(r'\n'))

1
2
