## Introduction to Regular Expressions (regex's)

There is a nice basic regular expressions tutorial here 

https://regexone.com/references/python (click on Interactive Tutorial)

The Python 3 documentation:

https://docs.python.org/3/library/re.html

This is also a nice helpful tutorial:

https://www.tutorialspoint.com/python3/python_reg_expressions.htm

One more:

https://docs.python.org/3.6/howto/regex.html

And there is a link to a book

https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf

Some of the most common tasks for regex's are:

1) determine whether a certain pattern of text matches some substring of a string starting at the beginning of the string - and return the size of the substring that matches the pattern

2) determine whether a pattern matches some substring anywhere in the string, and return the location of the match in the string

3) find all locations in a string where a pattern mathes

A regular expression consists of ordinary characters and special characters. The simplest form of regular expression consists of a single (non-special) character.

We can combine regular expressions by concatenating them. Thus, if A is a regular expression and B is a regular expression, so is AB.

For example, an ordinary character is a regular expression, so "d" is a regular expression. So is "a" and so is "n", so "dan" is a regular expression.

Here is a simple example of regular expression matching i.e. determining whether the initial portion of a string matches the pattern. If we get a match, we print a message confirming it and provide some range information.

We use "match" to do the match. When we get a match, the function returns a pair that gives the "span" of the match, i.e. the start and end positions.

In [25]:
import re
pattern='Is'
string="Is there a match?"
m=re.match(pattern,string)
if m:
    print("matches")
    print(m.start())
    print(m.end())
    print(m.span())
    print(string[m.start():m.end()])
else:
    print("doesn't match")

matches
0
2
(0, 2)
Is


We can also use re to match byte objects. However, when we do that, we have to use byte objects to match byte objects. We can't mix the two types of strings.

In [101]:
import re
pattern=b'\x20\x30\x40'
string=b'\x20\x30\x40\x20\x30\x40'
m=re.match(pattern,string)
if m:
    print("matches")
    print(m.start())
    print(m.end())
    print(m.span())
    print(string[m.start():m.end()])
else:
    print("doesn't match")

matches
0
3
(0, 3)
b' 0@'


For a match the pattern is required to match the initial portion of the string.

In [27]:
import re
pattern='match'
string="A match must occur with the initial portion of the string."
m=re.match(pattern,string)
if m:
    print("matches")
    print(m.start())
    print(m.end())
    print(m.span())
    print(string[m.start():m.end()])
else:
    print("doesn't match")

doesn't match


In [30]:
import re
pattern='m'
string="Matches are case-sensitive by default."
m=re.match(pattern,string)
if m:
    print("matches")
    print(m.start())
    print(m.end())
    print(m.span())
    print(string[m.start():m.end()])
else:
    print("doesn't match")

doesn't match


We can turn off case-sensitivity using the "I" flag.

In [119]:
import re
pattern='m'
string="Matches are case-sensitive by default."
m=re.match(pattern,string,re.I)
if m:
    print("matches")
    print(m.start())
    print(m.end())
    print(m.span())
    print(string[m.start():m.end()])
else:
    print("doesn't match")

matches
0
1
(0, 1)
M


On the other hand, we can **search** for a pattern appearing in any position in text using "search".

In [177]:
import re
pattern="dog"
string="My dog is named Sasha"
s=re.search(pattern,string)
if s:
    print("pattern found")
    print(s.start())
    print(s.end())
    print(s.span())
    print(string[s.start():s.end()])
else:
    print("doesn't match")

pattern found
3
6
(3, 6)
dog


When we get no match, the value returned is None.

In [31]:
import re
text="My name is Joan."
pattern="John"
s=re.search(pattern,text)
if s:
    print("found a match")
else:
    print("no match")
    print(m)
if s==None:
    print("we get none")

no match
None
we get none


When we use the search method, the information we get back is about the first match.

In [32]:
import re
text="My name is John. Are you also John?"
pattern="John"
s=re.search(pattern,text)
if s:
    print("pattern found")
    print(text[s.start():s.end()])
else:
    print("no match")

found a match
<class 're.Match'>
<re.Match object; span=(11, 15), match='John'>
John


We can also search for special characters.

In [20]:
import re
text="What is your name?\n\r My name is John."
pattern="\n"
m=re.search(pattern,text)
if m:
    print("found a match")
    print(type(m))
    print(m)
    print(text[m.start():m.end()])
else:
    print("no match")

found a match
<class '_sre.SRE_Match'>
<_sre.SRE_Match object; span=(18, 19), match='\n'>




# Some simple functions

Instead of repeating the same code over and over, let's create a couple of functions that do what we did in the above examples.

In [5]:
def check_match(pattern, string): 
    m=re.match(pattern,string)
    if m:
        pos0=m.start()
        pos1=m.end()
        print("pattern " + pattern + 
              "matches from " + str(pos0) + " to " 
              + str(pos1) + " with substring = " 
              + string[m.start():m.end()])
    else:
        print("no match")   
check_match("dog","My dog")
check_match("M","My dog")

def check_search(pattern, string): 
    s=re.search(pattern,string)
    if s:
        pos0=s.start()
        pos1=s.end()
        print("pattern matches from " + str(pos0) + " to " + str(pos1) + " with substring = " 
              + string[s.start():s.end()])
    else:
        print("no match")   
check_search("dog","My dog at my homework.")

no match
pattern Mmatches from 0 to 1 with substring = M
pattern matches from 3 to 6 with substring = dog


# Iterating over lists

In [52]:
strings=["What a beautiful day to be learning regex.", 
         "I feel so fortunate to have Dr. Miller as a mentor.", 
         "What a powerful thing, to be able to match patterns in strings!"]
p="to"
for s in strings:
    print(s)
    check_search(p,s)
    print("\n")

What a beautiful day to be learning regex.
pattern matches from 21 to 23 with substring = to


I feel so fortunate to have Dr. Miller as a mentor.
pattern matches from 20 to 22 with substring = to


What a powerful thing, to be able to match patterns in strings!
pattern matches from 23 to 25 with substring = to




## Special characters

Asou can imagine, without further tools this is rather limited in what can be done. Additional functionality is obtained using  characters with special meanings in our patterns. 

These special characters are referred to as meta-characters. Here is a list of them:

. ^ $ * + ? { } [ ] \ | ( )

We will proceed to describe the uses of these characters in patterns.

# The dot/period: .

By default, the period (.) means any single character except newline "\n" by default.

In [60]:
import re
p="d.n"
strings=["dan","don","dn","d\nn"]
for s in strings:
    print("searching for pattern " + p + " in string = " + s)
    check_search(p,s)
    print("\n")

searching for pattern d.n in string = dan
pattern matches from 0 to 3 with substring = dan


searching for pattern d.n in string = don
pattern matches from 0 to 3 with substring = don


searching for pattern d.n in string = dn
no match


searching for pattern d.n in string = d
n
no match




If we want "." to be interpreted as any single character, including the 
newline character, we use the DOTALL flag. So in the following example, 
we do indeed get a match.

In [62]:
import re
p="d.n"
string="d\nn"
if re.search(p,string,re.DOTALL):
    print("found!")

found!


# The circumflex character: ^

The ^ character is used to indicate that the pattern must appear in the start of a string.

In [67]:
import re
p="^Jo.n"
strings=["John, are you home?","Joan, are you home?","Hey, has anybody seen John or Joan around?"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = John, are you home?
pattern matches from 0 to 4 with substring = John


string = Joan, are you home?
pattern matches from 0 to 4 with substring = Joan


string = Hey, has anybody seen John or Joan around?
no match




# The dollar-sign: $

$ matches the end of the string, or just before the new line at the end of the string.

In [73]:
import re
p="dog.$"
strings=["Is that my dog?", "My dog seems to be missing.","Hey, look at my dog."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = Is that my dog?
pattern matches from 11 to 15 with substring = dog?


string = My dog seems to be missing.
no match


string = Hey, look at my dog.
pattern matches from 16 to 20 with substring = dog.




The \ can be used to tell the engine that we want a literal period rather than as a meta-characer.

In [178]:
import re
p="dog\.$"
strings=["Is that my dog?", "My dog seems to be missing.","Hey, look at my dog."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = Is that my dog?
no match


string = My dog seems to be missing.
no match


string = Hey, look at my dog.
pattern matches from 16 to 20 with substring = dog.




# The asterisk: *

The * makes the preceding regular expression appear as many times as possible and looks for the most repeitions possible.  In the following, the preceding regular expression is the letter "a". Note that search continues until the first match start is found, but the search finds the largest match starting at that point. 

This is what is meant by regex searching being greedy. It looks for a match starting as early as possible in a string, but it looks for the string that is as long as possible and still matches.

In the following, the pattern matches the first appearance of "dan" but the whole string from that first "dan" to the end also gives a match, so that is the one produced by re.search().

In [179]:
import re
p="Do*r"
strings=["Hi Doris!","Are you a Dr.?","Doors should not be left opened.","Dooooooomed!"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = Hi Doris!
pattern matches from 3 to 6 with substring = Dor


string = Are you a Dr.?
pattern matches from 10 to 12 with substring = Dr


string = Doors should not be left opened.
pattern matches from 0 to 4 with substring = Door


string = Dooooooomed!
no match




What happens if we end our expression with o's?

In [181]:
import re
p="do*"
strings=["We are dooooooomed! Really dooooooooooomed!!!!"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = We are dooooooomed!
pattern matches from 7 to 15 with substring = dooooooo




# The plus sign: +

The + symbol means match 1 or more repetitions of the preceding expression, so unlike the *, we need to have the preceding expression appear at least once to get a match.

In [89]:
import re
p="Ca+"
strings=["Cats are not like dogs!","Could you come here?","Caanan is spoken about in the bible"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")



string = Cats are not like dogs!
pattern matches from 0 to 2 with substring = Ca


string = Could you come here?
no match


string = Caanan is spoken about in the bible
pattern matches from 0 to 3 with substring = Caa




# The question mark: ?

One use of the ? character is obtained by putting it after a regular expession, which means match exactly 0 or 1 repetitions of that expression.  In other words, it indicates that the expression is optional. For an example of its use, english spellings in the U.K. can differ from english spellings in the U.S.. For example, in the U.S. we would use "humor" and in  the U.K. they would write "humour". We can test for either word matching using this:

In [182]:
import re
p="humou?r"
strings=["Americans are known for their great sense of humor",
         "British people generally lack a sense of humour"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Americans are known for their great sense of humor
pattern matches from 45 to 50 with substring = humor


string = British people generally lack a sense of humour
pattern matches from 41 to 47 with substring = humour




Another use of the ? character is in creating a lazy instead of greedy attempt at matching, as discussed below.

# Use of braces: {}

Braces are used to indicate the number of times an expression appears, or a range of numbers of times.

We can use {m,n} to indicate that at least m occurences at most n occurences match.

We can also write {,n} to refer to at most n occurences, and {m,} to mean at least m occurences.

In [183]:
import re
p="ma{2}"
strings=["Hi mama.","When I get heartburn I use maalox.",
        "What about maaaaa?"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Hi mama.
no match


string = When I get heartburn I use maalox.
pattern matches from 27 to 30 with substring = maa


string = What about maaaaa?
pattern matches from 11 to 14 with substring = maa




In [184]:
import re
p="ma{2}"
strings=["ma","maa","maaa","maaaa"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = ma
no match


string = maa
pattern matches from 0 to 3 with substring = maa


string = maaa
pattern matches from 0 to 3 with substring = maa


string = maaaa
pattern matches from 0 to 3 with substring = maa




In [104]:
import re
p="ma{2,3}"
strings=["ma","maa","maaa","maaaa"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = ma
no match


string = maa
pattern matches from 0 to 3 with substring = maa


string = maaa
pattern matches from 0 to 4 with substring = maaa


string = maaaa
pattern matches from 0 to 4 with substring = maaa




In [185]:
import re
p="ma{3,}"
strings=["ma","maa","maaa","maaaa"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = ma
no match


string = maa
no match


string = maaa
pattern matches from 0 to 4 with substring = maaa


string = maaaa
pattern matches from 0 to 5 with substring = maaaa




# Use of parentheses for grouping

The *. +, {n}, {m,n} are referred to as *quantifiers*.  By default, a quantified applies to the previous regular expression, which refers to a single ordinary character in a multiple ordinary character expression.  In the following example, the + refers to the "a", not to "ma".



In [186]:
import re
p="ma+"
strings=["Hi ma.","Hi mama.","When I get heartburn I use maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Hi ma.
pattern matches from 3 to 5 with substring = ma


string = Hi mama.
pattern matches from 3 to 5 with substring = ma


string = When I get heartburn I use maalox.
pattern matches from 27 to 30 with substring = maa




Parentheses can be used to make the special characters apply to more complex expressions. Here we make the + refer to repetitions of "ma".

In [94]:
import re
p="(ma)+"
strings=["Hi ma.","Hi mama.","When I get heartburn I use maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Hi ma.
pattern matches from 3 to 5 with substring = ma


string = Hi mama.
pattern matches from 3 to 7 with substring = mama


string = When I get heartburn I use maalox.
pattern matches from 27 to 29 with substring = ma




What is going on in this example?

In [98]:
import re
p="(ma).*ma+"
strings=["Don't forget to tell your mama that you took the maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Don't forget to tell your mama that you took the maalox.
pattern matches from 26 to 52 with substring = mama that you took the maa




In [187]:
import re
p="(ma){2}.*ma"
strings=["Don't forget to tell your mama that you took the maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Don't forget to tell your mama that you took the maalox.
pattern matches from 26 to 51 with substring = mama that you took the ma




In [106]:
import re
p="m.*a{3}"
strings=["Don't forget to tell your mama that you took the maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Don't forget to tell your mama that you took the maalox.
no match




In [188]:
import re
p="(m.*a){3}"
strings=["Don't forget to tell your mama that you took the maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Don't forget to tell your mama that you took the maalox.
no match




In [191]:
import re
p="(m.*a){2}.*ma"
strings=["Don't forget to tell your mama that you took the maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Don't forget to tell your mama that you took the maalox.
pattern matches from 26 to 51 with substring = mama that you took the ma




In [118]:
import re
p="(m.*a){2}.*(m.*a)"
strings=["Don't forget to tell your mama that you took the maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Don't forget to tell your mama that you took the maalox.
pattern matches from 26 to 52 with substring = mama that you took the maa




# Using ? to eliminating greediness encourage laziness.

In order to ensure that matching is done in a non-greedy fashion, we can use  \*?. In the following example, we see greediness in action.

In [144]:
import re
p="<item>.*</item>"
strings=["<item>blah1</item><item>blah2</item>"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = <item>blah1</item><item>blah2</item>
pattern matches from 0 to 36 with substring = <item>blah1</item><item>blah2</item>




But we wanted to stop when we got to the first closing tag. The .\*? says match any number characters between the two tags, but include a minimal amount to get the match. In other words, we want to stop searching if we match the smallest portion of the string that gives a match.

In [192]:
import re
p="<item>(.*?)</item>"
strings=["<item>blah1</item><item>blah2</item>"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = <item>blah1</item><item>blah2</item>
pattern matches from 0 to 18 with substring = <item>blah1</item>




In [193]:
import re
p="<item>.*?</item>"
strings=["<item>blah1</item><item>blah2</item>"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = <item>blah1</item><item>blah2</item>
pattern matches from 0 to 18 with substring = <item>blah1</item>




The ? character also works when appearing after the + quantifier. Again, it ensures that at least one match occurs, 
but makes the seach lazy by minimizing portion needed to match.

In [194]:
import re
p="<item>(.+?)</item>"
strings=["<item></item>","<item></item><item>some stuff</item>",
         "<item>some stuff</item>" ]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = <item></item>
no match


string = <item></item><item>some stuff</item>
pattern matches from 0 to 36 with substring = <item></item><item>some stuff</item>


string = <item>some stuff</item>
pattern matches from 0 to 23 with substring = <item>some stuff</item>




Finally ? is not greedy ?? is lazy in the sense that the attempt to match stops after finding the initial pattern without the optional expression.

In [164]:
import re
p="friends?"
strings=["Can I play with your friends?" ]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Can I play with your friends?
pattern matches from 21 to 28 with substring = friends




In [195]:
import re
p="friends??"
strings=["Can I play with your friends?" ]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Can I play with your friends?
pattern matches from 21 to 27 with substring = friend




The ? character allows us to create non-greedy versions of the {} patterns.

In [168]:
import re
p="(xo){2,}"
strings=["xo i really love you xoxo",
     "xo i really love you xoxoxox"]        
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = xo i really love you xoxo
pattern matches from 21 to 25 with substring = xoxo


string = xo i really love you xoxoxox
pattern matches from 21 to 27 with substring = xoxoxo




In [169]:
import re
p="(xo){2,}?"
strings=["xo i really love you xoxo",
     "xo i really love you xoxoxox"]        
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = xo i really love you xoxo
pattern matches from 21 to 25 with substring = xoxo


string = xo i really love you xoxoxox
pattern matches from 21 to 25 with substring = xoxo




# Character groups and []'s

We use square brackets to define character groups. For example, suppose we want a match for any expression of the form
"p_t" where the underscore character can be any vowel from among a,e,i, or o. We can define a character group [aeio].

In [206]:
import re
p="p[aeio]t?"
strings=["Is that your pet?",
     "When I see you I get a funny feeling in the pit of my stomach"
    "Are you a pot-smoker?",
    "I've got to give you a pat on the back!",
        "Is 'pyt' a word?"]        
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = Is that your pet?
pattern matches from 13 to 16 with substring = pet


string = When I see you I get a funny feeling in the pit of my stomachAre you a pot-smoker?
pattern matches from 44 to 47 with substring = pit


string = I've got to give you a pat on the back!
pattern matches from 23 to 26 with substring = pat


string = Is 'pyt' a word?
no match




To simplify writing down certain character groups, we can use the dash. For example, for all digits 0,1,2,3,4,5,6,7
we can use [0-7].

In [207]:
import re
p="[0-7]+"
strings=["My office phone number is 410-516-7203.",
     "My social security number is 897999131."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = My office phone number is 410-516-7203.
pattern matches from 26 to 29 with substring = 410


string = My social security number is 897999131.
pattern matches from 31 to 32 with substring = 7




In [209]:
import re
p="[a-i]+"
strings=["My office phone number is 410-516-7203.",
     "My social security number is 897999131."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = My office phone number is 410-516-7203.
pattern matches from 4 to 9 with substring = ffice


string = My social security number is 897999131.
pattern matches from 5 to 8 with substring = cia




# Exercise: 

Assume a phone number is always of the form: yxx-yxx-xxxx or 1-yxx-yxx-xxxx where the x's are all digits from 0-9, and y's are 1-9

Write code to find the first phone number in a string.

In [214]:
p="(1-)?[1-9]{1}[0-9]{2}-[1-9]{1}[0-9]{2}-[0-9]{4}"
strings=["Is this a phone number 1-234-123-3134 or not",
         "Is this a phone number 456-234-1088 or not",
         "Is this a phone number 0-131-143-4353 or not"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Is this a phone number 1-234-123-3134 or not
pattern matches from 22 to 37 with substring =  1-234-123-3134


string = Is this a phone number 456-234-1088 or not
pattern matches from 23 to 35 with substring = 456-234-1088


string = Is this a phone number 0-131-143-4353 or not
pattern matches from 25 to 37 with substring = 131-143-4353




In [215]:
p="(1-)?[1-9][0-9]{2}-[1-9][0-9]{2}-[0-9]{4}"
strings=["Is this a phone number 1-234-123-3134 or not",
         "Is this a phone number 456-234-1088 or not",
         "Is this a phone number 0-131-143-4353 or not"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Is this a phone number 1-234-123-3134 or not
pattern matches from 23 to 37 with substring = 1-234-123-3134


string = Is this a phone number 456-234-1088 or not
pattern matches from 23 to 35 with substring = 456-234-1088


string = Is this a phone number 0-131-143-4353 or not
pattern matches from 25 to 37 with substring = 131-143-4353




# Meta-characters  inside square brackets

Inside square brackets, there are only some meta-characters that have special meaning. Others are interpreted as literal characters. 

The ones that do have meaning are the backslash \\, the hyphen -, circumflex ^ and the right square bracket (as a closing bracket for the character group).

The circumflex only has special meaning when it appears immediately after the \[ and it means "not" among these characters.

In [216]:
import re
p="p[^aeio]t?"
strings=["Is that your pet?",
     "When I see you I get a funny feeling in the pit of my stomach"
    "Are you a pot-smoker?",
    "I've got to give you a pat on the back!",
        "Is 'pyt' a word?"]        
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Is that your pet?
no match


string = When I see you I get a funny feeling in the pit of my stomachAre you a pot-smoker?
no match


string = I've got to give you a pat on the back!
no match


string = Is 'pyt' a word?
pattern matches from 4 to 7 with substring = pyt




# The backslash: \

The backslash character is used to escape a special character in a regex pattern. So, for example, for searching for a +, you would need to do something like the followng.

In [200]:
import re
p="\+"
strings=["2+5=7","To add two numbers, we use the '+' sign."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = 2+5=7
pattern matches from 1 to 2 with substring = +


string = To add two numbers, we use the '+' sign.
pattern matches from 32 to 33 with substring = +




This ability to escape characters with special meaning inside square brackets applies too.  For example, suppose we want to match an expression like what mathematicians use to represent intervals with endpoints x, and y in the real line.

For example, these are all matching expressions:

    (x,y), (x,y], [x,y), [x,y]

Here, we can use two character groups, one of these consists of ( and \[, neither of which has any special meaning inside square brackets. The other has a \] character in it, which needs to be escaped.

In [14]:
import re
p="[([]x,y[)\]]"
strings=["Consider the open interval [x,y)","Consider the closed interval [x,y]",
         "Or an interval closed on the left and open on the right like [x,y)"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Consider the open interval [x,y)
pattern matches from 27 to 32 with substring = [x,y)


string = Consider the closed interval [x,y]
pattern matches from 29 to 34 with substring = [x,y]


string = Or an interval closed on the left and open on the right like [x,y)
pattern matches from 61 to 66 with substring = [x,y)




Exercise: How to find all matches of expressions of the form: integer + integer = integer?

# The pipe character: |

The pipe character is for creating patterns in which one there are choices for matches. Either what is to the left of the | needs to match, or what is to the right of the | needs to match.

In [227]:
import re
p="dog|cat"
strings=["Do dogs eat cats?","Do cats eat dogs?"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = Do dogs eat cats?
pattern matches from 3 to 6 with substring = dog


string = Do cats eat dogs?
pattern matches from 3 to 6 with substring = cat




In [228]:
import re
p="dog|cat|bird"
strings=["Do dogs eat bird?","Do birds eat cats?"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = Do dogs eat bird?
pattern matches from 3 to 6 with substring = dog


string = Do birds eat cats?
pattern matches from 3 to 7 with substring = bird




We can do more with parentheses.

In [235]:
import re
p="(bird)|(plane)"
strings=["It's a bird!", "It's a plane!", "No! It's superman!"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = It's a bird!
pattern matches from 7 to 11 with substring = bird


string = It's a plane!
pattern matches from 7 to 12 with substring = plane


string = No! It's superman!
no match




How do we try to match any integer that is either 0 or which is expressed without 0 as its first digit?

In [242]:
import re
p="(^0$)|(^[1-9]+[0-9]*$)"
strings=["f0","0","035","1423452520"]
for s in strings:
    print("string = " + s)
    check_match(p,s)
    print("\n")

string = f0
no match


string = 0
pattern (^0$)|(^[1-9]+[0-9]*$)matches from 0 to 1 with substring = 0


string = 035
no match


string = 1423452520
pattern (^0$)|(^[1-9]+[0-9]*$)matches from 0 to 10 with substring = 1423452520




We have now said something about every meta-character.

Progress!!!

# How to create a string with a single literal slash?

We want to talk about the special status of the backslash character, which we saw above can be used to escape the properties of a meta-character.

In [219]:
p="\""
print(p)

"


In [221]:
p="\\"
print(p)

\


In [223]:
p="\\\\"
print(p)

\\


In [226]:
p=r'\\'
print(p)
p=r'\n'
print(p)

\\
\n


# Backslash Issues
In regular expressions, care has to be taken when using a literal backslash (a backslash to be interpreted as a character) in a pattern.

Things get a little bit complicated when dealing with the backslash for two reasons:

1) it has special status in a pattern to create an escape

2) when we put a backslash in a string we need to escape it to make it literal

Let's first review how to get a backslash into a string.

Recall that the backslash is used by Python in various ways in string literals. For example,
for new line "\n", or to represent an ascii character using a hexadecimal value.

In [172]:
txt="\x43\n\x41\n\x54"
print(txt)

C
A
T


So some care is required when using backslash in a pattern.

Getting a backslash character in a string can be a slight challenge. The unicode code point for backslash is 005c (hexadecimal), i.e. 92 in decimal. We can also put a backslash in a string by escaping it with a backslash. We can also try to create a "raw" string with a single backslash but that fails because \" is interpreted as escaping the " character. For this we need two backslashes.

In [201]:
str1="\\"
print(str1)
str2=chr(92)
print(str2)
str1==str2

\
\


True

When backslashes appear in patterns, they might be *interpreted* rather than taken literally.

In [1]:
import re
pattern="\x43"
text="ABCDEF"
if re.search(pattern,text):
    print("match")

match


The same goes for backslashes in text.

In [2]:
import re
pattern="\x43"
text="AB\x43DEF"
if re.search(pattern,text):
    print("match")

match


In [179]:
mytext="Roses are red,\nviolets are blue,\nI stink at math\nHow about You?"
print(mytext)

Roses are red,
violets are blue,
I stink at math
How about You?


There will be instances in which the backslash needs to be treated literally rather than interpreted. In such cases, we are advised to use *raw* strings.

In [181]:
text=r'in this raw string there is a backslash character (i.e. \) appearing'
print(text)
mytext=r'Roses are red,\nviolets are blue,\nI stink at math\nHow about You?'
print(mytext)

in this raw string there is a backslash character (i.e. \) appearing
Roses are red,\nviolets are blue,\nI stink at math\nHow about You?


Now what about backslashes in patterns? If we want to search for the position of a period in some text, this won't work.

In [205]:
import re
strings=["Here is a sentence. Here is another sentence."]
p="."
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = Here is a sentence. Here is another sentence.
pattern matches from 0 to 1 with substring = H




Rememeber that special characters inside []'s are taken as literally, i.e. they are not interpreted as having any special meaning.

In [6]:
import re
p="[\/*+]"
strings=["8*9=72"]
for s in strings:
    print(s)
    check_search(p,s)
    print("\n")

8*9=72
pattern matches from 1 to 2 with substring = *




# Character classes

As noted above, we can create our own character classes. There are also standard pre-defined character classes.

Here is an example of one of these and the documentation from Python:

\d

For Unicode (str) patterns:
Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes [0-9], and also many other digit characters. If the ASCII flag is used only [0-9] is matched.

For 8-bit (bytes) patterns: Matches any decimal digit; this is equivalent to [0-9].

Let's see what kinds of matches can occur if we aren't careful. Now that you know all about unicode, you can appreciate what is going on!

In [14]:
import re
b=b'\xE0\xA9\xA7'
x=b.decode()
p="\d"
strings=["Sometimes he behaves like a 2 year old",
        "Really, because I thought he was actually a "+x+" year old"]
for s in strings:
    print(s)
    check_search(p,s)
    print("\n")

Sometimes he behaves like a 2 year old
pattern matches from 28 to 29 with substring = 2


Really, because I thought he was actually a ੧ year old
pattern matches from 44 to 45 with substring = ੧




We get a match because that special character which we know how to write as a 
unicode character is considered part of the unicode numerical digit category.

https://www.fileformat.info/info/unicode/category/Nd/list.htm

If we want to only allow for the ascii characters 0,1,...,9, we need to set a flag.

In [15]:
import re
b=b'\xE0\xA9\xA7'
x=b.decode()
p="\d"
strings=["Sometimes he behaves like a 2 year old",
        "Really, because I thought he was actually a "+x+" year old"]
for s in strings:
    res=re.search(p,s,re.ASCII)
    print(s)
    if res:
        print("match")
    else:
        print("no match")
    print("\n")

Sometimes he behaves like a 2 year old
match


Really, because I thought he was actually a ੧ year old
no match




Here is a list of pre-defined character classes provided by the package:

\d Matches any decimal digit; 
    this is equivalent to the class [0-9].

\D Matches any non-digit character; 
    this is equivalent to the class [^0-9].

\s Matches any whitespace character; 
    this is equivalent to the class [ \t\n\r\f\v].

\S Matches any non-whitespace character; 
    this is equivalent to the class [^ \t\n\r\f\v].

\w Matches any alphanumeric character; 
    this is equivalent to the class [a-zA-Z0-9_].

\W Matches any non-alphanumeric character; 
    this is equivalent to the class [^a-zA-Z0-9_].
    
   

We can put character classes inside of our own groups. So, for example, what does this do?

    [\w. ]+
    
    
Again, we see that \ retains its status as having special use inside []'s. 

In [45]:
import re
p="[\w. ]+"
strings=["Sometimes he behaves like a 2 year old.",
        "Really, because I thought he was actually a 3 year old"]
         
for s in strings:
    print(s)
    check_search(p,s)
    print("\n")

Sometimes he behaves like a 2 year old.
pattern matches from 0 to 39 with substring = Sometimes he behaves like a 2 year old.


Really, because I thought he was actually a 3 year old
pattern matches from 0 to 6 with substring = Really




# Anchors/word boundaries

One final special term that can be useful is \b which is used to say that we have word boundary. A word is a contiguous sequence of alphanumeric characters and its left-hand boundary can be:

- no character if the sequence is the start of a string
- a non-alphanumeric character

and its right-hand boundary can be

- no character if the sequence is the end of a string
- a non-alphanumeric character

Note: When you enter \b in a Python string, this has special meaning as a backspace. So you need to escape the \ or use a raw string.


In [50]:
import re
p="\\bSasha\\b"
strings=["Do you spell Sasha with a capital S?",
        "Sasha is 8 years old",
        "Do you know Sasha?",
         "I know Sasha",
        "I met her. Sasha is nice."]
for s in strings:
    print(s)
    check_search(p,s)
    print("\n")

Do you spell Sasha with a capital S?
pattern matches from 13 to 18 with substring = Sasha


Sasha is 8 years old
pattern matches from 0 to 5 with substring = Sasha


Do you know Sasha?
pattern matches from 12 to 17 with substring = Sasha


I know Sasha
pattern matches from 7 to 12 with substring = Sasha


I met her. Sasha is nice.
pattern matches from 11 to 16 with substring = Sasha




# Named groups

There are several advanced features that we won't discuss in detail here, but it is important to be aware of their existence. For example, we can define something to search for, then refer to the pattern found later on.

In the following example, we define a group called "pet" that matches "dog", "cat" or "bird"

(?P<pet>(dog|cat|bird))
    
and once a match occurs the term (?P=pet) refers to the match previouly found i.e. "dog" in the first example, and "bird" in the second example.   

In [36]:
import re
x=b.decode()
p="(?P<pet>(dog|cat|bird)).*(?P=pet)"
strings=["Do you have a dog as a pet? I don't have a dog.",
        "I really am not terribly fond of birds but wish my cat didn't hunt birds!"]
for s in strings:
    print(s)
    check_search(p,s)
    print("\n")

Do you have a dog as a pet? I don't have a dog.
pattern matches from 14 to 46 with substring = dog as a pet? I don't have a dog


I really am not terribly fond of birds but wish my cat didn't hunt birds!
pattern matches from 33 to 71 with substring = birds but wish my cat didn't hunt bird




# Compiling patterns

In [66]:
import re
pattern="(dog|cat|bird)"
strings=["That dog sure looks like a catbird.",
         "Whose dog is that anyway?",
         "Let's go bird-watching with the dog but leave the cat at home."]
pc=re.compile(pattern)
for s in strings:
    print(pc.search(s))

<re.Match object; span=(5, 8), match='dog'>
<re.Match object; span=(6, 9), match='dog'>
<re.Match object; span=(9, 13), match='bird'>


# Finding all occurences

We might want to identify all occurences of a pattern in a string. For this we can use findall.

In [70]:
import re
string="Mrs. Smith came home and said hello to Mr. Smith. He said to her,"
string=string+" 'How did things go at the office today?' But Mrs. Smith did not answer."
p=re.compile("Smith")
res=p.findall(text)
len(res)

3

The algorithm finds the first matching string, then tries to find the next match starting at the position after the match found. Do you see what the differences is between the following two examples?

In [82]:
import re
string="Mrs. Smith came home and said hello to Mr. Smith. He said to her,"
string=string+" 'How did things go at the office today?' But Mrs. Smith did not answer."
p=re.compile("Mrs?.*Smith")
res=p.findall(text)
len(res)
print(res)

["Mrs. Smith came home and said hello to Mr. Smith. He said to her, 'How did things go at the office today?' But Mrs. Smith"]


In [83]:
import re
string="Mrs. Smith came home and said hello to Mr. Smith. He said to her,"
string=string+" 'How did things go at the office today?' But Mrs. Smith did not answer."
p=re.compile("Mrs?.*?Smith")
res=p.findall(text)
len(res)
print(res)

['Mrs. Smith', 'Mr. Smith', 'Mrs. Smith']


In [90]:
import re
string="aaabbb aaabbbabbbb bbbaaaa"
p=re.compile("a+b+|b+a+")
res=p.findall(string)
len(res)
print(res)

['aaabbb', 'aaabbb', 'abbbb', 'bbbaaaa']


We can get more information using the finditer function.

In [112]:
import re
string="Mrs. Smith came home and said hello to Mr. Smith. He said to her,"
string=string+" 'How did things go at the office today?' But Mrs. Smith did not answer."
p=re.compile("Mrs?.*?Smith")
res=p.finditer(text)
for x in res:
    print("position of match = " + str(x.span()))
    start=x.span()[0]
    end=x.span()[1]
    print("substring = " + string[start:end])
        

position of match = (0, 10)
substring = Mrs. Smith
position of match = (39, 48)
substring = Mr. Smith
position of match = (111, 121)
substring = Mrs. Smith


# Getting more information

When we get a match we can recover the substrings making up groups in the pattern to be matched. Here, group() refers to the entire match.

In [113]:
text = "Students in financial math are smarter than anyone, I think."
pattern = "(St.*ts) .* (math).* are (.*) than .*(one).*"
m = re.match(pattern, text)
print(m.group())
for i in range(5):
    print(m.group(i))

Students in financial math are smarter than anyone, I think.
Students in financial math are smarter than anyone, I think.
Students
math
smarter
one


We can determine the positions of those groups that match using the *regs* method which gives a tuple of tuples when a match occurs.

In [155]:
text = "Students in financial math are smarter than anyone, I think."
pattern = "(St.*ts) .* (math).* are (.*) than .*(one).*"
m = re.match( pattern, text, re.M|re.I)
print(m.regs)
print(type(m.regs))
print(type(m.regs[2]))
for i in range(len(m.regs)):
    print(text[m.regs[i][0]:m.regs[i][1]])

((0, 60), (0, 8), (22, 26), (31, 38), (47, 50))
<class 'tuple'>
<class 'tuple'>
Students in financial math are smarter than anyone, I think.
Students
math
smarter
one


Similarly, we can use search to find locations of matching portions of a pattern.

In [154]:
text = "I believe that students in financial math are smarter than anyone."
pattern = "(st.*ts) .* (math) .*(any)"
m = re.search(pattern, text)
print(m.group(0))
print(m.group(1))
print(m.group(2))
print(m.group(3))

<class 'tuple'>
<class 'tuple'>
students in financial math are smarter than any
students
math
any


# Spltting strings

We can split a line using a regular expression as a delimiter. The result is a list of substrings if there are matches.

In [183]:
import re
text="this is the craziest idea you have ever presented in all my years"
pattern="e"
re.split(pattern,text)

['this is th',
 ' crazi',
 'st id',
 'a you hav',
 ' ',
 'v',
 'r pr',
 's',
 'nt',
 'd in all my y',
 'ars']

If there are no matches, then the output is a list with the whole string.

In [184]:
import re
text="this is the craziest idea you have ever presented in all my years"
pattern="q"
re.split(pattern,text)

['this is the craziest idea you have ever presented in all my years']

If the start of the line matches, then the list starts with an empty string.

In [182]:
import re
text="5AM: Woke up from a dream. 6AM: fell back asleep. 12PM: woke up and felt refreshed."
pattern="\d+[AP]M: "
re.split(pattern,text)


['',
 'Woke up from a dream. ',
 'fell back asleep. ',
 'woke up and felt refreshed.']

Final comments:
1) How do we find all the \newsection in text?
2) Greedyness (.* vs .*?)
3) Flags
4) Compiling regex's

In [14]:
text="here is some text with \\newsection appearing in it"
pattern="he.*t"
re.search(pattern,text)

<_sre.SRE_Match object; span=(0, 50), match='here is some text with \\newsection appearing in >

In [15]:
text="here is some text with \\newsection appearing in it"
pattern="he.*?t"
re.search(pattern,text)

<_sre.SRE_Match object; span=(0, 14), match='here is some t'>

In [112]:
import re
string="Here is a sentence. Here is another sentence."
pattern="."
s=re.search(pattern, string)
if s:
    print("match")
else:
    print("no match")
print(s.start())
print(s.end())

match
0
1


The problem is that the period in a pattern is a wild-card, meaning a match with any character (even a whitespace character).
So we need to tell re to use a literal period, not a period with special meaning. We do this by escaping with the backslash.

In [113]:
import re
string="Here is a sentence. Here is another sentence."
pattern="\."
s=re.search(pattern, string)
if s:
    print("match")
else:
    print("no match")
print(s.start())
print(s.end())

match
18
19


So \ has a special meaning in a pattern. So how do we search for the \ character?

In [127]:
import re
string="Here is a sentence that has a \ in it. Here is another sentence."
print(string)
pattern="\"
s=re.search(pattern, string)
if s:
    print("match")
else:
    print("no match")
print(s.start())
print(s.end())
print(string[s.start()])

SyntaxError: EOL while scanning string literal (<ipython-input-127-4a9273fc909b>, line 4)

This fails because the pattern isn't even well-defined as a string. The backslash is  escaping the " which is what we would use if we wanted a string with a literal quote in it. But the string doesn't have an end quote. To get a string with just " in it, we need to do this.

In [128]:
p="\""
print(p)

"


WHat about this?

In [130]:
import re
string="Here is a sentence that has a \ in it. Here is another sentence."
print(string)
pattern="\\"
print(pattern)
s=re.search(pattern, string)
if s:
    print("match")
else:
    print("no match")
print(s.start())
print(s.end())
print(string[s.start()])

Here is a sentence that has a \ in it. Here is another sentence.
\


error: bad escape (end of pattern) at position 0

This fails because a pattern that just has a \ in it and \ has a special meaning in regular expression patterns. So we need to tell re that this character is not meant to be a "meta" character. We need to escape this for this purpose, so we actually need to use 4 backslashes. \\ for the first literal backslash and \\ for the second one.

The following works just fine.

In [131]:
import re
string="Here is a sentence that has a \ in it. Here is another sentence."
print(string)
pattern="\\\\"
print(pattern)
s=re.search(pattern, string)
if s:
    print("match")
else:
    print("no match")
print(s.start())
print(s.end())
print(string[s.start()])

Here is a sentence that has a \ in it. Here is another sentence.
\\
match
30
31
\
