## Introduction to Regular Expressions (regex's)

There is a nice basic regular expressions tutorial here 

https://regexone.com/references/python (click on Interactive Tutorial)

The Python 3 documentation:

https://docs.python.org/3/library/re.html

This is also a nice helpful tutorial:

https://www.tutorialspoint.com/python3/python_reg_expressions.htm

One more:

https://docs.python.org/3.6/howto/regex.html

And there is a link to a book

https://www.princeton.edu/~mlovett/reference/Regular-Expressions.pdf

Some of the most common tasks for regex's are:

1) determine whether a certain pattern of text matches some substring of a string starting at the beginning of the string - and return the size of the substring that matches the pattern

2) determine whether a pattern matches some substring anywhere in the string, and return the location of the match in the string

3) find all locations in a string where a pattern mathes

A regular expression consists of ordinary characters and special characters. The simplest form of regular expression consists of a single (non-special) character.

We can combine regular expressions by concatenating them. Thus, if A is a regular expression and B is a regular expression, so is AB.

For example, an ordinary character is a regular expression, so "d" is a regular expression. So is "a" and so is "n", so "dan" is a regular expression.

Here is a simple example of regular expression matching i.e. determining whether the initial portion of a string matches the pattern. If we get a match, we print a message confirming it and provide some range information.

We use "match" to do the match. When we get a match, the function returns a pair that gives the "span" of the match, i.e. the start and end positions.

In [48]:
import re
pattern='Is'
string="Is there a match?"
m=re.match(pattern,string)
if m:
    print("matches")
    print(m.start())
    print(m.end())
    print(m.span())
    print(string[m.start():m.end()])
else:
    print("doesn't match")

matches
0
2
(0, 2)
Is


We can also use re to match byte objects. However, when we do that, we have to use byte objects to match byte objects. We can't mix the two types of strings.

In [49]:
import re
pattern=b'\x20\x30\x40'
string=b'\x20\x30\x40\x20\x30\x40'
m=re.match(pattern,string)
if m:
    print("matches")
    print(m.start())
    print(m.end())
    print(m.span())
    print(string[m.start():m.end()])
else:
    print("doesn't match")

matches
0
3
(0, 3)
b' 0@'


For a match the pattern is required to match the initial portion of the string.

In [50]:
import re
pattern='match'
string="A match must occur with the initial portion of the string."
m=re.match(pattern,string)
if m:
    print("matches")
    print(m.start())
    print(m.end())
    print(m.span())
    print(string[m.start():m.end()])
else:
    print("doesn't match")

doesn't match


In [51]:
import re
pattern='m'
string="Matches are case-sensitive by default."
m=re.match(pattern,string)
if m:
    print("matches")
    print(m.start())
    print(m.end())
    print(m.span())
    print(string[m.start():m.end()])
else:
    print("doesn't match")

doesn't match


We can turn off case-sensitivity using the "I" flag.

In [52]:
import re
pattern='m'
string="Matches are case-sensitive by default."
m=re.match(pattern,string,re.I)
if m:
    print("matches")
    print(m.start())
    print(m.end())
    print(m.span())
    print(string[m.start():m.end()])
else:
    print("doesn't match")

matches
0
1
(0, 1)
M


On the other hand, we can **search** for a pattern appearing in any position in text using "search".

In [53]:
import re
pattern="dog"
string="My dog is named Sasha"
s=re.search(pattern,string)
if s:
    print("pattern found")
    print(s.start())
    print(s.end())
    print(s.span())
    print(string[s.start():s.end()])
else:
    print("doesn't match")

pattern found
3
6
(3, 6)
dog


When we get no match, the value returned is None.

In [54]:
import re
text="My name is Joan."
pattern="John"
s=re.search(pattern,text)
if s:
    print("found a match")
else:
    print("no match")
    print(m)
if s==None:
    print("we get none")

no match
<re.Match object; span=(0, 1), match='M'>
we get none


When we use the search method, the information we get back is about the first match.

In [55]:
import re
text="My name is John. Are you also John?"
pattern="John"
s=re.search(pattern,text)
if s:
    print("pattern found")
    print(text[s.start():s.end()])
else:
    print("no match")

pattern found
John


We can also search for special characters.

In [56]:
import re
text="What is your name?\n\r My name is John."
pattern="\n"
m=re.search(pattern,text)
if m:
    print("found a match")
    print(type(m))
    print(m)
    print(text[m.start():m.end()])
else:
    print("no match")

found a match
<class 're.Match'>
<re.Match object; span=(18, 19), match='\n'>




# Some simple functions

Instead of repeating the same code over and over, let's create a couple of functions that do what we did in the above examples.

In [57]:
def check_match(pattern, string): 
    m=re.match(pattern,string)
    if m:
        pos0=m.start()
        pos1=m.end()
        print("pattern " + pattern + 
              "matches from " + str(pos0) + " to " 
              + str(pos1) + " with substring = " 
              + string[m.start():m.end()])
    else:
        print("no match")   
check_match("dog","My dog")
check_match("M","My dog")

def check_search(pattern, string): 
    s=re.search(pattern,string)
    if s:
        pos0=s.start()
        pos1=s.end()
        print("pattern matches from " + str(pos0) + " to " + str(pos1) + " with substring = " 
              + string[s.start():s.end()])
    else:
        print("no match")   
check_search("dog","My dog at my homework.")

no match
pattern Mmatches from 0 to 1 with substring = M
pattern matches from 3 to 6 with substring = dog


# Iterating over lists

In [58]:
strings=["What a beautiful day to be learning regex.", 
         "I feel so fortunate to have Dr. Miller as a mentor.", 
         "What a powerful thing, to be able to match patterns in strings!"]
p="to"
for s in strings:
    print(s)
    check_search(p,s)
    print("\n")

What a beautiful day to be learning regex.
pattern matches from 21 to 23 with substring = to


I feel so fortunate to have Dr. Miller as a mentor.
pattern matches from 20 to 22 with substring = to


What a powerful thing, to be able to match patterns in strings!
pattern matches from 23 to 25 with substring = to




## Special characters

Asou can imagine, without further tools this is rather limited in what can be done. Additional functionality is obtained using  characters with special meanings in our patterns. 

These special characters are referred to as meta-characters. Here is a list of them:

. ^ $ * + ? { } [ ] \ | ( )

We will proceed to describe the uses of these characters in patterns.

# The dot/period: .

By default, the period (.) means any single character except newline "\n" by default.

In [59]:
import re
p="d.n"
strings=["dan","don","dn","d\nn"]
for s in strings:
    print("searching for pattern " + p + " in string = " + s)
    check_search(p,s)
    print("\n")

searching for pattern d.n in string = dan
pattern matches from 0 to 3 with substring = dan


searching for pattern d.n in string = don
pattern matches from 0 to 3 with substring = don


searching for pattern d.n in string = dn
no match


searching for pattern d.n in string = d
n
no match




If we want "." to be interpreted as any single character, including the 
newline character, we use the DOTALL flag. So in the following example, 
we do indeed get a match.

In [60]:
import re
p="d.n"
string="d\nn"
if re.search(p,string,re.DOTALL):
    print("found!")

found!


# The circumflex character: ^

The ^ character is used to indicate that the pattern must appear in the start of a string.

In [61]:
import re
p="^Jo.n"
strings=["John, are you home?","Joan, are you home?","Hey, has anybody seen John or Joan around?"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = John, are you home?
pattern matches from 0 to 4 with substring = John


string = Joan, are you home?
pattern matches from 0 to 4 with substring = Joan


string = Hey, has anybody seen John or Joan around?
no match




# The dollar-sign: $

$ matches the end of the string, or just before the new line at the end of the string.

In [62]:
import re
p="dog.$"
strings=["Is that my dog?", "My dog seems to be missing.","Hey, look at my dog."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = Is that my dog?
pattern matches from 11 to 15 with substring = dog?


string = My dog seems to be missing.
no match


string = Hey, look at my dog.
pattern matches from 16 to 20 with substring = dog.




# The asterisk: *

The * makes the preceding regular expression appear as many times as possible and looks for the most repeitions possible.  In the following, the preceding regular expression is the letter "a". Note that search continues until the first match start is found, but the search finds the largest match starting at that point. 

This is what is meant by regex searching being greedy. It looks for a match starting as early as possible in a string, but it looks for the string that is as long as possible and still matches.

In the following, the pattern matches the first appearance of "dan" but the whole string from that first "dan" to the end also gives a match, so that is the one produced by re.search().

In [63]:
import re
p="Do*r"
strings=["Hi Doris!","Are you a Dr.?","Doors should not be left opened."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = Hi Doris!
pattern matches from 3 to 6 with substring = Dor


string = Are you a Dr.?
pattern matches from 10 to 12 with substring = Dr


string = Doors should not be left opened.
pattern matches from 0 to 4 with substring = Door




# The plus sign: +

The + symbol means match 1 or more repetitions of the preceding expression, so unlike the *, we need to have the preceding expression appear at least once to get a match.

In [64]:
import re
p="Ca+"
strings=["Cats are not like dogs!","Could you come here?","Caanan is spoken about in the bible"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")



string = Cats are not like dogs!
pattern matches from 0 to 2 with substring = Ca


string = Could you come here?
no match


string = Caanan is spoken about in the bible
pattern matches from 0 to 3 with substring = Caa




# The question mark: ?

One use of the ? character is obtained by putting it after a regular expession, which means match exactly 0 or 1 repetitions of that expression.  In other words, it indicates that the expression is optional. For an example of its use, english spellings in the U.K. can differ from english spellings in the U.S.. For example, in the U.S. we would use "humor" and in  the U.K. they would write "humour". We can test for either word matching using this:

In [65]:
import re
p="humou?r"
strings=["Americans are known for their great sense of humor","Brish people generally lack a sense of humour"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Americans are known for their great sense of humor
pattern matches from 45 to 50 with substring = humor


string = Brish people generally lack a sense of humour
pattern matches from 39 to 45 with substring = humour




Another use of the ? character is in creating a lazy instead of greedy attempt at matching, as discussed below.

# Use of braces: {}

Braces are used to indicate the number of times an expression appears, or a range of numbers of times.

In [66]:
import re
p="ma{2}"
strings=["Hi mama.","When I get heartburn I use maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Hi mama.
no match


string = When I get heartburn I use maalox.
pattern matches from 27 to 30 with substring = maa




In [67]:
import re
p="ma{2}"
strings=["ma","maa","maaa","maaaa"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = ma
no match


string = maa
pattern matches from 0 to 3 with substring = maa


string = maaa
pattern matches from 0 to 3 with substring = maa


string = maaaa
pattern matches from 0 to 3 with substring = maa




In [68]:
import re
p="ma{2,3}"
strings=["ma","maa","maaa","maaaa"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = ma
no match


string = maa
pattern matches from 0 to 3 with substring = maa


string = maaa
pattern matches from 0 to 4 with substring = maaa


string = maaaa
pattern matches from 0 to 4 with substring = maaa




# Use of parentheses for grouping

The *. +, {n}, {m,n} are referred to as *quantifiers*.  By default, a quantified applies to the previous regular expression, which refers to a single ordinary character in a multiple ordinary character expression.  In the following example, the + refers to the "a", not to "ma".

We can also write {,n} to refer to at most n occurences, and {m,} to mean at least m occurences.

In [69]:
import re
p="ma+"
strings=["Hi ma.","Hi mama.","When I get heartburn I use maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Hi ma.
pattern matches from 3 to 5 with substring = ma


string = Hi mama.
pattern matches from 3 to 5 with substring = ma


string = When I get heartburn I use maalox.
pattern matches from 27 to 30 with substring = maa




In [70]:
import re
p="(ma)+"
strings=["Hi ma.","Hi mama.","When I get heartburn I use maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Hi ma.
pattern matches from 3 to 5 with substring = ma


string = Hi mama.
pattern matches from 3 to 7 with substring = mama


string = When I get heartburn I use maalox.
pattern matches from 27 to 29 with substring = ma




What is going on in this example?

In [71]:
import re
p="(ma).*ma+"
strings=["Don't forget to tell your mama that you took the maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Don't forget to tell your mama that you took the maalox.
pattern matches from 26 to 52 with substring = mama that you took the maa




In [72]:
import re
p="(ma){2}.*ma"
strings=["Don't forget to tell your mama that you took the maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Don't forget to tell your mama that you took the maalox.
pattern matches from 26 to 51 with substring = mama that you took the ma




In [73]:
import re
p="m.*a{3}"
strings=["Don't forget to tell your mama that you took the maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Don't forget to tell your mama that you took the maalox.
no match




In [74]:
import re
p="(m.*a){3}"
strings=["Don't forget to tell your mama that you took the maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Don't forget to tell your mama that you took the maalox.
no match




In [75]:
import re
p="(m.*a){2}.*(m.*a)"
strings=["Don't forget to tell your mama that you took the maalox."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Don't forget to tell your mama that you took the maalox.
pattern matches from 26 to 52 with substring = mama that you took the maa




# Using ? to eliminating greediness encourage laziness.

In order to ensure that matching is done in a non-greedy fashion, we can use  \*?. In the following example, we see greediness in action.

In [76]:
import re
p="<item>.*</item>"
strings=["<item>blah1</item><item>blah2</item>"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = <item>blah1</item><item>blah2</item>
pattern matches from 0 to 36 with substring = <item>blah1</item><item>blah2</item>




But we wanted to stop when we got to the first closing tag. The .\*? says match any number characters between the two tags, but include a minimal amount to get the match. In other words, we want to stop searching if we match the smallest portion of the string that gives a match.

In [77]:
import re
p="<item>(.*?)</item>"
strings=["<item>blah1</item><item>blah2</item>"]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = <item>blah1</item><item>blah2</item>
pattern matches from 0 to 18 with substring = <item>blah1</item>




The ? character also works when appearing after the + quantifier. Again, it ensures that at least one match occurs, 
but makes the seach lazy by minimizing portion needed to match.

In [78]:
import re
p="<item>(.+?)</item>"
strings=["<item></item>","<item>some stuff</item>" ]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = <item></item>
no match


string = <item>some stuff</item>
pattern matches from 0 to 23 with substring = <item>some stuff</item>




Finally ? is not greedy ?? is lazy in the sense that the attempt to match stops after finding the initial patter without the optional expression.

In [79]:
import re
p="friends?"
strings=["Can I play with your friends?" ]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Can I play with your friends?
pattern matches from 21 to 28 with substring = friends




In [80]:
import re
p="friends??"
strings=["Can I play with your friends?" ]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Can I play with your friends?
pattern matches from 21 to 27 with substring = friend




The ? character allows us to create non-greedy versions of the {} patterns.

In [81]:
import re
p="(xo){2,}"
strings=["xo i really love you xoxo",
     "xo i really love you xoxoxox"]        
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = xo i really love you xoxo
pattern matches from 21 to 25 with substring = xoxo


string = xo i really love you xoxoxox
pattern matches from 21 to 27 with substring = xoxoxo




In [82]:
import re
p="(xo){2,}?"
strings=["xo i really love you xoxo",
     "xo i really love you xoxoxox"]        
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = xo i really love you xoxo
pattern matches from 21 to 25 with substring = xoxo


string = xo i really love you xoxoxox
pattern matches from 21 to 25 with substring = xoxo




# Character groups and []'s

We use square brackets to define character groups. For example, suppose we want a match for any expression of the form
"p_t" where the underscore character can be any vowel from among a,e,i, or o. We can define a character group [aeio].

In [83]:
import re
p="p[aeio]t?"
strings=["Is that your pet?",
     "When I see you I get a funny feeling in the pit of my stomach"
    "Are you a pot-smoker?",
    "I've got to give you a pat on the back!",
        "Is 'pyt' a word?"]        
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = Is that your pet?
pattern matches from 13 to 16 with substring = pet


string = When I see you I get a funny feeling in the pit of my stomachAre you a pot-smoker?
pattern matches from 44 to 47 with substring = pit


string = I've got to give you a pat on the back!
pattern matches from 23 to 26 with substring = pat


string = Is 'pyt' a word?
no match




To simplify writing down certain character groups, we can use the dash. For example, for all digits 0,1,2,3,4,5,6,7,8,9
we can use [0-9].

In [84]:
import re
p="[0-9]+"
strings=["My office phone number is 410-516-7203.",
     "My social security number is 897999131."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")


string = My office phone number is 410-516-7203.
pattern matches from 26 to 29 with substring = 410


string = My social security number is 897999131.
pattern matches from 29 to 38 with substring = 897999131




# Exercise: 

Assume a phone number is always of the form: xxx-xxx-xxxx or 1-yxx-yxx-xxx where the x's are all digits from 0-9. 
        Write code to find the first phone number in a string.

In [133]:
import re
p="(1-)?([1-9][0-9]{2}-){2}[0-9]{3}"
strings=["My office phone number is 110-116-1203.",
     "My social security number is 897999131."]
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = My office phone number is 110-116-1203.
pattern matches from 26 to 37 with substring = 110-116-120


string = My social security number is 897999131.
no match




# Meta-characters in inside square brackets

Inside square brackets, there are only some meta-characters that have special meaning. Others are interpreted as literal characters. 

The ones that do have meaning are the backslash \, the hyphen, and the circumflex.

The circumflex only has special meaning when it appears immediately after the \[ and it means "not" among these characters.

In [85]:
import re
p="p[^aeio]t?"
strings=["Is that your pet?",
     "When I see you I get a funny feeling in the pit of my stomach"
    "Are you a pot-smoker?",
    "I've got to give you a pat on the back!",
        "Is 'pyt' a word?"]        
for s in strings:
    print("string = " + s)
    check_search(p,s)
    print("\n")

string = Is that your pet?
no match


string = When I see you I get a funny feeling in the pit of my stomachAre you a pot-smoker?
no match


string = I've got to give you a pat on the back!
no match


string = Is 'pyt' a word?
pattern matches from 4 to 7 with substring = pyt




# The backslash: \

The backslash character is used to escape a special character in a regex pattern. So, for example, for searching for a +, you would need to do something like the followng.

In [86]:
import re
pattern="\+"
text="2+5=7"
re.search(pattern,text)

<re.Match object; span=(1, 2), match='+'>

The backslash is used by Python in various ways in string literals. For example,
for new line "\n", or to represent an ascii character using a hexadecimal value.

In [87]:
txt="\x43\n\x41\n\x54"
print(txt)

C
A
T


So some care is required when using backslash in a pattern.

Getting a backslash character in a string can be a slight challenge. The unicode code point for backslash is 005c (hexadecimal), i.e. 92 in decimal. We can also put a backslash in a string by escaping it with a backslash. We can also try to create a "raw" string with a single backslash but that fails because \" is interpreted as escaping the " character. For this we need two backslashes.

In [88]:
txt1=u'\u005c'
txt2=chr(92)
txt3='\\'
txt4='\\'
mystr=txt1+txt2+txt3+txt4
print(mystr)
bytestr=mystr.encode('utf-8')
print(bytestr)
print(bytestr.decode())
print(txt1==txt2)
print(txt1==txt3)
print(txt1==txt4)

\\\\
b'\\\\\\\\'
\\\\
True
True
True


When backslashes appear in patterns, they might be interpreted rather than taken literally.

In [89]:
import re
pattern="\x43"
text="ABCDEF"
re.search(pattern,text)

<re.Match object; span=(2, 3), match='C'>

The same goes for backslashes in text.

In [90]:
import re
pattern="\x43"
text="AB\x43DEF"
re.search(pattern,text)

<re.Match object; span=(2, 3), match='C'>

In [91]:
mytext="Roses are red,\nviolets are blue,\nI stink at math\nHow about You?"
print(mytext)

Roses are red,
violets are blue,
I stink at math
How about You?


There will be instances in which the backslash needs to be treated literally rather than interpreted. In such cases, we are advised to use raw strings.

In [92]:
text=r'in this raw string there is a backslash character (i.e. \) appearing'
print(text)
mytext=r'Roses are red,\nviolets are blue,\nI stink at math\nHow about You?'
print(mytext)

in this raw string there is a backslash character (i.e. \) appearing
Roses are red,\nviolets are blue,\nI stink at math\nHow about You?


The following attempt to create a single backslash in a string fails.

In [93]:
txt="\"
print(txt)

SyntaxError: EOL while scanning string literal (<ipython-input-93-7859c47daaaa>, line 1)

This fails also. 

In [94]:
txt=r"\"
print(txt)

SyntaxError: EOL while scanning string literal (<ipython-input-94-96bc300b1b45>, line 1)

But this works.

In [95]:
text="\\"
print(text)

\


But this fails.

In [96]:
text=r"\\"

In [97]:
print(text)

\\


The backslash is also used to create special sequences of regular expressions, as we will see below.

Square brackets are used to represent sets of characters. For example, to match one of the letters a, b or c, we can use [abc].

In [98]:
import re
pattern="[abc]"
print(re.search(pattern,"d"))
print(re.search(pattern,"help me please"))
pattern="[abcde]{2}"
print(re.search(pattern,"can you help me find my lost cat please"))

None
<re.Match object; span=(11, 12), match='a'>
<re.Match object; span=(0, 2), match='ca'>


Ranges can be used.

In [99]:
import re
pattern="[a-c]{2}"
print(re.search(pattern,"can you help me find my lost cat please"))

<re.Match object; span=(0, 2), match='ca'>


In [100]:
import re
pattern="[g-m][t-z]"
print(re.search(pattern,"can you help me find my lost cat please"))

<re.Match object; span=(21, 23), match='my'>


In [101]:
import re
pattern="[g-mt-z]"
print(re.search(pattern,"can you help me find my lost cat please"))

<re.Match object; span=(4, 5), match='y'>


In [102]:
import re
pattern="[3-5][4-8]"
print(re.search(pattern,"5823824854782102786467438"))

<re.Match object; span=(0, 2), match='58'>


If you want your set to include the "-" character, it needs to be escaped. Here we search for a three character sequence using "-" or " " 

In [103]:
import re
pattern="[\- ]{3}"
print(re.search(pattern,"If you are around today - can you please email me?"))

<re.Match object; span=(23, 26), match=' - '>


Special characters inside []'s are taken as literally, i.e. they are not interpreted as having any special meaning.

In [104]:
import re
pattern="[\/*+]"
print(re.search(pattern,"8*9=72"))

<re.Match object; span=(1, 2), match='*'>


There are special classes of characters that can appear inside sets. For example, \d refers to any digit (0-9), \D refers to a non-digit.

In [105]:
import re
print(re.match("[\d]","9"))
print(re.match("[\d]","-"))
print(re.match("[\D]","9"))
print(re.match("[\D]","-"))

<re.Match object; span=(0, 1), match='9'>
None
None
<re.Match object; span=(0, 1), match='-'>


The construction A|B is used to match occurences of one regular expression A or another B.

In [106]:
import re
pattern="dog|cat"
string="I don't like dogs, I do like cats"
re.search(pattern,string)

<re.Match object; span=(13, 16), match='dog'>

When using A|B if A matches, B is no longer tried, even if it produces a longer match.

In [107]:
import re
pattern="dog|dogs"
string="I don't like dogs, I do like cats"
re.search(pattern,string)

<re.Match object; span=(13, 16), match='dog'>

Multiple regular expressions separated by | can be used.

In [108]:
import re
pattern="dog|cat|bird"
string="I don't like birds, or cats or dogs."
re.search(pattern,string)

<re.Match object; span=(13, 17), match='bird'>

In Python, some care is required when newline characters can appear in the text.
We might need to use the DOTALL flag to indicate that "." refers to all characters, even the newline.

Here is an example.

In [109]:
import re
text="aj fjewkj xxx fw \n fwjfk  dfewj fejh geueiu yyy \n w hfdjwhf"
pattern="xxx.*yyy"
print(re.search(pattern,text))
print(re.search(pattern,text,re.DOTALL))


None
<re.Match object; span=(10, 47), match='xxx fw \n fwjfk  dfewj fejh geueiu yyy'>


It is common to want to extract what is between two expressions.

In [110]:
import re
text="aj fjewkj xxx fw \n fwjfk  dfewj fejh geueiu yyy \n w hfdjwhf"
pattern="xxx.*yyy"
re.search(pattern,text,re.DOTALL).group(0)


'xxx fw \n fwjfk  dfewj fejh geueiu yyy'

We might want to count occurences of a pattern.

In [111]:
import re
text="eieqwor iorwi xxx .fw uifej iyyy  aj fjewkj xxx fw \n fwjfk  dfewj fejh geueiu yyy \n w hfdjxxx hfehj hjjfhejh jhhjhjhfjeh yyy whf"
pattern="xxx.*?yyy"
res=re.findall(pattern,text,re.DOTALL)
len(res)

3

If we are going to use search repeatedly, it helps to compile the pattern.

In [112]:
import re
text="eieqwor iorwi xxx .fw uifej iyyy  aj fjewkj xxx fw \n fwjfk  dfewj fejh geueiu yyy \n w hfdjxxx hfehj hjjfhejh jhhjhjhfjeh yyy whf"
pattern="xxx.*?yyy"
p=re.compile(pattern, re.DOTALL)
res=p.findall(text)
len(res)


3

Suppose we have text files separated into documents delimited by <a> ... </a> that look like:

<a> <b>1 .... </a> <a><b>3 .... </a> <a><b>1 ... </a> 

so the <b>n tag indicates what type of document we have. We want extract all of the documents of the form:

<a><b>1 .... </a> 

where there are no </a> inside the ... .
How do we do this?



In [113]:
import re
text="line <a>fehuj<b>1 number 1 </a><a><b>2 dwjjfkwj </a> <a><b>1 dehjfhe</a><a>dwhjf <b>1</a>"
p=re.compile("<a>.*?<b>1.*?</a>",re.DOTALL)
pf=p.findall(text)
while pf:
    

SyntaxError: unexpected EOF while parsing (<ipython-input-113-106d4ed2ff1c>, line 6)

In [114]:
pf

NameError: name 'pf' is not defined

In [115]:
pf.groups(0)

NameError: name 'pf' is not defined

In [116]:
pattern="[ab]+[c]+"
text="abababcccababccccc"
re.search(pattern,text)

<re.Match object; span=(0, 9), match='abababccc'>

# Getting more information

When we get a match we can recover the substrings making up groups in the pattern to be matched. Here, group() refers to the entire match.

In [117]:
text = "Students in financial math are smarter than anyone, I think."
pattern = "(St.*ts) .* (math).* are (.*) than .*(one).*"
m = re.match(pattern, text)
print(m.group())
for i in range(5):
    print(m.group(i))

Students in financial math are smarter than anyone, I think.
Students in financial math are smarter than anyone, I think.
Students
math
smarter
one


We can determine the positions of those groups that match using the *regs* method which gives a tuple of tuples when a match occurs.

In [118]:
text = "Students in financial math are smarter than anyone, I think."
pattern = "(St.*ts) .* (math).* are (.*) than .*(one).*"
m = re.match( pattern, text, re.M|re.I)
print(m.regs)
print(type(m.regs))
print(type(m.regs[2]))
for i in range(len(m.regs)):
    print(text[m.regs[i][0]:m.regs[i][1]])

((0, 60), (0, 8), (22, 26), (31, 38), (47, 50))
<class 'tuple'>
<class 'tuple'>
Students in financial math are smarter than anyone, I think.
Students
math
smarter
one


Similarly, we can use search to find locations of matching portions of a pattern.

In [119]:
text = "I believe that students in financial math are smarter than anyone."
pattern = "(st.*ts) .* (math) .*(any)"
m = re.search(pattern, text)
print(m.group(0))
print(m.group(1))
print(m.group(2))
print(m.group(3))

students in financial math are smarter than any
students
math
any


# Re Flags

There are special flags that can be used to help accomplish certain tasks. Here are some examples.

The .I flag means ignore case.

In [120]:
import re
text="Can I help you?"
pattern="i"
print(re.search(pattern,text))
print(re.search(pattern,text,re.I))

None
<re.Match object; span=(4, 5), match='I'>


The .M means to interpret ^ as the beginning of any line, and $ to the end of any line.

In [121]:
import re
text="I need some help. \nCan you help me please?"
pattern="^Can.*help"
print(re.search(pattern,text))
print(re.search(pattern,text,re.M))

None
<re.Match object; span=(19, 31), match='Can you help'>


In [122]:
import re
text="I need some help.\nCan you help me please?"
pattern=".*help\.$"
print(re.search(pattern,text))
print(re.search(pattern,text,re.M))

None
<re.Match object; span=(0, 17), match='I need some help.'>


We can split a line using a regular expression as a delimiter. The result is a list of substrings if there are matches.

In [123]:
import re
text="this is the craziest idea you have ever presented in all my years"
pattern="e"
re.split(pattern,text)

['this is th',
 ' crazi',
 'st id',
 'a you hav',
 ' ',
 'v',
 'r pr',
 's',
 'nt',
 'd in all my y',
 'ars']

If there are no matches, then the output is a list with the whole string.

In [124]:
import re
text="this is the craziest idea you have ever presented in all my years"
pattern="q"
re.split(pattern,text)

['this is the craziest idea you have ever presented in all my years']

If the start of the line matches, then the list starts with an empty string.

In [125]:
import re
text="5AM: Woke up from a dream. 6AM: fell back asleep. 12PM: woke up and felt refreshed."
pattern="\d+[AP]M: "
re.split(pattern,text)


['',
 'Woke up from a dream. ',
 'fell back asleep. ',
 'woke up and felt refreshed.']

Final comments:
1) How do we find all the \newsection in text?
2) Greedyness (.* vs .*?)
3) Flags
4) Compiling regex's

In [14]:
text="here is some text with \\newsection appearing in it"
pattern="he.*t"
re.search(pattern,text)

<_sre.SRE_Match object; span=(0, 50), match='here is some text with \\newsection appearing in >

In [15]:
text="here is some text with \\newsection appearing in it"
pattern="he.*?t"
re.search(pattern,text)

<_sre.SRE_Match object; span=(0, 14), match='here is some t'>