# Regular Expressions in Python


Regular Expressions (sometimes called regex for short) allow a user to search for strings using almost any sort of rule they can come up with. For example, finding all capital letters in a string, or finding a phone number in a document.

Regular expressions are notorious for their seemingly strange syntax. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to filter out any string pattern you can imagine, which is why they have a complex string pattern format.



In [None]:
"""
Regular expressions allow for patten searching in a text document.

The syntax for regular expressions can be:

r'\d{3}-\d{3}\d{4}'

The key thing to keep in mind is that every character type has a
correspnding pattern code.

For example, digits have the placeholder pattern code of \d

The use of backslash allows python to understand that it is a special
code and not the letter "d".

"""

In [1]:
text = "The phone number of machine learning engineer is 555-123-1879. \
call soon"



In [2]:
"555-123-1879" in text

True

In [4]:
# Without regular expression is search for is there any phone number
# with this particular format inside of this text?

import re


In [9]:
pattern = "phone"

In [11]:
re.search(pattern,text)


<re.Match object; span=(4, 9), match='phone'>

In [12]:
my_match=re.search(pattern,text)
my_match.span()

(4, 9)

In [13]:
my_match.start()

4

In [14]:
my_match.end()

9

In [15]:
txt ="my phone is a new phone"


In [17]:
match= re.search(pattern,txt)
match.span()

(3, 8)

In [21]:
re.findall(pattern,txt)
# to find all matches


['phone', 'phone']

In [22]:
len(re.findall(pattern,txt))

2

In [25]:
# To find match objects instead of just a list of
# of the matches, which is not very useful.

for match in re.finditer("phone",txt):
    print(match.span())



(3, 8)
(18, 23)


# Patterns

So far we've seen how to search for a basic string. What about more complex examples? Such as trying to find a telephone number in a large string of text? Or an email address?

We could just use search method if we know the exact phone or email, but what if we don't know it? We may know the general format, and we can use that along with regular expressions to search the document for strings that match a particular pattern.

This is where the syntax may appear strange at first, but take your time with this; often it's just a matter of looking up the pattern code.



## Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'
    
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

All the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [32]:
text ="The phone number of machine learning engineer is 555-123-1879. call soon"
text

'The phone number of machine learning engineer is 555-123-1879. call soon'

In [28]:
pattern = r'\d\d\d-\d\d\d-\d\d\d\d'



In [30]:
phone_number = re.search(pattern,text)
phone_number

<re.Match object; span=(49, 61), match='555-123-1879'>

In [31]:
phone_number.group()

# will collect all span(49,61) and will print it.


'555-123-1879'

To avoid typing "\d" mutiple times.

Let's explore the possible quantifiers.

## Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.



<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [38]:
pattern = r'\d{3}-\d{3}-\d{4}'

phone_number=re.search(pattern,text)
phone_number

<re.Match object; span=(49, 61), match='555-123-1879'>

In [37]:
phone_number.group()

'555-123-1879'

In [39]:
# Regular expression has ability to select
# separate groups
# pattern = r'\d{3}-\d{3}-\d{4}'

# currently this entire phrase for the pattern is one solid
# group. But we can compile separate groups using parentheses.
# ()

pattern = r'(\d{3})-(\d{3})-(\d{4})'

phone_number=re.search(pattern,text)
phone_number


<re.Match object; span=(49, 61), match='555-123-1879'>

In [40]:
phone_number.group()

'555-123-1879'

In [41]:
phone_number.group(1)

'555'

In [42]:
phone_number.group(2)

'123'

In [43]:
phone_number.group(3)

'1879'

In [45]:
# phone_number.group(4)
# IndexError: no such group
# because matches has only 3 groups

In [52]:
# pipe | operator
# will print match that comes first in text.

re.search(r"male|female","This male and female are here")


<re.Match object; span=(5, 9), match='male'>

In [50]:

re.search(r"female|male","This male and female are here")


<re.Match object; span=(5, 9), match='male'>

In [51]:
re.search(r"female|male","This female and male are here")


<re.Match object; span=(5, 11), match='female'>

In [54]:
# match only 1 character

re.findall(r".at","The cat in the hat sat pet")


['cat', 'hat', 'sat']

In [55]:
re.findall(r"...at","The cat in the hat sat plat slat")


['e cat', 'e hat', ' plat', ' slat']

In [56]:
# Start With and Ends With
# we can use the ^ to signal starts with and
# the $ to signal ends with



In [58]:
re.findall(r"\d$","This text ends with a number 2")



['2']

In [60]:
re.findall(r"^\d","100 divide by 2 is 50")


['1']

In [62]:
txt = "There are 5 number inside this 3 sentence"
txt


'There are 5 number inside this 3 sentence'

In [65]:
re.findall(r"[^\d]",txt)

# exclude any digits

['T',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e']

In [69]:
# to get all words back together...
re.findall(r"[^\d]+",txt)


['There are ', ' number inside this ', ' sentence']

In [73]:
# To remove punctuation from a sentence
# which is a common thing we have to do when working
# with text data

test_txt = "This is a sample string!!! but it has \
punctuation. please remove it. possible? "



In [76]:
my_list= re.findall(r"[^|!?.]+",test_txt)
my_list

['This is a sample string',
 ' but it has punctuation',
 ' please remove it',
 ' possible',
 ' ']

In [78]:
" ".join(my_list)

'This is a sample string  but it has punctuation  please remove it  possible  '

In [None]:
# + sign with [] for grouping



In [91]:
text = "Only find the machine-learning topics. where \
can you-find computer vision learning sources?"

In [93]:
re.findall(r"[\w]+-[\w]",text)
# grab alpha numeric


['machine-l', 'you-f']