# Regular Expression

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. 

The Python module re provides full support for Perl-like regular expressions in Python. The re module raises the exception re.error if an error occurs while compiling or using a regular expression.

# Searching for Basic Patterns

 syntax for this function −

re.search(pattern, string, flags=0)

In [12]:
text = "The person's phone number is 408-555-1234. Call soon!"

In [13]:
'phone' in text

True

 show the format for regular expressions

In [14]:
import re

In [17]:
pattern = 'number'
print(text)
re.search(pattern, text)

The person's phone number is 408-555-1234. Call soon!


<re.Match object; span=(19, 25), match='number'>

In [None]:
pattern = 'Not in Text'
re.search(pattern, text)

In [24]:
pattern = 'phone'
match = re.search(pattern, text)

In [25]:
match

<re.Match object; span=(13, 18), match='phone'>

In [27]:
match.start()

13

In [28]:
match.end()

18

In [30]:
match.span()

(13, 18)

In [31]:
text = "my phone is a new phone"
match = re.search (pattern, text)

In [32]:
match. span()

(3, 8)

Notice it only matches the first instance. If we wanted a list of all matches, we can use .findall() method:

In [33]:
match = re.findall(pattern, text)

In [34]:
match

['phone', 'phone']

In [36]:
len(match)

2

To get actual match objects, use the iterator:

If you wanted the actual text that matched, you can use the .group() method.

In [42]:
for sp in re.finditer('phone', text):
    
    print(sp.span())
    print(sp.group())

(3, 8)
phone
(18, 23)
phone


# Patterns

to find a telephone number in a large string of text? Or an email address?

# Identifiers for Characters in Patterns

 When defining a pattern string for regular expression we use the format:

r'mypattern'

Placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

In [3]:
import re

In [4]:
text = "My telephone number is 408-555-1234"

In [5]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d', text)

In [6]:
phone

<re.Match object; span=(23, 35), match='408-555-1234'>

In [7]:
phone.group()

'408-555-1234'

# Parrern With Quantityfiers

In [8]:
phone = re.search(r'\d{3}-\d{3}-\d{4}', text)

In [9]:
phone

<re.Match object; span=(23, 35), match='408-555-1234'>

In [10]:
phone.group()

'408-555-1234'

# Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down).

In [35]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [36]:
result = re.search(phone_pattern, text)

In [37]:
result

<re.Match object; span=(23, 35), match='408-555-1234'>

In [38]:
result.group()

'408-555-1234'

In [39]:
result.group(1)

'408'

In [40]:
result.group(2)

'555'

In [41]:
result.group(3)

'1234'

# Additional Regex Syntax

Or operator |

In [43]:
re.search(r"man| woman", "this man was here")

<re.Match object; span=(4, 10), match=' woman'>

In [44]:
re.search(r"man| woman", "this woman was here")

<re.Match object; span=(4, 10), match=' woman'>

# The Wildcard Character

In [45]:
re.findall(r".at","The cat in the hat sat here.")

['cat', 'hat', 'sat']

In [47]:
re.findall(r"....at","The cat in the hat sat here.")

['he cat', 'he hat']

In [48]:
# One or more non-whitespace that ends with 'at'
re.findall(r"\Sat", "The cat in the hat sat here")

['cat', 'hat', 'sat']

# Starts with and Ends With

We can use the ^ to signal starts with, and the $ to signal ends with:

In [49]:
re.findall(r"\d$","This ends with a number 2")

['2']

In [50]:
re.findall(r"^\d","1This ends with a number ")

['1']

# Exclusion

To exclude characters, we can use the ^ symbol in conjunction with a set of brackets []. Anything inside the brackets is excluded. 

In [51]:
phrase = "there are 3 numbers 34 inside 5 this sentence."

In [54]:
exclu = re.findall(r'[^\d]', phrase)

In [56]:
exclu

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

To get the words back together, use a + sign

In [57]:
 re.findall(r'[^\d]+', phrase)

['there are ', ' numbers ', ' inside ', ' this sentence.']

In [58]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [65]:
re.findall(r'[^!?]+', test_phrase)

['This is a string', ' But it has punctuation. How can we remove it']

# Brackets for Grouping

In [61]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [62]:
re.findall(r'[\w]+-[\w]', text)

['hypen-w', 'long-i']

# Parenthesis for Multiple Options

In [66]:
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [68]:
re.search(r'cat(fish|nap|erpillar)', text)

<re.Match object; span=(27, 34), match='catfish'>