In [1]:
import re

# An email recognition pattern

In [2]:
pattern = re.compile(r'[\w\.]+@[a-z]+\.[a-z]+')

<code>r''</code>: This designates a raw string<br>
<code>[\w\.]+</code>: Match for at least one letter, digit, underscore, or dot character. <font color="red">local-part</font><br>
<code>@</code>: Match for "@"<br>
<code>[a-z]+</code>: Match at least one or more lowercase letter. <font color="red">domain</font><br>
<code>\.</code>: Match for "."<br>
<code>[a-z]+</code>: Match at least one or more lowercase letter. <font color="red">com</font><br>

In [3]:
text = 'Bill''s email is bill.seymore@famousbox.com '\
'is the e-mail of Bill. He often emails his coworker at '\
'distraction@somecompany.com. But the messages get '\
'forwarded to another email junkkmail87@domain.com.'

# Return matches using <font color="blue">re.finditer()</font>

 - Function will scan the string from left-to-right, and matches are returned in the order found. Empty matches are included in the result.

In [4]:
result = re.finditer(pattern, text)
print(result)
print("\n")

for match in result:
  print(match)

<callable_iterator object at 0x10aa4b190>


<re.Match object; span=(15, 41), match='bill.seymore@famousbox.com'>
<re.Match object; span=(97, 124), match='distraction@somecompany.com'>
<re.Match object; span=(174, 196), match='junkkmail87@domain.com'>


# Use re.finditer() with <font color="blue">.group(), .start(), and .end()</font> to get more information on matched object
 - <code>.group()</code>: returns the matched string
 - <code>.start()</code>: returns the starting index that the matching object is found
 - <code>.end()</code>: returns the last index of the matching object

In [5]:
result = re.finditer(pattern, text)

for match in result:
    print(f'The email match: {match.group()}')
    print(f'start index: {match.start()}')
    print(f'end index: {match.end()}')
    print("\n")

The email match: bill.seymore@famousbox.com
start index: 15
end index: 41


The email match: distraction@somecompany.com
start index: 97
end index: 124


The email match: junkkmail87@domain.com
start index: 174
end index: 196




# Get all matched instances with <font color="blue">re.findall()</font>

 - Matches all occurrences of a pattern, not just the first one

In [6]:
emails = re.findall(pattern, text)
print(emails)

['bill.seymore@famousbox.com', 'distraction@somecompany.com', 'junkkmail87@domain.com']


# Capture matches using <font color="blue">re.split()</font>

 - Split string by the occurrences of pattern. 
 - If capturing parentheses are used in pattern, then all groups in the pattern are returned in a list. 

In [7]:
# Example one: remove emails from text
split_text = re.split(pattern, text)
print(split_text)

['Bills email is ', ' is the e-mail of Bill. He often emails his coworker at ', '. But the messages get forwarded to another email ', '.']


In [9]:
# Compile the regular expression
new_pattern = re.compile(r'[,:\.\s]+')

<code>r''</code>: This designates a raw string<br>
<code>[,:\.\s]+</code>: Each major characters described below.
 - <code>[]+</code>: Match for at least one or more characters inside braket<br>
 - <code>[,:]+</code>: Match for at either the comman, and/or colon<br>
 - <code>[\.]+</code>: Match for a period<br>
 - <code>[\s]+</code>: Match for space or tab<br>

In [10]:
# Example two: remove blank space and return a list of only text
text = 'The python datatypes include: int, floats, strings, category.'

# Split the text that only words or numbers are left
words = re.split(new_pattern, text)
print(words)

['The', 'python', 'datatypes', 'include', 'int', 'floats', 'strings', 'category', '']


# <font color="blue">match()</font> versus <font color="blue">search()</font>

 - match() functions stops and returns the first pattern it matches
 - search() function checks for a match anywhere in the string

In [23]:
pattern = re.compile(r'[\w\.]+@[a-z]+\.[a-z]+')
new_text = 'jyoliver@live.com, mbrown@att.net, thaljef@verizon.net, boein@yahoo.com, noahb@me.com'

In [24]:
match_email = re.match(pattern, new_text)
print(match_email)

<re.Match object; span=(0, 17), match='jyoliver@live.com'>


Will only match the first substring it encounters. If it does not match it will return none.
 - In this example, we added a label 'emails:' before the start of an actual email

In [29]:
updated_text = 'emails: jyoliver@live.com, mbrown@att.net, thaljef@verizon.net, boein@yahoo.com, noahb@me.com'

match_email = re.match(pattern, updated_text)
print(match_email)

None


Search pattern is able to scan past substrings that do not match pattern and will then only return the 1st substring that it matched

In [30]:
updated_text = 'emails: jyoliver@live.com, mbrown@att.net, thaljef@verizon.net, boein@yahoo.com, noahb@me.com'
search_email = re.search(pattern, updated_text)
print(search_email)

<re.Match object; span=(8, 25), match='jyoliver@live.com'>


Again, use findall() if you want all objects that match pattern.

In [31]:
updated_text = 'emails: jyoliver@live.com, mbrown@att.net, thaljef@verizon.net, boein@yahoo.com, noahb@me.com'
return_all_email = re.findall(pattern, updated_text)
print(return_all_email)

['jyoliver@live.com', 'mbrown@att.net', 'thaljef@verizon.net', 'boein@yahoo.com', 'noahb@me.com']
