<h3>Regular Expressions</h3>
The need to search for, and extract different patterns of substrings from a larger string is so common, that most programming languages provide an efficient way of accomplishing this task. This involves the use of __Regular Expressions__ (`regex`).

In the following series of notebooks you will be introduced to Regular Expressions (`regex`) in Python.

A regular expression is a combination of characters that form a search pattern.  Many complex patterns can be created using different combinations of special characters.  These patterns can then be used together with different methods to accomplish many powerful search and extraction tasks.

In Python, we use the functions in the `re` module to achieve these capabilities.

The `search()` method in the re module can be used find out if the input string contains a specific substring.  We use the term `regex` to refer to the substring or pattern which is being searched for in the input string.

The syntax is:  `re.search(regex, inputString)`

In [1]:
import re
line = 'Received: from murder (mail.umich.edu [141.211.14.39]) by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA'
if re.search('From', line):
    print(line)

The `search()` method returns a match object if a match is found

The `match` object contains the following:
1. span -> the location (from start to end) where the match is found
2. match -> the exact string that was matched

If a match is not found the `search()` method returns None

The `search()` method only return the first match even if there are multiple occurrence of the regex.

In [4]:
import re
line = 'Received: from murder (mail.umich.edu [141.211.14.39]) by from frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA'
print(re.search('from', line))

<re.Match object; span=(10, 14), match='from'>


To retrieve just the matched string, we use the `group()` method of the match object.

To retrieve the start location of the matched string, we use the `start()` method of the match object.

To retrieve the end location of the matched string, we use the `end()` method of the match object.

In [3]:
import re
line = 'Received: from murder (mail.umich.edu [141.211.14.39]) by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA'
print('group: ', re.search('from', line).group())
print('start: ', re.search('from', line).start())
print('end: ', re.search('from', line).end())

group:  from
start:  10
end:  14


The `search()` method returns a match object if the regex is found anywhere within the input string.
However, if we only wish to match lines where the regex is at the start of the input string, 
we can use the `re.match()` method.  The `match()` method also returns a match object.

In [5]:
import re # import the re module
line = 'from: Received: from murder (mail.umich.edu [141.211.14.39]) by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA'
if re.match('from:', line) :
    print(line)
print(re.match('from:', line))

from: Received: from murder (mail.umich.edu [141.211.14.39]) by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA
<re.Match object; span=(0, 5), match='from:'>


The `search()` methods stops immediately after it finds one occurrence of the regex.  
The `findall()` method can be used to find and extract all occurrences.  returns a list.
The method returns a list where each element is one occurrence of the substring we are matching.  If the substring does not appear even once, the returned list will have a length of 0.

In [6]:
s = 'is there and again there'
print(re.findall('there', s))
print(re.search('there', s))
print(re.match('there', s))

['there', 'there']
<re.Match object; span=(3, 8), match='there'>
None
