# What is Regular Expression?

In computing, a regular expression, also referred to as "regex" or "regexp", provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor.

# When use Regular Expression?

Any time you have some kind of a complex string matching or extraction, it is a good time to use regular expression.

# Understanding Regular Expressions
- Very powerful and quite cryptic
- Fun once you understand them
- Regular expressions are a language unto themselves
- A language of "marker characters" - programming with characters
- It is kind of an "old school" language - compact

# The Regular Expression Module

- Before you can use regular expressions in your program, you must import the library using "**mport re**"
- You can use **re.search()** to see if a string matches a regular expression, similar to using the **find()** method for strings
- You can use **re.findall()** extract portions of a string that match your regular expression similar to a combination of **find()** and slicing: **var[5:10]**


# Regular Expression Quick Guide
![regexp_quick_guide](regexp_quick_guide.png)

## Exam1: Using re.search() like find() and startswith()

In [None]:
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if line.find('From: ') >= 0:
        print(line)

In [None]:
import re

hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('From: ', line):
        print(line)

In [None]:
hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if line.startswith('From: '):
        print(line)

In [None]:
import re

hand = open('mbox-short.txt')
for line in hand:
    line = line.rstrip()
    if re.search('^From: ', line):
        print(line)

## Wild-Card Characters

- The dot character matches any character
- if you add the asterisk character, the character is "any number of times"(zero or more times)

In [21]:
import re
str1 = 'X-Sieve: CMU Sieve 2.3'
str2 = 'X-DSPAM-Result: Innocent'
str3 = 'X-Plane is behind schedule: two weeks'
print(re.search('^X-.*:', str1))
print(re.search('^X-.*:', str2))
print(re.search('^X-.*:', str3))
print(100 * '-')
print(re.search('^X-\S+:', str1))
print(re.search('^X-\S+:', str2))
print(re.search('^X-\S+:', str3))

<_sre.SRE_Match object; span=(0, 8), match='X-Sieve:'>
<_sre.SRE_Match object; span=(0, 15), match='X-DSPAM-Result:'>
<_sre.SRE_Match object; span=(0, 27), match='X-Plane is behind schedule:'>
----------------------------------------------------------------------------------------------------
<_sre.SRE_Match object; span=(0, 8), match='X-Sieve:'>
<_sre.SRE_Match object; span=(0, 15), match='X-DSPAM-Result:'>
None


## Matching and Extracting Data

- The **re.search()** returns a True/False depending on whether the string matches the regular expression
- If we actually want the matching strings to be extracted, we use **re.findall()**
- When we use **re.findall()**, it returns a list of zero or more sub-strings that match the regular expression

In [44]:
import re
str1 = 'My 2 favorite numbers are 19 and 42'
result1 = re.findall('[0-9]+', str1)
print(result1)
result2 = re.findall('[AEIOU]+', str1)
print(result2)
result3 = re.findall('^My ([0-9]+)', str1)
print(result3)

['2', '19', '42']
[]
['2']


## Greedy Matching and NonGreedy Matching

- The repeat characters(* and +) push outward in both directions(greedy) to match the largest possible string

In [33]:
import re
str = 'From: Using the : character'
result1 = re.findall('^F.+:', str)
result2 = re.findall('^F.+?:', str)
print('the greedy match is:', result1)
print('the nongreedy match is:', result2)

the greedy match is: ['From: Using the :']
the nongreedy match is: ['From:']


## The Use of Parentheses

- You can refine the match for **re.findall()** and separately determine which portion of the match is to be extracted **by using parentheses**
- **Parenteses** are not part of the match, but they tell where to start and stop what string to extract

In [37]:
import re
str = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
result1 = re.findall('\S+@\S+', str)
print(result1)

['stephen.marquard@uct.ac.za']


#### Extracting a host name - using find and string slicing

In [38]:
str = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
atpos = str.find('@')
print(atpos)
sppos = str.find(' ', atpos)
print(sppos)
host = str[atpos+1:sppos]
print(host)

21
31
uct.ac.za


#### Extracting a host name - using regular expressions

In [41]:
import re
str = 'From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008'
result = re.findall('^From .*@([^ ]*)', str)
print(result)

['uct.ac.za']


## Escape Character

- if you want a special regular expression character to just behave normally (most of the time) you prefix it with '\'

In [46]:
import re
str = 'We just received $10.00 for cookies.'
result = re.findall('\$[0-9.]+', str)
print(result)

['$10.00']
