# Regular Expression
Regular expression is a way to do text pattern searching. It uses some metatcharacters to represent the pattern of strings. It allows you to find the pattern without doing it by hardcoding.   
We will use package `re` for regular expression in Python.

In [1]:
import re

mystring = "Hello World!"
# Compilte regular expression
regexp = re.compile("Hello")
regexp

re.compile(r'Hello', re.UNICODE)

In [2]:
# You can use tab to check the method for this object
s = regexp.search(mystring)
s

<_sre.SRE_Match object; span=(0, 5), match='Hello'>

In [3]:
# Start and End
[s.start(), s.end()]

[0, 5]

In [4]:
# Get the word
mystring[s.start():s.end()]

'Hello'

In [5]:
mystringl = mystring.lower()
mystringl

'hello world!'

If we want to search the lower case hello, there are several way to write it.
* `|` means or in regular expression.
* `()` group all the word inside the parentheses.
* `[]` match any word inside the bracket.

In [6]:
regexp1 = re.compile('Hello|hello')
regexp2 = re.compile('(H|h)ello')
regexp3 = re.compile('[hH]ello')

s1 = regexp1.search(mystringl)
s2 = regexp2.search(mystringl)
s3 = regexp3.search(mystringl)
# They are the same
[mystringl[s1.start():s1.end()], mystringl[s2.start():s2.end()], mystringl[s3.start():s3.end()]]

['hello', 'hello', 'hello']

Special character | meaning
--- | ---
^ | begin
. | wild card
* | match many times
$ | end
+ | one or more times

In [7]:
s = "I have $123 in my pocket!"

# Extract digit
re.findall('[0-9]+', s)

['123']

In [8]:
# All lower letters in the string
re.findall('[a-z]', s)

['h', 'a', 'v', 'e', 'i', 'n', 'm', 'y', 'p', 'o', 'c', 'k', 'e', 't']

### Raw String
Raw string is a different way to define string in Python. When you use the regular expression, sometimes you may encounter the problem about raw string. It will mess up the meaning of those special character like tab `\t` and new line `\n`. I cannot really find a good example to illustrate it, so I present the difference of raw string by the following example.

In [9]:
print(r'\ten')

\ten


In [10]:
print('\ten')

	en


### Greedy Search
When there are multiple string match the pattern you give, the function will give you the longest one. If you just want the first string matching your pattern, use `?` to avoid greedy search. See the following example:

In [11]:
s = "First: Second: Third"
# It will give your the longest matched string
re.findall('^F.+:', s)

['First: Second:']

In [12]:
# Use ? to tell the function do not do greedy search.
re.findall('^F.+?:', s)

['First:']

### Substitute


In [13]:
s = 'Hhello World!'
regexp = re.compile('Hhello')
regexp.sub("Hello", s)

'Hello World!'

### Example
We have done some exercises in R. We will redo those exercise in Python.    
Data we use are `mbox-short.txt` from the website of [Python for Informatics](http://www.pythonlearn.com/book.php)

In [14]:
handle = open('../../Data/mbox-short.txt')
i = 0
for line in handle:
    # We use strip to remove tab, newline, space
    line = line.strip()
    i += 1
    if i <= 5:
        print(line)

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiproject.org>
Received: from murder (mail.umich.edu [141.211.14.90])
by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
Sat, 05 Jan 2008 09:14:16 -0500


### Email

In [15]:
handle = open('../../Data/mbox-short.txt')
mbox = handle.read()
# Change into set, it will remove the duplicate, and transform back into list
list(set(re.findall("[a-zA-Z0-9]+@[a-zA-Z0-9\\.-]+\\.[a-zA-Z0-9]+", mbox)))

['louis@media.berkeley.edu',
 'm03MGhDa005292@nakamura.uits.iupui.edu',
 'horwitz@uct.ac.za',
 'm04K1cO0007738@nakamura.uits.iupui.edu',
 'marquard@uct.ac.za',
 'm04JmdwO007705@nakamura.uits.iupui.edu',
 'cwen@iupui.edu',
 'm049W2i5006493@nakamura.uits.iupui.edu',
 'm04Kiem3007881@nakamura.uits.iupui.edu',
 'rjlowe@iupui.edu',
 'm0495rWB006420@nakamura.uits.iupui.edu',
 'm040NpCc005473@nakamura.uits.iupui.edu',
 'source@collab.sakaiproject.org',
 'postmaster@collab.sakaiproject.org',
 'm04F21Jo007031@nakamura.uits.iupui.edu',
 'm04B6lK3006677@nakamura.uits.iupui.edu',
 'm04GB1Lb007221@nakamura.uits.iupui.edu',
 'm04Fb6Ci007092@nakamura.uits.iupui.edu',
 'zqian@umich.edu',
 'm04E3psW006926@nakamura.uits.iupui.edu',
 'm04L92hb007923@nakamura.uits.iupui.edu',
 'm04G8d7w007184@nakamura.uits.iupui.edu',
 'wagnermr@iupui.edu',
 'antranig@caret.cam.ac.uk',
 'josrodri@iupui.edu',
 'm049lUxo006517@nakamura.uits.iupui.edu',
 'm04N8v6O008125@nakamura.uits.iupui.edu',
 'm05ECIaH010327@nakamura.uit

### Website

In [16]:
handle = open('../../Data/mbox-short.txt')
mbox = handle.read()
# Change into set, it will remove the duplicate, and transform back into list
list(set(re.findall("http://[-a-zA-Z0-9./-?&]+|https://[-a-zA-Z0-9./-?&]+", mbox)))

['https://source.sakaiproject.org/svn/site-manage/trunk/',
 'http://source.sakaiproject.org/viewsvn/?view=rev&rev=39769',
 'http://source.sakaiproject.org/viewsvn/?view=rev&rev=39764',
 'http://source.sakaiproject.org/viewsvn/?view=rev&rev=39765',
 'http://source.sakaiproject.org/viewsvn/?view=rev&rev=39770',
 'http://source.sakaiproject.org/viewsvn/?view=rev&rev=39758',
 'http://source.sakaiproject.org/viewsvn/?view=rev&rev=39745',
 'http://source.sakaiproject.org/viewsvn/?view=rev&rev=39754',
 'http://bugs.sakaiproject.org/jira/browse/SAK-12592',
 'http://source.sakaiproject.org/viewsvn/?view=rev&rev=39752',
 'http://source.sakaiproject.org/viewsvn/?view=rev&rev=39757',
 'http://source.sakaiproject.org/viewsvn/?view=rev&rev=39743',
 'https://source.sakaiproject.org/svn/msgcntr/trunk',
 'http://source.sakaiproject.org/viewsvn/?view=rev&rev=39751',
 'https://collab.sakaiproject.org/portal',
 'http://bugs.sakaiproject.org/jira/browse/SAK-12175',
 'https://source.sakaiproject.org/svn/gra