###Preface

Recently, I was learning the **Chapter 7: Data Wrangling:Clean, Transform, Merge, Reshape** from *Python for data analysis* by [Wes McKinney](http://wesmckinney.com/blog/), the creator of *Pandas*. I find the part about **regular expression* particualrly interesting and might be useful for things lik web crawling. So I did a bit more in-depth learning on this topic and noted it down here.  Main sources are from [Google Python course](https://developers.google.com/edu/python/regular-expressions?hl=en) and [Python documentation-Regular Expression HOWTO](https://docs.python.org/2/howto/regex.html#using-regular-expressions).  

BTW, they are now working on another project called [Ibis](http://www.ibis-project.org).Wathching!

### "Regex" by definition

>Regular Expression is a sequence of characters that define a search pattern, mainly for use in pattern matching,substitution and splitting. i.e. "find and replace"-like operations. / In Python, **re** module provides regular expression support.

###Examples in *Python for data analysis*

**Split** / seperate by defined delimiters


In [4]:
import re 
text = "foo bar\t baz \tqux"
re.split('\s+', text) #\s+ describes one or more whitespace

['foo', 'bar', 'baz', 'qux']

Equivalent to: 

In [9]:
regex = re.compile('\s+')  # recommended practice as the same parttern might apply to many strings
regex.split(text)

['foo', 'bar', 'baz', 'qux']

Where,**Compile** is to obtain a regular expression, and then the split is called on the passed text

In [10]:
regex.findall(text)

[' ', '\t ', ' \t']

**findall** returns a list of all patterns matching the regex.

In [12]:
# findall, search , match,
text = """Dave dave@google.com Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
# According to Google python course: The 'r' at the start of the pattern string... 
# designates a python "raw" string which passes through backslashes without change . 
regex = re.compile(pattern, flags=re.IGNORECASE)  #IGNORECASE make the pattern case-insensitive
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

In [14]:
m = regex.search(text) # Scan string for the first match to pattern

In [15]:
text[m.start():m.end()]

'dave@google.com'

In [17]:
print regex.match(text) # Match pattern at start of string 

None


In [18]:
print regex.sub('REDACTED', text) #substitute the matched text with specified contents

Dave REDACTED Steve REDACTED
Rob REDACTED
Ryan REDACTED



If we want to further segment the pattern for subsequent processing, we can play with the pattern specified. 

In [48]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('wesm@bright.net')
m.group() #return the whole group

'wesm@bright.net'

In [50]:
m.groups() #return the seperated groups

('wesm', 'bright', 'net')

In [52]:
m.group(1) #return specified subgroup

'wesm'

In [20]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

In [21]:
print regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text)

Dave Username: dave, Domain: google, Suffix: com Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



Further, we can produce a handy dict by carefully signing the pattern:

In [22]:
regex = re.compile(r"""
(?P<username>[A-Z0-9._%+-]+)
@
(?P<domain>[A-Z0-9.-]+)
\.
(?P<suffix>[A-Z]{2,4})""", flags=re.IGNORECASE|re.VERBOSE)

m = regex.match('wesm@bright.net')
m.groupdict()

{'domain': 'bright', 'suffix': 'net', 'username': 'wesm'}

A summary of the methods involved in regex:
<a data-flickr-embed="true"  href="https://www.flickr.com/photos/108107823@N04/21662922381/in/album-72157659008201065/" title="regex methods"><img src="https://farm6.staticflickr.com/5717/21662922381_d10971bddf_z.jpg" width="563" height="224" alt="regex methods"></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>

### Google Python Course

Regex in action: 

In [24]:
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str) 
# If-statement after search() tests if it succeeded
if match:                      
    print 'found', match.group() ## 'found word:cat'
else:
    print 'did not find' 

found word:cat


####Basic Pattern Elements
- ordinary characters: a-z, A-Z,0-9,<...etc;
- meta-characters: ^ $ * + ? { [ ] \ | ( );

    -  ^ = start, $ = end : match the start or end of the string;
    -  \ : inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. 
    -  [ ]: specify a character class. Metacharacters are not active inside classes,e.g.[akm$] will match any of the characters 'a', 'k', 'm', or '$';
  
- . (a period): matches any single character except newline '\n';
- \w (lowercase w): matches a "word" character: a **letter** or **digit** or **underbar** [a-zA-Z0-9_].
- \W (upper case W): matches any non-word character.
- \b : *boundary* between word and non-word
- \s(lowercase s): matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f].
- \S(upper case S): matches any non-whitespace character,equivalent to the class [^ \t\n\r\f\v].
- \s+: matches one or more whitespace
- \t, \n, \r : tab, newline, return
- \d : decimal digit [0-9] 
- \D: Matches any non-digit character; this is equivalent to the class [^0-9]

**Repetition**
>
    + : 1 or more occurrences of the pattern to its left;
    * : 0 or more occurrences of the pattern to its left;
    ? : match 0 or 1 occurrences of the pattern to its left.
  
**Square Brackets**
Example: pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
It indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. 

**Options**

- IGNORECASE: ignore upper/lowercase differences for matching;

- DOTALL: allow dot (.) to match newline;

- MULTILINE: Within a string made of many lines, allow ^ and $ to match the start and end of each line. Normally ^/$ would just match the start and end of the whole string.

**Methods**

Methods       | Functions
------------- | -------------
group()       | Return the string matched by the RE
start()       | Return the starting position of the match
end()         | Return the ending position of the match
span()        | Return a tuple containing the (start, end) positions of the match

` f = open('filename.txt', 'r')

` strings = re.findall(r'some pattern', f.read())
//f.read() returns the whole text of a file in a single string

**Demo:**

In [29]:
match = re.search(r'.\w\dg', 'pii2g') 
match.group()    

'ii2g'

In [32]:
match = re.search(r'i+', 'piigiiii')
match.group()

'ii'

In [34]:
#\s* = zero or more whitespace chars
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx')
match.group()

'1 2   3'

In [45]:
# ^ = matches the start of string
match = re.search(r'^b\w+', 'foobar') #Match == NONE

In [55]:
match = re.search(r'.*',' <b>foo</b> and <i>so on</i>')  # .* will go as far as it can (greedy)
match.group()

' <b>foo</b> and <i>so on</i>'

In [60]:
match = re.search(r'<.*?>',' <b>foo</b> and <i>so on</i>')# .*? will stop after find the first match
match.group()

'<b>'

###Exercise -U.S. Baby Names

The files are downloaded from [this page](https://developers.google.com/edu/python/exercises/baby-names).
Objective: 
> Give a file name for xxxx.html, returns a list starting with the year string
  followed by the name-rank strings in alphabetical order.
  
The name infor in the html has the following pattern:
e.g.
```
<tr align="right"><td>1</td><td>Michael</td><td>Jessica</td>
```

In [83]:
import sys
import re

def extract_names(filename):
    names = []
    f = open(filename,'rU') # filename are the html files containing names and rank 
    text = f.read()
    
    #extract year
    year_match = re.search(r'Popularity\sin\s(\d\d\d\d)', text)
    if not year_match:
        sys.stderr.write('Couldn\'t find the year!\n')
        sys.exit(1)
    year = year_match.group(1)
    names.append(year)
    
    #extract rank, boy name and female name 
    tuples = re.findall(r'<td>(\d+)</td><td>(\w+)</td>\<td>(\w+)</td>', text) 
    
    #store the names in a dict where key is the name, value is the rank.
    #(if the name is already in there, don't add it, since this new rank will be bigger than the previous rank).
    names_to_rank =  {}
    for rank_tuple in tuples:
        (rank, boyname, girlname) = rank_tuple  # unpack the tuple into 3 vars
        if boyname not in names_to_rank:
            names_to_rank[boyname] = rank
        if girlname not in names_to_rank:
            names_to_rank[girlname] = rank
    
    sorted_names = sorted(names_to_rank.keys())
    # build the required list 
    for name in sorted_names:
        names.append(name + " " + names_to_rank[name])
    return names[:10]

def main():
    args = sys.argv[1:]

    if not args:
        print 'usage: [--summaryfile] file [file ...]'
        sys.exit(1)
    summary = False
    if args[0] == '--summaryfile':
        summary = True
        del args[0]

    for filename in args:
        names = extract_names('/Users/jin/Python/Google-Python/google-python-exercises/babynames/baby2008.html')

    text = '\n'.join(names)
  
    if summary:
        outf = open(filename + '.summary', 'w')
        outf.write(text + '\n')
        outf.close()
    else:
        print text

if __name__ == '__main__':
    main()

2008
Aaden 343
Aaliyah 77
Aarav 921
Aaron 50
Abagail 874
Abbey 822
Abbie 737
Abbigail 508
Abby 259


!!Will come back to this example for more operations again!