# Regular Expressions
Regular expressions are text matching patterns described with a formal syntax. 

You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation. Regular expressions can include a variety of rules, fro finding repetition, to text-matching, and much more. 

##### As you advance in Python you'll see that a lot of your parsing problems can be solved with regular expressions (they're also a common interview question!).

If you're familiar with Perl, you'll notice that the syntax for regular expressions are very similar in Python. We will be using the re module with Python for this lecture.

### Searching for pattern using regex

In [1]:
import re

In [2]:
#List of patterns to search for
patterns = ['term1', 'term2']

#Text to parse
text = 'This is string with term1, not with other'

for pattern in patterns:
    print 'Searching for "%s" in : \n "%s"'%(pattern,text)
    
    if re.search(pattern,text):
        print 'Match was found'
    else:
        print 'Match was not Found'
    

Searching for "term1" in : 
 "This is string with term1, not with other"
Match was found
Searching for "term2" in : 
 "This is string with term1, not with other"
Match was not Found


Now we've seen that re.search() will take the pattern, scan the text, 
and then returns a Match object. If no pattern is found, a None is returned. 

To give a clearer picture of this match object, check out the cell below:

In [11]:
 #List of patterns to search for
pattern = 'have'

# Text to parse
text = 'This is a string with term1, but it does not have the other term.'

match = re.search(pattern,  text)

type(match)

_sre.SRE_Match

This Match object returned by the search() method is more than just a Boolean or None, 
it contains information about the match, including the original input string, the regular expression that was used, and the location of the match. Let's see the methods we can use on the match object:

In [12]:
match.start()

45

In [13]:
match.end()

49

# Split with Regex

In [14]:
splitby = "@"

phrase = 'My Email is dc@gmail.com'

re.split(splitby,phrase)

['My Email is dc', 'gmail.com']

# Finding all instances of a pattern

In [15]:
#Lists all matches

re.findall('hey', 'hey hi hello hey how are you')

['hey', 'hey']

# Pattern re Syntax

This will be the bulk of this lecture on using re with Python. Regular expressions supports a huge variety of patterns the just simply finding where a single string occurred.

We can use metacharacters along with re to find specific types of patterns.

Since we will be testing multiple re syntax forms, let's create a function that will print out results given a list of various regular expressions and a phrase to parse:

In [16]:
def multi_find(patterns,phrase):
    
    for pattern in patterns:
        
        print 'This is the phrase : %r' %pattern
        print re.findall(pattern,phrase)

# Repetition Syntax
There are five ways to express repetition in a pattern:

1.) A pattern followed by the meta-character * is repeated zero or more times.

2.) Replace the * with + and the pattern must appear at least once. 

3.) Using ? means the pattern appears zero or one time. 

4.) For a specific number of occurrences, use {m} after the pattern, where m is replaced with the number of times         the pattern should repeat. 

5.) Use {m,n} where m is the minimum number of repetitions and n is the maximum. Leaving out n ({m,}) means the           value appears at least m times, with no maximum.


In [18]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = [ 'sd*',     # s followed by zero or more d's
                'sd+',          # s followed by one or more d's
                'sd?',          # s followed by zero or one d's
                'sd{3}',        # s followed by three d's
                'sd{2,3}',      # s followed by two to three d's
                ]

multi_find(test_patterns,test_phrase)

This is the phrase : 'sd*'
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']
This is the phrase : 'sd+'
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']
This is the phrase : 'sd?'
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']
This is the phrase : 'sd{3}'
['sddd', 'sddd', 'sddd', 'sddd']
This is the phrase : 'sd{2,3}'
['sddd', 'sddd', 'sddd', 'sddd']


# Character Sets

Character sets are used when you wish to match any one of a group of characters at a point in the input. Brackets are used to construct character set inputs. For example: the input [ab] searches for occurrences of either a or b.

In [20]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = [ '[sd]',     # either s or d
                 
                 's[sd]+'      # s followed by either s or d 
                ]

multi_find(test_patterns,test_phrase)

This is the phrase : '[sd]'
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']
This is the phrase : 's[sd]+'
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']


it makes sense that the first [sd] returns every instance. 

Also the second input will just return any thing starting with an s in this particular case of the test phrase input.

# Exclusion
We can use ^ to exclude terms by incorporating it into the bracket syntax notation. 

For example: [^...] will match any single character not in the brackets.

In [21]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

Use [^!.? ] to check for matches that are not a !,.,?, or space.

Add the + to check that the match appears at least once, 

this basically translate into finding the words.

In [23]:
re.findall('[^!.?]+',test_phrase)

['This is a string', ' But it has punctuation', ' How can we remove it']

# Character Ranges

As character sets grow larger, typing every character that should (or should not) match could become very tedious.

A more compact format using character ranges lets you define a character set to include all of the contiguous characters between a start and stop point. The format used is [start-end].

Common use cases are to search for a specific range of letters in the alphabet, such [a-f] would return matches with any instance of letters between a and f.

In [24]:
test_phrase = 'This is an example sentence. Lets see if we can find some letters.'

test_patterns=[ '[a-z]+',      # sequences of lower case letters
                '[A-Z]+',      # sequences of upper case letters
                '[a-zA-Z]+',   # sequences of lower or upper case letters
                '[A-Z][a-z]+'] # one upper case letter followed by lower case letters
                
multi_find(test_patterns,test_phrase)


This is the phrase : '[a-z]+'
['his', 'is', 'an', 'example', 'sentence', 'ets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']
This is the phrase : '[A-Z]+'
['T', 'L']
This is the phrase : '[a-zA-Z]+'
['This', 'is', 'an', 'example', 'sentence', 'Lets', 'see', 'if', 'we', 'can', 'find', 'some', 'letters']
This is the phrase : '[A-Z][a-z]+'
['This', 'Lets']


# Escape Codes
You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more. 

Code	Meaning

\d	a digit

\D	a non-digit

\s	whitespace (tab, space, newline, etc.)

\S	non-whitespace

\w	alphanumeric

\W	non-alphanumeric

Escapes are indicated by prefixing the character with a backslash (). Unfortunately, a backslash must itself be escaped in normal Python strings, 

and that results in expressions that are difficult to read. Using raw strings, created by prefixing the literal value with r,

for creating regular expressions eliminates this problem and maintains readability.

In [25]:
test_phrase = 'This is a string with some numbers 1233 and a symbol #hashtag'

test_patterns=[ r'\d+', # sequence of digits
                r'\D+', # sequence of non-digits
                r'\s+', # sequence of whitespace
                r'\S+', # sequence of non-whitespace
                r'\w+', # alphanumeric characters
                r'\W+', # non-alphanumeric
                ]

multi_find(test_patterns,test_phrase)

This is the phrase : '\\d+'
['1233']
This is the phrase : '\\D+'
['This is a string with some numbers ', ' and a symbol #hashtag']
This is the phrase : '\\s+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']
This is the phrase : '\\S+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', '#hashtag']
This is the phrase : '\\w+'
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', 'hashtag']
This is the phrase : '\\W+'
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #']
