# Regular Expressions

## Lesson Objectives
By the end of this lesson, you will be able to:
- Understand the relationship between regular expressions and Python
- Compile regular expressions using the `re` module
- Use regular expressions to find pattern matches in strings
- Use regular expressions to transform strings based on pattern matches

## Table of Contents
 - [Regular Expressions](#regular expressions)
 - [The `re` Module](#re)
 - [Compiling Patterns and Querying Match Objects](#compiling)
 - [The Regular Expression Syntax](#syntax)
     - [Special Metacharacters](#meta)
     - [Special Sequences](#special)
     - [Quantifiers](#quant)
 - [Modifying Strings](#mod)
 - [Takeaways](#takeaways)
 - [Applications](#applications)

<a id='regular expressions' ></a>
## Regular Expressions
Regular expressions are a powerful language for matching text patterns. They provide a lot of flexibility to help you find and/or replace a sequence of characters. Mastering regular expressions is a stepping stone to [Natural Language Processing](https://en.wikipedia.org/wiki/Natural-language_processing). 

This lesson will introduce you to regular expressions as well as the Python `re` module, which provides regular expression support.

I also want to quickly point you to https://pythex.org/, a quick and easy way to test your Python regular expressions.

<a id='re' ></a>
## The `re` Module
Regular expressions are made available through the `re` module in Python. Think of the module as access to another language through Python. 

You use regular expressions to specify patterns that you want to find within a string. Your pattern can be just about anything, like proper names or e-mail addresses, and the `re` module has a lot of tools that will let you find these patterns and even modify strings based on these patterns.

> Behind the scenes, regular expression patterns are compiled into a series of bytecodes and then executed in C. These details won't be covered in this lesson (but maybe in a future one!).

<a id='compiling' ></a>
## Compiling Patterns and Querying Match Objects
Before we dive into the regular expression language, we need to see how the `re` module works. We'll see here how to compile some basic patterns and then query the *match objects* that they return after searching a string.

In [1]:
import re

### Compiling Regular Expressions
The `re` module is an interface to the regular expression language. Since regular expressions aren't baked into Python's syntax, you often have to use the `compile()` method to compile regular expressions into *pattern objects*. These objects then have methods for various operations, such as searching for matches or performing string substitutions.

Here's how we compile a pattern for a simple character match (looking for the single character `"a"`):

In [2]:
a = re.compile('a')  #the single character `a` is our pattern
a

re.compile(r'a', re.UNICODE)

Above, we passed our regular expression (the single character `"a"`) to `re.compile()` as a string. We had to pass our regular expression as a string because regular expressions aren’t a part of the core Python language. This keeps the Python language simpler for all other purposes, but creates a little headache when it comes to compiling the backslash character.

#### A quick note on escape characters and Python's raw string notation
> Regular expressions use the backslash character (`\`) to indicate special forms or to escape special characters. This collides with Python’s usage of the same character for the same purpose in string literals. For example, to match a literal backslash with a regular expression, one would have to write `\\\\` as the pattern string, because the regular expression must be `\\`, and each backslash must be expressed as `\\` inside a regular Python string literal.

> The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with `'r'`. So `r"\n"` is a two-character string containing `'\'` and `'n'`, while `"\n"` is a one-character string containing a newline. **Thus it's best to type regular expression patterns  using this raw string notation.**

So let's re-compile our character using Python's raw string notation.

In [3]:
a = re.compile(r'a')  # Good practice to use the raw string notation
a

re.compile(r'a', re.UNICODE)

### Finding Matches
Now that we have an object representing a *compiled* regular expression, there's an array of object methods and attributes at our disposal. Here we'll cover just the most common/useful ones:

<table border="1">
<colgroup>
<col width="28%" />
<col width="72%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th>Method</th>
<th class="head">Purpose</th>
</tr>
</thead>
<tbody valign="top">
<tr><td><code><span>match()</span></code></td>
<td>Determines if the regular expression matches at the beginning
of the string, returning a match object.</td>
</tr>
<tr><td><code><span>search()</span></code></td>
<td>Scans through a string, looking for any
location where the regular expression matches.</td>
</tr>
<tr><td><code><span>findall()</span></code></td>
<td>Finds all substrings where the regular expression matches,
returning them as a list.</td>
</tr>
<tr><td><code><span>finditer()</span></code></td>
<td>Finds all substrings where the regular expression matches,
returniing them as an <a href="https://docs.python.org/3.4/glossary.html#term-iterator"><span>iterator</span></a>.</td>
</tr>
</tbody>
</table>

#### `match()`
> Determines if the regular expression matches at *the beginning of the string*, returning a `match object`.

In [4]:
apple = a.match('apple')
banana = a.match('banana')

In [5]:
type(apple)

_sre.SRE_Match

In [6]:
type(banana)

NoneType

> #### Match Objects return `None` if there's no match.
> Since the `match()` method didn't find an `"a"` at the beginning of `"banana"`, it returned `None`. This is handy for conditional tests, as `None` always evaluates to the boolean `False`:

In [7]:
if banana:
    print("banana!")
else:
    print("no banana :(")

no banana :(


> Above, we used `match()` to search the strings `apple` and `banana` using the single character `"a"` as our regular expression. We assigned the results to two match objects, one of which (`apple`) actually contained a match. We can now query that match object using some methods. A few of the most common ones are:

<table border="1">
<colgroup>
<col width="29%" />
<col width="71%" />
</colgroup>
<thead valign="bottom">
<tr><th>Method/Attribute</th>
<th>Purpose</th>
</tr>
</thead>
<tbody>
<tr><td><code><span>group()</span></code></td>
<td>Return the string matched by the RE</td>
</tr>
<tr><td><code><span>start()</span></code></td>
<td>Return the starting position of the match</td>
</tr>
<tr><td><code><span>end()</span></code></td>
<td>Return the ending position of the match</td>
</tr>
<tr><td><code><span>span()</span></code></td>
<td>Return a tuple containing the (start, end)
positions  of the match</td>
</tr>
</tbody>
</table>

Let's try these out!

In [8]:
apple.group()

'a'

In [9]:
apple.start()

0

In [10]:
apple.end()

1

In [11]:
apple.span()

(0, 1)

<a id='syntax' ></a>
## The Regular Expression Syntax
Regular Expression Syntax can be roughly divided into 3 categories:
1. Special Metacharacters
2. Special Sequences
3. Quantifiers

We'll run through several, but not all, techniques within each of these categories. Full documentation can be found [here](https://docs.python.org/3.4/library/re.html#regular-expression-syntax).

<a id='meta' ></a>
### Special Metacharacters
As we saw with the letter `"a"` above, most letters and characters match themselves. Naturally, however, there are exceptions to this rule. Some characters are **special metacharacters** and so they don’t match themselves. Instead, they signal that something special should happen.

Here's a table listing some of the most common special metacharacters.
<table>
<tr><th>Special Metacharacters</th>
<th>Purpose</th>
</tr>
  <tr>
    <td><code>[ ]</code></td>
    <td>matches any chars placed within the brackets</td>
  </tr>
  <tr>
    <td ><code>\</code></td>
    <td>escape special characters</td>
  </tr>
  <tr>
    <td><code>.</code></td>
    <td>matches any character</td>
  </tr>
  <tr>
    <td><code>^</code></td>
    <td>matches beginning of string</td>
  </tr>
  <tr>
    <td><code>$</code></td>
    <td>matches end of string</td>
  </tr>
  <tr>
    <td><code>|</code></td>
    <td>the OR operator matches either the left or right operand</td>
  </tr>
  <tr>
    <td><code>()</code></td>
    <td>creates capture groups and indicates precedence</td>
  </tr>
</table>

#### Character Classes:  [  ] 
> Square brackets are used to specify a *character class*, which is a set of characters that you wish to match. 

In [20]:
a = re.compile(r'[a]')
apple = a.match('apple')

# Using a conditional check to reinforce this practice
if apple:
    print(apple.group())
else:
    print("No match.")

a


> Inside the class, you can either list characters individually or specify a range of characters using a dash (`'-'`). For example, `[abc]` will match any of the characters `a`, `b`, or `c`. If you were to to use the range approach, you would have typed `[a-c]`. These character classes are case sensitive, so if you wanted to match only uppercase letters, your regular expression would be [A-Z].

In [14]:
az = re.compile(r'[a-z]') #any lowercase letters
alpha_num = az.search('01234w56789')

if alpha_num:
    print(alpha_num.group())
else:
    print("No match.")

w


> The special metacharacters - except the caret ('^') - do not work inside classes. For example, `[abc*]` will match any of the characters `'a'`, `'b'`, `'c'`, or `'*'`.

In [15]:
abc_star = re.compile(r'[abc*]')
star = abc_star.search('01234*56789')

if star:
    print(star.group())
else:
    print("No match.")

*


#### The Escape Character:  `\`
> As with Python string literals, the backslash escapes special characters (including itself!).

In [17]:
slash = re.compile(r'\\') #the first backslash is to escape the second backslash, which itself is a metacharacter
slash_match = slash.search(r"....path\file")

if slash_match:
    print(slash_match.group())
else:
    print("No match.")

\


#### The Wildcard:   `.`
> Another useful metacharacter is the dot (`.`). It'll match anything except a newline character, unless you use the `re.DOTALL` flag, in which case it will also match a newline.

In [27]:
after_dash = re.compile(r'[-].') 
after_dash_match = after_dash.search(r"a-b")

if after_dash_match:
    print(after_dash_match.group())
else:
    print("No match.")

-b


#### The Complementing Caret:   `^` 
> You can match the characters *not* listed within a character class by complementing the set with the caret character (`^`) as the first character of the class. For example, `[^a]` will match any character *except* 'a'.

In [28]:
not_a = re.compile(r'[^a]')
non_a = not_a.search('aaaabaaaaa')

if non_a :
    print(non_a.group())
else:
    print("No match.")

b


> When `^` is inside `[]` but not at the start, it means the actual `^` character. When it's escaped with the backslash (`\^`), it also means the actual `^` character.

In [30]:
caret = re.compile(r'[\^]')
c = caret.search('aaaa^aaaaa')

if c :
    print(c.group())
else:
    print("No match.")

^


> In all other cases, the caret `'^'` matches the start of the string. If your string spans multiple lines, you can use the `MULTILINE` flag, which also matches immediately after each newline, which the `match()` method doesn't do (we'll see this further down). 

In [31]:
s1 = "Kate is the most common spelling of that name"
s2 = "Another spelling is Cate, but that's rarer."
print(re.search(r"^[CK]ate", s1))

<_sre.SRE_Match object; span=(0, 4), match='Kate'>


In [32]:
print(re.search(r"^[CK]ate", s2)) #this will return None because Cate isn't at the 
                                  #beginning of the string

None


> If we concatenate the two strings `s2` and `s1`, the resulting string won't start with `Kate` or `Cate`. But, the name will be following a newline character. In this case, we can use the `re.MULTILINE` flag.

In [33]:
s = s2 + "\n" + s1
print(re.search(r"^[CK]ate", s, re.MULTILINE))

<_sre.SRE_Match object; span=(44, 48), match='Kate'>


In [34]:
print(re.match(r"^[CK]ate", s, re.MULTILINE))

None


> The last example using `match()` shows that the MULTILINE flag doesn't work with the `match()` method since `match()` only checks the beginning of a string for a match.

#### The Dollar Sign:  `$`
> The dollar sign ($) matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character.

> Here's it in action:

In [13]:
endsWithExclamation = re.compile(r'!$')
m = endsWithExclamation.search("Find the end of this sentence!")
m.group()

'!'

> And note how it won't find the exclamation point that's in the middle of a word.

In [14]:
endsWithExclamation = re.compile(r'!$')
m = endsWithExclamation.search("Find the end of this sent!ence")
m.group()

AttributeError: 'NoneType' object has no attribute 'group'

> You can use the `^` and `$` together to indicate that the entire string must match the regex:

In [17]:
number = re.compile(r'^\d+$')
n = number.search('1234567890')
n.group()

'1234567890'

In [19]:
#this won't work!
number = re.compile(r'^\d+$')
n = number.search('1234a567890')
n.group()

AttributeError: 'NoneType' object has no attribute 'group'

#### The Pipe:  `|`
> You use the pipe character when you want to match one of many expressions. For example, if your expression is `A|B`, where `A` and `B` are regular experessions themselves, you'll have a regular expression that will match either `A` or `B`. The regular expressions separated by `'|'` are tried from left to right. When one of the patterns completely matches, that pattern is accepted and the search is over. This means that once `A` matches, `B` will not be tested further, even if it would produce a longer overall match (although you can use the `findall()` method to circumvent this). Practically speaking, this means the `'|'` operator is never *greedy*, which is a concept we'll cover later.

In [2]:
country = re.compile (r'America|France')
m = country.search('America and France are both countries.')
m.group()

'America'

In [21]:
m_all = country.findall('America and France are both countries.')
m_all

['America', 'France']

#### Grouping with Parentheses:  `( )`
> Regular expressions are useful for dissecting strings into several subgroups based on whether or not they match different regular expressions. Adding parentheses to your regex will create *groups* in the expression, which then allow you to use the `group()` and `groups()` match object methods to return the matches from just one group or all of the groups.

In [56]:
g = re.compile(r'(ab)-(cd)')
g.match('ab-cd').groups()          #return all the groups as a tuple

('ab', 'cd')

> By passing integers to `group()`, you can return different groups of the match. Passing 0 or nothing to `group()` returns the whole match as a single string.

In [57]:
g = re.compile(r'(ab)-(cd)')

for i in range(3):
    print(g.match('ab-cd').group(i))

ab-cd
ab
cd


<a id='special' ></a>
### Special Sequences
As we saw above, the backslash escapes special characters. But it can also be followed by certain characters to signal a special sequence. These special sequences represent some pretty useful predefined sets of characters, such as the set of digits, the set of letters, or the set of anything that isn’t whitespace.

Here's a list of some of the special sequences:

<table>
<tr><th>Special Sequence</th>
<th>Purpose</th>
</tr>
<tbody>
<tr><td><code><span>\d</span></code></td>
<td>Matches any decimal digit; this is equivalent to the class [0-9]</td>
</tr>
<tr><td><code><span>\D</span></code></td>
<td>Matches any non-digit character; this is equivalent to the class [^0-9]</td>
</tr>
<tr><td><code><span>\s</span></code></td>
<td>Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]</td>
</tr>
<tr><td><code><span>\S</span></code></td>
<td>Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v]</td>
</tr>
<tr><td><code><span>\w</span></code></td>
<td>Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9]</td>
</tr>
<tr><td><code><span>\W</span></code></td>
<td>Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9]</td>
</tr>
</tbody>
</table>

Additionally, these special sequences can be included inside a character class. For example, `[\s,.]` is a character class that will match any whitespace character, or ',' or '.'.

<a id='quant' ></a>
### Quantifiers

<table>
<tr><th>Quantifiers</th>
<th>Purpose</th>
</tr>
  <tr>
    <td><code>*</code></td>
    <td>0 or more (append <code>?</code> for non-greedy)</td>
  </tr>
  <tr>
    <td><code>+</code></td>
    <td>1 or more (append <code>?</code> for non-greedy)</td>
  </tr>
  <tr>
    <td><code>?</code></td>
    <td>0 or 1; i.e. to mark something as being optional (append <code>?</code> for non-greedy)</td>
  </tr>
  <tr>
    <td><code>{m}</code></td>
    <td>exactly <code>m</code> occurrences</td>
  </tr>
  <tr>
    <td><code>{m, n}</code></td>
    <td>from <code>m</code> to <code>n</code>. <code>m</code> defaults to 0, <code>n</code> to infinity</td>
  </tr>
  <tr>
    <td><code>{m, n}?</code></td>
    <td>from <code>m</code> to <code>n</code>, as few as possible</td>
  </tr>
</table>

#### Matching Zero or More with `*`
> The group that precedes an asterisk will match a regex zero or more times.

In [59]:
ab_infinite = re.compile(r'(ab)*')
m = ab_infinite.search('abababababababababababababababababababababababababababab')
m.group()

'abababababababababababababababababababababababababababab'

#### Optional Matching with `?`
> The question mark indicates that the group that precedes it is an optional part of the pattern.

In [61]:
person = re.compile(r'(wo)?man')
m = person.search('man')
m.group()

'man'

In [62]:
person = re.compile(r'(wo)?man')
m = person.search('woman')
m.group()

'woman'

#### Matching One or More with `+`
> Unlike the asterisk, which does not require its group to appear in the matched string, the group preceding a plus must appear at least once. 

In [68]:
person = re.compile(r'(wo)+man')
m = person.search('woman')
m.group()

'woman'

In [69]:
person = re.compile(r'(wo)+man')
m = person.search('man')
m.group()

AttributeError: 'NoneType' object has no attribute 'group'

#### Matching Specific Repetitions with {}
> If you have a group that you want to repeat a specific number of times, you can append the group in your regex with a number in curly brackets. For example, the regex `(ab){2}` will match the string `'abab'` and nothing else since it has only two repetitions of the `(ab)` group.

In [74]:
two_matches = re.compile(r'(ab){2}')
m = two_matches.search('abababababababababababab')
m.group()

'abab'

> You can also specify a range by placing a minimum and maximum within the curly brackets. For example, the `(ab){2,5}` will match `'abab'`, `'ababab'`, `'abababab'`, and `'ababababab'`.

In [79]:
two_matches = re.compile(r'(ab){2,5}')
m = two_matches.search('abababababababababababab')
m.group()

'ababababab'

> You can also leave out the first or second number to make the minimum or maximum unbounded. The following example will find at least two but up to infinite repetitions of `'ab'`.

In [81]:
two_matches = re.compile(r'(ab){2,}')
m = two_matches.search('abababababababababababab')
m.group()

'abababababababababababab'

#### Greedy vs Nongreedy Matching
Above, we saw how `(ab){2,5}` could match two, three, four, or five instances of `'ab'`. But then why did `group()` return `'ababababab'` - the maximum match -  instead of all of the shorter matches?

This happened because Python’s regular expressions are **greedy** by default. This means  regex patterns involving repetition will match the longest string possible. To make patterns non-greedy - i.e. to match the shortest string possible - you can append a question mark.

In [82]:
two_matches = re.compile(r'(ab){2,5}?')
m = two_matches.search('abababababababababababab')
m.group()

'abab'

<a id='mod' ></a>
## Modifying Strings
So far we’ve just been searching for patterns in a string. But we can also use regular expressions to modify strings. Here's a few methods that'll help us with this:

<table>
<colgroup>
<col width="28%" />
<col width="72%" />
</colgroup>
<thead valign="bottom">
<tr><th>Method/Attribute</th>
<th>Purpose</th>
</tr>
</thead>
<tbody valign="top">
<tr><td><code><span>split()</span></code></td>
<td>Split the string into a list, splitting it
wherever the RE matches</td>
</tr>
<tr><td><code><span>sub()</span></code></td>
<td>Find all substrings where the RE matches, and
replace them with a different string</td>
</tr>
<tr><td><code><span>subn()</span></code></td>
<td>Does the same thing as <code><span>sub()</span></code>,  but
returns the new string and the number of
replacements</td>
</tr>
</tbody>
</table>

#### Splitting Strings
> `split(string[, maxsplit=0])`

> Split string by the matches of the regular expression. If capturing parentheses are used in the regex, then their contents will also be returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits are performed.

In [83]:
p = re.compile(r'\W+')
p.split('This is a test, short and sweet, of split().')

['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']

In [84]:
p.split('This is a test, short and sweet, of split().', 3)
['This', 'is', 'a', 'test, short and sweet, of split().']

['This', 'is', 'a', 'test, short and sweet, of split().']

#### Search and Replace
Another common task is to find all the matches for a pattern, and replace them with a different string. The `sub()` method takes a replacement value, which can be either a string or a function, and the string to be processed.
> `sub(replacement, string[, count=0])`

>Returns the string obtained by replacing the leftmost non-overlapping occurrences of the regex in `string` by the replacement `replacement`. If the pattern isn’t found, `string` is returned unchanged.

>The optional argument `count` is the maximum number of pattern occurrences to be replaced; `count` must be a non-negative integer. The default value of 0 means to replace all occurrences.

>Here’s a simple example of using the `sub()` method. It replaces colour names with the word colour:

In [86]:
p = re.compile('(blue|white|red)')

for i in range(3):
    print(p.sub('colour', 'blue socks, red shoes, and white pants', count = i))

colour socks, colour shoes, and colour pants
colour socks, red shoes, and white pants
colour socks, colour shoes, and white pants


<a id='takeaways' ></a>
## Takeaways
Regular expressions let you specify and find character patterns. They're super helpful in Python but are also ubiquitous across and beyond programming languages. For example, Google Sheets has a regular exression find-and-replace feature that allows you to search using regular expressions:
<img src="assets/gs_regex.png">
You can read more about regular expressions in general [here](http://www.regular-expressions.info/).

<a id='applications' ></a>
## Applications
[Phone Number Extractor](https://stackoverflow.com/questions/16699007/regular-expression-to-match-standard-10-digit-phone-number)

[Email Address Extractor](http://www.regular-expressions.info/email.html)