# String operations and regular expressions

Today we talk about strings. When we have a string, we might want to ask whether it has particular characteristics---does it start with a particular character? Does it contain within it another string?---or try to extract smaller parts of the string, like the first fifteen characters, or say, the part of the string inside parentheses. Or we may want to transform the string into another string altogether, by (for example) converting its characters to upper case, or replacing substrings within it with other substrings. Today we discuss how to do these things in Python.

This notebook is adapted from Dennis Tenin's [Lede Program](https://github.com/ledeprogram/courses/blob/master/README.md)

## Simple string checks

There are a number of functions, methods and operators that can tell us whether or not a Python string matches certain characteristics. Let's talk about the `in` operator first:

In [40]:
"foo" in "buffoon"

True

In [41]:
"foo" in "reginald"

False

The `in` operator takes one expression evaluating to a string on the left and another on the right, and returns `True` if the string on the left occurs somewhere inside of the string on the right.

We can check to see if a string begins with or ends with another string using that string's `.startswith()` and `.endswith()` methods, respectively:

In [42]:
"foodie".startswith("foo")

True

In [43]:
"foodie".endswith("foo")

False

The `.isdigit()` method returns `True` if Python thinks the string could represent an integer, and `False` otherwise:

In [44]:
print("foodie".isdigit())
print("4567".isdigit())

False
True


And the `.islower()` and `.isupper()` methods return `True` if the string is in all lower case or all upper case, respectively (and `False` otherwise).

In [45]:
print("foodie".islower())
print("foodie".isupper())

True
False


In [46]:
print("YELLING ON THE INTERNET".islower())
print("YELLING ON THE INTERNET".isupper())

False
True


## Finding substrings

The `in` operator discussed above will tell us if a substring occurs in some other string. If we want to know *where* that substring occurs, we can use the `.find()` method. The `.find()` method takes a single parameter between its parentheses: an expression evaluating to a string, which will be searched for within the string whose `.find()` method was called. If the substring is found, the entire expression will evaluate to the index at which the substring is found. If the substring is not found, the expression evaluates to `-1`. To demonstrate:

In [47]:
print("Now is the winter of our discontent".find("win"))
print("Now is the winter of our discontent".find("lose"))

11
-1


The `.count()` method will return the number of times a particular substring is found within the larger string:

In [48]:
print("I got rhythm, I got music, I got my man, who could ask for anything more".count("I got"))

3


## String slices

As has been alluded to previously, string slices work exactly like list slices---except you're getting characters from the string, instead of elements from a list. Observe:

In [49]:
message = "bungalow"
message[3]

'g'

In [50]:
message[1:6]

'ungal'

In [51]:
message[:3]

'bun'

In [52]:
message[2:]

'ngalow'

In [53]:
message[-2]

'o'

Combine this with the `find()` method and you can do things like write expressions that evaluate to everything from where a substring matches, up to the end of the string:

In [54]:
shakespeare = "Now is the winter of our discontent"
substr_index = shakespeare.find("win")
print(shakespeare[substr_index:])

winter of our discontent


## Simple string transformations

Python strings have a number of different methods which, when called on a string, return a copy of that string with a simple transformation applied to it. These are helpful for normalizing and cleaning up data, or preparing it to be displayed.

Let's start with `.lower()`, which evaluates to a copy of the string in all lower case:

In [55]:
"ARGUMENTATION! DISAGREEMENT! STRIFE!".lower()

'argumentation! disagreement! strife!'

The converse of `.lower()` is `.upper()`:

In [56]:
"e.e. cummings is. not. happy about this.".upper()

'E.E. CUMMINGS IS. NOT. HAPPY ABOUT THIS.'

The method `.title()` evaluates to a copy of the string it's called on, replacing every letter at the beginning of a word in the string with a capital letter:

In [57]:
"dr. strangelove, or, how I learned to love the bomb".title()

'Dr. Strangelove, Or, How I Learned To Love The Bomb'

The `.strip()` method removes any whitespace from the beginning or end of the string (but not between characters later in the string):

In [58]:
" got some random whitespace in some places here     ".strip()

'got some random whitespace in some places here'

Finally, the `.replace()` method takes two parameters: a string to find, and a string to replace that string with whenever it's found. You can use this to make sad stories.

In [59]:
"I got rhythm, I got music, I got my man, who could ask for anything more".replace("I got", "I used to have")

'I used to have rhythm, I used to have music, I used to have my man, who could ask for anything more'

### "Escape" sequences in strings

Inside of strings that you type into your Python code, there are certain sequences of characters that have a special meaning. These sequences start with a backslash character (`\`) and allow you to insert into your string characters that would otherwise be difficult to type, or that would go against Python syntax. Here's some code illustrating a few common sequences:

In [60]:
print("include \"double quotes\" (inside of a double-quoted string)")
print('include \'single quotes\' (inside of a single-quoted string)')
print("one\ttab, two\t\ttabs")
print("new\nline")
print("include an actual backslash \\ (two backslashes in the string)")

include "double quotes" (inside of a double-quoted string)
include 'single quotes' (inside of a single-quoted string)
one	tab, two		tabs
new
line
include an actual backslash \ (two backslashes in the string)


## Regular expressions

So far, we've discussed how to write programs and expressions that are able to check whether strings meet very simple criteria, such as “does this string begin with a particular character” or “does this string contain another string”? But imagine writing a program that performs the following task: find and print all ZIP codes in a string (i.e., a five-character sequence of digits). Give up? Here’s my attempt, using only the tools we’ve discussed so far:

In [61]:
input_str = "here's a zip code: 12345. 567 isn't a zip code, but 45678 is. 23456? yet another zip code."
current = ""
zips = []
for ch in input_str:
    if ch in '0123456789':
        current += ch
    else:
        current = ""
    if len(current) == 5:
        zips.append(current)
        current = ""
print(zips)

['12345', '45678', '23456']


Basically, we have to iterate over each character in the string, check to see if that character is a digit, append to a string variable if so, continue reading characters until we reach a non-digit character, check to see if we found exactly five digit characters, and add it to a list if so. At the end, we print out the list that has all of our results. Problems with this code: it’s messy; it doesn’t overtly communicate what it’s doing; it’s not easily generalized to other, similar tasks (e.g., if we wanted to write a program that printed out phone numbers from a string, the code would likely look completely different).

Our ancient UNIX pioneers had this problem, and in pursuit of a solution, thought to themselves, "Let’s make a tiny language that allows us to write specifications for textual patterns, and match those patterns against strings. No one will ever have to write fiddly code that checks strings character-by-character ever again." And thus regular expressions were born.

Here's the code for accomplishing the same task with regular expressions, by the way:

In [62]:
import re
zips = re.findall(r"\d{5}", input_str)
print(zips)

['12345', '45678', '23456']


I’ll allow that the `r"\d{5}"` in there is mighty cryptic (though hopefully it won’t be when you’re done reading this page and/or participating in the associated lecture). But the overall structure of the program is much simpler.

### Fetching our corpus

For this section of class, we'll be using the subject lines of all e-mails in the [EnronSent corpus](http://verbs.colorado.edu/enronsent/), kindly put into the public domain by the United States Federal Energy Regulatory Commission.

### Matching strings with regular expressions

The most basic operation that regular expressions perform is matching strings: you’re asking the computer whether a particular string matches some description. We're going to be using regular expressions to print only those lines from our `enronsubjects.txt` corpus that match particular sequences. Let's load our corpus into a list of lines first:

In [63]:
subjects = [x.strip() for x in open("data/enronsubjects.txt").readlines()]

In [64]:
subjects

['# This file contains the subject lines from every message in the EnronSent corpus.',
 '# For more information, see http://verbs.colorado.edu/enronsent',
 '',
 'Headcount',
 'utilities roll',
 'utilities roll',
 'TIME SENSITIVE: Executive Impact & Influence Program Survey',
 'TIME SENSITIVE: Executive Impact & Influence Program Survey',
 'Wow',
 'Wow',
 'Wow',
 'Wow',
 'Re:',
 'Re:',
 'Re:',
 'RE: Receipt of Team Selection Form - Executive Impact & Influence',
 'RE: Receipt of Team Selection Form - Executive Impact & Influence',
 'Receipt of Team Selection Form - Executive Impact & Influence',
 'FYI',
 'FYI',
 'Re: Transportation Reports',
 'Re: Western Gas Market Report -- Draft',
 'Receipt of Team Selection Form - Executive Impact & Influence',
 'Receipt of Team Selection Form - Executive Impact & Influence Program',
 'Re: (No Subject)',
 'Re: Security Request: CLOG-4NNJEZ has been Denied.',
 'New Generation',
 'New Generation',
 'Re: Meeting to discuss 2001 direct expense plan?',
 

We can check whether or not a pattern matches a given string in Python with the `re.search()` function. The first parameter to search is the regular expression you're trying to match; the second parameter is the string you're matching against.

Here's an example, using a very simple regular expression. The following code prints out only those lines in our Enron corpus that match the (very simple) regular expression `shipping`:

In [65]:
import re
[line for line in subjects if re.search("shipping", line)]

['FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'lng shipping']

At its simplest, a regular expression matches a string if that string contains exactly the characters you've specified in the regular expression. So the expression `shipping` matches strings that contain exactly the sequences of `s`, `h`, `i`, `p`, `p`, `i`, `n`, and `g` in a row. If the regular expression matches, `re.search()` evaluates to `True` and the matching line is included in the evaluation of the list comprehension.

> BONUS TECH TIP: `re.search()` doesn't actually evaluate to `True` or `False`---it evaluates to either a `Match` object if a match is found, or `None` if no match was found. Those two count as `True` and `False` for the purposes of an `if` statement, though.

### Metacharacters: character classes

The "shipping" example is pretty boring. (There was hardly any fan fiction in there at all.) Let's go a bit deeper into detail with what you can do with regular expressions. There are certain characters or strings of characters that we can insert into a regular expressions that have special meaning. For example:

In [66]:
[line for line in subjects if re.search("sh.pping", line)]

['FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'FW: How to use UPS for shipping on the internet',
 'How to use UPS for shipping on the internet',
 "FW: We've been shopping!",
 'Re: Start shopping...',
 'Start shopping...',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'lng shipping/mosk meeting in tokyo 2nd of feb',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'lng shipping',
 'lng shipping',
 'Re: lng shipping',
 'lng shipping',
 'FW: Online shopping',
 'Online shopping']

In a regular expression, the character `.` means "match any character here." So, using the regular expression `sh.pping`, we get lines that match `shipping` but also `shopping`. The `.` is an example of a regular expression *metacharacter*---a character (or string of characters) that has a special meaning.

Here are a few more metacharacters. These metacharacters allow you to say that a character belonging to a particular *class* of characters should be matched in a particular position:

| metacharacter | meaning |
|---------------|---------|
| `.` | match any character |
| `\w` | match any alphanumeric ("*w*ord") character (lowercase and capital letters, 0 through 9, underscore) |
| `\s` | match any whitespace character (i.e., space and tab) |
| `\S` | match any non-whitespace character (the inverse of \s) |
| `\d` | match any digit (0 through 9) |
| `\.` | match a literal `.` |

Here, for example, is a (clearly imperfect) regular expression to search for all subject lines containing a time of day:

In [67]:
[line for line in subjects if re.search(r"\d:\d\d\wm", line)]

['RE: 3:17pm',
 '3:17pm',
 "RE: It's On!!! - 2:00pm Today",
 "FW: It's On!!! - 2:00pm Today",
 "It's On!!! - 2:00pm Today",
 'Re: Registration Confirmation: Larry Summers on 12/6 at 1:45pm (was',
 'Re: Conference Call today 2/9/01 at 11:15am PST',
 'Conference Call today 2/9/01 at 11:15am PST',
 '5/24 1:00pm conference call.',
 '5/24 1:00pm conference call.',
 'FW: 07:33am EDT 15-Aug-01 Prudential Securities (C',
 'FW: 07:33am EDT 15-Aug-01 Prudential Securities (C',
 '07:33am EDT 15-Aug-01 Prudential Securities (C',
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Re: Updated Mar'00 Requirements Received at 11:25am from CES",
 "Updated Mar'00 Requirements Received at 11:25am from CES",
 'Reminder: Legal Team Meeting -- Friday, 9:00am Houston time',
 'Thursday, March 7th 1:30-3:00pm: REORIENTATION',
 'Meeting at 2:00pm Friday',
 'Meeting at 2:00pm Friday',
 'Fw: 12:30pm Deadline for changes to letters or

Here's that regular expression again: `r"\d:\d\d\wm"`. I'm going to show you how to read this, one unit at a time.

"Hey, regular expression engine. Tell me if you can find this pattern in the current string. First of all, look for any number (`\d`). If you find that, look for a colon right after it (`:`). If you find that, look for another number right after it (`\d`). If you find *that*, look for any alphanumeric character---you know, a letter, a number, an underscore. If you find  that, then look for a `m`. Good? If you found all of those things in a row, then the pattern matched."

#### But what about that weirdo `r""`?

Python provides another way to include string literals in your program, in addition to the single- and double-quoted strings we've already discussed. The r"" string literal, or "raw" string, includes all characters inside the quotes literally, without interpolating special escape characters. Here's an example:

In [68]:
print("this is\na test")
print(r"this is\na test")

this is
a test
this is\na test


In [69]:
print("one\ttab, two\t\ttabs")
print(r"one\ttab, two\t\ttabs")

one	tab, two		tabs
one\ttab, two\t\ttabs


As you can see, whereas a double- or single-quoted string literal interprets `\n` as a new line character, the raw quoted string includes those characters as they were literally written. More importantly, for our purposes at least, is the fact that, in the raw quoted string, we only need to write one backslash in order to get a literal backslash in our string.

Why is this important? Because regular expressions use backslashes all the time, and we don't want Python to try to interpret those backslashes as special characters. (Inside a regular string, we'd have to write a simple regular expression like `\b\w+\b` as `\\b\\w+\\b`---yecch.)

So the basic rule of thumb is this: use r"" to quote any regular expressions in your program. All of the examples you'll see below will use this convention.

###Character classes in-depth

You can define your own character classes by enclosing a list of characters, or range of characters, inside square brackets:

| regex | explanation |
|-------|-------------|
| `[aeiou]` | matches any vowel |
| `[02468]` | matches any even digit |
| `[a-z]` | matches any lower-case letter |
| `[A-Z]` | matches any upper-case character |
| `[^0-9]` | matches any non-digit (the ^ inverts the class, matches anything not in the list) |
| `[Ee]` | matches either `E` or `e` |

Let's find every subject line where we have four or more vowels in a row:

In [70]:
[line for line in subjects if re.search(r"[aeiou][aeiou][aeiou][aeiou]", line)]

['Re: Natural gas quote for Louiisiana-Pacific (L-P)',
 'WooooooHoooooo more Vacation',
 'Re: Clickpaper Counterparties waiting to clear the work queue',
 'Gooooooooooood Bye!',
 'Gooooooooooood Bye!',
 'RE: Hello Sweeeeetie',
 'Hello Sweeeeetie',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'Re: FW: Wasss Uuuuuup STG?',
 'RE: Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'FW: The Osama Bin Laden Song ( Soooo Funny !! )',
 'Fw: The Osama Bin Laden Song ( Soooo Funny !! )',
 'The Osama Bin Laden Song ( Soooo Funny !! )',
 'RE: duuuuhhhhh',
 'RE: duuuuhhhhh',
 'RE: duuuuhhhhh',
 'duuuuhhhhh',
 'RE: duuuuhhhhh',
 'duuuuhhhhh',
 'RE: FPL Queue positions 1-15',
 'Re: FPL Queue positions 1-15',
 'Re: Helloooooo!!!',
 'Re: Helloooooo!!!',
 'Fw: FW: OOOooooops',
 'FW: FW: OOOooooops',
 'Re: yeeeeha',
 'yeeeeha',
 'yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo

### Metacharacters: anchors

The next important kind of metacharacter is the *anchor*. An anchor doesn't match a character, but matches a particular place in a string.

| anchor | meaning |
|--------|---------|
| `^` | match at beginning of string |
| `$` | match at end of string |
| `\b` | match at word boundary |

> Note: `^` in a character class has a different meaning from `^` outside a character class!

> Note #2: If you want to search for a literal dollar sign (`$`), you need to put a backslash in front of it, like so: `\$`

Now we have enough regular expression knowledge to do some fairly sophisticated matching. As an example, all the subject lines that begin with the string `New York`, regardless of whether or not the initial letters were capitalized:

In [71]:
[line for line in subjects if re.search(r"^[Nn]ew [Yy]ork", line)]

['New York Details',
 'New York Power Authority',
 'New York Power Authority',
 'New York Power Authority',
 'New York Power Authority',
 'New York',
 'New York',
 'New York',
 'New York, etc.',
 'New York, etc.',
 'New York sites',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York Hotel',
 'New York',
 'New York',
 'New York City Marathon Guaranteed Entry',
 'new york rest reviews',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas Corporation ("NYSEG")',
 'New York State Electric & Gas ("NYSEG")',
 'New York regulatory restriccions',
 'New York regulatory restriccions',
 'New York Bar Numbers']

Every subject line that ends with an ellipsis:

In [72]:
[line for line in subjects if re.search(r"\.\.\.$", line)]

['Re: Inquiry....',
 'Re: Inquiry....',
 'RE: the candidate we spoke about this morning...',
 'the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'RE: the candidate we spoke about this morning...',
 'the candidate we spoke about this morning...',
 'Re: Hmmmmm........',
 'Hmmmmm........',
 'FW: Bumping into the husband....',
 'FW: Bumping into the husband....',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...',
 'Re: try this one...',
 'try this one...',
 'RE: try this one...',
 'RE: try this one...',
 '

Every subject line that has the word 'oil' in it

In [73]:
[line for line in subjects if re.search(r"\b[Oo]il\b", line)]

['Re: PIRA Global Oil and Natural Outlooks- Save these dates.',
 'PIRA Global Oil and Natural Outlooks- Save these dates.',
 'Re: PIRA Global Oil and Natural Outlooks- Save these dates.',
 '=09PIRA Global Oil and Natural Outlooks- Save these dates.',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Cabot Oil & Gas Marketing Corp. - Amendment and Confirmations to',
 'Cabot Oil & Gas Marketing Corp. - Amendment and Confirmations to',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil & Gas Marketing Corp. - 9/99 production - price',
 'Re: Cabot Oil &

### Metacharacters: quantifiers

Above we had a regular expression that looked like this:

    [aeiou][aeiou][aeiou][aeiou]
    
Typing out all of those things is kind of a pain. Fortunately, there’s a way to specify how many times to match a particular character, using quantifiers. These affect the character that immediately precede them:

| quantifier | meaning |
|------------|---------|
| `{n}` | match exactly n times |
| `{n,m}` | match at least n times, but no more than m times |
| `{n,}` | match at least n times |
| `+` | match at least once (same as {1,}) |
| `*` | match zero or more times |
| `?` | match one time or zero times |

For example, here's an example of a regular expression that finds subjects that contain at least fifteen capital letters in a row:

In [74]:
[line for line in subjects if re.search(r"[A-Z]{15,}", line)]

['CONGRATULATIONS!',
 'CONGRATULATIONS!',
 'Re: FW: Fw: Fw: Fw: Fw: Fw: Fw: PLEEEEEEEEEEEEEEEASE READ!',
 'ACCOMPLISHMENTS',
 'ACCOMPLISHMENTS',
 'Re: FW: FORM: BILATERAL CONFIDENTIALITY AGREEMENT',
 'FORM: BILATERAL CONFIDENTIALITY AGREEMENT',
 'Re: CONGRATULATIONS!',
 'CONGRATULATIONS!',
 'Re: ORDER ACKNOWLEDGEMENT',
 'ORDER ACKNOWLEDGEMENT',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'Re: CONGRATULATIONS',
 'CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'RE: CONGRATULATIONS',
 'Re: CONGRATULATIONS',
 'CONGRATULATIONS',
 'Re: VEPCO INTERCONNECTION AGREEMENT',
 'VEPCO INTERCONNECTION AGREEMENT',
 'Re: VEPCO INTERCONNECTION AGREEMENT',
 'Re: VEPCO INTERCONNECTION AGREEMENT',
 'VEPCO INTERCONNECTION AGREEMENT',
 'Re: CONGRATULATIONS !',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'Re: FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAAAAAAAAAAAAAABI!',
 'FW: WASSSAA

Lines that contain five consecutive vowels:

In [75]:
[line for line in subjects if re.search(r"[aeiou]{5}", line)]

['WooooooHoooooo more Vacation',
 'Gooooooooooood Bye!',
 'Gooooooooooood Bye!',
 'RE: Hello Sweeeeetie',
 'Hello Sweeeeetie',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'FW: Waaasssaaaaabi !',
 'Re: FW: Wasss Uuuuuup STG?',
 'RE: Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'Rrrrrrrooooolllllllllllll TIDE!!!!!!!!',
 'Re: Helloooooo!!!',
 'Re: Helloooooo!!!',
 'Fw: FW: OOOooooops',
 'FW: FW: OOOooooops',
 'yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo',
 'RE: yahoooooooooooooooooooo',
 'yahoooooooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 'RE: I hate yahooooooooooooooo',
 'I hate yahooooooooooooooo',
 "FW: duuuuuuuuuuuuuuuuude...........what's up?",
 "RE: duuuuuuuuuuuuuuuuude...........what's up?",
 "RE: duuuuuuuuuuuuuuuuude...........what's up?",
 'Re: skiiiiiiiiing',
 'skiiiiiiiiin

Count the number of lines that are e-mail forwards, regardless of whether the subject line begins with `Fw:`, `FW:`, `Fwd:` or `FWD:`

In [76]:
# can we improve this search?
len([line for line in subjects if re.search(r"^F[Ww]d?:", line)])

20159

Lines that have the word `news` in them and end in an exclamation point:

In [77]:
[line for line in subjects if re.search(r"\b[Nn]ews\b.*!$", line)]

['RE: Christmas Party News!',
 'FW: Christmas Party News!',
 'Christmas Party News!',
 'Good News!',
 'Good News--Twice!',
 'Re: VERY Interesting News!',
 'Great News!',
 'Re: Great News!',
 'News Flash!',
 'RE: News Flash!',
 'RE: News Flash!',
 'News Flash!',
 'RE: Good News!',
 'RE: Good News!',
 'RE: Good News!',
 'RE: Good News!',
 'Good News!',
 'RE: Good News!!!',
 'Good News!!!',
 'RE: Big News!',
 'Big News!',
 'Individual.com - News From a Friend!',
 'Individual.com - News From a Friend!',
 'Re: Individual.com - News From a Friend!',
 'RE: We need news!',
 '=09We need news!',
 'RE: Big News!',
 'FW: Big News!',
 'RE: Big News!',
 'FW: Big News!',
 'Big News!',
 'FW: NW Wine News- Eroica, Sineann, Bergstrom, Hamacher, And more!',
 '=09NW Wine News- Eroica, Sineann, Bergstrom, Hamacher, And more!',
 'RE: Good News!!!',
 'Good News!!!',
 'Re: Big News!',
 'Big News!',
 'RE: Good  News!',
 'Good  News!']

###Metacharacters: alternation

One final bit of regular expression syntax: alternation.

* `(?:x|y)`: match either x or y
* `(?:x|y|z)`: match x, y or z
* etc.

So for example, if you wanted to count every subject line that begins with either `Re:` or `Fwd:`:

In [78]:
len([line for line in subjects if re.search(r"^(?:Re|Fwd):", line)])

39901

Every subject line that mentions a primary color:

In [79]:
[line for line in subjects if re.search(r"\b(?:[Rr]ed|[Yy]ellow|[Bb]lue)\b", line)]

['Re: Blue Dolphin Pipe',
 'Blue Dolphin Pipe',
 'FW: Red Rock expansion',
 'FW: Red Rock expansion',
 'Red Rock expansion',
 'Re: Red Rock GE/NP Emissions',
 'Re: Red Rock GE/NP Emissions',
 'Re: Red Rock GE/NP Emissions',
 'Red Rock GE/NP Emissions',
 'Air Permit Delay, Red Rock Expansion',
 'RE: Red Rock Air Permits Heads up!',
 'RE: Red Rock Air Permits Heads up!',
 'Red Rock Air Permits Heads up!',
 'RE: FW: Red Rock Expansion Station 4',
 'RE: FW: Red Rock Expansion Station 4',
 'RE: FW: Red Rock Expansion Station 4',
 'Re: FW: Red Rock Expansion Station 4',
 'FW: Red Rock Expansion Station 4',
 'Red Rock Expansion Station 4',
 'summary of red rock contracts',
 'Re: Yellow Book',
 'Re: Now we have red ones',
 'Re: Red Herring Res',
 'Blue Jean Shirts',
 'Blue Jean Shirts',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',
 'Blue Girl - $1.2MM option expires today - need to know whether to',


## Capturing what matches

The `re.search()` function allows us to check to see *whether or not* a string matches a regular expression. Sometimes we want to find out not just if the string matches, but also to what, exactly, in the string matched. In other words, we want to *capture* whatever it was that matched.

The easiest way to do this is with the `re.findall()` function, which takes a regular expression and a string to match it against, and returns a list of all parts of the string that the regular expression matched. Here's an example:

In [80]:
import re
print(re.findall(r"\b\w{5}\b", "alpha beta gamma delta epsilon zeta eta theta"))

['alpha', 'gamma', 'delta', 'theta']


The regular expression above, `\b\w{5}\b`, is a regular expression that means "find me strings of five non-white space characters between word boundaries"---in other words, find me five-letter words. The `re.findall()` method returns a list of strings---not just telling us whether or not the string matched, but which parts of the string matched.

For the following `re.findall()` examples, we'll be operating on the entire file of subject lines as a single string, instead of using a list comprehension for individual subject lines. Here's how to read in the entire file as one string, instead of as a list of strings:

In [None]:
all_subjects = open("data/enronsubjects.txt").read()

Having done that, let's write a regular expression that finds all domain names in the subject lines:

In [None]:
re.findall(r"\b\w+\.(?:com|net|org)", all_subjects)

Every time the string `New York` is found, along with the word that comes directly afterward:

In [None]:
re.findall(r"New York \b\w+\b", all_subjects)

And just to bring things full-circle, everything that looks like a zip code, sorted:

In [None]:
sorted(re.findall(r"\b\d{5}\b", all_subjects))

## Full example: finding the dollar value of the Enron e-mail subject corpus

Here's an example that combines our regular expression prowess with our ability to do smaller manipulations on strings. We want to find all dollar amounts in the subject lines, and then figure out what their sum is.

To understand what we're working with, let's start by writing a list comprehension that finds strings that just have the dollar sign (`$`) in them:

In [None]:
[line for line in subjects if re.search(r"\$", line)]

Based on this data, we can guess at the steps we'd need to do in order to figure out these values. We're going to ignore anything that doesn't have "k", "million" or "billion" after it as chump change. So what we need to find is: a dollar sign, followed by any series of numbers (or a period), followed potentially by a space (but sometimes not), followed by a "k", "m" or "b" (which will sometimes start the word "million" or "billion" but sometimes not... so we won't bother looking).

Here's how I would translate that into a regular expression:

    \$[0-9.]+ ?(?:[Kk]|[Mm]|[Bb])
    
We can use `re.findall()` to capture all instances where we found this regular expression in the text. Here's what that would look like:

In [None]:
re.findall(r"\$[0-9.]+ ?(?:[Kk]|[Mm]|[Bb])", all_subjects)

If we want to actually make a sum, though, we're going to need to do a little massaging.

In [None]:
total_value = 0
dollar_amounts = re.findall(r"\$\d+ ?(?:[Kk]|[Mm]|[Bb])", all_subjects)
for amount in dollar_amounts:
    # the last character will be 'k', 'm', or 'b'; "normalize" by making lowercase.
    multiplier = amount[-1].lower()
    # trim off the beginning $ and ending multiplier value
    amount = amount[1:-1]
    # remove any remaining whitespace
    amount = amount.strip()
    # convert to a floating-point number
    float_amount = float(amount)
    # multiply by an amount, based on what the last character was
    if multiplier == 'k':
        float_amount = float_amount * 1000
    elif multiplier == 'm':
        float_amount = float_amount * 1000000
    elif multiplier == 'b':
        float_amount = float_amount * 1000000000
    # add to total value
    total_value = total_value + float_amount

print(int(total_value))

That's over one trillion dollars! Nice work, guys.

### Conclusion

Regular expressions are a great way to take some raw text and find the parts that are of interest to you. Python's string methods and string slicing syntax are a great way to massage and clean up data. You know them both now, which makes you powerful. But as powerful as you are, you have only scratched the surface of your potential! We only scratched the surface of what's possible with regular expressions.

## Next

#### Check A02_TextProc_YourFirstName_YourLastName.ipynb !

