In [None]:
%matplotlib inline
import matplotlib
import seaborn as sns
matplotlib.rcParams['savefig.dpi'] = 2 * matplotlib.rcParams['savefig.dpi']

# Dealing with Strings in Python

### Goals

 - Strings in Python (...and other things)
 - Basic string processing in Python
 - StringIO package in python
 - Regular expressions

## The string data structure

A string is a sequence of characters.  In Python it's indicated by surrounding it with either single or double quotes:

        'The quick brown fox jumped over the lazy dog'
        "The quick brown fox jumped over the lazy dog"

They are pretty much interchangeable.  The only difference has to do with __escaping__:


### Escaping

Suppose you wanted to enter the string 

        I'm Anatoly, but some people call me "Toly."

how would you do it?  You can't just surround it with `".."` like 

        "I'm Anatoly, but some people call me "Toly.""
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

because Python would get confused and think that the string was over early (as shown).  Similarly, you can't just enclose it in single quotes because of the single quote in `I'm`.  Instead, when you want to insert a quote into a quoted string, you _escape_ it by writing it as `\"` or `\'`.  So we could represent the previous string as either:
        
        "I'm Anatoly, but some people call me \"Toly.\""
        'I\'m Anatoly, but some people call me "Toly."'      

(This also means that if you want to represent a backslash, you have to write `\\`.)


### Gotcha 1  (file away for later, don't think about now)

That's in Python.  Other languages have their own rules and conventions and you often have to interface with them.  You should try to avoid dealing with these quotation and escaping rules in code -- it is a source of many bugs.  

In many languages, notably bash shell scripts and SQL, the two types of quotes are not the same.  What's even worse, each of three popular SQL backends (SQLite, MySQL, and Postgres) have substantively different rules about strings and about quoting!


### Gotcha 2  (file away for later, don't think about now)

We never said what a character is, and it turns out to be painfully complicated.  After all, there are a whole lot of possible characters in a whole lot of alphabets, many more than the 128 characters in the [ASCII](http://en.wikipedia.org/wiki/ASCII) character encoding that was once predominant.  There are a variety of systems, collectively called [Unicode](http://en.wikipedia.org/wiki/Unicode), for _encoding_ all these possible characters.

When you see a Python string that has a `u` in front, e.g.

        u'I\'m Anatoly, but some people call me "Toly."'

that tells Python that it is a _Unicode string_.

This is something that you _shouldn't_ have to worry about: so long as you use libraries, and those libraries are smart, all the needed conversions should be handled for you.  Unfortunately, it is still possible to hit these rough edges so it's good to know about them.

In [None]:
str1="I'm Anatoly, but some people call me \"Toly.\""
str2='I\'m Anatoly, but some people call me "Toly."' 
print str1
print str2
print str1==str2

In [None]:
str3=u'I\'m Anatoly, but some people call me "Toly."'
print str1==str3
print type(str1)
print type(str3)
print type(str1)==type(str3)

print isinstance(str1, basestring)
print isinstance(str3, basestring)

### Exercises


1. Fill in the following Python code
        >>> s = ...
        >>> print s
   
   so that the resulting output is
   
        Bob said "I'm not sure, but I think that the quick brown fox said it would 'jump over' the lazy dog."
   
2. Without running your code, what do you think is the output of just typing
        >>> s
   at the REPL?


## Basic string processing

The Python standard library provides a bunch of basic string functions.  For a complete list see https://docs.python.org/2/library/string.html

The general pattern is that everything is invoked in `str.operation(arguments)` notation.  Let's just jump to examples for:

- `split`: Splits a string along a substring
- `join` : The opposite of split
- `strip` : Removes leading / trailing whitespace
- `format` / `%` : String substitution and formatting
- `in`: check if one string is contained in another one
- `startswith`: checks if, well, one string starts with another one
- `+`: You can concatenate strings
- slices: Strings behave like arrays, so you can "slice them"

In [None]:
"Once upon a midnight dreary, while I pondered weak and weary, over many a quaint and curious volume of forgottn lore.".split(",")

In [None]:
"Note that the splitting string does not have to be just one character long.".split("not")

In [None]:
", ".join(["a","b","c"])

In [None]:
print "\n".join([
"Look I can make a string",
"that crosses lines!"
])

In [None]:
"    why is there so much whitespace around this?    ".strip()

In [None]:
print "Plug {0} into {1}".format("this", "that")
print "Hi {first} {last}!   Bye {first}.".format(first="Jane", last="Doe")

print "Bob is {:+.2f} feet tall".format(5.526)

location = { 'city' : 'New York', 'state': 'NY' }
print "Welcome to {city}, {state}".format( **location )

# There is also an alternate substitution system:
print "Can I buy a %s for $%.2f" % ("salad", 2.56)

In [None]:
"ea" in "team"

In [None]:
"I" in "there is no I in team"

In [None]:
"The quick brown fox...".startswith("The")

In [None]:
"Left me" + "et right"

In [None]:
"The quick brown fox.."[4:9]

In [None]:
# printing out tables
import string
for i, c in enumerate(string.ascii_lowercase):
    print "{num:<2} {lower:>2} {upper:>2}".format(num=i, lower=c, upper=c.upper())

## StringIO in Python

If you have a file, it's easy to turn it into a string using `open` (likely wrapped in a `with` statement).  What if you want to turn a string into a file object?  Some python libraries take file objects for arguements.  How might you use their functionality on strings (e.g. from web scraping)?  The answer is `StringIO`.

In [None]:
from StringIO import StringIO
import csv

with open("small_data/fha_by_tract.csv") as fh:
    data = [row for row in csv.reader(fh)]
print data[:5]
print

string = """
2012,0000000435,5,1,1,1,1,00253,1,1,11260,02,020,0014.00,3,5,6, , , , ,8, , , , ,1,5,0138,0, , , ,NA   ,2,1,6,0000452,00005224,047.68,00085200,058.97,00000657,00001074,0
2012,0000001281,3,1,1,3,1,00361,3,5,11260,02,020,0028.13,3,3,6, , , , ,6, , , , ,3,3,0212,0, , , ,NA   ,2,1,6,0000492,00004579,010.50,00085200,189.96,00001614,00001761,0
2012,0000001281,3,1,1,3,2,00391,3,1,11260,02,020,0029.00,2,2,5, , , , ,5, , , , ,1,2,0862,3, , , ,NA   ,2,1,6,0001175,00002570,010.86,00085200,123.43,00000671,00001357,0
"""

fh = StringIO(string)
data = [row for row in fh]
print data[:5]

**Exercises**

1. Here's a string (gotten from running `ps auxww|tail -5` somewhere):

        root     31457  0.0  0.0  65996  3444 ?        Ss   04:21   0:00 sshd: preygel [priv]
        preygel  31459  0.0  0.0  65996  1444 ?        S    04:21   0:00 sshd: preygel@pts/3 
        preygel  31460  0.1  0.0  22492  3632 pts/3    Ss   04:21   0:00 -bash
        preygel  31478  0.0  0.0  18448  1256 pts/3    R+   04:22   0:00 ps auxww
        preygel  31479  0.0  0.0   7236   684 pts/3    S+   04:22   0:00 tail -10
   
   make a Python string that contains this as its contents.  
2. Write a function to extract just the second column of each row.
3. In the above example, why do we use `with` when opening a true file but not for `StringIO`?

## Regular expressions

Tasks like in #2 in the above Exercises are ubiquitous.  In this case -- because we have it on good authority that the column layout is fixed -- it's easy to do just by counting.  What if instead we had a file containing lines like the following

        Docket S13-396 . ID 30546 :  A photonic micro-structured vacuum-ultraviolet radiation source based on solid-state frequency conversion .  4/3/2014
        Docket S13-202 . ID 30260 :  Performance Enhancement of Transparent Conducting Electrodes by Mesoscale Metal Wires .  3/28/2014
        Docket S13-211 . ID 30257 :  The Self-Assembly of Semiconducting Single-Walled Carbon Nanotubes into Dense and Aligned Rafts on Patterned Substrates .  4/3/2014
        Docket S13-198 . ID 30246 :  Polymer matrices for ambient ionization mass spectrometry .  3/12/2014
        Docket S13-360 . ID 30476 :  High-Performance Silicon Photoanode Passivated with an Ultrathin Nickel Film .  3/19/2014
               ^^^^^^^      ^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
mixed in among other content, and we wanted to pick them out -- and to pick out the three underlined parts of each?

(P.S. Each of these is intended to be one line -- it just wraps in this view!)

This is a general class of problems: We want to be able to identify all strings that "look like _this_," and then to "extract _that_ bit of the string."  __Regular expressions__ provide a concise language for specifying _this_ and _that_, and regular expression solvers in most programming languages (including Python) are a great tool for solving this class of problems.

Before we get down to the dry stuff, let's see what _this_ and _that_ look like in our example:

In [None]:
import re

lines = ["Docket S13-396 . ID 30546 :  A photonic micro-structured vacuum-ultraviolet radiation source based on solid-state frequency conversion .  4/3/2014",
"Docket S13-202 . ID 30260 :  Performance Enhancement of Transparent Conducting Electrodes by Mesoscale Metal Wires .  3/28/2014",
"One might imagine I could go back to the file I copied from and insert the lines that were actually there.  But why?",
"Docket S13-211 . ID 30257 :  The Self-Assembly of Semiconducting Single-Walled Carbon Nanotubes into Dense and Aligned Rafts on Patterned Substrates .  4/3/2014",
"Docket S13-198 . ID 30246 :  Polymer matrices for ambient ionization mass spectrometry .  3/12/2014",
"I surely must be a fish out of water.",
"Docket S13-360 . ID 30476 :  High-Performance Silicon Photoanode Passivated with an Ultrathin Nickel Film .  3/19/2014",
"  Docket S66-666 . ID 66666 :  On the effects of white space at the start of the string.  5/16/2014"]

# only create regex once!
regex = re.compile("Docket (S.*) [.] ID (\d*) : (.*) [.]  (\d+/\d+/\d+)")
for line in lines:
    m = regex.match(line)
    if m:
        print "Aha, we've found them", m.groups()
    else:
        print "Can't fool me that easy"

Regular expressions provide a very concise way of specifying _sets of strings_ from a few building blocks and operations.  It is good to think of a regular expression as a special type of program that tries to "eat" a string, but is picky about what it eats.  For a given regular expression, the "set of strings" mentioned earlier consists of those strings that it's willing to eat.  The more formal word for this is **matches**: A regular expression *matches* some set of strings.

Here are some building blocks that apply to matching a single character:
  - `.` : Matches any character (except a newline)
  - `\s`: Matches any whitespace character (`\S` is the opposite)
  - `\d`: Matches any digit (`\D` is the opposite)
  - `\w`: Matches any alphanumeric character (`\W` is the opposite)
  - `c` : Matches the character 'c' (and similarly for all characters that don't have some special meaning like `.` or `\` or `[`)
  - `[   ]`: Lists of characters, with ranges allowed: e.g. `[a-zA-Z0-9]` is the same as `\w` in ordinary ASCII
  - `[^  ]`: Some characters have special meaning inside brackets, for instance the caret `^` indicates negation.  That is: `[^a-zA-Z0-9]` matches everything _other than_ what [a-zA-Z0-9] matches.
  - `$`: Matches the end of the string (or right before a newline)
  
Now the fun comes in when we build in notation for repetition and concatenation:
  - `AB`: If `A` and `B` are regular expressions, then a string $s$ will match `AB` if and only if it is the concatenation $s = s_A s_B$ of a string $s_A$ matching `A` with a string $s_B$ matching `B`.  In other words, `A B` will eat a string only if it can first let `A` eat some amount and then let `B` eat from what's left over -- and they must both be happy with what they get.
  - `*`: If `A` is any regular expression, then `A*` matches any number of repetitions of A.
  - `+`: ... matches one or more repetitions.
  - `?`: ... matches 0 or 1 repetitions.
  - `{m,n}`: ... matches between m and n repetitions.
  
Finally, if a regular expression _does_ match there's another verb that applies: **captures**.  If we put part of our expression in _parentheses_ `(   )` then it will be _captured_.  This  means that the matcher will remember which part of the string was eaten by the sub-expression inside of the parentheses, so that we can access it afterwards.
  

**Breaking down our example: the RE**

Now we're ready to break down our example.  Let's start with just the regular expression:

                         6
                     |vvvvvvvv|
        r"Docket (S.*) [.] ID (\d*) : (.*) [.]"
        ^ |^^^^^^|^^^| ^^^     ^^^
        1   2       3   4       5
        
1. We saw before that `u"..."` told Python that something is a Unicode string.  Well, `r"..."` tells it that it is a _raw_ string. That means that escaping rules we talked about earlier do not apply!  This is helpful for regular expressions because otherwise we'd have to write things like `\\d` in place of `\d`.
2. The regular expression `r"Docket "` would match exactly the string `"Docket "`.  None of the characters involved are special, not even space.
3. This is a capture group.  The regular expression inside matches any string that starts with an `'S'` -- it must be an `'S'` followed by zero or more times any other character.  Why doesn't it gobble up the rest of the string in our example?  Because for the whole expression to match the next bit, 4, has to get to "eat" as well.
4. This regular expression matches precisely one string: `"."`
    We could also have written this as `\.`, but we could _not_ have written just `.`.  That would match _any_ one character string.
5. This matches a string of digits.
6. What does this segment match?  Notice that it has no `+`, `*`, `?`, etc. in it so that it matches exactly one string: " . ID "

So in short, we're matching a string that looks like _this_:
  - It starts with "Docket ";
  - Then comes a string, starting with an "S", that we capture;  
  - Then the string " . ID ";
  - Then comes a string of numbers that we capture;
  - Then comes the string " : ", then anything, followed by a period.

**Breaking down our example: Python's re library**

There are three potential __gotchas__:

1. In Python's `re` library, `re.match` requires that the match begin at the start of the string (though it need not eat all the way to the end).  This is not the "normal" behavior for regular expression libraries, which allow the match to begin anywhere: `re.search` gives this behavior.

2. `re.match` returns a "match object".  It is not just a boolean value, but it _is_ "truthy" which is why we could write `if m:`

3. To get the capture groups we use the match object: `m.group(..)`.  Note that this is **1** based, not zero based.  (More precisely, `m.group(0)` is the entire string matches.  This is a useful behavior since it doesn't have to eat the whole string.) 

### Exercises
1. Write a regular expression that'll match (US) phone numbers.  Use this to write a function that'll take in a string and output the area code and separately the rest of the number (all punctuation, etc. removed).
1. Write a regular expression that matches [ipv4 addresses](https://en.wikipedia.org/wiki/IP_address).
1. Check out http://regexone.com/ for a **fun** interactive "tutorial".

### Further resources

See https://docs.python.org/2/library/re.html for Python's syntax / support of regular expressions.  

*Copyright &copy; 2015 The Data Incubator.  All rights reserved.*