##Regular expressions

Regular expressions are patterns that match text. Why are they useful?

Our sample problem: We have 20-30 files containing several hundred measurements. 

**Problem:** they are in several different formats and so we can not simply put them all in the same file right away… Some use tabs to separate fields, others use commas… and the dates are written in several different styles.

We will need to use regular expressions to extract the relevant information rearrange it so it is a formatted and organized in a uniform manner.

You might have seen regular expressions before? E.g. `*` as a wild-card? Eg. `*.txt` matches any filenames ending in `.txt`

Warning - notation for REs are ugly…

Let's load in our text from the notebook-1.txt and notebook-2.txt files

In [82]:
##Initialize a variable to hold the text
readings = []

##Create a loop to read in each notebook-*.txt file
for filename in ('notebook-1.txt', 'notebook-2.txt'):
    
    ##read() to "read" the strings
    ##
    ##strip() to remove leading and trailing whitespace
    ##
    ##split('\n') to add the newline character to the end of each 
    ##line to have each line of text
    ##be stored as a separate item in the list. If this is not done
    ##then the list will only contain 2 items, the first will be 
    ##all of the text from notebook-1.txt and the second will be 
    ##all of the text from notebook-2.txt
    lines = open(filename, 'r').read().strip().split('\n')
    
    ##Each time through the loop we add all the lines from each 
    ##notebook to readings
    readings += lines[:]

##Loop through each item in readings and print it to the screen
for r in readings:
    print(r)

Baker 1 2009-11-17      1223.0
Baker 1 2010-06-24      1122.7
Baker 2 2009-07-24      2819.0
Baker 2 2010-08-25      2971.6
Baker 1 2011-01-05      1410.0
Baker 2 2010-09-04      4671.6
Davison/May 23, 2010/1724.7
Pertwee/May 24, 2010/2103.8
Davison/June 19, 2010/1731.9
Davison/July 6, 2010/2010.7
Pertwee/Aug 4, 2010/1731.3
Pertwee/Sept 3, 2010/4981.0


To use regular expressions in Python we will use the re library

In [83]:
# import libraries
import re

##Loop through each item in readings
for r in readings:
    
    ##Usage: re.search(pattern_to_search_for, string_to_search_in) 
    ##Here we search each item in readings for those that contain
    ##the pattern '06'. If '06' is found then we print that item.
    if re.search('06', r):
        print(r)


Baker 1 2010-06-24      1122.7


In [84]:
##Loop through each item in readings
for r in readings:
    
    ##Usage: re.search(pattern_to_search_for, string_to_search_in) 
    ##Here we search each item in readings for those that contain
    ##the pattern '06'. If '06' is found then we print that item.
    if re.search('Aug', r):
        print(r)

Pertwee/Aug 4, 2010/1731.3


## Exercise:
~~~
Caenorhabditis elegans
Caenorhabditis briggsae
Caenorhabditis remanei
Caenorhabditis remanei
Caenorhabditis elegans
Caenorhabditis remanei
Caenorhabditis elegans
~~~
From the above list of nematode species name and we want to extract only those from the species Caenorhabditis elegans

Which command below would allow us to do so?

a. 
~~~
for nematodes in readings:
    if re.search(nematodes, ’Caenorhabditis elegans’)
    print r
~~~
b.
~~~
for nematodes in readings:
    if re.search(‘Caenorhabditis elegans’, nematodes)
    print r
~~~
c. 
~~~
for nematodes in readings:
    if re.search(nematodes, ’elegans’)
    print r
~~~
d.
~~~
for nematodes in readings:
    if re.search(‘elegans’, nematodes)
    print r
~~~
e. Both b & d


In [85]:
##Loop through each item in readings
for r in readings:
    
    ##Here we search each item in readings for those that contain
    ##the pattern '06' OR '07'. If '06' OR '07' is found then we 
    ##print that item.
    if re.search('06|07', r):
        print(r)

Baker 1 2010-06-24      1122.7
Baker 2 2009-07-24      2819.0


##Exercise
Print all records from the month of June

##Answer:
~~~
for r in readings:
    if re.search('06|June', r):
        print r
~~~

We will be trying to match a lot of patterns in this lesson, so let's write a function to tell us which records match a particular pattern:

In [86]:
##Define the function
def show_matches(pattern, strings):
    
    ##Loop through the given list of strings
    for s in strings:
        
        ##If pattern matches a string in the list 
        ##print '**' beside it, if not, just print 
        ##'  ' (i.e. nothing, but we keep the spacing
        ##nice and readable)
        if re.search(pattern, s):
            print('**', s)
        else:
            print('  ', s)

##Call our function show_matches and ask it to tell 
##us which records contain '06' and '07' 
show_matches('06|07', readings)

   Baker 1 2009-11-17      1223.0
** Baker 1 2010-06-24      1122.7
** Baker 2 2009-07-24      2819.0
   Baker 2 2010-08-25      2971.6
   Baker 1 2011-01-05      1410.0
   Baker 2 2010-09-04      4671.6
   Davison/May 23, 2010/1724.7
   Pertwee/May 24, 2010/2103.8
   Davison/June 19, 2010/1731.9
   Davison/July 6, 2010/2010.7
   Pertwee/Aug 4, 2010/1731.3
   Pertwee/Sept 3, 2010/4981.0


## Exercise
What happens if we try this:
~~~
show_matches('06|7', readings)
~~~
Will we get the same output? Try it!

In [87]:
show_matches('06|7', readings)

** Baker 1 2009-11-17      1223.0
** Baker 1 2010-06-24      1122.7
** Baker 2 2009-07-24      2819.0
** Baker 2 2010-08-25      2971.6
   Baker 1 2011-01-05      1410.0
** Baker 2 2010-09-04      4671.6
** Davison/May 23, 2010/1724.7
   Pertwee/May 24, 2010/2103.8
** Davison/June 19, 2010/1731.9
** Davison/July 6, 2010/2010.7
** Pertwee/Aug 4, 2010/1731.3
   Pertwee/Sept 3, 2010/4981.0


Order of operations is important "0" and "6" happens before "06" or "7". Similar to mathmatics, we could force it using brackets

In [88]:
show_matches('0(6|7)', readings)

   Baker 1 2009-11-17      1223.0
** Baker 1 2010-06-24      1122.7
** Baker 2 2009-07-24      2819.0
   Baker 2 2010-08-25      2971.6
   Baker 1 2011-01-05      1410.0
   Baker 2 2010-09-04      4671.6
   Davison/May 23, 2010/1724.7
   Pertwee/May 24, 2010/2103.8
   Davison/June 19, 2010/1731.9
   Davison/July 6, 2010/2010.7
   Pertwee/Aug 4, 2010/1731.3
   Pertwee/Sept 3, 2010/4981.0


But in general show_matches('06|07', readings) is more readable, and so we will stick with that.

Now let's try to ask for records from the 5th month

In [89]:
show_matches('05|May', readings)

   Baker 1 2009-11-17      1223.0
   Baker 1 2010-06-24      1122.7
   Baker 2 2009-07-24      2819.0
   Baker 2 2010-08-25      2971.6
** Baker 1 2011-01-05      1410.0
   Baker 2 2010-09-04      4671.6
** Davison/May 23, 2010/1724.7
** Pertwee/May 24, 2010/2103.8
   Davison/June 19, 2010/1731.9
   Davison/July 6, 2010/2010.7
   Pertwee/Aug 4, 2010/1731.3
   Pertwee/Sept 3, 2010/4981.0


But `Baker 1 2011-01-05      1410.0` isn't from the 5th month, it is from the 5th day... What went wrong here?

How might we fix it? Answer - take advantage of context!!

In [90]:
show_matches('-05-|May', readings)

   Baker 1 2009-11-17      1223.0
   Baker 1 2010-06-24      1122.7
   Baker 2 2009-07-24      2819.0
   Baker 2 2010-08-25      2971.6
   Baker 1 2011-01-05      1410.0
   Baker 2 2010-09-04      4671.6
** Davison/May 23, 2010/1724.7
** Pertwee/May 24, 2010/2103.8
   Davison/June 19, 2010/1731.9
   Davison/July 6, 2010/2010.7
   Pertwee/Aug 4, 2010/1731.3
   Pertwee/Sept 3, 2010/4981.0


Matching is great, but what is even more useful is remembering what we matched so we can extract the data!

`re.search` does this using parentheseses. When a regular expression matches, the library remembers what matched against every parenthesized sub-expresssion.

Let's extract the year from the first record. Here `re.search` returns a match object **IF** a match is found. Otherwise it returns None.

In [91]:
match = re.search('(2009|2010|2011)', 'Baker 1\t2009-11-17\t1223.0')
print(match.group(1))

2009


`match.group(k)` returns the kth subexpression in the regular expression
(inside the kth pair of parentheses, counting from the left).

*Note - unlike other things in Python, k goes from 1 to N, not 0  to N-1*

What if we wanted to match the month, not the year, from the first record? We could use | for OR again, such as:

In [92]:
match = re.search('(01|02|03|04|05|06|07|08|09|10|11|12)', 'Baker 1\t2009-11-17\t1223.0')

print(match.group(1))

09


But that is a LOT of typing!!! Isn't there a better way? We can use '.' to match a single character. To extract the month, we can type:

In [93]:
match = re.search('....-(..)-..', 'Baker 1\t2009-11-17\t1223.0')

print(match.group(1))

11


##Exercise

Where pattern is a variable containing a string, and typing:
~~~
match = re.search(pattern, ‘Baker 1\t2009-11-17\t1223.0’)
print match.group(1) 
~~~
returns:
~~~
17
~~~

What string in pattern would result in the above output?

a. `'....-(..)-..'`

b. `'..  -..-(..)'`

c. `'....-(..)-(..)'`

d. `'(....-..-..)'`

As you might have imagined, we can use this to match and extract several regular expressions. To extract the year, month, and day we can type:

In [94]:
match = re.search('(....)-(..)-(..)', 'Baker 1\t2009-11-17\t1223.0')

print(match.group(1), match.group(2), match.group(3))

2009 11 17


## Exercise
From the text below, extract the Researcher's name and date (including the dashes) and print to screen.
~~~
Baker 2009-11-17
~~~

##Answer:
~~~
match = re.search('(.....) 1\t(....-..-..)', 'Baker 1\t2009-11-17\t1223.0')
print(match.group(1), match.group(2))
~~~

Let's now try to break the first record of `notebook-2.txt` into pieces

In [95]:
print(readings[6])

Davison/May 23, 2010/1724.7


To do this we can by using the `*` character. `*` is a postfix operator that means 0 or more

In [96]:
match = re.search('(.*)/(.*)/(.*)', 'Davison/May 23, 2010/1724.7')

print(match.group(1))
print(match.group(2))
print(match.group(3))

Davison
May 23, 2010
1724.7


Let's test how well this works on a case we might not want a match for (e.g. `//`)

In [97]:
match = re.search('(.*)/(.*)/(.*)', '//')

print('*', match.group(1))
print('*', match.group(2))
print('*', match.group(3))

* 
* 
* 


What is going on here? Why are we getting a false positive? `.*` can match the empty string (zero characters) because of the `*`. So what can we do to prevent our pattern from recognizing badly patterned data?

We can use `+` instead of `*`. `+` is a postfix operator which means 1 or more.

In [98]:
match = re.search('(.+)/(.+)/(.+)', '//')

print('*', match.group(1))
print('*', match.group(2))
print('*', match.group(3))

AttributeError: 'NoneType' object has no attribute 'group'

We get an error because each `match.group(k)` holds the value None.

In [99]:
print(re.search('(.+)/(.+)/(.+)', '//'))

None


Now let's double check that this still works with our real data:

In [100]:
match = re.search('(.+)/(.+)/(.+)', 'Davison/May 23, 2010/1724.7')

print(match.group(1))
print(match.group(2))
print(match.group(3))

Davison
May 23, 2010
1724.7


Since we are going to be doing a lot of matching again, let's write a function which will apply a pattern to a piece of text and report if there is no match, OR IF there is a match, it will print out all the groups in order.

In [101]:
##Function to show matched groups
def show_groups(pattern, text):
    
    m = re.search(pattern, text)
    
    ##If no match is found (i.e. re.search returns 'None')
    ##then we print 'No Match' and exit the function
    if m is None:
        print('No Match')
        return
    
    ##If there is a match, we print out the group number,
    ##an arrow, and the string that matched the pattern.
    ##This will be done for every parenthesized regular
    ##expression in the pattern
    for i in range(1, 1 + len(m.groups())):
        print(i, "->", m.group(i))

Let's try it out:

In [102]:
show_groups('(.+)/(.+)/(.+)', 'Davison/May 23, 2010/1724.7')

1 -> Davison
2 -> May 23, 2010
3 -> 1724.7


Now, since we could break our record up into 3 parts, why not break up the date while we are at it:

In [103]:
show_groups('(.+)/(.+) (.+), (.+)/(.+)', 'Davison/May 23, 2010/1724.7')

1 -> Davison
2 -> May
3 -> 23
4 -> 2010
5 -> 1724.7


What if we had an irregular record where a comma was forgotten?

In [104]:
show_groups('(.+)/(.+) (.+), (.+)/(.+)', 'Davison/May 23 2010/1724.7')

No Match


How can we deal with such a case? We can use the `?` postfix operator which means 0 or 1:

In [105]:
show_groups('(.+)/(.+) (.+),? (.+)/(.+)', 'Davison/May 23 2010/1724.7')

1 -> Davison
2 -> May
3 -> 23
4 -> 2010
5 -> 1724.7


Does this still work on data with a comma?

In [106]:
show_groups('(.+)/(.+) (.+),? (.+)/(.+)', 'Davison/May 23, 2010/1724.7')

1 -> Davison
2 -> May
3 -> 23,
4 -> 2010
5 -> 1724.7


What if there was an entry error in the year? E.g. someone wrote 201 instead of 2010 or 2011?

In [107]:
show_groups('(.+)/(.+) (.+),? (.+)/(.+)', 'Davison/May 23, 201/1724.7')

1 -> Davison
2 -> May
3 -> 23,
4 -> 201
5 -> 1724.7


We probably don't want to match that. We could fix this by specifying that the date has to be 4 characters long:

In [108]:
show_groups('(.+)/(.+) (.+),? (....)/(.+)', 'Davison/May 23, 201/1724.7')

No Match


This isn't very readable (is there four, five, twenty dots there???). We can use a `{N}` postfix operator to make this more readable:

In [109]:
show_groups('(.+)/(.+) (.+),? (.{4})/(.+)', 'Davison/May 23, 201/1724.7')

No Match


##Exercise

Write a regular expression to match records with dates with one or two digits but not > 2 (e.g. would match May 22 or May 2, but not May 222).

Hint to {M,N} matches from M to N times.

##Answer:
~~~
show_groups('(.+)/(.+) (.{1,2}),? (.{4})/(.+)', 'Davison/May 23, 2010/1724.7')
show_groups('(.+)/(.+) (.{1,2}),? (.{4})/(.+)', 'Davison/May 232, 2010/1724.7')
show_groups('(.+)/(.+) (.{1,2}),? (.{4})/(.+)', 'Davison/May 2010/1724.7')
~~~

Does our solution work on a record such as: `Davison/May , 201/1724.7`?

In [110]:
show_groups('(.+)/(.+) (.{1,2}),? (.{4})/(.+)', 'Davison/May , 2010/1724.7')

1 -> Davison
2 -> May
3 -> ,
4 -> 2010
5 -> 1724.7


Why does this return a match? Its because of the `,` matching `.{1,2}` and the `?` postfix operator allowing for the presence or absence of a comma.

We can fix this by limiting what type of characters are matched for the day `[...]` matches characters in a set defined by the user

In [111]:
show_groups('(.+)/(.+) ([0-9]{1,2}),? (.{4})/(.+)', 'Davison/May , 2010/1724.7')

No Match


This still works when our records have no unreasonable mistakes:

In [112]:
show_groups('(.+)/(.+) ([0-9]{1,2}),? (.{4})/(.+)', 'Davison/May 22, 2010/1724.7')

1 -> Davison
2 -> May
3 -> 22
4 -> 2010
5 -> 1724.7


Final matching pattern for records from notebook-2.txt:

In [113]:
p = '(.+)/([A-Za-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.+)'
show_groups(p, 'Davison/May 22, 2010/1724.7')

1 -> Davison
2 -> May
3 -> 22
4 -> 2010
5 -> 1724.7


Now we have some tools to start to organize this data. Let's take a look at our data again.

In [114]:
##Loop through each item in readings and print it to the screen
for r in readings:
    print(r)

Baker 1 2009-11-17      1223.0
Baker 1 2010-06-24      1122.7
Baker 2 2009-07-24      2819.0
Baker 2 2010-08-25      2971.6
Baker 1 2011-01-05      1410.0
Baker 2 2010-09-04      4671.6
Davison/May 23, 2010/1724.7
Pertwee/May 24, 2010/2103.8
Davison/June 19, 2010/1731.9
Davison/July 6, 2010/2010.7
Pertwee/Aug 4, 2010/1731.3
Pertwee/Sept 3, 2010/4981.0


Let's write a function that returns the date in the order Y, M, D or none if there is no match

In [115]:
def get_date(record):
    '''Return (Y,M,D) as strings or None.'''
    
    #try re.search() for dates which looks like 2010-01-01
    m = re.search('([0-9]{4})-([0-9]{2})-([0-9]{2})', record)
    
    ##test if re.search() returned a match from the above pattern
    if m:
        return m.group(1), m.group(2), m.group(3)
    
    ##if re.search() returned None from the above patterm
    ##try the other pattern
    m = re.search('/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/', record)
    
    ##test if re.search() returned a match from the second pattern
    if m:
        return m.group(3), m.group(2), m.group(1)
    
    ##if neither pattern returned a match from re.search(), return None
    return None

Let's apply this function to all our records in readings:

In [116]:
for r in readings:
    print(get_date(r)[0], get_date(r)[1], get_date(r)[2])

2009 11 17
2010 06 24
2009 07 24
2010 08 25
2011 01 05
2010 09 04
2010 23 May
2010 24 May
2010 19 June
2010 6 July
2010 4 Aug
2010 3 Sept


This is how we would normally approach a problem like this, write a pattern for each case and put them together in a function. This is easier and more readable than trying to make a giagantic pattern that works for all cases.

##CHALLENGE!!

Modify the function to get all 3 fields (Researcher, Date, Measurement)and put them in the same order (remember to list date as Y, M, D)

For an extra challenge, try to not include the experiment number (e.g.
get only 'Baker' not 'Baker 1')

##Answer (first part of challenge):

In [119]:
def get_date(record):
    '''Return (Y,M,D) as strings or None.'''
    
    #try re.search() for dates which looks like 2010-01-01
    m = re.search('(.+)\\s*([0-9]{4})-([0-9]{2})-([0-9]{2})\\s*(.+)', record)
    
    ##test if re.search() returned a match from the above pattern
    if m:
        return m.group(1), m.group(2), m.group(3), m.group(4), m.group(5)
    
    ##if re.search() returned None from the above patterm
    ##try the other pattern
    m = re.search('([A-Za-z]+)/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.*)', record)
    
    ##test if re.search() returned a match from the second pattern
    if m:
        return m.group(1), m.group(4), m.group(2), m.group(3), m.group(5)
    
    ##if neither pattern returned a match from re.search(), return None
    return None


##Let's apply this function to all our records in readings
for r in readings:
    print(get_date(r)[0], get_date(r)[1], get_date(r)[2], get_date(r)[3], get_date(r)[4])

Baker 1  2009 11 17 1223.0
Baker 1  2010 06 24 1122.7
Baker 2  2009 07 24 2819.0
Baker 2  2010 08 25 2971.6
Baker 1  2011 01 05 1410.0
Baker 2  2010 09 04 4671.6
Davison 2010 May 23 1724.7
Pertwee 2010 May 24 2103.8
Davison 2010 June 19 1731.9
Davison 2010 July 6 2010.7
Pertwee 2010 Aug 4 1731.3
Pertwee 2010 Sept 3 4981.0


##Answer (second part of challenge):

In [120]:
def get_date(record):
    '''Return (Y,M,D) as strings or None.'''
    
    #try re.search() for dates which looks like 2010-01-01
    m = re.search('([A-Za-z]+)\\s*.\\s*([0-9]{4})-([0-9]{2})-([0-9]{2})\\s*(.+)', record)
    
    ##test if re.search() returned a match from the above pattern
    if m:
        return m.group(1), m.group(2), m.group(3), m.group(4), m.group(5)
    
    ##if re.search() returned None from the above patterm
    ##try the other pattern
    m = re.search('([A-Za-z]+)/([A-Z][a-z]+) ([0-9]{1,2}),? ([0-9]{4})/(.*)', record)
    
    ##test if re.search() returned a match from the second pattern
    if m:
        return m.group(1), m.group(4), m.group(2), m.group(3), m.group(5)
    
    ##if neither pattern returned a match from re.search(), return None
    return None

##Let's apply this function to all our records in readings
for r in readings:
    print(get_date(r)[0], get_date(r)[1], get_date(r)[2], get_date(r)[3], get_date(r)[4])

Baker 2009 11 17 1223.0
Baker 2010 06 24 1122.7
Baker 2009 07 24 2819.0
Baker 2010 08 25 2971.6
Baker 2011 01 05 1410.0
Baker 2010 09 04 4671.6
Davison 2010 May 23 1724.7
Pertwee 2010 May 24 2103.8
Davison 2010 June 19 1731.9
Davison 2010 July 6 2010.7
Pertwee 2010 Aug 4 1731.3
Pertwee 2010 Sept 3 4981.0


## Finding more than one thing

What if we wanted to extract multiple occurences of a pattern? For example, if we wanted to know how many occurences of 'atg's there are in a DNA sequence?

In [121]:
DNA_sequence = 'atcccttatatatgggatcatttaatatacgtgtatgcaactatttaaaagcgatgggc'

m = re.findall('atg', DNA_sequence)

for i in range(0,len(m)):
    print(m[i])

atg
atg
atg


We can use more complicated regular expressions like we did before. For example, if we want to know the 3 nucleotides that follow each 'atg'

In [123]:
m = re.findall('atg(.{3})', DNA_sequence)

for i in range(0,len(m)):
    print(m[i])

gga
caa
ggc


Now, if we are going to be using the pattern over and over again, we can compile it to save time:

In [125]:
after_atg = re.compile('atg(.{3})')
m = after_atg.findall(DNA_sequence)

for i in range(0,len(m)):
    print(m[i])

gga
caa
ggc
