# Regular Expressions:

https://docs.python.org/2/library/re.html

## !!!!IMPORTANT!!!!
Basic Tutorial: Please go through the tutorial here and complete the exercises at the end of the tutorial: https://regexone.com/lesson/introduction_abcs

More advanced tutorial, optional:
https://developers.google.com/edu/python/regular-expressions

### In class:

We are going to see the functions `re.search()`, `re.find()`, `re.findall()`, `re.finditer()`, and `re.sub()`.

In [None]:
import re

In [None]:
b = "AAAA"
m = re.search('AA',b)

In [None]:
print type(m)
print m
dir(m)

In [None]:
print m.group(), m.start(), m.end(), m.span()

In [None]:
b = "AAAA"
m = re.findall('AA',b)

In [None]:
print type(m)
print m

Same problem with overlapping matches. To fix this we have to use a different syntax to tell python to not consume the match.

In [None]:
m = re.findall('(?=(AA))',b)
m

In [None]:
for m in re.finditer('AA',b):
    print m.group(), m.start(), m.end(), m.span()

In [None]:
for m in re.finditer('(?=(AA))',b):
    print m.group(), m.start(), m.end(), m.span()

We can use RegEx to find restriction sites in DNA sequences for example. The EcoRI site has the sequence: GAATTC. We can write a program to find all EcoRI sites in a sequence. We will use BioPython to read the sequence from a file.

In [None]:
from Bio import SeqIO
a = SeqIO.read("testDNA.fasta", "fasta")

result = re.search('GAATTC',str(a.seq))

print result.group(), result.start(), result.end(), result.span()

In [None]:
result = re.findall('GAATTC',str(a.seq))
result

Only one site at position 2023. What about if we wanted to find all the sites for enzyme BiSI? It's restriction site is GCNGC, where N stands for any nucleotide.

Now we will start to look at the power of regular expressions. RegEx have a syntax that uses special characters to perform specific functions, I got this nice summary from the github repository below and it is the same as the handout.

https://github.com/tartley/python-regex-cheatsheet/blob/master/cheatsheet.rst


__Special Characters:__


```
\       Escape special char or start a sequence.
.       Match any char except newline.
^       Match start of the string.
$       Match end of the string.
[]      Enclose a set of matchable chars.
R|S     Match either regex R or regex S.
()      Create capture group, & indicate precedence
```

After '[', enclose a set, the only special chars are:

```
]   End the set, if not the 1st char
-   A range, eg. a-c matches a, b or c
^   Negate the set only if it is the 1st char
```

__Quantifiers:__

```
{m}     Exactly m repetitions
{m,n}   From m (default 0) to n (default infinity)
*       0 or more. Same as {,}
+       1 or more. Same as {1,}
?       0 or 1. Same as {,1}
```

__Special sequences:__

```
\A  Start of string
\d  Digit
\D  Non-digit
\s  Whitespace [ \t\n\r\f\v].
\S  Non-whitespace
\w  Alphanumeric: [0-9a-zA-Z_].
\W  Non-alphanumeric
\Z  End of string
\f  ASCII Formfeed
\n  ASCII Linefeed
\r  ASCII Carriage return
\t  ASCII Tab
\v  ASCII Vertical tab
\\  A single backslash
```

__Extensions:__

```
(?=...)       Lookahead assertion, match without consuming
There are several others but we won't go that deep in this course
```

__Case Insensitive:__

```
Add re.I as an argument to the re function.
```

_______

For our purpose we need an expression that matches a 'GC', followed by any nucleotide, followed by another 'GC'. The following expressions all work:

```
GC\wGC
GC[ATGC]GC
GCAGC|GCTGC|GCCGC|GCGGC
```

Let's test:

In [None]:
result = re.findall('GC\wGC',str(a.seq))
result

In [None]:
result = re.findall('(?=(GC\wGC))',str(a.seq))
result

In [None]:
result = re.findall('GC[ATGC]GC',str(a.seq))
result

In [None]:
result = re.findall('GCAGC|GCTGC|GCCGC|GCGGC',str(a.seq))
result

In [None]:
for m in re.finditer('GC\wGC',str(a.seq)):
    print m.start(), m.group(), m.span()

In [None]:
for m in re.finditer('gc\wgc',str(a.seq)):
    print m.start(), m.group(), m.span()

In [None]:
for m in re.finditer('gc\wgc',str(a.seq), re.I):
    print m.start(), m.group(), m.span()

### Pattern capture:

You can use parenthesis around part of your pattern to return just that piece of info, to read this match you index the `group()` method. For example, if we wanted to find which nucleotide was between the GC's in each occurrence of the BisI site we can do:

In [None]:
for m in re.finditer('gc(\w)gc',str(a.seq), re.I):
    print m.start(), m.group(), m.group(0), m.span(), m.group(1)

You can use several parenthesis to get several parts of your pattern. If we had a restriction enzyme that had the site `GCN{0-3}GCNGC` we could use: `GC(\w{0,3})GC\w+GC`:

In [None]:
mstring = """
GCaGCaGC
GCGCgGC
GCatgGCtGC
"""
for m in re.finditer('gc(\w{0,3})gc(\w)gc', mstring, re.I):
    print m.start(), m.group(), m.group(0), m.span(), m.group(1), m.span(1), m.group(2), m.span(2)

### Substitutions:

In [None]:
text = "Andre is lame"
print text
text = re.sub("lame", "GREAT", text, count=1)
print text

## Exercise:


Add the regular expression necessary to find the names (use the fact that people names in this text always follow Mr.), emails, dates and proper nouns in the string below. I already the scaffold of the code and you just need to insert the regular expressions in the right locations:

In [None]:
text = """
Hello Mr. Andre Cavalcanti, this is Mr. Anthony Cavalcanti, my email is anthony@pomona.edu, today is 2/3/2017.

Other dates can come anywhere in the text, like here: 12/03/17, 04/05/1976, etc.

"""

# find names in the string:
print "Names:"
for m in re.finditer('',text):
    print '\t', m.start(), m.group(1), m.span()
# find emails:
print "Emails:"
for m in re.finditer('',text):
    print '\t', m.start(), m.group(1), m.span()
# Find date:
print "Dates:"
for m in re.finditer('',text):
    print '\t', m.start(), m.group(1), m.span()
    
print "Proper nouns:"
for m in re.finditer('',text):
    print '\t', m.start(), m.group(1), m.span()