### Summary: 
* new data types that are more efficient in many cases than lists:
    
        * dictionaries
            * MUTABLE
            * paired
            * HAVE METHODS
            
        * tuples 
            * IMMUTABLE
            * can act as KEYS for dictionaries
            * DO NOT HAVE METHODS BECAUSE THEY ARE IMMUTABLE (yes, strings are weird)

# Regular Expressions (AKA: Regex)
* Much of what we do in biology can be describes as searching for patterns in strings
Examples:
    * Sequence data, Protein domains, DNA transcription factor binding motifs, degenerate PCR binding sites, Read mapping locations, BLAST

* Features of regular expressions allow us to find patterns and variations in patterns that we are searching for

    * If you have any familiarity with linux, you will probably know about the **grep** function: http://www.gnu.org/software/grep/manual/grep.html

*  When would a regular expression be inadequate? When regular expressions are not sufficient (not flexible enough for the 'poor spelling' of DNA or protein sequences), you will find the next level of searching: Hidden Markov Models. However, we don't cover them in this course. For example: 
    * regular expression would probably find differentiating "color" from "colour" hard
    * fuzzy data - when there are mistakes in sequencing
    * proteins with the same function can be spelled differently in different organisms
          * ie. ribosome binding/splice sites

* Many other programming languages use regular expressions. Other common regular expression terms exist (that we aren’t going to discuss in depth in this course but that exist in case you encounter them in your studies). You can find back descriptions in the usual place (https://docs.python.org/3/library/re.html)

        * Greedy quantifiers: *, + and ? match as much text as possible so they are 'greedy' 
        * Minimal quantifiers: as few characters as possible are matched
        * Back references
        * Lookahead assertions/lookbehind assertions
        * Built-in character classes (we’ll talk OOP in lecture 14)

Example: 

    * Using a regex: [AT][CG][AC][ACTG]*A[TG][CG] would not differentiate between the following two sequences: 
    
                            1. TGCTAGG
                            2. ACACATC
    
    * This is fine except in cases where you know that one of these two sequences is far more common (2 is the consensus sequence) than the other (1 is an exceptionally rare sequence) --> codon bias might mean that you need a more sophisticated tool than regex
    
Modularization:

* Tools when you need them, they don’t automatically load
* Collection of specialized functions, data types
* More efficient
* There is a module for regular expressions: **import re **
        * re.search identifies pattern occurring anywhere in the string (usually what we want to do)
        * re.match only identifies a pattern if it matches entire string (Not usually what we want since it is more limited than re.search). 

<div class="alert alert-block alert-warning">
Example: 
        
        re.search(arg1, arg2) #arg1 = pattern; arg2= string
* Gives a True/False answer to whether specified pattern is present in the given string
* Returns a "match object" which has useful methods like .group. We'll see that in the last two lines of the cell below. 


Revisiting "hidden characters":

* Sometimes you will be searching for patterns in strings but you will encounter hidden special characters that will mess up your search (carriage return, eof, tab). In order to avoid that, you can place the letter r (stands for "raw") directly in front of the string and it will ignore special characters. 
<div class="alert alert-block alert-warning">
    
Example:
	
        print(r”\t\n”)
    
        Output: \t\n

In the next cell, we are going to search for restriction enzyme motifs!

* we will demonstrate how to efficiently search for particular expressions in the data 
* We will then improve the efficiency of regex by introducing a small set of rules:

    * __ALTERNATION__
        * | or character. Keep in mind that | is low precedence so it is usually a good idea to put your query into brackets (to impose order of operation on them)
        
    * __CHARACTER GROUPS__ 
        * NOT: ^
        * ANY:.
        * Encasement (imposing order of operations): []                
            * ^ means that XYZ are characters that we don’t want to match when they are in a []
            * ^ can also mean at the exact beginning of a string but then the search pattern isn’t encased in []
            
    * __Quantifiers__: allows us to describe variation in the number of times a section of a pattern is repeated 
        1. __?__ 
            * Example: 
                    GAT?C <-- means that T can be present in the pattern **0 or 1 times** (the pattern can be 				GAC or GATC) and you can use () to group ex. GA(TT)?AA which will match what? 
        2. __+__ 
            * character or group **MUST be present at least once** but can be repeated ‘n’ times
            * Example: 
                    GGGA+TTT will match: GGGATTT or GGGAATTT or GGGAAATTT but it will NOT match GGGTTT
        3. __*__
            * '*' – character (or group) is optional but, if it is present, it can be present ‘n’ times
            * Example: 
                GG(AAA)*TT  will match: GGTT or GGAAATT or......or GGAAAAAAAAAAAAAAATT	
		4. __{}__ 
            * curly brackets denote specific number of repeats or range
            * Example: 
                * GGAAA{4}TT  will match ONLY: GGAAAAAATT
                * You can also specify a range: GGAAA{1,5}TT  will match the following: 						
                            GGAAATT
                            GGAAAATT	
							GGAAAAATT	
							GGAAAAAATT	
							GGAAAAAAATT
            
    * __Positions__: 
            * Regular expression tools can also involve positions (rather than characters) in an input string
                    * ^ - caret matches beginning of string
                    * $ - matches end of string
                    * Example: 
                           * ^AAA will match: AAATT but not GGAAATT
                           * GGG$ will match: AAAGGG but not AAAGGGTT

* You can combine all of the above search tools to specify flexible patterns efficiently
<div class="alert alert-block alert-warning">
    Example: What does this mean?  
        Identifying full-length eukaryotic messenger RNA sequence:
        ^ATG[ATGC]{30,1000}A{5,10}$
 



In [2]:
# Begin each program by importing the re module
import re
dna="ATGACGTACGTACGACTG"
# instead of hard wiring the dna sequence into our code, we could ask the user to input
# the sequence or, as we saw in an earlier PS, we could read it in from a file like
# so (you will need to make up your own file here): 
#----------------------------------------
#f=open("Lect_3B_opening_sequence_from_file.txt")
#dna=f.read()
#f.close()
#print(dna)
#----------------------------------------

#----------------------------------------
# demonstrates the difference between r in front of a string and no r in front of a string
#----------------------------------------
print("Demonstrating difference between raw and not raw")
print("\n")
print(r"\n")
#----------------------------------------
# a straight up search with no necessary modifications
# plain ol' re module with search method
#----------------------------------------
if re.search(r"GACGT",dna):
    print("restriction fragment found!")
#we want to expand our search a bit and find
#either of these two patterns in our data

#----------------------------------------
# ALTERNATION: What if you want to test two patterns within the string? 
# You could use a BOOLEAN OPERATOR
#----------------------------------------

if re.search(r"GACT",dna) or re.search(r"GACG",dna):
    print("restriction fragment found! with OR statement")
#----------------------------------------
#more efficiently writing this with the PIPE CHARACTER
#----------------------------------------
if re.search(r"GAC(T|G)",dna):
    print("restriction fragment found! with | statement")
#----------------------------------------
# Instead of just using a pipe if you want to search for >2 possibilities at a position    
#slightly lame example of caret usage. We are matching a RE that has the pattern
# GANG here where the N is NOT X, Y or Z
#----------------------------------------
if re.search(r"GA[^XYZ]G",dna):
    print("restriction fragment found with caret statement")
# using . will match to any character
if re.search(r"GT.CG",dna): 
    print("restriction fragment GG.CC found with . matching to any alphabet character!")

#----------------------------------------
#store the match object in the variable m
#----------------------------------------
m=re.search(r"GA[ATGC]{3}AC",dna)#GANNNAC
print(m.group())

Demonstrating difference between raw and not raw


\n
restriction fragment found!
restriction fragment found! with OR statement
restriction fragment found! with | statement
restriction fragment found with caret statement
restriction fragment GG.CC found with . matching to any alphabet character!
GACGTAC


## String extraction 
* Answers the question which part of the string matched? 
* Stores the part of the string that matched 

* .group() on match object (we have seen file objects previously but we will see objects formally later – an object has its own methods, like a file object has .close etc) which evaluates to True or False and is stored as a result from re.search
* the very last two lines of the code cell above demonstrates this feature

## Capturing
* What happens when we want to extract more than one bit of the pattern?
* For instance,
        GA[ATGC]{3}AC[ATGC]{2}AC
* What does the above mean?
* We want to capture __*two*__ parts of this pattern: GA__*[ATGC]{3}*__AC__*[ATGC]{2}*__AC
* We enclose the patterns that we want to extract with brackets
GA__*([ATGC]{3}*)__AC__*([ATGC]{2})*__AC
* A useful method of match objects is .group()
* Example:
       * .group(1) <-return piece of string matched by the section of the pattern in the first set of parentheses
       * .group(2) <-return piece specified in second parentheses

## Position of the match
* A match object contains information about the contents of the match and also the position of the match
    * .start() <- first position of the match (starting at 0)
    * .end() <- last position of the match

## Splitting a string
* Use regular expressions as the delimiter when you want to split a string
* There is a general method called split that won’t do this; you must use the .split(where to split, string) that is part of the re module

## Finding Multiple matches
* Finding every place in a data file where a pattern occurs in a string
    * re.findall <-returns strings (not objects) so you can’t determine position of matches, you can just extract the strings
    * re.finditer <-returns a sequence of match objects so you need to use return value in loop


In [7]:
import re
#a more complicated example than the previous one
dna="ATGACGTACGTACGACTG"
#----------------------------------------
#groups!
#store the match object in the variable m
#we are searching for the pattern: GANNNACNNAC
#----------------------------------------
m=re.search(r"GA([ATGC]{3})AC([ATCG]{2})AC",dna)
#----------------------------------------
#this allows you to specify groups that you can then individually call
#----------------------------------------
print("Entire match: " + m.group())
print("First match: "+ m.group(1))
print("Second match: " + m.group(2))
#in this case this shouldn't give a different answer to the search
#because our data set doesn't contain any letter that aren't
# A,C,G,or T but the search result isn't clustered in groups anymore
#----------------------------------------
n=re.search(r"GA...AC..AC",dna)
print(n.group())

Entire match: GACGTACGTAC
First match: CGT
Second match: GT
GACGTACGTAC


In [9]:
# Slightly expanded version
import re
dna="ATGACGTACGTACGACTG"
#store the match object in the variable m
m=re.search(r"GA([ATGC]{3})AC([ATCG]{2})AC",dna)
print("Start: "+ str(m.start()))
print("End: "+ str(m.end()))

#----------------------------------------
#We can get the start and end positions of individual groups by supplying a number:
#----------------------------------------
print("The first group starts at: "+ str(m.start(1)))
print("The first group ends at: "+ str(m.end(1)))
print("The second group starts at: "+ str(m.start(2)))
print("The second group ends at: "+ str(m.end(2)))

Start: 2
End: 13
The first group starts at: 4
The first group ends at: 7
The second group starts at: 9
The second group ends at: 11


In [1]:
# simple example with split
import re
dna1="ACTNGCATRGCTACGTYACGATSCGAWTCG"
#what is this going to produce/print?
runs=re.split(r"[^ATGC]",dna1)
print(runs)

['ACT', 'GCAT', 'GCTACGT', 'ACGAT', 'CGA', 'TCG']


In [2]:
# simple example with findall
import re
dna3="ACTGCATTATATCGTACGAAATTATACGCGCG"
#store the match object in the variable m
runs=re.findall(r"[AT]{6,10}",dna3)
print(runs)

['ATTATAT', 'AAATTATA']


In [1]:
# the more sophisticated finditer - positions too!
import re
dna3="ACTGCATTATATCGTACGAAATTATACGCGCG"
#store the match object in the variable m
runs=re.finditer(r"[AT]{6,10}",dna3)
#print(runs)
for match in runs:
    run_start=match.start()
    run_end=match.end()
    print("AT rich region from " + str(run_start) +" to " +str(run_end))

<callable_iterator object at 0x10fa42650>
AT rich region from 5 to 12
AT rich region from 18 to 26


In [22]:
import re
dna3="ACTGCATTATATCGTACCAAATTATACGCGCG"
runs=re.findall(r"AC(T|C)",dna3)
print(runs)

['T', 'C']


In [8]:
?str.find

In [6]:
'   spacious    mountains  '.strip().replace("    "," ")

'spacious mountains'