 # The Power of Regular Expressions
    
What's a regular expression? A tool that allows you to:
1. Find Patterns.
1. Match and Replace.
1. Locate patterns and their occurrences.

Check out this [Python Regular Expression Tutorial!](https://www.w3schools.com/python/gloss_python_regex_functions.asp)

In [1]:
import re
seq1 = "GTCTCAATGCATGTCTTCTATGCAACTAACCTCCATGTATGCCAAT"
# the character's 'C or T' (CAAC or CAAT), then one or more matches of 
# the character class "AGCT' followed by CAA, C or T (CAAT or CAAC).
#caat = re.search(r"CAAT", seq1)
#print(caat)
s =re.search (r"(CAA[C|T])[AGCT]+(CAA[T|C])", seq1)
print(s.group(1))
print(s.group(2))

CAAT
CAAT


In [None]:
# The group method will return each match in succession. In this example
# the match objects are the patterns wrapped in parenthesis.
# There are two match object patterns and we access them like this:
#print(s.group(2))

## Predicting Exons In a Sequence
We can use the grammar of genomics to get an idea of what exist in 
nucleotide sequence. Let's demonstrate with regular expressions.

In [2]:
dna = """
ggtattgatttaggtacaacatactcgtgtgttgctcactttgctaatga
tcgtgtggacattattgccaacgatcaaggtaacagaaccactccatctt
ttgtcgctttcactgacactgaaagattgattggtgatgctgctaagaat
caagctgctatgaatccttcgaataccgttttcgacgctaagcgtttgat
cggtagaaacttcaacgacccagaagtgcaggctgacatgaagcacttcc
cattcaagttgatcgatgttgacggtaagccacaaattcaagttgaattt
aagggtgaaaccaagaactttaccccagaacaaatctcctccatggtctt
gggtaagatgaaggaaactgccgaatcttacttgggtgccaaggtcaatg
acgctgtcgtcactgtcccagcttacttcaacgattctcaaagacaagct
accaaggatgctggtaccattgctggtttgaatgtcttgcgtattattaa
cgaacctaccgccgctgccattgcttacggtttggacaagaagggtaagg
aagaacacgtcttgattttcgacttgggtggtggtactttcgatgtctct
ttgttgtccattgaagacggtatctttgaagttaaggccaccgctggtga
cacccatttgggtggtgaagattttgacaacagattggtcaaccacttca
tccaagaattcaagagaaagaacaagaaggacttgtctaccaaccaaaga
gctttgagaagattaagaactgcttgtgaaagagccaagagaactttgtc
ttcctccgctcaaacttccgttgaaattgactctttgttcgaaggtatcg
atttctacacttccatcaccagagccagattcgaagaattgtgtgctgac
ttgttcagatctactttggacccagttgaaaaggtcttgagagatgctaa
attggacaaatctcaagtcgatgaaattgtcttggtcggtggttctacca
gaattccaaaggtccaaaaattggtcactgactacttcaacggtaaggaa
ccaaacagatctatcaacccagatgaagctgttgcttacggtgctgctgt
tcaagctgctattttgactggtgacgaatcttccaagactcaagatctat
tgttgttggatgtcgctccattatccttgggtattgaaactgctggtggt
gtcatgaccaagttgattccaagaaactctaccattccaacaaagaagtc
cgagatcttttccacttatgctgataaccaaccaggtgtcttgattcaag
tctttgaaggtgaaagagccaagactaaggacaacaacttgttgggtaag
ttcgaattgagtggtattccaccagctccaagaggtgtcccacaaattga
agtcactttcgatgtcgactctaacggtattttgaatgtttccgccgtcg
aaaagggtactggtaagtctaacaagatcactattaccaacgacaagggt
agattgtccaaggaagatatcgaaaagatggttgctgaagccgaaaaatt
caaggaagaagatgaaaaggaatctcaaagaattgcttccaagaaccaat
tggaatccattgcttactctttgaagaacaccatttctgaagctggtgac
aaattggaacaagctgacaaggacaccgtcaccaagaaggctgaagagac
tatttcttggttagacagcaacaccactgccagcaagga"""
dna = dna.split('\n') # Create list by splitting on newlines
del dna[0] # delete the first index because it's empty
print(dna)

['ggtattgatttaggtacaacatactcgtgtgttgctcactttgctaatga', 'tcgtgtggacattattgccaacgatcaaggtaacagaaccactccatctt', 'ttgtcgctttcactgacactgaaagattgattggtgatgctgctaagaat', 'caagctgctatgaatccttcgaataccgttttcgacgctaagcgtttgat', 'cggtagaaacttcaacgacccagaagtgcaggctgacatgaagcacttcc', 'cattcaagttgatcgatgttgacggtaagccacaaattcaagttgaattt', 'aagggtgaaaccaagaactttaccccagaacaaatctcctccatggtctt', 'gggtaagatgaaggaaactgccgaatcttacttgggtgccaaggtcaatg', 'acgctgtcgtcactgtcccagcttacttcaacgattctcaaagacaagct', 'accaaggatgctggtaccattgctggtttgaatgtcttgcgtattattaa', 'cgaacctaccgccgctgccattgcttacggtttggacaagaagggtaagg', 'aagaacacgtcttgattttcgacttgggtggtggtactttcgatgtctct', 'ttgttgtccattgaagacggtatctttgaagttaaggccaccgctggtga', 'cacccatttgggtggtgaagattttgacaacagattggtcaaccacttca', 'tccaagaattcaagagaaagaacaagaaggacttgtctaccaaccaaaga', 'gctttgagaagattaagaactgcttgtgaaagagccaagagaactttgtc', 'ttcctccgctcaaacttccgttgaaattgactctttgttcgaaggtatcg', 'atttctacacttccatcaccagagccagattcgaagaattgtgtgctgac', 'ttgttcagatctactttggacccagt

In [None]:
dna_string = ''.join(dna)
print(dna_string)

In [None]:
len(dna_string)

In [9]:
import re
# My intron pattern
dna_string = """>UDW38242.1 |surface glycoprotein|MS|GenBank|ssRNA(+)
MFVFLVLLPLVSSQCVNLRTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
NNATNVVIKVCEFQFCNDPFLDVYYHKNNKSWMESGVYSSANNCTFEYVSQPFLMDLEGK
QGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLL
ALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCT
LKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCV
ADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYN
YKLPDDFTGCVIAWNSNNLDSKVGGNYNYRYRLFRKSNLKPFERDISTEIYQAGSKPCNG
VEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFN
FNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGT
NTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYEC
DIPIGAGICASYQTQTNSRRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISV
TTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVF
AQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLG
DIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQM
AYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQNVVNQNAQALNTL
VKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASA
NLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAIC
HDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQ
PELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQE
LGKYEQYIKWPWYIWLGFIAGLIAILMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSE
PVLKGVKLHYT
>UDW40541.1 |surface glycoprotein|TX|GenBank|ssRNA(+)
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAISGTNGTKRXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXNN
ATNVVIKVCEFQFCNDPFLGVYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQ
GNFKNLREFVFKNIDGYFKIYXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
LHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQXXXXXXXXXXXXXX
XXXXXXXXXXXXXXSNXXXXXXXXXXXXXXXXXXXXXXXPFERDISTEIYQAGSTPCXXX
XXFNCYFPLQSYGFQPTYGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNF
NGLTGTGVLTESNKKFLPFQQFGRDIDDTTDAVRDPQTLEILDITPCSFGGVSVITPGTN
TSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECD
IPIGAGICASYQTQTNSHRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPINFTISVT
TEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXIAQYTSALLAGTIXXXXXXXXXXXXQIPFAMQMA
YRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLV
KQLSSNFGAISSVLNDILARLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASAN
LAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICH
DGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTHNTFVSGNCDVVIGIVNNTVYDXXXX
XXXXXXXXXXKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQEL
GKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEP
VLKGVKLHYT
>UDW40565.1 |surface glycoprotein|TX|GenBank|ssRNA(+)
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAISGTNGTKRXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXNN
ATNVVIKVCEFQFCNDPFLGVYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQ
GNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLA
LHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTL
KSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVA
DYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNY
KLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGV
EGFNCYFPLQSYGFQPTYGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNF
NGLTGTGVLTESNKKFLPFQQFGRDIDDTTDAVRDPQTLEILDITPCSFGGVSVITPGTN
TSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECD
IPIGAGICASYQTQTNSHRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPINFTISVT
TEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFA
QVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKXXXXCLGD
IAARDLICAQKFNGLTVLXXXXXXXXIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMA
YRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLV
KQLSSNFGAISSVLNDILARLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASAN
LAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICH
DGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTHNTFVSGNCDVVIGIVNNTVYDPLQP
ELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQEL
GKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEP
VLKGVKLHYT
>UDW49791.1 |surface glycoprotein|GA|GenBank|ssRNA(+)
MFVFLVLLPLVSSQCVNLRTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXGVYFASTXXXXXXXXXIFGTTXXXXXXXXXXX
XXXXXXXIXXXXFQFCNDPFLGVYYHKNNKSWMESGVYSSANNCTFEYVSQPFLMDLEGK
QGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLL
ALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCT
LKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCV
ADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYN
YKLPDDFTGCVIAWNSNNLDSKVGGNYNYRYRLFRKSNLKPFERDISTEIYQAGSKPCNG
VEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFN
FNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGT
NTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYEC
DIPIGAGICASYQTQTNSRRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISV
TTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVF
AQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLG
DIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQM
AYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQNVVNQNAQALNTL
VKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASA
NLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAIC
HDGKAH
>UDW51945.1 |surface glycoprotein|NC|GenBank|ssRNA(+)
MFVFLVLLPLVSSQCVNLRTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESGVYSSANNCTFEYVSQPFLMDLEGK
QGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSVLEPLVDLPIGINITRFQTLL
ALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAIDCALDPLSETKCT
LKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCV
ADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYN
YKLPDDFTGCVIAWNSNNLDSKVGGNYNYRYRLFRKSNLKPFERDISTEIYQAGSKPCNG
VEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFN
FNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGT
NTSNQVAVLYQGVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYEC
DIPIGAGICASYQTQTNSRRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISV
TTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVF
AQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLG
DIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQM
AYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQNVVNQNAQALNTL
VKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASA
NLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAIC
HDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQ
PELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQE
LGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSE
PVLKGVKLHYTFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTV
YDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESL
IDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFD
EDDSEPVLKGVKLHYT
"""
get_introns = re.finditer(r'gt[atgc]{30,40}ag',dna_string)
"""In the sequence dna_string I'm looking for the pattern of 'gt' 
followed by 30 to 40 characters of the either 'a,t,g, or c' in any
combination followed by the character 'ag'
"""
# Create an empty list that will be populate by the RE get_introns
introns = []
for match in get_introns:
# 'gt' is the beginning and 'ag' is the ending coordinates for all matches found in the sequence 'dna_string'
# together the form a span.
    gt = match.start() 
    ag = match.end()    # end of match
    introns.append(match.span())

In [None]:
print(introns) # The empty list that was populate by the RE get_introns

##  Get a List of the Counts and Coordinates of Every Group and Span

If we defined a first or second match we would access it by using
'2' or '3' as the argument for the group method.
Then capturing the match object with the group method many times is part of a decision on what path to take in your analysis. Let's print those out with a more descriptive formatted message.

These operations are useful in genomics. The location of certain patterns can provide information about regulatory elements and/or the acquisition of the genomic coordinates in raw sequence data.  

In [None]:
#Reformat this without the 'str' function. Remember the other formatting options?
seqx = "GTCTCAATGCATGTCTTCTATGCAACTAACCTCCATGTATGC"
s = (r"GAA[C|T][AGCT]+(CAA[T|C])", seqx)

print("Thefirst match is " + str(s.group(1)) + " location coordinates are " + str(s.span(1)) +".\n"
+ " The second match is " + str(s.group(2)) + " location coordinates are " + str(s.span(2)) + ".\n")

In [3]:
#Assignment part1 is to get all of the occurrences of the match and each 
# matching span
import re
cDNA = """CCACGCGTCCGCCGCGACCTGCGTTTTCCTGGGGGTCCGCAACTCTGGCTTG
ACCCAAGGACCCGGCCACCCCCTCAGGACAACTCGCATCTCAGCAGAGCAGCCCCTGGCCCAGG
TCCACCCATAGGCAGATGGCCTCCGCCCCACCCCCGGGAGGATTTCTTAATGGGGTGAAAATGC
CAGTCCCCGAAGCCAGGGTTCCGGGACCCCCGGGGCCGAGCTGGGCGCGGGAGCCCCGCAGGGG
CGGAGAGACCCGGCGAGCCTGAAGAAGTGGAGGAGAGATTACACAACTTCAGTGGGGCGTACCC
ATCCTGTCTTCCTGTTCACCCTCCTCTGCCCCCAACTCCATCTCTGAGACCTCCTGCCCCCCCA
AAAAAAGAAGTCGCTCGCGGGCTCTGTCTGCAGAGAGCCAGGGTGA"""
# create RE using search method
s = re.search(r"(TGA)", cDNA) # Get the group and the span
print(s.group(1))
print(s.span())

TGA
(173, 176)


How would we get the other occurrences? Play around in the cell below to come up with a solution.

In [None]:
# Work on code to get all of the matches

## Let's demonstrate with another example!

In [None]:
#Assignment part 2 is to get all of the occurrences of the match 
# and each matching span
t = re.search(r"(AAA[T|A])",cDNA)
print(t.group())
print(t.span())

In [None]:
# Do the same here. Get all of the matches of the RE.

The following are examples of performing operations on match
objects. The code example will work through finding the match.
then getting the match string and genomic coordinates of each
match object. The match objects are searched using the finditer
regex method. All of the match objects are identified via a for
loop. Let's start the demonstration!
Query search/finditer object with group and span functions.

In [7]:
import re 
seqA = """TGCCAGCTGCATCTCAAGGGCAGGGGCCAGGGTCAGAACAGGAGGCCCCTTTCTAGTGGATGCAACACCG
CACAGCAGCGCAGGACAAGGCTGTCTCACCTGCTGTTACCAACGCCCCCAGGTTGTGTTTCTCTTCGGAA
CGCTCCAGATGCAATGCCGATCTGGGCACGCAGGGAGGTGGGGGGTGATGGGAAGCTCACCAGGAACCAG
GGACTGGGGCCCAGACTGACCTGTCAGAAGGCACCAAGGTGGACGTGGTGTGGCCGGTCCTGCTTCACTA
GCCTTGCTCCTCCTAGTGGGAGGCCGTGCACTCTGCCAACCCGCTATCCCCCACCCTCACCCTGCCTCGG
ATCCACACCACCTTCCTGCTGGGAGGTGTGGGGTGACAGGAAGCTCTTGCCCCACACCATGGTCCTGGCG
CGGTCCTGGTGTGAGACATCCTGCATTTGAAACAGCTGTGTGACTTCAGGAGAGTTACTTGACCTTTCTG
TCTCAGCGTCTTCTGCAAAAGACTGAACAGGAACTCCGTGGCTCGGGCGAACTCACTCCTGTCAAGTGCA
TAGGAAGGGGCACGAGTGGGCGGGGAGGTGCTGCCCTGTTATTCACAGAACACTTTGCCCAGGCAACACA
CTACAAATCCACAACCTCGCTCCCTGCAGGTGCACTGAGACCACCCACGCCCTCCCGGACACCAACGCCC
ACCATCAGATTCGCTGCGCAAAGTCCCAGAGCCGCCGGCGCACGCTCACACCCCGGCGAGCAGCCCCCAG
CTCCCTCCCTCCGAGAGGAGCCCGGTCCGCGACCAGCCCAGCCCATCCCAGTCCCGCGCGGAGTCCTGGA
TTCCAGCCGCTCGCAGTGACTCGGTACTCGGGATAGTGCCGGGGGCCGCAGCCCTGTCCCGCTGCCGCCG
CCGGATGCCCCGAGTCGGCCGTCACGCACCCCCCGCGGGAGCCCGCGCCGCCCGCCGCGCCGGGGCCGTT
TAAATGGGCCAAGTTGTGGCGGCGGCGTCGGCGGCGGAGTCTCCCAAGTCCCCGCCGGGCGGGCGCGCGC
CAGTGGACGCGGGTGCACGACTGACGCGGCCCGGGCGGCGGGGCGGGGGCTTGGGACCCCCGAGAGGGGC
GGGGACTCCGCGACTCCTCGCTGCCGGGCTCGGCCTGGCGGGTGGGTCGGCGAGCCGGGCGTGGGACTGC
CCCGGGCGCGGGCGCTGGTGGCCGGGGCGCGGGACTCCAGACGCCCCGGGGAGCCCCGAGGCCCTGGAAC
TGCGGCGCTCGGCGAGTCGATCCGGGATCGATAGCAGCTCCATGTCTCCGGCCTCTGAGGCCCCGCCGGC
CGGCTGGGCAGTCCGGGGAGGCCTGGCGGGCGGCGCGTAGGCGGCGGCTGCGGGCGCCGGGGCGCACTAG
CGGACGGCGTGGGCGCGCGGCCAGGCGCCTCCCCGGCCCCCGCGACCCAACTCCAGCCCGGGCCGGAATA"""
# create the regex for iterable object with finditer
gc_motif = re.finditer("GGCGGC", seqA)
# Print output
"""for n in gc_motif:
    print (n.group())
    print (n.span())
    # GGCGGC: (1011, 1017)"""
for n in gc_motif:
    print(n.group() + "->" + str(n.span()) + "\n")

GGCGGC->(1011, 1017)

GGCGGC->(1023, 1029)

GGCGGC->(1098, 1104)

GGCGGC->(1377, 1383)

GGCGGC->(1388, 1394)



The above output is a lot cleaner! This would look good in a report!

## Let's Look at Another Way Of Getting all of the Exons with the RE split Method!

We will use the following steps:
1. Get all occurrences that may be exons.
1. Remove the newlines from the sequences.
1. Push each potential exon into a list.


In [None]:
import re
dna = """TGCCAGCTGCATCTCAAGGGCAGGGGCCAGGGTCAGAACAGGAGGCCCCTTTCTAGTGGATGCAACACCG
CACAGCAGCGCAGGACAAGGCTGTCTCACCTGCTGTTACCAACGCCCCCAGGTTGTGTTTCTCTTCGGAA
CGCTCCAGATGCAATGCCGATCTGGGCACGCAGGGAGGTGGGGGGTGATGGGAAGCTCACCAGGAACCAG
GGACTGGGGCCCAGACTGACCTGTCAGAAGGCACCAAGGTGGACGTGGTGTGGCCGGTCCTGCTTCACTA
GCCTTGCTCCTCCTAGTGGGAGGCCGTGCACTCTGCCAACCCGCTATCCCCCACCCTCACCCTGCCTCGG
ATCCACACCACCTTCCTGCTGGGAGGTGTGGGGTGACAGGAAGCTCTTGCCCCACACCATGGTCCTGGCG
CGGTCCTGGTGTGAGACATCCTGCATTTGAAACAGCTGTGTGACTTCAGGAGAGTTACTTGACCTTTCTG
TCTCAGCGTCTTCTGCAAAAGACTGAACAGGAACTCCGTGGCTCGGGCGAACTCACTCCTGTCAAGTGCA
TAGGAAGGGGCACGAGTGGGCGGGGAGGTGCTGCCCTGTTATTCACAGAACACTTTGCCCAGGCAACACA
CTACAAATCCACAACCTCGCTCCCTGCAGGTGCACTGAGACCACCCACGCCCTCCCGGACACCAACGCCC
ACCATCAGATTCGCTGCGCAAAGTCCCAGAGCCGCCGGCGCACGCTCACACCCCGGCGAGCAGCCCCCAG
CTCCCTCCCTCCGAGAGGAGCCCGGTCCGCGACCAGCCCAGCCCATCCCAGTCCCGCGCGGAGTCCTGGA
TTCCAGCCGCTCGCAGTGACTCGGTACTCGGGATAGTGCCGGGGGCCGCAGCCCTGTCCCGCTGCCGCCG
CCGGATGCCCCGAGTCGGCCGTCACGCACCCCCCGCGGGAGCCCGCGCCGCCCGCCGCGCCGGGGCCGTT
TAAATGGGCCAAGTTGTGGCGGCGGCGTCGGCGGCGGAGTCTCCCAAGTCCCCGCCGGGCGGGCGCGCGC
CAGTGGACGCGGGTGCACGACTGACGCGGCCCGGGCGGCGGGGCGGGGGCTTGGGACCCCCGAGAGGGGC
GGGGACTCCGCGACTCCTCGCTGCCGGGCTCGGCCTGGCGGGTGGGTCGGCGAGCCGGGCGTGGGACTGC
CCCGGGCGCGGGCGCTGGTGGCCGGGGCGCGGGACTCCAGACGCCCCGGGGAGCCCCGAGGCCCTGGAAC
TGCGGCGCTCGGCGAGTCGATCCGGGATCGATAGCAGCTCCATGTCTCCGGCCTCTGAGGCCCCGCCGGC
CGGCTGGGCAGTCCGGGGAGGCCTGGCGGGCGGCGCGTAGGCGGCGGCTGCGGGCGCCGGGGCGCACTAG
CGGACGGCGTGGGCGCGCGGCCAGGCGCCTCCCCGGCCCCCGCGACCCAACTCCAGCCCGGGCCGGAATA"""
    

In [None]:
splice = re.split(r"GT[AGCT]{14,20}GA", dna)
print(splice)

In [None]:
for match in splice:
    line = match.replace('\n', '')
    print(line)

In [None]:
exons =[]
for match in splice:
    if match in splice:
        line = match.replace('\n','')
        exons.append(line)
print(exons)

##          Using RE on a List
In this example we will demonstrate how we can use regex on a list.
We will perform the following operations:
1. Create an empty list.
1. Populate the list with list method **_extend_**.
1. Loop the list to check if any of the elements have the pattern, and if an element does then we will print it to standard output.

In [6]:
import re
seq10 = "ATGCCAATGTCCTGTTTCTAATGTATATGCAACACCATGCACAAT"
seq11 = "GTCTCAATGCATGTCTTCTATGCAACTAACCTCCATGTATGC"
seq12 = "TGTGGCGGCGGCGTCGGCGGCGGAGTCTCCCAAGT"
seq13 = "GGTGGCCGGGGCGCGGGACTCCAGACGCCCCGGGGA"
#empty list
NTs = []
#Populate the empty list 
NTs.extend((seq10,seq11,seq12,seq13))
# Loop through the list to check for pattern. If pattern exist print the
# sequence with a message.
for nt in NTs:
	if re.search(r"CAA[C|T]",nt):
		print(nt + " has the motif.\n")

ATGCCAATGTCCTGTTTCTAATGTATATGCAACACCATGCACAAT has the motif.

GTCTCAATGCATGTCTTCTATGCAACTAACCTCCATGTATGC has the motif.



How do we know which sequence id we're working with? We can define a counter outside of the for loop and increment it inside of the if condition. Our counter will start with **_9_**, because the first sequence ID starts at **_10_**. Otherwise we would have the wrong sequence IDs for the match, which would be a logical error. 

In [None]:
nt = 9
for n in NTs:
    if re.search(r"CAA[C|T]",n):
        nt += 1 # We add one to each element starting from after 10.
        print( "seq" + str(nt) + ":" + n + " have the motif.\n")

This looks a lot cleaner. We can easily look at seq1 and see that it
matches the originally defined sequence.

There are other instances where we will define an empty list and populate it while we are in the loop, then use a condition with regex to determine the flow of the program. This is a pretty common operation in Bioinformatics.

# Bringing It All Together

We will use the **_DNA_** sequence again as the input for our program. 
We will perform the following operations:
1. create a conditional statement where I search for the existence of the regex.
1. If the pattern exist, we report the existence to stdout.
1. We locate and return each match with their positions.
 

In [None]:
import re
# Now i'll create a conditional statement where I search for the existence of the regex
if re.search(r"GCCGCCGG",dna):
	print("The motif GCCGCCGG exist in the sequence.")
# We call on the re.finditer method. Using the return value 
# in a loop provides a means of matching the motif
# and doing something with it.
motif_match = re.finditer(r"GGCGGC",dna)
for match in motif_match:
	match_start = match.start()
	match_end = match.end()
	print( "The motif GGCGGC is located at positions " + str(match_start) + " to " + str(match_end))


                    **_Exercise_**
Replace the line of code with the RE method finditer with the method findall. Check out the difference.  Finditer has more flexibility because it is iterable.

In [None]:
# Perform exercise here.