# Regular Expression with Python

### Learning Goals
- learn basic syntax for regular expression in Python and be able to write simple regular expressions using common operations in pattern matching
- understand the flexibility regular expression affords in searching and articulate at least one real world example where this is useful
- become familiar with Python's re module and some of its functions such as findall(), search(), and sub()
- become aware of online resources such as documentation, "cheat sheets" and Stack Exchange to independently learn more about a technical topic

Regular expression (shortened as regex or regexp) can be used for pattern matching in a text editor.  For example, when you use the "find and replace" feature in Microsoft Word, you are asking the computer to find specific strings which match a pattern and replace them with another string.  We might desire a more flexible way to search and replace.  For example, we might wish to locate and replace a word spelled two different ways in a text: serialise and serialize (British and American spelling).  The regular expression seriali[sz]e matches both "serialise" and "serialize". Wildcard characters also achieve this, but are more limited in what they can pattern.

Other examples where this flexibilty is useful might be searching for and extracting email addresses from a file.  We know there will be an at sign (@), but don't know what the constraints are in front of it in terms of word length or characters used.  It is possible to read through texts and look for patterns using string methods like split() and find().  However, searching and extracting is so common that there is a powerful library for these tasks (re).

The re module provides a set of powerful regular expression facilities, which allows you to quickly check whether a given string matches a given pattern (using the match function), or contains such a pattern (using the search function). A regular expression is a string pattern written in a compact (and quite cryptic) syntax.  

The module functions fall into three categories:
- pattern matching
- substitiution
- splitting

The regex describes a pattern to locate in the text, and then we can use specific methods to accomplish tasks.  You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?” You can also use REs to modify a string or to split it apart in various ways.

The documentation provides more details: https://docs.python.org/3/library/re.html.

### Look for a match


In [4]:
import re

#Check if the string starts with "The" and ends with "Spain":

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

if x:
  print("YES! We have a match!")
else:
  print("No match")



YES! We have a match!


In [5]:
import re

#Check if the string starts with "The" and ends with "Spain":

txt = "The Running of the Bulls occurs in Pamplona, Spain"
x = re.search("^The.*Spain$", txt)

if x:
  print("YES! We have a match!")
else:
  print("No match")



YES! We have a match!


In [6]:
import re

#Check if the string starts with "The" and ends with "Spain":

txt = "The Louvre is in Paris, France"
x = re.search("^The.*Spain$", txt)

if x:
  print("YES! We have a match!")
else:
  print("No match")



No match


### Print Matches

In [8]:
import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)
len(x) # how many matches are found

['ai', 'ai']


2

In [9]:
import re
# return an empty list of no matches are found
txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

[]


### Search
The search() function searches the string for a match and returns a match object if there is a match.  If there is more than one match, only the first occurence will be returned.  If there are no matches None is returned.  The match object returned has properties and methods which can provide more information about the search such as
- span() which returns a tuple containing the start and end positions of the match
- group() returns the part of the string where there was a match

In [10]:
import re

txt = "The rain in Spain"
# search for first white space character \s
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 3


In [11]:
import re

txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)

None


In [18]:
import re

txt = "The rain in Spain"
x = re.search("ai", txt)
print(x) #this will print an object

<re.Match object; span=(5, 7), match='ai'>


In [19]:
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
# span returns the start and end position of the first match occurrence.
print(x.span())

(12, 17)


In [20]:
import re

txt = "The rain in Spain"
# look for upper case S
x = re.search(r"\bS\w+", txt)
# print the part of the string where there was a match
print(x.group())


Spain


### Split
The split() function returns a list where the string has been split on each match.  Notice this can also be accomplished with the string method split().

In [13]:
import re

txt = "The rain in Spain"
# split on white space
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


In [15]:
txt = "The rain in Spain"
x = txt.split()
print(x)

['The', 'rain', 'in', 'Spain']


### Substitution
The sub() function replaces the matches found with a string you indicate.  You can control how many replacements are done with the optional count parameter.

In [16]:
import re

txt = "The rain in Spain"
# replace spaces with the number nine as a string
x = re.sub("\s", "9", txt)
print(x)

The9rain9in9Spain


In [17]:
import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)

The9rain9in Spain


### Task
We have seen now different syntaxes to flexibly control what we want to match or replace.  Google "regular expression in Python cheat sheet" and download one of your choosing.  Take a moment to read through it.
Choose four different syntaxes on the cheat sheet and write a small example with a string of your choosing like "The rain in Spain" to test them.  
Specifically experiment with
- [] vs ()
- \+ vs \*
- {}

Find one question on Stack Overflow or Stack Exchange related to regular expression in Python and read the answer and test it out with code.  (If the first one you find doesn't make sense, look for another.)  Did you learn anything about syntax from the example you found?  Be prepared to explain the problem and solution to a peer.

### Solution (many solutions are possible)

In [27]:
# grouping with ()
import re
s = 'The rain in Spain'
x = re.sub('(ai)', '9', s)
x

'The r9n in Sp9n'

In [28]:
# greedy
# replaces every instance of one or more ai together
import re
s = 'The rain in Spain makes me shout aiaiaiaiaiaiaiai'
x = re.sub('(ai)+', '9', s)
x

'The r9n in Sp9n makes me shout 9'

In [30]:
# nongreedy
# replaces every instance of one ai together
import re
s = 'The rain in Spain makes me shout aiaiaiaiaiaiaiai'
x = re.sub('(ai)+?', '9', s)
x

'The r9n in Sp9n makes me shout 99999999'

In [31]:
# replaces every instance of one or more a or i (doesn't look for ai)
import re
s = 'The rain in Spain makes me shout aiaiaiaiaiaiaiai'
x = re.sub('[ai]+', '9', s)
x

'The r9n 9n Sp9n m9kes me shout 9'

In [32]:
# replaces every instance of two (ai) terms 
import re
s = 'The rain in Spain makes me shout aiaiaiaiaiaiaiai'
x = re.sub('(ai){2}', '9', s)
x

'The rain in Spain makes me shout 9999'

### Task
We will use the .txt file emails.txt for this exercise.  This file needs to be in the same directory as your program or Jupyter notebook.  

Here is a snippet of the file:

From bkirschn@umich.edu Fri Dec 21 09:55:06 2007
Return-Path: <postmaster@collab.sakaiproject.org>
Received: from murder (mail.umich.edu [141.211.14.25])
	 by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;
	 Fri, 21 Dec 2007 09:55:06 -0500
X-Sieve: CMU Sieve 2.3
Received: from murder ([unix socket])
	 by mail.umich.edu (Cyrus v2.2.12) with LMTPA;
	 Fri, 21 Dec 2007 09:55:06 -0500
Received: from dreamcatcher.mr.itd.umich.edu (dreamcatcher.mr.itd.umich.edu [141.211.14.43])
	by panther.mail.umich.edu () with ESMTP id lBLEt6x8006098;
	Fri, 21 Dec 2007 09:55:06 -0500
Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])
	BY dreamcatcher.mr.itd.umich.edu ID 476BD3C4.BFDC1.28307 ; 
	21 Dec 2007 09:55:03 -0500
Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])
	by paploo.uhi.ac.uk (Postfix) with ESMTP id A4CC6A7DD7;
	Fri, 21 Dec 2007 14:51:39 +0000 (GMT)
Message-ID: <200712211454.lBLEs7d9009944@nakamura.uits.iupui.edu>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit


Use the search() function to return the lines in the .txt file which contain the word "From."

Recall that the rstrip() method removes any trailing characters (characters at the end a string) with space as the default character to remove.



### Solution

In [49]:
import re
hand = open('emails.txt')
for line in hand:
    line = line.rstrip() # strip trailing white space
    if re.search('From', line):
        print(line)
hand.close        

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From: stephen.marquard@uct.ac.za
From louis@media.berkeley.edu Fri Jan  4 18:10:48 2008
From: louis@media.berkeley.edu
From zqian@umich.edu Fri Jan  4 16:10:39 2008
From: zqian@umich.edu
From rjlowe@iupui.edu Fri Jan  4 15:46:24 2008
From: rjlowe@iupui.edu
From zqian@umich.edu Fri Jan  4 15:03:18 2008
From: zqian@umich.edu
From cwen@iupui.edu Wed Dec 19 09:45:21 2007
From: cwen@iupui.edu
From gsilver@umich.edu Wed Dec 19 09:35:37 2007
From: gsilver@umich.edu
From cwen@iupui.edu Wed Dec 19 08:47:43 2007
From: cwen@iupui.edu
From stuart.freeman@et.gatech.edu Wed Dec 19 08:41:55 2007
From: stuart.freeman@et.gatech.edu
From wagnermr@iupui.edu Wed Dec 19 08:39:56 2007
From: wagnermr@iupui.edu
From stuart.freeman@et.gatech.edu Wed Dec 19 08:35:04 2007
From: stuart.freeman@et.gatech.edu
From stephen.marquard@uct.ac.za Wed Dec 19 03:40:33 2007
From: stephen.marquard@uct.ac.za
From stephen.marquard@uct.ac.za Wed Dec 19 03:20:57 2007
From:

<function TextIOWrapper.close()>

The above task could also be accomplished with line.search().  The real power of regular expression comes from adding special characters to the search string to more precisely control which lines match the string.  For example, we can search for lines that start with From and have an @ sign.  
- the ^ symbol indicates the start of the line
- the .+ means one or more characters.  Think of this as a wildcard expanding to match an unspecified number of characters.
- the @ looks for this sign
- putting this together, we can search for '^From:.+@'

Notice that we don't have to specify how many characters are before the @.  This is good because email addresses vary.  For example, Humboldt's Math Department email is math@humboldt.edu and the Biology Deparment's email is biosci@humboldt.edu.  They have a different number of characters before the @ sign.  Therefore we need flexibility in the searching.

Here is code to accomplish this task.

In [50]:
import re
hand = open('emails.txt')
for line in hand:
    line =line.rstrip()
    if re.search('^From:.+@', line): # search for lines starting with From: followed by one or more characters (.+) followed by an @ sign
        print(line)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: cwen@iupui.edu
From: stuart.freeman@et.gatech.edu
From: wagnermr@iupui.edu
From: stuart.freeman@et.gatech.edu
From: stephen.marquard@uct.ac.za
From: stephen.marquard@uct.ac.za
From: stephen.marquard@uct.ac.za
From: stephen.marquard@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: ian@caret.cam.ac.uk
From: antranig@caret.cam.ac.uk
From: ian@caret.cam.ac.uk
From: ian@caret.cam.ac.uk
From: ian@caret.cam.ac.uk
From: wagnermr@iupui.edu
From: david.horwitz@uct.ac.za
From: jleasia@umich.edu
From: david.horwitz@uct.ac.za
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From: mmmay@indiana.edu
From

### Extracting Data with Regular Expression
The method findall() finds *all* the matches and returns them as a list of strings, with each string representing one match.
If we would like to find all strings that look like email addresses, we can search '\S+@\S+'.  This works because
- \S matches a single character other than white space.  Adding the + means one or more characters other than white space, so \S+ matches as many nonwhite space characters as possible (greedy).
- @ looks for the sign in all email addresses
- \S+ again looks for non white space characters

The terms greedy and lazy in regular expression mean
- greedy (default): keep searching until the condition is not satisfied
- lazy (indicated with a ? at the end of the quantifier): stop searching once the condition is satisfied

In [44]:
import re
hand = open('emails.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('\S+@\S+', line) # look for lines that match at least one non-white space character, the @ and at least one non-white space character
    if len(x) > 0:
        print(x)
hand.close()

['bkirschn@umich.edu']
['<postmaster@collab.sakaiproject.org>']
['<200712211454.lBLEs7d9009944@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['bkirschn@umich.edu']
['source@collab.sakaiproject.org']
['bkirschn@umich.edu']
['bkirschn@umich.edu']
['david.horwitz@uct.ac.za']
['<postmaster@collab.sakaiproject.org>']
['<200712211453.lBLErI1D009932@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['david.horwitz@uct.ac.za']
['source@collab.sakaiproject.org']
['david.horwitz@uct.ac.za']
['david.horwitz@uct.ac.za']
['david.horwitz@uct.ac.za']
['<postmaster@collab.sakaiproject.org>']
['<200712211449.lBLEniVf009920@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sa

We see some of the email addresses returned have characters we might not want.  For example,we might want to remove the < and > in this address: <postmaster@collab.sakaiproject.org>.  We can just keep the portion of the string that starts with a letter or a number.
- square brackets are used to indicate a set of multiple acceptable characters we consider matching
- [a-zA-Z0-9]\S*@\S*[a-zA-Z] tells us to look for substrings that
    - start with a single lowercase character, uppercase character or digit ([a-zA-Z0-9])
    - followed by zero or more nonblank characters (\S*)
    - has an @ sign
    - is followed by zero or more nonblank characters which are letters (*[a-zA-Z])
    - note that * means zero or more and + means one or more, applied to the single character immediately to the left

In [51]:
import re
hand = open('emails.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('[a-zA-Z0-9]\S*@\S*[a-zA-Z]', line) 
    if len(x) > 0:
        print(x)
hand.close()

['stephen.marquard@uct.ac.za']
['postmaster@collab.sakaiproject.org']
['200801051412.m05ECIaH010327@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['louis@media.berkeley.edu']
['postmaster@collab.sakaiproject.org']
['200801042308.m04N8v6O008125@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['louis@media.berkeley.edu']
['zqian@umich.edu']
['postmaster@collab.sakaiproject.org']
['200801042109.m04L92hb007923@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject

### Task
Let the following string be considered: 'X-DSPAM-Confidence: 0.8475'
- Method one: use find and string slicing to extract the number and convert it to a float.
- Method two: instead, complete this task with regular expression.

### Solution With String Slicing

In [1]:
phrase = 'X-DSPAM-Confidence: 0.8475'

col_pos = phrase.find(':')                  # Finds the colon character
number = phrase[col_pos+1:]                 # Extracts portion after colon
number = float(number)                  # Converts to floating point number
print(number)

0.8475


### Solution with Regular Expression

In [14]:
import re
phrase = 'X-DSPAM-Confidence: 0.8475'
number = re.findall('\d+\.\d+', phrase) # returns list
number = float(number[0])
print(number)

0.8475


### Task
Extract the hour of day that email messages were sent.  We will use the .txt file emails.txt.  This file needs to be in the same directory as your program or Jupyter notebook.  

- Method one: instead, complete this task with regular expression.
- Method two: do this with two calls to split (splitting on the colon and then on spaces is one way to extract the hour (e.g., 09 for 9 am).

### Solution With Regular Expression

In [48]:
import re
hand = open('emails.txt')
for line in hand:
    line = line.rstrip()
    x = re.findall('^From .* ([0-9][0-9]):', line) # look for lines that start with From then have a space, 
    #potentially some number of characters followed by a space and two digits followed by a colon.  Extract the two
    # digits, indicated with the square brackets
    if len(x) > 0: print(x)
hand.close()        

['09']
['18']
['16']
['15']
['15']
['09']
['09']
['08']
['08']
['08']
['08']
['03']
['03']
['03']
['02']
['02']
['20']
['12']
['12']
['12']
['11']
['11']
['08']
['07']
['07']
['05']
['18']
['18']
['18']
['18']
['18']
['18']
['18']
['18']
['18']
['18']
['17']
['17']
['17']
['17']
['17']
['17']
['17']
['17']
['17']
['17']
['17']
['17']
['17']
['17']
['17']
['17']
['17']
['17']
['17']
['16']
['16']
['16']
['16']
['16']
['16']
['16']
['16']
['16']
['16']
['16']
['16']
['15']
['15']
['15']
['15']
['15']
['15']
['15']
['14']
['14']
['14']
['14']
['14']
['14']
['14']
['14']
['14']
['14']
['13']
['13']
['13']
['13']
['13']
['12']
['12']
['12']
['11']
['11']
['11']
['11']
['10']
['10']
['10']
['10']
['10']
['10']
['10']
['08']
['08']
['08']
['07']
['06']
['06']
['01']
['20']
['23']
['19']
['19']
['19']
['17']
['16']
['16']
['15']
['14']
['14']
['14']
['14']
['14']
['14']
['14']
['13']
['13']
['13']
['11']
['11']
['11']
['11']
['10']
['10']
['10']
['10']
['10']
['08']
['08']
['22']
['22']
['21']

### Solution with String Splitting

In [83]:
fhand = open('emails.txt')
for line in fhand:
    line = line.rstrip()
    if not line.startswith('From '): continue # do nothing if it is not a line we are interested in
    x = line.split(':') # split on colon
    y = x[0].split(' ') # split on space
    print(y[-1]) # print hour

09
18
16
15
15
09
09
08
08
08
08
03
03
03
02
02
20
12
12
12
11
11
08
07
07
05
18
18
18
18
18
18
18
18
18
18
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
17
16
16
16
16
16
16
16
16
16
16
16
16
15
15
15
15
15
15
15
14
14
14
14
14
14
14
14
14
14
13
13
13
13
13
12
12
12
11
11
11
11
10
10
10
10
10
10
10
08
08
08
07
06
06
01
20
23
19
19
19
17
16
16
15
14
14
14
14
14
14
14
13
13
13
11
11
11
11
10
10
10
10
10
08
08
22
22
21
20
20
17
17
16
15
15
15
15
14
14
12
12
12
11
11
11
11
09
09
09
07
07
05
18
16
16
15
15
15
15
15
12
11
11
11
09
08
08
07
04
21
21
19
19
19
19
19
18
18
17
17
17
14
12
12
12
11
11
10
09
09
09
08
08
07
07
19
18
16
15
15
14
14
14
14
13
13
13
13
15
13
13
13
13
12
11
11
10
10
10
10
10
09
09
09
09
09
06
06
06
06
06
06
06
06
06
02
18
16
16
16
16
16
13
13
12
12
11
11
10


### Task (Main Course)
In this assignment you will read through and parse a file with text and numbers. You will extract all the numbers in the file and compute the sum of the numbers.  


The file contains text from a data science textbook introduction with random numbers inserted through the verbage.

For example, the text might look like this:

Why should you learn to write programs? 7746
12 1929 8827
Writing programs (or programming) is a very creative 
7 and rewarding activity.  You can write programs for 
many reasons, ranging from making your living to solving
8837 a difficult data analysis problem to having fun to helping 128
someone else solve a problem.  This book assumes that 
everyone needs to know how to program ...


The data can be found at this link: http://py4e-data.dr-chuck.net/regex_sum_1742785.txt. 

The basic outline of this problem is to 
- read the file
- look for integers using the re.findall()
- look for a regular expression of '[0-9]+' 
- convert the extracted strings to integers
- sum up the integers.


### Solution

In [None]:
import re

sum = 0

file = open('regex_sum_1742785.txt')
for line in file:
    numbers = re.findall('[0-9]+', line)
    if not numbers: # if empty, don't attempt to sum elements in list
        continue
    else:
        for number in numbers:
            sum += int(number) # convert from string to int, then sum

print(sum)
file.close()

### Applications to Bioinformatics
The flexibility of regular expression is particularly useful in bioinformatics.  A codon is a DNA or RNA sequence of 3 nucleotides that encodes a particular amino acid or gives a stop signal.  For DNA, there are three stop codons: TAG, TAA, and TGA.  If we want to match any sequence of DNA terminated by a stop codon, we can use this syntax:
([ACTG])+(TAG|TAA|TGA).
- [ACTG] indicates any of the nucleotide bases (A, C, T, G) 
- the parentheses group patterns
- + modifies the previous group to match one or more times
- (TAG|TAA|TGA) indicates followed by one of the stop codons (the | notation signifies or)

Curly brackets allow flexibility in terms of how many repetitions we are searching for.  For example, (AT){10,100} matches an "AT" repeated 10 to 100 times.  (AT){10,} mathces an "AT" repeated 10 or more times (no upper bound).

Open the .txt file grape.txt provided.  This file contains information about the Vitis vinifera (common grape) genome.  

The GATA protein is a transcription factor and is important for regulating transcription (the process where cells make an RNA copy of a piece of DNA which will later be used to make proteins). It binds to any short DNA sequence which matches the pattern GATA with either an A or a T before and either a G or an A after.  For example, in this sequence 
- AAAAAAATGATAGAAAAAGATAAAAAA
there are two matches (find the substring GATA, and then check that before you see an A or a T and after you see a G or an A).

Given a specific string, we can use regular expression to find out how many times this motif occurs.

In [20]:
import re


def count_motifs(seq, motif):
    pieces = re.split(motif, seq)
    return len(pieces) - 1

seq = 'AAAAAAATGATAGAAAAAGATAAAAAA'
print(count_motifs(seq, '[AT]GATA[GA]'))


2


### Task  (Dessert)
Huntington's Disease is a neurogenerative disorder and is linked to the anomalous expansion of the number of tribucleotide repeates in particular genes.  Human beings have 23 pairs of chromosomes in our cells and each of our parents contributes one chromosome to each pair. The gene that causes Huntington's Disease (HD) is found on chromosome 4.  Each of us gets one copy of the gene from our mother and one copy from our father.

The gene responsible for HD contains a sequence with several CAG repeats (Cytosine, Adenine, Guanine). We all have these CAG repeats in the gene that codes for the huntingtin protein, but people with HD have a greater number than usual of CAG repeats in one of the genes they inherited.

The actual number of repeats of a specific codon determines the risk of developing HD.  Both the CAG and CAA codons specify glutamine, and more than 35 repeats virtually assures the disease.  In this task, we will use regular expression to find the polyglutamine (called polyQ) repeat number for a specific mRNA sequence.


- go to the NCBI nucleotide database for the htt mRNA sequence and download it to your working directory.  https://www.ncbi.nlm.nih.gov/nuccore/NM_002111.8?report=fasta.  The beginning of the code is GCTCGGGGAC.  (You can copy and paste the code into a .txt file.)
- open the file and read it
- use regular expression to determine how many times either codon is repeated (you can run your code with different numbers to experiment and take the maximum repetition number).



### Solution
Note: I manually tuned the number 18 to find the max.

In [11]:
import re
fhand = open('HTTmRNA.txt')
htt_mRNA = fhand.read()
htt_pattern = '(CAG|CAA){18,}' # look for either codon that encodes glutamine, and we will look for 18 or more repeats
#to_find = 'CAG'

match = re.findall(htt_pattern, htt_mRNA )
print('The number of polyQ repeats found is ' + str(len(match)))
fhand.close()

The number of polyQ repeats found is 1


This tells us that there is one sequence of 18 repeats in this mRNA sequence.

### References
- Python for Everybody: Exploring Data in Python 3 by Charles Severance.  https://www.py4e.com/book.php
- Möncke‐Buchner, Elisabeth, et al. "Counting CAG repeats in the Huntington’s disease gene by restriction endonuclease Eco P15I cleavage." Nucleic Acids Research 30.16 (2002): e83-e83.
- A Primer for Computational Biology by Shawn T. ONeil https://open.oregonstate.education/computationalbiology/chapter/bioinformatics-knick-knacks-and-regular-expressions/
- Using Regular Expression in Genetics with Python by Stephen Fordham.  https://towardsdatascience.com/using-regular-expression-in-genetics-with-python-175e2b9395c2