# Regular expressions in Python

### Sources

Python for Biologists - Martin Jones, Chapter 7: Regular expressions (Pages 149 - 178)

[Regular expressions in genomics](https://towardsdatascience.com/using-regular-expression-in-genetics-with-python-175e2b9395c2)

## What are regular expressions (Regex)?
RegEx is a sequence of characters that define search patterns.

## Common regular expression syntax :
Source: [Regular expressions in genomics](https://towardsdatascience.com/using-regular-expression-in-genetics-with-python-175e2b9395c2)

| Syntax | Description|
| --- | --- |
| . | Matches any single character |
| ^ | Anchor; matches from the start of a string |
| $ | Anchor; matches at the end of a string |
| \ | Escape character |
| &#124; | Pipe character OR; C&#124;T will match C or T |
| * | Matches zero or more repetitions of the previous character, GGGA*TTT matches GGGTTT and GGGATTT and GGGAATTT, etc |
| + | Matches one ore more repetitions of the previous character, GGGA+TTT matches GGGATTT and GGGAATTT, etc. |
| ? | Matches zero or one reptition of the previous character, GAT?C means that T is optional, matches GAC or GATC |
| {n} | Quantifier; matches n repetitions of the previous character |
| {n,x} | Quantifer; matches n to x repetitions of the previous character |
| [ ] | Character group; e.g. [AGCT] will match the characters AGCT |
| [^] | Negated character group e.g. [^AGCT] will match any characters not in this group |
| ( ) | Matches the pattern specified in the parentheses exactly |



In [1]:
import re

## Raw strings

Writing regular expressions requires us to type a lot of special characters. 

If we put the letter `r` immediately before the opening quotation mark, then any special characters inside the string are ignored:

In [6]:
print(r"\t\n")

\t\n


The `r` stands for *raw*, which is Python's description for a string where special characters are ignored. 

## Exercises

1.Use `re.search()` to test whether or not the DNA sequence `dna = "ATCGCGAATTCAC"` contains an EcoRI restriction site. The EcoRI recognition sequence is `GAATTC`.

In [7]:
dna = "ATCGCGAATTCAC"
pat = re.search(r"GAATTC", dna)
if pat:
    print("pattern is found")

pattern is found


2.Use `re.search()` and pipe syntax `|` to test whether or not the DNA sequence `dna = "ATCGCGAATTCAC"` contains AvaII recognition site. AvaII can have two different sequences, `GGACC` and `GGTCC`.

In [15]:
dna = "ATCGCGAATTCACGGACC"
pat = re.search(r"(GG[AT]CC)", dna)
if pat:
    print("pattern is found")

pat1 = re.search(r"(GG(A|T)CC)", dna)
if pat1:
    print("pattern is found")

pattern is found
pattern is found


3.Use `re.search()` and character group `[]` to test whether or not the DNA sequence `dna = "ATCGCGAATTCACGCTGC"` contains the recognition site for BisI. The BisI restriction enzyme cuts at an even wider range of motifs. The pattern is `GCNGC`. Can you use another Regex syntax for this exercise instead of `[]`? What is the downside of using this latter syntax?

In [25]:
# Proper way

dna = "ATCGCGAATTCACGCTGCGCAGCXN"
pat = re.search(r"(GC[ATGC]GC)", dna)
if pat:
    print("pattern is found")
    
pat = re.search(r"(GC(A|T|C|G)GC)", dna)
if pat:
    print("pattern is found")
    
# Other ways but in biological sequence context, you might get into trouble since you don't control for errors.
pat = re.search(r"(GC.GC)", dna)
if pat:
    print("pattern is found")

pat = re.search(r"(GC[^XN]GC)", dna)
if pat:
    print("pattern is found")

pattern is found
pattern is found
pattern is found
pattern is found


4.Write a pattern to identify full-length eukaryotic messenger RNA sequences.

Imagine the sequence of eukaryotic messenger RNA looks like this:
It starts with `AUG`, it then contains the bases `AUGC` between for example, 30 to 1000, and it ends with the infamous poly-A tail which contains between 5 to 10 adenine nucleotides.

In [26]:
pat = r"(AUG[AUGC]{30, 1000}A{5, 10})"

5.What is the type of the output of the `re.search()`? What is the method that we can use to extract the part of the string that matched?

In [33]:
dna = "ATCGCGAATTCACGCTGCGCAGCXN"
pat = re.search(r"(GC[ATGC]GC)", dna)
print(type(pat))
print(type(2))
print(type("Niki"))
print(type(3.2))

# Answer: Match object
# pat.group() is the method to extract the part of the string that matched.

<class '_sre.SRE_Match'>
<class 'int'>
<class 'str'>
<class 'float'>


6.In the following DNA sequence, dna = "ATGACGTACGTACGACTG", the match object is stored in the variable  
`m = re.search(r"GA[ATGC]{3}AC", dna)`, use the appropriate method to extract the matched string from dna?

In [35]:
dna = "ATGACGTACGTACGACTG"
m = re.search(r"GA[ATGC]{3}AC", dna)
print(m.group())

GACGTAC


7.In this DNA sequence, dna = "ATGACGTACGTACGACTG", do the following:
Use `re.search()` with the suitable regex to find the following pattern:
A sequence that starts with GA, then can have any of the A, T,G, C nucleotides up to 3 times, 
then has the fixed character AC and then can have any of the A, T, G, C nucleotides up to 2 times, and ends with AC.

Use the appropriate method to extract:

7A. The eniter match

7B. The first bit of the match

7C. The second bit of the match

In [67]:
dna = "ATGACGTACGTACGACTG"
pat = r"(GA[ATGC]{0,3})(AC)([ATGC]{0,2}AC)"
m = re.search(pat, dna)
print(m.group())
print(m.group(1))
print(m.group(2))
print(m.group(3))

GACGTACGTAC
GACGT
AC
GTAC


8.In this DNA sequence, dna = "ATGACGTACGTACGACTG", do the following:
Using the same regex constructed in question 7, use the appropriate methods to obtain the start and end positions of the matched pattern in the string?

In [71]:
dna = "ATGACGTACGTACGACTG"
pat = r"(GA[ATGC]{0,3})(AC)([ATGC]{0,2}AC)"
m = re.search(pat, dna)
print("start is ", m.start())
print("end is ", m.end())
print("start group 1 is ", m.start(1))
print("end group 1 is ", m.end(1))
print("start group 2 is ", m.start(2))
print("end group 2 is ", m.end(2))

start is  2
end is  13
start group 1 is  2
end group 1 is  7
start group 2 is  7
end group 2 is  9


9.In this DNA sequence, dna = "ACTNGCATRGCTACGTYACGATSCGAWTCG", we have several IUPAC code indicating missing data or mutations in our DNA such as N, R, Y, etc. We want the list of DNA segments that occur between the IUPAC codes, that is:
['ACT', 'GCAT', 'GCTACGT', 'ACGAT', 'CGA', 'TCG']
how could we achieve this?

In [89]:
dna = "ACTNGCATRGCTACGTYACGATSCGAWTCG"
#?re.split()

#re.split(pattern, string, maxsplit=0, flags=0)
print("first way ", re.split(r"[NYSWR]", dna))
print("second way ", re.split(r"[^ATGC]", dna)) #***

#g = "ATGCGT.TCGT"
#print(dna.split("N")[1].split("Y"))#.split("S").split("W").split("R"))

first way  ['ACT', 'GCAT', 'GCTACGT', 'ACGAT', 'CGA', 'TCG']
second way  ['ACT', 'GCAT', 'GCTACGT', 'ACGAT', 'CGA', 'TCG']


10.Let's find every place in the DNA that is AT or GC rich. 
Define a DNA and find all runs of A and T in a DNA sequence longer than 3 bases.

In [107]:
dna = "ACGTACATATATCACACGTATATATCACACTTTTTTTTAAGCAAAAAAATTTTTTTGCTTT"
#?re.findall()
#re.findall(pattern, string, flags=0)

# finds stretches of DNA containing A or/and T repeated for 3 or more times.
re.findall(r"[AT]{3,}", dna)
re.findall(r"[A]{3,}|[T]{3,}", dna)

['TTTTTTTT', 'AAAAAAA', 'TTTTTTT', 'TTT']

11.Define a DNA and find all runs of A and T in a DNA sequence longer than 5 bases and report back the position of the start and end of the matched sequence.

In [122]:
#?re.finditer()
# re.finditer(pattern, string, flags=0)
# Return an iterator over all non-overlapping matches in the
# string.  For each match, the iterator returns a match object.

dna = "ACGTACATATATCACACGTATATATCACACTTTTTTTTAAGCAAAAAAATTTTTTTGCTTT"
m = re.finditer(r"[AT]{3,}", dna)
for match in m:
    print(match.group(), match.start(), match.end())
    
#for i in range(1, 4):
#    print(i)
# re.finditer() returns a match object in contrast to re.finall() which returns 
# a list. Therefore, we can use match object methods such as group(), start()
# and end() on the output of re.finditer().

ATATAT 6 12
TATATAT 18 25
TTTTTTTTAA 30 40
AAAAAAATTTTTTT 42 56
TTT 58 61


12.Given the accession names as below:

accs = ["xkn59438", "yhdck2", "eihd39d9", "chdsye847", "hedle3455", "xjhd53e", "45da", "de37dp"]

Write a program that will print only the accession names that satisfy the following criteria:

* contain the number 5
* contain the letter d or e
* contain the letters d and e in that order
* contain the letters d and e in that order with a single letter between them
* contain both the letters d and e in any order
* start with x or y
* start with x or y and end with e
* contain three or more numbers in a row
* end with d followed by either a, r or p

In [149]:
accs = ["edhi","xkn59438", "yhdck2", "eihd39d9", "chdsye847", "hedle3455", "xjhd53e", "45da", "de37dp"]
# contain the number 5
for i in accs:
    if re.search(r"5", i):
        print(i, "found")
# contain the letter d or e
for i in accs:
    if re.search(r"[d|e]", i):
        print(i, "found2")
# contain the letters d and e in that order
for i in accs:
    if re.search(r"(de)", i):
        print(i, "found3")
# contain the letters d and e in that order with a single letter between them
for i in accs:
    if re.search(r"(d[A-Z, a-z]{1}e)", i):
        print(i, "found4")
# contain both the letters d and e in any order
for i in accs:
    if re.search(r"(d.*e)|(e.*d)", i):
        print(i, "found5")
# start with x or y
for i in accs:
    if re.search(r"^[x|y]", i):
        print(i, "found6")
# start with x or y and end with e
for i in accs:
    if re.search(r"^[x|y].*e$", i):
        print(i, "found7")
# contain three or more numbers in a row
for i in accs:
    if re.search(r"[0-9]{3,}", i):
        print(i, "found8")
# end with d followed by either a, r or p
for i in accs:
    if re.search(r".*d[arp]$", i):
        print(i, "found9")

xkn59438 found
hedle3455 found
xjhd53e found
45da found
edhi found2
yhdck2 found2
eihd39d9 found2
chdsye847 found2
hedle3455 found2
xjhd53e found2
45da found2
de37dp found2
de37dp found3
hedle3455 found4
edhi found5
eihd39d9 found5
chdsye847 found5
hedle3455 found5
xjhd53e found5
de37dp found5
xkn59438 found6
yhdck2 found6
xjhd53e found6
xjhd53e found7
xkn59438 found8
chdsye847 found8
hedle3455 found8
45da found9
de37dp found9


13.Use the HTT.fasta to explore the problem in [Regular expressions in genomics](https://towardsdatascience.com/using-regular-expression-in-genetics-with-python-175e2b9395c2)