# Regular expressions and patterns in sequences.

It is a recurring theme of patterns in biological sequences that some positions are observed to be more conserved than others. The variable positions may be more variable because mutations are less likely to occur at these positions - or that they are not functionally or structurally crucial. 

In coding regions of exons the degeneracy of the genetic code means that mutations at the third position of codons may not change the resulting protein. And at the protein level some amino acid residues are more similar to each other and can be inter-changeable. For example both Asp and Glu may supply a negatively-charged side-chains, both Arg and Lys a positively-charged ones.

Regular expressions are a flexible way to specify an underlying pattern while still allowing for such variation. For this reason various syntaxes based on regular expressions are found in user interfaces of bioinformatic programs and databases.

## Regular expressions in Python

Python has sophisticated regular expression functions available in the module *re*. 

In [1]:
import re

Regular expressions use a range of special characters in patterns. And as here is a limited number of special characters  some of these clash with usage in Python. 

One way around this is to preface strings with r for raw. Compare the print output of the following two strings. 

In [2]:
print("\t1\n2")

	1
2


In [3]:
print(r"\t1\n2")

\t1\n2


The regular expression search function can be used to find patterns in reasonably short sequences. For example, restriction enzymes have specific recognition sites in DNA. For simplicity ignoring that DNA has two strands!

In [4]:
dna = "TATAGAATTCATAAATT"
if re.search(r"GAATTC", dna):
    print("EcoRI site found.")

EcoRI site found.


Of course for an exact match like this there are the usual string methods available of the form text.find(subString). But some restriction enzymes recognise ambiguous sequences. You will be aware of the ambiguous nomenclature for DNA bases: for example: R is any purine (A or G), Y is a pyrimidine (C or T). Unfortunately Python does not recognise these codes out of the box.  

As an example *AvaII* cuts the pattern GGWCC where W is either an A or a T (the converse is S for C or G). 
So this can be expressed using the regular expression symbol | for alternatives.

In [12]:
dna = "TTATCGGTCCGC"
if re.search(r"GG(A|T)CC", dna):
    print("AvaII site found.")

AvaII site found.


*BisI* cuts the pattern GCNGC where, as you know., N stands for a nucleotide with any base.

In [12]:
dna = "TCTTAGCAGCAATTCCGC"
if re.search(r"GC(A|C|G|T)GC", dna):
    print("BisI site found.")

BisI site found.


Or by including a character class with a list of the alternatives.

In [13]:
dna = "TCTTAGCAGCAATTCCGC"
if re.search(r"GC[ACGT]GC", dna):
    print("BisI site found.")

BisI site found.


The symbol . will match any character - so that could match any nucleotide.

In [14]:
dna = "TCTTAGCAGCAATTCCGC"
if re.search(r"GC.GC", dna):
    print("BisI site found.")

BisI site found.


Unfortunately it would also match GCQGC, GCWGC, and even GC.GC  
Repeats of characters can be specified using symbols ? (0 or 1 times), * (0 to infinity), or + (1 to infinity). Notice that unlike the case of * in linux the modifier applies to the symbol in front of them.

In [16]:
dna = "TCTTAGCAGCAAAAAAAAAAAAATTCCGC"
if re.search(r"GCA+GC", dna):
    print("poly(A) found.")

poly(A) found.


Specific numbers can be given as a number range in {} after the character. For example {n} for a single specific number, {n,m} for number n to number m times, {n,} for n to infinity times {,m} for 0 to m times. Multicharacter patterns can be grouped together using parentheses. 

For example an intronic region in the human VWF gene contains variable numbers of tetranucleotide repeats that are used for forensic identification. Alleles differ in the number of repeats. Here is a check on an individual for the commonest short variants of that which have TCTA[TCTG]3-4[TCTA]7-11.

In [4]:
dna = "TTGATTCTATCTGTCTGTCTGTCTGTCTATCTATCTATCTATCTATCTATCTATCTTCCA"
if re.search(r"TCTA(TCTG){3,4}(TCTA){7,11}", dna):
    print("STR allele found.")

STR alleles found.


Full specification of the use of special characters are in the documentation at https://docs.python.org/3/library/re.html

## Match and regex objects
The examples above give the impression that the re match function is a true/false function. But if it finds a match, it doesn't return True, but rather an object that is evaluated as True in a conditional context. A match object represents the results of a regular expression search and has a number of useful methods for getting data out of it.

Another useful data object is the result of interpreting the search pattern specification. This is produced by the re compile function. So if multiple searches are likely then it is worth applying this and using the regex object. 

Going back to the *AvaII* example.

In [5]:
dna = "TTATCGGTCCGC"
regexObj = re.compile(r"GG(A|T)CC")
matchObj = regexObj.search(dna)
print(matchObj.group()) # string that was matched
print(matchObj.start()) # start index for the match
print(matchObj.end()) # index just after match
print(matchObj.span()) # (start to end+1) for match

GGTCC
5
10
(5, 10)


In general you need to check that the pattern found a match otherwise the matchObj is None and there is nothing to interrogate or print. Here is the sequences and a mutated form but only one returns the match.

In [9]:
seqs = ["TTATCGGTCCGC","TTATCGGGCCGC"]
regexObj = re.compile(r"GG(A|T)CC") # not needed if created already
for seq in seqs:
    matchObj = regexObj.search(seq)
    if matchObj is None:
        print("AvaII site not found.")
    else:
        print(matchObj.span())

(5, 10)
AvaII site not found.


For finding multiple occurrences there is the function re.finditer(). Ambiguous bases in a sequence can be found using the expression [^ATGC] where the ^ inverts the selection.  
(Outside [ ] the ^ is used to mark the position of the pattern as the start of the string).   

In [15]:
dna = "GGTGAGRTAAGAAGGGGYTAAGAGAGGATWAGG" 
regexObj = re.compile(r"[^ATGC]")
matchObj = regexObj.finditer(dna) 
for match in matchObj: 
    base = match.group() 
    pos  = match.start() + 1 #position with 1 for start 
    print(base + " found at position " + str(pos))

R found at position 7
Y found at position 18
W found at position 30


## Splitting a sequence using a regular expression
There is a function to split a string based on a regular expression. Here the sequence is split at each ambiguous base using the regexObj from above. Notice that the actual pattern is omitted from the output strings.

In [16]:
unambig = regexObj.split(dna)
print(runs)

['GGTGAG', 'TAAGAAGGGG', 'TAAGAGAGGAT', 'AGG']


Further examples of regular expressions for sequence manipulation are covered in *Chapter 5* of Rocha & Ferreira (2008) *Bioinformatics Algorithms*.