<h1 style = "fontsize:3rem;color:white;">An ADAPTED Introduction to Regular Expressions in Python</h1>

#### Among the most vaulable tools available to Python programmers, are regular expressions - often shortened to as regexes. A regex is essentially a string-processing tool that allows us to pattern-match text, making them highly valuable across a variety of different applications: for everything from finding simple phrases in a mass of text, through to parsing and handling large datasets for scientific analyses.

By importing the re module, pre-installed in Python, we can perform basic pattern matching using regexes. A regular expression is iself a string, that allows the user to define complex matching and replacement rules for strings, which can collectively synergise into a single-line, powerful searching, filtering and editing tool for analysing even the most expansive datasets. <p>Using minimalist syntax often referred to as regex 'special characters', we can tell the re module exactly what to match. For example a '.' special character can specifcally match a single instance of any character (text, numbers or punctuation). And a '+' can serve as a quantifier, that searches for one or more of the character that precededed it. Thus, 'a*' will search for a pattern matching anything from a single 'a' to 'aa', 'aaaaaaa' and beyond: until that pattern is interrupted by another character.  <p><b>Figure 1</b> outlines some commonly-used basic pattern matching syntax; this will be enough for us to get started matching simple alphanumeric patterns and sequences in the following exercises.

![title](Regex1.jpg)

<b>Figure 1: </b> Table of commonly-used regular expression syntax. (Adapted from: https://www.debuggex.com/cheatsheet/regex/python).

Let's start by writing a simple piece of code. Complete the regular expression line to match only the numbers in the sentence, below (note the decimal point):

In [None]:
import re

sequence ='The male human genome is 6.27 Gbp. '

pattern = re.compile('\d\.\d\d')
#This line here contains our regular expression.
#Add your special characters between the quotation marks.

matches = pattern.finditer(sequence)
#This line collates our search results into 'matches'.

for match in matches:
    print (match)
#This for loop allows us to sequentially print each matched pattern.

Next, modify the code to match and output all the vowels in the following sentence:

In [18]:
import re

sequence ='The male human genome is 6.27 Gbp. '

pattern = re.compile('[a|e|i|o|u]')

matches = pattern.finditer(sequence)

for match in matches:
    print (match)

<re.Match object; span=(2, 3), match='e'>
<re.Match object; span=(5, 6), match='a'>
<re.Match object; span=(7, 8), match='e'>
<re.Match object; span=(10, 11), match='u'>
<re.Match object; span=(12, 13), match='a'>
<re.Match object; span=(16, 17), match='e'>
<re.Match object; span=(18, 19), match='o'>
<re.Match object; span=(20, 21), match='e'>
<re.Match object; span=(22, 23), match='i'>


Note that the regex returns both the match, and the 'span' - a numbered coordinate pointing to the location of the individual match in the searched string. <p>Next, let's try and work with a short nucleic acid sequence. Match all cytosine bases, and output the total number found in this sequence. Create a basic counter to tally up the number of matches and print these as a sentence ('There are xx cytosine bases in this sequence').

In [19]:
import re

sequence ='CTTGGGATGAAACGTTTAGCGGATAATCGTGAGATCGGACTGGCTGGATTGATCGGATTAACGGATCTGATCTGGGATCGGTAA'

pattern = re.compile('C')

matches = pattern.finditer(sequence)

counter = 0

for match in matches:
    counter += 1
    
print ("There are", counter,"cytosine bases in this sequence.")


There are 12 cytosine bases in this sequence.


Based on this example, amend the following code to identify stop codons in the sequence given, above. Think about using an apporopriate quantifier to identify these. (TAG, TGA or TAA).

In [23]:
pattern = re.compile('TGA|TAG|TAA')

matches1 = pattern.finditer(sequence)

counter = 0

for match in matches1:
    counter += 1
    print (match)
    
print ("Matches =",counter)

<re.Match object; span=(7, 10), match='TGA'>
<re.Match object; span=(16, 19), match='TAG'>
<re.Match object; span=(23, 26), match='TAA'>
<re.Match object; span=(29, 32), match='TGA'>
<re.Match object; span=(49, 52), match='TGA'>
<re.Match object; span=(58, 61), match='TAA'>
<re.Match object; span=(67, 70), match='TGA'>
<re.Match object; span=(81, 84), match='TAA'>
Matches = 8


How many stop codons did you identify?

Answer:

The above code should now identify both the number of stop codons, and all stop codons in the sequence. It does not, however, account for the different possible open reading frames. <p> For the purposes of this exercise, let's focus solely on the plus-sense ORFs. <p> The code, below, splits the sequence into these units of 3, starting from the first character of the sequence, thus taking into account the codons that comprise the +1 ORF. The output is a list.

In [5]:
orf1 = []

n=3
for index in range(0, len(sequence), n):
    orf1.append(sequence[index : index + n])
print (orf1)

['CTT', 'GGG', 'ATG', 'AAA', 'CGT', 'TTA', 'GCG', 'GAT', 'AAT', 'CGT', 'GAG', 'ATC', 'GGA', 'CTG', 'GCT', 'GGA', 'TTG', 'ATC', 'GGA', 'TTA', 'ACG', 'GAT', 'CTG', 'ATC', 'TGG', 'GAT', 'CGG', 'TAA']


In order for us to apply the above regex, it is easier to convert this list into a single string. We can do this by using the join() method in Python, joining each codon into a string. Using a separator of our choice, we can then make it easy to pattern match each 3-base codon (with our regex) as a separate unit, within the string. <p> Run the code in the cell below, and note the differences in the output from the cell above.

In [6]:
orfjoined = ' '.join(orf1)
print (orfjoined)

CTT GGG ATG AAA CGT TTA GCG GAT AAT CGT GAG ATC GGA CTG GCT GGA TTG ATC GGA TTA ACG GAT CTG ATC TGG GAT CGG TAA


In the cell below, write your own code to analyse the orfjoined string for stop codons. 

In [7]:
#WRITE YOUR CODE HERE
pattern = re.compile('TGA|TAG|TAA')

matches2 = pattern.finditer(orfjoined)

counter = 0

for match in matches2:
    counter += 1
    print (match)
    
print ('\nTotal +1 ORF matches:', counter, '\n')

<re.Match object; span=(108, 111), match='TAA'>

Total +1 ORF matches: 1 



How many do you find? What are their positions in the nucleotide sequence?

Answer:

Let's now take a look at the +2 ORF. Everything we have done up to this point is applicable, except we need to write code that shifts the above processes one base along the sequence. <p> An easy way to do this, is to simply create a new string with the first character of our sequence, removed. We can do this using string slicing, as follows. <p> Create a new string, whereby the first character is removed using the following syntax. Name the string you wish to slice, and enter the range you wish to return. For example, if we wish to return bases 2-8 of the sequence the code is: <p>sequence[2:8] <p> Since we only wish to remove the first character, and keep the rest of the sequence in tact, modify the code below, to do this.

In [8]:
orftwoseq = sequence[1:]
orf2 = []

n=3
for index in range(0, len(orftwoseq), n):
    orf2.append(orftwoseq[index : index + n])

orfjoined = ' '.join(orf2)
print (orfjoined)

TTG GGA TGA AAC GTT TAG CGG ATA ATC GTG AGA TCG GAC TGG CTG GAT TGA TCG GAT TAA CGG ATC TGA TCT GGG ATC GGT AA


The regex code we used earlier in the exercise, is given, below. Modify this to ascertain which of the codons in the +2 ORF are stop codons. 

In [9]:
pattern = re.compile('TGA|TAG|TAA')

matches3 = pattern.finditer(orfjoined)

counter = 0

for match in matches3:
    counter += 1
    print (match)
    
print ("+2 ORF Matches =",counter)

<re.Match object; span=(8, 11), match='TGA'>
<re.Match object; span=(20, 23), match='TAG'>
<re.Match object; span=(64, 67), match='TGA'>
<re.Match object; span=(76, 79), match='TAA'>
<re.Match object; span=(88, 91), match='TGA'>
+2 ORF Matches = 5


How many stop codons are there in the +2 open reading frame? <p> Finally, fill the cell below with the appropriate code to ascertain the number of stop codons present in the +3 ORF.

In [10]:
#WRITE CODE HERE

orf3seq = sequence[2:]
orf3 = []

n=3
for index in range(0, len(orf3seq), n):
    orf3.append(orf3seq[index : index + n])

orfjoined3 = ' '.join(orf3)

pattern = re.compile('TGA|TAG|TAA')

matches = pattern.finditer(orfjoined3)

counter = 0

for match in matches:
    counter += 1
    print(match)
print ('\nTotal +3 ORF matches:', counter)

<re.Match object; span=(28, 31), match='TAA'>
<re.Match object; span=(36, 39), match='TGA'>

Total +3 ORF matches: 2


How many stop codons did you find in the +3 ORF? <p> Based on the stop codons counted in each of the three plus-sense ORFs, which of these reading frames is likely to translate the protein product of the gene in question?

Answer:

# Special Characters and Microplastics:

The following exercises aim to use Python regular expressions to sort through a microplastics dataset sampled in late 2014 across three tributaries of the river Thames. The data provides information on the site characteristics, dry weight of sediment (in grams), the number of extracted microplastic particles, and their characteristics. The data is in the form of a CSV file (comprising comma-separated values), and thus this needs to be accounted for in terms of how we read these data into Python, and pattern match them, using regular expressions. <p> The columns from left to right are: <p>Sample_ID, Site, Replicate, Size_fraction, Subsample_40_grams, Sorting_step, Other_manmade, Natural_substance, Unidentifiable.<p>Firstly, we must read the file into Python, from a file.<p>

In [63]:
with open('ThamesMicroplastics_ParticleCharacteristics.csv', 'r') as f:
    contents = f.read()
print(contents)

Sample_ID,Site,Replicate,Size_fraction,Subsample_40_grams,Sorting_step,Other_manmade,Natural_substance,Unidentifiable
1,Lambourn,1,1-2 mm,No,Eye,0,0,1
7,Lambourn,4,1-2 mm,Yes,Eye,1,0,0
9,Lambourn,1,1-2 mm,No,Floated,0,0,0
9,Lambourn,1,1-2 mm,No,Floated,0,0,1
9,Lambourn,1,1-2 mm,No,Floated,0,0,1
10,Lambourn,1,2-4 mm,No,Floated,0,0,1
10,Lambourn,1,2-4 mm,No,Floated,0,0,1
10,Lambourn,1,2-4 mm,No,Floated,0,0,1
11,Lambourn,2,1-2 mm,No,Floated,0,0,1
11,Lambourn,2,1-2 mm,No,Floated,0,1,0
12,Lambourn,2,1-2 mm,Yes,Floated,1,0,0
13,Lambourn,2,2-4 mm,No,Floated,1,0,0
13,Lambourn,2,2-4 mm,No,Floated,0,0,1
14,Lambourn,3,1-2 mm,No,Floated,0,0,1
14,Lambourn,3,1-2 mm,No,Floated,0,0,1
15,Lambourn,3,1-2 mm,Yes,Floated,0,0,1
15,Lambourn,3,1-2 mm,Yes,Floated,0,0,1
15,Lambourn,3,1-2 mm,Yes,Floated,0,0,1
16,Lambourn,3,2-4 mm,No,Floated,0,0,1
16,Lambourn,3,2-4 mm,No,Floated,0,0,1
16,Lambourn,3,2-4 mm,No,Floated,0,0,1
17,Lambourn,4,1-2 mm,No,Floated,0,0,1
17,Lambourn,4,1-2 mm,No,Floated,0,0,1
17,Lambourn,4,1-

The Lambourn tributary is considerably deeper than both the Cut Site and Leach tributaries, exhibiting fast flow rates in excess of 3000 cubic feet per second (CFS), making it less favourable for studying deposited microplastics. The Cut Site and Leach tributaries are both shallower, with slower flow rates of 2-300 CFS on average. <p>Using special characters from the tables given in Figure 1, write a single regex to isolate data from all samples taken from both The Cut Site and Leach tributaries. As with the earlier exercises, keep a counter to output the number of matches returned by your regex. (Tip: You might want to use quantifier special characters to help specify your search).

In [9]:
#WRITE YOUR CODE HERE:
import re

with open('ThamesMicroplastics_ParticleCharacteristics.csv', 'r') as f:
    contents = f.read()

pattern = re.compile('Leach|The Cut Site')
#print (contents)
matches = pattern.finditer(contents)

counter = 0

for match in matches:
    counter += 1
    #print (match) #Keep this to show what's matching, then comment out.

print ('\nNumber of matching entries:', counter)


Number of matching entries: 306


How many data entries are there for these two sites, in total?

Answer:

Copy and paste this code into the cell below, and specify the number of entries from these data that have a 2-4 mm particle size. This time, do so only for data collected at The Cut Site.

In [10]:
#WRITE YOUR CODE HERE:
import re

with open('ThamesMicroplastics_ParticleCharacteristics.csv', 'r') as f:
    contents = f.read()

pattern = re.compile('The Cut Site.*2-4\smm')

matches = pattern.finditer(contents)

counter = 0

for match in matches:
    counter += 1
    #print (match) #Keep this to show what's matching, then comment out.

print ('\nNumber of matching entries:', counter)


Number of matching entries: 106


Paste and modify your code into the cell below, this time filtering those results which were sorted by eye, and were classed as unidentifable (marked by a '1' as opposed to a '0' in the data).

In [13]:
#WRITE YOUR CODE HERE:
import re

with open('ThamesMicroplastics_ParticleCharacteristics.csv', 'r') as f:
    contents = f.read()
    
pattern = re.compile('The Cut Site.*2-4\smm.*Eye\W\d\W\d\W1')

matches = pattern.finditer(contents)

counter = 0

for match in matches:
    counter += 1
    #print (match) #Keep this to show what's matching, then comment out.

print ('\nNumber of matching entries:', counter)


Number of matching entries: 9


How many data entries fit these criteria?

Answer:

# Group()

Currently, your code should be matching everything between the first pattern specified in the regex, through to the last pattern. From this entire line of output, it is also possible to capture part of it, as opposed to printing the entire matched line of output. In order to achieve this, we can use '()' around the search terms in the regex, and use 'group()' to print just these captured patterns. Take the following example code:

In [14]:
import re

sequence ='CTTGGGATGAAACGTTTAGCGGATAATCGTGAGATCGGACTGGCTGGATTGATCGGATTAACGGATCTGATCTGGGATCGGTAA'

pattern = re.compile('(C)|T|(A)|G')

matches = pattern.finditer(sequence)

counter = 0
counter2 = 0

for match in matches:
        if match.group(1) == 'C':
            #print(match.group(1)) #Keep while writing, then comment out.
            counter += 1
        if match.group(2) == 'A':
            #print(match.group(2)) #Keep while writing, then comment out.
            counter2 += 1

print('\nThere were', counter, 'occurrences of C, and', counter2, 'occurrences of A.')


There were 12 occurrences of C, and 21 occurrences of A.


Using 'group()', we successfully isolated the bracketed portions of our regex. Note that group () also returns false values as 'None', and so there are two if loops in the code above that separate out true values, alone. (Note, feel free to try uncommenting the commented lines to see what is being matched). <p>In light of the above, incorporate 'group()' to isolate the sample ID, site name and whether substances in the samples were manmade, natural or unidentified.

In [22]:
#WRITE YOUR CODE HERE:
import re

with open('ThamesMicroplastics_ParticleCharacteristics.csv', 'r') as f:
    contents = f.read()
    
pattern = re.compile('(\d*).*(The Cut Site).*2-4\smm.*Eye,(\d)\W(\d)\W(1)')
matches = pattern.finditer(contents)

for match in matches:
    counter += 1
    if match.group(3) != '0':
        category1 = 'manmade'
    else: 
        category1 = ''
    if match.group(4) != '0': 
        category2 = 'natural substance'
    else: 
        category2 = ''
    if match.group(5) != '0':
        category3 = 'unidentifiable'
    else: 
            category3 = ''
    if match.group(2) == 'The Cut Site':
        print ('Sample',match.group(1), '-', match.group(2),':', category1, '-', category2, '-', category3)
  

print ('\nTotal number of matching entries:', counter)

Sample 56 - The Cut Site :  -  - unidentifiable
Sample 58 - The Cut Site :  - natural substance - unidentifiable
Sample 58 - The Cut Site :  -  - unidentifiable
Sample 58 - The Cut Site : manmade -  - unidentifiable
Sample 60 - The Cut Site :  -  - unidentifiable
Sample 60 - The Cut Site :  -  - unidentifiable
Sample 60 - The Cut Site :  -  - unidentifiable
Sample 60 - The Cut Site :  -  - unidentifiable
Sample 88 - The Cut Site :  -  - unidentifiable

Total number of matching entries: 21


# HERV Loci

Endogenous Retroviruses (ERVs) are among some of the most abundant sequences in mammalian genomes. Human ERVs (HERVs) are numerous, and many of them poorly described and detailed. As a result, researchers are in the process of devising a novel classification and nomenclature system to better catalogue and describe these genetic elements, but the process is long and cumbersome. <p>In order to get there, data are being collated that comprise valuable information about these HERV insertions, including their loci and genomic records. <p>In this exercise, we'll take a look at a small portion of one such dataset, and attempt to write Python regexes to capture each data criterion, so that we can parse these data down to smaller, more useful chunks. <p>As with before, the data is given as a CSV (comma-separated values) file. Run the code in the cell below, to extract the title of each column in the dataset.

In [115]:
import re

with open('HERV_loci.csv', 'r') as f:
    contents = f.read()

pattern = re.compile('^(.*)\n')

matches = pattern.finditer(contents)

for match in matches:
     print (match.group(1))

Record_ID,Organism,Assigned_to,Extract_start,Extract_end,Orientation,Chunk_name,Scaffold,Genome_structure


Now that we know what each column of data comprises, use this information to complete the code in the cell below, creating a regex that captures a value for every single criterion (column), for each row of data. 

In [116]:
import re

with open('HERV_loci.csv', 'r') as f:
    contents = f.read()

pattern = re.compile('(\d{5,6}),(Homo_sapiens),(.[^,]*),(\d*),(\d*),(\w*),(.[^,]*),(.[^,]*\w*),(.[^,]*\w*)')

matches = pattern.finditer(contents)
#
for match in matches:
     Record_ID = match.group(1)
     Organism = match.group(2)
     Assigned_to = match.group(3)
     Extract_start = match.group(4)
     Extract_end = match.group(5)
     Orientation = match.group(6)
     Chunk_name = match.group(7)
     Scaffold = match.group(8)
     Genome_structure = match.group(9)

     print(Record_ID,Assigned_to,Genome_structure)


45432 ERV-9 LTR-LEA
51742
51796 ERV-9 LTR-LEA
52047
52126 ERV-9 LTR-LEA
58575
44262 ERV-9 LTR-LEA-env-LTR
37491
38646 ERV-9 LTR-LEA-gag
40592
40688 ERV-9 LTR-LEA-gag
42769
44185 ERV-9 LTR-LEA-gag
44867
44897 ERV-9 LTR-LEA-gag
45408
45547 ERV-9 LTR-LEA-gag
50355
50368 ERV-9 LTR-LEA-gag
51657
52411 ERV-9 LTR-LEA-gag
52520
58560 ERV-9 LTR-LEA-gag
37207
39577 ERV-9 LTR-LEA-gag-env
39709
37644 ERV-9 LTR-LEA-gag-env-LTR
40624
40690 ERV-9 LTR-LEA-gag-env-LTR
42858
44181 ERV-9 LTR-LEA-gag-env-LTR
45005
45394 ERV-9 LTR-LEA-gag-env-LTR
45446
45519 ERV-9 LTR-LEA-gag-env-LTR
49952
50152 ERV-9 LTR-LEA-gag-env-LTR
51697
51797 ERV-9 LTR-LEA-gag-env-LTR
52062
52079 ERV-9 LTR-LEA-gag-env-LTR
36017
38672 ERV-9 LTR-LEA-gag-LTR
39695
44220 ERV-9 LTR-LEA-gag-LTR
45402
45583 ERV-9 LTR-LEA-gag-LTR
47083
49949 ERV-9 LTR-LEA-gag-LTR
36108
36135 ERV-9 LTR-LEA-gag-pol
37449
37589 ERV-9 LTR-LEA-gag-pol
37658
38658 ERV-9 LTR-LEA-gag-pol
38679
39667 ERV-9 LTR-LEA-gag-pol
43400
44979 ERV-9 LTR-LEA-gag-pol
45215
4543

Adapting the code you just wrote, use the cell below to filter out matches that are classified as ERV-9, are in a positive orientation, and calculate the size of the described sequences. <p>Print the filtered matches as follows: <p>'Match number 48738 is a ERV-9 element, and is xxxx bases long'.

In [119]:
#WRITE YOUR CODE HERE
import re

with open('HERV_loci.csv', 'r') as f:
    contents = f.read()
    
pattern = re.compile('(\d{5,6}),(.[^,]*),(.[^,]*),(\d*),(\d*),(\w*),(.[^,]*),(.[^,]*\w*),(.[^,]*\w*)')

matches = pattern.finditer(contents)
counter = 0
for match in matches:
     # print (match)
     Record_ID = match.group(1)
     Organism = match.group(2)
     Assigned_to = match.group(3)
     Extract_start = match.group(4)
     Extract_end = match.group(5)
     Orientation = match.group(6)
     Chunk_name = match.group(7)
     Scaffold = match.group(8)
     Genome_structure = match.group(9)

     if Assigned_to == 'ERV-9':
          if Orientation == 'positive':
               Extract_length = int(Extract_end) - int(Extract_start)
               print ('Match number',Record_ID,'is a', Assigned_to, 'element, and is', Extract_length, 'bases long:\n', match,'\n')

Match number 45432 is a ERV-9 element, and is 1036 bases long:
 <re.Match object; span=(106, 224), match='45432,Homo_sapiens,ERV-9,146373701,146374737,posi> 

Match number 44185 is a ERV-9 element, and is 3269 bases long:
 <re.Match object; span=(1476, 1597), match='44185,Homo_sapiens,ERV-9,39289063,39292332,positi> 

Match number 44897 is a ERV-9 element, and is 2721 bases long:
 <re.Match object; span=(1707, 1827), match='44897,Homo_sapiens,ERV-9,74548527,74551248,positi> 

Match number 52411 is a ERV-9 element, and is 3274 bases long:
 <re.Match object; span=(2399, 2519), match='52411,Homo_sapiens,ERV-9,48664389,48667663,positi> 

Match number 58560 is a ERV-9 element, and is 3106 bases long:
 <re.Match object; span=(2629, 2747), match='58560,Homo_sapiens,ERV-9,408,3514,positive,hs_alt> 

Match number 39577 is a ERV-9 element, and is 4856 bases long:
 <re.Match object; span=(2861, 2985), match='39577,Homo_sapiens,ERV-9,80997504,81002360,positi> 

Match number 44181 is a ERV-9 elemen

Having scratched the surface of using regular expressions in Python to handle data, it is obvious to see their power and usefulness as applied to biological sciences. Should you wish to delve into these further, below are some useful online resources: <p>https://docs.python.org/3/howto/regex.html <p>https://realpython.com/regex-python/