<a href="https://colab.research.google.com/github/Tycour/crisanti-toolshed/blob/main/docs/lessons/Regular_Expression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Documentation: https://docs.python.org/3/howto/regex.html
Video: https://youtu.be/yAeZD-MzBVY

# Regular Expression (RegEx)

Regular expressions are essentially a tiny, highly specialized programming language embedded inside Python and made available through the `re` module.

Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or anything you like.

You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

To use it in your code first import the official module:
```
import re
```

OR a slightly tweaked version with enhanced functionalities:
```
import regex
```

# Simple real life examples

* Find all countries that end in a unique letter
* Keep only email addresses which contain first and last name of an author on paper
* Find all PAM motifs in a given gene
* Codon optimisation

# RegEx Functions

`search('pattern', string)` - Searches the string for a match, and returns a `Match object` if there is a match.

If there is more than one match, only the first occurrence of the match will be returned.

If no matches are found, the value None is returned.

In [None]:
import regex as re
from country_list import countries_for_language

countries_dict = dict(countries_for_language('en'))
countries_list = list(countries_dict.values())

# Example: find countries that end in the letter 'k'
for country in countries_list:
  if re.search('k$', country):
    print(country)

Denmark


In [None]:
seq = 'AGCTGATCGTATCATGGATCCGGCTAATGCCTAGCTTAGTTCAGTTAATG'

if re.search('ATG', seq):
  print('seq has a start codon')
  result = re.search('ATG', seq)
  print("Start codon's position is:", result.start())

seq has a start codon
Start codon's position is: 13


`split('pattern', string)` - Returns a list where the string has been split at each match.

In [None]:
sentence = 'Coding is silly'

words = ' '.split(sentence)

print(words)

['Coding', 'is', 'silly']


In [None]:
seq = 'CCGATGCATGCC'

split_pieces = re.split('ATG', seq)
print(split_pieces)

['CCG', 'C', 'CC']


`sub('pattern', 'substitute', string)` - Replaces one or many matches of `pattern` in `string` with `substitute`

In [None]:
DNA_seq = 'GATTACA'

RNA_seq = re.sub('T', 'U', DNA_seq)

print(RNA_seq)

GAUUACA


`findall('pattern', string)` - Returns a list of all matches of `pattern` in `string`

The list contains the matches in the order they are found.

If no matches are found, an empty list is returned.

In [None]:
import regex as re

seq = 'ACGGTGCAGCGGTGGCTTGACGAACGG'

pams = re.findall('.GG', seq, overlapped=True) # '.' finds any character which happens to be the same as searching for 'N' in DNA

print(pams)

['CGG', 'CGG', 'TGG', 'CGG']


`finditer('pattern', string)` - Returns an iterator yielding `Match objects` over all non-overlapping matches for the RE `pattern` in `string`.

The string is scanned left-to-right, and matches are returned in the order found.

In [5]:
import regex as re
motif = '[AT]GATA[GA]' # Trying to find a transcription factor ('GATA' box) with a specific pattern
                       # Brackets '[]' denote the instruction for RegEx to find any characters within the brackets at that position in the motif.

from google.colab import drive
drive.mount('/content/drive', force_remount=True)
 
with open('/content/drive/My Drive/Coding club/_data/11_AGAP004050.txt') as file:
  for seq in file:
    dsx_seq = seq

# pams = re.finditer(motif, seq, overlapped=True)
# for pam in pams:
#   print(pam.group()) # Prints the actual motif it finds

def find_PAMs(motif, sequence):
  pams = re.finditer(motif, sequence, overlapped=True)
  pam_tuple = [(pam.start(), pam.group()) for pam in pams]
  return pam_tuple

# Show first 20 for convenience
print(find_PAMs(motif, dsx_seq)[:19])

Mounted at /content/drive
[(537, 'AGATAG'), (688, 'TGATAG'), (741, 'TGATAA'), (1361, 'AGATAG'), (1593, 'AGATAA'), (2087, 'AGATAA'), (2960, 'AGATAA'), (3536, 'AGATAA'), (5642, 'TGATAA'), (9615, 'AGATAA'), (11686, 'AGATAG'), (12890, 'TGATAG'), (13965, 'AGATAG'), (14454, 'TGATAA'), (14505, 'TGATAA'), (18552, 'AGATAA'), (19567, 'TGATAG'), (20321, 'AGATAA'), (25461, 'TGATAA')]


In [2]:
import re
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
 
with open('/content/drive/My Drive/Coding club/_data/11_AGAP004050.txt') as file:
  for seq in file:
    dsx_seq = seq

def find_pams(seq, pam):
  iupac_dict = {
      'A': 'A',
      'C': 'C',
      'G': 'G',
      'T': 'T',
      'R': '[AG]',
      'Y': '[CT]',
      'S': '[GC]',
      'W': '[AT]',
      'K': '[GT]',
      'M': '[AC]',
      'B': '[CGT]',
      'D': '[AGT]',
      'H': '[ACT]',
      'V': '[ACG]',
      'N': '[ACGT]'}
  iupac_pam = ''.join([iupac_dict[letter] for letter in pam])

  pams = [(m.start(), m.group()) for m in re.finditer(iupac_pam, seq) if m.start() - 40 > 0 and m.end() + 40 < len(seq)] # 40 is an arbitrary number we used to avoid guides that were too close to the ends of sequences.

  return pams

pam = 'BGG'

# Show first 10 for convenience
print(find_pams(dsx_seq, pam)[:9])

Mounted at /content/drive
[(44, 'TGG'), (50, 'CGG'), (61, 'CGG'), (183, 'CGG'), (196, 'TGG'), (205, 'CGG'), (216, 'TGG'), (224, 'CGG'), (309, 'CGG')]
