# Regular expressions in Python

### Sources

Python for Biologists - Martin Jones, Chapter 7: Regular expressions (Pages 149 - 178)

[Regular expressions in genomics](https://towardsdatascience.com/using-regular-expression-in-genetics-with-python-175e2b9395c2)

## What are regular expressions (Regex)?
RegEx is a sequence of characters that define search patterns.

## Common regular expression syntax :
Source: [Regular expressions in genomics](https://towardsdatascience.com/using-regular-expression-in-genetics-with-python-175e2b9395c2)

| Syntax | Description|
| --- | --- |
| . | Matches any single character |
| ^ | Anchor; matches from the start of a string |
| $ | Anchor; matches at the end of a string |
| \ | Escape character |
| &#124; | Pipe character OR; C&#124;T will match C or T |
| * | Matches zero or more repetitions of the previous character, GGGA*TTT matches GGGTTT and GGGATTT and GGGAATTT, etc |
| + | Matches one ore more repetitions of the previous character, GGGA+TTT matches GGGATTT and GGGAATTT, etc. |
| ? | Matches zero or one reptition of the previous character, GAT?C means that T is optional, matches GAC or GATC |
| {n} | Quantifier; matches n repetitions of the previous character |
| {n,x} | Quantifer; matches n to x repetitions of the previous character |
| [ ] | Character group; e.g. [AGCT] will match the characters AGCT |
| [^] | Negated character group e.g. [^AGCT] will match any characters not in this group |
| ( ) | Matches the pattern specified in the parentheses exactly |



In [1]:
import re

## Raw strings

Writing regular expressions requires us to type a lot of special characters. 

If we put the letter `r` immediately before the opening quotation mark, then any special characters inside the string are ignored:

In [2]:
print(r"\t\n")

\t\n


The `r` stands for *raw*, which is Python's description for a string where special characters are ignored. 

## Exercises

1.Use `re.search()` to test whether or not the DNA sequence `dna = "ATCGCGAATTCAC"` contains an EcoRI restriction site. The EcoRI recognition sequence is `GAATTC`.

2.Use `re.search()` and pipe syntax `|` to test whether or not the DNA sequence `dna = "ATCGCGAATTCAC"` contains AvaII recognition site. AvaII can have two different sequences, `GGACC` and `GGTCC`.

3.Use `re.search()` and character group `[]` to test whether or not the DNA sequence `dna = "ATCGCGAATTCACGCTGC"` contains the recognition site for BisI. The BisI restriction enzyme cuts at an even wider range of motifs. The pattern is `GCNGC`. Can you use another Regex syntax for this exercise instead of `[]`? What is the downside of using this latter syntax?

4.Write a pattern to identify full-length eukaryotic messenger RNA sequences?
Imagine the sequence of eukaryotic messenger RNA looks like this:
It starts with `AUG`, it then contains the bases `AUGC` between for example, 30 to 1000, and it ends with the infamous poly-A tail which contains between 5 to 10 adenine nucleotides.

5.What is the type of the output of the `re.search()`? What is the method that we can use to extract the part of the string that matched?

6.In the following DNA sequence, dna = "ATGACGTACGTACGACTG", the match object is stored in the variable  
`m = re.search(r"GA[ATGC]{3}AC", dna)`, use the appropriate method to extract the matched string from dna?

7.In this DNA sequence, dna = "ATGACGTACGTACGACTG", do the following:
Use `re.search()` with the suitable regex to find the following pattern:
A sequence that starts with GA, then can have any of the A, T,G, C nucleotides up to 3 times, 
then has the fixed character AC and then can have any of the A, T, G, C nucleotides up to 2 times, and ends with AC.

Use the appropriate method to extract:

7A. The eniter match

7B. The first bit of the match

7C. The second bit of the match

8.In this DNA sequence, dna = "ATGACGTACGTACGACTG", do the following:
Using the same regex constructed in question 7, use the appropriate methods to obtain the start and end positions of the matched pattern in the string?

9.In this DNA sequence, dna = "ACTNGCATRGCTACGTYACGATSCGAWTCG", we have several IUPAC code indicating missing data or mutations in our DNA such as N, R, Y, etc. We want the list of DNA segments that occure between the IUPAC codes, that is:
['ACT', 'GCAT', 'GCTACGT', 'ACGAT', 'CGA', 'TCG']
how could we achieve this?

10.Let's find every place in the DNA that is AT or GC rich. 
Define a DNA and find all runs of A and T in a DNA sequence longer than 3 bases.

11.Define a DNA and find all runs of A and T in a DNA sequence longer than 5 bases and report back the position of the start and end of the matched sequence.

12.Given the accession names as below:

accs = ["xkn59438", "yhdck2", "eihd39d9", "chdsye847", "hedle3455", "xjhd53e", "45da", "de37dp"]

Write a program that will print only the accession names that satisfy the following criteria:

* contain the number 5
* contain the letter d or e
* contain the letters d and e in that order
* contain the letters d and e in that order with a single letter between them
* contain both the letters d and e in any order
* start with x or y
* start with x or y and end with e
* contain three or more numbers in a row
* end with d followed by either a, r or p

13.Use the HTT.fasta to explore the problem in [Regular expressions in genomics](https://towardsdatascience.com/using-regular-expression-in-genetics-with-python-175e2b9395c2)