# Regular Expressions

accounting for variation

- ```[^XXX]``` means negate characters in the expression 
  - [^ABC]
- ```(X|X|)``` means either of the characters
- ```.``` means any
  - [A|T|C|G]
- ```?``` after a character or grouped characters in parenthesis means match 0 or 1 time
  - AAT?CCC -> AATCCC or AACCC
- ```+``` means it will match one or more times
  - AAT+CCC -> AATCCC or AATTCCC or AATTTCCC etc
- ```*``` means it will match 0 or more times
  - AAT*CCC 0-> AACCC or AATCCC or AATTCCC etc
- ```{number}``` after a letter will match number of times
  - AAT{3}CCC -> AATTTCCC
- ```{number1, number2}``` it will match that range 
  - AAT{2,3}CCC -> AATTCCC or AATTTCCC or AATTTCCC
- ```^AAA``` means find pattern at start of string
- ```AAA$``` means find pattern at end of string

# Exercises
# <b> Accession names </b>

Here's a list of made-up gene accession names:
- xkn59438, yhdck2, eihd39d9, chdsye847, hedle3455, xjhd53e, 45da, de37dp

Write a program that will print only the accession names that satisfy the following criteria – treat each criterion separately:
- contain the number 5
- contain the letter d or e
- contain the letters d and e in that order
- contain the letters d and e in that order with a single letter between them
- contain both the letters d and e in any order
- start with x or y
- start with x or y and end with e
- contain three or more numbers in a row
- end with d followed by either a, r or p


In [1]:
import re
accessions = ['xkn59438', 'yhdck2', 'eihd39d9', 'chdsye847', 'hedle3455', 'xjhd53e', '45da', 'de37dp']

# Contain the number 5
for acc in accessions:
    run = re.search(r"5",acc)
    if run:
        print(acc)

xkn59438
hedle3455
xjhd53e
45da


In [2]:
# Contain the letter d or e
for acc in accessions:
    run = re.search(r"d|e",acc)
    if run:
        print(acc)

yhdck2
eihd39d9
chdsye847
hedle3455
xjhd53e
45da
de37dp


In [3]:
# Contain the letters d and e in that order
for acc in accessions:
    run = re.search(r"d.*e",acc)
    if run:
        print(acc)

chdsye847
hedle3455
xjhd53e
de37dp


In [4]:
# Contain the number d and e in the order with a single letter in between
for acc in accessions:
    run = re.search(r"d.e",acc)
    if run:
        print(acc)

hedle3455


In [5]:
# contain both the letters d and e in any order
for acc in accessions:
    run = re.search(r"d.*e",acc)
    run2 = re.search(r"e.*d",acc)
    if run and run2:
        print(acc)

hedle3455
de37dp


In [6]:
# start with x or y
for acc in accessions:
    run = re.search(r"^[x|y]",acc)
    if run:
        print(acc)

xkn59438
yhdck2
xjhd53e


In [7]:
# start with x or y and end with e
for acc in accessions:
    run = re.search(r"^(x|y).*e$",acc)
    if run:
        print(acc)

xjhd53e


In [8]:
# contain three or more numbers in a row
for acc in accessions:
    run = re.search(r"\d{3,}",acc)
    if run:
        print(acc)

xkn59438
chdsye847
hedle3455


In [9]:
# end with d followed by either a, r or p
for acc in accessions:
    run = re.search(r"d(a|r|p)$",acc)
    if run:
        print(acc)

45da
de37dp


# <b> Double digest </b>

In the chapter_7 file inside the exercises download, there's a file called dna.txt which contains a made-up DNA sequence. 

Predict the fragment lengths that we will get if we digest the sequence with two made-up restriction enzymes – AbcI, whose recognition site is ANT```*```AAT, and AbcII, whose recognition site is GCRW```*```TG (asterisks indicate the position of the cut site).

In [10]:
ABCI = r"A[ATGC]TAAT"
ABCII = r"GC[AG][AT]TG"

In [11]:
dna = open("../ExerciseAnswers/regular_expressions/exercises/dna.txt")
dna_read = dna.read().rstrip("\n")

x = re.finditer(ABCI, dna_read)
y = re.finditer(ABCII, dna_read)
total_list = []
total_list.append(len(dna_read))
total_list.append(0)

for iterable in x:
    startPos = iterable.start() + 3 
    total_list.append(startPos)
for iterable in y:
    startPos = iterable.start() + 4
    total_list.append(startPos)

x = sorted(total_list)
x.reverse()
print(x)
for pos in range(len(x)):
    if pos < len(x)-1:
        frag = x[pos] - x[pos+1]
        print("A predicted fragment length is: " + str(frag) + "bp")

[2012, 1628, 1577, 1143, 488, 0]
A predicted fragment length is: 384bp
A predicted fragment length is: 51bp
A predicted fragment length is: 434bp
A predicted fragment length is: 655bp
A predicted fragment length is: 488bp
