<a href="https://colab.research.google.com/github/Tycour/crisanti-toolshed/blob/main/docs/lessons/11_Finding_Motifs_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Recording Link: https://youtu.be/oiMi8MNfcz4


# **Finding Motifs in DNA** (http://rosalind.info/problems/subs/)

Given two strings `s` and `t`, `t` is a substring of `s` if `t` is contained as a contiguous collection of symbols in `s` (as a result, `t` must be no longer than `s`).

In bioinformatics terms, a motif is a commonly shared interval in DNA. So trying to find a given motif in DNA comes down to searching a string for a given substring.


In [None]:
# Strings are indexed as though they were a list of characters such that:
seq_1 = 'GCTA'
for nt in seq_1:
  print(nt)

# is functionally the same as:
seq_2 = ['G','C','T','A']
for nt in seq_2:
  print(nt)

print('Are these exactly the same thing to you, Oh Mighty Python?')
if seq_1 == seq_2:
  print('Computer says Yes')
else:
  print('Computer Says NO')

G
C
T
A
G
C
T
A
Are these exactly the same thing to you, Oh Mighty Python?
Computer Says NO


Python's 0-based indexation means that the you subtract 1 to our 1-based conception of positionality:

e.g. In the sequence, `'GATTACA'`, the position of all `'A'` are 2, 5, and 7.

But in Python that list would be 1, 4, and 6 respectively.

In [None]:
i = 0
for nt in 'GATTACA':
  if nt == 'A':
    print(i)
  i += 1

for i, nt in enumerate('GATTACA'):
  if nt == 'A':
    print(i)

# 'x += y' means adding y to x. It is the same as writing 'x = y + x'.

1
4
6
1
4
6


In [None]:
# Bit of revision about extracting elements you already know
seq = 'GATTACA'

# What do these return?
# print(seq[0])
# print(seq[:3])
# print(seq[3:])
# print(seq[::-1])

A substring of a string can thus be represented as `string[x:y]`, where `x` gives the starting index, and y the index of the character following the end of the substring.

i.e. `string[x:y]` will return a substring from `string[x]` up to, but not including, `string[y]`

In [None]:
seq = 'GATTACA'

print(seq[1:4]) # Prints a substring from indices 1 to 3
print(seq[1]) # Prints the single character substring of index 1
print(seq[4]) # Prints the single character substring of index 4

ATT
A
A


# Exercise

Given: Two DNA strings `seq` and `motif`.

Return: All locations of `motif` as a substring of `seq`.


---


Sample Dataset:
```
GATATATGCATATACTT
ATAT
```
Sample Output:
```
[1, 3, 9]
```

In [None]:
# Given a string, 'seq', and a substring, 'motif':
seq = 'GATATATGCATATACTT'
motif = 'ATAT'

# for i, nt in enumerate(seq):
#   print(i, '-->', nt, '-->', seq[i:i+4])

def find_motifs(seq, motif):
  indices = []
  for i, nt in enumerate(seq):
    if seq[i:i+len(motif)] == motif:
      indices.append(i)
  return indices

test = find_motifs(seq, motif)
print(test)

[1, 3, 9]


In [None]:
def double(x):
  return x*2


print([double(x) for x in range(10)])

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]


# Introduction to Regular Expression (ReGex)

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern.

Searches with RegEx return a `Match` object containing information about the search and the result. These include such things as the start and end indices of each match.

In [None]:
import regex as re # 'import [module] as [x]' allows you to call the module by any name 'x'

seq = 'GATATATGCATATACTT'
motif = 'ATAT'

# m = re.findall(motif, seq, overlapped=True)

# print(m)

matches = re.finditer(motif, seq, overlapped=True)

indices=[]
for match in matches:
  indices.append(match.start())

print(indices)
# It gets better, but wait until next week!

[1, 3, 9]


# If we have time - List Comprehensions

List comprehensions provide a concise way to create lists. 

It consists of brackets containing an expression followed by a `for` clause, then zero or more `for` or `if` clauses. The expressions can be anything meaning you can put in all kinds of objects in lists.

The result will be a new list resulting from evaluating the expression in the
context of the `for` and `if` clauses which follow it. 

The list comprehension always returns a result list.


---



Basically, this:
```
new_list = []
for i in old_list:
    if filter(i):
        new_list.append(expressions(i))
```

Becomes this:
```
new_list = [expression(i) for i in old_list if filter(i)]
```

In [None]:
import regex as re

seq = 'GATATATGCATATACTT'
motif = 'ATAT'

m = re.finditer(motif, seq, overlapped=True)

# Go on then... You thought this code was gonna write itself?
indices = [match.start() for match in matches]

# Example with find_motif()
def find_motifs(seq, motif):
  indices = [i for i,nt in enumerate(seq) if seq[i:i+len(motif)] == motif]
  return indices


# Homework

* Read dsx sequence from `AGAP004050.txt` in `_data` folder
* Write function to find all instances of `'GG'` and their corresponding index
* Return a list of tuples containing:


1.   Index of each `'GG'`
2.   Three-nucleotide PAM (i.e. the nucleotide preceding, as well as, `'GG'`)

Sample Output:
```
[(45, 'AGG'), (87, 'TGG'), (152, 'GGG') ...]
```

# Extra Points

* Include PAMs on negative strand (include a third category to your tuple)


---


```
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

with open('/content/drive/My Drive/Coding club/_data/11_AGAP004050.txt') as file:
  for seq in file:
    dsx_seq = seq
```