In [2]:
import numpy as np
import numpy.linalg as la
from helper_funcs import generate_genomic_sequences

# Genome classification

In computational biology, a typical problem is:

- given a small strand of DNA, find if that DNA belong to a known mapped organism. 

DNA is comprised of long chains of base pairs.
4 nucleobases make up all of any organism's DNA.
These 4 nucleobases are A,T, C, and G.

Here is a small example of what a small snippet of a single sequence of DNA could look like:

```python
genome_seq = 'ATCGATTGAGCTCTAGCG'
```

Now, supposed we have sequenced some DNA for an unknown organism.
```python
small_sample = 'ATCG'
```

Does the small sample of DNA belong to the same organism as the provided genomic sequence?

We are going to be using our knowledge of **norms** to answer this question. But first, we need to convert the strand of DNA into an array of numbers:

### Write a function to convert a DNA sequence to an array of numbers

For this conversion, you are going to make the following assumption:

A = 1, T = 2, C = 3, G = 4.

In [5]:
#grade (enter your code in this cell - DO NOT DELETE THIS LINE) 
def char_to_num(char):
    lookup = {
        'A' : 1,
        'T' : 2,
        'C' : 3,
        'G' : 4
    }
    return lookup[char]

def seq_to_array(dna):
    # Converts the string dna into a numpy array
    # where each element corresponds to the numeric value of a nucleobase 
    # dna: string
    # numeric_dna: 1d numpy array of type integer
    
    # complete the function
    numeric_dna = [char_to_num(x) for x in dna]
    return np.array(numeric_dna)

Test your function by using it on the sequence `genome_seq`.

In [6]:
genome_seq = 'ATCGATTGAGCTCTAGCG'
genome_numeric = seq_to_array(genome_seq)
print(genome_numeric)

[1 2 3 4 1 2 2 4 1 4 3 2 3 2 1 4 3 4]


Now that we have the numpy array, we can use vector norms to determine whether a small sample of DNA belongs to a larger known DNA sequence.

Suppose that $v_1$ is a subset of the larger known DNA sequence, and $v_2$ the small unknown sample of DNA.

The small unknown sample belongs to the DNA sequence if we can find a $v_1$ such that:

$$
||v_1-v_2||_1 = 0.
$$


In this example, we are trying to find a match by comparing a DNA sample with the known sequence for a group of animals. We give you the list of animals `animals_list`, the DNA sequence for each animal in the list `animal_dna` and the smaller sample DNA `unknown_dna`.

In [7]:
# generate inputs for students
animals_list = ['dog', 'bear', 'giraffe', 'tiger']
animal_dna, unknown_dna = generate_genomic_sequences(animals_list)
print(unknown_dna)
print(animal_dna)

CTCGCATGCGCTCTCG
{'dog': 'ATCGATTGAGCTCTAGCCAGCTAGGAACGCAACTAGATTGATCGAGCTATCTAGTTATCTCTATCCAGCTAGGAACGCAAATCGATTGAGCTCTAGTTCGTAAGCAATGTAATTCGTAAGCAATGTAAATCTAGTTATCTCTATATCTAGTTATCTCTATCCAGCTAGGAACGCAAATCTAGTTATCTCTATCTCGCATGCGCTCTCGCCAGCTAGGAACGCAAATCGATTGAGCTCTAGCTAGATTGATCGAGCT', 'bear': 'ATCGATTGAGCTCTAGTTCGTAAGCAATGTAAATCGATTGAGCTCTAGTTCGTAAGCAATGTAATTCGTAAGCAATGTAACCAGCTAGGAACGCAAATCGATTGAGCTCTAGATCGATTGAGCTCTAGATCTAGTTATCTCTATATCTAGTTATCTCTATATCTAGTTATCTCTATATCTAGTTATCTCTATTTCGTAAGCAATGTAAATCTAGTTATCTCTATCTAGATTGATCGAGCTTTCGTAAGCAATGTAA', 'giraffe': 'ATCTAGTTATCTCTATATCGATTGAGCTCTAGCTAGATTGATCGAGCTCTAGATTGATCGAGCTATCTAGTTATCTCTATATCTAGTTATCTCTATATCGATTGAGCTCTAGCCAGCTAGGAACGCAACTAGATTGATCGAGCTATCTAGTTATCTCTATCTAGATTGATCGAGCTCCAGCTAGGAACGCAAATCTAGTTATCTCTATATCGATTGAGCTCTAGCTAGATTGATCGAGCTATCTAGTTATCTCTAT', 'tiger': 'ATCGATTGAGCTCTAGCTAGATTGATCGAGCTATCGATTGAGCTCTAGCTAGATTGATCGAGCTCCAGCTAGGAACGCAATTCGTAAGCAATGTAACCAGCTAGGAACGCAAATCTAGTTATCTCTATATCGATTGAGCTCTAGATCTAGTTATCTCTATCCAGCT

Take a look at the code snippet below. It uses the function `find_the_match` that is not yet defined (so you will get an error if you try to run it now!)

In [48]:
print('Trying to find a match for ', unknown_dna)
for animal in animal_dna:
    
    known_dna = animal_dna[animal]

    numeric_genome = seq_to_array(known_dna)
    numeric_sample = seq_to_array(unknown_dna)
    
    pos,diff = find_the_match(numeric_sample, numeric_genome)
    
    if pos >= 0:
        print('The sample DNA matches the sequence of the', animal, 'starting at position', pos)
        break
        
if pos < 0:  
    print('Could not find a match')
    


Trying to find a match for  CTCGCATGCGCTCTCG
The sample DNA matches the sequence of the dog starting at position 192


### Write the function `find_the_match`

that uses the 1-norm to find the DNA sequence that matches the sample DNA. 

```python
def find_the_match(numeric_sample, numeric_genome)

    return match_pos, diff_arr
```

The function takes the 1d numpy array that were converted from the DNA strings, and returns 2 outputs:
- a non-negative integer <code>match_pos</code> which represents the position in the DNA sequence where the match starts (recall that python starts the index at zero). If no match found, return -1.

- a numpy array <code>diff_arr</code> that contains list of differences (compute using 1-norm) between sample DNA <code>numeric_sample</code> and all possible subsequences of <code>numeric_genome</code>. Ex.<code>diff_arr[1]</code> is the differrence between <code>numeric_sample</code> and <code>numeric_genome[1:1+len(numeric_sample)]</code>.


Run the code snippet above that uses your now defined function, and check your results. You can also generate other input values (animal list), and re-run your code snippet.

In [49]:
#grade (enter your code in this cell - DO NOT DELETE THIS LINE) 
def find_the_match(numeric_sample, numeric_genome):
    
    match_pos = None
    diff_arr = []
    
    for start in range(0, len(numeric_genome) - len(numeric_sample) + 1):
        subgenome = numeric_genome[start : start + len(numeric_sample)]
        diff_arr.append(la.norm(numeric_sample - subgenome, 1))
        
    match_pos = np.argmin(diff_arr)
    if diff_arr[match_pos] != 0:
        match_pos = -1
    
    return match_pos, np.array(diff_arr)

# numeric_sample = np.array([1, 2, 3])
# numeric_genome = np.array([1, 2, 3, 4, 5])
# _, arr = find_the_match(numeric_sample, numeric_genome)
# print(_, arr)