# Identifying Butterfly Species

In this notebook you will get to use your transcription and translation functions to analyze the same gene from four butterflies and find out which pairs are the same species!

First, copy and paste your transcription and translation functions here:

In [1]:
# Paste your transcription function here

def transcribe(DNAseq):
    
    RNA = ""
    
    for base in DNAseq:
        if base == "A":
            RNA += "U"
        elif base == "T":
            RNA += "A"
        elif base == "G":
            RNA += "C"
        elif base == "C":
            RNA += "G"
        else:
            continue
    return RNA

In [16]:
# Paste your translation function here

def translate(RNAseq):
    protein = ""
    codon = ""

    codon_dict = {"UUU":"F","UUC":"F","UUA":"L","UUG":"L", "CUU":"L","CUC":"L", 
             "CUA":"L","CUG":"L","AUU":"I","AUC":"I","AUA":"I","AUG":"M",
             "GUU":"V","GUC":"V","GUA":"V","GUG":"V","UCU":"S","UCC":"S",
             "UCA":"S","UCG":"S","CCU":"P","CCC":"P","CCA":"P","CCG":"P",
             "ACU":"T","ACC":"T","ACA":"T","ACG":"T","GCU":"A","GCC":"A",
             "GCA":"A","GCG":"A","UAU":"Y","UAC":"Y","UAA":"STOP","UAG":"STOP",
             "UGA":"STOP","CAU":"H","CAC":"H","CAA":"Q","CAG":"Q","AAU":"N",
             "AAC":"N","AAA":"K","AAG":"K","GAU":"D","GAC":"D","GAA":"E",
             "GAG":"E","UGU":"C","UGC":"C","UGG":"W","CGU":"R","CGC":"R",
             "CGA":"R","CGG":"R","AGU":"S","AGC":"S","AGA":"R","AGG":"R",
             "GGU":"G","GGC":"G","GGA":"G","GGG":"G"}
    
    for i in RNAseq:
        if len(codon) == 3:
            protein += codon_dict[codon] 
            codon = "" 
            codon += i 
        else:
            codon += i 
    return protein

## Pair Coding Time!!! 
### Step 1: Make sure you know who the typing partner and the guiding partner are.
Open the four butterfly DNA files that have been shared on Slack, read them as strings, and assign them to variables:

In [4]:
# You will need the open() function and the .read() method
A = open("Butterfly_A_DNA.txt").read()
B = open("Butterfly_B_DNA.txt").read()
C = open("Butterfly_C_DNA.txt").read()
D = open("Butterfly_D_DNA.txt").read()

Next find the amino acid sequence for each butterfly and store it in a variable.

In [6]:
# Hint: You will need to use your functions

aRNA = transcribe(A)
pA = translate(aRNA)
print(pA)

print()

bRNA = transcribe(B)
pB = translate(bRNA)
print(pB)

print()

cRNA = transcribe(C)
pC = translate(cRNA)
print(pC)

print()

dRNA = transcribe(D)
pD = translate(dRNA)
print(pD)

RFQRSSHAKLQALWLEAHYQEAERLRGRPLGPVDKYRVRKKFPLPRTIWDGEQKTHCFKERTRSLLREWYLQDPYPNPTKKRELAAATGLTPTQVGNWFKNRRQRDRAAAAKNRSAVLGRGFASSSTYDEDSADSEINVDE

RFQRSSHAKLQALWLEAHYQEAERLRGRPLGPVDKYRVRKKFPLPRTIWDGEQKTHCFKERTRSLLREWYLQDPYPNPTKKRELAAATGLTPTQVGNWFKNRRQRDRAAAAKNRSAVLGRGFASSSTYDEDSADSEINVDE

RFQRSSHAKLQALWLEAHYQEAERLRGRPLGPVDKYRVRKKFPLPRTIWDGEQKTHCFKERTRSLLREWYLQDPYPNPTKKRELAAATGLTPTQVGNWFKNRRQRDRAAAAKNRSAVLGRGFASSSTYDEDSADSEINVDE

RFQRSSHAKLQALWLEAHYQEAERLRGRPLGPVDKYRVRKKFPLPRTIWDGEQKTHCFKERTRSLLREWYLQDPYPNPTKKRELAAATGLTPTQVGNWFKNRRQRDRAAAAKNRSAVLGRGFASSSTYDEDSADSEINVDE


### Step 2: Swap typing partner and the guiding partners!
How can we find out how many differences there are between each pair of protein sequences? 
Create a function that compares two sequences and returns the number of differences:

In [10]:
# Hint, you will need the enumerate() function. You can use this in "for loops" that loop through strings or lists.
# Remember, you must specify two variables in your "for loop": one for the index counter and one for the character.

# example: for c, i in enumerate(my_list):
    # print(i + " is element " + c)
    
    
#my_list = [0, 1, 3,5, 6, 7]

#for c, i in enumerate(my_list):
    #print(str(i) + " is element " + str(c))


    
def diff(seq1, seq2):
  
    diff = 0

    for c, i in enumerate(seq1):
        if i != seq2[c]:
            diff += 1
        else:
            continue
    return diff
    

### Step 3: Swap typing partner and the guiding partners!

Run your function for each pair of butterfly protein sequences:

In [11]:
# Hint: You will need to run the function 6 times

print(diff(pA, pB))
print(diff(pA, pC))
print(diff(pA, pD))
print(diff(pB, pC))
print(diff(pB, pD))
print(diff(pC, pD))

0
0
0
0
0
0


If you got 0 differences for all 6, that's correct! We'll have to look elsewhere for the differences in the butterfly genes.

### Step 4: Swap typing partner and the guiding partners!

Try running your function for the DNA sequences instead of the protein sequences (*Notice that you can use the same function for both kinds of sequences!)*:

In [15]:
print("A and B")
print(diff(A, B))
print()

print("A and C")
print(diff(A, C))
print()

print("A and D")
print(diff(A, D))
print()

print("B and C")
print(diff(B, C))
print()

print("B and D")
print(diff(B, D))
print()

print("C and D")
print(diff(C, D))

A and B
15

A and C
4

A and D
20

B and C
15

B and D
13

C and D
20


Based on the number of differences, do you have a guess about which two butterflies are the same species?

## Exploring SNPs
### Step 5: Swap typing partner and the guiding partners!
These DNA differences are actually known variations between species. Called "single nucleotide polymorphisms" or SNPs, they can be used to identify a particular species. 

Using the two dictionaries below, create a function that takes a DNA sequence and a basepair position and returns the butterfly species:

In [17]:
# melpomene and erato are two Heliconius butterfly species
# These dictionaries have the DNA basepair # as the key and the chemical base as the value

melpomene = {36:"A", 156:"C", 177:"T", 213:"C", 276:"A", 291:"A", 315:"A", 333:"G", 405:"C"}
erato = {36:"G", 156:"T", 177:"C", 213:"T", 276:"C", 291:"G", 315:"C", 333:"A", 405:"T"}

# Create your function here:

def snp(seq, basepair):
    
    melpomene = {36:"A", 156:"C", 177:"T", 213:"C", 276:"A", 291:"A", 315:"A", 333:"G", 405:"C"}
    erato = {36:"G", 156:"T", 177:"C", 213:"T", 276:"C", 291:"G", 315:"C", 333:"A", 405:"T"}
    
    if seq[basepair-1] == melpomene[basepair]:
        return("Melpomene")
    elif seq[basepair-1] == erato[basepair]:
        return("Erato")
    else:
        return("Unknown")

Call your function for the following combinations and print out the results:
- Butterfly A, bp 36
- Butterfly B, bp 177
- Butterfly C, bp 291
- Butterfly D, bp 405

In [20]:
print(snp(A, 36))

print(snp(B, 177))

print(snp(C, 291))

print(snp(D, 405))


Melpomene
Erato
Melpomene
Erato


## Coding Challenge
Why do you think these changes to the DNA sequences don't have any effect on the protein sequence?

Write a function that tests your hypothesis: