# Identifying Butterfly Species

In this notebook you will get to use your transcription and translation functions to analyze the same gene from four butterflies and find out which pairs are the same species!

First, copy and paste your transcription and translation functions here:

In [1]:
# Paste your transcription function here
def transcribe(DNAseq):
    bpDict = {"C":"G", "G":"C", "A":"U", "T":"A"}
    RNA = ""
    for base in DNAseq:
        RNA += bpDict[base]
    return RNA

In [2]:
# Paste your translation function here
def translate(rnaseq):
    ProteinDict = {"UUU":"F", "UUC":"F", "UUA":"L", "UUG":"L", "CUU":"L", "CUC":"L", "CUA":"L", "CUG":"L", "AUU":"I", "AUC":"I", "AUA":"I", "AUG":"M", "GUU":"V", "GUC":"V", "GUA":"V", "GUG":"V", "UCU":"S", "UCC":"S", "UCA":"S", "UCG":"S", "CCU":"P", "CCC":"P", "CCA":"P", "CCG":"P", "ACU":"T", "ACC":"T", "ACA":"T", "ACG":"T", "GCU":"A", "GCC":"A", "GCA":"A", "GCG":"A", "UAU":"Y", "UAC":"Y", "UAA":"", "UAG":"", "CAU":"H", "CAC":"H", "CAA":"Q", "CAG":"Q", "AAU": "N", "AAC":"N", "AAA":"K", "AAG":"K", "GAU":"D", "GAC":"D", "GAA":"E", "GAG":"E", "UGU":"C", "UGC":"C", "UGA":"", "UGG":"W", "CGU":"R", "CGC":"R", "CGA":"R", "CGG":"R", "AGU":"S", "AGC":"S", "AGA":"R", "AGG":"R", "GGU":"G", "GGC":"G", "GGA":"G", "GGG":"G"}
    codon = ""
    protein = ""
    for base in rnaseq:
        codon += base
        if len(codon) == 3:
            protein += ProteinDict[codon]
            codon = ""
    return protein

Next, open the four files, read them as strings, and assign them to variables:

In [3]:
# You will need the open() function and the .read() method
DNAseqA = open("Butterfly_A_DNA.txt").read()
DNAseqB = open("Butterfly_B_DNA.txt").read() 
DNAseqC = open("Butterfly_C_DNA.txt").read() 
DNAseqD = open("Butterfly_D_DNA.txt").read()

Next find the amino acid sequence for each butterfly and store it in a variable.

In [7]:
# Hint: You will need to use your functions
protA = translate(transcribe(DNAseqA))
protB = translate(transcribe(DNAseqB))
protC = translate(transcribe(DNAseqC))
protD = translate(transcribe(DNAseqD))

Are there any differences in the protein sequences? 
Create a function that compares two sequences and prints out the number of differences:

In [9]:
# Hint, you will need the enumerate() function. You can use this in "for loops" that loop through strings or lists.
# Remember, you must specify two variables in your "for loop": one for the index counter and one for the character.

def diffNum(seq1, seq2):
    diffNum = 0
    for i, aa in enumerate(seq1):
        if seq2[i] != aa:
            diffNum += 1
    print("There are",diffNum,"differences")

Run your function for each pair of butterfly protein sequences:

In [11]:
# Hint: You will need to run the function 6 times
diffNum(protA, protB)
diffNum(protA, protC)
diffNum(protA, protD)
diffNum(protB, protC)
diffNum(protB, protD)
diffNum(protC, protD)

There are 0 differences
There are 0 differences
There are 0 differences
There are 0 differences
There are 0 differences
There are 0 differences


If you got 0 differences for all 6, that's correct! We'll have to look elsewhere for the differences in the butterfly genes.

Try running your function for the DNA sequences instead of the protein sequences:

In [12]:
diffNum(DNAseqA, DNAseqB)
diffNum(DNAseqA, DNAseqC)
diffNum(DNAseqA, DNAseqD)
diffNum(DNAseqB, DNAseqC)
diffNum(DNAseqB, DNAseqD)
diffNum(DNAseqC, DNAseqD)

There are 15 differences
There are 4 differences
There are 20 differences
There are 15 differences
There are 13 differences
There are 20 differences


Based on the number of differences, do you have a guess about which two butterflies are the same species?

## Exploring SNPs
These DNA differences are actually known variations between species. Called "single nucleotide polymorphisms" or SNPs, they can be used to identify a particular species. 

Using the two dictionaries below, create a function that takes a DNA sequence and identifies the butterfly species:

In [32]:
# melpomene and erato are two Heliconius butterfly species
# These dictionaries have the DNA basepair # as the key and the chemical base as the value

melpomene = {36:"A", 156:"C", 177:"T", 213:"C", 276:"A", 291:"A", 315:"A", 333:"G", 405:"C"}
erato = {36:"G", 156:"T", 177:"C", 213:"T", 276:"C", 291:"G", 315:"C", 333:"A", 405:"T"}

def speciesID(seq):
    for key in melpomene:
        if seq[key-1] == melpomene[key]:
            print("According to base",key,"this butterfly is a melpomene")
        elif seq[key-1] == erato[key]:
            print("According to base",key,"this butterfly is an erato")
        else:
            print("I do not know what species this butterfly is")

Call your function for all four butterfly DNA sequences:

In [33]:
speciesID(DNAseqA)
speciesID(DNAseqB)
speciesID(DNAseqC)
speciesID(DNAseqD)

According to base 177 this butterfly is a melpomene
According to base 291 this butterfly is a melpomene
According to base 36 this butterfly is a melpomene
According to base 213 this butterfly is a melpomene
According to base 276 this butterfly is a melpomene
According to base 315 this butterfly is a melpomene
According to base 156 this butterfly is a melpomene
According to base 333 this butterfly is a melpomene
According to base 405 this butterfly is a melpomene
According to base 177 this butterfly is an erato
According to base 291 this butterfly is an erato
According to base 36 this butterfly is an erato
According to base 213 this butterfly is an erato
According to base 276 this butterfly is an erato
According to base 315 this butterfly is an erato
According to base 156 this butterfly is an erato
According to base 333 this butterfly is an erato
According to base 405 this butterfly is an erato
According to base 177 this butterfly is a melpomene
According to base 291 this butterfly is a

## Coding Challenge
Why do you think these changes to the DNA sequences don't have any effect on the protein sequence?

Write a function that tests your hypothesis: