### Summary
* in module 2, we learned about all the basic syntax and tools available in Python. Specifically, we learned about: 
    * how break down a problem into 'sub problems'
    * data types (strings, lists, int etc) and their associated methods
    * I/O
    * loops, for and while
    * Conditions: if, elif, else
    
* in module 3, we are going to increase the sophistication level of the tools that we incorporate into our programs. 
    * By choosing the best tool for our situation, instead of just using the fundamental tools of Python, we will improve the efficiency of our code. 

## complex data types

#### Warning: We will refine just ONE example for most of this notebook.
    * this is an attempt to demonstrate how much more efficient and clear the solution to the problem becomes as we use slightly more complex tools to solve it. 

* We often have paired data sets in biology
    * Protein sequences and names
    * DNA restriction enzymes and motifs
    * Codons and their amino acid translation
    
* Dictionaries allow you to store and access paired data efficiently
    * Dictionaries allow you to look up a word and a definition 
    * Of course, we could see what this type of problem looks like with a list…
    
* let's say we want to count trinucleotides present in the following sequence: 

        dna= "ATCGATCGATCGTACGCTGA"

(BTW: creating a 'sliding glass window' that slides across a haplotype sequence - usually three nucleotides at a time, not just one - is extremely common in genomics and bioinformatics so this questions isn't as 'theoretical' or manufactored for the sake of an example as it might initially seem)

-----------------------------

What does the following program do?

In [3]:
%%time

# %%time is a magic command that runs the entire cell multiple times
# and averages out how long it takes. 
# IT MUST BE THE VERY FIRST THING YOU TYPE IN A CELL - you can't
# even put comments above it --- I learned this the hard way. 
# There are MANY print statements in this code so the user time 
# will be big and the system time - the time spent evaluating commands
# by the kernel - will be small. 

# For loops revisited
dna="AATGATCGATCGTACGCTGA"
# set up an empty list
all_counts=[]
# remember that we can loop through a list
all_bases=["A","C","G","T"]
# setting up the four bases as in line 16 and using THAT as the list to iterate over
# in line 21 is best practice, but I am using lists here to illustrate.
# This is because there is only ONE place in the code
# where you have to make changes. You'll see this in the next cell.  
for base1 in ["A","C","G","T"]:
    for base2 in ["A","C","G","T"]:
        for base3 in ["A","C","G","T"]:
            trinucleotide=base1+base2+base3
            count=dna.count(trinucleotide)
            print("count is "+str(count)+" for "+trinucleotide)
            all_counts.append(count)

print("*"*20)
# remember the idea of 'scope' of a variable? 
#trinucleotide variable is only appropriate within the third loop. It will only print out 
# the last item it has in memory. 
print(trinucleotide)
print("-"*20)
print(all_counts)
print("~"*20)
print(sum(all_counts))

count is 0 for AAA
count is 0 for AAC
count is 0 for AAG
count is 1 for AAT
count is 0 for ACA
count is 0 for ACC
count is 1 for ACG
count is 0 for ACT
count is 0 for AGA
count is 0 for AGC
count is 0 for AGG
count is 0 for AGT
count is 0 for ATA
count is 2 for ATC
count is 1 for ATG
count is 0 for ATT
count is 0 for CAA
count is 0 for CAC
count is 0 for CAG
count is 0 for CAT
count is 0 for CCA
count is 0 for CCC
count is 0 for CCG
count is 0 for CCT
count is 1 for CGA
count is 1 for CGC
count is 0 for CGG
count is 1 for CGT
count is 0 for CTA
count is 0 for CTC
count is 1 for CTG
count is 0 for CTT
count is 0 for GAA
count is 0 for GAC
count is 0 for GAG
count is 2 for GAT
count is 0 for GCA
count is 0 for GCC
count is 0 for GCG
count is 1 for GCT
count is 0 for GGA
count is 0 for GGC
count is 0 for GGG
count is 0 for GGT
count is 1 for GTA
count is 0 for GTC
count is 0 for GTG
count is 0 for GTT
count is 0 for TAA
count is 1 for TAC
count is 0 for TAG
count is 0 for TAT
count is 0 f

#### Problem: we want a count of the trinucleotides present in the following sequence: 
    
    dna=“ATCGATCGATCGTACGCTGA”

* Using list/for loop: 
    * Output is inefficient (takes a long time) and hard to read – especially if you want to know about the frequency of a specific trinucleotide (such as ATG)
    * We can try to do things like generate two lists (one of counts, one of trinucleotides) but this makes life more difficult because we have to keep track of TWO lists and make sure that the one-to-one correspondence of the elements remains the same. Let's modify the program above to do just that. 

In [49]:
%%time
dna="AATGATCGATCGTACGCTGA"
all_trinucleotides=[]
all_counts=[]
all_bases=["A","C","G","T"]
for base1 in all_bases:
    for base2 in all_bases:
        for base3 in all_bases:
            trinucleotide=base1+base2+base3
            count=dna.count(trinucleotide)
            #print("count is "+str(count)+" for "+trinucleotide)
            all_trinucleotides.append(trinucleotide)
            all_counts.append(count)

print(all_counts)
print("~"*20)
print(all_trinucleotides)

[0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0]
~~~~~~~~~~~~~~~~~~~~
['AAA', 'AAC', 'AAG', 'AAT', 'ACA', 'ACC', 'ACG', 'ACT', 'AGA', 'AGC', 'AGG', 'AGT', 'ATA', 'ATC', 'ATG', 'ATT', 'CAA', 'CAC', 'CAG', 'CAT', 'CCA', 'CCC', 'CCG', 'CCT', 'CGA', 'CGC', 'CGG', 'CGT', 'CTA', 'CTC', 'CTG', 'CTT', 'GAA', 'GAC', 'GAG', 'GAT', 'GCA', 'GCC', 'GCG', 'GCT', 'GGA', 'GGC', 'GGG', 'GGT', 'GTA', 'GTC', 'GTG', 'GTT', 'TAA', 'TAC', 'TAG', 'TAT', 'TCA', 'TCC', 'TCG', 'TCT', 'TGA', 'TGC', 'TGG', 'TGT', 'TTA', 'TTC', 'TTG', 'TTT']
CPU times: user 243 µs, sys: 106 µs, total: 349 µs
Wall time: 260 µs


* our program has slightly more informative output but it still difficult to search for the count of the particular codon (trinucleotide) that you want to know
* it is easy to imagine these two lists becoming separated or one list being modified but forgetting to modify the second list

In [51]:
%%time
#Let's add yet another level of sophistication to this example:
dna="AATGATCGATCGTACGCTGA"
#trying to use lists and for loops - things that we have already learned
# about to solve this problem we can solve it but it is inelegant
all_trinucleotides=[]
all_counts=[]
all_bases=["A","C","G","T"]
for base1 in all_bases:
    for base2 in all_bases:
        for base3 in all_bases:
            trinucleotide=base1+base2+base3
            count=dna.count(trinucleotide)
            # you could print off the trinucleotide count as it is iterated here
            # probably not the most efficient thing to do but still useful to 
            # understanding what the program is doing
            print("count is "+str(count)+" for "+trinucleotide)
            all_trinucleotides.append(trinucleotide)
            all_counts.append(count)

print("."*25)
print(all_counts)
print("~"*25)
print(all_trinucleotides)
#we can look up the counts for specific trinucleotides
i=all_trinucleotides.index("TGA")
c=all_counts[i]
print("*"*25)
print("Here are the counts for TGA which is located at position: "+str(i))
print("count for TGA is "+str(c))
# of course, you could modify line 26 to include USER INPUT for whatever trinucleotide
# you want specified for the count of  here.

.........................
[0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0]
~~~~~~~~~~~~~~~~~~~~~~~~~
['AAA', 'AAC', 'AAG', 'AAT', 'ACA', 'ACC', 'ACG', 'ACT', 'AGA', 'AGC', 'AGG', 'AGT', 'ATA', 'ATC', 'ATG', 'ATT', 'CAA', 'CAC', 'CAG', 'CAT', 'CCA', 'CCC', 'CCG', 'CCT', 'CGA', 'CGC', 'CGG', 'CGT', 'CTA', 'CTC', 'CTG', 'CTT', 'GAA', 'GAC', 'GAG', 'GAT', 'GCA', 'GCC', 'GCG', 'GCT', 'GGA', 'GGC', 'GGG', 'GGT', 'GTA', 'GTC', 'GTG', 'GTT', 'TAA', 'TAC', 'TAG', 'TAT', 'TCA', 'TCC', 'TCG', 'TCT', 'TGA', 'TGC', 'TGG', 'TGT', 'TTA', 'TTC', 'TTG', 'TTT']
*************************
Here are the counts for TGA which is located at position: 56
count for TGA is 2
CPU times: user 1.11 ms, sys: 871 µs, total: 1.98 ms
Wall time: 2.51 ms


There is still the problem of potential modification of one list but not the other since they aren't linked <-- A REALLY BIG PROBLEM. Python has a way to solve it, though!
 
## Dictionaries
* most important built-in Python data structure
* Similar to a list but with __keys instead of an index__
    * These are called key-value pairs
    * more efficient use of memory than lists
    * non-biological common use: email addresses (keys) and passwords (values)
    
* sometimes called "associative arrays" (less commonly: "hash map")
* __UNORDERED__ <--important difference from a list
    * If you loop through a dictionary, you may get a different order than what you specified. Once the dictionary is set, the memory allocated for it makes it appear in a particular order.
    * __SINCE PYTHON 3.6, DICTIONARIES ARE NOW 'INSERTION ORDERED' IN THE SAME WAY THAT LISTS ARE ORDERED. You should still know that traditionally, dictionaries are 'unordered' compared to lists since it is possible that you will use an older version of Python.__ 

* __Use {} instead of []__

<div class="alert alert-block alert-warning">
Syntax:
			d={‘key1’:1,’key2’:2,’key3’:3}

* Key1 is key and points to 1 which is the value
* Common (non-biological) uses include login user name/email address

<div class="alert alert-block alert-warning">
Accessing/retrieving an element of a dictionary:

				d['key2'] would result in 2

## Creating a dictionary
* You can add values to a dictionary (like you can with a list)
* Adding a new key-value pair. 
    * Dictionaries come with the restriction that keys can only be certain data types: strings and numbers (so no file objects as keys) but values can be any type of data
    * Keys must be unique
    * Keys have to be IMMUTABLE - once set, they cannot be changed. We have only learned about two immutable types (strings and numbers) which limits our repetoire for keys at this point. We will learn about 'TUPLES' in the next lecture, though. 


<div class="alert alert-block alert-warning">
Example:

    #instantiate empty dictionary	
	
    menu ={}
    
    #add price to item
	
    menu["Chicken Alfredo"]=14.50
    
    len() can be used to determine the number of key-value pairs


Example: 

    Zoo_animals={"Unicorn":"Cotton Candy House","Sloth":"Rainforest Exhibit","Salmon":"Alaskan Estuary"}

‘Unicorn’ is the Key and ‘Cotton Candy House’ is the value (probably the location of the unicorn)

* We can use for to loop through the keys of a dictionary

<div class="alert alert-block alert-warning">
Example: 

    d = {'a': 'apple', 'b': 'berry', 'c': 'cherry’}
    for key in d:
        print(key +" "+ d[key])
The above would print out: 
    
    a apple
    
    b berry
    
    c cherry


In [None]:
#create an empty dictionary that associates restriction enzymes with their
# particular cut sequence

# remember that dictionaries use {} whereas lists use []
enzymes ={}

#assign key value pairs
# we could also have done this in the following way: 
#enzymes={"EcoRI":r"GAATTC","AvaII":r"GG(A|T)CC","BisI":r"GC[ATGC]GC"}

enzymes["EcoRI"]="GAATTC"
# sneaking in a tiny bit of REGEX for the next two lines:
enzymes["AvaII"]="GG(A|T)CC"
enzymes["BisI"]="GC[ATGC]GC"
#ask for the length of the dictionary
print(len(enzymes))
#print out all the key value pairs
print(enzymes)
for key in enzymes:
    print(key)

# Remove the "AvaII" entry. 
del(enzymes["AvaII"])
print(enzymes)
#let's add another restriction enzyme to the dictionary
enzymes["PamII"]="GRCG(C|T)C"

print(enzymes)

Back to our trinucleotide example...

In [5]:
%%time
# same as above but USING A DICTIONARY NOW! 
# --> waaaaay more efficient way to handle the problem  
dna="AATGATCGATCGTACGCTGA"
counts={}
all_bases=["A","C","G","T"]
# you can use for to loop through a dictionary (like we did with lists)
for base1 in all_bases:
    for base2 in all_bases:
        for base3 in all_bases:
            trinucleotide=base1+base2+base3
            count=dna.count(trinucleotide)
            counts[trinucleotide]=count
print("-----------------------")
#Print out counts dictionary
print(counts)
print("-----------------------")
#print counts for a specific trinucleotide
print(counts["TGA"])

-----------------------
{'AAA': 0, 'AAC': 0, 'AAG': 0, 'AAT': 1, 'ACA': 0, 'ACC': 0, 'ACG': 1, 'ACT': 0, 'AGA': 0, 'AGC': 0, 'AGG': 0, 'AGT': 0, 'ATA': 0, 'ATC': 2, 'ATG': 1, 'ATT': 0, 'CAA': 0, 'CAC': 0, 'CAG': 0, 'CAT': 0, 'CCA': 0, 'CCC': 0, 'CCG': 0, 'CCT': 0, 'CGA': 1, 'CGC': 1, 'CGG': 0, 'CGT': 1, 'CTA': 0, 'CTC': 0, 'CTG': 1, 'CTT': 0, 'GAA': 0, 'GAC': 0, 'GAG': 0, 'GAT': 2, 'GCA': 0, 'GCC': 0, 'GCG': 0, 'GCT': 1, 'GGA': 0, 'GGC': 0, 'GGG': 0, 'GGT': 0, 'GTA': 1, 'GTC': 0, 'GTG': 0, 'GTT': 0, 'TAA': 0, 'TAC': 1, 'TAG': 0, 'TAT': 0, 'TCA': 0, 'TCC': 0, 'TCG': 2, 'TCT': 0, 'TGA': 2, 'TGC': 0, 'TGG': 0, 'TGT': 0, 'TTA': 0, 'TTC': 0, 'TTG': 0, 'TTT': 0}
-----------------------
2
CPU times: user 361 µs, sys: 119 µs, total: 480 µs
Wall time: 401 µs


In [54]:
%%time
# This code is the same as above but even more efficient since you have gotten rid of a
# lot of unnecessary values of the list and only kept the non-0 values 
dna="AATGATCGATCGTACGCTGA"
counts={}
all_bases=["A","C","G","T"]
for base1 in all_bases:
    for base2 in all_bases:
        for base3 in all_bases:
            trinucleotide=base1+base2+base3
            count=dna.count(trinucleotide)
            #add in an if statement so that we can get rid of all the trinucleotides
            #have a count of 0
            if count >0:
                counts[trinucleotide]=count
print("*"*25)
print(counts)
print("-"*20)
#print counts for a specific trinucleotide 
print(counts["TGA"])
# What happens now? 
print(counts["AAA"])

*************************
{'AAT': 1, 'ACG': 1, 'ATC': 2, 'ATG': 1, 'CGA': 1, 'CGC': 1, 'CGT': 1, 'CTG': 1, 'GAT': 2, 'GCT': 1, 'GTA': 1, 'TAC': 1, 'TCG': 2, 'TGA': 2}
--------------------
2


KeyError: 'AAA'

Hmmmm. Well, well, well -  what happened here? 

We seemed to call the dictionary for a value that isn't contained in the dictionary. The code raised an exception since "AAA" wasn't included. 

We have a way to address that issue, though! 

In [57]:
%%time
# we can impose other criteria on the dictionary - like only print out the
# values if there are exactly 2 of them
# This will only print out the nucleotide
dna="AATGATCGATCGTACGCTGA"
counts={}
all_bases=["A","C","G","T"]
for base1 in all_bases:
    for base2 in all_bases:
        for base3 in all_bases:
            trinucleotide=base1+base2+base3
            #add in an if statement because we only care about trinucleotides
            #with 2 counts
            count=dna.count(trinucleotide)
            counts[trinucleotide]=count
            if counts[trinucleotide] == 2:
                print(trinucleotide)
                
# the underlying dictionary is still present of course and we could print it out, if we wanted to 
print(counts)
# we could also print out the counts for any given key
print("_________")
#print counts for a specific trinucleotide
print(counts["TGA"])
# ----------------------------
# AND NOW WHAT HAPPENS? COMPARE TO CELL ABOVE
# ----------------------------
print(counts["AAA"])

ATC
GAT
TCG
TGA
{'AAA': 0, 'AAC': 0, 'AAG': 0, 'AAT': 1, 'ACA': 0, 'ACC': 0, 'ACG': 1, 'ACT': 0, 'AGA': 0, 'AGC': 0, 'AGG': 0, 'AGT': 0, 'ATA': 0, 'ATC': 2, 'ATG': 1, 'ATT': 0, 'CAA': 0, 'CAC': 0, 'CAG': 0, 'CAT': 0, 'CCA': 0, 'CCC': 0, 'CCG': 0, 'CCT': 0, 'CGA': 1, 'CGC': 1, 'CGG': 0, 'CGT': 1, 'CTA': 0, 'CTC': 0, 'CTG': 1, 'CTT': 0, 'GAA': 0, 'GAC': 0, 'GAG': 0, 'GAT': 2, 'GCA': 0, 'GCC': 0, 'GCG': 0, 'GCT': 1, 'GGA': 0, 'GGC': 0, 'GGG': 0, 'GGT': 0, 'GTA': 1, 'GTC': 0, 'GTG': 0, 'GTT': 0, 'TAA': 0, 'TAC': 1, 'TAG': 0, 'TAT': 0, 'TCA': 0, 'TCC': 0, 'TCG': 2, 'TCT': 0, 'TGA': 2, 'TGC': 0, 'TGG': 0, 'TGT': 0, 'TTA': 0, 'TTC': 0, 'TTG': 0, 'TTT': 0}
_________
2
0
CPU times: user 474 µs, sys: 134 µs, total: 608 µs
Wall time: 490 µs


In [58]:
%%time
dna="AATGATCGATCGTACGCTGA"
counts={}
all_bases=["A","C","G","T"]
for base1 in all_bases:
    for base2 in all_bases:
        for base3 in all_bases:
            trinucleotide=base1+base2+base3
            count=dna.count(trinucleotide)
            counts[trinucleotide]=count
print(counts)
# ooooooh! Now we are going to see some dictionary methods
print("~~~~~~~~~~")
for trinucleotides,count in counts.items():
    if count>=2:
        print(trinucleotides +" : "+str(count))

{'AAA': 0, 'AAC': 0, 'AAG': 0, 'AAT': 1, 'ACA': 0, 'ACC': 0, 'ACG': 1, 'ACT': 0, 'AGA': 0, 'AGC': 0, 'AGG': 0, 'AGT': 0, 'ATA': 0, 'ATC': 2, 'ATG': 1, 'ATT': 0, 'CAA': 0, 'CAC': 0, 'CAG': 0, 'CAT': 0, 'CCA': 0, 'CCC': 0, 'CCG': 0, 'CCT': 0, 'CGA': 1, 'CGC': 1, 'CGG': 0, 'CGT': 1, 'CTA': 0, 'CTC': 0, 'CTG': 1, 'CTT': 0, 'GAA': 0, 'GAC': 0, 'GAG': 0, 'GAT': 2, 'GCA': 0, 'GCC': 0, 'GCG': 0, 'GCT': 1, 'GGA': 0, 'GGC': 0, 'GGG': 0, 'GGT': 0, 'GTA': 1, 'GTC': 0, 'GTG': 0, 'GTT': 0, 'TAA': 0, 'TAC': 1, 'TAG': 0, 'TAT': 0, 'TCA': 0, 'TCC': 0, 'TCG': 2, 'TCT': 0, 'TGA': 2, 'TGC': 0, 'TGG': 0, 'TGT': 0, 'TTA': 0, 'TTC': 0, 'TTG': 0, 'TTT': 0}
~~~~~~~~~~
ATC : 2
GAT : 2
TCG : 2
TGA : 2
CPU times: user 2.29 ms, sys: 1.25 ms, total: 3.54 ms
Wall time: 4.28 ms


## Dictionary Methods
* Items can be removed from a dictionary

			del(dict_name[key_name])
            
* The above command, removes the key and value from the dictionary
* You can .pop() a value off of a dictionary which returns the value of the key and simultaneously deletes it (we saw this method with lists, too)

            dict_name.pop(“key_1”)
            
* You can replace the value matched to a key by:
            
            dict_name[key]=new_value
* Built in functions for dictionaries include: 
            
            .keys() -  returns an array of the dictionary’s keys

            .values() – returns an array of the dictionary’s values
            
            .get() – takes two arguments but the second argument is optional and it useful if the dictionary does not contain the key that is being searched for. It may not be immediately obvious to you why the .get method is so useful. However, if you accidentally call a key that doesn’t exist, the program will stop and raise an exception which could de-rail your project. By using .get() you can specify the second argument for what to return if the key doesn’t exist so that your program doesn’t just stop
            
             .sorted() - We can sort key list before looping to impose a specific order on it that can be repeatable
             .items() - Instead of returning a single value or a list of values, items method returns a list of paired values so we can iterate over pairs of data (this is a bit weird…) but we have seen something similar with enumerate function in lists. The more technical definition is that the .items method returns an array of tuples which consists of a key/value pair from the dictionary
                         - usually see it as: 
                             example: 
                    d = {"Name": ”DAP”,"Age": 00, “Statistics": True}
                    print(d.items())

<div class="alert alert-block alert-warning">          

    This is a common pattern: 
        for key in my_dict.keys():
            value =my_dict.get(key)
            #do something with key and value

    So common, in fact, that python has a built in feature to handle it: 
        for key,value in my_dict.items():
            #do something with key and value


<div class="alert alert-block alert-warning">

    print(dict_name["Key1"]) is the same as  print(dict_name.get("Key1"))

    But! With .get(), you can specify what to return if key doesn’t exist! For instance, in the nucleotide example, we happened to know that, with how the loop was structured, the trinucleotide would have to be present 0 times if it wasn't in our final dictionary so we could specify the count of "0" if the trinucleotide wasn't present.  

    print(dict_name.get("Key1",0)) 


* Remember that these functions do not return the keys or values in any user-specified order (python allocates memory efficiently without regard to which item was added first or last). Unordered means that you can’t search by indexing like you can a list <--- note that this is no longer true. The latest version of Python treats dictionaries in the same way as lists. It is still useful to appreciate this limitations when you are dealing with older code, though!

In [1]:
d = {"Name":"DAP","Age": 100, "Statistics": True}
print(d.items())

dict_items([('Name', 'DAP'), ('Age', 100), ('Statistics', True)])


In [42]:
# A note on the update() method of dictionaries: There are two 
# red":rot key-value pairs and when updating, python removes one of the them. 
w={"house":"Haus","cat":"Katze","red":"rot"}
w1 = {"red":"rouge","blau":"bleu"}
w.update(w1)
print(w)
#{'house': 'Haus', 'blau': 'bleu', 'red': 'rouge', 'cat': 'Katze'}

{'house': 'Haus', 'cat': 'Katze', 'red': 'rouge', 'blau': 'bleu'}


In [7]:
my_children={"Darwin":"Chihuahua","Fisher":"small furry dog","Violet":"child 1", "Henry":"Child 2"}
print(my_children.keys())
print("++++++++++++")
print(my_children.values())
print("~~~~~~~~~~~~")
#You can iterate over a dictionary
for key in my_children:
    print(key, my_children[key])

# an imperfect example but, since Daven is not a key in my list of children, when I search my list for him
# I want to return the fact that he is not a member of this list
print(my_children.get("Daven","NOT my child"))
# if I used the other method of calling keys, I would raise an exception instead of telling me something useful 
# that he doesn't isn't a member of this list
#print(my_children["Daven"])


dict_keys(['Darwin', 'Fisher', 'Violet', 'Henry'])
++++++++++++
dict_values(['Chihuahua', 'small furry dog', 'child 1', 'Child 2'])
~~~~~~~~~~~~
Darwin Chihuahua
Fisher small furry dog
Violet child 1
Henry Child 2
NOT my child


In [59]:
%%time
# Here are some more important methods for dictionaries!
dna="AATGATCGATCGTACGCTGA"
#dna="AATGATCGA"
counts={} 
all_bases=["A","C","G","T"]
for base1 in all_bases:
    for base2 in all_bases:
        for base3 in all_bases:
            trinucleotide=base1+base2+base3
            count=dna.count(trinucleotide)
            #add in an if statement so that we can get rid of all the trinucleotides
            #have a count of 0
            if count >0:
                counts[trinucleotide]=count
print(counts)
print("Is AAA present? ")
# ******************************
# INCREDIBLY IMPORTANT METHOD ALERT! .get() allows you to search a dictionary 
# but also specify what to return in the case that what you are searching
# for doesn't exist in the dictionary! This will serve two purposes:
# 1. Your program doesn't crash when it doesn't find something that you specified
# 2. it gives you information that the item doesn't exist
# ******************************
print(counts.get("AAA",0))
# -------------------------------------
# Alert!!! Here is the sorted function in use!
# -------------------------------------
print("Here is the sorted function with criteria of 2")
print("**********************")
#for trinucleotide,count in sorted(counts.items()):
for trinucleotide,count in counts.items():
    if count == 2:
        print(trinucleotide)
print("Here are all the trinucleotides that are present in this string")
for trinucleotide,count in counts.items():
    print("The trinucleotide: "+trinucleotide+" is represented "+str(count)+" times")

{'AAT': 1, 'ACG': 1, 'ATC': 2, 'ATG': 1, 'CGA': 1, 'CGC': 1, 'CGT': 1, 'CTG': 1, 'GAT': 2, 'GCT': 1, 'GTA': 1, 'TAC': 1, 'TCG': 2, 'TGA': 2}
Is AAA present? 
0
Here is the sorted function with criteria of 2
**********************
ATC
GAT
TCG
TGA
Here are all the trinucleotides that are present in this string
The trinucleotide: AAT is represented 1 times
The trinucleotide: ACG is represented 1 times
The trinucleotide: ATC is represented 2 times
The trinucleotide: ATG is represented 1 times
The trinucleotide: CGA is represented 1 times
The trinucleotide: CGC is represented 1 times
The trinucleotide: CGT is represented 1 times
The trinucleotide: CTG is represented 1 times
The trinucleotide: GAT is represented 2 times
The trinucleotide: GCT is represented 1 times
The trinucleotide: GTA is represented 1 times
The trinucleotide: TAC is represented 1 times
The trinucleotide: TCG is represented 2 times
The trinucleotide: TGA is represented 2 times
CPU times: user 1.25 ms, sys: 379 µs, total: 1

In [19]:
%%time
# Here is an example where I try to incorporate all of the ideas that we have seen so 
# far in this module: 
dna="AATGATCGATCGTACGCTGA"
counts={}
all_bases=["A","C","G","T"]
for base1 in all_bases:
    for base2 in all_bases:
        for base3 in all_bases:
            trinucleotide=base1+base2+base3
            count=dna.count(trinucleotide)
            #add in an if statement so that we can get rid of all the trinucleotides
            #have a count of 0
            if count >0:
                counts[trinucleotide]=count

print(counts)
#print counts for a specific trinucleotide
#print(counts["TGA"])
for base1 in all_bases:
    for base2 in all_bases:
        for base3 in all_bases:
            trinucleotide=base1+base2+base3
            #add in an if statement because we only care about trinucleotides
            #with 2 counts
            #count=dna.count(trinucleotide)
            #counts[trinucleotide]=count
            if counts.get(trinucleotide,0) == 1 or counts.get(trinucleotide,0) == 2:
                print(trinucleotide)
# Here's a method that we haven't seen yet.                
print(counts.keys())
#print counts for a specific trinucleotide
print(counts["TGA"])

{'AAT': 1, 'ACG': 1, 'ATC': 2, 'ATG': 1, 'CGA': 1, 'CGC': 1, 'CGT': 1, 'CTG': 1, 'GAT': 2, 'GCT': 1, 'GTA': 1, 'TAC': 1, 'TCG': 2, 'TGA': 2}
AAT
ACG
ATC
ATG
CGA
CGC
CGT
CTG
GAT
GCT
GTA
TAC
TCG
TGA
dict_keys(['AAT', 'ACG', 'ATC', 'ATG', 'CGA', 'CGC', 'CGT', 'CTG', 'GAT', 'GCT', 'GTA', 'TAC', 'TCG', 'TGA'])
2


###  In-Lecture example (post on discussion board): How can we make this dictionary (a part of one of your PS) more efficient? 
___________

___________
gencode = {
    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',
    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'}

## Lists can be entries - values, not keys - in a dictionary and vice versa
* Lists can be an entry in a dictionary

* They can't be keys - why not? 

* When you access a list in a dictionary by:

        dict_name[‘list_key’]

* This can be a bit challenging and if you want to print out all the information you will need to use a for loop

In [60]:
# A dictionary of lists from codeacademy:
inventory = {
    'gold' : 500,
    'pouch' : ['flint', 'twine', 'gemstone'], # Assigned a new list as the value to 'pouch' key
    'backpack' : ['xylophone','dagger', 'bedroll','bread loaf']
}

# Adding a key 'burlap bag' and assigning a list to it
inventory['burlap bag'] = ['apple', 'small ruby', 'three-toed sloth']
inventory['pocket']=['seashell','strange berry','lint']

# Sorting the list found under the key 'pouch'
print(inventory["pouch"])
inventory['pouch'].sort() 
print(inventory["pouch"])
inventory['backpack'].sort()
inventory['backpack'].remove('dagger')
inventory['gold']=550

for stuff in inventory: 
    print(inventory[stuff])

['flint', 'twine', 'gemstone']
['flint', 'gemstone', 'twine']
550
['flint', 'gemstone', 'twine']
['bedroll', 'bread loaf', 'xylophone']
['apple', 'small ruby', 'three-toed sloth']
['seashell', 'strange berry', 'lint']


In [20]:
#example from codeacademy which has a great section on dictionaries (and other basics of python)
#A list of dictionaries
lloyd = {
    "name": "Lloyd",
    "homework": [90.0,97.0,75.0,92.0],
    "quizzes": [88.0,40.0,94.0],
    "tests": [75.0,90.0]
}
alice = {
    "name": "Alice",
    "homework": [100.0, 92.0, 98.0, 100.0],
    "quizzes": [82.0, 83.0, 91.0],
    "tests": [89.0, 97.0]
}
tyler = {
    "name": "Tyler",
    "homework": [0.0, 87.0, 75.0, 22.0],
    "quizzes": [0.0, 75.0, 78.0],
    "tests": [100.0, 100.0]
}
#a BIG list of dictionaries
students=[lloyd,alice,tyler]
#try to print out all the information in the dictionary
print("-"*10)
print(students)
print("*"*10)
#we can also print out the information from a specific dictionary
print(tyler.items())
print("."*10)
#this is not really what we want....it's pretty messy
for student in students:
    print("-------")
    print(student["name"])
    print("-------")
    print(student["homework"])
    print(student["quizzes"])
    print(student["tests"])

----------
[{'name': 'Lloyd', 'homework': [90.0, 97.0, 75.0, 92.0], 'quizzes': [88.0, 40.0, 94.0], 'tests': [75.0, 90.0]}, {'name': 'Alice', 'homework': [100.0, 92.0, 98.0, 100.0], 'quizzes': [82.0, 83.0, 91.0], 'tests': [89.0, 97.0]}, {'name': 'Tyler', 'homework': [0.0, 87.0, 75.0, 22.0], 'quizzes': [0.0, 75.0, 78.0], 'tests': [100.0, 100.0]}]
**********
dict_items([('name', 'Tyler'), ('homework', [0.0, 87.0, 75.0, 22.0]), ('quizzes', [0.0, 75.0, 78.0]), ('tests', [100.0, 100.0])])
..........
-------
Lloyd
-------
[90.0, 97.0, 75.0, 92.0]
[88.0, 40.0, 94.0]
[75.0, 90.0]
-------
Alice
-------
[100.0, 92.0, 98.0, 100.0]
[82.0, 83.0, 91.0]
[89.0, 97.0]
-------
Tyler
-------
[0.0, 87.0, 75.0, 22.0]
[0.0, 75.0, 78.0]
[100.0, 100.0]


## You can pass a dictionary to a function!

In [26]:
# you can also pass dictionaries to functions
def average(numbers):
    total = sum(numbers)
    total = float(total)
    print("********** HERE I AM IN THE FIRST FUNCTION!!******")
    total=total/len(numbers)
    return total
    
def get_average(student):
    homework=average(student["homework"])
    print("***Here I am in the second function!")
    quizzes=average(student["quizzes"])
    tests=average(student["tests"])
    return 0.1*homework+0.3*quizzes+0.6*tests

def get_letter_grade(score):
    print("Here I am in the third function!")
    if score >= 90:
        return "A"
    elif score >=80:
        return "B"
    elif score >=70:
        return "C"
    elif score >=60:
        return "D"
    else: 
        return "F"

# Here is the dictionary that you will then pass in the function
lloyd = {
    "name": "Lloyd",
    "homework": [90.0,97.0,75.0,92.0],
    "quizzes": [88.0,40.0,94.0],
    "tests": [75.0,90.0]
}
#Passing a dictionary, lloyd, to the get_average function which is then passed to the get_letter_grade function
print(get_letter_grade(get_average(lloyd)))


********** HERE I AM IN THE FIRST FUNCTION!!******
***Here I am in the second function!
********** HERE I AM IN THE FIRST FUNCTION!!******
********** HERE I AM IN THE FIRST FUNCTION!!******
Here I am in the third function!
B


In [22]:
# I have written this function in TWO ways - with one way hashed out 
# I have two dictionaries: prices and stock 
# second way, I have one combined dictionary which has the value of a list at each key
#def compute_lab_cost(reagent):
#    total_here = 0
#    for item in reagent:
#        #you have access to the elements of a dictionary even though it hasn't been passed into the function
#        total_here +=prices[item]
#    return total_here
def compute_lab_cost(reagent):
    total_here = 0
    for item in reagent:
        total_here +=price_stock[item][0]
    return total_here
# two dictionaries here: one that tells you the price of each item
#prices = { "reagent1":10,  "reagent2":25, "reagent3":1, "reagent4":5}
# second dictionary that tells you how many of each item you have in stock
#stock = {"reagent1":6,"reagent2": 0, "reagent3":32, "reagent4":10}

# We can combine these two dictionaries into one dictionary that has a list for each reagent key
# the first element of that list is price and the second element is stock. I have hashed it out below
price_stock={"reagent1":[10,6],"reagent2":[25,0],"reagent3":[1,32],"reagent4":[5,10]}


total=0
for key in prices:
    print(total)
    print(key)
    print("**********")
    print("price: "+ str(prices[key]))
    print("stock: "+ str(stock[key]))
    print(prices[key]*stock[key])
    total = total + prices[key]*stock[key]

print("*")
#print(total)
print("-")


shopping_for_lab = ["reagent4", "reagent2", "reagent1"]
# PASSING A DICTIONARY TO A FUNCTION!
print(compute_lab_cost(shopping_for_lab))

0
reagent1
**********
price: 10
stock: 6
60
60
reagent2
**********
price: 25
stock: 0
0
60
reagent3
**********
price: 1
stock: 32
32
92
reagent4
**********
price: 5
stock: 10
50
*
-
40
