# Practical Python Programming for Biologists
Author: Dr. Daniel Pass | www.CompassBioinformatics.com

---

# Dictionaries

A dictionary at first looks like a 2D list but is actually unique because all data is in unordered key-value pairs. They are particularly useful for finding the corresponding pair (like a real dictionary!) when you have one value for example:

- ```geneID => sequence```
- ```aminoAcid => frequency```
- ```site => longitude/lattitude```

This example dictionary has a frequency count of each base, where the first element in each pair is the "key" reference, and the second element is the "value".

It's defined in the format of ```'key' : 'value'```, between braces ```{ }``` (curly brackets)

In [1]:
# Create a dictionary of DNA base counts
base_counts = {'A': 101, 'T': 250, 'G': 125, 'C': 92}
print(base_counts)

{'A': 101, 'T': 250, 'G': 125, 'C': 92}


We can use the square bracket method ```dictionary['key']``` to return the value of a dictionary key similar to using the index number for a list, however the .get() method is usually better because it handles missing data better. Note the difference:

In [2]:
print("Number of Thymine:",  base_counts['T'])

Number of Thymine: 250


In [3]:
#Here we search for a key that does not exist, causing a key error and causing the program to crash
print("Number of Unknowns:", base_counts['N'])

KeyError: 'N'

In [4]:
print("Number of Unknowns:", base_counts.get('N'))

Number of Unknowns: None


First lets add some new pairs to the dictionary. This can be done two ways:

In [None]:
# Add a two new key-value pairs to the dictionary
base_counts['N'] = 5
#Here we use the update string method to add data into the dictionary
base_counts.update({ 'MISSING' : 0 })

print(base_counts)

We can then access individual values based on their key just like looking them up in a dictionary, or modify/delete them:

In [None]:
base_counts = {'A': 101, 'T': 250, 'G': 125, 'C': 92, 'N': 5, 'MISSING': 0}

# Print corresponding value (two methods)
print("Number of Adenine:", base_counts.get('A'))
print("Number of Thymine:", base_counts.get('T'))
print("Number of Unknowns:", base_counts.get('N'))
print(base_counts)

# Modify the value for a key
base_counts['A'] = 65

# Remove the 'MISSING' key-value pair from the dictionary
base_counts.pop('MISSING')
print(base_counts)

In [None]:
# There is also the del statement
del base_counts['T']
print(base_counts)

# Print corresponding value (two methods)
print("Number of Adenine:", base_counts.get('A'))
print("Number of Thymine:", base_counts.get('T'))
print("Number of Unknowns:", base_counts.get('N'))

A useful set of methods are ```.keys()``` and ```.values()``` which return dict objects (just like when we made a range). They can be turned into a list using the ```list()``` function in the same way.

In [5]:
# Print just the values
print(base_counts.values())

# Print just the keys
print(base_counts.keys())

dict_values([101, 250, 125, 92])
dict_keys(['A', 'T', 'G', 'C'])


We can also use the ```.items()``` method to return key:value pairs as a list of tuples. We'll return to this when we know how to run through loops of data

In [6]:
# Print just the values
print(base_counts)
print(base_counts.items())

{'A': 101, 'T': 250, 'G': 125, 'C': 92}
dict_items([('A', 101), ('T', 250), ('G', 125), ('C', 92)])


### Dictionary exercise

1. Create a dictionary named insects where keys are the common short names of insects and values are their Latin names using this data.
- honeybee  - Apis mellifera
- fruitfly  - Drosophila melanogaster
- butterfly - Papilio machaon
- ladybird  - Coccinella septempunctata
- fireant   - Solenopsis invicta

2. Add a new insect to the insects dictionary with Mosquito & Culex pipiens
3. Actually we have incorrectly identified the bee! Modify the honeybee to Apis cerana
4. Print the list of latin names as a dict_list, and how many sepecies are in the dictionary

In [10]:
# Code goes here
animals = {"honeybee" : "Apis mellifera",
           "fruitfly" : "Drosophila melanogaster",
           "butterfly" : "Papilio machaon",
           "ladybird" : "Coccinella septempunctata",
           "fireant" : "Solenopsis invicta"}
#Add new key value pair
animals.update({"mosquito" : "Culex pipiens"})

#Correct honeybee
animals["honeybee"] = "Apis cerana"

#Print the list of latin names as a dict_list, and how many sepecies are in the dictionary

dict_list = animals.values()
print(dict_list)



dict_values(['Apis cerana', 'Drosophila melanogaster', 'Papilio machaon', 'Coccinella septempunctata', 'Solenopsis invicta', 'Culex pipiens'])


---
#### Dictionary Example
Dictionaries are really important and powerful. Lets look at another example where we can attach specific information to geneIDs, and use the ```.get``` method to search the dictionary.

Here we can combine a gene sequence dictionary with our 2D list exercise from earlier

In [13]:
## The data

# Dictionary of gene names and sequences
gene_dict = {'BRCA1': 'ATGTTGTCATCGTTGAGCTTTGCTTCCT',
             'TP53': 'ATGGAGGAGCCGCAGTCAGATC',
             'EGFR': 'ATGACCATCCAAGATGATGGTGTC',
             'KRAS': 'ATGACTGAATATAAACTTGTGGTAG',
             'BRAF': 'ATGGTCCAGCTTGGACCCACTCC',
             'ALK': 'ATGAAGGAGCCCTCAGATTTCTTG',
             'RET': 'ATGGGTGGGTTGTCGGAAGATCTT',
             'ROS1': 'ATGAGCCACCCAGGTCCCTGTAGT',
             'MET1': 'ATGGCTTCAAGCTGTTGTCGTGAAGA'}

# gene confidence values
gene_confs = [[0.92, 'MET1', 2205], [0.82, 'EGFR', 1567], [0.93, 'KRAS', 6523], [0.4, 'TP53', 5002], [0.94, 'ROS5', 1999], [0.87, 'BRCA1', 2323]]

# sort and get lowest conf gene ID
gene_confs.sort()
lowest_gene_conf = gene_confs[0][0]
lowest_gene_ID = gene_confs[0][1]

print(lowest_gene_ID)
print(lowest_gene_conf)

TP53
0.4


We don't need to type the exact gene name. we can use the variable as the key:

In [14]:
# Search dictionary keys for that ID
print(gene_dict.get(lowest_gene_ID))

# easier to read!
print("Gene sequence for", lowest_gene_ID, "(Confidence value:", str(lowest_gene_conf) + ") is:",  gene_dict.get(lowest_gene_ID))

ATGGAGGAGCCGCAGTCAGATC
Gene sequence for TP53 (Confidence value: 0.4) is: ATGGAGGAGCCGCAGTCAGATC


Lets continue our example from 2D lists and remove the three lowest genes from our dictionary:

In [15]:
print("gene_dict contains", len(gene_dict), "genes")
gene_dict.pop("EGFR")
gene_dict.pop("TP53")
gene_dict.pop("BRCA1")
print("gene_dict contains", len(gene_dict), "genes")

gene_dict contains 9 genes
gene_dict contains 6 genes


Actually, even better idea! Instead of just throwing the data out, lets put the DNA sequences of the lowest three genes from the dictionary into a new list called ```bad_genes``` before they get discarded.

Note, what happens if you run this code immediately? Read the error and identify what is wrong.

In [16]:
bad_genes = []

print("gene_dict contains", len(gene_dict), "genes")
bad_genes.append(gene_dict.pop("EGFR"))
bad_genes.append(gene_dict.pop("TP53"))
bad_genes.append(gene_dict.pop("BRCA1"))
print("gene_dict contains", len(gene_dict), "genes")

gene_dict contains 6 genes


KeyError: 'EGFR'

In [18]:
print(bad_genes)
print(gene_dict)

[]
{'KRAS': 'ATGACTGAATATAAACTTGTGGTAG', 'BRAF': 'ATGGTCCAGCTTGGACCCACTCC', 'ALK': 'ATGAAGGAGCCCTCAGATTTCTTG', 'RET': 'ATGGGTGGGTTGTCGGAAGATCTT', 'ROS1': 'ATGAGCCACCCAGGTCCCTGTAGT', 'MET': 'ATGGCTTCAAGCTGTTGTCGTGAAGA'}


Dictionaries and lists are powerful ways to handle data and often involve going between the two. There are also lots of powerful methods to search throughnot just the keys, but also the values to find relevant keys which can be very useful but first we need to learn a bit about loops and conditionals!

## Exercises

1. Create a dictionary of bacteria and confluence values (growth) using the data:

```
["E. coli", "S. aureus", "P. aeruginosa", "K. pneumoniae", "A. baumannii"]
[60, 82, 75, 91, 70]
```
2. Print the confluence value of *K. pneumoniae* from the dictionary
3. A function we haven't used yet is ```sum()```, but it works just like ```len()```. Use both of these to calculate the average confluence of all the samples in the dictionary (what's the easiest way to get all the values?)
4. Oops! We just found another sample! Add ```{L. Beijerinick : 65}``` to the dictionary.
5. Print the species name with the highest growth.

In [None]:
# Format reminder
bacteria_dict = {"E. coli": 60,}