# Problem 0.1: Gene Ontology Dictionary (5 points)

## Build a dictionary while iterating over a list

This problem is a simplified version of a type of problem that will come up frequently in this course, especially when we do enrichment calculations with genomic data.

Imagine you performed some experiment in which you measured gene expression in a sample before and after some perturbation. You have obtained gene ontology annotations for the genes whose expression increased after the perturbation. Your task is to count the number of times each gene ontology term shows up in your results. To keep track of the counts, you will create a dictionary in which the *keys* are gene ontology terms and the *values* are number of times the term occurs in the data.

In the cell below is a list of gene ontology terms for each gene whose expression increased. You're going to iterate over the list with a FOR loop. In the block of the loop, your code should do build the dictionary by performing one of the following tasks: 

1) Add *new* GO terms to the dictionary as you encounter them OR:

2) Increment the count for GO terms that already exist in the dictionary. 

To write the code, you can use the keywords `in` and `not in` to check if a GO term is already in the dictionary. Below is a simple example of how this works. If the entry *doesn't* exist, one is created and given an initial value of 1. If the entry already exists, the code increments the count by 1:

```python
### Counting coin tosses

coin_tosses = {} # empty dictionary
toss = 'heads' # the data

if toss not in coin_tosses:
    coin_tosses[toss] = 1 # create a new entry and set value to 1
else:
    coin_tosses[toss] += 1 # increment existing entry
```

Below, apply the logic of the coin toss example to count the occurrences of gene ontology terms in the data. Write the code to do the following:

1. Create a empty dictionary called `go_counts`.
2. Use a `for` loop to iterate over the list `go_annotations`.
3. In the block of the `for` loop, add new GO terms to the dictionary as they come up, and increment the count of terms that are already in the dictionary.

Use `print()` to see the results.

(If you're not familiar with gene ontology analysis, you can find an introduction here: http://geneontology.org.)

In [None]:
# Some simplified GO annotations
go_annotations = ['neurogenesis', 'eye development', 'neurogenesis',  'cell differentiation','detoxification',\
                  'cell differentiation', 'neurogenesis', 'detoxification', 'exocytosis', 'lipid transport',\
                 'signaling', 'neurogenesis']

# YOUR ANSWER HERE

print(go_counts)

In [None]:
assert go_counts['neurogenesis'] == 4
assert go_counts['eye development'] == 1
assert len(go_counts) == 7

# Problem 0.2: Restriction enzymes (3 points)
## Run a sliding window over a DNA sequence to find restriction sites

Your goal in this problem is to identify which restriction enzymes cut a DNA sequence. You'll write code to do the following:

1) Create a dictionary of restriction enzyme specificities in which the cut sequence is the *key* and the enzyme name is the *value*.

2) Run a `for` loop over a DNA sequence, using `range()` to take 6 bp windows. Check if those 6 bp sequences match a dictionary entry of enzyme specificities. If yes, append the name of the restriction enzyme to a list.

It sounds like a simple problem, but the code will bring together most of the concepts we've learned so far. We'll solve this in two steps.

### 0.2.1: Build the dictionary by looping over two lists.
In the cell below are two lists, `enzyme_names` and `specificities`. Matching names and specificities are in the corresponding position in each list.

Build a dictionary called `enzymes` by using `for` with `zip()` to loop over the two lists. Add dictionary entries as you go, with the *specificity as the key and the enzyme name as the value.* (Intuitively you might think the enzyme name should be the key, but in this problem we need to look up enzymes by their cuts sites, not their names.)

HINT: Define your empty dictionary *before* the `for` loop. (Why is this important?)

In [None]:
enzyme_names = ['EcoRI', 'BamHI', 'SphI','XhoI', 'XbaI', 'SacI', 'HindIII']
specificities = ['GAATTC', 'GGATCC', 'GCATGC', 'CTCGAG', 'TCTAGA', 'CCGCGG','AAGCTT']

# YOUR ANSWER HERE

print(enzymes)

In [None]:
assert sorted(enzymes.values()) == ['BamHI', 'EcoRI', 'HindIII', 'SacI', 'SphI', 'XbaI', 'XhoI']
assert sorted(enzymes.keys()) == ['AAGCTT', 'CCGCGG', 'CTCGAG', 'GAATTC', 'GCATGC', 'GGATCC', 'TCTAGA']

### 0.2.2 Loop over a DNA sequence & take 6 bp slices

In the cell below, you're given a DNA sequence. Write code to do the following:

1) Create an empty list called `cutters` to hold the list of enzymes that cut the sequence.

2) Iterate over the DNA sequence, 1 bp at a time using `range()`.

3) In the block of the loop do the following: 
        1. Take a 6 bp slice of the DNA sequence
        2. Check whether that 6 bp sequence is a key in the dictionary
        3. If it is, append the enzyme name to `cutters`.
        
#### Hints
To loop over a string (such as a DNA sequence), we use two tools: slices and `range()`. You've learned how to use `range()` in a for loop. Let's look at slices:

**Grabbing three bases with a slice**

Strings, like lists, can be accessed with indexing and slices, like this:

```python
dna = 'ATGAGCAGGTCAGTGACTGAT' # a DNA sequence
dna[0] # the first base
dna[0:3] # a slice that grabs the first three bases
dna[i:i+3] # a slice that grabs three bases beginning at index i
```
By iterating a for loop over an index i, you can use this syntax to slide across the DNA in 6 bp windows.

In [None]:
# Check this sequence for cut sites
dna_seq = 'TAGGCTGGATCCTCGATTCGATGGGGCCCATTAATCTAGAGATCGGATCGGACTGAAAGCTCTTTTGATTCGAAGCTTGCGATGCGAAGCTTGCTAGCTA'

# YOUR ANSWER HERE
        
print(cutters)

In [None]:
assert len(cutters) == 4
assert cutters.count('HindIII') == 2
assert 'XbaI' in cutters