# 0: Data Structures and Gene Regulatory Networks (10 points)

Cell cycle progression is controlled by a set of cyclin dependent kinases and transcription factors. These kinases and transcription factors interact with each other, and thus can be thought of as a network, in which the **nodes** are the genes or proteins, and the **edges** are the interactions.

In this problem, you'll use a **dictionary of lists** to examine properties of the yeast cell cycle network (in simplified form).

A dictionary of lists is a natural way to represent networks: the network nodes are represented by the dictionary keys, while the edges are represented as a list of other nodes to which the key node is connected.

For example, if `geneA` regulates `geneX` and `geneY`, the dictionary entry would look like this:

```python
network['geneA'] = ['geneX', 'geneY'] # list of regulatory targets of geneA in dictionary 'network'
```

## 0.1: Read in file data as a dictionary of lists that holds network nodes and edges
The accompanying file `yeast_cc_network.txt` contains the regulatory interactions of one version of the yeast cell cycle network. (The network interactions are taken from Fig.2 of this paper: https://www.ncbi.nlm.nih.gov/pubmed/25970087.) Each line of the tab-delimited file lists a single network interaction, with a regulator in the left column and the target gene in the right column (feel free to take a look at the file via the Jupyter dashboard):

```
Cln3    MBF
```

**INSTRUCTIONS: In the cell below, read in the file data and store it as a dictionary of lists called `cc_network`.** As you read in the lines, use the usual `line.strip('\n').split('\t')` syntax.

For each line, the **gene on the left is the dictionary key** (`Cln3` in the case above), and the **gene on the right should be appended** to that key's list of regulatory targets.

**HINT: Review of populating dictionaries on the fly:**

In several of the problem sets you wrote code to create dictionary entries while reading in file data. You'll need to use that technique to solve this problem.

For each line of the file, you need to determine if the regulator gene already exists as a key in the dictionary. If it does, then simply append the new target to that regulator's list of targets. If this regulator is not yet in the dictionary, then you need to create an entry for it, with its target as the first item in the list of targets.

Recall that you did something like this for ps3 problem 1, when you used a dictionary as part of code to count how often different transcript types occurred:

```python
transcript_counts = {} # create empty dictionary
for transcript in transcripts: # go transcript by transcript
    if transcript in transcript_counts: # if the transcript entry exists in the dict incremenet count
        transcript_counts[transcript] += 1
    else:
        transcript_counts[transcript] = 1  # create a new entry if the transcript isn't yet in the dict
```
(Note: There is **no header line** in the cell cycle network file. Also, feel free to create a blank cell below to try out piece of your code as you work this problem.)

In [None]:
cc_network = {} # empty dictionary that will hold the network

# Read in the file and add dictionary entries:

# YOUR ANSWER HERE

# Remember to close the file after reading the data.
print(cc_network) # display the result

In [None]:
assert len(cc_network) == 10
assert cc_network['Clb5,6'][2] == 'Cdh1'
assert len(cc_network['Cdc14']) == 3

### Fix the dictionary: add one gene with zero targets
Run the cell below to add a dictionary entry for `Cdh1` with no targets. `Cdh1` shows up as a target, but not a regulator, so its list is empty. I left that off the input file to avoid complications reading in the data, hence the fix below. (You could easily tweak your file reading code to include entries with no targets. For example, you could test whether the result of `line.split()` produces a list of one item or two.)

In [None]:
# Just run this cell
cc_network['Cdh1'] = [] # Add Cdh1 with no edges

### 0.2 How many edges does a node have?

We now want to know how connected each node is – how many regulatory targets does each regulator have? This is simple to do with a dictionary of lists. **In the cell below, write one line of code to determine the number of regulatory targets of `Mcm1`.** This line of code should look up the dictionary entry for `Mcm1` and determine the number of targets.

Save your answer to a variable called `mcm1_targets`.

In [None]:
# YOUR ANSWER HERE

In [None]:
assert mcm1_targets == 3

### 0.3 That last question was (hopefully) ridiculously easy. Next, write a `for` loop to determine the number of regulatory targets for *each* entry in the dictionary. Save the result as a list called `n_targets`.

Remember that to iterate over a dictionary with a `for` loop, you can use the following syntax:

```python
for key in my_dictionary.keys():  # replace my_dictionary with your variable name
```

Alternately, to iterate over the keys in order:
```python
for key in sorted(my_dictionary.keys()):
```

In [None]:
# YOUR ANSWER HERE

print(n_targets)

In [None]:
assert len(n_targets) == 11
assert max(n_targets) == 6

How many connections between regulators and targets are there in the network overall? 
Use `n_targets` to determine the answer and assign the result to the variable `total_edges`, using just one line of code.

In [None]:
# YOUR ANSWER HERE

In [None]:
assert total_edges == 26

### 0.4 Write a function to find regulators of a gene

One common programming task is to write functions that operate on different data sets of the same structure. Our data is structured so that you can easily look up the regulatory targets of a gene using ordinary lookup by key. But what if you want to know what the upstream regulators of a gene are?

Your task in this problem is to write a function, `find_regulators()` that takes as input a network (dictionary of lists) and **one or more** query genes (as a strings, each separated by commas). The function should rerurn **a single list of all upstream regulators of the query genes.** (In other words, do NOT return separate lists for each query.) If there are no upstream regulators, the function returns an empty list.

In the cell below, complete the function `find_regulators()`.

**Hint:** A simple solution is to use a `for` loop to iterate over the network dictionary for each query. For each dictionary entry, determine whether the query gene is in the list of targets for that entry. Recall that you can test for presence of a value in a list by using the keyword **`in`**:

```python
my_list = [1,2,3,4,5]
number = 4
if number in my_list: # key syntax to test for presence in list
    print("It's in the list")
```

In [None]:
# Review the *args syntax if necessary

def find_regulators(network, *queries):
    # YOUR ANSWER HERE
# test your function
find_regulators(cc_network,'Mcm1', 'Cdc14')

In [None]:
assert len(find_regulators(cc_network, 'SBF', 'Swi5', 'Mcm1')) == 9