# A2 Conditionally Expressed
In this assignment, you'll apply what you know about lists, conditionals, and for loops to interact with a microarray dataset from the Allen Brain Institute. We've already processed the raw data such that it is normalized and organized into a file arranged by gene names and brain areas (brainarea_vs_genes_exp_w_reannotations.tsv). Before you can begin this assignment, you need to download this dataset from datahub and upload it in the same folder as this assignment. We'll review this in class.

This assignment is worth 50 points (5 points or 5% of your grade for the class).

**PLEASE DO NOT CHANGE THE NAME OF THIS FILE.**

**PLEASE DO NOT COPY & PASTE OR DELETE CELLS INCLUDED IN THE ASSIGNMENT.**


## How to complete assignments

Whenever you see:

```
# YOUR CODE HERE
raise NotImplementedError()
```

You need to **replace (meaning, delete) these lines of code with code that answers the questions** and meets the specified criteria. Make sure you remove the 'raise' line when you do this (or your notebook will raise an error, regardless of any other code, and thus fail the grading tests).

You should write the answer to the questions in those cells (the ones with `# YOUR CODE HERE`), but you can also add extra cells to explore / investigate things if you need / want to. 

Any cell with `assert` statements in it is a test cell. You should not try to change or delete these cells. Note that there might be more than one assert that tests a particular question. 

If a test does fail, reading the error that is printed out should let you know which test failed, which may be useful for fixing it.

Note that some cells, including the test cells, may be read only, which means they won't let you edit them. If you cannot edit a cell - that is normal, and you shouldn't need to edit that cell.


## Tips & Tricks

The following are a couple tips & tricks that may help you if you get stuck on anything.

#### Printing Variables
You can (and should) print and check variables as you go. This allows you to check what values they hold, and fix things if anything unexpected happens.

#### Restarting the Kernel
- If you run cells out of order, you can end up overwriting things in your namespace. 
- If things seem to go weird, a good first step is to restart the kernel, which you can do from the kernel menu above.
- Even if everything seems to be working, it's a nice check to 'Restart & Run All', to make sure everything runs properly in order.

### Loading in the data
First, we'll take a few steps to load up the dataset. After you have uploaded the 'brainarea_vs_genes_exp_w_reannotations.tsv' file to your directory, simply run the code below -- you don't need to change anything.

In [1]:
# Import necessary packages
from csv import reader

# Open the tab-delimited file
opened_file = open('brainarea_vs_genes_exp_w_reannotations.tsv')
read_file = reader(opened_file, delimiter = '\t')
gene_data = list(read_file)

## Q1

Above, the variable `gene_data` is a list of lists. The first list is a list of headers for the array, containing a first item 'gene_symbol', followed by a list of brain regions.

In the cell below, assign the first list of `gene_data` to a variable called `brain_regions`. The first entry of the list, 'gene_symbol' isn't a brain region, but that's okay for this exercise. Leave it in the list.

In [2]:
### BEGIN SOLUTION
brain_regions = gene_data[0]
### END SOLUTION

In [None]:
# Tests for Q1, worth 5 points total. Note: includes hidden tests.
assert isinstance(brain_regions,list)

### BEGIN HIDDEN TESTS
assert len(brain_regions) == 233
### END HIDDEN TESTS

## Q2

For our study, we're interested in seeing if the superior colliculus and visual cortex have different gene expression. First, we need to know if they're in our list of brain regions.

Write two statements to check if 'superior colliculus' and 'visual cortex' are in your list of brain regions (`brain_region`). Save the boolean outputs of these membership checks as `SC_bool` and `VC_bool`, respectively. Print the values of `SC_bool` and `VC_bool` so that you can see them.

In [None]:
### BEGIN SOLUTION
SC_bool = 'superior colliculus' in brain_regions
VC_bool = 'visual cortex' in brain_regions
print(SC_bool)
print(VC_bool)
### END SOLUTION

In [None]:
# Tests for Q2, worth 5 points
assert isinstance(SC_bool,bool)
assert isinstance(VC_bool,bool)

In [None]:
# Hidden tests for Q2, worth 5 points
## BEGIN HIDDEN TESTS
assert SC_bool == True
assert VC_bool == False
### END HIDDEN TESTS

## Q3
Hmm, looks like the data has superior colliculus but not visual cortex. In humans, visual cortex is often called "striate cortex", because of the appearance of a dense layer of myelinated fiber that runs through it, called the Line of Gennari (details <a href="https://webvision.med.utah.edu/book/part-ix-brain-visual-areas/the-primary-visual-cortex/">here</a>, if you're curious). It's also a part of the occiptal lobe, and the gyri and sulci there are named accordingly.

To get a sense of what possible visual regions are in our list, we can look for _striate_ and _occiptal_ in the strings for each brain region. 

1. Write a `for` loop that loops through the list of brain regions and looks for *either* "striate" or "occipital" within the string for each of the brain regions in your list. Save all of the possible matches to a list called `possible_regions`.
2. Create a counter (called `counter` that shows you how many brain regions you have at the end. Save the output of this counter as a variable called `regions_message` that says "There are X possible visual regions" where "X" is the value of your counter.
3. At the end, print your list of possible regions so that you can see what it includes.

In [None]:
### BEGIN SOLUTION
possible_regions = []
counter = 0

for i in range(len(brain_regions)):
    if 'occipital' in brain_regions[i]:
        possible_regions.append(brain_regions[i])
        counter = counter + 1
    elif 'striate' in brain_regions[i]:
        possible_regions.append(brain_regions[i])
        counter = counter + 1
        
regions_message = "There are " + str(counter) + " possible visual regions."
print(regions_message)
print(possible_regions)
### END SOLUTION

In [None]:
# Tests for Q3, worth 5 points.
assert isinstance(possible_regions,list)
assert isinstance(counter,int)
assert isinstance(regions_message,str)

In [None]:
# Hidden Tests for Q3, worth 5 points.
### BEGIN HIDDEN TESTS
assert len(possible_regions) == 11
### END HIDDEN TESTS

In [None]:
# Hidden Tests for Q3, worth 5 points.
### BEGIN HIDDEN TESTS
assert counter == 11
### END HIDDEN TESTS

In [None]:
# Hidden Tests for Q3, worth 5 points.
### BEGIN HIDDEN TESTS
assert regions_message == "There are 11 possible visual regions."
### END HIDDEN TESTS

## Q4

![](https://resource.loni.usc.edu/wp-content/uploads/2012/06/LINGUAL01.jpg)

Let's go with '_lingual gyrus, striate_' -- that's a nice chunk of brain that encompasses visual cortex in humans (see the pink area above, details <a href="https://resource.loni.usc.edu/resources/downloads/research-protocols/masking-regions/lingual-gyrus/">here</a>.

Now that we know that 'lingual gyrus, striate' and 'superior colliculus' are both in our list, we need to know their index so that we can look for their corresponding values in the lists for each gene. For that, we can use the `index` method on our list (see the help for Index, or <a href="https://www.programiz.com/python-programming/methods/list/index">this tutorial.</a>)

Find the index of the 'lingual gyrus, striate' and 'superior colliculus' and save them as `LG_index` and `SC_index`, respectively.

In [None]:
### BEGIN SOLUTION
LG_index = brain_regions.index('lingual gyrus, striate')
SC_index = brain_regions.index('superior colliculus')
print(LG_index)
print(SC_index)
### END SOLUTION

In [None]:
# Tests for Q4, worth 5 points.
assert isinstance(LG_index,int)
assert isinstance(SC_index,int)

In [None]:
# Hidden Tests for Q4, worth 10 points.
### BEGIN HIDDEN TESTS
assert LG_index == 125
assert SC_index == 206
### END HIDDEN TESTS

## Q5

Searching for our gene in this dataset is a little tricky, since each row is a different list, but we can do it with a for loop. Let's say we're interested in **DISC1**, <a href="https://www.nature.com/articles/tp2016282">a gene that is associated with schizophrenia</a>.

Write a `for` loop that loops through each row (list) of our data, and checks if the first entry in that list is DISC1. When it finds DISC1, assign the entire list of values (including the DISC1 label) to `DISC1_data`.

In [None]:
### BEGIN SOLUTION
for i in range(len(gene_data)):
    if gene_data[i][0] == 'DISC1':
        DISC1_data = gene_data[i]
### END SOLUTION

In [None]:
# Tests for Q5, worth 5 points.
assert isinstance(DISC1_data,list)

In [None]:
# Hidden Tests for Q5, worth 10 points.
### BEGIN HIDDEN TESTS
assert DISC1_data[0] == 'DISC1'
assert len(DISC1_data) == 233
### END HIDDEN TESTS

## Q6
Using the indices we saved above, now we can look to see whether expression of DISC1 is higher in the superior colliculus or in the occipital lobe.

1. Save the gene expression values for superior colliculus and the occiptal lobe as `SC_DISC1` and `LG_DISC1` respectively, by using the indices you saved in the previous step.
2. Check the type of these. If they're not a float, convert each of them into a float (still assigned to `SC_DISC1` and `LG_DISC1`.

In [None]:
### BEGIN SOLUTION
SC_DISC1 = float(DISC1_data[SC_index])
LG_DISC1 = float(DISC1_data[LG_index])
print(SC_DISC1)
print(LG_DISC1)
### END SOLUTION

In [None]:
# Tests for Q6, worth 5 points
assert isinstance(SC_DISC1,float)
assert isinstance(LG_DISC1,float)

In [None]:
# Hidden tests for Q6, worth 10 points

### BEGIN HIDDEN TESTS
assert SC_DISC1 > 1
assert LG_DISC1 < 1
### END HIDDEN TESTS

## Q7

Given the data points that we have here in `SC_DISC1` and `LG_DISC1`, what could we reasonably claim?

**Note:** Remember that you can indicate your response on a multiple choice by assigning a string with your one letter response to `answer`.

* `A` : superior colliculus has greater expression of DISC1 than other genes
* `B` : superior colliculus has less expression of DISC1 than other genes
* `C` : superior colliculus has greater expression of DISC1 than the lingual gyrus
* `D` : superior colliculus has less expression of DISC1 than the lingual gyrus

In [None]:
### BEGIN SOLUTION
answer = 'C'
### END SOLUTION

In [None]:
# Tests for Q7, worth 5 points (note: includes hidden tests).

assert answer in ['A','B','C','D']

### BEGIN HIDDEN TESTS
assert answer == 'C'
### END HIDDEN TESTS

## Q8

We could also decide to guide our interest in brain regions based on higher expression of DISC1. For all of the values of DISC1, look for expression values that are greater than **1.5**, and save these as a list called `high_DISC1`. In the end, `high_DISC1` should contain a list of brain areas with expression values higher than 1.5.

**Note**: Remember that the first value in each list is the name of the gene; you might need to skip it.

In [None]:
### BEGIN SOLUTION
high_DISC1 = []

for i in range(len(DISC1_data)):
    if i == 0:
        continue
    if float(DISC1_data[i]) > 1.5:
        high_DISC1.append(brain_regions[i])
### END SOLUTION

In [None]:
# Tests for Q8, worth 5 points
assert isinstance(high_DISC1,list)

In [None]:
# Hidden tests for Q8, worth 10 points
### BEGIN HIDDEN TESTS
assert len(high_DISC1) == 14
assert high_DISC1[0] == 'cingulum bundle'
### END HIDDEN TESTS