# Part 3 in the continuing saga of Lalama Evans: dictionaries, calculator stuff

You, meaning Lalama, want to study something useful that involves serotinin. That's why you figured it was worth spending all this time writing a parser for csv files on the Jackson website.

You have a hunch, or as we say in science, a "hypothesis".
You know that tests of schizophrenic patients reveal impaired response times to sudden loud noises.
This is measured in a test called the Auditory Startle Response (ASR). 
The ASR is fairly easy to measure in humans and in mice.

Your hunch is that the impaired response time results from dysregulation of the serotonin system, since it is known to contribute to sensory processing.

You think you can find evidence for your hypothesis using a mouse model.

So first you need to find some good candidate mouse strains.

Conveniently, the Jax site has data on ABR in different strains that you can download as a csv file.

Time to parse that file with your code.
Here's one version of the complete function we just wrote that does all the boring stuff for us.

In [None]:
import csv

def parse_jackson_csv(csv_filename,keep_header=False):
    """
    Parses csv files.
    A wrapper around csv.reader that deals with idiosyncracies
    of csv files from phenome.jax.org.
    Specifically, it discards comments at the top of files and
    removes extra columns in header row (that probably result
    from converting an .xls file to a .csv)
    
    arguments
    ---------
    csv_filename : string, name of csv file
    keep_header : if True, returns header. default is False.
    
    returns
    -------
    csv_file_parsed : list of lists, csv file after
                      parsing by csv.reader
    header : list of strings, header row,
             only returned if keep_header = True
    """
    
    with open(csv_filename,'r') as csv_file_open:
        csv_file = csv_file_open.readlines()
        
    counter = 0
    while 1:
        line = csv_file[counter]
        if '//' not in line:
            break 
        counter += 1
    reader = csv.reader(csv_file[counter+1:],delimiter=',')
    parsed_file = list(reader)
    if keep_header:
        header = parsed_file[0]
        header = [col_name
                     for col_name in header
                         if col_name !='']
        return parsed_file[1:], header
    else:
        return parsed_file[1:]

We use it to parse our file of ABR responses.

In [None]:
filename = 'Willott1_table-1.csv'
parsed,header = parse_jackson_csv(filename,keep_header=True)

So a natural question to ask would be:
** which strains have the strongest and the weakest ASR on average? **

But we have a problem to solve before we can answer that.

** The rows of the file are *individual* responses. We need to group them by strain. **

Here, look at the header and a couple of rows:

In [None]:
print(header)
print('')
for row in parsed[:2]:
    print(row)

Notice that each row starts with the strain to which the individual belongs, followed by different measures.

How are we even going to group them?

Should we just use a list?

Maybe we could have another list of lists like we had for our parsed csv file, only this time each item in the list will be a [string,list] pair, i.e., the name of the strain, paired with all the ABRs from that strain.

Something like this:
```Python
[['129S1/SvImJ',[0.121,0.211,0.11,0.15],
 ['A/J',[0.31,0.289,0.33,0.37],
 ...
 ]
```

But here we don't really care about the order of the strains.

And in fact if we ask the computer to find the ASRs for a certain strain, and it has to search through the whole list in order, this can take a long time, especially if we have a lot of data.

Seems like it's time for a **dictionary**.

## dictionaries

* a **dictionary**
  - consists of `key`,`value` pairs
  ```Python
  mouse_dict = {
            '129S1/SvImJ' : [0.121,0.211,0.11,0.15],
            'A/J' : [0.31,0.289,0.33,0.37]
                }
  ```
  - you give the dictionary a key, and it returns the value paired with that key
  - the algorithm used to look up keys in a dictionary is fast, much faster than going in alphabetical order on average
  
  
Let's write a function that sorts all the rows from our csv file into a dictionary where each `key` is a strain and the `value` for that `key` is a list of values from a user-specified column of the csv file that correspond to that strain

In [None]:
def make_mouse_dict(parsed_file,key_index,value_index):
    """
    takes csv file of individual animal data
    and sorts values of a column into a dictionary
    where the keys are strains
    
    arguments
    ---------
    key_index : index of column in each row that is the name
                of the mouse strain
    value_index : index of column in each row that is the 
                  value to associate with each strain
                  
    returns
    -------
    mouse_dict: dictionary where keys are strains of mice and
                values are lists, all entries in column of
                interest that correspond to a given strain
    """
    
    mouse_dict = {} # make empty dictionary
    for row in parsed_file:
        key = row[key_index] 
        val = float(row[value_index]) # convert from string to float!
        if key in mouse_dict.keys():
            mouse_dict[key].append(val)
        else:
            mouse_dict[key] = [val]
    return mouse_dict

Note this function doesn't use anything all that new to us, except for the dictionary.

Let's call it with the `value_index = 12`, the column that contains the value of the ASR at 100 decibels.

In [None]:
mouse_dict = make_mouse_dict(parsed,0,12)

We can call the `keys` method on our `dictionary` to verify that it contains the names of all the mouse strains.

In [None]:
mouse_dict.keys()

Looks like mouse strain names to me.

What about if we give the dictionary a key?

In [None]:
print(mouse_dict['MA/MyJ'])

Looks like ABR values to me.
Cool.

But we want to find out ** what strains have the highest ASR on average **.

Let's make another dictionary that will contain the *mean* values for each strain.

We'll write another lil' function to do that.

In [None]:
def compute_mouse_dict_mean(mouse_dict):
    """
    takes mouse_dict returned by make_mouse_dict
    and computes mean for each strain.
    returns mean_mouse_dict.
    """
    mean_mouse_dict = {}
    
    for key,val in mouse_dict.items():
        mean = sum(val) / len(val)
        mean_mouse_dict[key] = mean
    return mean_mouse_dict

So now we can get a dictionary where each `key` is a strain and each `value` is the mean for that strain:

In [None]:
mean_mouse_dict = compute_mouse_dict_mean(mouse_dict)

In [None]:
key = 'C58/J'
print('mean for strain {} is {}'.format(key,mean_mouse_dict[key]))

Lastly we need a function to figure out which strain has the **highest** average ASR.

In [1]:
def max_ASR(mean_mouse_dict):
    max_ABR = 0
    for key,val in mean_mouse_dict.items():
        if val > max_ABR:
            max_ABR = val
            max_strain = key
    return max_strain,max_ABR

In [None]:
max_strain,max_ASR = max_ASR(mean_mouse_dict)
print("strain {0} had the maximum ASR, {1}".format(max_strain,max_ABR))

## Exercises

1) write a function that will take `mouse_dict` and `mean_mouse_dict` then return `stdev_mouse_dict`, a dictionary where the `key`s are strain names and the `value`s are the standard deviation

Recall that standard deviation is the square root of the squared differences from the mean, i.e.,
```Python
from math import sqrt # square root function
# below, ** means 'power of'
diffs = [(val - mean)**2 for val in mouse_dict[key]] 
var = sum(diffs) / len(diffs)
st_dev = sqrt(var)
```