# Problem 0: Measures of Location (27 points)

In this problem, we will write code to calculate the mean, median, and mode of different data sets, and explore how those measures are influenced by outliers in the data.

First, let's define some data sets. Run the cell below to define the lists `toy_data` and `exon_lengths`.

In [None]:
toy_data = [1, 1, 2, 4, 32] # same as our in-class example

# now read in the exon-length data:
exon_lengths = []
file = open('gene_table.txt')
header = file.readline()
for line in file:
    values = line.strip('\n').split('\t')
    exon_lengths.append(int(values[3]))
file.close()

## Medians (5 points)

Let's write a function to calculate the median. The procedure for finding the median is:
  - sort the data
  - if there are an odd number of data points, pick the middle one
  - otherwise, choose the number halfway between the middle two

There are several useful built-in python functions that will help with the above. First, sorting. The `sorted` function can be applied to a list just like `len` or `sum`:
```python
ordered_list = sorted([3,4,1,2])
# the below will raise an error if ordered_list is not [1,2,3,4]
assert ordered_list == [1,2,3,4] 
```

Next, how do we find out if a number (like the length of a list) is odd or even? The easiest way is with the "modulus" operator `%`, otherwise known as the [remainder](https://en.wikipedia.org/wiki/Remainder).

Some background: we know the standard math operators: `+`, `-`, `*`, and `/`. Python  provides two useful additional operators: `//` and `%`. The first, `//`, does integer division (otherwise known as "floor division"): `a//b` gives you the next integer below the value of `a/b`. This tells you the number of times that `a` will fit into `b` evenly. For example, you can fit two whole 4s into 9, but not three. So `9//4` gives `2`. Similarly, `10//4` also gives 2. But `12//4` gives 3.

The modulus operator `%` gives the remainder that's left over from such a division. `12%4` gives 0 because 12 is an integer multiple of 4. On the other hand `9%4` is 1, because the remainder left over after dividing 9 into 4 equal integer-size pieces is 1. For the math inclined, there is a simple invariant:

`b == a*(b//a) + b%a`

I.e. "**a** multiplied by the number of times that **b** can be evenly divided by **a**, plus the remainder of that division, is **b**." (Try it with 9 and 4: `9 == 4*2 + 1`.)

So, how do we use this to tell if a number is odd or even? Well, even numbers divide by two with no remainder! If `a%2 == 0`, then we know that `a` is even.

Finally, how do we pick middle elements from a list? Let's start with the odd-sized case. Let's say we have a list `[0,1,2,3,4]` of length 5. The middle element, `2`, is at position `2`. (I.e. there are two values to the left of position 2, and two values to the right.) Do you know an operation that will take 5 and give 2? How about for a length-7 list? We would want position 3. For a length-101 list, position 50 would be right in the middle. (Hint: it's one of the two new operators described above...)

How about the even-size case? Let's start with the list `[0,1,2,3,4,5,6,7]`, of length 8. The middle positions are 3 and 4. For a list of length 100, the middle positions would be 49 and 50. (There are 49 list entries to the left of position 49, and 49 entries between positions 51 and 99, inclusive.) Once we know the 'left-middle' and 'right-middle' values, we just need to get the number halfway between those two values, and that's the median.

Important note: you cannot access list values using floating-point (i.e. real) numbers. Though for math, the integer `5` is equvalent to the floating-point `5.0`, only the first of the below works:
```python
my_list = [1,2,3,4,5]
my_list[4] # OK
my_list[4.0] # No!
```

Now, also note that the regular division operator `/` *always* returns floating-point values. `8/2` gives `4.0`, not `4`. In contrast, `8//2` gives `4` as an integer. However, `8.0//2` will still give `4.0`. When calculating positions in a list, always use integer division, and if you might ever be using a floating-point number as input, then make sure to convert the output back to a true integer with `int(8.0//2)`. 

Try to fill in the code below, using this information.

In [None]:
def median(values):
    # First, sort values into a new, ordered list called ordered_values:
    # YOUR ANSWER HERE
    
    n = len(ordered_values)
    # Next, use if and else to save the median value in a variable named median,
    # in the cases that n is odd vs. n is even.
    # Don't worry about handling errors if an empty list is provided (i.e. n is 0)
    # YOUR ANSWER HERE
    return median

print(median([1,2,3]))
print(median([1,2,3,4]))

In [None]:
# Run this cell to test your answers. If there is no error, you'll get full credit!
assert median([1,2,3,4]) == 2.5
assert median([1,2,3,4,5]) == 3
assert median(toy_data) == 2
assert median(exon_lengths) == 4

## The median property (5 points)

In class, we showed that the the median minimizes $\sum_i|S-d_i|$, but we never actually calculated this value. Below, we'll calculate this sum at several different positions, so you can convince yourself that the median really is the position with the smallest absolute distances to each data point.

First, write a function to calculate this summation. The classic way to calculate a running sum is to initialize a variable to zero, then loop through a list adding to that to the running sum each time. For example, if you wanted to replace the built-in function `sum` (though don't! `sum` is faster):
```python
def my_sum(values):
    running_sum = 0
    for value in values:
        running_sum = running_sum + value
```
Note that a shortcut for `a = a + b` is simply `a += b`. That syntax means "update a by adding to it the value of b". (Of course, `-=`, `*=`, &c., are also available.)

Finally, note that the built-in function `abs` can be used to calculate the absolute value: `abs(-1) == 1`

In [None]:
def sum_abs_distance(S, values):
    running_sum = 0
    # YOUR ANSWER HERE
    return running_sum

print(sum_abs_distance(5, [1,10]))
print(sum_abs_distance(5, [1,3,5,7,9]))

In [None]:
assert sum_abs_distance(0, [4]) == 4
assert sum_abs_distance(0, [-4]) == 4
assert sum_abs_distance(0, [-4, 4]) == 8
assert sum_abs_distance(4, exon_lengths) == 729956

In math, we would say, "the median minimizes $\sum_i|S-d_i|$". In Python, the same statement would be expressed as: "`sum_abs_diff(S, values)` will at a minimum when `S = median(values)`".

Let's test if that is the case! Below is a list of values. Note that the median of these values is 4. There is also a list of a set of different `S` values. For each of those different possible S's, you will need to calculate `sum_abs_diff(S, values)`, and see at what `S` that sum is minimized.

The easiest way to do this will be to to make a new list, `abs_distance_sums`, which will contain `sum_abs_diff(S, values)` for each different `S` in `S_vals`. Write a for loop to do this.

In [None]:
values = [-2, 1, 2, 6, 48, 100]
S_vals = [-10, 0, 1, 2, 4, 4.5, 6, 8, 50]

abs_distance_sums = []
# YOUR ANSWER HERE

for S, abs_distance_sum in zip(S_vals, abs_distance_sums):
    print(S, ':', abs_distance_sum)

In [None]:
assert len(abs_distance_sums) == len(S_vals)
assert abs_distance_sums[3] == abs_distance_sums[4] == abs_distance_sums[5] == abs_distance_sums[6]
assert min(abs_distance_sums) == sum_abs_distance(median(values), values)

Note that the distance sum is at a minimum everywhere between 2 and 6. As discussed above, the convention is that the median is the middle-most position, but as you can see, any value in between the two middle-most data points has the same property.

## Means (5 points)

First, write a function to calculate the mean value of a list. Just use built-in functions, no for-loops.

In [None]:
def mean(values):
    # YOUR ANSWER HERE

print(mean(toy_data))
print(mean(exon_lengths))

In [None]:
assert mean([1,2,3,4]) == 2.5
assert mean([0,0,0,0,0,1]) == 1/6

Now, let's examine how the mean versus the median changes in the presence of outliers.

One classic way to get rid of outliers is to "trim" a dataset. Each time you have an outlier, instead of just ignoring it (bad idea!), you instead set the value to some pre-specified maximum. So, for example, to trim our `exon_lengths` dataset, we could declare that genes with over 10 exons are "outliers", and so any time we see more than 10 exons in the list, we'll just replace that outlier value with 10.

Let's do this: use a for loop to make a new version of `exon_lengths` called `trimmed_lengths`. For each count in `exon_lengths`, `trimmed_lengths` should contain the count of exons if there are fewer than 10 exons, otherwise it should just contain 10.

Within the for loop, you could either use an `if` statement to choose what value to put in the new list, or if you want style points think about how you could use the built-in `min` function. (The `min` function can either be applied to a list as in `min(toy_data)`, or to two or more separate values as in `min(3, 5)`.)

Examine the difference in the mean values between the trimmed and untrimmed data. Compare this to the difference in the medians...

After creating `trimmed_lengths`, write a second for loop that counts the number of genes with more than 10 exons.

In [None]:
trimmed_lengths = []
# Now fill in the list
# YOUR ANSWER HERE
print('Means:', mean(exon_lengths), mean(trimmed_lengths))
print('Medians:', median(exon_lengths), median(trimmed_lengths))

num_outliers = 0
# Now count the outliers
# YOUR ANSWER HERE
print('Percent outliers:', 100 * num_outliers / len(exon_lengths))

In [None]:
assert len(trimmed_lengths) == len(exon_lengths)
assert max(trimmed_lengths) == 10
assert median(trimmed_lengths) == median(exon_lengths) == 4
assert 4.9 < mean(trimmed_lengths) < 5
assert num_outliers == 28504

## Modes (12 points)

Last, let's write a function to calculate the mode of a dataset. This will involve using a dictionary to keep a running count of how many times a particular value has been seen.

Remember, to test whether a key called `my_key` is in a dictionary called `my_dict`, use `if my_key in my_dict`. You can use `not in` to test the opposite.

Think carefully about how to update a dictionary entry for a specific key.
```python
my_dict = {5: 0}
print(my_dict[5])
value = my_dict[5]
print(value)
value += 1
print(value)
print(my_dict[5])
```
What do you think the last line will print? Zero or one? Try this below.


In [None]:
my_dict = {5: 0}
print(my_dict[5])
value = my_dict[5]
print(value)
value += 1
print(value)
print(my_dict[5])

What?

So, the line: `value = my_dict[5]` meant that both `value` and `my_dict[5]` are different "names" for the number 0. When we do `value += 1`, we don't somehow change the number 0 into the number 1. Instead, `value` just becomes a synonym for the number 1 instead, while `my_dict[5]` stays as a synonym for the number 0.

If we want to update the value in the dictionary, we could do any of the following:
```python
my_dict[5] = value + 1
my_dict[5] = my_dict[5] + 1
my_dict[5] += 1
```

**Aside:** note that the situation is a little different for "containers" like lists and so forth. Let's make a dictionary where they key 5 maps to a value that is a list.
```python
my_dict = {5: [1, 2, 3]}
value = my_dict[5]
value.append(4)
print(value)
print(my_dict[5])
```
What do you think will happen? Try it below.

In [None]:
my_dict = {5: [1, 2, 3]}
value = my_dict[5]
value.append(4)
print(value)
print(my_dict[5])

The difference here is that the list `[1, 2, 3]` is a specific *thing* (we call it an "object"), that lives somewhere in your computer's memory. After we run the line `value = my_dict[5]`, then both `value` and `my_dict[5]` are each synonyms for the exact same object. Using something like `append` modifies that specific object in memory. Then when we look up that object in memory, regardless of whether we do so using the name `value` or the name `my_dict[5]`, we get the same, modified thing.

If you want to make a new copy of the list to modify, without changing the list in-place, then you do as we did above:
```python
value = list(my_dict[5]) # make a new copy, rather than just point the name 'value' at the same list
value.append(4) # doesn't change the list pointed to by my_dict[5]
```

**Anyhow!** From the above, you now know how to modify a number in a dictionary:
```python
my_dict = {}
my_dict['hello'] = 0
my_dict['hello'] += 1
```

Use this to write a function that will count the number of times each value shows up in a list. Remember that if a key isn't already in the dictionary, you need to add it first before using `+=` to increase the count.

In [None]:
def count_data(data):
    counts = {}
    # YOUR ANSWER HERE
    return counts

print(count_data([1,1,1,2,2,2,2,3]))
print(count_data(toy_data))

In [None]:
exon_counts = count_data(exon_lengths)
assert exon_counts[1] == 17748
assert 0 not in exon_counts
assert 200 not in exon_counts
assert exon_counts[74] == 6

Now write a function to calculate the mode, using the `count` function.

Basically, we need to find the key in our `counts` dictionary that is associated with the largest value. There is a way to step through the keys and values of a dictionary at once:
```python
my_dict = {'a': 1, 'b': 2, 'c': 3}
for key, value in my_dict.items():
    print(key, value)
```

Try this below:

In [None]:
my_dict = {'a': 1, 'b': 2, 'c': 3}
for key, value in my_dict.items():
    print(key, value)

So to find the mode, we need to step through the data and counts in the dictionary. We will keep track of the current largest count value. For each new pair of datum and its count, see if that count is larger than the current largest. If so, then we'll save the associated datum value as the current best candidate for the mode. After the loop is finished, this will be our modal value. (Right now we won't worry about multi-modal distributions.)

In [None]:
def mode(data):
    counts = count_data(data)
    largest_count = 0
    for datum, count in counts.items():
        # if the count is larger than the largest count:
        # update the largest count to the new value, and save the
        # value of datum in a variable named mode
        # YOUR ANSWER HERE
    return mode

print(mode([1,1,1,2,2,2,2,3]))
print(mode(toy_data))
print(mode(exon_lengths))
# verify that 2 is indeed the mode by printing the number of exons with a few different lengths
print(exon_counts[1], exon_counts[2], exon_counts[3], exon_counts[4])

In [None]:
assert(mode(exon_lengths)) == 2
assert(mode([1,1,1,2])) == 1
assert(mode([1,2,3,2])) == 2

How would we make our code handle multi-modal distributions? Easy. Instead of keeping a single value for `mode`, make a list of `modes`. If we encounter a count that is strictly greater than the largest previous count seen, then set `modes` to a one-element list containing just the new datum. If we encounter a count that is equal to the largest previous, then append the new datum to the `modes` list.

In [None]:
def modes(data):
    counts = count_data(data)
    largest_count = 0
    for datum, count in counts.items():
        # YOUR ANSWER HERE
    return modes

print(modes([0,1,1,2,2,3]))
print(modes(exon_lengths))

In [None]:
assert modes([0,1,1,2,2,3]) == [1, 2]
assert modes(exon_lengths) == [2]