# Histograms (13 points)

In this problem, we will draw simple histograms.

First, run the below to load up some data.

In [None]:
gene_lengths = []
exon_lengths = []
file = open('gene_table.txt')
header = file.readline()
for line in file:
    values = line.strip('\n').split('\t')
    gene_lengths.append(int(values[2]) - int(values[1]))
    exon_lengths.append(int(values[3]))
file.close()

First, we need to calculate what bin a given data value should go into. Let's start by using 10-unit wide bins, starting at zero. So the bins will be:
```
[0, 10)
[10, 20)
[20, 30)
[30, 40)
[40, 50)
...
```

Note that the bins are specified in set notation, where $[$ means the bound is inclusive, and $)$ means exclusive. So 
$10$ goes into the bin $[10,20)$, but 20 goes into the bin $[20,30)$.

We will number the bins like we do python list indices, so bin 0 is `[0, 10)`, bin 4 is `[40, 50)`, etc.

Write a function, `bin_index`, that for any positive number will calculate what its bin index should be. Remember, any number between 0 and 10 should be mapped to the value 0. Any number between 10 and 20 should go to 1, and so forth. Hint: the floor divison operator `//` will be very helpful here!

Also, note that sometimes the input values might be floating-point numbers. So even though `//` will always give a round number, in the case of floating-point input, it will be a round floating-point number (e.g. `4.0`), rather than a true integer (e.g. `4`). To solve this, just use `int` to convert the output to an integer before returning it. This is important because trying to use `4.0` to index into a list is an error, while using `4` isn't.


In [None]:
def bin_index(value):
    # YOUR ANSWER HERE

print('0:', bin_index(0))
print('5:', bin_index(5))
print('10:', bin_index(10))
print('25:', bin_index(25))
print('52145:', bin_index(52145))

In [None]:
assert bin_index(0) == 0
assert bin_index(5) == 0
assert bin_index(25) == 2
assert bin_index(34434) == 3443
assert bin_index(5.5) == 0 and type(bin_index(5.5)) is int

The above assumed a fixed bin width of 10, and a fixed minimum value of 0. Fix the code to use an arbitrary bin width, and to subtract off a specified minimum value.

So for example, with a bin with of 20 and a minimum value of 100, the data point 105 should go to bin 0, and the data point 125 should go to bin 1. Don't worry about dealing with cases where the provided value is less than the minimum. And again, don't forget to convert to `int` before returning the index.

In [None]:
def ranged_bin_index(value, minumum, bin_width):
    # YOUR ANSWER HERE

print(ranged_bin_index(105, 100, 20))
print(ranged_bin_index(125, 100, 20))
print(ranged_bin_index(-50, -100, 10))
print(ranged_bin_index(-51, -100, 10))

In [None]:
assert ranged_bin_index(105, 100, 20) == 0
assert ranged_bin_index(125, 100, 20) == 1
assert ranged_bin_index(-50, -100, 10) == 5
assert ranged_bin_index(-51, -100, 10) == 4
assert ranged_bin_index(-49, -100, 5) == 10
assert type(ranged_bin_index(-49.5, -100, 5)) is int

Now we will write a function to generate a histogram! It will take an set of input values, plus a minimum and maximum value and a number of bins. 

You will have to calculate the bin width, then make a new list containing a zero for every bin. Then step through the data, and if the data value is less than or equal to the max and greater than or equal to the min, increment the count in the correct bin (e.g. `bin_counts[index] += 1`).

Note that while all bins less than the largest bin are defined as half-open intervals `[bin_min, bin_max)`, the largest bin must be `[bin_min, bin_max]`. That is, our `ranged_bin_index` function above will give the wrong bin index for values that are at our maximum (it will say to go to the next bin up, but we don't want to have a whole bin just for the count of the values that are at the exact maximum...) So you will need to write special-case code for when a data value is exactly the maximum value.

In [None]:
def histogram(values, hist_min, hist_max, n_bins):
    # make sure we have sane input:
    assert hist_max > hist_min
    
    # now, calculate out the bin width and store in a variable bin_width.
    # YOUR ANSWER HERE
    
    # now make a list with n_bins entries, filled with zeros
    bin_counts = [0] * n_bins # silly python trick: you can multiply lists to repeat elements.
    # Try it out in a different notebook...
    
    # now loop through the data values and increment the bin_counts values.
    for value in values:
        # make sure to skip out-of-range values, and to do the right thing when 
        # the value equals hist_max
        # YOUR ANSWER HERE
    return bin_counts

print(histogram(exon_lengths, 1, 10, 9))
print(histogram(exon_lengths, 1, 10, 4))
print(histogram(gene_lengths, 1, 25000, 15))
print(histogram(gene_lengths, 25000, max(gene_lengths), 15))

In [None]:
assert histogram(exon_lengths, 1, 10, 9)[3] == 26430
assert sum(histogram(exon_lengths, 1, 10, 4)) == sum(histogram(exon_lengths, 1, 10, 9))
assert sum(histogram(gene_lengths, 0, max(gene_lengths), 100)) == len(gene_lengths)
assert histogram(gene_lengths, 25000, max(gene_lengths), 15)[0] == 51435