# Basic Probabilities (11 points)

In this question, we will use simple counting to calculate the probabilities of various events.

We will use data from `transcript_table.csv`, which contains information about each annotated transcript in the human genome. (Because of alternate splicing, a single gene can have many transcripts.) Try viewing the data table via the Jupyter file browser (from which you opened this notebook file): you'll see it has columns for transcript name, number of exons, gene type, and transcript length.

Run the cell below to load these data and define some basic functions that you wrote in the previous homework.

In [1]:
exon_counts = []
gene_types = []
transcript_lengths = []
file = open('transcript_table.csv')
header = file.readline()
for line in file:
    values = line.strip('\n').split(',')
    exon_counts.append(int(values[1]))
    gene_types.append(values[2])
    transcript_lengths.append(int(values[3]))
file.close()

def mean(values):
    return sum(values) / len(values)

def variance(values):  #方差
    mean_value = mean(values)
    squared_deviations = []
    for value in values:
        squared_deviations.append((value - mean_value)**2)
    return mean(squared_deviations)

def std(values):#标准差
    return variance(values)**0.5

def median(values):#中位数
    sorted_values = sorted(values)
    midpoint = len(sorted_values) // 2
    if len(sorted_values) %2 == 1: # odd length
        return sorted_values[midpoint]
    else: # even length
        return (sorted_values[midpoint - 1] + sorted_values[midpoint]) / 2

def median_absolute_deviation(values): #中位数绝对偏差
    median_value = median(values)
    absolute_deviations = []
    for value in values:
        absolute_deviations.append(abs(value - median_value))
    return median(absolute_deviations)

Recall that we have seen two basic types of for loops so far.

First, loops that build a list by transforming elements of another list:
```python
values = [1, 2, 3, 4]
squared_values = []
for value in values:
    squared_values.append(value**2)
```
And second, loops that count:
```python
values = [1, 2, 3, 4]
num_even = 0
for value in values:
    if value %2 == 0:
        num_even += 1
```

For this problem, we will mostly be using the latter type of loop.

**Note**: Some of the following problems are a bit repetitive. This is intentional! You just have to write the same basic kinds of loops a lot of times to get to the point where you see and think about the whole loop as a single unit of code.


## Simple probabilities via counting
First, write a for-loop below that will count how many transcripts are longer than the mean.

**Hint**: do not re-calculate the mean each time in the body of the for loop! Instead, calculate the mean once, before the loop, and store the result in a variable that you will use within the for loop.

**Hint 2**: do not name this variable `mean`, as that would replace the function that we defined above with the name `mean`...

In [35]:
larger_than_mean = 0
# YOUR ANSWER HERE
average=mean(transcript_lengths)
for value in transcript_lengths:
    if value >average :
        larger_than_mean += 1
print(larger_than_mean)

24154


In [36]:
assert larger_than_mean == 24154

Now divide by the relevant denominator to convert this count into a probability. Store the resulting probability in a variable named `p_larger_than_mean`.

In [37]:
# YOUR ANSWER HERE
#calculate the arrow numbers of  'exon count' column
arrow_number = len(transcript_lengths)
#calculate the percent larger than the mean
p_larger_than_mean = larger_than_mean / arrow_number
print('Percent larger than the mean:', p_larger_than_mean * 100)

Percent larger than the mean: 24.17817817817818


In [38]:
assert 0.24 < p_larger_than_mean < 0.25

Now calculate the probability that a transcript is larger than the mean plus two standard deviations. (Again, don't re-calculate this threshold each time through the for loop: that will take forever!) Store the result in a variable named `p_much_larger_than_mean`.

In [22]:
# YOUR ANSWER HERE
standard_deviation=std(transcript_lengths)
larger_than_mean_plus_2std = average + 2* standard_deviation #正态分布95%的点分布在这个范围内
much_larger = 0
for values in exon_counts:
    if values > larger_than_mean_plus_2std:
        much_larger += 1
p_much_larger_than_mean=much_larger/arrow_number
print('Percent much larger than the mean:', p_much_larger_than_mean * 100)

Percent much larger than the mean: 4.1741741741741745


In [23]:
assert 0.041 < p_much_larger_than_mean < 0.043

What's the probability that a transcript is larger than the median transcript length? (Assume that the dataset has an even number of transcripts...) **Hint**: you shouldn't need to write any code to answer this one. Write your answer below.

In [24]:
# YOUR ANSWER HERE
assert p_larger_than_median==0.5

SyntaxError: invalid syntax (2428811591.py, line 2)

## Compound probabilites: *and* and *or*

Calculate the probability that a transcript length is greater than the median **and** less than the mean and store this as `p_between`.

In [45]:
# YOUR ANSWER HERE
median_value=median(transcript_lengths)
mean_value=mean(transcript_lengths)
between_number=0
for values in transcript_lengths:
    if median_value<values<mean_value:
        between_number+=1
p_between=between_number/len(transcript_lengths)
print('Percent between mean and median:', p_between * 100)

Percent between mean and median: 25.82182182182182


In [46]:
assert 0.25 < p_between < 0.26

Calculate the probability of a transcript length being within one standard deviation of the mean in either direction. Store the low and high thresholds for transcript length in variables `low_thresh` and `high_thresh`, and the probability in `p_near_mean`.

In [39]:
# YOUR ANSWER HERE
#正态分布中，大约68%的值分布于均值的正负一个标准差范围内
low_thresh=mean(transcript_lengths)-std(transcript_lengths)
high_thresh=mean(transcript_lengths)+std(transcript_lengths)
near_mean_number=0
for value in transcript_lengths:
    if low_thresh<value<high_thresh:
        near_mean_number+=1
p_near_mean=near_mean_number/len(transcript_lengths)
print('Range: [', low_thresh, ',', high_thresh, ']')
print('Percent within range:', p_near_mean * 100)

Range: [ -55909.60434689818 , 143874.35798053181 ]
Percent within range: 92.990990990991


In [40]:
assert 0.929 < p_near_mean < 0.93

Woah! Two things are a bit odd here, right? First, the low threshold (the mean minus one standard deviation) is *negative*. But a transcript can't have a negative length! That means that the standard deviation is much larger than the mean, which is a result of the fact that there are some *really* long transcripts. And as we can see, outliers have a disproportionate effect on the standard deviation just as they have on the mean.

Second, over 90% of the data are "near" the mean, if we use one standard deviation on either side as a criterion for "near". But the standard deviation is supposed to give something like an "average" distance from the mean. So 90% of the data are below average (in terms of distance from the mean)? Again, this is a result of extreme outliers.

Modify your code above to find out the probability that a transcript length is within $\pm1$ median absolute deviation of the median. (Use $\ge$ or $\le$, not strict inequality, for the comparisons. Recall that in Python, these operators are written `>=` and `<=`, respectively.) Store the result in `p_near_median`.

In [43]:
# YOUR ANSWER HERE
low_thresh=median(transcript_lengths)-median_absolute_deviation(transcript_lengths)
high_thresh=median(transcript_lengths)+median_absolute_deviation(transcript_lengths)
near_median_number=0
for value in transcript_lengths:
    if low_thresh<=value<=high_thresh:
        near_median_number+=1
p_near_median=near_median_number/len(transcript_lengths)
print('Range: [', low_thresh, ',', high_thresh, ']')
print('Percent within range:', p_near_median * 100)

Range: [ 1051.0 , 25148.0 ]
Percent within range: 50.001001001001


In [44]:
assert 0.5 < p_near_median < 0.501

Interesting! Exactly half of the data are within $\pm1$ median absolute deviation of the median. Think about this and convince yourself why this is by definition so.

Last, we will try making some "or" measurements. To switch things up a bit, we will look at several attributes for each transcript at once.

Recall the syntax for stepping through multiple lists in parallel with a for loop:
```python
letters = ['a', 'b', 'c']
numbers = [1, 2, 3]
for letter, number in zip(letters, numbers):
    print(letter, number)
```
This of course prints:
```
a 1
b 2
c 3
```

Using this sort of loop, calculate the probability that a transcript either is really long (> 200,000 bp) or has a lot of exons (> 20). Store it in `p_huge`.

In [50]:
# YOUR ANSWER HERE
huge_number=0
for transcript,exon in zip(transcript_lengths,exon_counts):
    if transcript>200000 or exon>20:
        huge_number+=1
p_huge=huge_number/len(transcript_lengths)
print('Percent of huge transcripts:', p_huge * 100)

Percent of huge transcripts: 9.653653653653652


In [51]:
assert 0.096 < p_huge < 0.097

Now we will verify our formula for "or": $$P(A \mathbin{or} B) = P(A) + P(B) - P(A \mathbin{and} B)$$

Below, calculate the relevant probabilities `p_long_transcript` (length > 200,000), `p_many_exons` (> 20), and `p_both`, the probability that a transcript is long *and* has many exons. (Try to write just one for loop...)

In [54]:
# YOUR ANSWER HERE
long_transcripts_number=0
many_exons_number=0
both_number=0
for transcript,exon in zip(transcript_lengths,exon_counts):
    if transcript > 200000 and exon>20:
        both_number+=1
    if transcript > 200000:
        long_transcripts_number+=1
    if exon>20:
        many_exons_number+=1

p_long_transcript=long_transcripts_number/len(transcript_lengths)
p_many_exons=many_exons_number/len(exon_counts)
p_both=both_number/len(transcript_lengths)
print('Percent with long transcripts:', p_long_transcript * 100)
print('Percent with many exons:', p_many_exons * 100)
print('Percent with both:', p_both * 100)
calculated_or = p_long_transcript + p_many_exons - p_both
print('Percent long transcripts or many exons:', calculated_or * 100)

Percent with long transcripts: 4.34034034034034
Percent with many exons: 6.731731731731731
Percent with both: 1.4184184184184185
Percent long transcripts or many exons: 9.653653653653652


In [55]:
assert 0.043 < p_long_transcript < 0.044
assert 0.067 < p_many_exons < 0.068
assert 0.014 < p_both < 0.015

## Independence
Last, let's examine the definition of independence:
$$P(A \mathbin{and} B) = P(A) \cdot P(B) \iff \textrm{A and B are independent}$$
(Where $\iff$ means "if and only if".)

We will use this to test if having many exons is independent of having a long transcript. Calculate `p_expected_both`, the probability that both of these events would be true if they were independent.

In [66]:
# YOUR ANSWER HERE
if p_both==p_long_transcript*p_many_exons:
    p_expected_both=p_long_transcript*p_many_exons
print('Expected percent with both:', p_expected_both * 100)

Expected percent with both: 0.09548412276140003


In [68]:
assert 0.0009 < p_expected_both < 0.001

So we can see that we encounter long transcripts with many exons almost ten times more frequently than we would expect if there were no relationship between transcript length and exon count... Thus (as we expect) these are not independent properties of a transcript.