# Conditional Probability (8 points)

In this problem we will explore conditional probability, or the probability of A given that B is also the case: $P(A \mid B)$.

Recall that in class we defined conditional probability: $P(A \mid B) = P(A \mathbin{and} B) \mathbin{/} P(B)$.

Rearranging this, we have, equivalently, $P(A \mid B) \cdot P(B) = P(A \mathbin{and} B)$.

That should make sense: If 1/2 of all genetics students study cancer (that is, $P(\textrm{studying cancer} \mid \textrm{genetics student}) = 0.5$), and 1/4 of all students study genetics (i.e. $P(\textrm{genetics student}) = 0.25$), then 1/8 of all students are genetics students studying cancer (i.e. $P(\textrm{studying cancer} \mathbin{and} \textrm{genetics student}) = 0.125$).

Run the below cell to load the dataset and define `mean`.

In [2]:
exon_counts = []
gene_types = []
transcript_lengths = []
file = open('transcript_table.csv')
header = file.readline()
for line in file:
    values = line.strip('\n').split(',')
    exon_counts.append(int(values[1]))
    gene_types.append(values[2])
    transcript_lengths.append(int(values[3]))
file.close()

def mean(values):
    return sum(values) / len(values)

First, let's make a list of the transcript lengths of *just* the microRNAs. (The `gene_types` list has `miRNA` as the entry for microRNA transcripts.) To do this, step through both the `gene_types` list and the `transcript_lengths` list, using `zip`, and add to the `miRNA_tx_lens` list only when the gene type is a microRNA:

In [3]:
miRNA_tx_lens = []
# YOUR ANSWER HERE
for gene,miRNA_length in zip(gene_types,transcript_lengths):
    if gene == 'miRNA':
        miRNA_tx_lens.append(int(miRNA_length))

print('number of annotated microRNAs:', len(miRNA_tx_lens))
print('mean miRNA transcript length:', mean(miRNA_tx_lens))

number of annotated microRNAs: 3197
mean miRNA transcript length: 88.50547388176416


In [4]:
assert len(miRNA_tx_lens) == 3197
assert miRNA_tx_lens[1233] == 88

Now, use `miRNA_tx_lens` to calculate the probability that a given microRNA transcript is less than 70 nucleotides. Save this as `p_short_miRNA`. 

**Hint**: Remember that the relevant denominator here is the number of miRNA transcripts, not the total number of transcripts. 

In [5]:
# YOUR ANSWER HERE
short_miRNA_number=0
for length in miRNA_tx_lens:
    if length < 70:
        short_miRNA_number+=1
p_short_miRNA=short_miRNA_number/len(miRNA_tx_lens)
print('Percent of microRNAs that have short transcripts:', p_short_miRNA * 100)

Percent of microRNAs that have short transcripts: 11.197998123240538


In [6]:
assert 0.111 < p_short_miRNA < 0.112

The value `p_short_miRNA` is the probability, among only the set of microRNA transcripts, that the length is less than 70 nt. In other words, this is the *conditional probability* that a transcript is < 70 nt, given that the gene is a microRNA:

$P(\mathrm{transcript\_len} < 70 \mid \mathrm{miRNA})$

Recall that by definition, $P(A \mid B) = P(A \mathbin{and} B)\mathbin{/}P(B)$. In the previous cell, you calculated a conditional probability directly: you extracted a subset of the data where some condition was true (i.e. the gene type was 'miRNA'), and found out the fraction of that subset where some other condition was true (i.e. the transcript was < 70 nt). 

Now let's calculate $P(\mathrm{transcript\_len} < 70 \mathbin{and} \mathrm{miRNA})$ and $P(\mathrm{miRNA})$ to check the formula for conditional probability.

Re-write the for loop in your first answer above to calculate the probability that a given transcript is both a microRNA *and* has a transcript less than 70 nt. Save the answer as `p_miRNA_and_short`. Also calculate `p_miRNA`, the number of microRNA transcripts divided by the total number of transcripts. (**Hint**: You could use `len(miRNA_tx_lens)` to get the number of microRNA transcripts without having to count them directly...) 

Finally, calculate the conditional probability using the formula above. Save this as `p_short_miRNA_calculated`.

# YOUR ANSWER HERE
short_miRNA_number=0
for length in miRNA_tx_lens:
    if length < 70:
        short_miRNA_number+=1
p_miRNA=len(miRNA_tx_lens)/len(transcript_lengths)#p(B),divided by whole samples
p_miRNA_and_short=short_miRNA_number/len(transcript_lengths)#p(A and B),divided by whole samples
p_short_miRNA_calculated=p_miRNA_and_short/p_miRNA#p(A|B),divided by extracted samples B
print('percent microRNA:', p_miRNA * 100)
print('percent microRNA and short:', p_miRNA_and_short * 100)
print('calculated percent short miRNA:', p_short_miRNA_calculated * 100 )

In [14]:
assert 0.032 < p_miRNA < 0.0321
assert 0.00358 < p_miRNA_and_short < 0.00359
assert 0.111 < p_short_miRNA_calculated < 0.112

In some sense, this is obvious and/or trivial. In the first task, you made a list of microRNAs and then counted the number from that list that had short transcripts. In other words, this is simply the count of transcripts that are both microRNAs *and* short. Then, to calculate `p_short_miRNA`, you simply divided this by the number of microRNAs. I.e.:
```
p_short_miRNA = num_short_and_miRNA / num_miRNA
```


In the second case, you calculated from the formula:
```
p_short_and_miRNA = num_short_and_miRNA / num_transcripts
p_miRNA = num_miRNA / num_transcripts
p_short_miRNA_calculated = p_short_and_miRNA / p_miRNA
```

But obviously just a little algebra shows that `num_transcripts` divides itself out of the result, so:
```
p_short_miRNA_calculated = num_short_and_miRNA / num_miRNA
```

Thus, as we saw in class, the difference between conditional probability of "A given B" and unconditional probability of "A and B" is simply one of choosing the correct denominator.

More specifically, the numerator for $P(A \mid B)$, $P(A \mathbin{and} B)$, and $P(B \mid A)$ are all identical: the number of times that $A$ and $B$ occur together (i.e. the area of the overlap between the two circles in a Venn diagram). The denominators differ however: for $P(A \mid B)$, you divide that number of coocurrences (of $A$ and $B$ together) by the total number of occurences of $B$ (i.e. the overlap area in the Venn diagram divided by the total area of $B$).

For $P(B \mid A)$, you divide by the total number of occurences of $A$. And for $P(A \mathbin{and} B)$, you divide by the total size of the whole dataset.

As a last exercise in thinking about conditional probabilities, we will verify Bayes law: $P(B \mid A) = P(A \mid B)\cdot P(B)\mathbin{/}P(A)$

Above, we calculated $P(\mathrm{transcript\_len} < 70 \mid \mathrm{miRNA})$. Now let's calculate $P(\mathrm{miRNA} \mid \mathrm{transcript\_len} < 70)$.

First, do this calculation directly. Write a for loop to count the number of transcripts that are both for microRNAs and have lengths < 70 nt (you have this code above) and also to count the total number with lengths < 70. The fraction of short microRNA transcripts among all the short transcripts is of course $P(\mathrm{miRNA} \mid \mathrm{transcript\_len} < 70)$. Save this as `p_miRNA_given_short`.

In [28]:
# YOUR ANSWER HERE
short_miRNA_number=0
for length in miRNA_tx_lens:
    if length < 70:
        short_miRNA_number+=1
p_miRNA=len(miRNA_tx_lens)/len(transcript_lengths)#p(B),divided by whole samples
p_miRNA_and_short=short_miRNA_number/len(transcript_lengths)#p(A and B),divided by whole samples
p_short_miRNA_calculated=p_miRNA_and_short/p_miRNA#p(A|B),divided by extracted samples B
p_miRNA_given_short=p_short_miRNA_calculated*p_miRNA/p_short_miRNA
print('percent short transcripts that are microRNAs:', p_miRNA_given_short * 100)

percent short transcripts that are microRNAs: 100.0


In [None]:
assert 0.44 < p_miRNA_given_short < 0.441

So, around 44% of short transcripts are microRNAs, while only around 11% of microRNA transcripts are short. With a bit of thinking, you should be able to convince yourself that this can only be true if there are somewhat more total microRNA transcripts than short transcripts.

Using Bayes law, we have:
$$ P(\mathrm{miRNA} \mid \mathrm{transcript\_len} < 70) = P(\mathrm{transcript\_len} < 70 \mid \mathrm{miRNA}) \cdot \frac{P(\mathrm{miRNA})}{P(\mathrm{transcript\_len} < 70)}$$ 

That is:

$$0.44 \approx 0.11 \cdot \frac{P(\mathrm{miRNA})}{P(\mathrm{transcript\_len} < 70)}$$

Note that this implies that:
$$\frac{P(\mathrm{miRNA})}{P(\mathrm{transcript\_len} < 70)} \approx 4$$

That is, any given transcript is about 4 times more likely to be a microRNA than to be short. In other words, microRNA transcripts are about 4 times more numerous than short transcripts. We can see, then, that Bayes law is really just a formalization of the intuitive picture that the relationship between the percent of microRNA transcripts that are short and the percent of short transcripts that are microRNAs has something to do with the relative abundance of each of those classes.

Write code to calculate $\frac{P(\mathrm{miRNA})}{P(\mathrm{transcript\_len} < 70)}$. (Use a for loop to re-calculate any necessary values from scratch, rather than using them from previous cells.) Store this as `miRNA_short_ratio`. You might note that actually calculating the probabilities (i.e. dividing the counts by the total number of transcripts) is a bit redundant, because just like above that denominator just divides out of the equation.

In [None]:
# YOUR ANSWER HERE
print('relative probability of microRNAs to short transcripts:', miRNA_short_ratio)
print('calculated p_miRNA_given_short:', p_short_miRNA * miRNA_short_ratio)
print('measured p_miRNA_given_short:', p_miRNA_given_short)

In [None]:
assert 3.93 < miRNA_short_ratio < 3.94