<a href="https://colab.research.google.com/github/Lawrence-Krukrubo/Probability-and-Statistics/blob/master/bayes_rule_for_testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

UK has a HIV rate of about 1.7 per 1000 people [link text](https://www.avert.org/professionals/hiv-around-world/western-central-europe-north-america/uk)

In [1]:
p_hiv = round(1.7 / 1000,4)
p_no_hiv = round(1 - p_hiv, 4)

print(f'P(HIV) is {p_hiv} and P(No-HIV) is {p_no_hiv}')

P(HIV) is 0.0017 and P(No-HIV) is 0.9983


Latest testing has 99% accuracy. Meaning if you have HIV, 99% of the time, you would test positive(sensitivity) and if you don't have it, 99% of the time you'd test negative(specificity).

In [2]:
p_positive_given_hiv = 0.99  # Sensitivity
p_negative_given_no_hiv = 0.99  # Specificity

# Therefore let's find the complements of Sensitivity and Specificity
# The first one below is called False-Negative and the second one is called,
# false-Positive

p_negative_given_hiv = round(1 - p_positive_given_hiv, 4)
p_positive_given_no_hiv = round(1 - p_negative_given_no_hiv, 4)

print(f'p_negative_given_hiv({p_negative_given_hiv}), p_positive_given_no_hiv({p_positive_given_no_hiv})')

p_negative_given_hiv(0.01), p_positive_given_no_hiv(0.01)


Therefore after receiving a positive test result for HIV, what is the probability of having HIV, given the positive test result.

Now we use Bayes Theory to find or infer this new Probaility

<h2>$Bayes-Theory = P(A|B) = \frac{P(B|A) * P(A)} {P(B)}$</h2>

In our case here, Bayes rule would be
<h2>$P(HIV|Positive) = \frac{P(Positive|HIV) * P(HIV)} {P(Positive)}$</h2>

This means the Probability of HIV given a Positive result is equal to the Probability of a Positive result given HIV times the Probability of HIV, divided by the Total Probability of a Positive test result.

Let's solve the numerator part of Bayes Theorem first, this is the part that says: <h2>$P(Positive|HIV) * P(HIV)$</h2>

In [3]:
numerator = round(p_positive_given_hiv * p_hiv, 4)
print(f'Numerator is {numerator}')

Numerator is 0.0017


Now, let's solve the denominator part of Bayes Theorem, this is the part that says: <h2>$P(Positive)$</h2>

Do note that to find This Probability of Positive, we need to find what is called the Total Probability of testing Positive to HIV.<br>
This is the Probability of testing positive to HIV given HIV times probability of HIV, plus the Probability of testing positive to HIV given No-HIV times probability of No-HIV.<br>
This is often the difficult part of calculating Bayes Theorem, written as 
<h2>$P(Positive|HIV)*P(HIV)+P(Positive|No-HIV)*P(No-HIV)$</h2>

In [4]:
# since we had already calculated all of these, let's plug them in
denominator = round(p_positive_given_hiv * p_hiv + p_positive_given_no_hiv * p_no_hiv, 4)
print(f'Denominator is {denominator}')

Denominator is 0.0117


Therefore to find the Probability of getting hiv given a Positive test result, we divide the numerator by the denominator

In [5]:
p_hiv_given_positive = round(numerator / denominator, 4)
print(p_hiv_given_positive)

0.1453


Really? 14% chance of having HIV in the UK, given a positive HIV test result,  when the HIV test has a proven record of 99% accuracy?.. How is this even possible?

To find out more, let's explore, let's say we start with 1,000,000 people and they have two options, HIV or no HIV. So since 1.7 of every 1000 have HIV, then out of 1,000,000 people 1.7 * 1000 will have HIV = 1700.

In [6]:
# given that 1.7 out of every 1k have HIV, then for 100k people...
hiv = 1.7 * 1000
no_hiv  = 1000000 - hiv
print(f'hiv:{hiv}, no_hiv:{no_hiv}')

hiv:1700.0, no_hiv:998300.0


Now let's remember that for every 100 cases, 99 gets correctly diagnosed (99% accurate) and 1 gets incorrectly diagnosed. This applies to both groups:- hiv and no_hiv. Let's define some important terms.

1. False-Positive: This is when a test says a person is Positive, but its false as the person is truly Negative.
2. False-Negative: This is when a test says a person is Negative, but its false as the person is truly Positive.
3. True-Positive: This is when a test says a person is Positive and its True.
4. True-Negative: This is when a test says a person is Negative and it's True.

In [7]:
# For the hiv group
true_positive_hiv_group = round((99/100) * hiv)
false_negative_hiv_group = hiv - true_positive_hiv_group

print(f'true-positive in hiv-group: {true_positive_hiv_group}, false-negative in hiv-group: {false_negative_hiv_group}')

true-positive in hiv-group: 1683, false-negative in hiv-group: 17.0


In [8]:
# For the no-hiv group
true_negative_no_hiv_group = round((99/100) * no_hiv)
false_positive_no_hiv_group = no_hiv - true_negative_no_hiv_group

print(f'false-positive in no-hiv-group: {false_positive_no_hiv_group}, true-negative in no-hiv-group: {true_negative_no_hiv_group}')

false-positive in no-hiv-group: 9983.0, true-negative in no-hiv-group: 988317


So we have seen four distinct groups above

1. true_positive_hiv_group: for those who have hiv and correctly classified as positive, ideally given drugs and care.
2. false_negative_hiv_group: For those who have hiv, but sent home in the false evidence that they don't have hiv.
3. true_negative_no_hiv_group: For those who don't have hiv and sent home on the true evidence of no-hiv.
4. false_positive_no_hiv_group: These are the confused set of Patients who don't have hiv but diagnosed with hiv and sent for more tests and care and all the expected help and trauma.

So back to the test result of 14% for hiv given a positive result... We know that since we tested positive, then we must be in either of the positive sets.<br>
This means we must be in either true_positive_hiv_group or false_positive_no_hiv_group. So what is the probability that we are actually in the true_positive_hiv_group?

In [9]:
p_true_positive_hiv_group = true_positive_hiv_group / (true_positive_hiv_group + false_positive_no_hiv_group)
p_false_postive_no_hiv_group = false_positive_no_hiv_group / (true_positive_hiv_group + false_positive_no_hiv_group)

print(f'Probability of being in true-positive-hiv-group; {round(p_true_positive_hiv_group, 4)}')
print(f"Probability of being in false-positive-no-hiv-group; {round(p_false_postive_no_hiv_group, 4)}")

Probability of being in true-positive-hiv-group; 0.1443
Probability of being in false-positive-no-hiv-group; 0.8557


So we can clearly see that given the evidence of 99% accuracy for HIV test in the UK and the additional information that the rate of HIV infection per 1000 people in the UK is 1.7, we have a 14.4% chance of actually having HIV even if we're diagnosed with HIV and a 85.6% chance of having being mis-diagnosed as HIV-positive, when we very well could be negative. 

Let's see some important metrics from the confusion matrix for this case

In [10]:
total_assumed = true_positive_hiv_group + false_positive_no_hiv_group + true_negative_no_hiv_group + false_negative_hiv_group
print(total_assumed)

1000000.0


In [11]:
accuracy = (true_positive_hiv_group + true_negative_no_hiv_group) / total_assumed
print(accuracy)

0.99


In [12]:
# Recall or Sensitivity or True Positive Rate:
# Percent of positive cases identified from actual total positive cases
recall = true_positive_hiv_group / (true_positive_hiv_group + false_negative_hiv_group)
recall = round(recall, 4)
print(recall)

0.99


In [13]:
# Precision or Positive-Predicted-Value:
# Percent of cases identified as positive that are truly positive
precision = true_positive_hiv_group / (true_positive_hiv_group + false_positive_no_hiv_group)
precision = round(precision, 4)
print(precision)

0.1443


In [14]:
# True-negative-rate: Aka Specificity,
# Percent of negative cases identified from actual total negative cases.
specificity = true_negative_no_hiv_group / (true_negative_no_hiv_group + false_positive_no_hiv_group)
specificity = round(specificity, 4)
print(specificity)

0.99


In [16]:
# False-positive-rate: Aka Specificity,
# Percent of total negative cases misclassified as positive.
fpr = false_positive_no_hiv_group / (true_negative_no_hiv_group + false_positive_no_hiv_group)
fpr = round(fpr, 4)
print(fpr)

0.01


As we can see the test does a good job of catching 99% of infections (sensitivity) and where there are no infections, it catches 99% of no infections (specificity). But it does a poor job on Precision. Meaning of the percent of cases identified as positive, it includes 9983 non-positive cases as positive. And this is the crux of the matter in this post.

Let's see the overall weighted harmonic mean of the performance of the Test

In [15]:
# Lets see the F1-score
f1_score  = 2 * ((precision * recall) / (precision + recall))
f1_score = round(f1_score, 4)
print(f1_score)

0.2519


So we see that although the test is 99% accurate and has a TPR and TNR of 99%, it's quite terrible over all, with f1_score of only 25% 

<h2>To Recap:</h2>

<h2>$Bayes-Theory = P(A|B) = \frac{P(B|A) * P(A)} {P(B)}$</h2>

1. $P(A|B)$: Trying to find the Probability of HIV($A$) given a positive test result($B$) is called the Posterior. This is what we want to compute given the evidence
2. $P(B|A)$: The existing conditional probability of being positive given HIV is called the likelihood. this is the likelihood of a positive diagnosis given HIV. 
3. $P(A)$: The probability of HIV is the Prior, this is the additional evidence we found after knowing that the test has 99% accuracy.
4. $P(B)$: The Probability of Positive is the marginal likelihood, simply put, it's the total probability of being positive to HIV given HIV times probability of HIV plus its complement, which is the probability of being positive to HIV given no HIV  times the probability of no HIV.