# Conditionnal probability

## Rare illness test

Alice takes a test for some illness and the test is positive.
However, the test is not perfect
and might have false positives
(the result is positive while the person is not sick)
and false negatives
(the result is negative while the person is actually sick).
The illness is very rare in the population.

**Notations:**
  - $p_\mathrm{ill}$: fraction of the population affected by the illness,
  - $p_\mathrm{FP}$: probability of false positive,
  - $p_\mathrm{FN}$: probability of false negative.
  - $S$: the **random** event "Alice is sick".
  - $P$: the **random** event "the test is positive".

**Numerical application:**

$p_\mathrm{ill}=0.001,\ p_\mathrm{FP}=0.01,\ p_\mathrm{FN}=0.01$.

### 1. Theory
Is Alice really sick (i.e. what is the probability)?

The required probability is $p(S|P)$.

$\Rightarrow p(S|P) = \frac{p(P|S)\cdot p(S)}{p(P)}$

where, $p(P) = p(P\cap S) + p(p \cap \bar{S}) = p(P|S) \cdot p(S) + p(P|\bar{S}) \cdot p(\bar{S})$
$\Rightarrow p(P) = (1-p_{FN})p_{ill} + p_{FP}(1-p_{ill})$

$p(S|P) = \frac{(1-p_{FN})p_{ill}}{(1-p_{FN})p_{ill} + p_{FP}(1-p_{ill})}$

putting numbers one can get,

$p(S|P) = 0.0902$

### 2. Numerical experiment with python

#### Imports

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# See https://numpy.org/doc/stable/reference/index.html

#### Generate random booleans

1. Single boolean

In [2]:
def randbool_single(p):
    """
    This function returns True with probability p,
    and False with probabilty (1-p).
    """
    x = np.random.random()
    if x>p:
        return False
    else:
        return True
    # Todo
    # See https://numpy.org/doc/stable/reference/random/legacy.html#functions-in-numpy-random

Check that the function works as intended...

In [3]:
n = 100000
s = 0
for _ in range(n):
    s += randbool_single(0.25)
print(s/n)

0.24853


2. Vector of boolean

In [4]:
def randbool_vector(p):
    """
    p is array of probabilities.
    This function returns a vector of booleans with the same size as p.
    For each element, the value is True with probability p[k],
    and False with probabilty (1-p[k]).
    """
    n1 = p.size
    ab = np.random.random(n1)
    return (ab<p)
    # Todo
    # You can use randbool_single iteratively,
    # but it is good practice to work directly with vectors
    # (operations are better optimized)

Check that the function works as intended...

In [5]:
# Todo
n=100000
pp = np.full(n, 0.27)
#print(pp)
xx = randbool_vector(pp)
#print(xx)
s1 = np.sum(xx)    
print(s1/n)
print(np.mean(xx))

0.27109
0.27109


#### Generate a random population of healthy/sick individuals

In [6]:
def generate_rand_population(p_ill, n):
    """
    This function generates a random population of size n,
    with a probability of illness p_ill.    
    """
    abc1 = np.full(n,p_ill)
    abc = randbool_vector(abc1)
    return abc
    # Todo
    # Use previously defined functions

Check that the function works as intended...

In [7]:
# Todo
population = generate_rand_population(0.001, 100000)
print(np.sum(population)/100000)

0.0009


In [8]:
pop1 = generate_rand_population(0.1, 10)
print(pop1)
print(np.mean(pop1))
pop2 = np.where(pop1, 0.2, 0.3)
print(pop2)
pop3 = randbool_vector(pop2)
print(pop3)
print(np.mean(pop3))

[False False  True False  True False False False False False]
0.2
[0.3 0.3 0.2 0.3 0.2 0.3 0.3 0.3 0.3 0.3]
[False False  True  True False False False False False  True]
0.3


#### Generate test results

In [9]:
def generate_test_results(population, p_FP, p_FN):
    """
    This function generates test results for each individual
    in the population (as a boolean vector).
    population is a vector of boolean giving whether
    an individual is sick or not.
    p_FP is the test false positive rate,
    and p_FN is the false negative rate.    
    """
    # Todo
    abc = np.where(population, 1-p_FN, p_FP)
    # np.where(condition, a, b): if condition is true return a else b
    abc1 = randbool_vector(abc)
    return abc1

Check that the function works as intended...

In [10]:
# Todo
test_all_sick = generate_test_results(np.full(10000, True), 0.01, 0.01)
print("All sick:", 1-np.mean(test_all_sick))
test_all_healthy = generate_test_results(np.full(10000, False), 0.01, 0.01)
print("All healthy:", 1-np.mean(test_all_healthy))
test = generate_test_results(population, 0.01, 0.01)
print("Population:", np.mean(test))

All sick: 0.010499999999999954
All healthy: 0.9882
Population: 0.01145


#### Compute the conditional probability

In [11]:
def compute_conditional(population, test):
    """
    This function computes the conditional probability of
    being sick when the test result is positive.
    population is a vector of boolean giving whether
    an individual is sick or not.
    test is a vector of boolean giving the test result
    for each individual in the population.   
    """
    # Todo
    gp = population[test]
    cond = np.sum(gp)/len(gp)
    return cond

Use all the previously defined functions to give a numerical answer to question 1.
Use different values for the population size (for instance with a for loop), and compare the results together, and with the analytical answer.

In [12]:
# Todo
p_ill = 0.001
p_FP = 0.01
p_FN = 0.01
for pop_size in [10**k for k in range(3,9)]:
    pop = generate_rand_population(0.001, pop_size)
    test = generate_test_results(pop, p_FP, p_FN)
    p_cond = compute_conditional(pop, test)
    print('Empirical', p_cond)
# Analytical formula (see first question)
p_ill_and_pos = (1-p_FN)*p_ill
p_healthy_and_pos = p_FP*(1-p_ill)
p_pos = p_ill_and_pos + p_healthy_and_pos
print("Analytical", p_ill_and_pos/p_pos)

Empirical 0.05555555555555555
Empirical 0.09523809523809523
Empirical 0.08845829823083404
Empirical 0.08563586459286368
Empirical 0.09086529283967676
Empirical 0.09005698034228261
Analytical 0.09016393442622951


Comment your results.

As we increase the total population numbers, the numerical (or empirical) value tends to its analytical value.

### 3. Two tests
Alice takes two successive and independant tests, and both are positive.
What is the probability for her to be sick (analytical and numerical answer)?

Let's take $P_1$: Event that the first test is positive,
            $P_2$: Event that the second test is positive,
            $Q = P_1 \cap P_2$: Event that both tests are positive
            
Now, we want to find $p(S|Q)$, and we know that,

\begin{equation*}
    p(S|Q) = \frac{p(Q|S)\cdot p(S)}{p(Q)}
\end{equation*}

Now, $p(Q|S) = p(P_1 \cap P_2|S) = p(P_1|S) \cdot p(P_2|S) = (1-p_{FN})^2 $ since $P_1$ and $P_2$ are independent events. Furthermore, we can express $p(Q)$ as,

\begin{equation*}
    \begin{split}
        p(Q) &= p(Q \cap S) + p(Q \cap \bar{S}) \\
        &= p(Q|S)p(S) + p(Q|\bar{S})p(\bar{S}) \\
        &= p(P_1 \cap P_2|S)p(S) + p(P_1 \cap P_2|\bar{S})p(\bar{S}) \\
        &= p(P_1|S)p(P_2|S) p(S) + p(P_1|\bar{S})p(P_2|\bar{S})p(\bar{S}) \\
        &= (1-p_{FN})^2 p_{ill} + p_{FP}^2 (1-p_{ill})
    \end{split}
\end{equation*}

That means,

\begin{equation*}
    p(Q|S) = \frac{(1-p_{FN})^2 p_{ill}}{(1-p_{FN})^2 p_{ill} + p_{FP}^2 (1-p_{ill})}
\end{equation*}

After doing algebra, one would get,

$$p(Q|S) = 0.9075$$

In [13]:
# Todo
def two_tests_cond(popu, t1, t2):
    """
    This function computes the conditional probability of
    being sick when the test1 result and test2 result are positive.
    population is a vector of boolean giving whether
    an individual is sick or not.
    test is a vector of boolean giving the test result
    for each individual in the population.   
    """
    # Todo
    gp1 = popu[t1 & t2]
    cond1 = np.sum(gp1)/len(gp1)
    return cond1

p_ill = 0.001
p_FP = 0.01
p_FN = 0.01
for pop_size in [10**k for k in range(3,9)]:
    pop = generate_rand_population(0.001, pop_size)
    test1 = generate_test_results(pop, p_FP, p_FN)
    test2 = generate_test_results(pop, p_FP, p_FN)
    p_cond = two_tests_cond(pop, test1, test2)
    print('Empirical', p_cond)
# Analytical formula (see first question)
p_ill_and_pos = ((1-p_FN)**2)*p_ill
p_healthy_and_pos = p_FP*p_FP*(1-p_ill)
p_pos = p_ill_and_pos + p_healthy_and_pos
print("Analytical", p_ill_and_pos/p_pos)

Empirical 1.0
Empirical 0.8125
Empirical 0.8938053097345132
Empirical 0.9018691588785047
Empirical 0.9096689732560315
Empirical 0.907321378797733
Analytical 0.9075
