In [9]:

import random
import pandas as pd
import numpy as np

# **DIAGNOSING A RARE DISEASE**

Imagine that $0.5\%$ of the population has a certain disease:

$P(D) = 0.005$

There exists a $99\%$-accurate  test to diagnose it. That is, when the person is sick, the test shows positive (+) result in $99\%$ of the cases. Similarly, when the person is not sick, the test shows negative (-) result in $99\%$ of the time:

$P(+|Disease) = P(-|no \ D) = 0.99$

Imagine that a random person tests positive. Should they be worried? How likely is it that they are indeed sick? That is, what is the probability

$P(D | +) = \ ?$

## The task

In this exercise, you task is to simulate this scenario and estimate $P(D | +)$ based on the generated data. You will need to complete the following steps:

1. Create a sample of N = 100000 people and mark random 0.5% of them as sick.
2. Run the test for the disease on all of these people. The test is 99% accurate, meaning that some false negative and false positive results are possible.
3. To approximate probability $P(D | +)$ we are interested in, look at the people who got tested positive. How many of them are actually sick?
4. Play around with the parameters of our model to see how hey influence the result.

Complete the code below following the instructions.

### Setting up
Here is everything we know about the setting:

In [2]:
# Population size
N = 100000

# The probability of getting the disease
# P(D)
p_disease = 0.005

# The probability that the test is positive
# if the person is ill P(+|D)
true_pos_rate = 0.99

# The probability that the test is negative
# if the person is not ill P(-|no D)
true_neg_rate = 0.99

### Generating N people

Since probability of having the disease is `p_disease`, out of `N` people roughly `p_disease * N` should be sick.

Let's generate the data that models this.

In [None]:
# Generate N people and diagnose them with the disease (0 - healthy, 1 - sick)
# with p_disease = 0.01

is_sick = N * p_disease
is_sick
not_sick = N - is_sick

500.0

In [None]:
#print(100 * sum(is_sick) / N, '% of people in the population have the disease')


### Testing for disease

Now, let's test the entire population for the disease in question.

For the sick people, the outcome of the test should be positive (1) with probability `true_pos_rate` and negative (0) otherwise. Similarly, for healthy individuals, it should be negative (0) with probability `true_neg_rate` and positive (1) otherwise.

In [11]:
np.zeros(shape=(N,2))

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       ...,
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [17]:
# The list containing test results.
# 0 = healthy, 1 = sick

test_results = pd.DataFrame(np.zeros(shape=(N,2)), columns=['Healthy','Sick'])
test_results

Unnamed: 0,Healthy,Sick
0,0.0,0.0
1,0.0,0.0
2,0.0,0.0
3,0.0,0.0
4,0.0,0.0
...,...,...
99995,0.0,0.0
99996,0.0,0.0
99997,0.0,0.0
99998,0.0,0.0


In [22]:
test_results.loc[:int(is_sick),'Sick'] = 1*true_pos_rate
test_results.loc[int(is_sick):,'Healthy'] = 0*true_neg_rate
test_results

Unnamed: 0,Healthy,Sick
0,0.0,0.99
1,0.0,0.99
2,0.0,0.99
3,0.0,0.99
4,0.0,0.99
...,...,...
99995,0.0,0.00
99996,0.0,0.00
99997,0.0,0.00
99998,0.0,0.00


### Analysing results of the tests

How many positive test results are there compared to the actual number of sick people?



In [23]:
print(str(sum(100*test_results)/N) + '% of the population is diagnosed with the disease.')
print('Only ' + str(100*sum(is_sick)/N) + '% of people in the population are actually sick.')

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Out of the people who have been diagnosed with a positive test, what is the share of individuals who are actually sick?

This ratio approximates the probability P(D|+) that we are interested in.

In [None]:
# Your code here