In [9]:

import random
import pandas as pd
import numpy as np

# **DIAGNOSING A RARE DISEASE**

Imagine that $0.5\%$ of the population has a certain disease:

$P(D) = 0.005$

There exists a $99\%$-accurate  test to diagnose it. That is, when the person is sick, the test shows positive (+) result in $99\%$ of the cases. Similarly, when the person is not sick, the test shows negative (-) result in $99\%$ of the time:

$P(+|Disease) = P(-|no \ D) = 0.99$

Imagine that a random person tests positive. Should they be worried? How likely is it that they are indeed sick? That is, what is the probability

$P(D | +) = \ ?$

## The task

In this exercise, you task is to simulate this scenario and estimate $P(D | +)$ based on the generated data. You will need to complete the following steps:

1. Create a sample of N = 100000 people and mark random 0.5% of them as sick.
2. Run the test for the disease on all of these people. The test is 99% accurate, meaning that some false negative and false positive results are possible.
3. To approximate probability $P(D | +)$ we are interested in, look at the people who got tested positive. How many of them are actually sick?
4. Play around with the parameters of our model to see how hey influence the result.

Complete the code below following the instructions.

### Setting up
Here is everything we know about the setting:

In [6]:
# Population size
N = 100000

# The probability of getting the disease
# P(D)
p_disease = 0.005

# The probability that the test is positive
# if the person is ill P(+|D)
true_pos_rate = 0.99

# The probability that the test is negative
# if the person is not ill P(-|no D)
true_neg_rate = 0.99

### Generating N people

Since probability of having the disease is `p_disease`, out of `N` people roughly `p_disease * N` should be sick.

Let's generate the data that models this.

In [21]:
# Generate N people and diagnose them with the disease (0 - healthy, 1 - sick)
# with p_disease = 0.005
categories = np.array([0,1])
population = np.random.choice(categories,size=N, p=[(1. - p_disease), p_disease]) #Bernoulli trial
population
np.sum(population) #checking if we have the zeros we expect to have



486

### Testing for disease

Now, let's test the entire population for the disease in question.

For the sick people, the outcome of the test should be positive (1) with probability `true_pos_rate` and negative (0) otherwise. Similarly, for healthy individuals, it should be negative (0) with probability `true_neg_rate` and positive (1) otherwise.

In [45]:
# The list containing test results.
# 0 = healthy, 1 = sick
positive = 1
negative =0
is_sick = np.array([positive,negative])
test_for_sick_positive= np.random.choice(is_sick, size=N, p=[true_pos_rate, (1.-true_pos_rate)])
test_for_sick_positive

is_healthy = np.array([negative, positive])
test_for_healthy_negative = np.random.choice(is_healthy, size=N, p=[true_neg_rate, 1.-true_neg_rate])
test_for_healthy_negative

test_outcomes = np.array([test_for_healthy_negative,
                        test_for_sick_positive])


test_outcomes.shape

(2, 100000)

In [None]:
test_outcome = np.choose(population, test_outcomes)
print(test_outcome.shape)
print(test_outcome.sum())

### Analysing results of the tests

How many positive test results are there compared to the actual number of sick people?



In [49]:
print(str(sum(100*test_outcome)/N) + '% of the population is diagnosed with the disease.')
print('Only ' + str(100*sum(population)/N) + '% of people in the population are actually sick.')

1.442% of the population is diagnosed with the disease.
Only 0.486% of people in the population are actually sick.


Out of the people who have been diagnosed with a positive test, what is the share of individuals who are actually sick?

This ratio approximates the probability P(D|+) that we are interested in.

In [64]:
prob_of_disease_gn_positive=(sum(population)/sum(test_outcome))*100
print(f'Chances that the person is sick and is positive: P(D|+) {np.round(prob_of_disease_gn_positive)}%')

Chances that the person is sick and is positive: P(D|+) 34.0%


In [62]:
#sanity check

p_of_test_positive = true_pos_rate*p_disease + (1-true_neg_rate)*(1-p_disease)
p_of_test_positive

0.01490000000000001

In [52]:
#chatgpt solution

import numpy as np

# Parameters
N = 100000  # Total population
disease_prob = 0.005  # Probability of having the disease
test_positive_if_disease = 0.99  # Sensitivity: P(+ | D)
test_negative_if_no_disease = 0.99  # Specificity: P(- | no D)

# Step 1: Simulate the population
# Mark 0.5% of the population as sick
sick_population = np.random.rand(N) < disease_prob

# Step 2: Simulate the test results
# Initialize test results
test_results = np.zeros(N, dtype=bool)

# For sick individuals, test positive with 99% probability
test_results[sick_population] = np.random.rand(sick_population.sum()) < test_positive_if_disease

# For healthy individuals, test positive with 1% probability
test_results[~sick_population] = np.random.rand((~sick_population).sum()) < (1 - test_negative_if_no_disease)

# Step 3: Calculate P(D | +)
# Find the number of true positives and false positives
positive_tests = test_results.sum()
true_positives = (sick_population & test_results).sum()

# Estimate P(D | +)
p_d_given_positive = true_positives / positive_tests if positive_tests > 0 else 0

print("Estimated P(D | +):", p_d_given_positive)


Estimated P(D | +): 0.334716459197787
