# Assignment 1: Probability Review

## Submitted By: Aarathi Vijayachandran (244267)

### Assignment question is available here: https://ovgu-ailab.github.io/lgm2024/assignment1.html

# PART - 1
##  Solving via simulation

In [2]:
import numpy as np

def simulate_illness_test(population_size):
    # Generating the sickness status: 1% chance of being sick
    sickness_status = np.random.rand(population_size) < 0.01

    # Generating test results
    # Sick individuals: 99.9% chance of testing positive
    test_results_sick = np.random.rand(population_size) < 0.999
    test_results_sick = np.logical_and(sickness_status, test_results_sick)

    # Healthy individuals: 1% chance of testing positive
    test_results_healthy = np.random.rand(population_size) < 0.01
    test_results_healthy = np.logical_and(~sickness_status, test_results_healthy)

    test_results = np.logical_or(test_results_sick, test_results_healthy)

    # Calculate probability
    true_positives = np.sum(test_results_sick)
    total_positives = np.sum(test_results)
    if total_positives == 0:
        print(f"For a population size of {population_size}: No positives detected.")
        return 0  # Avoid division by zero in case no positives
    probability = true_positives / total_positives
    print(f"For a population size of {population_size}: Probability that an individual actually has the illness, given a positive test result: {probability*100:.2f}%")
    print(f"Out of {total_positives} people that tested positive, {true_positives} are actually sick.\n\n\n")

# Population sizes to test
population_sizes = [1000, 10000, 100000, 1000000, 10000000, 100000000, 1000000000]

for size in population_sizes:
    simulate_illness_test(size)


For a population size of 1000: Probability that an individual actually has the illness, given a positive test result: 46.67%
Out of 15 people that tested positive, 7 are actually sick.



For a population size of 10000: Probability that an individual actually has the illness, given a positive test result: 49.55%
Out of 222 people that tested positive, 110 are actually sick.



For a population size of 100000: Probability that an individual actually has the illness, given a positive test result: 49.61%
Out of 1935 people that tested positive, 960 are actually sick.



For a population size of 1000000: Probability that an individual actually has the illness, given a positive test result: 50.71%
Out of 19725 people that tested positive, 10002 are actually sick.



For a population size of 10000000: Probability that an individual actually has the illness, given a positive test result: 50.26%
Out of 199637 people that tested positive, 100333 are actually sick.



For a population size of 10

## Solving via mathematics

### Given:

1. **Prevalence of the illness/disease** (\( P(D) \)): 1% or 0.01
   - This is the prior probability that a randomly selected person from the population is sick.
2. **Sensitivity of the test** (\( P(T|D) \)): 99.9% or 0.999
   - This is the probability that the test is positive given that the person is actually sick.
3. **False positive rate** (\( P(T|-D) \)): 1% or 0.01
   - This is the probability that the test is positive given that the person is not sick.

### To find:

The probability that a person is sick given that his/her test is positive (\( P(D|T) \)).

### Applying Bayes' Theorem:

Bayes' Theorem states:
P(D|T) = {P(T|D) * P(D)}/{P(T)}

Where \( P(T) \) (the probability of testing positive) is calculated using the law of total probability:
P(T) = P(T|D) * P(D) + P(T| -D) * P(-D)

### Calculations:

1. **Calculate \( P(-D) \)**, the probability of not being sick:
P(-D) = 1 - P(D) = 1 - 0.01 = 0.99

2. **Calculate \( P(T) \)**:
P(T) = (0.999 * 0.01) + (0.01 * 0.99) = 0.00999 + 0.0099 = 0.01989

3. **Apply Bayes' Theorem**:
P(D|T) = {0.999 * 0.01}/{0.01989} ~ 0.50226

### Conclusion:

The probability that a person actually has the illness, given that his/her test result is positive, is approximately 50.23%. This demonstrates the impact of the relatively high false positive rate in a low prevalence setting, where the test's specificity (or the complement of the false positive rate) becomes crucial in determining the accuracy of the test in predicting the disease.

# PART-2

## Case-1:

### Given:

1. **Prevalence of the illness/disease** (\( P(D) \)): 1% or 0.01
   - This is the prior probability that a randomly selected person from the population is sick.
2. **Sensitivity of the test** (\( P(T|D) \)): 99.9% or 0.999
   - This is the probability that the test is positive given that the person is actually sick.
3. **False positive rate** (\( P(T|-D) \)): 1% or 0.01
   - This is the probability that the test is positive given that the person is not sick.

### To find:

The probability that a person is sick given that his/her test is negative (\( P(D|-T) \)).

### Applying Bayes' Theorem:

Bayes' Theorem states:
P(D|-T) = {P(-T|D) * P(D)}/{P(-T)}

Where \( P(-T|D) \) is the probability of a negative test given that the person is sick (the complement of the test's sensitivity):
P(-T|D) = 1 - P(T|D) = 1 - 0.999 = 0.001

And \( P(-T) \) (the probability of testing negative) is calculated using the law of total probability:
P(-T) = P(-T|D) * P(D) + P(-T|-D) * P(-D)

### Calculations:

1. **Calculate \( P(-D) \)**, the probability of not being sick:
P(-D) = 1 - P(D) = 0.99

2. **Calculate \( P(-T|-D) \)**, the probability of a negative test given that the person is not sick (the complement of the false positive rate):
P(-T|-D) = 1 - P(T|-D) = 1 - 0.01 = 0.99

3. **Calculate \( P(-T) \)**:
P(-T) = (0.001 * 0.01) + (0.99 * 0.99) = 0.00001 + 0.9801 = 0.98011

4. **Apply Bayes' Theorem**:
P(D|-T) = {0.001 * 0.01}/{0.98011} ~ 0.0000102

### Conclusion:

The probability that a person actually has the illness, given that his/her test result is negative, is approximately 0.00102%. This extremely low probability indicates that a negative test result in this scenario is very reliable at ruling out the disease, thanks to the high sensitivity of the test.

## Case-2:

- \( D \) as the event "the person is sick".
- \( T_1 \) as the event "the first test is positive".
- \( T_2 \) as the event "the second test is positive".

### Given:

1. **Prevalence of the illness/disease** (\( P(D) \)): 1% or 0.01.
2. **Sensitivity of the first test** (\( P(T_1|D) \)): 99.9% or 0.999.
3. **False positive rate of the first test** (\( P(T_1|-D) \)): 1% or 0.01.
4. **Sensitivity of the second test** (\( P(T_2|D) \)): 96% or 0.96.
5. **False positive rate of the second test** (\( P(T_2|-D) \)): 2% or 0.02.

### To find:

The probability that you are sick given that both tests are positive (\( P(D|T_1 ∩ T_2) \)).

### Using Bayes' Theorem for Two Tests:

We use Bayes' theorem in the context of two independent tests:
P(D|T_1 ∩ T_2) = {P(T_1 ∩ T_2|D) * P(D)}/{P(T_1 ∩ T_2)}

Where \( P(T_1 ∩ T_2|D) \) (the joint probability of both tests being positive given the person is sick) and \( P(T_1 ∩ T_2) \) (the joint probability of both tests being positive) need to be calculated.

### Calculations:

1. **Calculate \( P(-D) \)**, the probability of not being sick:
P(-D) = 1 - P(D) = 0.99

2. **Calculate \( P(T_1 ∩ T_2|D) \) using independence**:
P(T_1 ∩ T_2|D) = P(T_1|D) * P(T_2|D) = 0.999 * 0.96 = 0.95904

3. **Calculate \( P(T_1 ∩ T_2|-D) \) using independence**:
P(T_1 ∩ T_2|-D) = P(T_1|-D) * P(T_2|-D) = 0.01 * 0.02 = 0.0002

4. **Calculate \( P(T_1 ∩ T_2) \) using the law of total probability**:
P(T_1 ∩ T_2) = P(T_1 ∩ T_2|D) * P(D) + P(T_1 ∩ T_2|-D) * P(-D) = 0.95904 * 0.01 + 0.0002 * 0.99 = 0.0095904 + 0.000198 = 0.0097884

5. **Apply Bayes' Theorem**:
P(D|T_1 ∩ T_2) = {0.95904 * 0.01}/{0.0097884} ~ 0.9794

### Conclusion:

The probability that you actually have the illness, given that both test results are positive, is approximately 97.94%. This result shows how the use of multiple independent tests can significantly increase the confidence in the diagnosis, even if the second test has higher error rates compared to the first.

# Modeling Waiting Times

To model the waiting times for appointments, we should consider several characteristics of the data:

1. **Non-negativity**: Waiting times cannot be negative.
2. **Continuity**: Waiting times could be considered as real numbers (e.g., 12.5 minutes), although in some practical scenarios, they may be rounded to the nearest minute.
3. **Shape of Distribution**: Waiting times are likely to be right-skewed, as most people experience short to moderate waiting times, but a few might experience very long waits.
4. **Common Distribution**: A good candidate for modeling waiting times is the exponential distribution due to its simplicity and properties that align with the nature of waiting times (non-negative, continuous, and memoryless).

### Exponential Distribution

The exponential distribution is defined by the probability density function (PDF):

$$ f(x|λ) = λ e^{-λ x} \text{ for } x \geq 0 $$

where \( λ > 0 \) is the rate parameter.

#### Estimation via Maximum Likelihood

For a set of observed waiting times \( X = \{x_1, x_2, ..., x_n\} \), the likelihood function for the exponential distribution is:

$$ L(λ) = λ^n e^{-λ \sum_{i=1}^n x_i}  $$

The log-likelihood function is:

$$  \ell(λ) = n \log(λ) - λ \sum_{i=1}^n x_i  $$

To find the maximum likelihood estimate (MLE) of \( λ \), we take the derivative of ℓ(λ) with respect to \( λ \) and set it to zero:

$$  \frac{d\ell}{dλ} = \frac{n}{λ} - \sum_{i=1}^n x_i = 0  $$
$$  λ = \frac{n}{\sum_{i=1}^n x_i} $$

This gives us the MLE of \( λ \) as the reciprocal of the sample mean.


In [5]:
import numpy as np

# Set the true parameter for exponential distribution
lambda_true = 0.2  # This corresponds to an average waiting time of 5 minutes

# Generate random samples from exponential distribution
np.random.seed(42)
sample_size = 10000
data = np.random.exponential(1/lambda_true, sample_size)

# MLE for lambda
lambda_mle = sample_size / np.sum(data)

print(f"True lambda: {lambda_true}")
print(f"MLE of lambda: {lambda_mle}")
print(f"Mean of dataset: {np.mean(data)}")
print(f"Reciprocal of MLE (estimated mean waiting time): {1/lambda_mle}")


True lambda: 0.2
MLE of lambda: 0.2046037993599483
Mean of dataset: 4.8874947734511744
Reciprocal of MLE (estimated mean waiting time): 4.8874947734511744
