# Assignment 1: Probability Review - Kai Ponel & Hannan Mahadik 

# Testing Illnesses

## Given:

About **1%** of the population has the **illness**. That is, any given person has a 1% “a priori” probability of being sick \\
Probability that someone is ill => $$P(I)=0.01$$


If a **sick** person is tested, the test returns a **positive** result 99.9% of the time \\
Probability that the test is positive given a sick person is tested => $$P(+ | I) = 0.999$$ 

If a **healthy** person is tested, the test still returns a **positive** result 1% of the time\\
Probability that the test is positive given a healthy person is tested => $$P(+ | H) = 0.01$$

Your test result is positive. What is the probability that you have the illness? \\
To find : \\
Probability that a person is ill, given that the test was positive => $$P(I|+) = ?$$

## Solution: 

### Mathematically

About 1% of the population has the illness, which means that the probability of someone being healthy should be $$P(H) = 1 - P(I)$$
Probability that someone is healthy => $$P(H)= 1 - 0.01 = 0.99$$

Using Bayes' Theorem:
$$P(I|+) = \frac{P(+|I) \cdot P(I)}{P(+)} $$

$$P(I|+) = \frac{P(+|I) \cdot P(I)}{P(+|I) \cdot P(I)+P(+|H) \cdot P(H)} $$

$$P(I|+) = \frac{0.999 \cdot 0.01}{0.999 \cdot 0.01 + 0.01 \cdot 0.99} = 0.50 $$

### Simulation

In [None]:
import random

In [None]:
# Given Parameters

p_ill = 0.01 # Probability that someone has an illness  
p_positive_ill = 0.999 # Probability that someone tested positive, given that they were ill
p_positive_healthy = 0.01 # Probability that someone tested positive, given that they were healthy

In [None]:
# Other variables

sick_count = 0 # Number of people that are sick
positive_count = 0 # Number of people that tested positive

In [None]:
for x in [1000,10000,100000,1000000]:

  # Number of people tested 
  population_sample = x

  for i in range(population_sample):
      is_sick = random.random() < p_ill

      if is_sick:
          sick_count += 1
          test_result = random.random() < p_positive_ill
      else:
          test_result = random.random() < p_positive_healthy
      
      if test_result:
          positive_count += 1

  proportion = sick_count / positive_count if positive_count > 0 else 0

  print(f"Out of {positive_count} people that tested positive, {sick_count} are actually sick.")
  print(f"The proportion of people that tested positive and are actually sick is {proportion:.2f}.")
  print()

Out of 15 people that tested positive, 5 are actually sick.
The proportion of people that tested positive and are actually sick is 0.33.

Out of 186 people that tested positive, 88 are actually sick.
The proportion of people that tested positive and are actually sick is 0.47.

Out of 2149 people that tested positive, 1070 are actually sick.
The proportion of people that tested positive and are actually sick is 0.50.

Out of 22221 people that tested positive, 11130 are actually sick.
The proportion of people that tested positive and are actually sick is 0.50.



# Modeling Waiting Times

## Solution:

Let’s say we have a dataset of waiting times x1, x2, ..., xn that we assume are independent and identically distributed (i.i.d.) samples from an exponential distribution with unknown rate parameter λ. The PDF of the exponential distribution is given by: \\

$$ f(x | \lambda) = \lambda e^{-\lambda x} $$ \\
 
The likelihood function for the data given the parameter λ is the product of the individual PDFs for each data point:

$$ L(\lambda | x_1, x_2, \dots, x_n) = f(x_1 | \lambda) \cdot f(x_2 | \lambda) \dots f(x_n | \lambda) $$ \\

Taking the natural logarithm of the likelihood function gives us the log-likelihood function:

$$ \ln(L(\lambda | x_1, x_2, \dots, x_n)) = \ln(f(x_1 | \lambda)) + \ln(f(x_2 | \lambda)) + \dots + \ln(f(x_n | \lambda)) $$ \\

We know that the log of $$ \lambda e^{-\lambda x} = \ln(\lambda) + \ln(e^{(-\lambda x)}) = \ln(\lambda) - \lambda x $$ \\

Substituting that in the equation gives us:
$$ = \ln(\lambda) - (\lambda x_1) + \ln(\lambda) - (\lambda x_2) + \dots + \ln(\lambda) - (\lambda x_n) $$ \\
$$ = n\ln(\lambda) - \lambda(x_1 + x_2 + \dots + x_n) $$ \\
$$ = n\ln(\lambda) - \lambda\sum_{i=1}^n (x_i) $$ \\

To find the MLE for λ, we take the derivative of the log-likelihood function with respect to λ and set it equal to zero : \\

$$ \frac{d(n\ln(\lambda) - \lambda\sum_{i=1}^n (x_i))}{d(λ)} = \frac{n}{λ} - \sum_{i=1}^n (x_i) = 0 $$ \\

Solving for λ gives us the maximum likelihood estimate for λ:
$$ \lambda = \frac{n}{\sum_{i=1}^n x_i} $$


In [5]:
import numpy as np
from scipy.stats import expon

# Using the exponential distribution
rate = 0.1
x = expon(scale=1/rate)

# Generate random waiting times
n_samples = 10000
waiting_times = x.rvs(size=n_samples)

# Compute the log-likelihood of the data
log_likelihood = np.sum(x.logpdf(waiting_times))

# Compute the expected and mean values of the data
expected_value = 1/rate
mean = np.mean(waiting_times)

# Compute the maximum likelihood estimate
mle_rate = n_samples/np.sum(waiting_times)

print(f"Log-likelihood: {log_likelihood:.2f}")
print(f"Mean value of the distribution:: {mean:.2f}")
print(f"Expected value of the distribution: {expected_value:.2f}")
print(f"Maximum likelihood estimate for lambda: {mle_rate:.2f}")

Log-likelihood: -33218.66
Mean value of the distribution:: 10.19
Expected value of the distribution: 10.00
Maximum likelihood estimate for lambda: 0.10
