# Diagnostic Testing Accuracy Practice

A healthcare provider is testing a new diagnostic method using a blood test to detect a particular disease. Your task is to analyze the performance of this test using simulated data.

In [7]:
import pandas as pd
import numpy as np
from simulate import DiagnosticSimulator

np.random.seed(30)

## Your Task

1. Understand what this simulated dataset consist of
2. Build a confusion matrix
3. Compute test performance metrics:
4. Reflection

In [8]:
sim = DiagnosticSimulator()
data = sim.simulate()

### Understand what this simulated dataset consists of

In [10]:
data

Unnamed: 0,PatientID,TrueDisease,TestResult
0,1,0,1
1,2,0,1
2,3,0,1
3,4,1,1
4,5,0,1
...,...,...,...
995,996,0,1
996,997,0,1
997,998,0,1
998,999,1,1


In [12]:
data["TrueDisease"].value_counts()

TrueDisease
0    616
1    384
Name: count, dtype: int64

In [13]:
data["TestResult"].value_counts()

TestResult
1    948
0     52
Name: count, dtype: int64

### Build a confustion matrix

**Note:** Your are building this from scratch. No use of external libraries to create the confusion matrix.

In [40]:
confustion = data.groupby(['TestResult', 'TrueDisease']).count()
confustion

Unnamed: 0_level_0,Unnamed: 1_level_0,PatientID
TestResult,TrueDisease,Unnamed: 2_level_1
0,0,38
0,1,14
1,0,578
1,1,370


### Compute test performance metrics

- Sensitivity
- Specificity
- Positive Predictive Value
- Negative Predictive Value
- Likelihood Ratios(LR+ and LR-)

In [42]:
# True negative = 38
# False negative = 14
# False positive = 578
# True positive = 370

sensitivity = (370 / (370 + 14))
specificity = (38 / (38 + 578))
ppv = (370 / (370 + 578))
npv = (38 / (38 + 14))
lr_po = sensitivity / (1 - specificity)
lr_ne = (1 - sensitivity) / specificity

print('sensitivity =',sensitivity)
print('specificity =',specificity)
print('ppv =',ppv)
print('npv =',npv)
print('Likelihood Ratio positive =', lr_po)
print('Likelihood Ratio negative =', lr_ne)

sensitivity = 0.9635416666666666
specificity = 0.06168831168831169
ppv = 0.39029535864978904
npv = 0.7307692307692307
Likelihood Ratio positive = 1.0268886966551325
Likelihood Ratio negative = 0.5910087719298251


### Reflection

1. Based on sensitivity and specificity, how accurate is the test?

In [48]:
# The sensitivity is quite high at around 96% which means that there is a greter proporition
# of true postiives than false negatives; shows the test is very accurate.
# The specificity is almost abysmal at 6.2% meaning that the test struggles to correctly identify 
# healthy individuals-- a far greater proportion of false postiives compared to true negatives. 
# When claiming a patient is healthy, the test performs very poorly. 

2. If a patient tests positive, how confident can we be that they truly have the disease?

In [49]:
# There is about a 39% confidence that the patient truly has the disease.

3. How would the predictive values change if the disease were rare(e.g. <5% prevalence)?

In [None]:
# 

4. Which metric would be most important for:
    - A screening test for early detection
    - A confirmatory test to finalize diagnosis

### Optional Exercise

Try rerunning the simulation with a different prevalence(e.g., 5%, 50%, 70%) and observe how the PPV and NPV change, even if sensitivity and specificity remain the same.

In [3]:
# reset the same seed from earlier
np.random.seed(30)

# pass additional arguments to simulator
sim = DiagnosticSimulator(n_patients=1000, prevalence=0.384)
data = sim.simulate()