In [1]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None

# Bayesian COVID-19 Classifier
In this task you will develop a simple COVID-19 Classifier based on Bayesian Networks.
For this, you should fill in the dots.

## Load Data
Here we will use a fragment of the dataset, which has been published alongside the paper ["Machine learning-based prediction of COVID-19 diagnosis based on symptoms" Yazeed Zoabi, Shira Deri-Rozov & Noam Shomron](https://www.nature.com/articles/s41746-020-00372-6).

It is a dataset with various symptoms and the associated COVID-19 test results.

First you need to load in the data from the `covid_symptoms_data.csv` supplied along with the exercise.



In [2]:
data = pd.read_csv('./covid_symptoms_data.csv')

There are 5 different syptoms considered in this dataset: `cough`, `fever`, `sore_throat`,	`shortness_of_breath`, `head_ache`. If the value is 1 then the symptom is present if 0 it is not. The `corona_result` column provides the result of the COVID-19 test. It can be either "*negative*", "*positive*" or "*other*". Here we are only interested in the "*negative*" and "*positive*" outcomes.
Let's clean up the dataset to only contain data that is relevant to our classification task.

In [3]:
# Filter the negative and positive results
clean_data = data.query('corona_result == "negative" or corona_result == "positive"')

# Change the negative and positive results into a boolean variable
clean_data['corona_result'] = clean_data['corona_result'].apply(lambda result: np.int0(result == "positive"))

symptoms = ['cough', 'fever', 'sore_throat', 'shortness_of_breath', 'head_ache']

# Select only the columns that are relevant
clean_data = clean_data[symptoms + ['corona_result']]

clean_data

Unnamed: 0,cough,fever,sore_throat,shortness_of_breath,head_ache,corona_result
0,0,0,0,0,0,0
1,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0
...,...,...,...,...,...,...
55404,0,0,0,0,0,0
55405,0,0,0,0,0,0
55406,0,0,0,0,0,0
55408,0,0,0,0,0,0


## Split into train and test datasets

After loading and cleaning the data, we now have to split it into train and test data (ratio 10:1)

In [14]:
from sklearn.model_selection import train_test_split

# Split the data into train and test sets
train_data, test_data = train_test_split(clean_data, test_size=(1/11), random_state=42)
# If ratio is 10:1 then the size of the test_data is 1/11 of the whole dataset
# Random state is introduced for reproducibility

# Take a look at randomly splitted data
train_data.head()

Unnamed: 0,cough,fever,sore_throat,shortness_of_breath,head_ache,corona_result
4798,0,1,0,1,0,0
46852,0,0,0,0,0,0
32768,1,1,0,0,0,0
41810,1,0,0,0,0,0
52005,0,0,0,0,0,0


In [22]:
# Confirm correct split ratio
print(f'Size of train_data: {len(train_data)}')
print(f'Size of test_data:  {len(test_data)}')
print(f'The split ratio has been implemented successfully!')

Size of train_data: 49527
Size of test_data:  4953
The split ratio has been implemented successfully!


## Compute the probabilities
Now using the training data compute the probabilities $p(C)$ and $p(s_i \mid C)$ from the data.

In [23]:
# Prior of covid test
p_covid = len(train_data[train_data['corona_result'] == 1]) / len(train_data)

# Likelihood of the symptoms given covid
p_symptom_covid = {}
for symptom in symptoms:
    p_symptom_covid[symptom] = len(train_data.query(f'{symptom} == 1 and corona_result == 1')) / len(train_data.query('corona_result == 1'))

# Evidence of the symptoms
p_symptom = {}
for symptom in symptoms:
    p_symptom[symptom] = len(train_data.query(f'{symptom} == 1')) / len(train_data)

In [29]:
print(f'Prior of covid test: {p_covid}\n')
print(f'Likelihood of the symptoms given covid:\n {p_symptom_covid}\n')
print(f'Evidence of the symptoms:\n{p_symptom}\n')

Prior of covid test: 0.08205625214529448

Likelihood of the symptoms given covid:
 {'cough': 0.5014763779527559, 'fever': 0.48302165354330706, 'sore_throat': 0.14739173228346455, 'shortness_of_breath': 0.13410433070866143, 'head_ache': 0.2217027559055118}

Evidence of the symptoms:
{'cough': 0.16380963918670624, 'fever': 0.10208573101540575, 'sore_throat': 0.022371635673471037, 'shortness_of_breath': 0.019100692551537544, 'head_ache': 0.02798473559876431}



## Compute the posterior for the data
Given the probabilites implement the posterior 
$$ p\left(C = 1 \mid \mathbf{S=s}\right) = \frac{p\left(C = 1\right)}{Z}\prod_{j=1}^5 p\left(S_j=s_j \mid C = 1\right),$$
where $Z$ is a normalizing factor of the form $Z = \sum_{i=0}^1 p\left(C = i\right)\prod_{j=1}^5 p\left(S_j=s_j \mid C = i\right)$.

In [33]:
X_train = train_data[symptoms].to_numpy()
y_train = train_data['corona_result'].to_numpy()

# Given probabilities
p_covid = len(train_data[train_data['corona_result'] == 1]) / len(train_data)  # Prior of COVID test
p_s_c1 = p_symptom_covid  # Likelihood of symptoms given COVID
p_s_c0 = {symptom: 1 - p for symptom, p in p_symptom_covid.items()}  # Probability of symptom given COVID is false (complement)

# Calculate the normalizing factor Z
Z = sum(p_covid * np.prod([p_s_c1[symptom] if symptom_value == 1 else p_s_c0[symptom] for symptom, symptom_value in s.items()]) for _, s in train_data[symptoms].iterrows())

# Calculate the posterior probability of COVID given symptoms
p_covid_symptoms = {}
for _, s in train_data[symptoms].iterrows():
    numerator = p_covid * np.prod([p_s_c1[symptom] if symptom_value == 1 else p_s_c0[symptom] for symptom, symptom_value in s.items()])
    p_covid_symptoms[tuple(s)] = numerator / Z

In [36]:
print(f'Probability of symptom given COVID is True:\n{p_s_c1}\n')
print(f'Probability of symptom given COVID is false (complement):\n{p_s_c0}\n')
print(f'Posterior: {p_covid_symptoms}')

Probability of symptom given COVID is True:
{'cough': 0.5014763779527559, 'fever': 0.48302165354330706, 'sore_throat': 0.14739173228346455, 'shortness_of_breath': 0.13410433070866143, 'head_ache': 0.2217027559055118}

Probability of symptom given COVID is false (complement):
{'cough': 0.4985236220472441, 'fever': 0.5169783464566929, 'sore_throat': 0.8526082677165354, 'shortness_of_breath': 0.8658956692913385, 'head_ache': 0.7782972440944882}

Posterior: {(0, 1, 0, 1, 0): 3.0746024272760453e-06, (0, 0, 0, 0, 0): 2.12479718865702e-05, (1, 1, 0, 0, 0): 1.9969926613609327e-05, (1, 0, 0, 0, 0): 2.1373823645029645e-05, (1, 0, 0, 0, 1): 6.088465097746352e-06, (1, 1, 0, 0, 1): 5.688556395466963e-06, (0, 0, 0, 0, 1): 6.052615450458347e-06, (1, 0, 1, 0, 0): 3.69492651179589e-06, (1, 1, 1, 0, 0): 3.452232623824526e-06, (1, 1, 1, 0, 1): 9.833896914530186e-07, (0, 0, 1, 1, 1): 1.6204808420988842e-07, (1, 0, 0, 1, 1): 9.429421648967782e-07, (0, 1, 0, 0, 0): 1.9852341177219084e-05, (1, 0, 0, 1, 0): 3

In [58]:
# Check that all values of posterior are between 0 and 1
posterior_values = p_covid_symptoms.values()
if all(0 <= value <= 1 for value in posterior_values):
    print("All values are between 0 and 1!")
else:
    print("Warning: Not all values are between 0 and 1!")

sorted_probabilities = sorted(posterior_values)
sorted_probabilities

All values are between 0 and 1!


[1.5140427858353685e-07,
 1.5230104627504837e-07,
 1.6204808420988842e-07,
 1.6300789517263208e-07,
 5.315113575579656e-07,
 5.346594998534716e-07,
 5.688769038356017e-07,
 5.722463622985962e-07,
 8.758194078329803e-07,
 8.810068870501547e-07,
 9.373900029837452e-07,
 9.429421648967782e-07,
 9.775993694228735e-07,
 9.833896914530186e-07,
 1.0463251529075179e-06,
 1.0525225378210866e-06,
 3.0746024272760453e-06,
 3.0928133004879463e-06,
 3.290748700818631e-06,
 3.310239808622097e-06,
 3.4319054444889552e-06,
 3.452232623824526e-06,
 3.6731703203623524e-06,
 3.69492651179589e-06,
 5.655061460851848e-06,
 5.688556395466963e-06,
 6.052615450458347e-06,
 6.088465097746352e-06,
 1.9852341177219084e-05,
 1.9969926613609327e-05,
 2.12479718865702e-05,
 2.1373823645029645e-05]

If $P(C=1 \mid \mathbf{s}) > 0.5$ we assume that the patient is classified as positive for COVID-19. Get the classifications $y_{pred}$ for testing set $X_{test}$

In [None]:
X_test = test_data[symptoms].to_numpy()
y_test = test_data['corona_result'].to_numpy()

y_pred = ... # P(C = 1 | s) > 0.5 for X_test

## Evaluate the accuracy
Verify the accuracy of this classifier on the test data.

\begin{equation}
ACC = \frac{\sum_{i=0}^N \mathbb{I}[y_{pred}^{(i)} = y_{test}^{(i)}]}{|y_{test}|},
\end{equation}
$\mathbb{I}[true] = 1$ and $\mathbb{I}[false] = 0$.

In [None]:
acc = ...

#### Questions
- What do you observe when evaluating the classifier?
- How well does this classifier perform and why?
- What could you do to improve the classifier?

Your Answers: