In [1]:
import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None

# Bayesian COVID-19 Classifier
In this task you will develop a simple COVID-19 Classifier based on Bayesian Networks.
For this, you should fill in the dots.

## Load Data
Here we will use a fragment of the dataset, which has been published alongside the paper ["Machine learning-based prediction of COVID-19 diagnosis based on symptoms" Yazeed Zoabi, Shira Deri-Rozov & Noam Shomron](https://www.nature.com/articles/s41746-020-00372-6).

It is a dataset with various symptoms and the associated COVID-19 test results.

First you need to load in the data from the `covid_symptoms_data.csv` supplied along with the exercise.



In [2]:
data = pd.read_csv('./covid_symptoms_data.csv')

There are 5 different syptoms considered in this dataset: `cough`, `fever`, `sore_throat`,	`shortness_of_breath`, `head_ache`. If the value is 1 then the symptom is present if 0 it is not. The `corona_result` column provides the result of the COVID-19 test. It can be either "*negative*", "*positive*" or "*other*". Here we are only interested in the "*negative*" and "*positive*" outcomes.
Let's clean up the dataset to only contain data that is relevant to our classification task.

In [3]:
# Filter the negative and positive results
clean_data = data.query('corona_result == "negative" or corona_result == "positive"')

# Change the negative and positive results into a boolean variable
clean_data['corona_result'] = clean_data['corona_result'].apply(lambda result: np.int0(result == "positive"))

symptoms = ['cough', 'fever', 'sore_throat', 'shortness_of_breath', 'head_ache']

# Select only the columns that are relevant
clean_data = clean_data[symptoms + ['corona_result']]

clean_data

Unnamed: 0,cough,fever,sore_throat,shortness_of_breath,head_ache,corona_result
0,0,0,0,0,0,0
1,0,0,0,0,0,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0
...,...,...,...,...,...,...
55404,0,0,0,0,0,0
55405,0,0,0,0,0,0
55406,0,0,0,0,0,0
55408,0,0,0,0,0,0


## Split into train and test datasets

After loading and cleaning the data, we now have to split it into train and test data (ratio 10:1)

In [5]:

train_data = clean_data.sample(frac=0.9,random_state=20)
test_data = clean_data.drop(train_data.index)

train_data

Unnamed: 0,cough,fever,sore_throat,shortness_of_breath,head_ache,corona_result
29255,0,0,0,0,0,1
38554,0,0,0,0,0,0
39010,0,0,0,0,0,0
4828,0,0,0,0,0,0
30340,1,1,0,0,0,0
...,...,...,...,...,...,...
3951,0,0,1,0,0,0
17387,0,0,0,0,0,0
1565,0,0,0,0,0,0
43981,0,0,0,0,0,0


## Compute the probabilities
Now using the training data compute the probabilities $p(C)$ and $p(s_i \mid C)$ from the data.

In [10]:
p_covid = train_data['corona_result'].sum()/len(train_data) # prior of covid test


cov_1 = train_data.loc[train_data['corona_result'] == 1]
cov_0 = train_data.loc[train_data['corona_result'] == 0]

def psc(data):
    product = 1
    for symptom in symptoms:
        product = data[symptom].sum()/len(data)* product
    return product    
    
    
    
p_symptom_covid = psc(cov_1)  # likelihood of the symptoms given covid
p_symptom = psc(cov_1)* p_covid + psc(cov_0)*(1-p_covid)  # evidence of the symptoms


print(p_covid)
print(p_symptom_covid)
print(p_symptom)



0.08188529939631262
0.001104310074318736
9.043527845772005e-05


## Compute the posterior for the data
Given the probabilites implement the posterior 
$$ p\left(C = 1 \mid \mathbf{S=s}\right) = \frac{p\left(C = 1\right)}{Z}\prod_{j=1}^5 p\left(S_j=s_j \mid C = 1\right),$$
where $Z$ is a normalizing factor of the form $Z = \sum_{i=0}^1 p\left(C = i\right)\prod_{j=1}^5 p\left(S_j=s_j \mid C = i\right)$.

In [29]:
X_train = train_data[symptoms].to_numpy()
y_train = train_data['corona_result'].to_numpy()




def prob_symp(symptom,data):
    return data[symptom].sum()/len(data)

p_s_c1 = np.array([prob_symp(symptom, cov_1) for symptom in symptoms]) # probability of symptom given covid is true
p_s_c0 = np.array([prob_symp(symptom, cov_0) for symptom in symptoms]) # probability of symptom given covid is false

print(p_s_c1)
print(p_s_c0)

def post(system):
    inverse1 = 1-p_s_c1
    inverse0 = 1- p_s_c0
    
    probabilities = system * p_s_c1
    likelihood = [inverse1[i] if probabilities[i] == 0 else probabilities[i] for i in range(len(probabilities))]
    
    given0 = system * p_s_c0
    Z_first_part = [inverse0[i] if given0[i] == 0 else given0[i] for i in range(len(given0))]
    Z =  np.prod(Z_first_part)* (1-p_covid)+ np.prod(likelihood) * p_covid
    
    return p_covid * np.prod(likelihood)/Z

print(X_train[20,:])
print(post(X_train[20,:]))                                                                                           
print(X_train)    

p_covid_symptoms = np.apply_along_axis(post, axis=1, arr=X_train)# posterior probability of covid given system

print(p_covid_symptoms)

[0.50884184 0.48393524 0.1494396  0.13599004 0.22067248]
[0.13381611 0.06755226 0.0110847  0.00877446 0.01055157]
[1 0 0 1 0]
0.6633800536958459
[[0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 ...
 [0 0 0 0 0]
 [0 0 0 0 0]
 [1 0 1 0 0]]
[0.01625922 0.01625922 0.01625922 ... 0.01625922 0.01625922 0.63467816]


If $P(C=1 \mid \mathbf{s}) > 0.5$ we assume that the patient is classified as positive for COVID-19. Get the classifications $y_{pred}$ for testing set $X_{test}$

In [42]:
X_test = test_data[symptoms].to_numpy()
y_test = test_data['corona_result'].to_numpy()

p_covid_symptoms_test = np.apply_along_axis(post, axis=1, arr=X_test)
y_pred = np.asarray([1 if val > 0.5 else 0 for val in p_covid_symptoms_test])# P(C = 1 | s) > 0.5 for X_test

print(y_pred)
y_pred[20:350]

[0 1 1 ... 0 0 0]


array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

## Evaluate the accuracy
Verify the accuracy of this classifier on the test data.

\begin{equation}
ACC = \frac{\sum_{i=0}^N \mathbb{I}[y_{pred}^{(i)} = y_{test}^{(i)}]}{|y_{test}|},
\end{equation}
$\mathbb{I}[true] = 1$ and $\mathbb{I}[false] = 0$.

In [43]:
def accuracy(pred,test):
    correct = 0
    for i in range(len(pred)):
        if pred[i] == test[i]:
            correct +=1
    return correct/len(test)    

print(len(y_test))
print(len(y_pred))
acc = accuracy(y_pred,y_test)
print(acc)

5448
5448
0.8907856093979442


#### Questions
- What do you observe when evaluating the classifier?
- How well does this classifier perform and why?
- What could you do to improve the classifier?

Your Answers: