# COVID19 - Building a model from the clinical data

In this notebook I will unify, explore and clean the clinical data in this repository. After that I will try to build a model to predict whether the coronavirus test will be positive or negative, ie, clinically diagnose COVID19

In [147]:
import pandas as pd
import numpy as np
import seaborn as sns

import glob
import os

os.chdir('../data/') # Change the working directory to the data directory
all_data_available = glob.glob('*.csv')
print(all_data_available)

['04-07_carbonhealth_and_braidhealth.csv', '04-14_carbonhealth_and_braidhealth.csv', '04-21_carbonhealth_and_braidhealth.csv', '04-28_carbonhealth_and_braidhealth.csv', '05-05_carbonhealth_and_braidhealth.csv', '05-12_carbonhealth_and_braidhealth.csv', '05-19_carbonhealth_and_braidhealth.csv', '05-26_carbonhealth_and_braidhealth.csv', '06-02_carbonhealth_and_braidhealth.csv', '06-09_carbonhealth_and_braidhealth.csv', '06-16_carbonhealth_and_braidhealth.csv', '06-23_carbonhealth_and_braidhealth.csv', '06-30_carbonhealth_and_braidhealth.csv', '07-07_carbonhealth_and_braidhealth.csv', '07-14_carbonhealth_and_braidhealth.csv', '07-21_carbonhealth_and_braidhealth.csv', '07-28_carbonhealth_and_braidhealth.csv', '08-04_carbonhealth_and_braidhealth.csv', '08-11_carbonhealth_and_braidhealth.csv', '08-18_carbonhealth_and_braidhealth.csv', '08-25_carbonhealth_and_braidhealth.csv', '09-01_carbonhealth_and_braidhealth.csv', '09-08_carbonhealth_and_braidhealth.csv', '09-15_carbonhealth_and_braidheal

In [166]:
# We load all data from the repo

all_data = None #A workaround to declare the all_data variable for use later

for file in all_data_available:     
    df = pd.read_csv("../data/" + file)    
    print(file, df["covid19_test_results"].value_counts()["Positive"] / len(df["covid19_test_results"]))
    try:
        df["rapid_flu_results"] = df["rapid_flu_results"].astype("object") #Because in 2 files all values are null and because of that pandas changes the column type to float
        if all_data is None:
            all_data = df
        else:
            all_data = pd.merge(all_data, df, how="outer")

    except Exception as e:
        print(file, "could not be merged:", e)
        print(len(df), "rows were left out")

    print("All data size:", len(all_data))

04-07_carbonhealth_and_braidhealth.csv 0.061224489795918366
All data size: 735
04-14_carbonhealth_and_braidhealth.csv 0.08365019011406843
All data size: 998
04-21_carbonhealth_and_braidhealth.csv 0.03074141048824593
All data size: 2104
04-28_carbonhealth_and_braidhealth.csv 0.011996572407883462
All data size: 3271
05-05_carbonhealth_and_braidhealth.csv 0.018738288569643973
All data size: 4872
05-12_carbonhealth_and_braidhealth.csv 0.03541315345699832
All data size: 5465
05-19_carbonhealth_and_braidhealth.csv 0.05353319057815846
All data size: 6399
05-26_carbonhealth_and_braidhealth.csv 0.04533333333333334
All data size: 7149
06-02_carbonhealth_and_braidhealth.csv 0.03134479271991911
All data size: 8138
06-09_carbonhealth_and_braidhealth.csv 0.010725552050473186
All data size: 9723
06-16_carbonhealth_and_braidhealth.csv 0.011756569847856155
All data size: 11169
06-23_carbonhealth_and_braidhealth.csv 0.020618556701030927
All data size: 13594
06-30_carbonhealth_and_braidhealth.csv 0.03353

In [180]:
all_data.drop(columns=["batch_date", "test_name", "swab_type", "er_referral"], inplace=True) #Dropping the columns that have nothing to do with the information about the disease itself

In [181]:
#All Data info
all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 93995 entries, 0 to 93994
Data columns (total 42 columns):
covid19_test_results             93995 non-null object
age                              93995 non-null int64
high_risk_exposure_occupation    93826 non-null object
high_risk_interactions           69168 non-null object
diabetes                         93995 non-null object
chd                              93995 non-null object
htn                              93995 non-null object
cancer                           93995 non-null object
asthma                           93995 non-null object
copd                             93995 non-null object
autoimmune_dis                   93995 non-null object
smoker                           93995 non-null object
temperature                      47542 non-null float64
pulse                            48279 non-null float64
sys                              46523 non-null float64
dia                              46523 non-null float64
rr      

### Removing columns with almost no info

The first ones that come to mind to remove are the rapid flu and rapid strep results, since they are less than 1 percent of the data (both positive and negative). I will also get rid of the the radiological findings, since they are too few as well.

After that, I am somewhat suspicious of the cough severity and shortness of breath (sob) severity, since it is my intuition that the severity is dependant on whether they have cough or sob. I'll check that in the cell after dropping, and if that is the case, I will join those columns so a value of 0 means no cough or sob following numbers from 1-N depending on the severity values.

I was hesitant to check the er_referral (whether the patient was referred to ER or not) but since the goal is to build an online predictor for the covid test

In [183]:
columns_to_drop = ["rapid_flu_results", "rapid_strep_results", "cxr_findings", "cxr_impression", "cxr_label", "cxr_link"]
all_data.drop(columns=columns_to_drop, inplace = True)

<b>Cough and cough severity</b>

In [186]:
print(all_data["cough"].value_counts())
print("Not nan values:", all_data["cough_severity"].notna().sum())
print(" ")
print("All that have a cough and have cough severity")
print(all_data[(all_data["cough"].notna()) & (all_data["cough"] == True)]["cough_severity"].value_counts())
print("All that do not have a cough and have cough severity")
print(all_data[(all_data["cough"].notna()) & (all_data["cough"] == False)]["cough_severity"].value_counts()) #The cough value for this one is going to be changed to true
print("All that have cough as nan and have cough severity")
print(all_data[(all_data["cough"].isna())].loc[:,"cough_severity"].value_counts()) #This is empty so it is correct
print("Number of rows with cough and no cough severity: ", len(all_data[(all_data["cough"] == True) & (all_data["cough_severity"].isna())])) #Since this are just 15 they are to be imputated with the most frequent type

False    87488
True      6492
Name: cough, dtype: int64
Not nan values: 6492
 
All that have a cough and have cough severity
Mild        4747
Moderate    1627
Severe       118
Name: cough_severity, dtype: int64
All that do not have a cough and have cough severity
Series([], Name: cough_severity, dtype: int64)
All that have cough as nan and have cough severity
Series([], Name: cough_severity, dtype: int64)
Number of rows with cough and no cough severity:  0


In [187]:
cough_false_severity_not_na = all_data.loc[(all_data["cough"].notna()) & (all_data["cough"] == False) & (all_data["cough_severity"].notna())].index
all_data.loc[cough_false_severity_not_na, "cough"] = True

all_data.loc[(all_data["cough"] == True) & (all_data["cough_severity"].isna()), "cough_severity"] = all_data["cough_severity"].mode()[0] #Since it returns a series, the 0 subscript is to retrieve the value

<b>SOB (shortedness of breath) and SOB severity</b>

In [188]:
print(all_data["sob"].value_counts())
print("Not nan values:", all_data["sob_severity"].notna().sum())
print(" ")
print("All that have sob and have sob severity")
print(all_data[(all_data["sob"].notna()) & (all_data["sob"] == True)]["sob_severity"].value_counts())
print("All that do not have sob and have sob severity")
print(all_data[(all_data["sob"].notna()) & (all_data["sob"] == False)]["sob_severity"].value_counts()) #The cough value for this one is going to be changed to true
print("All that have sob as nan and have sob severity")
print(all_data[(all_data["sob"].isna())].loc[:,"sob_severity"].value_counts()) #This is empty so it is correct
print("Number of rows with sob and no sob severity: ", len(all_data[(all_data["sob"] == True) & (all_data["sob_severity"].isna())])) #Since this are just 15 they are to be imputated with the most frequent type

False    90461
True      3328
Name: sob, dtype: int64
Not nan values: 3328
 
All that have sob and have sob severity
Mild        2096
Moderate    1106
Severe       126
Name: sob_severity, dtype: int64
All that do not have sob and have sob severity
Series([], Name: sob_severity, dtype: int64)
All that have sob as nan and have sob severity
Series([], Name: sob_severity, dtype: int64)
Number of rows with sob and no sob severity:  0


In [189]:
sob_false_severity_not_na = all_data.loc[(all_data["sob"].notna()) & (all_data["sob"] == False) & (all_data["sob_severity"].notna())].index
all_data.loc[sob_false_severity_not_na, "sob"] = True

all_data.loc[(all_data["sob"] == True) & (all_data["sob_severity"].isna()), "sob_severity"] = all_data["sob_severity"].mode()[0] #Since it returns a series, the 0 subscript is to retrieve the value

### Data imputation and separation in different datasets

In this part, having removed all columns that will not be used whether it is because of lack of information or because they do not provide any useful information per se, I will impute missing data for the values that have a relatively small missing portion of the data. I use the most frequent value in this type of column first. I might experiment with Imputation by most frequent neighbor

Next, for the others I will explore the option of separating them into different datasets or imputing the data.

In [190]:
#Positive data info
positive_data = all_data[all_data["covid19_test_results"] == "Positive"]
positive_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1313 entries, 21 to 93813
Data columns (total 36 columns):
covid19_test_results             1313 non-null object
age                              1313 non-null int64
high_risk_exposure_occupation    1308 non-null object
high_risk_interactions           1078 non-null object
diabetes                         1313 non-null object
chd                              1313 non-null object
htn                              1313 non-null object
cancer                           1313 non-null object
asthma                           1313 non-null object
copd                             1313 non-null object
autoimmune_dis                   1313 non-null object
smoker                           1313 non-null object
temperature                      1071 non-null float64
pulse                            1082 non-null float64
sys                              1064 non-null float64
dia                              1064 non-null float64
rr                      

In [185]:
#Negative data info
negative_data = all_data[all_data["covid19_test_results"] == "Negative"]
negative_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 92682 entries, 0 to 93994
Data columns (total 36 columns):
covid19_test_results             92682 non-null object
age                              92682 non-null int64
high_risk_exposure_occupation    92518 non-null object
high_risk_interactions           68090 non-null object
diabetes                         92682 non-null object
chd                              92682 non-null object
htn                              92682 non-null object
cancer                           92682 non-null object
asthma                           92682 non-null object
copd                             92682 non-null object
autoimmune_dis                   92682 non-null object
smoker                           92682 non-null object
temperature                      46471 non-null float64
pulse                            47197 non-null float64
sys                              45459 non-null float64
dia                              45459 non-null float64
rr      

In [191]:
#Lets see how many unique values each column has
#This will take the nans as an additional value. That's why .nunique is preferred
#for column in all_data.columns:
    #print(column, len(all_data[column].unique()), sep=": ") 
    
all_data.nunique(axis=0)

covid19_test_results               2
age                               95
high_risk_exposure_occupation      2
high_risk_interactions             2
diabetes                           2
chd                                2
htn                                2
cancer                             2
asthma                             2
copd                               2
autoimmune_dis                     2
smoker                             2
temperature                       86
pulse                            113
sys                              103
dia                               92
rr                                28
sats                              21
ctab                               2
labored_respiration                2
rhonchi                            2
wheezes                            2
days_since_symptom_onset          20
cough                              2
cough_severity                     3
fever                              2
sob                                2
s

In [91]:
all_data.describe() #After running the line we can see that initially we only have 8 continuous variables

Unnamed: 0,age,temperature,pulse,sys,dia,rr,sats,days_since_symptom_onset
count,93995.0,47626.0,48296.0,46640.0,46640.0,41958.0,47618.0,18904.0
mean,39.701144,37.901694,76.51702,116.996334,76.128838,18.068712,94.189739,16.126111
std,14.837268,3.662426,13.896021,25.650222,12.07255,10.500889,13.408977,23.253454
min,0.0,33.5,50.0,50.0,15.0,0.0,50.0,1.0
25%,29.0,36.65,68.0,109.0,70.0,14.0,97.0,2.0
50%,38.0,36.85,76.0,120.0,77.0,16.0,98.0,4.0
75%,50.0,37.0,85.0,131.0,84.0,16.0,99.0,21.0
max,91.0,50.0,160.0,235.0,150.0,50.0,100.0,300.0


I just found in the data dictionary that the age has been added using noise, so Ill replace ages less than 0 with the mean age. I choose the mean since it shouldnt affect the distribution.
For the pulse, what I will do is that since it is medically defined that a heart rate below 60 bpm is a bad unless you are an athlete, I will replace every heartrate below 60 with 50, so we know it is a low heart rate

In [71]:
all_data[all_data["age"] < 0].age = all_data.age.mean()
all_data[all_data["pulse"] < 60].pulse = 50
         

In [90]:
all_data[all_data["sats"] < 90].sats.value_counts()

50.0    3977
89.0       2
81.0       2
88.0       2
84.0       2
85.0       1
87.0       1
55.0       1
79.0       1
76.0       1
Name: sats, dtype: int64

In [62]:
positive_data.describe()

Unnamed: 0,age,temperature,pulse,sys,dia,rr,sats,days_since_symptom_onset
count,1313.0,1071.0,1082.0,1064.0,1064.0,968.0,1076.0,670.0
mean,35.577304,36.923576,81.547135,124.839286,79.663534,15.049587,98.151487,4.640299
std,15.52281,0.374418,13.625631,16.547304,9.646506,2.071529,1.450039,5.56015
min,0.0,35.8,49.0,88.0,41.0,10.0,87.0,1.0
25%,24.0,36.7,72.0,114.0,74.0,14.0,97.0,2.0
50%,33.0,36.9,80.0,123.0,79.0,16.0,98.0,3.0
75%,46.0,37.1,90.0,134.0,85.0,16.0,99.0,5.0
max,83.0,39.2,140.0,210.0,130.0,26.0,100.0,60.0


In [63]:
negative_data.describe()

Unnamed: 0,age,temperature,pulse,sys,dia,rr,sats,days_since_symptom_onset
count,92682.0,46471.0,47197.0,45459.0,45459.0,40480.0,46459.0,15195.0
mean,39.226668,36.791796,76.827913,123.109483,78.265096,14.702495,98.240233,7.17078
std,15.023769,0.286291,13.210527,16.10079,9.45775,1.96849,1.42333,17.561869
min,-3.0,33.5,35.0,50.0,15.0,0.0,55.0,1.0
25%,28.0,36.65,68.0,112.0,72.0,13.0,97.0,2.0
50%,37.0,36.8,76.0,121.0,78.0,15.0,98.0,3.0
75%,50.0,36.95,85.0,132.0,84.0,16.0,99.0,7.0
max,91.0,39.6,160.0,235.0,150.0,40.0,100.0,300.0


In [194]:
for column in all_data.columns:
    if all_data[column].dtype == "object":
        print(column + ":", all_data[column].unique(), "NaN values:", all_data[column].isna().sum())

covid19_test_results: ['Negative' 'Positive'] NaN values: 0
high_risk_exposure_occupation: [True False nan] NaN values: 169
high_risk_interactions: [nan True False] NaN values: 24827
diabetes: [False True] NaN values: 0
chd: [False True] NaN values: 0
htn: [False True] NaN values: 0
cancer: [False True] NaN values: 0
asthma: [False True] NaN values: 0
copd: [False True] NaN values: 0
autoimmune_dis: [False True] NaN values: 0
smoker: [False True] NaN values: 0
ctab: [False nan True] NaN values: 58528
labored_respiration: [False nan True] NaN values: 45248
rhonchi: [False nan True] NaN values: 70651
wheezes: [False nan True] NaN values: 66507
cough: [True False nan] NaN values: 15
cough_severity: ['Severe' 'Mild' nan 'Moderate'] NaN values: 87503
fever: [nan False True] NaN values: 22921
sob: [False nan True] NaN values: 206
sob_severity: [nan 'Moderate' 'Mild' 'Severe'] NaN values: 90667
diarrhea: [False nan True] NaN values: 187
fatigue: [False nan True] NaN values: 176
headache: [Fal

In [24]:
all_data["covid19_test_results"].value_counts()["Positive"] / len(all_data["covid19_test_results"])

1232

In [195]:
all_data[all_data["covid19_test_results"] == "Positive"].iloc[:,14:24].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1313 entries, 21 to 93813
Data columns (total 10 columns):
sys                         1064 non-null float64
dia                         1064 non-null float64
rr                          968 non-null float64
sats                        1076 non-null float64
ctab                        885 non-null object
labored_respiration         1077 non-null object
rhonchi                     591 non-null object
wheezes                     690 non-null object
days_since_symptom_onset    670 non-null float64
cough                       1313 non-null object
dtypes: float64(5), object(5)
memory usage: 112.8+ KB
