# Your Health, You decide (But decide wisely for the rest of us please)

Hey let me tell you somethin you might wanna hear.

Hospital workers are inundated with many patients who now have the Coronavirus, not to mention sewing masks for themselves, caring for their other quarantined family members etc.,. So, to make everyone's life a little better, please ask yourself: Should I really be going out there to my doctor's clinic or a hospital?

Who knows? You might walk in for precautionary measures without the virus, get to the medical facility, contract the virus from another patient. You could add to a doctor's headaches. So here we are. You're torn about what to do. Well, relax. Let's think about this rationally and make the best decision for you.

Here we have a multi-level classifier which will group you into different categories of risk and decide whether it's best if you go to a hospital, or stay home and keep a watch on your health status, or just sit tight. (DISCLAIMER: These lines of code do not and should not dictate your choices/course of action. Make sure to check in with your doctor on the phone about any decision you make that relates to this crisis. Plus, who am I to tell you what to do? So, choose for yourself. I hope for all our sakes that you end up making the right choice.)

This is a prototype classifier that is not trained with a lot of symptoms taken into consideration. A new release where all the ailments of every COVID-19 patient are taken into consideration and evaluated for relevance using a Bayesian Network will be released. The new release should hopefully cover all symptoms and be (near-)completely comprehensive in triaging you into a risk category. More hopefully, this whole thing blows over before I release that.

In [1]:
# Do not import any other libraries other than those listed here. 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn as skl
import seaborn as sns
import keras
import keras.utils
from keras import utils as np_utils
from keras.models import Sequential
from keras.layers import Dense, Dropout
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import datasets, neighbors
import sklearn.datasets 
from sklearn.preprocessing import LabelBinarizer
plt.style.use('ggplot')

Using TensorFlow backend.


In [None]:
def DGP():
    symptoms = 'pid,fever,tiredness,dry_cough,breathing_difficulty,ache_somewhere,nasal_congestion,runny_nose,sore_throat,diarrhea,hospital_need'
    data = np.random.randint(2, size = (1001, 9))
    hosp = np.random.randint(3, size = 1000)
    data = np.append(data, [hosp], axis = 1)
    data = np.insert(data, [0], np.arange(1,1001), axis=1)
    for i in range(len(data)-1):
        line = data[i]
        for symptom in symptoms.split(','):
            if symptom == 'breathing_difficulty' and line[4] == 1:
                line[-1] = 2
                line[1] = 1
            if symptom == 'fever' and line[1] == 1:
                line[-1] = 2
        if line[:len(line)] == [i, 0, 0, 0, 0, 0, 0, 0, 0, 0]:
            line[-1] = 0
        else:
            if line[-1] != 2:
                line[-1] = 1
        line = np.asarray(line)
        data[i] = [line]
    data = np.asarray(data)
    print(data)
    df = pd.DataFrame(data = data, index = np.arange(1,1000), columns = symptoms.split(','))
    X = df[['pid', 'fever','tiredness','dry_cough','breathing_difficulty','ache_somewhere','nasal_congestion','runny_nose','sore_throat','diarrhea']].values
    Y = df['hospital need']
    
    #test_size can change as more updates reveal more features. Currently 9 major symptoms listed on WHO website are the focus,
    #the ones that the WHO recommends hospitalization for are the ones with hospital need = 2. 
    #test_size proportion = sqrt(1/number of features) = sqrt(1/9) = 1/3
    
    x_train, x_val, y_train, y_val = train_test_split(X,Y, test_size=0.33, random_state = 0)
    
    return x_train, x_val, y_train, y_val

x_train, y_train, x_val, y_val = DGP()

In [None]:
#For a sanity check, we use the K-nearest neighbors method to group patients into triages. We check the accuracy.

model = KNeighborsClassifier()
model.fit(x_train, y_train)

training_accuracy = 0
index = 0
for elem in y_train:
    if model.predict([x_train[index]]) == elem:
        training_accuracy += 1
    index += 1
training_accuracy = float(training_accuracy/len(y_train))
print('accuracy of training data inputs: '+str(training_accuracy))

val_acc = 0
index = 0
for elem in y_val:
    if model.predict([x_val[index]]) == elem:
        val_acc += 1
    index += 1
val_acc = float(val_acc/len(y_val))
print('accuracy of validation data inputs: '+str(val_acc))

Since there are 9 symptoms, I think a reasonable neural network will have 81 nodes in the input and hidden layers (I mean we start asking questions once we have 2 minor symptoms, but hopefully it learns the rule that the WHO recommends: call a doctor if you have a fever and/or breathing difficulty). To assess the accuracy, there will be long pages of tests where I tweak the number of nodes in a layer, number of layers. The activation function needs to be reLU since the classification is not binary and we want better accuracy with little training time to afford.

In [None]:
x_train, y_train, x_val, y_val = DGP()

In [None]:
#We define an abstracted function since we will be testing a variety of input sizes and number of layers to see which one is best
#
def NN_model(layers, input_count):
    model = Sequential()
    for i in range(layers):
        model.add(Dense(input_count, input_dim = input_count, activation = 'relu'))
        model.add(Dropout(0.1))
    model.add(Dense(3, input_dim = input_count, activation = 'relu'))
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

In [None]:
#We try a network with an input and an output layer to see how it works for a start. We will build on this and try out many
#more combinations with new randomly-generated data for each trial. So our network will not be overfitted.
#Yet we are looking for the best of many random trials, so we are effectively chasing the maximum for all data.

model1 = NN_model(1, 81)
model1.fit(x_train, y_train, epochs = 12, batch_size = 81, verbose = True)
print(model1.evaluate(x_val, y_val))

In [None]:
test_accs = []
n_vals = []
for n in range(1, 11, 1):
    n_vals.append(n)
    i_vals = []
    for i in range(1, 729, 81):
        #DGP
        model2 = NN_model(n, i)
        model2.fit(x_train, y_train, epochs = 12, batch_size = 81, verbose = True)
        test_accs.append(model2.evaluate(x_val, y_val)[1])
        print(str(n)+" layer network:")
        plt.scatter(i_vals, test_accs)
#plt.scatter(n_vals, test_accs)

The number of layers and inputs for which the accuracy is highest is ___ based on all the scatter plots. The values for which the accuracy was closest to 1 is the one that works best for randomly generated patient data. Our algorithm learns the WHO recommendation rule as quickly as that. Similarly when more data becomes available and easily machine-readable, the new release will also have a lot of simulations like these to find the best fitting network structure.

But for now, we're gonna try another 1000 values and see how it does just as a sanity check on our prototype.