# Alternative Doctore
### **Sami Lababidi**

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

The dataset available is divided into four csv tables. The first table contains a list of diseases and their descriptions, 41 rows. The second contains a set of recommendations for each disease, 41 rows. The third table is a list of all unique symptoms reported by patients, 133 rows. The final table has a list of cases with a combination of symptoms that previous patients reported and their final diagnoses, 4919 rows.

In [2]:
# storing all diseases in an array
disease_description = pd.read_csv("archive/symptom_Description.csv")
diseases = []
for i in range(len(disease_description['Disease'])):
    diseases.append(disease_description['Disease'][i].strip())
    print(f"{i}: {disease_description['Disease'][i]}")

0: Drug Reaction
1: Malaria
2: Allergy
3: Hypothyroidism
4: Psoriasis
5: GERD
6: Chronic cholestasis
7: hepatitis A
8: Osteoarthristis
9: (vertigo) Paroymsal  Positional Vertigo
10: Hypoglycemia
11: Acne
12: Diabetes
13: Impetigo
14: Hypertension
15: Peptic ulcer diseae
16: Dimorphic hemorrhoids(piles)
17: Common Cold
18: Chicken pox
19: Cervical spondylosis
20: Hyperthyroidism
21: Urinary tract infection
22: Varicose veins
23: AIDS
24: Paralysis (brain hemorrhage)
25: Typhoid
26: Hepatitis B
27: Fungal infection
28: Hepatitis C
29: Migraine
30: Bronchial Asthma
31: Alcoholic hepatitis
32: Jaundice
33: Hepatitis E
34: Dengue
35: Hepatitis D
36: Heart attack
37: Pneumonia
38: Arthritis
39: Gastroenteritis
40: Tuberculosis


In [3]:
# an array for all reported symptoms 
symptom_severity = pd.read_csv("archive/Symptom-severity.csv")
reported_symptoms = []
for i in range(len(symptom_severity['Symptom'])):
    reported_symptoms.append(symptom_severity['Symptom'][i].strip())
    print(f"{i}: {symptom_severity['Symptom'][i]}")

0: itching
1: skin_rash
2: nodal_skin_eruptions
3: continuous_sneezing
4: shivering
5: chills
6: joint_pain
7: stomach_pain
8: acidity
9: ulcers_on_tongue
10: muscle_wasting
11: vomiting
12: burning_micturition
13: spotting_urination
14: fatigue
15: weight_gain
16: anxiety
17: cold_hands_and_feets
18: mood_swings
19: weight_loss
20: restlessness
21: lethargy
22: patches_in_throat
23: irregular_sugar_level
24: cough
25: high_fever
26: sunken_eyes
27: breathlessness
28: sweating
29: dehydration
30: indigestion
31: headache
32: yellowish_skin
33: dark_urine
34: nausea
35: loss_of_appetite
36: pain_behind_the_eyes
37: back_pain
38: constipation
39: abdominal_pain
40: diarrhoea
41: mild_fever
42: yellow_urine
43: yellowing_of_eyes
44: acute_liver_failure
45: fluid_overload
46: swelling_of_stomach
47: swelled_lymph_nodes
48: malaise
49: blurred_and_distorted_vision
50: phlegm
51: throat_irritation
52: redness_of_eyes
53: sinus_pressure
54: runny_nose
55: congestion
56: chest_pain
57: weaknes

After storing all unique symptoms and diseases, now we create an array of buckets. Each bucket represents a case.

In [4]:
# loading patient cases wich contains diagnosis and symptoms
cases = pd.read_csv("archive/dataset.csv")
# cleaning the data
before = ['Dimorphic hemmorhoids(piles)', ' dischromic _patches',' spotting_ urination',' foul_smell_of urine']
after =  ['Dimorphic hemorrhoids(piles)', 'dischromic_patches'  ,'spotting_urination'  ,'foul_smell_ofurine']

cases['Disease'] = cases['Disease'].str.strip()

cases = cases.replace(before,after)



# 2D array storing a bucket of symptoms for each case
# the first element of each row represents the number of symptoms reported
case_buckets = {}

# 1D array with patients diagnosis
patient_diagnosis = []




In [5]:
# run this only once
for row in range(len(cases)):
    dis = diseases.index(cases.iloc[row,0])
    patient_diagnosis.append(dis)
    sym_list = [reported_symptoms.index(x.strip()) for x in cases.iloc[row,1:] if not pd.isnull(x)]
#     print(sym_list)
    # the first element of symptoms array is the number of symptoms reported
    
    sym = set()
    for i in sym_list:
        sym.add(i)
    case_buckets[row] = sym
    
    

Now, we have the data to find the most frequent symptom using A-Priori. However, we still need to create a matrix to find nearest k-neighbors so we can recommend a list of possible diagnosis using collaborative filtering.

The matrix will have n rows and m columns where n is the number of cases and m is the number of unique symptoms.

In [6]:
# initialize the matrix with all values equal -1
case_symptom_matrix = np.array([[-1]*len(reported_symptoms)]*len(case_buckets))
for c in range(len(case_buckets)):
    for s in case_buckets[c]:
        case_symptom_matrix[c][s] = 1


        
def add_patient(all_sym, curr_sym, ignored_sym):
    new_patient = [-1]*len(all_sym)
    for i in all_sym:
        if i in curr_sym:
            new_patient[i] = 1
        if i in ignored_sym:
            new_patient[i] = 0

    return new_patient

Run A-Priori algorithm to find the most frequent symptoms reported by patients.

In [7]:
frequency = []
for i in range(len(reported_symptoms)):
    f = 0
    for bucket in case_buckets.keys():
        if i in case_buckets[bucket]:
            f += 1
    new_list = [reported_symptoms[i], f]
    frequency.append(new_list)
freq_df = pd.DataFrame(frequency, columns=["symptom", "frequency"])
freq_df = freq_df.sort_values(by=['frequency'],ascending=False)
freq_df.head()

Unnamed: 0,symptom,frequency
14,fatigue,1932
11,vomiting,1914
25,high_fever,1362
35,loss_of_appetite,1152
34,nausea,1146


In [8]:
#calculating support for each single item

# return support for each set in the list lst
def A_priori(lst, buckets):
    supp = [0]*len(lst)
    for i in range(len(lst)):
        supp[i] = np.sum([set(lst[i]) <= set(case_buckets[k]) for k in case_buckets.keys()])/len(case_buckets.keys())
    return supp

# creates a combination of reported symptoms from the new patient 
#   with all other symptoms.
#   ignored symptoms are frequent symptoms that the patient is not Experiencing 
def add_sym(symptom_list, cur_sym, ignored_symptoms):
    new_combination = []
    for i in symptom_list:
        if i not in ignored_symptoms and i not in cur_sym:
            sym = cur_sym.copy()
            sym.append(i)
            new_combination.append(sym)
    return new_combination

# return an 2D array with each element consists of list of symptoms and support
#   with minimum of limit
def return_supp(sym, supp, limit):
    lst = []
    for i,s in zip(sym,supp):
        if s >= limit:
            lst.append([i,s])
    return lst

def print_sym_supp(lst, all_sym):
    for i in lst:
        print(all_sym[i[0][-1]], i[1])
        
# storing integers instead of dealing with strings
sym = [[x] for x in range(len(reported_symptoms))]
all_sym = [x for x in range(len(reported_symptoms))]
supp = A_priori(sym, case_buckets)

print('Symptom','Support')
for i,s in zip(reported_symptoms,supp):
    if s >= 0.2:
        print(i,s)
    

Symptom Support
vomiting 0.38902439024390245
fatigue 0.3926829268292683
high_fever 0.27682926829268295
headache 0.2304878048780488
nausea 0.2329268292682927
loss_of_appetite 0.23414634146341465
abdominal_pain 0.2097560975609756


In [9]:
# the patient is experiencing these symptoms from the list above
curr_sym = [reported_symptoms.index('vomiting')]
curr_sym.append(reported_symptoms.index('fatigue'))
curr_sym.append(reported_symptoms.index('high_fever'))

# the rest of the symptoms are added to ignored symptoms
ignored_sym = []
ignored_sym.append(reported_symptoms.index('headache'))
ignored_sym.append(reported_symptoms.index('nausea'))
ignored_sym.append(reported_symptoms.index('loss_of_appetite'))
ignored_sym.append(reported_symptoms.index('abdominal_pain'))

# create new combination
new_combo = add_sym(all_sym, curr_sym, ignored_sym)
supp = A_priori(new_combo, case_buckets)
sym_supp = return_supp(new_combo, supp, 0.04)
print_sym_supp(sym_supp, reported_symptoms)

chills 0.06219512195121951
yellowing_of_eyes 0.041463414634146344
malaise 0.04024390243902439


In [10]:
# add symptoms
curr_sym.append(reported_symptoms.index('yellowing_of_eyes'))

# patient not experiencing the symptom
ignored_sym.append(reported_symptoms.index('chills'))
ignored_sym.append(reported_symptoms.index('malaise'))

# create a row for the new patient to be added to recommendation array
new_patient = add_patient(all_sym, curr_sym, ignored_sym)
case_buckets[len(case_buckets)] = curr_sym

#### Implementing colleboratibve filtering recommendation algorithom

In [11]:
def mag(x):
    return np.sqrt(np.sum(x**2))

def cosine_sim(x,y):
    return np.sum(x*y)/(mag(x)*mag(y))


def nearest_k_neighbors(M, k, patient):
    sim_patients = {}
    y = np.where(patient == -1, 0, patient)
    for i in range(len(M)):
        x = np.where(M[i]==-1,0,M[i]) # replace all 99 with 0 for the cosine
        cos = cosine_sim(x,y)
        sim_patients[i] = cos
    sim_patients_sorted = {k: v for k, v in sorted(sim_patients.items(), key=lambda item: item[1])}
#     sim_users_sorted = sim_users_sorted.reverse()
    data = list(reversed(sim_patients_sorted.keys()))[:k]
    return data

def suggestion_set(patient_diagnosis, k_neighbors, diseases):
    suggestions = set()
    for i in k_neighbors:
        suggestions.add(diseases[patient_diagnosis[i]])
    return suggestions


k_Nearest = nearest_k_neighbors(case_symptom_matrix, 50, new_patient)

print(suggestion_set(patient_diagnosis, k_Nearest, diseases))


{'Hepatitis C', 'Hepatitis D', 'Chronic cholestasis'}


The system recommended we check for `Chronic cholestasis`, `Hepatitis D` and `Hepatitis C`.

Let's run some test to see if the symptoms reported by patients with these diagnosis have similar symptoms.

To do that, we create a bucket for each disease and add all symptoms reported by all patients and compare between the new patient symptom and the suggested diagnosis.

In [12]:
def print_symptoms(symps, lst):
    for i in lst:
        print(symps[i])

disease_buckets = {}
for i in range(len(patient_diagnosis)):
    if patient_diagnosis[i] in disease_buckets.keys():
        disease_buckets[patient_diagnosis[i]].union(set(case_buckets[i]))
    else:
        disease_buckets[patient_diagnosis[i]] = set(case_buckets[i])

print("**The symptoms reported by patients for Chronic cholestasis: ")
print_symptoms(reported_symptoms, disease_buckets[diseases.index('Chronic cholestasis')])

print('**************************')
print("**The symptoms reported by patients for Hepatitis D: ")
print_symptoms(reported_symptoms, disease_buckets[diseases.index('Hepatitis D')])


print('**************************')
print("**The symptoms reported by patients for Hepatitis C: ")
print_symptoms(reported_symptoms, disease_buckets[diseases.index('Hepatitis C')])


print('**************************')

print("**The symptoms reported by the new patient: ")
print_symptoms(reported_symptoms, curr_sym)

**The symptoms reported by patients for Chronic cholestasis: 
itching
yellowish_skin
nausea
loss_of_appetite
abdominal_pain
yellowing_of_eyes
vomiting
**************************
**The symptoms reported by patients for Hepatitis D: 
yellowish_skin
dark_urine
nausea
loss_of_appetite
joint_pain
abdominal_pain
vomiting
yellowing_of_eyes
fatigue
**************************
**The symptoms reported by patients for Hepatitis C: 
yellowish_skin
nausea
loss_of_appetite
family_history
fatigue
**************************
**The symptoms reported by the new patient: 
vomiting
fatigue
high_fever
yellowing_of_eyes


**Another possibe approach to recommend a diagnosis:**

- ask the patients to rate the severity of a symptom from 1 to 10
- use the rating for a more accurate diagnosis for future patients

This approach is going to require additional datasets that I am unable to find. random weights will be given for cases if this theory is implemented.