## Getting the data

http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html

Copy the table from this link and paste it into a csv file named `dataset_uncleaned.csv`. Then follow the next steps for preprocessing.

## Cleaning our data

In [1]:
import pandas as pd

In [5]:
import csv
from collections import defaultdict

disease_list = []

def return_list(disease):
    disease_list = []
    match = disease.replace('^','_').split('_')
    ctr = 1
    for group in match:
        if ctr%2==0:
            disease_list.append(group)
        ctr = ctr + 1

    return disease_list

with open("dataset_uncleaned.csv") as csvfile:
    reader = csv.reader(csvfile)
    disease=""
    weight = 0
    disease_list = []
    dict_wt = {}
    dict_=defaultdict(list)
    for row in reader:

        if row[0]!="\xc2\xa0" and row[0]!="":
            disease = row[0]
            disease_list = return_list(disease)
            weight = row[1]

        if row[2]!="\xc2\xa0" and row[2]!="":
            symptom_list = return_list(row[2])

            for d in disease_list:
                for s in symptom_list:
                    dict_[d].append(s)
                dict_wt[d] = weight

    #print (dict_)

Writing our cleaned data

In [6]:
with open("dataset_clean.csv","w") as csvfile:
    writer = csv.writer(csvfile)
    for key,values in dict_.items():
        for v in values:
            #key = str.encode(key)
            key = str.encode(key).decode('utf-8')
            #.strip()
            #v = v.encode('utf-8').strip()
            #v = str.encode(v)
            writer.writerow([key,v,dict_wt[key]])

In [7]:
columns = ['Source','Target','Weight']

In [8]:
data = pd.read_csv("dataset_clean.csv",names=columns, encoding ="ISO-8859-1")

In [9]:
data.head()

Unnamed: 0,Source,Target,Weight
0,hypertensive disease,pain chest,3363
1,hypertensive disease,shortness of breath,3363
2,hypertensive disease,dizziness,3363
3,hypertensive disease,asthenia,3363
4,hypertensive disease,fall,3363


In [10]:
data.to_csv("dataset_clean.csv",index=False)

In [11]:
slist = []
dlist = []
with open("nodetable.csv","w") as csvfile:
    writer = csv.writer(csvfile)

    for key,values in dict_.items():
        for v in values:
            if v not in slist:
                writer.writerow([v,v,"symptom"])
                slist.append(v)
        if key not in dlist:
            writer.writerow([key,key,"disease"])
            dlist.append(key)

In [12]:
nt_columns = ['Id','Label','Attribute']

In [13]:
nt_data = pd.read_csv("nodetable.csv",names=nt_columns, encoding ="ISO-8859-1",)

In [14]:
nt_data.head()

Unnamed: 0,Id,Label,Attribute
0,pain chest,pain chest,symptom
1,shortness of breath,shortness of breath,symptom
2,dizziness,dizziness,symptom
3,asthenia,asthenia,symptom
4,fall,fall,symptom


In [15]:
nt_data.to_csv("nodetable.csv",index=False)

## Analysing our cleaned data

In [16]:
data = pd.read_csv("dataset_clean.csv", encoding ="ISO-8859-1")

In [17]:
data.head()

Unnamed: 0,Source,Target,Weight
0,hypertensive disease,pain chest,3363
1,hypertensive disease,shortness of breath,3363
2,hypertensive disease,dizziness,3363
3,hypertensive disease,asthenia,3363
4,hypertensive disease,fall,3363


In [18]:
len(data['Source'].unique())

149

In [19]:
len(data['Target'].unique())

405

In [20]:
df = pd.DataFrame(data)

In [21]:
df_1 = pd.get_dummies(df.Target)

In [22]:
df_1.head()

Unnamed: 0,Heberden's node,Murphy's sign,Stahli's line,abdomen acute,abdominal bloating,abdominal tenderness,abnormal sensation,abnormally hard consistency,abortion,abscess bacterial,...,vision blurred,vomiting,weepiness,weight gain,welt,wheelchair bound,wheezing,withdraw,worry,yellow sputum
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
df.head()

Unnamed: 0,Source,Target,Weight
0,hypertensive disease,pain chest,3363
1,hypertensive disease,shortness of breath,3363
2,hypertensive disease,dizziness,3363
3,hypertensive disease,asthenia,3363
4,hypertensive disease,fall,3363


In [24]:
df_s = df['Source']

In [25]:
df_pivoted = pd.concat([df_s,df_1], axis=1)

In [26]:
df_pivoted.drop_duplicates(keep='first',inplace=True)

In [27]:
df_pivoted[:5]

Unnamed: 0,Source,Heberden's node,Murphy's sign,Stahli's line,abdomen acute,abdominal bloating,abdominal tenderness,abnormal sensation,abnormally hard consistency,abortion,...,vision blurred,vomiting,weepiness,weight gain,welt,wheelchair bound,wheezing,withdraw,worry,yellow sputum
0,hypertensive disease,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,hypertensive disease,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,hypertensive disease,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,hypertensive disease,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,hypertensive disease,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
len(df_pivoted)

2116

In [29]:
cols = df_pivoted.columns

In [30]:
cols = cols[1:]

In [31]:
df_pivoted = df_pivoted.groupby('Source').sum()
df_pivoted = df_pivoted.reset_index()
df_pivoted[:5]

Unnamed: 0,Source,Heberden's node,Murphy's sign,Stahli's line,abdomen acute,abdominal bloating,abdominal tenderness,abnormal sensation,abnormally hard consistency,abortion,...,vision blurred,vomiting,weepiness,weight gain,welt,wheelchair bound,wheezing,withdraw,worry,yellow sputum
0,Alzheimer's disease,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,HIV,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,PneumocystisÃ¿cariniiÃ¿pneumonia,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,accidentÃ¿cerebrovascular,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,acquiredÃ¿immuno-deficiency syndrome,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [32]:
len(df_pivoted)

149

In [33]:
df_pivoted.to_csv("df_pivoted.csv")

In [35]:
x = df_pivoted[cols]
y = df_pivoted['Source']
print (x[:5])
print (y[:5])

   Heberden's node  Murphy's sign  Stahli's line  abdomen acute  \
0                0              0              0              0   
1                0              0              0              0   
2                0              0              0              0   
3                0              0              0              0   
4                0              0              0              0   

   abdominal bloating  abdominal tenderness  abnormal sensation  \
0                   0                     0                   0   
1                   0                     0                   0   
2                   0                     0                   0   
3                   0                     0                   0   
4                   0                     0                   0   

   abnormally hard consistency  abortion  abscess bacterial      ...        \
0                            0         0                  0      ...         
1                            0        

### Trying out our classifier to learn diseases from the symptoms

In [36]:
import pandas as pd
#import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split



In [37]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [38]:
mnb = MultinomialNB()
mnb = mnb.fit(x_train, y_train)

In [39]:
mnb.score(x_test, y_test)

0.0

### Inferences on train and test split
It can't work on unseen data because it has never seen that disease before. Also, there is only one point for each disease and hence no point for this. So we need to train the model entirely. Then what will we test it on? Missing data? Say given one symptom what is the disease? This is again multilabel classification. We can work symptom on symptom. What exactly is differential diagnosis, we need to replicate that.

In [40]:
mnb_tot = MultinomialNB()
mnb_tot = mnb_tot.fit(x, y)

In [41]:
mnb_tot.score(x, y)

0.8993288590604027

In [42]:
disease_pred = mnb_tot.predict(x)

In [43]:
disease_real = y.values

In [44]:
for i in range(0, len(disease_real)):
    if disease_pred[i]!=disease_real[i]:
        print ('Pred: {0} Actual:{1}'.format(disease_pred[i], disease_real[i]))

Pred: HIV Actual:acquiredÃ¿immuno-deficiency syndrome
Pred: biliary calculus Actual:cholelithiasis
Pred: coronary arteriosclerosis Actual:coronary heart disease
Pred: depression mental Actual:depressive disorder
Pred: HIV Actual:hiv infections
Pred: carcinoma breast Actual:malignant neoplasm of breast
Pred: carcinoma of lung Actual:malignant neoplasm of lung
Pred: carcinoma prostate Actual:malignant neoplasm of prostate
Pred: carcinoma colon Actual:malignant tumor of colon
Pred: candidiasis Actual:oralcandidiasis
Pred: effusion pericardial Actual:pericardial effusion body substance
Pred: malignant neoplasms Actual:primary malignant neoplasm
Pred: sepsis (invertebrate) Actual:septicemia
Pred: sepsis (invertebrate) Actual:systemic infection
Pred: tonic-clonic epilepsy Actual:tonic-clonic seizures


These are the predicted versus actual diseases that our classifier misclassifies.

### Training a decision tree

In [45]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz

In [46]:
print ("DecisionTree")
dt = DecisionTreeClassifier()
clf_dt=dt.fit(x,y)
print ("Acurracy: ", clf_dt.score(x,y))

DecisionTree
Acurracy:  0.8993288590604027


In [47]:
from sklearn import tree 
from sklearn.tree import export_graphviz

export_graphviz(dt, 
                out_file='tree.dot', 
                feature_names=cols)

In [53]:
import pydot

(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('tree.png')
from IPython.display import Image
Image(filename='tree.png')

FileNotFoundError: [WinError 2] "dot.exe" not found in path.

According to the plotted decision tree, `Jugular venous distention` is the attribute symptom that has the highest gini score of 0.9846. Thus this symptom would play a major role in predicting diseases.
<hr>

## Analysis of the Manual data

In [54]:
data = pd.read_csv("Training.csv")

In [55]:
data.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
1,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
4,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection


In [56]:
data.columns

Index(['itching', 'skin_rash', 'nodal_skin_eruptions', 'continuous_sneezing',
       'shivering', 'chills', 'joint_pain', 'stomach_pain', 'acidity',
       'ulcers_on_tongue',
       ...
       'blackheads', 'scurring', 'skin_peeling', 'silver_like_dusting',
       'small_dents_in_nails', 'inflammatory_nails', 'blister',
       'red_sore_around_nose', 'yellow_crust_ooze', 'prognosis'],
      dtype='object', length=133)

In [57]:
len(data.columns)

133

In [58]:
len(data['prognosis'].unique())

41

41 different type of target diseases are available in the manual training dataset.

In [59]:
df = pd.DataFrame(data)

In [60]:
df.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
1,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection
4,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal infection


In [61]:
len(df)

4920

The manual data contains approximately 4920 rows.

In [62]:
cols = df.columns

In [63]:
cols = cols[:-1]

In [64]:
cols

Index(['itching', 'skin_rash', 'nodal_skin_eruptions', 'continuous_sneezing',
       'shivering', 'chills', 'joint_pain', 'stomach_pain', 'acidity',
       'ulcers_on_tongue',
       ...
       'pus_filled_pimples', 'blackheads', 'scurring', 'skin_peeling',
       'silver_like_dusting', 'small_dents_in_nails', 'inflammatory_nails',
       'blister', 'red_sore_around_nose', 'yellow_crust_ooze'],
      dtype='object', length=132)

In [65]:
len(cols)

132

We have 132 symptoms in the manual data.

In [None]:
x = df[cols]
y = df['prognosis']
print x[:5]
print y[:5]

In [None]:
import os

In [67]:
dest_addr = 'Users\Aayush Bhargava\Desktop\Disease-Predictor-master'


In [68]:
import csv


In [70]:
with open('Training.csv') as f:
    reader = csv.reader(f)
    i = next(reader)
    rest = [row for row in reader]

In [71]:
print (i)

['itching', 'skin_rash', 'nodal_skin_eruptions', 'continuous_sneezing', 'shivering', 'chills', 'joint_pain', 'stomach_pain', 'acidity', 'ulcers_on_tongue', 'muscle_wasting', 'vomiting', 'burning_micturition', 'spotting_ urination', 'fatigue', 'weight_gain', 'anxiety', 'cold_hands_and_feets', 'mood_swings', 'weight_loss', 'restlessness', 'lethargy', 'patches_in_throat', 'irregular_sugar_level', 'cough', 'high_fever', 'sunken_eyes', 'breathlessness', 'sweating', 'dehydration', 'indigestion', 'headache', 'yellowish_skin', 'dark_urine', 'nausea', 'loss_of_appetite', 'pain_behind_the_eyes', 'back_pain', 'constipation', 'abdominal_pain', 'diarrhoea', 'mild_fever', 'yellow_urine', 'yellowing_of_eyes', 'acute_liver_failure', 'fluid_overload', 'swelling_of_stomach', 'swelled_lymph_nodes', 'malaise', 'blurred_and_distorted_vision', 'phlegm', 'throat_irritation', 'redness_of_eyes', 'sinus_pressure', 'runny_nose', 'congestion', 'chest_pain', 'weakness_in_limbs', 'fast_heart_rate', 'pain_during_bow

In [72]:
for ix in i:
    ix = ix.replace('_',' ')
    print (ix)

itching
skin rash
nodal skin eruptions
continuous sneezing
shivering
chills
joint pain
stomach pain
acidity
ulcers on tongue
muscle wasting
vomiting
burning micturition
spotting  urination
fatigue
weight gain
anxiety
cold hands and feets
mood swings
weight loss
restlessness
lethargy
patches in throat
irregular sugar level
cough
high fever
sunken eyes
breathlessness
sweating
dehydration
indigestion
headache
yellowish skin
dark urine
nausea
loss of appetite
pain behind the eyes
back pain
constipation
abdominal pain
diarrhoea
mild fever
yellow urine
yellowing of eyes
acute liver failure
fluid overload
swelling of stomach
swelled lymph nodes
malaise
blurred and distorted vision
phlegm
throat irritation
redness of eyes
sinus pressure
runny nose
congestion
chest pain
weakness in limbs
fast heart rate
pain during bowel movements
pain in anal region
bloody stool
irritation in anus
neck pain
dizziness
cramps
bruising
obesity
swollen legs
swollen blood vessels
puffy face and eyes
enlarged thyroi

In [73]:
print (i)

['itching', 'skin_rash', 'nodal_skin_eruptions', 'continuous_sneezing', 'shivering', 'chills', 'joint_pain', 'stomach_pain', 'acidity', 'ulcers_on_tongue', 'muscle_wasting', 'vomiting', 'burning_micturition', 'spotting_ urination', 'fatigue', 'weight_gain', 'anxiety', 'cold_hands_and_feets', 'mood_swings', 'weight_loss', 'restlessness', 'lethargy', 'patches_in_throat', 'irregular_sugar_level', 'cough', 'high_fever', 'sunken_eyes', 'breathlessness', 'sweating', 'dehydration', 'indigestion', 'headache', 'yellowish_skin', 'dark_urine', 'nausea', 'loss_of_appetite', 'pain_behind_the_eyes', 'back_pain', 'constipation', 'abdominal_pain', 'diarrhoea', 'mild_fever', 'yellow_urine', 'yellowing_of_eyes', 'acute_liver_failure', 'fluid_overload', 'swelling_of_stomach', 'swelled_lymph_nodes', 'malaise', 'blurred_and_distorted_vision', 'phlegm', 'throat_irritation', 'redness_of_eyes', 'sinus_pressure', 'runny_nose', 'congestion', 'chest_pain', 'weakness_in_limbs', 'fast_heart_rate', 'pain_during_bow

### Trying out our classifier to learn diseases from the symptoms

In [74]:
import pandas as pd
#import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split

In [75]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [76]:
mnb = MultinomialNB()
mnb = mnb.fit(x_train, y_train)

In [77]:
mnb.score(x_test, y_test)

0.0

In [81]:
from sklearn import cross_validation
print ("cross result========")
scores = cross_validation.cross_val_score(mnb, x_test, y_test, cv=3)
print (scores)
print (scores.mean())



ValueError: All the n_labels for individual classes are less than 3 folds.

We use the testing dataset to actually test our Multinomial Bayes model

In [None]:
test_data = pd.read_csv("Testing.csv")

In [None]:
test_data.head()

In [None]:
testx = test_data[cols]
testy = test_data['prognosis']

In [None]:
mnb.score(testx, testy)

### Training a decision tree

In [None]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [None]:
print ("DecisionTree")
dt = DecisionTreeClassifier()
clf_dt=dt.fit(x_train,y_train)
print ("Acurracy: ", clf_dt.score(x_test,y_test))

In [None]:
from sklearn import cross_validation
print ("cross result========")
scores = cross_validation.cross_val_score(dt, x_test, y_test, cv=3)
print (scores)
print (scores.mean())

In [None]:
print ("Acurracy on the actual test data: ", clf_dt.score(testx,testy))

In [None]:
from sklearn import tree 
from sklearn.tree import export_graphviz

export_graphviz(dt, 
                out_file='DOT-files/tree.dot', 
                feature_names=cols)

Running the following command we can get the decision tree image.

```dot -Tpng tree.dot -o tree.png```

In [None]:
from IPython.display import Image
Image(filename='tree.png')

In [None]:
dt.__getstate__()

#### Finding the Feature importances

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

importances = dt.feature_importances_
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

In [None]:
features = cols

In [None]:
for f in range(5):
    print("%d. feature %d - %s (%f)" % (f + 1, indices[f], features[indices[f]] ,importances[indices[f]]))

Thus the top features are the symptoms of redness of eyes, internal itching etc that would play a bigger role in predicting diseases. This can be verified by the exported decision tree.

In [None]:
export_graphviz(dt, 
                out_file='DOT-files/tree-top5.dot', 
                feature_names=cols,
                max_depth = 5
               )

In [None]:
from IPython.display import Image
Image(filename='tree-top5.png')

The redness_of_eyes is the top symptom that has the highest [Gini impurity](https://en.wikipedia.org/wiki/Decision_tree_learning#Gini_impurity) score of 0.9755. Then comes internal_itchiness with a score of 0.9749 and so on. Basically this implies that the redness_of_eyes symptom has the potential to divide most samples into particular classes and hence is selected as the root of the decision tree. From there we move down with decreasing order of Gini scores.

In [None]:
feature_dict = {}
for i,f in enumerate(features):
    feature_dict[f] = i

In [None]:
feature_dict['hip_joint_pain']

In [None]:
print feature_dict

In [None]:
sample_x = [i/52 if i ==52 else i*0 for i in range(len(features))]

This means predicting the disease where the only symptom is redness_of_eyes.

In [None]:
len(sample_x)

In [None]:
sample_x = np.array(sample_x).reshape(1,len(sample_x))

In [None]:
dt.predict(sample_x)

In [None]:
dt.predict_proba(sample_x)

Hence it has 100% confidence that the disease would be Common Cold. The prediction would improve once we take more symptoms as input.

<hr>