## Getting the data

http://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html

Copy the table from this link and paste it into a csv file named `dataset_uncleaned.csv`. Then follow the next steps for preprocessing.

## Cleaning our data

In [1]:
import pandas as pd

In [2]:
import csv
from collections import defaultdict

disease_list = []

def return_list(disease):
    disease_list = []
    match = disease.replace('^','_').split('_')
    ctr = 1
    for group in match:
        if ctr%2==0:
            disease_list.append(group)
        ctr = ctr + 1

    return disease_list

with open("dataset_uncleaned.csv") as csvfile:
    reader = csv.reader(csvfile)
    disease=""
    weight = 0
    disease_list = []
    dict_wt = {}
    dict_=defaultdict(list)
    for row in reader:

        if row[0]!="\xc2\xa0" and row[0]!="":
            disease = row[0]
            disease_list = return_list(disease)
            weight = row[1]

        if row[2]!="\xc2\xa0" and row[2]!="":
            symptom_list = return_list(row[2])

            for d in disease_list:
                for s in symptom_list:
                    dict_[d].append(s)
                dict_wt[d] = weight

    #print (dict_)

Writing our cleaned data

In [3]:
with open("dataset_clean.csv","w") as csvfile:
    writer = csv.writer(csvfile)
    for key,values in dict_.items():
        for v in values:
            #key = str.encode(key)
            key = str.encode(key).decode('utf-8')
            #.strip()
            #v = v.encode('utf-8').strip()
            #v = str.encode(v)
            writer.writerow([key,v,dict_wt[key]])

In [4]:
columns = ['Source','Target','Weight']

In [5]:
data = pd.read_csv("dataset_clean.csv",names=columns, encoding ="ISO-8859-1")

In [6]:
data.head()

Unnamed: 0,Source,Target,Weight
0,hernia hiatal,pain abdominal,61
1,hernia hiatal,fatigability,61
2,hernia hiatal,prodrome,61
3,hernia hiatal,vomiting,61
4,hernia hiatal,nausea,61


In [7]:
data.to_csv("dataset_clean.csv",index=False)

In [8]:
slist = []
dlist = []
with open("nodetable.csv","w") as csvfile:
    writer = csv.writer(csvfile)

    for key,values in dict_.items():
        for v in values:
            if v not in slist:
                writer.writerow([v,v,"symptom"])
                slist.append(v)
        if key not in dlist:
            writer.writerow([key,key,"disease"])
            dlist.append(key)

In [9]:
nt_columns = ['Id','Label','Attribute']

In [10]:
nt_data = pd.read_csv("nodetable.csv",names=nt_columns, encoding ="ISO-8859-1",)

In [11]:
nt_data.head()

Unnamed: 0,Id,Label,Attribute
0,pain abdominal,pain abdominal,symptom
1,fatigability,fatigability,symptom
2,prodrome,prodrome,symptom
3,vomiting,vomiting,symptom
4,nausea,nausea,symptom


In [12]:
nt_data.to_csv("nodetable.csv",index=False)

## Analysing our cleaned data

In [183]:
data = pd.read_csv("dataset_clean.csv", encoding ="ISO-8859-1")

In [184]:
data.head()

Unnamed: 0,Source,Target,Weight
0,hernia hiatal,pain abdominal,61
1,hernia hiatal,fatigability,61
2,hernia hiatal,prodrome,61
3,hernia hiatal,vomiting,61
4,hernia hiatal,nausea,61


In [185]:
len(data['Source'].unique())

149

In [186]:
len(data['Target'].unique())

405

In [187]:
df = pd.DataFrame(data)

In [188]:
df_1 = pd.get_dummies(df.Target)

In [189]:
df_1.head()

Unnamed: 0,Heberden's node,Murphy's sign,Stahli's line,abdomen acute,abdominal bloating,abdominal tenderness,abnormal sensation,abnormally hard consistency,abortion,abscess bacterial,...,vision blurred,vomiting,weepiness,weight gain,welt,wheelchair bound,wheezing,withdraw,worry,yellow sputum
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [190]:
df.head()

Unnamed: 0,Source,Target,Weight
0,hernia hiatal,pain abdominal,61
1,hernia hiatal,fatigability,61
2,hernia hiatal,prodrome,61
3,hernia hiatal,vomiting,61
4,hernia hiatal,nausea,61


In [191]:
df_s = df['Source']

In [192]:
df_pivoted = pd.concat([df_s,df_1], axis=1)

In [193]:
df_pivoted.drop_duplicates(keep='first',inplace=True)

In [194]:
df_pivoted[:5]

Unnamed: 0,Source,Heberden's node,Murphy's sign,Stahli's line,abdomen acute,abdominal bloating,abdominal tenderness,abnormal sensation,abnormally hard consistency,abortion,...,vision blurred,vomiting,weepiness,weight gain,welt,wheelchair bound,wheezing,withdraw,worry,yellow sputum
0,hernia hiatal,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,hernia hiatal,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,hernia hiatal,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,hernia hiatal,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,hernia hiatal,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [195]:
len(df_pivoted)

2116

In [196]:
cols = df_pivoted.columns

In [197]:
cols = cols[1:]

In [198]:
df_pivoted = df_pivoted.groupby('Source').sum()
df_pivoted = df_pivoted.reset_index()
df_pivoted[:5]

Unnamed: 0,Source,Heberden's node,Murphy's sign,Stahli's line,abdomen acute,abdominal bloating,abdominal tenderness,abnormal sensation,abnormally hard consistency,abortion,...,vision blurred,vomiting,weepiness,weight gain,welt,wheelchair bound,wheezing,withdraw,worry,yellow sputum
0,Alzheimer's disease,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,HIV,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Pneumocystis carinii pneumonia,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,accident cerebrovascular,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,acquired immuno-deficiency syndrome,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [199]:
len(df_pivoted)

149

In [200]:
df_pivoted.to_csv("dfp.csv")

In [201]:
x = df_pivoted[cols]
y = df_pivoted['Source']

### Trying out our classifier to learn diseases from the symptoms

In [204]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split



In [205]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [206]:
mnb = MultinomialNB()
mnb = mnb.fit(x_train, y_train)

In [207]:
mnb.score(x_test, y_test)

0.0

### Inferences on train and test split
It can't work on unseen data because it has never seen that disease before. Also, there is only one point for each disease and hence no point for this. So we need to train the model entirely. Then what will we test it on? Missing data? Say given one symptom what is the disease? This is again multilabel classification. We can work symptom on symptom. What exactly is differential diagnosis, we need to replicate that.

In [208]:
mnb_tot = MultinomialNB()
mnb_tot = mnb_tot.fit(x, y)

In [209]:
mnb_tot.score(x, y)

0.89932885906040272

In [210]:
disease_pred = mnb_tot.predict(x)

In [211]:
disease_real = y.values

In [214]:
for i in range(0, len(disease_real)):
    if disease_pred[i]!=disease_real[i]:
        print ('Pred: {0} Actual:{1}'.format(disease_pred[i], disease_real[i]))

Pred: HIV Actual:acquired immuno-deficiency syndrome
Pred: biliary calculus Actual:cholelithiasis
Pred: coronary arteriosclerosis Actual:coronary heart disease
Pred: depression mental Actual:depressive disorder
Pred: HIV Actual:hiv infections
Pred: carcinoma breast Actual:malignant neoplasm of breast
Pred: carcinoma of lung Actual:malignant neoplasm of lung
Pred: carcinoma prostate Actual:malignant neoplasm of prostate
Pred: carcinoma colon Actual:malignant tumor of colon
Pred: candidiasis Actual:oralcandidiasis
Pred: effusion pericardial Actual:pericardial effusion body substance
Pred: malignant neoplasms Actual:primary malignant neoplasm
Pred: sepsis (invertebrate) Actual:septicemia
Pred: sepsis (invertebrate) Actual:systemic infection
Pred: tonic-clonic epilepsy Actual:tonic-clonic seizures


These are the predicted versus actual diseases that our classifier misclassifies.

<hr>

# More analysis to be done soon...