## Diagnosis

While the purpose of Flaredown for it's users isn't about diagnosis, we would be remiss if we ignored the fact that we have a very solid training set for a system to perform diagnosis.  We will reshape the data so that our features are one-hotted symptoms, and we will predict on condition.

There are a lot of algorithms out there that take a list of self reported symptoms and attempt a diagnosis.  But the depth and breadth of the ever-growing Flaredown data may provide new oppertunities for this task, especially since in the future Flaredown may collect any number of additional variables.

I do recommend giving this some time before it's used.  At this time (Aug 22, 2016) the data only describes about 900 conditions.  For a diagnosis engine to be useful it is likely to require a huge breadth of conditions.

In [48]:
import numpy as np
import pandas as pd
import random
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

df = pd.read_csv("flaredown_trackable_data_080316.csv")
df['checkin_date'] = pd.to_datetime(df['checkin_date'])

In [28]:
#reshape and one-hot the symptoms
symptoms = pd.get_dummies(df[(df['trackable_type'] == "Symptom") & (df['trackable_value'] != 0)], columns=['trackable_name'])
symptoms = symptoms.drop(['trackable_id', 'trackable_type', 'trackable_value'], axis=1)

def numericOr(x):
    if 1 in x.values:
        return 1
    else:
        return 0
symptoms = symptoms.groupby(['user_id', 'checkin_date']).agg(numericOr).reset_index()

In [78]:
from sklearn.preprocessing import MultiLabelBinarizer

def combineConditions(x):
    return set(x)

def makeList(x):
    return list(x)

newdf = df[df['trackable_type'] == 'Condition'].groupby(['user_id', 'checkin_date'])['trackable_name'].agg(combineConditions).reset_index()
newdf = newdf.merge(symptoms, on=['user_id','checkin_date'])

#newdf = newdf.drop_duplicates().drop(['user_id','checkin_date','trackable_id','trackable_type', 'trackable_value'], axis=1)
newdf = newdf.drop(['user_id','checkin_date'], axis=1)
X = newdf.drop('trackable_name', axis=1)
Y = newdf['trackable_name'].apply(makeList)  # each row of Y is a list, because this is a multilabel problem
Y = MultiLabelBinarizer().fit_transform(Y)  
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42) 

### A note about model selection

It is important to note that in this case Y is a matrix.  This is an example of multi-label classification, in that each user may be suffering from any number of conditions.  Because of this, a classifier that can handle multi-label must be used.  Fortunately, sklearn has several options for this including:

Decision Trees, Random Forests, Nearest Neighbors, Ridge Regression

TODO should try all of the above

### Resources

http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_multilabel_classification.html#sklearn.datasets.make_multilabel_classification

http://scikit-learn.org/stable/modules/multiclass.html


In [80]:
#This block uses SVM, which slows way down for problems of this size, skip this block if you're in a hurry

from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC

clf = OneVsRestClassifier(SVC(kernel='poly'))
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print Y_test
print Y_pred
print accuracy_score(Y_test, Y_pred)  #TODO accuracy score doesn't paint a complete picture for multilabel, should use something else

0.0


In [None]:
#Blah random forest
#TODO Can we get some multilabel Gradient Boosting up in here?
from sklearn.ensemble import RandomForest
rf = RandomForestClassifier(n_estimators=10, criterion='gini', max_depth=3)
rf.fit(X_train, Y_train)
Y_pred = rf.predict(X_test)
print accuracy_score(Y_test, Y_pred)
print Y_test
print Y_pred