# Tree-based Analyses

Decision trees work kind of like a game of 20 Questions. They work by considering each variable in a dataset and finding its "split point." Variables with more predictive power will be used earlier in the splitting process. For example, imagine a set of data showing the freshness of various vegetables, and we want to predict whether they have spoiled or not. If one of the features is the farm of origin, and 80% of spoiled vegetables come from one particular farm, then the farm-of-origin feature will have high predictive value and be used early on in the tree-building process. From there, additional features would be considered to determine their value in predicting freshness. 

When using a simple tree, it is easy to produce a visualization (with a tree shape, of course) that shows splits and features in decreasing order of importance moving down the "tree." However, in this particular instance, I'm using AdaBoost, which is version of boosted trees that offers more powerful predictions. The output in this case is a bit different.

In [2]:
from sklearn.ensemble import AdaBoostClassifier 
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
import pandas as pd

#import mapped version of data set here
readmit = pd.read_csv('diabetes_readmission_onehot.csv') 

In [3]:
# capture independent variables in list
features = list(readmit)
features = [e for e in features if e not in ('Unnamed: 0', 'readmit30')]

In [4]:
# split the data into a training and test set
Xtrain, Xtest, Ytrain, Ytest = train_test_split(readmit[features].values, 
                                              (readmit.readmit30 == 1).values, test_size = .2, random_state = 7)

In [5]:
# build model on training data
dt = DecisionTreeClassifier() 
clf = AdaBoostClassifier(base_estimator = dt, n_estimators = 50, learning_rate = 1, random_state = 7)
model = clf.fit(Xtrain, Ytrain)

# check performance (model accuracy) on test data
model.score(Xtest, Ytest)

0.84576258508489788

In [32]:
# results summary (won't be a tree b/c boosting used -- explain?)
importance = model.feature_importances_
importance_df = pd.DataFrame({'feature': features, 'importance': importance})
importance_df.sort_values('importance', ascending = False)

Unnamed: 0,feature,importance
7,num_visits,0.110377
2,num_medications,0.100254
0,days_in_hospital,0.050011
1,num_procedures,0.044256
6,number_diagnoses,0.038636
5,number_inpatient,0.023498
3,number_outpatient,0.018242
12,gender_Male,0.015974
139,third_diag_neoplasms,0.015031
21,age_[80-90),0.014437


In [None]:
# importance = expected fraction of samples that feature can contribute to
# if line above is accurate, this model seems not great despite accuracy score of 85%

## Confusion Matrix

Another way to evaluate model performance is a confustion matrix, which puts our model's predictions into four categories:

- In the top-left quadrant is the number of observations classified as not readmitted within 30 days that were in fact not readmitted within 30 days. This is the true negative count. 
- In the top-right quadrant is the number of observations classified as readmitted within 30 days that were in fact not readmitted within 30 days. This is the false positive count. 
- In the lower left quadrant is the number of observations classified as not readmitted within 30 days that were in fact not readmitted within 30 days. This is the false negative count. 
- In the lower right quadrant is the number of observations classified as readmitted within 30 days that were in fact  readmitted within 30 days. This is the true negative count. 

In [8]:
# set actual and predicted vectors; generate confusion matrix
actual = pd.Series(readmit['readmit30'], name = 'Actual')
predicted = pd.Series(clf.predict(readmit[features]), name = 'Predicted')
cm = pd.crosstab(actual, predicted, margins = True)
print(cm)

Predicted  False  True    All
Actual                       
0          59629  1141  60770
1            921  5153   6074
All        60550  6294  66844


In [10]:
# present matrix as percentages
cm_pct = cm / cm.sum(axis=1)
print(cm_pct)

Predicted     False      True       All
Actual                                 
0          0.490612  0.093925  0.454566
1          0.007578  0.424185  0.045434
All        0.498190  0.518110  0.500000


This final version -- perhaps more useful -- provides percentages for each quadrant. The true negative rate is about 49%, and the true positive rate is 42%. These percentages are both below 50, so the model is currently worse than random guessing. Improvement is needed. 