# Tree-based Analyses

Decision trees work kind of like a game of 20 Questions. They work by considering each variable in a dataset and finding its "split point." Variables with more predictive power will be used earlier in the splitting process. For example, imagine a set of data showing the freshness of various vegetables, and we want to predict whether they have spoiled or not. If one of the features is the farm of origin, and 80% of spoiled vegetables come from one particular farm, then the farm-of-origin feature will have high predictive value and be used early on in the tree-building process. From there, additional features would be considered to determine their value in predicting freshness. 

## Decision Trees
When using a simple tree, it is easy to produce a visualization (with a tree shape, of course) that shows splits and features in decreasing order of importance moving down the "tree." However, in this particular instance, I'm using AdaBoost, which is version of boosted trees that offers more powerful predictions. The output in this case is a bit different.

In [1]:
from sklearn.ensemble import AdaBoostClassifier 
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import recall_score
import pandas as pd

#import mapped version of data set here
readmit = pd.read_csv('diabetes_readmission_onehot.csv') 

In [2]:
# capture independent variables in list
features = list(readmit)
features = [e for e in features if e not in ('Unnamed: 0', 'readmit30')]

In [4]:
# split the data into a training and test set
X = readmit[features].values
y = readmit.readmit30.values
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, y, test_size = .2, 
                                                random_state = 31, stratify = y)

In [11]:
# build model on training data
dt = DecisionTreeClassifier() 
clf = AdaBoostClassifier(base_estimator = dt, n_estimators = 50, learning_rate = 1, random_state = 7)
model = clf.fit(Xtrain, Ytrain)

# check performance (model accuracy) on test data
print(model.score(Xtest, Ytest))

0.847856982572


In [6]:
# results summary (won't be a tree b/c boosting used -- explain?)
importance = model.feature_importances_
importance_df = pd.DataFrame({'feature': features, 'importance': importance})
importance_df.sort_values('importance', ascending = False)

Unnamed: 0,feature,importance
7,num_visits,0.119798
2,num_medications,0.100397
0,days_in_hospital,0.063651
6,number_diagnoses,0.043483
1,num_procedures,0.029324
5,number_inpatient,0.020059
3,number_outpatient,0.019957
12,gender_Male,0.014830
20,age_[70-80),0.014345
148,num_lab_procs_[41-50],0.014087


In [None]:
# importance = expected fraction of samples that feature can contribute to
# if line above is accurate, this model seems not great despite accuracy score of 85%

## Confusion Matrix and Recall Score

Another way to evaluate model performance is a confustion matrix, which puts our model's predictions into four categories:

- In the top-left quadrant is the number of observations classified as not readmitted within 30 days that were in fact not readmitted within 30 days. This is the true negative count. 
- In the top-right quadrant is the number of observations classified as readmitted within 30 days that were in fact not readmitted within 30 days. This is the false positive count. 
- In the lower left quadrant is the number of observations classified as not readmitted within 30 days that were in fact not readmitted within 30 days. This is the false negative count. 
- In the lower right quadrant is the number of observations classified as readmitted within 30 days that were in fact  readmitted within 30 days. This is the true negative count. 

In [26]:
# set actual and predicted vectors; generate confusion matrix for test data
actual = pd.Series(Ytest, name = 'Actual')
predicted = pd.Series(clf.predict(Xtest), name = 'Predicted')
train_ct = pd.crosstab(actual, predicted, margins = True)
print(train_ct)

Predicted      0     1    All
Actual                       
0          11062  1092  12154
1            942   273   1215
All        12004  1365  13369


In [35]:
 # as percentages
TN = train_ct.iloc[0,0] / train_ct.iloc[0,2]
TP = train_ct.iloc[1,1] / train_ct.iloc[1,2]
print('Accuracy for not readmitted: {}'.format('%0.3f' % TN))
print('Accuracy for readmitted: {}'.format('%0.3f' % TP))

Accuracy for not readmitted: 0.910
Accuracy for readmitted: 0.225


Finally, we can consider the model's recall score. Recall is a (percentage) measure of how many positive cases were identified correctly. In the context of this analysis, if 100 patients were readmitted within thirty days and the model detected 81 of them, then the model's recall would be .23 (or 23%). 

In [30]:
# find recall for AdaBoost tree model
recall_score(actual, predicted)

0.22469135802469137

## Random Forest

Individual decision trees are weak learners, meaning that their accuracy is limited (often not much higher than 50%). In order to improve accuracy, we can take an ensemble approach, random forest, that combines multiple trees. The idea is that many weak learners combine their "knowledge" to create a strong learner, which is much more accurate. 

In [31]:
# build and fit model with random forest
clf_rf = RandomForestClassifier(random_state = 7, class_weight = {0: .1, 1: .9})
model_rf = clf_rf.fit(Xtrain, Ytrain)

In [32]:
# model accuracy on test data
print(model_rf.score(Xtest, Ytest))

0.906350512379


In [33]:
# confusion matrix
actual_rf = pd.Series(Ytest, name = 'Actual')
predicted_rf = pd.Series(clf_rf.predict(Xtest), name = 'Predicted')
rf_ct = pd.crosstab(actual_rf, predicted_rf, margins = True)
print(rf_ct)

Predicted      0   1    All
Actual                     
0          12098  56  12154
1           1196  19   1215
All        13294  75  13369


In [37]:
# confusion matrix with percentages
TN_rf = rf_ct.iloc[0,0] / rf_ct.iloc[0,2]
TP_rf = rf_ct.iloc[1,1] / rf_ct.iloc[1,2]
print('Accuracy for not readmitted: {}'.format('%0.3f' % TN_rf))
print('Accuracy for readmitted: {}'.format('%0.3f' % TP_rf))

Accuracy for not readmitted: 0.995
Accuracy for readmitted: 0.016


In [38]:
# recall
recall_score(actual_rf, predicted_rf)

0.015637860082304528

In [None]:
# read in under-sampled df and try rf again
# add markdown cell to explain what's up
# may want to try with other algorithms if metrics indicate appropriate after fixing maths

In [61]:
# run random forest again w/ under-sampled dataset
clf_rus = RandomForestClassifier(random_state = 7)
model_rus = clf_rus.fit(Xtrain, Ytrain)
print(model_rus.score(X_res, y_res))

0.916666666667


In [62]:
# confusion matrix for random forest with random under-sampling
actual_rf = pd.Series(readmit['readmit30'], name = 'Actual')
predicted_rus = pd.Series(clf_rus.predict(readmit[features]), name = 'Predicted')
cm_rus = pd.crosstab(actual_rf, predicted_rus, margins = True)
print(cm_rus)

Predicted  False  True    All
Actual                       
0          60705    65  60770
1           1981  4093   6074
All        62686  4158  66844


In [63]:
# confusion matrix with percentages
cm_rus_pct = cm_rus / cm_rus.sum(axis=1)
print(cm_rus_pct)

Predicted     False      True       All
Actual                                 
0          0.499465  0.005351  0.454566
1          0.016299  0.336928  0.045434
All        0.515764  0.342279  0.500000


In [64]:
# check new model's recall
recall_score(actual_rf, predicted_rus)

0.67385577872900893