# Module 9 Exercises - Decision Trees

### Exercise 1:

Using the diabetes.csv file from the Module 8 Exercises notebook, load the file as a dataframe. Repeat the steps from exercises 1 & 2 in the Module 8 Exercise notebook to prepare your dataset for modeling.

In [17]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [3]:
location = "datasets/diabetes.csv"
df = pd.read_csv(location)
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [4]:
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
# Check all columns have same number of rows
df.count()

Pregnancies                 768
Glucose                     768
BloodPressure               768
SkinThickness               768
Insulin                     768
BMI                         768
DiabetesPedigreeFunction    768
Age                         768
Outcome                     768
dtype: int64

In [6]:
# Check for null data
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [7]:
# Check for na values
df.isna().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [None]:
small_df = df[['Glucose','BloodPressure','Insulin','BMI','DiabetesPedigreeFunction','Age','Outcome']]
small_df

In [9]:
# Extract target variable
# Make copy of 'survived' column
Y = small_df['Outcome']

In [10]:
# Make dataframe that only contains predictive features
X = small_df.drop('Outcome', axis=1)

### Exercise 2:

Using the decision tree function in the scikit-learn library (sklearn), fit the model with the training dataset. Then score the model for training; how well did it do?

In [11]:
# test_size default = 0.25
# train_test_split function give back 4 variables
# 2 variables for X (the predictive features) - training and testing
# 2 variables for y (the target) - training and testing
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size=0.25, random_state = 10)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

(576, 6)
(192, 6)
(576,)
(192,)


In [18]:
#assign decision tree function to model variable
treeModel = tree.DecisionTreeClassifier()

In [19]:
#develop model using training data
#defining arguments in the model can help prevent overfitting
treeModel.fit(X_train, Y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [20]:
#accuracy score of model on training data
treeModel.score(X_train, Y_train)

1.0

### Exercise 3:

Now use the test dataset on the decision tree function and get its score.

In [21]:
#run the predictions on the test data
Y_pred = treeModel.predict(X_test)

In [22]:
#accuracy score of model on test data
treeModel.score(X_test, Y_test)

0.6927083333333334

### Exercise 4:

Make a confusion matrix for the predicted outcomes to compare it against the "true" outcomes. How many values for each outcome did the model get incorrect?

In [23]:
# Confusion matrix shows which values model predicted correctly vs incorrectly

cm = pd.DataFrame(
    confusion_matrix(Y_test, Y_pred),
    columns=['Predicted Not Diabetic', 'Predicted Diabetic'],
    index=['True Not Diabetic', 'True Diabetic']
)

cm

Unnamed: 0,Predicted Not Diabetic,Predicted Diabetic
True Not Diabetic,93,28
True Diabetic,31,40


### Exercise 5:

Get a classification report on the model for the predicted data. Which outcome is the model more accurate at predicting?

In [24]:
# From precision column, model is better at predicting passengers that do not survive
print(classification_report(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.75      0.77      0.76       121
           1       0.59      0.56      0.58        71

   micro avg       0.69      0.69      0.69       192
   macro avg       0.67      0.67      0.67       192
weighted avg       0.69      0.69      0.69       192



### Exercise 6:

Compare the predictions from the decision tree model to the logistic regression model in the Module 8 Exercise notebook. Which model was best at predicting the outcome of diabetes for a patient?