### Decision Tree Classifier
Using data from: https://www.kaggle.com/datasets/luvharishkhati/heart-disease-patients-details/data

Code help from: https://www.datacamp.com/tutorial/decision-tree-classification-python

Article help from: https://www.techtarget.com/searchenterpriseai/definition/data-splitting


In [182]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix

# Data source: https://www.kaggle.com/datasets/luvharishkhati/heart-disease-patients-details/data
df = pd.read_csv('heart_disease.csv')

# X is the input data in the shape (m x n) aka (examples x features)
X = df.drop(columns=['result']).values

# y is true/target value (this can only be 0 or 1) 
y = df['result'].values

# Print the number of each result
result_counts = df['result'].value_counts()
print(result_counts)

df.head() # show the data

result
0    150
1    120
Name: count, dtype: int64


Unnamed: 0,age,sex,chest,resting_blood_pressure,serum_cholestoral,fasting_blood_sugar,resting_electrocardiographic_results,maximum_heart_rate_achieved,exercise_induced_angina,oldpeak,slope,number_of_major_vessels,thal,result
0,70,1,4,130,322,0,2,109,0,2.4,2,3,3,1
1,67,0,3,115,564,0,2,160,0,1.6,2,0,7,0
2,57,1,2,124,261,0,0,141,0,0.3,1,0,7,1
3,64,1,4,128,263,0,0,105,1,0.2,2,1,7,0
4,74,0,2,120,269,0,2,121,1,0.2,1,1,3,0


In [183]:

accSum = 0

for i in range(1000):

    # Split dataset into training set and test set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test

    # Create Decision Tree classifer object
    clf = DecisionTreeClassifier()

    # Train Decision Tree Classifer
    clf = clf.fit(X_train,y_train)

    #Predict the response for test dataset
    y_pred = clf.predict(X_test)

    # Running total of model accuracy scores to be averaged
    # (How often is the classifier correct?)
    accSum += metrics.accuracy_score(y_test, y_pred)

    
# Average model accuracy over 1000 runs
print(f"Accuracy: {(accSum/10):.1f}%")

Accuracy: 73.9%


The portion of the data chosen for training the model vs testing the model will effect *measured* model accuracy. The more representative the 70% of the data chosen for training is of the entire data set, the more accurate the model will appear. However, you wouldn't just pick the 70% that gives you the highest score because while it increases your model accuracy score, thats somewhat deceptive and doesn't practically increase your model quality. You're falling into the trap of overfitting and your model may generalize poorly to any new data.

The above set up presents a running average over 1000 tests, this gives a reasonably accurate measurement of how good this random selection is on average, about 74%.

In a real world case, you would have a large enough data set that the portion removed for model testing would hopefully effect the percent accuracy to a lower degree.

### Optimizing feature weights

Okay, so what can we mess with to increase the accuracy of the model that isn't just cherrypicking the data? How about modifying the settings on the decsion tree? Currently we're just using the default that can be found on the manual page here: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

For our cases I am going to play with "criterion" which selects the function used to measure the quality of a split and "max_features" which determines the number of features to consider when looking for the best split.

Criterion:
    gini impurity: (This is the default selection) Measures the probability of misclassifying a randomly chosen element in the set

    entropy: Measures the ammount of uncertainty or randomness in a set

Max features: The number of features to consider when looking for the best split

    None: no selection, max_features=n_features (This is the default selection)

    sqrt: max_features=sqrt(n_features)

    log: max_features=log2(n_features)

    int/float: input some integer or float value to cap features considered.

    (There are more but these are the choices for my testing)


Since 270 examples isn't a huge dataset, I avoided modifying features that risked overfitting since I didn't want to get a model that was "dishonestly" accurate.

In [184]:

accSum = 0

for i in range(1000):

    # Split dataset into training set and test set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test

    # Create Decision Tree classifer object
    clf = DecisionTreeClassifier(criterion="entropy")

    # Train Decision Tree Classifer
    clf = clf.fit(X_train,y_train)

    #Predict the response for test dataset
    y_pred = clf.predict(X_test)

    # Running total of model accuracy scores to be averaged
    # (How often is the classifier correct?)
    accSum += metrics.accuracy_score(y_test, y_pred)

    
# Average model accuracy over 1000 runs
print(f"Accuracy: {(accSum/10):.1f}%")

Accuracy: 74.4%


The max_features parameter controls the maximum number of features each individual decision tree in the forest is allowed to consider when making a split.

max_features="log2" alone decreased accuracy to about 73% 

In fact, any value tested other than none, reduced the accuracy of the model. The only exception to this was setting an integer value high enough to make the number of features considered the same, since you're considering the same number of features (reconsidering them if the value is high enough). 

With a small dataset, say only 270 examples, None is the best choice to capture a broader range of patterns. Reducing the number of features with reducing max_features can be a regularization technique to prevent the model from becoming too complex and address overfitting.


critereon="entropy" alone increased accuracy to about 74.5%

This small improvement is not surprising as ususally the difference in performance is minimal. There are situations where using entropy might lead to a slightly higher model accuracy however. Namely entropy is more sensitive to changes in class probabilities than Gini impurity. For datasets with imbalanced class distributions (cases where the outcome is biased towards one option) or scenarios where subtle differences in class probabilities matter, entropy might be better at capturing these nuances. Since the breakup of results is abotu 55% to 45% there shouldn't be much if any imbalance in the class distribution, so the slight increase in performance of entropy is likely due to subtle differences in class probabilities.


Most of the non default options provided by DecisionTreeClassifier exist to specifically control kinds of overfitting, so changing them is going to reduce model accuracy. The other features allow better fitting but these are generally better for larger data sets. As mentioned earlier these were avoided so as not to present "dishonest" model accuracy.


#### Bagging: Random Forest

In [190]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Create a bagging random forest classifier
random_forest = RandomForestClassifier(n_estimators=100)
# Note: n_estimators is the number of decision trees to be used in the ensemble

# Train the random forest classifier
random_forest.fit(X_train, y_train)

# Perform k-fold cross-validation
k = 10  # Number of folds
cv_scores = cross_val_score(random_forest, X, y, cv=k, scoring='accuracy')
print(f'Mean accuracy across {k}-fold cross-validation: {(np.mean(cv_scores)*100):.1f}%')

# Perform k-fold cross-validation
k = 10  # Number of folds
y_pred = cross_val_predict(clf, X, y, cv=k)

# Calculate the confusion matrix
conf_matrix = confusion_matrix(y, y_pred)

# Extract the number of False Negatives
false_negatives = conf_matrix[1, 0]

print(f'Number of False Negatives: {false_negatives} / {sum(sum(conf_matrix))}')


Mean accuracy across 10-fold cross-validation: 82.6%
Number of False Negatives: 32 / 270


#### Boosting: Gradient Tree

In [191]:
from sklearn.ensemble import GradientBoostingClassifier

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Initialize the Gradient Boosting Classifier
gb_classifier = GradientBoostingClassifier()

# Fit the model on the training data
gb_classifier.fit(X_train, y_train)

# Perform k-fold cross-validation
k = 10  # Number of folds
cv_scores = cross_val_score(gb_classifier, X, y, cv=k, scoring='accuracy')
print(f'Mean accuracy across {k}-fold cross-validation: {(np.mean(cv_scores)*100):.1f}%')

# Perform k-fold cross-validation
k = 10  # Number of folds
y_pred = cross_val_predict(clf, X, y, cv=k)

# Calculate the confusion matrix
conf_matrix = confusion_matrix(y, y_pred)

# Extract the number of False Negatives
false_negatives = conf_matrix[1, 0]

print(f'Number of False Negatives: {false_negatives} / {sum(sum(conf_matrix))}')


Mean accuracy across 10-fold cross-validation: 80.0%
Number of False Negatives: 35 / 270


## Bagging and Boosting Results

The random forest bagging algorithim had an accuracy of 81.5% in my testing while the gradient tree boosting algorithim had an accuracy of 80.0%. Random forest bagging was therefore a little more effective. Why was this? Let's discuss in our results section.

## Compare and Contrast Models

### Metric for Evaluation

Firstly, you may have noticed in the begining how clunky (and slow!) it was to analyze our decision tree classifier by just running it 1000 times. A single run wasn't a good approximation since the limited data set meant the models accuracy was hugely dependent on the random selection of training data. The score of one run wasn't a good generalization of overall model performance and 1000 runs averagef was really slow and clunky.

To fix this, we use K-fold cross-validation to measure the next models. The data is divided into 'k' equally-sized folds, and trained and tested 'k' times, each time using a different fold as the test set and the remaining folds as the training set. The performance metrics from each iteration are then averaged to provide a more accurate evaluation of the model. This helps ensure that the model's performance is not heavily influenced by the specific random split of the data. K-fold cross-validation provides a more reliable estimate of how well a model generalizes to unseen data, making it a valuable tool in model evaluation and selection.

Essentially, K-fold cross-validation is great for getting a more accurate estimate of the accuracy of a model's thats been fed limited data. It avoids bias, overfitting and underfitting by allowing a more representative data selection.

Really quick, let's re-run decision tree classifier with k-fold analysis to better contrast results.

In [187]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test

# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy")

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

# Perform k-fold cross-validation
k = 10  # Number of folds
cv_scores = cross_val_score(clf, X, y, cv=k, scoring='accuracy')
print(f'Mean accuracy across {k}-fold cross-validation: {(np.mean(cv_scores)*100):.1f}%')

# Perform k-fold cross-validation
k = 10  # Number of folds
y_pred = cross_val_predict(clf, X, y, cv=k)

# Calculate the confusion matrix
conf_matrix = confusion_matrix(y, y_pred)

# Extract the number of False Negatives
false_negatives = conf_matrix[1, 0]

print(f'Number of False Negatives: {false_negatives} / {sum(sum(conf_matrix))}')

Mean accuracy across 10-fold cross-validation: 74.1%
Number of False Negatives: 34 / 270


Okay, now that we have all three algorithms evaluated by k-fold cross validation what are our final rankings?
(Rankings for when I ran the code, results may vary.)

82.0% Random Forest Bagging

80.0% Gradient Tree Boosting

74.1% Decision Tree Classifier

Random Forest is an ensemble learning method that builds multiple decision trees during training and merges their predictions to improve accuracy and control overfitting. Bagging is used to train each tree on a random subset of the training data. Notably, random forests perform well on high-dimensional data, with 13 features + result and 270 examples our data isn't exactly high-dimensional, it isn't super low dimensional either. Since random forests are also less prone to overfitting, the 70:30 ratio of train to test is less likely to make less accurate models.

Gradient Tree Boosting is also an ensemble method. It builds a series of decision trees sequentially, with each tree correcting the errors of the previous ones. It optimizes a loss function by adding weak learners (trees) in an iterative manner. Gradient Boosting is powerful for capturing complex relationships in data and relationships with subtle patterns. This accurately describes our data so it is unsurpising that it did somewhat well.

Decision trees are non-linear models that recursively split data based on feature values to make predictions.Decision tree classifier, in its basic form, can suffer from overfitting, as it tends to create deep and complex trees. This tendency to overfit tracks for this being the worse performing of our three models. The model sometimes overfits to the data selected to train and then performs poorly on the tests. Random Forest and Gradiant Tree both represent ways to help manage this issue and as such it is no surprise they both performed better.

In summary,

Random forest bagging and Gradient Tree Boosting perform better than the basic Decision Tree Classifier, which is consistent with expectations.

Random Forest Bagging's higher accuracy can be attributed to the diversity introduced by training multiple trees on different subsets of the data, reducing overfitting.

Gradient Tree Boosting's slightly lower accuracy may be due to the iterative nature of the algorithm, which can be sensitive to hyperparameter tuning. However, its effectiveness in capturing complex patterns is reflected in the performance.

Decision Tree Classifier has the lowest accuracy, likely because it tends to overfit the training data by creating deep trees.


### What about another way to evaluate?

It's important to note that accuracy (which we used both in the 1000 runs and in K-fold) is just one metric, and the choice of a different metric could lead to different rankings, especially in scenarios where class imbalances exist or where certain types of errors are more important to avoid for others. For example, in medical models like this one, minimizing the number of false negatives is crucial. It would be terrible for a patient to go home without treatment thinking they're perfectly healthy! Recall Score is great for this since it judges false negatives far more harshly.

Since our distribution of positive and negative results is about even, 55% to 45% and since the number of false negatives (thats why I have been calculating them) is somewhat low I don't expect our results to change much.



In [189]:
from sklearn.metrics import recall_score
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test

# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy")

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

k = 10  # Number of folds
cv_scores = cross_val_score(clf, X, y, cv=k, scoring='recall')
print(f'Decision Tree Classifier mean recall score across {k}-fold cross-validation: {(np.mean(cv_scores)*100):.1f}%')

# Initialize the Gradient Boosting Classifier
gb_classifier = GradientBoostingClassifier()

# Fit the model on the training data
gb_classifier.fit(X_train, y_train)

k = 10  # Number of folds
cv_scores = cross_val_score(gb_classifier, X, y, cv=k, scoring='recall')
print(f'Gradient Tree Boosting mean recall score across {k}-fold cross-validation: {(np.mean(cv_scores)*100):.1f}%')

# Create a bagging random forest classifier
random_forest = RandomForestClassifier(n_estimators=100)

# Train the random forest classifier
random_forest.fit(X_train, y_train)

k = 10  # Number of folds
cv_scores = cross_val_score(random_forest, X, y, cv=k, scoring='recall')
print(f'Random Forest Bagging mean recall score across {k}-fold cross-validation: {(np.mean(cv_scores)*100):.1f}%')

Decision Tree Classifier mean recall score across 10-fold cross-validation: 69.2%
Gradient Tree Boosting mean recall score across 10-fold cross-validation: 76.7%
Random Forest Bagging mean recall score across 10-fold cross-validation: 74.2%


75.8% Random Forest Bagging

76.7% Gradient Tree Boosting

72.3% Decision Tree Classifier

Okay so, our scores went down. This isn't that surprising, the 35 or so false negatives from each model were bound to drag the scores down. What is interesting is that Gradient Tree is now outperforming Random Forest. As far as I can tell (which isn't a very scientific measurement) there doesn't seem to be a significant difference in the number of false negatives. The outperformance likely stems from Gradient Tree's focus on correcting errors in each iteration, since one type of error is now punished more harshly, it can focus on avoiding that and therefore earns a higher score.

