## Problem Description

Congenital heart ailments are one of the leading causes of death in the USA. What if we could predict the likelihood of a person contracting a heart ailment based on minute heart valve measurements? Today, we would be using a decision tree algorithm to determine the likelihood of contracting a congenital heart ailment.

Decision Trees is a type of supervised learning algorithm that can be used for regression and classification problems, in other words categorical and continuous input and output variables. This makes it ideal for researchers due to it's robust approach towards noisy, missing, irrelevant and redundant data. Decision Tree is also computationally inexpensive while relatively accurate. Not to mention ease of readability to the dataset.

However, there are some real drawbacks towards decision trees and that includes it's inclination to overfit the model and the chance of oversimplifying a problem that may have more layers of complexity.

## Feature Engineering and Exploratory data Analysis (EDA)

For today's example we will load a heart ailment dataset and do some preliminary data cleansing techniques.

In [1]:
#Load Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#Load dataset
df = pd.read_csv('./MurmurInfoRaw20201.csv')
df.Class = [x if x != 2 else 1 for x in df.Class]
df.describe()

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,...,x32,x33,x34,x35,x36,x37,x38,x39,x40,Class
count,5001.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,...,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,-0.018634,0.3281,0.65308,1.00654,1.35706,2.01562,2.67282,2.68864,2.68382,3.00448,...,0.0303,0.01562,0.03336,0.01152,0.01904,-0.0256,-0.0113,-0.00444,0.0254,0.6616
std,1.007407,1.044018,1.183895,1.42673,1.673702,1.842113,2.036861,1.76131,1.672908,1.527788,...,0.996712,1.02023,1.008364,1.009334,0.997772,0.997697,1.001626,1.015645,1.007789,0.473213
min,-3.5,-3.2,-3.1,-3.3,-3.5,-3.2,-3.0,-2.7,-2.3,-1.7,...,-3.4,-3.4,-3.7,-3.5,-3.5,-3.4,-3.3,-3.5,-3.8,0.0
25%,-0.7,-0.4,-0.2,0.0,0.1,0.6,1.1,1.375,1.5,1.9,...,-0.6,-0.7,-0.7,-0.7,-0.7,-0.7,-0.7,-0.7,-0.7,0.0
50%,0.0,0.3,0.6,0.9,1.2,1.8,2.5,2.7,2.8,3.0,...,0.0,0.0,0.1,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,0.7,1.0,1.4,2.0,2.5,3.4,4.2,4.0,3.9,4.1,...,0.7,0.7,0.7,0.7,0.7,0.7,0.7,0.7,0.7,1.0
max,3.4,4.3,4.8,5.7,6.7,7.8,8.5,7.6,7.3,7.9,...,3.2,3.4,4.1,3.6,3.3,3.4,3.7,3.3,3.5,1.0


`DESCRIPTION:` We have identified that columns x1 to x40 are individual heart valve measurements. and the rows are individual patients. While class is the indicator for patients with heart ailment = 1 or no heart ailment = 0.

In [2]:
#drop NaNs
df_nonans = df.dropna().reset_index(drop = True)

#Remove Outliers via Turkey method

# import required libraries
from collections import Counter

# Outlier detection 
def detect_outliers(df,n,features):
    
    outlier_indices = []
    
    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col],75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        
        # outlier step
        outlier_step = 1.5 * IQR
        
        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index
        
        # append the found outlier indices for col to the list of outlier indices 
        outlier_indices.extend(outlier_list_col)
        
    # select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)        
    multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
    
    return multiple_outliers   

# List of Outliers
Outliers_to_drop = detect_outliers(df_nonans.drop('Class', axis=1),0,list(df_nonans.drop('Class', axis=1)))
df_nonans.drop('Class', axis=1).loc[Outliers_to_drop]

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,...,x31,x32,x33,x34,x35,x36,x37,x38,x39,x40
296,3.0,0.3,-1.2,-0.9,0.7,2.6,-0.3,1.1,1.7,3.8,...,0.2,-1.3,0.4,0.8,-1.5,-1.0,1.5,-0.3,0.7,-0.5
1171,-3.0,0.4,1.3,1.2,3.2,2.0,0.3,1.6,1.0,2.1,...,-0.8,0.6,1.0,-0.8,0.2,-0.4,-0.6,0.1,1.3,0.6
1228,3.2,-0.4,0.8,-0.8,0.2,1.5,3.2,2.4,3.4,5.1,...,0.3,0.4,1.4,0.6,-0.4,0.9,0.9,0.6,-1.4,0.3
1411,-3.2,-0.5,1.0,-0.1,1.3,2.5,3.1,2.0,2.0,-0.5,...,-0.2,0.2,0.1,-2.0,1.1,1.4,-1.1,0.2,0.5,-0.8
1453,-3.0,0.7,0.9,-0.6,1.2,0.2,0.9,3.5,4.1,4.2,...,-1.1,-0.5,-0.6,-0.6,-0.3,0.3,0.6,-1.2,-1.9,-0.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3107,2.4,0.3,0.7,2.8,4.1,5.3,7.2,5.7,3.9,2.3,...,-0.3,-0.5,0.9,-0.9,-2.5,0.9,-0.9,-2.7,0.0,-2.9
3308,0.5,-1.1,-0.1,0.8,-0.2,1.7,0.1,1.5,1.3,2.1,...,0.5,-0.1,-0.4,2.5,-2.0,0.8,0.5,-0.6,-0.5,-2.9
3537,0.1,-0.2,-1.4,-0.9,-0.7,-0.3,1.1,0.6,2.0,6.0,...,-0.2,-0.1,0.6,1.2,-1.8,1.1,0.2,-0.9,-1.7,-3.0
4601,0.8,-0.1,1.6,2.1,2.7,3.3,6.5,3.8,3.4,3.3,...,0.2,0.4,-1.9,1.4,1.9,1.0,1.7,0.2,0.1,-3.5


In [3]:
#Create New Dataset without Outliers
df_clean = df_nonans.drop(df_nonans.index[Outliers_to_drop]).reset_index(drop = True)
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4258 entries, 0 to 4257
Data columns (total 41 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x1      4258 non-null   float64
 1   x2      4258 non-null   float64
 2   x3      4258 non-null   float64
 3   x4      4258 non-null   float64
 4   x5      4258 non-null   float64
 5   x6      4258 non-null   float64
 6   x7      4258 non-null   float64
 7   x8      4258 non-null   float64
 8   x9      4258 non-null   float64
 9   x10     4258 non-null   float64
 10  x11     4258 non-null   float64
 11  x12     4258 non-null   float64
 12  x13     4258 non-null   float64
 13  x14     4258 non-null   float64
 14  x15     4258 non-null   float64
 15  x16     4258 non-null   float64
 16  x17     4258 non-null   float64
 17  x18     4258 non-null   float64
 18  x19     4258 non-null   float64
 19  x20     4258 non-null   float64
 20  x21     4258 non-null   float64
 21  x22     4258 non-null   float64
 22  

In [4]:
#Create x and y variables
x=df_clean.drop('Class',axis=1).values
y=df_clean['Class'].values

#Train and Test
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.20,stratify=y,random_state=100)

#Scale x variables
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train2 = sc.fit_transform(x_train)
y_test2 = sc.fit_transform(x_test)

x_2 = sc.fit_transform(x)

#Import Models
from sklearn.svm import SVC

In [5]:
#Run SMOTE over the dataset
from imblearn.over_sampling import SMOTE
smt=SMOTE(random_state=100)
x_train_smt, y_train_smt = smt.fit_sample(x_train,y_train)
seed = 100

#Run Class Balance
print('SMOTE - Class Split')
num_zeroes_smt = (y_train_smt == 0).sum()
num_ones_smt = (y_train_smt == 1).sum()
print('Class zeroes -',num_zeroes_smt)
print('Class ones -', num_ones_smt)

SMOTE - Class Split
Class zeroes - 2256
Class ones - 2256


In [6]:
#Saving and renaming the Dataset
data_clean_DT = pd.concat([pd.DataFrame(x_train_smt), pd.DataFrame(y_train_smt)], axis=1)
data_clean_DT.columns = ['x1', 'x2', 'x3', 'x4', 'x5', 'x6','x7', 'x8', 'x9', 'x10','x11', 'x12', 'x13', 'x14', 'x15', 'x16', 'x17', 'x18','x19','x20','x21', 'x22', 'x23', 'x24', 'x25', 'x26','x27', 'x28', 'x29', 'x30','x31', 'x32', 'x33', 'x34', 'x35', 'x36', 'x37', 'x38','x39','x40','Class']

`DESCRIPTION:` From here we have completed our EDA, the following steps were performed on the dataset to ensure an unbiased and balanced dataset.

(1) Converted the Dataset to Class 0 (no heart ailment) and Class 1 (heart ailment)

(2) Removed the outliers

(3) Checked and balanced the dataset based on it's distribution

(4) saved new dataset as data_clean_DT

In [7]:
#Create x and y variables
x = data_clean_DT.drop('Class', axis=1).values
y = data_clean_DT['Class'].values

# Train and Test Splitting
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.10,random_state=100)

#Scale the Data
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train2 = sc.fit_transform(x_train)
x_test2 = sc.fit_transform(x_test)

x_2 = sc.fit_transform(x)

#Import Models
from sklearn.tree import DecisionTreeClassifier

In [8]:
#Base Model DT Output
from sklearn.metrics import classification_report, confusion_matrix  

for name,method in [('DT', DecisionTreeClassifier())]: 
    method.fit(x_train2,y_train)
    predict = method.predict(x_test2)
    print('\nEstimator: {}'.format(name)) 
    print(confusion_matrix(y_test,predict))  
    print(classification_report(y_test,predict)) 


Estimator: DT
[[185  27]
 [ 41 199]]
              precision    recall  f1-score   support

         0.0       0.82      0.87      0.84       212
         1.0       0.88      0.83      0.85       240

    accuracy                           0.85       452
   macro avg       0.85      0.85      0.85       452
weighted avg       0.85      0.85      0.85       452



# ML Classifier and Metrics 

`DESCRIPTION:` Right from the get go we can identify how powerful the decision tree algorithm is. From the accuracy score, we have scored fairly high, without any parameter tuning involved in the model. The metrics used were the accuracy score, recall and f1-score.

What are precision, recall, and f1-score? the precision metric allows one to understand how the model is performing based on the patients correctly diagnosed among all the diagnosed patients diagnosed positive. While recall, also known as sensitivity, focuses on the probability that a patient would have heart murmur out of patients who were misdiagnosed to not have heart murmur. While f1 is the combination of both precision and recall, F1-score is made to measure the test accuracy.

In [9]:
# Construct some pipelines 
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

#Create Pipeline

pipeline =[]

pipe_dt = Pipeline([('scl', StandardScaler()),
                    ('clf', DecisionTreeClassifier(random_state=100))])
pipeline.insert(1,pipe_dt)

# Set grid search params 

modelpara =[]


max_depth = range(1,100)
param_griddt = {'clf__criterion':['gini','entropy'],
                'clf__max_depth':max_depth}
modelpara.insert(0,param_griddt)

In [10]:
#Model Analysis
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score

models=[]
models.append(('Decision Tree',pipe_dt))

#Model Evaluation
results =[]
names=[]
scoring ='accuracy'
print('Model Evaluation - Accuracy Score')
for name, model in models:
    rkf=RepeatedKFold(n_splits=10, n_repeats=5, random_state=100)
    cv_results = cross_val_score(model,x,y,cv=rkf,scoring=scoring)
    results.append(cv_results)
    names.append(name)
    print('{} {:.2f} +/- {:.2f}'.format(name,cv_results.mean(),cv_results.std()))
print('\n')

Model Evaluation - Accuracy Score
Decision Tree 0.86 +/- 0.01




In [11]:
#Feature Importance
import eli5
from eli5.sklearn import PermutationImportance
from IPython.display import display

for name, model in models:
    print(name)
    perm=PermutationImportance(model.fit(x_train2,y_train),random_state=100).fit(x_test2,y_test)
    features=data_clean_DT.drop('Class', axis=1).columns
    print('\nPermutation Importance')
    print('\n')
    df=eli5.show_weights(perm,feature_names=data_clean_DT.drop('Class', axis=1).columns.tolist())
    display(df)
    plt.show()



Decision Tree

Permutation Importance




Weight,Feature
0.1217  ± 0.0261,x11
0.0593  ± 0.0233,x10
0.0571  ± 0.0071,x12
0.0447  ± 0.0076,x6
0.0358  ± 0.0141,x9
0.0230  ± 0.0172,x15
0.0204  ± 0.0225,x17
0.0199  ± 0.0115,x13
0.0199  ± 0.0158,x7
0.0173  ± 0.0090,x5


## ML Classifiers and Datasets (Training and Test)

`DESCRIPTION:` After running the model, we can identify that the following measurements have indicated strong weights in implicating a patient's diagnosis.These heart valve measurements being: x11, x10, x12, x6 and x9. 

This is worth noting that this is based on the base model with no special modifications to the model, no optimization of parameters (max_depth, and etc). The model's test size is 20%, while the training dataset is at 80%. There were no other revisions. Also, I have included the KFold with 10 splits and folded 5 times. Random state is at 100.