### Introduction

About this dataset

1. Age : Age of the patient
2. Sex : Sex of the patient
3. exang: exercise induced angina (1 = yes; 0 = no)
4. ca: number of major vessels (0-3)
5. cp : Chest Pain type chest pain type
   * Value 1: typical angina
   * Value 2: atypical angina
   * Value 3: non-anginal pain
   * Value 4: asymptomatic
6. trtbps : resting blood pressure (in mm Hg)
7. chol : cholestoral in mg/dl fetched via BMI sensor
8. fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
9. rest_ecg : resting electrocardiographic results
   * Value 0: normal
   * Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
   * Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
10. thalach : maximum heart rate achieved
11. target : 0= less chance of heart attack 1= more chance of heart attack

### Importing libraries

In [44]:
# Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

### Importing data

In [45]:
df = pd.read_csv("/content/drive/MyDrive/DS Course Uploads/Datasets/heart.csv")
df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


### EDA

In [46]:
df.shape

(303, 14)

In [47]:
df.isnull().sum()

Unnamed: 0,0
age,0
sex,0
cp,0
trtbps,0
chol,0
fbs,0
restecg,0
thalachh,0
exng,0
oldpeak,0


In [48]:
df.duplicated().sum()

1

In [49]:
df.drop_duplicates(inplace=True)

In [50]:
df.shape

(302, 14)

In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 302 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       302 non-null    int64  
 1   sex       302 non-null    int64  
 2   cp        302 non-null    int64  
 3   trtbps    302 non-null    int64  
 4   chol      302 non-null    int64  
 5   fbs       302 non-null    int64  
 6   restecg   302 non-null    int64  
 7   thalachh  302 non-null    int64  
 8   exng      302 non-null    int64  
 9   oldpeak   302 non-null    float64
 10  slp       302 non-null    int64  
 11  caa       302 non-null    int64  
 12  thall     302 non-null    int64  
 13  output    302 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 35.4 KB


In [52]:
df.exng.value_counts()

Unnamed: 0_level_0,count
exng,Unnamed: 1_level_1
0,203
1,99


In [53]:
cat_vars = ['restecg','fbs', 'cp', 'sex', 'exng']

In [54]:
# one hot encoding for cat_vars columns

df_new = pd.get_dummies(df, columns=cat_vars, dtype=int)

In [55]:
df_new

Unnamed: 0,age,trtbps,chol,thalachh,oldpeak,slp,caa,thall,output,restecg_0,...,fbs_0,fbs_1,cp_0,cp_1,cp_2,cp_3,sex_0,sex_1,exng_0,exng_1
0,63,145,233,150,2.3,0,0,1,1,1,...,0,1,0,0,0,1,0,1,1,0
1,37,130,250,187,3.5,0,0,2,1,0,...,1,0,0,0,1,0,0,1,1,0
2,41,130,204,172,1.4,2,0,2,1,1,...,1,0,0,1,0,0,1,0,1,0
3,56,120,236,178,0.8,2,0,2,1,0,...,1,0,0,1,0,0,0,1,1,0
4,57,120,354,163,0.6,2,0,2,1,0,...,1,0,1,0,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,140,241,123,0.2,1,0,3,0,0,...,1,0,1,0,0,0,1,0,0,1
299,45,110,264,132,1.2,1,0,3,0,0,...,1,0,0,0,0,1,0,1,1,0
300,68,144,193,141,3.4,1,2,3,0,0,...,0,1,1,0,0,0,0,1,1,0
301,57,130,131,115,1.2,1,1,3,0,0,...,1,0,1,0,0,0,0,1,0,1


### Modeling

In [56]:
# Separating input and output

X = df_new.drop('output', axis=1)
y = df_new['output']

In [57]:
# Standardise input columns

scaler = StandardScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)

In [58]:
# Separating train test data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [59]:
# Create model

model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [60]:
# Evaluate the model

print("Accuracy:", metrics.accuracy_score(y_test, y_pred).round(2))
print("Precision:", metrics.precision_score(y_test, y_pred).round(2))
print("Recall:", metrics.recall_score(y_test, y_pred).round(2))
print("F1 Score:", metrics.f1_score(y_test, y_pred).round(2))

Accuracy: 0.79
Precision: 0.85
Recall: 0.72
F1 Score: 0.78


In [61]:
# Evaluate model on training data

y_pred_train = model.predict(X_train)

print("Accuracy:", metrics.accuracy_score(y_train, y_pred_train).round(2))
print("Precision:", metrics.precision_score(y_train, y_pred_train).round(2))
print("Recall:", metrics.recall_score(y_train, y_pred_train).round(2))
print("F1 Score:", metrics.f1_score(y_train, y_pred_train).round(2))

Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0


### Implementing GridsearchCV

In [62]:
# Implementing GridsearchCV

from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_leaf_nodes': list(range(2, 10)),
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
    'min_samples_split': [2, 3, 4, 5],
    'min_impurity_decrease': [0.0001, 0.001, 0.01, 0.1],
    'min_samples_leaf': [1, 2, 3, 4, 5],
    'max_features': [None, 'sqrt', 'log2']
}

grid_search = GridSearchCV(model, param_grid, cv=5, scoring='accuracy', verbose=1)
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_.round(2))

Fitting 5 folds for each of 7680 candidates, totalling 38400 fits
Best parameters: {'criterion': 'gini', 'max_features': None, 'max_leaf_nodes': 9, 'min_impurity_decrease': 0.0001, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'random'}
Best score: 0.81


### Optimising model

In [63]:
# Optimise model

dtc = DecisionTreeClassifier(criterion='gini', max_leaf_nodes=9, min_impurity_decrease=0.0001, min_samples_leaf=1, min_samples_split=2, splitter='random', random_state=42)
dtc.fit(X_train, y_train)
y_pred_dtc = dtc.predict(X_test)

In [64]:
# Evaluate optimised dtc

print("Accuracy:", metrics.accuracy_score(y_test, y_pred_dtc).round(2))
print("Precision:", metrics.precision_score(y_test, y_pred_dtc).round(2))
print("Recall:", metrics.recall_score(y_test, y_pred_dtc).round(2))
print("F1 Score:", metrics.f1_score(y_test, y_pred_dtc).round(2))

Accuracy: 0.79
Precision: 0.85
Recall: 0.72
F1 Score: 0.78


In [65]:
# Evaluate training data

y_pred_dtc_train = dtc.predict(X_train)

print("Accuracy:", metrics.accuracy_score(y_train, y_pred_dtc_train).round(2))
print("Precision:", metrics.precision_score(y_train, y_pred_dtc_train).round(2))
print("Recall:", metrics.recall_score(y_train, y_pred_dtc_train).round(2))
print("F1 Score:", metrics.f1_score(y_train, y_pred_dtc_train).round(2))

Accuracy: 0.84
Precision: 0.89
Recall: 0.81
F1 Score: 0.85


In [66]:
# Implementing RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV

param_grid = {
    'max_leaf_nodes': list(range(2, 10)),
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
    'min_samples_split': [2, 3, 4, 5],
    'min_impurity_decrease': [0.0001, 0.001, 0.01, 0.1],
    'min_samples_leaf': [1, 2, 3, 4, 5],
    'max_features': [None, 'sqrt', 'log2']
}

random_search = RandomizedSearchCV(model, param_grid, cv=5, scoring='accuracy', verbose=1)
random_search.fit(X_train, y_train)

print("Best parameters:", random_search.best_params_)
print("Best score:", random_search.best_score_.round(2))

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best parameters: {'splitter': 'best', 'min_samples_split': 3, 'min_samples_leaf': 3, 'min_impurity_decrease': 0.01, 'max_leaf_nodes': 9, 'max_features': 'sqrt', 'criterion': 'entropy'}
Best score: 0.77


### Optimising model with Randomsearch

In [67]:
# Optimise model

dtc_r = DecisionTreeClassifier(criterion='entropy', max_leaf_nodes=9, min_impurity_decrease=0.01, min_samples_leaf=3, min_samples_split=3, splitter='best', max_features='sqrt', random_state=42)
dtc_r.fit(X_train, y_train)
y_pred_dtc_r = dtc_r.predict(X_test)

In [68]:
# Evaluate optimised dtc_r

print("Accuracy:", metrics.accuracy_score(y_test, y_pred_dtc_r).round(2))
print("Precision:", metrics.precision_score(y_test, y_pred_dtc_r).round(2))
print("Recall:", metrics.recall_score(y_test, y_pred_dtc_r).round(2))
print("F1 Score:", metrics.f1_score(y_test, y_pred_dtc_r).round(2))

Accuracy: 0.82
Precision: 0.86
Recall: 0.78
F1 Score: 0.82


In [69]:
# Evaluate training data on dtc_r

y_pred_dtc_train_r = dtc_r.predict(X_train)

print("Accuracy:", metrics.accuracy_score(y_train, y_pred_dtc_train_r).round(2))
print("Precision:", metrics.precision_score(y_train, y_pred_dtc_train_r).round(2))
print("Recall:", metrics.recall_score(y_train, y_pred_dtc_train_r).round(2))
print("F1 Score:", metrics.f1_score(y_train, y_pred_dtc_train_r).round(2))

Accuracy: 0.84
Precision: 0.81
Recall: 0.93
F1 Score: 0.87
