In [5]:
import pandas as pd

## Loading the dataset

In [6]:
df = pd.read_csv("cleaned_df (1).csv")

In [7]:
df.head()

Unnamed: 0.1,Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,0,63,1,3,145,233,1,0,150,0,2.3,0,0.0,1.0,1
1,1,37,1,2,130,250,0,1,187,0,3.5,0,0.0,2.0,1
2,2,41,0,1,130,204,0,0,172,0,1.4,2,0.0,2.0,1
3,3,56,1,1,120,236,0,1,178,0,0.8,2,0.0,2.0,1
4,4,57,0,0,120,354,0,1,163,1,0.6,2,0.0,2.0,1


**Here, we can observe that we have a column Unnamed:0, an index column. We don't need this column. So we can drop this column or while loading the dataset, we can set it as an index column**

In [9]:
df = pd.read_csv("cleaned_df (1).csv", index_col=0) # 0 is the index of the column to be indexed

In [10]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0.0,1.0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0.0,2.0,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0.0,2.0,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0.0,2.0,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0.0,2.0,1


## Attribute information 

In [14]:
# age: The person’s age in years
# sex: The person’s sex (1 = male, 0 = female)
# cp: chest pain type
# — Value 0: asymptomatic
# — Value 1: atypical angina
# — Value 2: non-anginal pain
# — Value 3: typical angina
# trestbps: The person’s resting blood pressure (mm Hg on admission to the hospital)
# chol: The person’s cholesterol measurement in mg/dl
# fbs: The person’s fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)
# restecg: resting electrocardiographic results
# — Value 0: showing probable or definite left ventricular hypertrophy by Estes’ criteria
# — Value 1: normal
# — Value 2: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
# thalach: The person’s maximum heart rate achieved
# exang: Exercise induced angina (1 = yes; 0 = no)
# oldpeak: ST depression induced by exercise relative to rest (‘ST’ relates to positions on the ECG plot. See more here)
# slope: the slope of the peak exercise ST segment — 0: downsloping; 1: flat; 2: upsloping
# 0: downsloping; 1: flat; 2: upsloping
# ca: The number of major vessels (0–3)
# thal: A blood disorder called thalassemia Value 0: NULL (dropped from the dataset previously
# Value 1: fixed defect (no blood flow in some part of the heart)
# Value 2: normal blood flow
# Value 3: reversible defect (a blood flow is observed but it is not normal)
# target: Heart disease (1 = no, 0= yes)

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    float64
 12  thal      303 non-null    float64
 13  target    303 non-null    int64  
dtypes: float64(3), int64(11)
memory usage: 35.5 KB


**Our data set has 303 rows and 14 columns. We have 0 null values. 3 columns are float64dtype and 11 columns are int64dtyp**

In [18]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,303.0,54.366337,9.082101,29.0,47.5,55.0,61.0,77.0
sex,303.0,0.683168,0.466011,0.0,0.0,1.0,1.0,1.0
cp,303.0,0.966997,1.032052,0.0,0.0,1.0,2.0,3.0
trestbps,303.0,131.623762,17.538143,94.0,120.0,130.0,140.0,200.0
chol,303.0,245.194719,48.488324,126.0,211.0,240.0,274.0,417.0
fbs,303.0,0.148515,0.356198,0.0,0.0,0.0,0.0,1.0
restecg,303.0,0.528053,0.52586,0.0,0.0,1.0,1.0,2.0
thalach,303.0,149.646865,22.905161,71.0,133.5,153.0,166.0,202.0
exang,303.0,0.326733,0.469794,0.0,0.0,0.0,1.0,1.0
oldpeak,303.0,1.039604,1.161075,0.0,0.0,0.8,1.6,6.2


## Train, test split

In [21]:
y=df["target"]
X=df.drop("target", axis=1)

In [23]:
from sklearn.model_selection import train_test_split

In [25]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Scaling the data
**If we observe the range of each column, it differs significantly. Column 'age' has a range of 29-77, while 'thal' has a range of 1-3. When the column range differs significantly, as it does in our dataset. We will use StandardScaler to scale the data**

**Decision Tree and Random Forest do not require data to be scaled. Splitting is done based on the feature values, so the scale of the data does not affect their performance.**

**Since we are going to implement Logistic Regression as well, we need the data to be scaled**

In [30]:
from sklearn.preprocessing import StandardScaler

In [32]:
scaler = StandardScaler()

In [36]:
X_train  = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

**Here, we have used fit_transform for X_train and just transform for X_test. Let us understand this**

**In the first line of code, X_train=scaler.fit_transform(X_train), we perform two operations, fit and transform. First, the fit operation will be performed. In the fit operation, the StandardScaler computes the data's mean and standard deviation and scales it such that the mean is 0 and the standard deviation is 1. Once, the data is fit, a scale is set. (The range of standard scale data is usally [-3,3], 99.7% of the values are covered in 3 standard deviations on either side of the mean). The transformation is based on the scale that the algorithm computes while fitting the train data**

**Now, for X_test, if we again perform fit_transform, it will compute the scale again, which would be different from the one computed earlier for X_train. Because we do not want to change the scale, we only perform .transform() for X_test**

## Model Implementation

**First, we will implement a baseline algorithm also known as DummyClassifier. This algorithm guesses the outcome of the target varible randomly. This algorithm generally gives the worst prediction**

**Following this, we will implement LogisitcRegression, DecisionTreeClassifier, and RandomForestClassifier and compare the models, to determine the best model among the 3**

### Dummy Classifier

In [66]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [70]:
dummy = DummyClassifier()
dummy.fit(X_train,y_train)
accuracy_dummy = dummy.score(X_test, y_test)
print("Dummy accuracy", accuracy_dummy)

Dummy accuracy 0.5245901639344263


In [68]:
dummy_manual = DummyClassifier()
dummy_manual.fit(X_train,y_train)
y_pred = dummy_manual.predict(X_test)
accuracy_dummy = accuracy_score(y_test, y_pred)
print("Dummy accuracy", accuracy_dummy)

Dummy accuracy 0.5245901639344263


**If we notice the difference between the two codes, in the first code we have not predicted the values using .predict, while in the second code we have. Let us understand this**

**In the first code we are using ''.score()' method of the dummy classifier. This method, predicts the labels for X_test and compares them to y_test internally, thereby eliminating the need to predict the target labels for X_test separately**

**In the second code, we are using accuracy_score to predict the accuracy of the models which requires us to pass the predicted labels and actual labels. Hence, the need to calculate y_pred separately**

**Our model gives an accuracy of 52.45%. This is the worst accuracy any model can predict. Any further models we implement will perform at least as good as this model**

## Logistic Regression

In [79]:
from sklearn.linear_model import LogisticRegression

In [83]:
log_model = LogisticRegression()
log_model.fit(X_train,y_train)
y_pred_log = log_model.predict(X_test)

cm_log = confusion_matrix(y_test, y_pred_log)
clf_log = classification_report(y_test, y_pred_log)

print(f"Confusion matrix: \n {cm_log}")
print(clf_log)

Confusion matrix: 
 [[26  3]
 [ 4 28]]
              precision    recall  f1-score   support

           0       0.87      0.90      0.88        29
           1       0.90      0.88      0.89        32

    accuracy                           0.89        61
   macro avg       0.88      0.89      0.89        61
weighted avg       0.89      0.89      0.89        61



## Decision Tree

In [86]:
from sklearn.tree import DecisionTreeClassifier

In [90]:
tree_model = DecisionTreeClassifier()
tree_model.fit(X_train,y_train)
y_pred_tree = tree_model.predict(X_test)

cm_tree = confusion_matrix(y_test, y_pred_tree)
clf_tree = classification_report(y_test,y_pred_tree)

print(f"Confusion matrix: \n {cm_tree}")
print(clf_tree)

Confusion matrix: 
 [[23  6]
 [ 7 25]]
              precision    recall  f1-score   support

           0       0.77      0.79      0.78        29
           1       0.81      0.78      0.79        32

    accuracy                           0.79        61
   macro avg       0.79      0.79      0.79        61
weighted avg       0.79      0.79      0.79        61



## Random Forest

In [93]:
from sklearn.ensemble import RandomForestClassifier

In [95]:
rfc_model = RandomForestClassifier()
rfc_model.fit(X_train,y_train)
y_pred_rfc = rfc_model.predict(X_test)

cm_rfc = confusion_matrix(y_test, y_pred_rfc)
clf_report = classification_report(y_test,y_pred_rfc)

print(f"confusion matrix: \n {cm_rfc}")
print(clf_report)

confusion matrix: 
 [[24  5]
 [ 5 27]]
              precision    recall  f1-score   support

           0       0.83      0.83      0.83        29
           1       0.84      0.84      0.84        32

    accuracy                           0.84        61
   macro avg       0.84      0.84      0.84        61
weighted avg       0.84      0.84      0.84        61



**While evaluating the performance of a model, it is important to consider all the factors like accuracy, precision, recall, and f1-score. If your model achieves high accuracy and precision but the recall score is poor, the model will not perform well in production. It is also important to see the balance of scores between the classes. In our models, the recall and f1-score are almost similar for both the classes, '0' and '1'. If the model performs better for one class, say '0' and poor for another class, say '1', then the model is unreliable.**

**If we observe the classification report of our models, all 3 models have a good balance between class '0' and class '1'. There is no huge difference in the precision, recall, and f1-score of the two classes**

**We also observe that the LogisticRegression performs the best among all 3 models in all the parameters, accuracy, precision, recall, and f1-score**

## Random Forest Classifier (Hyper Parameter Tuning)

In [111]:
# class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, 
#                                                 criterion='gini', 
#                                                 max_depth=None, 
#                                                 min_samples_split=2, 
#                                                 min_samples_leaf=1, 
#                                                 min_weight_fraction_leaf=0.0, 
#                                                 max_features='sqrt', 
#                                                 max_leaf_nodes=None, 
#                                                 min_impurity_decrease=0.0, 
#                                                 bootstrap=True, 
#                                                 oob_score=False, 
#                                                 n_jobs=None, 
#                                                 random_state=None, 
#                                                 verbose=0, 
#                                                 warm_start=False, 
#                                                 class_weight=None, 
#                                                 ccp_alpha=0.0, 
#                                                 max_samples=None, 
#                                                 # monotonic_cst=None)


In [114]:
from sklearn.model_selection import GridSearchCV

In [118]:
param_grid_rfc = {"n_estimators": [100,200,300,400],
                 "criterion": ['gini','entropy'],
                 "max_depth":[None, 10, 20, 30],
                 "min_samples_split": [2,4],
                 "oob_score":[False, True]}

In [120]:
grid_search_rfc = GridSearchCV(estimator=rfc_model, param_grid=param_grid_rfc, cv=5, scoring="accuracy", n_jobs=-1)

In [122]:
grid_search_rfc.fit(X_train,y_train)
rfc_best_param = grid_search_rfc.best_params_
rfc_best_score = grid_search_rfc.best_score_

In [124]:
rfc_best_param

{'criterion': 'entropy',
 'max_depth': None,
 'min_samples_split': 2,
 'n_estimators': 100,
 'oob_score': True}

In [126]:
rfc_best_score

0.8388605442176871

**Even after hyper parameter tuning, the model returns the same accuracy**

In [129]:
rfc_model_tune = RandomForestClassifier(**rfc_best_param) #kwarg can be mentioned with **
rfc_model_tune.fit(X_train, y_train)
y_pred_rfc_model_tune = rfc_model_tune.predict(X_test)

cm_rfc_model_tune = confusion_matrix(y_test, y_pred_rfc_model_tune)
clf_rfc_model_tune = classification_report(y_test, y_pred_rfc_model_tune)

print(f"confusion matrix: \n {cm_rfc_model_tune}")
print(f"Classification report: \n {clf_rfc_model_tune}")

confusion matrix: 
 [[24  5]
 [ 5 27]]
Classification report: 
               precision    recall  f1-score   support

           0       0.83      0.83      0.83        29
           1       0.84      0.84      0.84        32

    accuracy                           0.84        61
   macro avg       0.84      0.84      0.84        61
weighted avg       0.84      0.84      0.84        61



**As observed earlier, the accuracy, precision, recall, and f1-score remain the same even after tuning the parameters**

**Logistic Regression remains the best model for this dataset**