<h1>3. Predicting Heart Disease Using Logistic Regression</h1>
<h3><b>  Preprocessing Steps:</b></h3>
<ul>
    <li>Handle missing values (e.g., fill missing values with mean).</li>
    <li>Encode categorical variables (e.g., one-hot encoding for gender, chest pain type, etc.).</li>
    <li>Standardize numerical features.</li>
</ul>
<h3><b>Task:</b> Implement logistic regression to predict heart disease and evaluate the model using accuracy and ROC-AUC.</h3>

In [34]:
# Importing Libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

In [35]:
# Loading the dataset
heart_dataset = pd.read_csv('..\\..\\Datasets\\HeartDisease.csv')
print(heart_dataset.shape, '\n')
heart_dataset.head()

(1025, 14) 



Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


In [36]:
# Printing basic statistics of the dataset
heart_dataset.describe()        

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,54.434146,0.69561,0.942439,131.611707,246.0,0.149268,0.529756,149.114146,0.336585,1.071512,1.385366,0.754146,2.323902,0.513171
std,9.07229,0.460373,1.029641,17.516718,51.59251,0.356527,0.527878,23.005724,0.472772,1.175053,0.617755,1.030798,0.62066,0.50007
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,132.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,240.0,0.0,1.0,152.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,166.0,1.0,1.8,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [37]:
# Printing the information of the dataset
heart_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


<h2>Data Preprocessing</h2>

<h3>1. Handling Missing Values</h3>

In [38]:
# Checking for the missing values in the dataset
heart_dataset.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

-> Since there are no missing values in the dataset, we can proceed to the next preprocessing step i.e, <b>encoding categorical variables</b>.

<h3>2. Encoding Categorical Variables</h3>

In [39]:
# Separating the categorical vairables from the dataset
categorical_features = heart_dataset.select_dtypes('object').columns
categorical_features

Index([], dtype='object')

-> Since, there are no categorical variables in the dataset, we can proceed to the next preprocessing step i.e, <b>standardizing features</b>.

<h3>3. Standardizing Numerical Features</h3>

In [40]:
# Separating the features and the target variable
X = heart_dataset.drop('target', axis=1)
Y = heart_dataset['target']

features_to_be_scaled = X[['age', 'trestbps', 'chol', 'thalach', 'oldpeak']]   # Since other features are categorical and should not be standardized
features_to_be_scaled

Unnamed: 0,age,trestbps,chol,thalach,oldpeak
0,52,125,212,168,1.0
1,53,140,203,155,3.1
2,70,145,174,125,2.6
3,61,148,203,161,0.0
4,62,138,294,106,1.9
...,...,...,...,...,...
1020,59,140,221,164,0.0
1021,60,125,258,141,2.8
1022,47,110,275,118,1.0
1023,50,110,254,159,0.0


In [41]:
# Implementing Standard scaler
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features_to_be_scaled)

# Creating the dataframe of the scaled features
features_scaled = pd.DataFrame(features_scaled, columns=features_to_be_scaled.columns)
features_scaled

Unnamed: 0,age,trestbps,chol,thalach,oldpeak
0,-0.268437,-0.377636,-0.659332,0.821321,-0.060888
1,-0.158157,0.479107,-0.833861,0.255968,1.727137
2,1.716595,0.764688,-1.396233,-1.048692,1.301417
3,0.724079,0.936037,-0.833861,0.516900,-0.912329
4,0.834359,0.364875,0.930822,-1.874977,0.705408
...,...,...,...,...,...
1020,0.503520,0.479107,-0.484803,0.647366,-0.912329
1021,0.613800,-0.377636,0.232705,-0.352873,1.471705
1022,-0.819834,-1.234378,0.562371,-1.353113,-0.060888
1023,-0.488996,-1.234378,0.155137,0.429923,-0.912329


In [42]:
# Concatinating the scaled features to the original features dataframe
X.drop(X[features_scaled.columns], axis=1, inplace=True)
X = pd.concat([X, features_scaled], axis=1)
X.describe().round(2)

Unnamed: 0,sex,cp,fbs,restecg,exang,slope,ca,thal,age,trestbps,chol,thalach,oldpeak
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,0.7,0.94,0.15,0.53,0.34,1.39,0.75,2.32,-0.0,-0.0,-0.0,-0.0,-0.0
std,0.46,1.03,0.36,0.53,0.47,0.62,1.03,0.62,1.0,1.0,1.0,1.0,1.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-2.8,-2.15,-2.33,-3.4,-0.91
25%,0.0,0.0,0.0,0.0,0.0,1.0,0.0,2.0,-0.71,-0.66,-0.68,-0.74,-0.91
50%,1.0,1.0,0.0,1.0,0.0,1.0,0.0,2.0,0.17,-0.09,-0.12,0.13,-0.23
75%,1.0,2.0,0.0,1.0,1.0,2.0,1.0,3.0,0.72,0.48,0.56,0.73,0.62
max,1.0,3.0,1.0,2.0,1.0,2.0,4.0,3.0,2.49,3.91,6.17,2.3,4.37


<h2>Model Training</h2>

In [43]:
# Splitting the dataset into train and test data in 80/20 ratio
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

In [44]:
# Applying the model
lr_model = LogisticRegression()
lr_model.fit(X_train, Y_train)

In [46]:
# Predicting the target variable
Y_pred = lr_model.predict(X_test)
Y_pred_proba = lr_model.predict_proba(X_test)[:, 1]

<h2>Model Evaluation</h2>

<h3>1. Accuracy Score</h3>

In [47]:
# Calculating the accuracy score of the model
accuracy = accuracy_score(Y_test, Y_pred)
print("Accuracy of the model:", accuracy)

Accuracy of the model: 0.7951219512195122


<h3>2. ROC-AUC Score</h3>

In [50]:
# Calculating the roc-auc score of the model
roc_auc = roc_auc_score(Y_test, Y_pred_proba)
print("ROC-AUC score of the model:", roc_auc)

ROC-AUC score of the model: 0.8771178374262326


-> The model's accuracy of approximately 79.5% indicates it correctly predicts heart disease in a significant majority of cases. The ROC-AUC score of 0.877 suggests the model has a high capability of distinguishing between patients with and without heart disease. This high ROC-AUC score reflects strong overall performance, especially in balancing true positive and false positive rates. However, further validation on different datasets and consideration of additional performance metrics could provide a more comprehensive evaluation.

<hr>