📊 Data Exploration & Visualization
- Dataset: heart.csv loaded using pandas.

- EDA Visualizations: Created with plotly.express to understand distributions and relationships:

  - Age distribution.

  - Chest pain type vs. heart disease.

  - Max heart rate vs. heart disease.

  - Fasting blood sugar vs. heart disease.



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly as py
import plotly.express as px
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv("/content/heart.csv")
df.head(5)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


In [4]:
df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


In [5]:
df.describe(include='O')

Unnamed: 0,Sex,ChestPainType,RestingECG,ExerciseAngina,ST_Slope
count,918,918,918,918,918
unique,2,4,3,2,3
top,M,ASY,Normal,N,Flat
freq,725,496,552,547,460


In [6]:
df.duplicated().sum()

np.int64(0)

In [7]:
fig = px.histogram(df,
                   x='Age',
                   nbins=20,
                   title='Age Distribution')

fig.update_traces(marker=dict(color='lightcoral', line=dict(color='black', width=1)))
fig.update_layout(xaxis_title='Age', yaxis_title='Frequency')
fig.show()

- Purpose: Visualize the distribution of ages in the dataset.

- Insight: Helps understand the age spread of patients. Peaks in certain age groups may indicate higher occurrences of heart-related issues within specific demographics.

In [8]:
fig = px.histogram(df, x='ChestPainType', color='HeartDisease', title='Chest Pain Type vs Heart Disease', barmode='group',color_discrete_sequence=['skyblue', 'lightcoral'])
fig.update_layout(xaxis_title='Chest Pain Type', yaxis_title='Count')
fig.show()

- Purpose: Examine the relationship between types of chest pain and heart disease presence.

- Insight: Certain chest pain types (e.g., ASY, TA) may be more frequently associated with heart disease, providing valuable information for feature importance.

In [9]:
fig = px.box(df, x='HeartDisease', y='MaxHR', title='Max Heart Rate vs Heart Disease')
fig.update_layout(xaxis_title='Heart Disease', yaxis_title='Max Heart Rate')
fig.show()

- Purpose: Compare the distribution of maximum heart rate between those with and without heart disease.

- Insight: Box plots highlight differences in medians and ranges, showing whether higher/lower heart rates are more common in heart disease cases.

In [10]:
fig = px.histogram(df,
                   x='FastingBS',
                   color='HeartDisease',
                   title='Fasting Blood Sugar vs Heart Disease',
                   barmode='group',
                   color_discrete_sequence=['skyblue', 'lightcoral'])
fig.update_layout(xaxis_title='Fasting Blood Sugar', yaxis_title='Count')
fig.show()

- Purpose: Analyze the correlation between fasting blood sugar levels and heart disease.

- Insight: Shows how elevated blood sugar (typically a binary 0/1 feature) may correlate with heart disease presence.

In [11]:
color_palette = ['skyblue', 'red']
fig = px.scatter_matrix(
    df,
    dimensions=['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak'],
    color='HeartDisease',
    title='Pairplot of Numerical Features',
    color_continuous_scale=color_palette
)
fig.show()

- Purpose: Visualize pairwise relationships between key numerical features, grouped by heart disease status.

- Insight: Identifies trends, clusters, and possible correlations between features (e.g., high cholesterol vs. age, Oldpeak vs. MaxHR) that may differentiate heart disease presence.

🔧 Data Preprocessing
- Categorical Encoding: One-hot encoding applied to features like Sex, ChestPainType, RestingECG, etc.

- Feature Scaling: Standardized all features using StandardScaler.

- Target & Feature Split: Defined X (features) and y (target), followed by train-test split.

In [12]:
categorical_columns = ['Sex', 'ChestPainType', 'RestingECG','ExerciseAngina','ST_Slope']
df = pd.get_dummies(df, columns=categorical_columns, drop_first=True)

In [13]:
df.corr()['HeartDisease'].sort_values(ascending=False)

Unnamed: 0,HeartDisease
HeartDisease,1.0
ST_Slope_Flat,0.554134
ExerciseAngina_Y,0.494282
Oldpeak,0.403951
Sex_M,0.305445
Age,0.282039
FastingBS,0.267291
RestingBP,0.107589
RestingECG_ST,0.102527
ChestPainType_TA,-0.05479


🏗️ Model Development & Evaluation
1. Logistic Regression
  - Applied L2 regularization and tuned the regularization strength (C=10).

  - Evaluated using accuracy, recall, and classification report on both training and testing data.

2. Random Forest Classifier
  - Used cross-validation (KFold) with accuracy scoring.

  - Parameters: n_estimators=100, criterion="entropy", max_depth=4.

  - Reported mean accuracy and standard deviation from cross-validation.

  - Model evaluated on both train and test splits.

3. Support Vector Machine (SVM)
  - Trained with a linear kernel and regularization parameter C=0.1.

  - Evaluated using train-test accuracy.

In [14]:
X=df.drop('HeartDisease',axis=1)
y=df['HeartDisease']

In [15]:
scaler_mas = StandardScaler()
for col in X.columns:
    scaler_mas.fit(X[[col]])
    X[col] = scaler_mas.transform (X[[col]])

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42, shuffle=True)

Logistic Regression

In [17]:
lr=LogisticRegression(solver='liblinear')
lr.fit(X_train,y_train)

In [18]:
y_predtest= lr.predict(X_test)
y_predtrain=lr.predict(X_train)

In [19]:
print("\nAccuracy Score:")
print(f"Train Accuracy: {accuracy_score(y_train, y_predtrain)*100:.2f}")
print(f"Test Accuracy: {accuracy_score(y_test, y_predtest)*100:.2f}")


Accuracy Score:
Train Accuracy: 86.14
Test Accuracy: 88.04


In [20]:
print("\nRecall Score:")
print(f"Train Recall: {recall_score(y_train, y_predtrain)}")
print(f"Test Recall: {recall_score(y_test, y_predtest)}")


Recall Score:
Train Recall: 0.8866279069767442
Test Recall: 0.8780487804878049


In [21]:
print("\nClassification Report (Test):")
print(classification_report(y_test, y_predtest))


Classification Report (Test):
              precision    recall  f1-score   support

           0       0.83      0.88      0.86       112
           1       0.92      0.88      0.90       164

    accuracy                           0.88       276
   macro avg       0.87      0.88      0.88       276
weighted avg       0.88      0.88      0.88       276



In [22]:
print("Confusion Matrix (Test):")
print(confusion_matrix(y_test, y_predtest))

Confusion Matrix (Test):
[[ 99  13]
 [ 20 144]]


KFold

In [23]:
k_fold = KFold(n_splits= 5, shuffle=True, random_state=42)

In [24]:
model = RandomForestClassifier(n_estimators=100 , criterion = "entropy" , max_depth= 4 ,random_state= 0)
scores = cross_val_score(model, X, y, cv=k_fold, scoring='accuracy')
print("Cross-validation scores:", scores)
print("Mean accuracy:", scores.mean()*100)
print("Standard deviation:", scores.std())
model.fit(X_train, y_train)

Cross-validation scores: [0.85869565 0.88586957 0.88043478 0.83060109 0.84699454]
Mean accuracy: 86.05191256830601
Standard deviation: 0.020594205281784923


In [25]:
y_predTest=model.predict(X_test)
y_predTrain=model.predict(X_train)

In [26]:
print("\nAccuracy Score:")
print(f"Train Accuracy: {accuracy_score(y_train, y_predTrain)*100:.2f}")
print(f"Test Accuracy: {accuracy_score(y_test, y_predTest)*100:.2f}")


Accuracy Score:
Train Accuracy: 87.23
Test Accuracy: 85.51


In [27]:
print("Confusion Matrix (Test):")
print(confusion_matrix(y_test, y_predTest))

Confusion Matrix (Test):
[[ 92  20]
 [ 20 144]]


Support Vector Machine

In [28]:
from sklearn.svm import SVC
svc = SVC(gamma='scale')
svc.fit(X_train, y_train)

In [29]:
y_predtest= lr.predict(X_test)
y_predtrain=lr.predict(X_train)

In [30]:
print("\nAccuracy Score:")
print(f"Train Accuracy: {accuracy_score(y_train, y_predtrain)*100:.2f}")
print(f"Test Accuracy: {accuracy_score(y_test, y_predtest)*100:.2f}")


Accuracy Score:
Train Accuracy: 86.14
Test Accuracy: 88.04


In [31]:
print("Confusion Matrix (Test):")
print(confusion_matrix(y_test, y_predtest))

Confusion Matrix (Test):
[[ 99  13]
 [ 20 144]]


📈 Model Comparison
- All models were compared using:

- Accuracy

- Recall

- Classification Report

- Cross-validation Scores (for SVM)