INTRODUCTION:

The Cardiovascular Disease Dataset is a structured dataset collected from a multispecialty hospital in India, containing 1,000 patient records with 14 features related to demographic, clinical, and lifestyle factors. The dataset is designed for early-stage heart disease prediction and is widely used for developing machine learning models to detect cardiovascular risk.

The features include age, sex, height, weight, blood pressure, cholesterol, glucose levels, smoking habits, alcohol intake, physical activity, BMI, and a binary target variable indicating the presence or absence of cardiovascular disease. This dataset is suitable for classification tasks and allows exploration of various predictive models such as Logistic Regression, SVM, Random Forest, and XGBoost.

In [None]:
# importing libraris
import numpy as np
from scipy import stats
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix , classification_report
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

In [55]:
Dataset = pd.read_csv("Cardiovascular_Disease_Dataset.csv")
Dataset.head()

Unnamed: 0,patientid,age,gender,chestpain,restingBP,serumcholestrol,fastingbloodsugar,restingrelectro,maxheartrate,exerciseangia,oldpeak,slope,noofmajorvessels,target
0,103368,53,1,2,171,0,0,1,147,0,5.3,3,3,1
1,119250,40,1,0,94,229,0,1,115,0,3.7,1,1,0
2,119372,49,1,2,133,142,0,0,202,1,5.0,1,0,0
3,132514,43,1,0,138,295,1,1,153,0,3.2,2,2,1
4,146211,31,1,1,199,0,0,2,136,0,5.3,3,2,1


In [56]:
Dataset.shape

(1000, 14)

In [57]:
Dataset.describe()

Unnamed: 0,patientid,age,gender,chestpain,restingBP,serumcholestrol,fastingbloodsugar,restingrelectro,maxheartrate,exerciseangia,oldpeak,slope,noofmajorvessels,target
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,5048704.0,49.242,0.765,0.98,151.747,311.447,0.296,0.748,145.477,0.498,2.7077,1.54,1.222,0.58
std,2895905.0,17.86473,0.424211,0.953157,29.965228,132.443801,0.456719,0.770123,34.190268,0.500246,1.720753,1.003697,0.977585,0.493805
min,103368.0,20.0,0.0,0.0,94.0,0.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0
25%,2536440.0,34.0,1.0,0.0,129.0,235.75,0.0,0.0,119.75,0.0,1.3,1.0,0.0,0.0
50%,4952508.0,49.0,1.0,1.0,147.0,318.0,0.0,1.0,146.0,0.0,2.4,2.0,1.0,1.0
75%,7681877.0,64.25,1.0,2.0,181.0,404.25,1.0,1.0,175.0,1.0,4.1,2.0,2.0,1.0
max,9990855.0,80.0,1.0,3.0,200.0,602.0,1.0,2.0,202.0,1.0,6.2,3.0,3.0,1.0


In [58]:
Dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   patientid          1000 non-null   int64  
 1   age                1000 non-null   int64  
 2   gender             1000 non-null   int64  
 3   chestpain          1000 non-null   int64  
 4   restingBP          1000 non-null   int64  
 5   serumcholestrol    1000 non-null   int64  
 6   fastingbloodsugar  1000 non-null   int64  
 7   restingrelectro    1000 non-null   int64  
 8   maxheartrate       1000 non-null   int64  
 9   exerciseangia      1000 non-null   int64  
 10  oldpeak            1000 non-null   float64
 11  slope              1000 non-null   int64  
 12  noofmajorvessels   1000 non-null   int64  
 13  target             1000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 109.5 KB


In [59]:
Dataset.isnull().sum()
# Data is completly clean...

patientid            0
age                  0
gender               0
chestpain            0
restingBP            0
serumcholestrol      0
fastingbloodsugar    0
restingrelectro      0
maxheartrate         0
exerciseangia        0
oldpeak              0
slope                0
noofmajorvessels     0
target               0
dtype: int64

In [60]:
x = Dataset.iloc[:,:-1]
x.head()

Unnamed: 0,patientid,age,gender,chestpain,restingBP,serumcholestrol,fastingbloodsugar,restingrelectro,maxheartrate,exerciseangia,oldpeak,slope,noofmajorvessels
0,103368,53,1,2,171,0,0,1,147,0,5.3,3,3
1,119250,40,1,0,94,229,0,1,115,0,3.7,1,1
2,119372,49,1,2,133,142,0,0,202,1,5.0,1,0
3,132514,43,1,0,138,295,1,1,153,0,3.2,2,2
4,146211,31,1,1,199,0,0,2,136,0,5.3,3,2


In [61]:
y = Dataset.iloc[:,-1]
y.head()

0    1
1    0
2    0
3    1
4    1
Name: target, dtype: int64

In [62]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=42)

In [63]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(x_train)
X_test_scaled = scaler.transform(x_test)

Applying Standard scaling 


Now, we will perform hyperparameter tuning on this dataset to find the best algorithm along with its optimal parameters. Once we identify the best-performing model and its parameters, we will apply that algorithm separately for training and evaluation.

In [66]:
# Dictionary of models with hyperparameters
models = {
    "Logistic Regression": {
        "model": LogisticRegression(max_iter=1000),
        "params": {
            "C": [0.01, 0.1, 1, 10],
            "solver": ["liblinear", "lbfgs"]
        }
    },
    "SVM": {
        "model": SVC(),
        "params": {
            "C": [0.1, 1, 10],
            "kernel": ["linear", "rbf", "poly"]
        }
    },
    "Random Forest": {
        "model": RandomForestClassifier(),
        "params": {
            "n_estimators": [50, 100, 200],
            "max_depth": [None, 5, 10],
            "min_samples_split": [2, 5]
        }
    },
    "XGBoost": {
        "model": XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
        "params": {
            "n_estimators": [50, 100, 200],
            "learning_rate": [0.01, 0.1, 0.2],
            "max_depth": [3, 5, 7]
        }
    }
}

# Run GridSearchCV for each model
results = []

for name, config in models.items():
    print(f"\nTuning {name}...")
    grid = GridSearchCV(config["model"], config["params"], cv=5, scoring='accuracy', n_jobs=-1)
    grid.fit(X_train_scaled, y_train)  
    best_model = grid.best_estimator_
    test_acc = accuracy_score(y_test, best_model.predict(X_test_scaled))  
    results.append({
        "Model": name,
        "Best Parameters": grid.best_params_,
        "CV Accuracy": grid.best_score_,
        "Test Accuracy": test_acc
    })

# convert into dataframe.
results_df = pd.DataFrame(results)
print("\nModel Comparison Results:")
print(results_df)



Tuning Logistic Regression...

Tuning SVM...

Tuning Random Forest...

Tuning XGBoost...

Model Comparison Results:
                 Model                                    Best Parameters  \
0  Logistic Regression                    {'C': 1, 'solver': 'liblinear'}   
1                  SVM                     {'C': 0.1, 'kernel': 'linear'}   
2        Random Forest  {'max_depth': None, 'min_samples_split': 5, 'n...   
3              XGBoost  {'learning_rate': 0.1, 'max_depth': 7, 'n_esti...   

   CV Accuracy  Test Accuracy  
0      0.96250          0.965  
1      0.96000          0.965  
2      0.97375          0.980  
3      0.97375          0.975  


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Now, we will cheack which one is the best estimator..

In [70]:
print(grid.best_estimator_)
print(grid.best_params_)

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, feature_weights=None, gamma=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=7, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=100, n_jobs=None,
              num_parallel_tree=None, ...)
{'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100}


so, we can clearly see that our best estimator is XGBOOST classifier.

now, we will perform XGBOOST according to this paremeters that we tune upper.


in XGBOOST scaling is optional..

In [71]:
# creating the model
xgb_model = XGBClassifier(
    use_label_encoder=False,  
    eval_metric='logloss',    
    n_estimators=100,         
    max_depth=7,              
    learning_rate=0.1,        
    random_state=42
)

In [72]:
xgb_model.fit(x_train,y_train)

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


0,1,2
,objective,'binary:logistic'
,base_score,
,booster,
,callbacks,
,colsample_bylevel,
,colsample_bynode,
,colsample_bytree,
,device,
,early_stopping_rounds,
,enable_categorical,False


In [73]:
# predections
train_pred = xgb_model.predict(x_train)
test_pred = xgb_model.predict(x_test)


Evaluating the model

In [74]:
print("XGBoost Results:")
print("Train Accuracy:", accuracy_score(y_train, train_pred))
print("Test Accuracy:", accuracy_score(y_test, test_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, test_pred))
print("\nClassification Report:\n", classification_report(y_test, test_pred))


XGBoost Results:
Train Accuracy: 1.0
Test Accuracy: 0.975

Confusion Matrix:
 [[ 81   2]
 [  3 114]]

Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.98      0.97        83
           1       0.98      0.97      0.98       117

    accuracy                           0.97       200
   macro avg       0.97      0.98      0.97       200
weighted avg       0.98      0.97      0.98       200

