<a href="https://colab.research.google.com/github/Srija-Burugula/PulseGuard/blob/main/PulseGuard.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DATA PREPROCESSING

In [None]:
import numpy as np
import pandas as pd


In [None]:
import seaborn as sns #new and made from matplotlib used for best visualisation
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df=pd.read_csv("/content/heart_prediction.csv") # four databases: Cleveland, Hungary, Switzerland, and Long Beach
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0


CHECK SKEWNESS OF DATA

In [None]:
import pandas as pd
from scipy.stats import skew


# Select numerical columns to check skewness
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

# Calculate skewness for each numerical column
skewness = df[numerical_columns].apply(lambda x: skew(x.dropna()))

print("Skewness of numerical columns:")
print(skewness)


Skewness of numerical columns:
age        -0.248502
sex        -0.850202
cp          0.528680
trestbps    0.738685
chol        1.072500
fbs         1.968452
restecg     0.180176
thalach    -0.513025
exang       0.691641
oldpeak     1.209127
slope      -0.478433
ca          1.259342
thal       -0.523622
target     -0.052701
dtype: float64


Analysis:
- **Low Skewness (Close to 0):**
  age, restecg, thalach, thal, and target have low skewness, meaning their distributions are relatively symmetric.
- **Moderate Positive Skewness:**
  cp, trestbps, exang, chol, oldpeak, ca, and fbs have positive skewness, indicating that these features have longer tails on the right side. This means the majority of the data points are concentrated on the left with a few outliers on the right.
- **Moderate Negative Skewness:**
  sex, thalach, and slope have negative skewness, indicating longer tails on the left side. This means the majority of the data points are concentrated on the right with a few outliers on the left.
- **High Skewness:**
  chol, fbs, oldpeak, and ca have skewness values greater than 1, which suggests that these features are highly skewed.

## Trasformation:

In [None]:
import pandas as pd
import numpy as np
from scipy import stats

Log Transformation: For chol, oldpeak, ca, and fbs.


In [None]:
df['chol'] = np.log1p(df['chol'])
df['oldpeak'] = np.log1p(df['oldpeak'])
df['ca'] = np.log1p(df['ca'])
df['fbs'] = np.log1p(df['fbs'])

Square Root Transformation: For features trestbps, cp, and exang.

In [None]:
df['chol'], _ = stats.boxcox(df['chol'] + 1)
df['oldpeak'], _ = stats.boxcox(df['oldpeak'] + 1)
df['ca'], _ = stats.boxcox(df['ca'] + 1)
df['fbs'], _ = stats.boxcox(df['fbs'] + 1)

Check Skewness after transformation

In [None]:
skewness = df[numerical_columns].apply(lambda x: skew(x.dropna()))

print("Skewness of numerical columns:")
print(skewness)

Skewness of numerical columns:
age        -0.248502
sex        -0.850202
cp          0.528680
trestbps    0.738685
chol        0.002623
fbs         1.968452
restecg     0.180176
thalach    -0.513025
exang       0.691641
oldpeak     0.014556
slope      -0.478433
ca          0.338165
thal       -0.523622
target     -0.052701
dtype: float64


## Check for Imbalance

In [None]:
class_distribution = df['target'].value_counts()
print("Class distribution in the target variable:")
print(class_distribution)

Class distribution in the target variable:
target
1    526
0    499
Name: count, dtype: int64


In [None]:
ratio = class_distribution.min() / class_distribution.max()
print(f"Class ratio: {ratio:.2f}")

Class ratio: 0.95


The Data is quite Balanced

## Feature Scaling:

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

# Prepare feature matrix X
X = df.drop(columns=['target'])  # Exclude target column

# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X)

# Optionally, convert back to DataFrame
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

## Handling Missing Values

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

## Encode Categorical Variables

In [None]:
# Apply One-Hot Encoding to categorical columns
df_encoded = pd.get_dummies(df, columns=['sex', 'cp', 'restecg', 'slope', 'thal', 'exang'])

print("Data after encoding:")
print(df_encoded.head())


Data after encoding:
   age  trestbps      chol       fbs  thalach   oldpeak        ca  target  \
0   52       125  2.661129  0.000000      168  0.514557  0.438585       0   
1   53       140  2.647601  0.079188      155  0.846776  0.000000       0   
2   70       145  2.599080  0.000000      125  0.795337  0.000000       0   
3   61       148  2.647601  0.000000      161  0.000000  0.358643       0   
4   62       138  2.761367  0.079188      106  0.702315  0.475040       0   

   sex_0  sex_1  ...  restecg_2  slope_0  slope_1  slope_2  thal_0  thal_1  \
0  False   True  ...      False    False    False     True   False   False   
1  False   True  ...      False     True    False    False   False   False   
2  False   True  ...      False     True    False    False   False   False   
3  False   True  ...      False    False    False     True   False   False   
4   True  False  ...      False    False     True    False   False   False   

   thal_2  thal_3  exang_0  exang_1  
0   False

# TRAIN TEST SPLIT

In [None]:
from sklearn.model_selection import train_test_split

# Assuming 'target' is the name of your target column
X = df_encoded.drop(columns=['target'])  # Features
y = df_encoded['target']  # Target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print the shapes of the resulting datasets to verify
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")
print(f"Training target shape: {y_train.shape}")
print(f"Testing target shape: {y_test.shape}")


Training features shape: (717, 25)
Testing features shape: (308, 25)
Training target shape: (717,)
Testing target shape: (308,)


In [None]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,2.661129,0.0,1,168,0,0.514557,2,0.438585,3,0
1,53,1,0,140,2.647601,0.079188,0,155,1,0.846776,0,0.0,3,0
2,70,1,0,145,2.59908,0.0,1,125,1,0.795337,0,0.0,3,0
3,61,1,0,148,2.647601,0.0,1,161,0,0.0,2,0.358643,3,0
4,62,0,0,138,2.761367,0.079188,1,106,0,0.702315,1,0.47504,2,0


# MODELLING

## Random Forest:

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Initialize the Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.98
Random Forest Classification Report:
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       159
           1       1.00      0.96      0.98       149

    accuracy                           0.98       308
   macro avg       0.98      0.98      0.98       308
weighted avg       0.98      0.98      0.98       308

Confusion Matrix:
[[159   0]
 [  6 143]]


## XGBoost

In [None]:
import xgboost as xgb
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Initialize the XGBoost model
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)

# Train the model
xgb_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = xgb_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

print("XGBoost Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.99
XGBoost Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       159
           1       1.00      0.98      0.99       149

    accuracy                           0.99       308
   macro avg       0.99      0.99      0.99       308
weighted avg       0.99      0.99      0.99       308

Confusion Matrix:
[[159   0]
 [  3 146]]


Parameters: { "use_label_encoder" } are not used.



# HYPERPARAMETER TUNING:

Why Hyperparameter Tuning?

**Optimize Performance:** Even though your models are performing well, tuning can help maximize their performance.

**Improve Generalization:** Proper tuning can help models generalize better to unseen data.

**1. Random Forest:**

  Key hyperparameters to tune:

  **n_estimators:** Number of trees in the forest.

  **max_depth:** Maximum depth of the trees.

  **min_samples_split:** Minimum number of samples required to split an internal node.

  **min_samples_leaf:** Minimum number of samples required to be at a leaf node.

  **max_features:** Number of features to consider when looking for the best split.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


In [None]:
# Define the parameter grid
rf_param_grid = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the trees
    'min_samples_split': [2, 5, 10],  # Minimum number of samples required to split an internal node
    'min_samples_leaf': [1, 2, 4],    # Minimum number of samples required to be at a leaf node
    'max_features': ['auto', 'sqrt', 'log2']  # Number of features to consider when looking for the best split
}


In [None]:
# Initialize the Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV with the Random Forest model and parameter grid
rf_grid_search = GridSearchCV(estimator=rf_model,
                              param_grid=rf_param_grid,
                              cv=5,           # 5-fold cross-validation
                              n_jobs=-1,      # Use all available cores
                              scoring='accuracy')  # Metric to optimize


In [None]:
# Fit GridSearchCV to the training data
rf_grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters for Random Forest:")
print(rf_grid_search.best_params_)
print("Best score achieved:")
print(rf_grid_search.best_score_)


540 fits failed out of a total of 1620.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
540 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1145, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 638, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 96, in validate_parameter_constraints
    raise InvalidParameterError(
s

Best parameters for Random Forest:
{'max_depth': None, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Best score achieved:
0.9762820512820513


In [None]:
# Initialize the best model
best_rf_model = rf_grid_search.best_estimator_

# Train the model
best_rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with best parameters: {accuracy:.2f}")

print("Random Forest Classification Report with best parameters:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix with best parameters:")
print(confusion_matrix(y_test, y_pred))


Accuracy with best parameters: 0.98
Random Forest Classification Report with best parameters:
              precision    recall  f1-score   support

           0       0.96      1.00      0.98       159
           1       1.00      0.96      0.98       149

    accuracy                           0.98       308
   macro avg       0.98      0.98      0.98       308
weighted avg       0.98      0.98      0.98       308

Confusion Matrix with best parameters:
[[159   0]
 [  6 143]]


2. **XGBoost:**

  Key hyperparameters to tune:

  **n_estimators:** Number of boosting rounds (trees) to build.

  **learning_rate:** Step size at each iteration (learning rate).

  **max_depth:** Maximum depth of the trees.

  **min_child_weight:** Minimum sum of instance weight needed in a child.

  **subsample:** Fraction of samples used for fitting each tree.

  **colsample_bytree:** Fraction of features used for each tree.

  **gamma:** Minimum loss reduction required to make a further partition on a leaf node.

In [None]:
import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


In [None]:
# Define the parameter grid
xgb_param_grid = {
    'n_estimators': [100, 200, 300],       # Number of boosting rounds
    'learning_rate': [0.01, 0.1, 0.2],     # Step size at each iteration
    'max_depth': [3, 6, 9],                # Maximum depth of a tree
    'min_child_weight': [1, 5, 10],        # Minimum sum of instance weight needed in a child
    'subsample': [0.8, 0.9, 1.0],          # Fraction of samples used for fitting the trees
    'colsample_bytree': [0.8, 0.9, 1.0],   # Fraction of features used for each tree
    'gamma': [0, 0.1, 0.2]                 # Minimum loss reduction required to make a further partition
}


In [None]:
# Initialize the XGBoost model
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=42)

# Initialize GridSearchCV with the XGBoost model and parameter grid
xgb_grid_search = GridSearchCV(estimator=xgb_model,
                                param_grid=xgb_param_grid,
                                cv=5,                # 5-fold cross-validation
                                n_jobs=-1,           # Use all available cores
                                scoring='accuracy') # Metric to optimize


In [None]:
# Fit GridSearchCV to the training data
xgb_grid_search.fit(X_train, y_train)

# Print the best parameters and best score
print("Best parameters for XGBoost:")
print(xgb_grid_search.best_params_)
print("Best score achieved:")
print(xgb_grid_search.best_score_)


Parameters: { "use_label_encoder" } are not used.



Best parameters for XGBoost:
{'colsample_bytree': 0.8, 'gamma': 0, 'learning_rate': 0.1, 'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 200, 'subsample': 1.0}
Best score achieved:
0.966520979020979


In [None]:
# Initialize the best model
best_xgb_model = xgb_grid_search.best_estimator_

# Train the model
best_xgb_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_xgb_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy with best parameters: {accuracy:.2f}")

print("XGBoost Classification Report with best parameters:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix with best parameters:")
print(confusion_matrix(y_test, y_pred))


Parameters: { "use_label_encoder" } are not used.



Accuracy with best parameters: 0.99
XGBoost Classification Report with best parameters:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       159
           1       1.00      0.98      0.99       149

    accuracy                           0.99       308
   macro avg       0.99      0.99      0.99       308
weighted avg       0.99      0.99      0.99       308

Confusion Matrix with best parameters:
[[159   0]
 [  3 146]]


# **ENSEMBLING: Voting classifier**


A VotingClassifier is an ensemble method used to improve predictive performance by combining multiple base models. It makes predictions based on the majority vote or averaged probabilities from the base models. Here we are trying to combine random forest and xgboost.

In [None]:
from sklearn.ensemble import VotingClassifier

# Define base models
voting_clf = VotingClassifier(estimators=[('rf', best_rf_model), ('xgb', best_xgb_model)], voting='soft')

# Train and evaluate the Voting Classifier
voting_clf.fit(X_train, y_train)
y_pred_voting = voting_clf.predict(X_test)


Accuracy of Voting Classifier: 0.9902597402597403


Parameters: { "use_label_encoder" } are not used.



In [None]:
print("Accuracy of Voting Classifier:", accuracy_score(y_test, y_pred_voting))
classification_report_voting = classification_report(y_test, y_pred_voting)

# Generate confusion matrix
confusion_matrix_voting = confusion_matrix(y_test, y_pred_voting)

# Print the detailed report
print("\nVoting Classifier Classification Report:")
print(classification_report_voting)
print("\nVoting Classifier Confusion Matrix:")
print(confusion_matrix_voting)

Accuracy of Voting Classifier: 0.9902597402597403

Voting Classifier Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       159
           1       1.00      0.98      0.99       149

    accuracy                           0.99       308
   macro avg       0.99      0.99      0.99       308
weighted avg       0.99      0.99      0.99       308


Voting Classifier Confusion Matrix:
[[159   0]
 [  3 146]]


# **CONCLUSION**
we successfully developed a robust predictive model for detecting heart disease by leveraging ensemble methods. Through meticulous data preparation, including transformations and balancing techniques, we improved the model’s performance. We trained and evaluated individual models, specifically Random Forest and XGBoost, and then combined them using a Voting Classifier to achieve our desired accuracy. The final ensemble model demonstrated strong performance with high accuracy and precision.