Skip to content

PiyushKBhattacharyya/Heart_Attack_Prediction

Repository files navigation

Machine Learning Project Report: Heart Attack Prediction

NOTE:

To install all necessary libraries, run pip install -r requirements.txt

1. Executive Summary

Objective: The objective of this project is to develop a machine learning model that predicts the likelihood of a heart attack based on patient data. This model aims to assist healthcare providers in early diagnosis and preventive measures.

Key Findings: The final model, a Multi-Layer Perceptron (MLP) neural network, achieved an accuracy of 91.67%, a precision of 100%, a recall of 85%, and an AUC-ROC score of 0.92. Key predictors identified include age, cholesterol levels, resting blood pressure, and maximum heart rate achieved.

2. Introduction

Background: Heart disease is one of the leading causes of death globally. Early prediction and diagnosis are crucial for effective treatment and prevention. Machine learning models can analyze complex datasets to identify patterns and predict outcomes.

Purpose: This project aims to develop a predictive model for heart attacks using patient health data. The model will help in early detection and intervention.

Scope: The project uses the UCI Heart Disease Dataset, which includes various health metrics such as age, sex, cholesterol levels, blood pressure, and electrocardiographic results.

3. Literature Review

Existing Research: Previous studies have utilized logistic regression, neural networks, and support vector machines for heart attack prediction. Each method has shown varying degrees of success, but challenges remain in terms of accuracy and generalizability.

Gap Analysis: Many existing models do not generalize well to different populations or are not interpretable by healthcare providers. This project aims to build a robust and interpretable model.

4. Methodology

Data Collection: The dataset used is the UCI Heart Disease Dataset, comprising 303 records and 13 features, including age, sex, chest pain type, resting blood pressure, cholesterol, fasting blood sugar, resting ECG results, maximum heart rate, exercise-induced angina, ST depression, slope, number of vessels colored, and thalassemia.

Data Preprocessing:

  • Data Normalization: Scaled numerical features using Standard scaling.

Exploratory Data Analysis (EDA):

  • Descriptive Statistics: Mean age of patients is 54.43 years with a standard deviation of 9.08 years.
  • Visualizations: Correlation matrix shows strong correlations between target variable and features like age, cholesterol, and max heart rate.

Model Selection:

  • Algorithms Evaluated: Multi-Layer Perceptron (MLP), Support Vector Machine (SVM), and Deep Neural Network (DNN).
  • Final Model Choice: MLP due to its high accuracy and balance between precision and recall.

Model Training:

  • Training Process: Split the data into 80% training and 20% testing sets.
  • Parameter Tuning: Used Grid Search for hyperparameter tuning, optimizing the architecture and parameters of the MLP model.

Model Evaluation:

  • Evaluation Metrics: Accuracy, precision, recall, F1 score, and AUC-ROC.
  • Cross-Validation: Performed 10-fold cross-validation to ensure model robustness.

5. Results

Final Model Performance:

  • Accuracy: 91.67%
  • Precision: 100%
  • Recall: 85%
  • F1 Score: 92%
  • AUC-ROC: 0.92
  • Loss: 0.3051
  • Training Time: 3.18 seconds

Feature Importance:

  • Top Features: Age, cholesterol levels, resting blood pressure, maximum heart rate achieved.
  • Visualization: [Provide Feature Importance Plot]

6. Discussion

Interpretation of Results:

  • The MLP model shows high accuracy and reliable performance in predicting heart attacks.
  • Age and cholesterol are significant predictors, indicating the importance of monitoring these factors in patients.

Limitations:

  • Dataset Size: The relatively small dataset may limit the model’s generalizability.
  • Biases: Potential biases in the dataset could affect the predictions. More diverse datasets are needed for broader applicability.

7. Conclusion

Summary: The project successfully developed a machine learning model with high accuracy for predicting heart attacks. The model's interpretability makes it useful for healthcare providers.

Future Work:

  • Larger Datasets: Incorporate larger and more diverse datasets.
  • Feature Expansion: Explore additional features such as genetic factors and lifestyle habits.
  • Model Enhancement: Experiment with ensemble methods and advanced neural networks.

Impact: The model can significantly impact early diagnosis and preventive healthcare, potentially reducing the incidence of heart attacks through timely intervention.

8. References

  • Datasets: UCI Machine Learning Repository. (Heart Disease Dataset)
  • Papers: Refer to studies on heart attack prediction using machine learning in journals such as IEEE Transactions on Biomedical Engineering, Journal of the American Medical Informatics Association, etc.
  • Books: "Machine Learning in Healthcare" by S. Joshi.

9. Appendices

Appendix A: Detailed Data Description

  • age: Age of the patient
  • sex: Sex of the patient
    • 0 = Female
    • 1 = Male
  • cp: Chest pain type
    • 0 = Typical Angina
    • 1 = Atypical Angina
    • 2 = Non-anginal Pain
    • 3 = Asymptomatic
  • trtbps: Resting blood pressure (in mm Hg)
  • chol: Cholesterol level in mg/dl fetched via BMI sensor
  • fbs: Fasting blood sugar
    • 1 = (fasting blood sugar > 120 mg/dl)
    • 0 = (fasting blood sugar <= 120 mg/dl)
  • restecg: Resting electrocardiographic results
    • 0 = Normal
    • 1 = ST-T wave abnormality
    • 2 = Left ventricular hypertrophy
  • thalachh: Maximum heart rate achieved
  • oldpeak: Previous peak
  • slp: Slope
  • caa: Number of major vessels
  • thall: Thallium Stress Test result
    • (0,3)
  • exng: Exercise induced angina
    • 1 = Yes
    • 0 = No
  • output: Target variable
    • 0 = No Attack
    • 1 = Attack

Appendix B: Additional Charts and Visualizations

  • Correlation Heatmap: plot
  • Pair Plot: plot

Appendix C: Code Snippets

Data Preprocessing:

# Extract feature variables (all columns except the first and last) and target variable (last column)
x = df.iloc[:, 1:-1].values
y = df.iloc[:, -1].values

# Print the feature variables (x) and target variable (y)
print("Feature Variables (x):")
print(x)
print("\nTarget Variable (y):")
print(y)

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

# Initialize StandardScaler
scaler = StandardScaler()

# Standardize the training data (x_train)
x_train = scaler.fit_transform(x_train)

# Standardize the testing data (x_test) using the same scaler as training data
x_test = scaler.transform(x_test)

Model Training:

# Initialize necessary variables
best_accuracy_MLP = 0.0
best_auc_roc_MLP = 0.0
best_f1_score_MLP = 0.0
best_loss_MLP = float('inf')
best_precision_MLP = 0.0
best_training_time_MLP = float('inf')
best_model_MLP = None
best_confusion_matrix_MLP = None
best_recall_MLP = 0.0
fold_histories = {}
accuracy_list = []
loss_list = []
precision_list = []
recall_list = []
f1_list = []
auc_list = []
confusion_matrices = [] 
training_time_list = []
fold_histories_MLP = {}

results_df_mlp = pd.DataFrame(columns=['Fold', 'Early_Stopping', 'Accuracy', 'Loss', 'Precision', 'Recall', 'F1-Score', 'AUC-ROC', 'Training_Time'])

# Perform grid search
for fold_index, (train_index, val_index) in enumerate(kfold.split(x_train, y_train), 1):
    print(f"\nFold {fold_index}:")

    x_train_fold, x_val_fold = x_train[train_index], x_train[val_index]
    y_train_fold, y_val_fold = y_train[train_index], y_train[val_index]

    smote = SMOTE(random_state=42)
    x_train_fold, y_train_fold = smote.fit_resample(x_train_fold, y_train_fold)

    accuracies_MLP = []
    losses_MLP = []
    precisions_MLP = []
    recalls_MLP = []
    f1_scores_MLP = []
    auc_rocs_MLP = []
    training_times_MLP = []

    for early_stopping_param in early_stopping_params:
        model_MLP = Sequential([
            Dense(32, activation='relu', input_shape=(x_train.shape[1],), kernel_regularizer=regularizers.l2(0.001)),
            BatchNormalization(),
            Dropout(0.5),
            Dense(16, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
            BatchNormalization(),
            Dropout(0.3),
            Dense(1, activation='sigmoid')
        ])

        model_MLP.compile(optimizer='RMSProp', loss='binary_crossentropy', metrics=['accuracy'])

        early_stopping = EarlyStopping(monitor='val_loss', patience=early_stopping_param[0], restore_best_weights=early_stopping_param[1])

        start_time = time.time()
        history = model_MLP.fit(x_train_fold, y_train_fold, epochs=75, batch_size=32, validation_split=0.2, callbacks=[early_stopping], verbose=0)
        training_time = time.time() - start_time
        training_times_MLP.append(training_time)

        eval_loss, accuracy_MLP = model_MLP.evaluate(x_val_fold, y_val_fold)
        accuracies_MLP.append(accuracy_MLP)
        losses_MLP.append(eval_loss)

        y_pred = (model_MLP.predict(x_val_fold) > 0.5).astype("int32")

        precision = precision_score(y_val_fold, y_pred)
        recall = recall_score(y_val_fold, y_pred)
        f1 = f1_score(y_val_fold, y_pred)
        auc = roc_auc_score(y_val_fold, y_pred)

        precisions_MLP.append(precision)
        recalls_MLP.append(recall)
        f1_scores_MLP.append(f1)
        auc_rocs_MLP.append(auc)

        results_df_mlp = pd.concat([results_df_mlp, pd.DataFrame({
            'Fold': [fold_index],
            'Early_Stopping': [early_stopping_param],
            'Accuracy': [accuracy_MLP],
            'Loss': [eval_loss],
            'Precision': [precision],
            'Recall': [recall],
            'F1-Score': [f1],
            'AUC-ROC': [auc],
            'Training_Time': [training_time]
        })], ignore_index=True)

        print(f"Early Stopping: {early_stopping_param}, Accuracy = {accuracy_MLP:.4f}, Loss = {eval_loss:.4f}, Precision = {precision:.4f}, Recall = {recall:.4f}, F1-Score = {f1:.4f}, AUC-ROC = {auc:.4f}, Training Time = {training_time:.2f} seconds")

        accuracy_list.extend(accuracies_MLP)
        loss_list.extend(losses_MLP)
        precision_list.extend(precisions_MLP)
        recall_list.extend(recalls_MLP)
        f1_list.extend(f1_scores_MLP)
        auc_list.extend(auc_rocs_MLP)
        training_time_list.extend(training_times_MLP)

        mean_accuracy_MLP = np.mean(accuracies_MLP)
        mean_loss_MLP = np.mean(losses_MLP)

        fold_histories[fold_index] = history

        if mean_accuracy_MLP > best_accuracy_MLP:
            best_accuracy_MLP = mean_accuracy_MLP
            best_model_MLP = model_MLP
            best_confusion_matrix_MLP = confusion_matrix(y_val_fold, y_pred)
            best_precision_MLP = precision
            best_recall_MLP = recall
            best_f1_score_MLP = f1
            best_auc_roc_MLP = auc
            best_recall_MLP = recall
            best_loss_MLP = eval_loss
            best_training_time_MLP = training_time

Model Evaluation:

    # Function to compute softmax error (categorical cross-entropy loss)
    def compute_softmax_error(model, x_test, y_test):
        predictions = model.predict(x_test)
        loss_fn = CategoricalCrossentropy()
        softmax_error = loss_fn(y_test, predictions).numpy()
        return softmax_error

    # Placeholder function to simulate teacher model predictions
    def get_teacher_predictions(x_data):
        # In practice, this would be obtained from a trained teacher model
        return np.random.rand(x_data.shape[0], num_classes)

    # Function to compute distillation error (KLDivergence)
    def compute_distillation_error(student_model, x_data, teacher_predictions):
        predictions = student_model.predict(x_data)
        distillation_loss = KLDivergence()
        distillation_error = distillation_loss(teacher_predictions, predictions).numpy()
        return distillation_error

    # Function to compute ConfWeight Error
    def compute_confweight_error(model, x_data, y_data):
        predictions = model.predict(x_data)
        confidences = np.max(predictions, axis=1)
        correct_predictions = np.argmax(predictions, axis=1) == np.argmax(y_data, axis=1)
        errors = 1 - correct_predictions
        confweight_error = np.mean(errors * (1 - confidences))
        return confweight_error

    # Function to compute SRatio Error
    def compute_sratio_error(model, x_data, y_data):
        predictions = model.predict(x_data)
        true_probabilities = y_data / np.sum(y_data, axis=1, keepdims=True)
        predicted_probabilities = predictions / np.sum(predictions, axis=1, keepdims=True)
        sratio_error = np.mean(np.abs(true_probabilities - predicted_probabilities))
        return sratio_error

    model_MLP = load_model(r'saved_model/MLP_Model.h5')

    # Compute and print Softmax Error
    softmax_error_MLP = compute_softmax_error(model_MLP, x_test, y_test)
    print(f"Softmax Error: {softmax_error_MLP}")

    # Compute and print Distillation Error
    teacher_predictions_MLP = get_teacher_predictions(x_test)
    distillation_error_MLP = compute_distillation_error(model_MLP, x_test, teacher_predictions_MLP)
    print(f"Distillation Error: {distillation_error_MLP}")

    # Compute and print ConfWeight Error
    confweight_error_MLP = compute_confweight_error(model_MLP, x_test, y_test)
    print(f"ConfWeight Error: {confweight_error_MLP}")

    # Compute and print SRatio Error
    sratio_error_MLP = compute_sratio_error(model_MLP, x_test, y_test)
    print(f"SRatio Error: {sratio_error_MLP}")

    # Track and print computation time for predictions
    start_time_MLP = time.time()
    predictions_MLP = model_MLP.predict(x_test)
    computation_time_MLP = time.time() - start_time_MLP
    print(f"Computation Time: {computation_time_MLP} seconds") 

About

Heart-Attack Prediction Model using TensorFlow on UCI Dataset

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published