# Logistic Regression - Heart Failure Prediction

This notebook builds and evaluates a **Logistic Regression** model for predicting the likelihood of heart failure based on patient health attributes.  

---

## Objective

Use Logistic Regression as a **baseline classifier** to:
- Identify which health factors are most strongly associated with heart failure
- Establish a performance benchmark for comparison with more complex models such as Decision Trees, Random Forests, and Gradient Boosted Trees

---

## Dataset Summary

The dataset contains **918 patient records** and **11 health-related features**, along with a binary target variable:

- **Target:** `HeartDisease` (1 = patient has or is at risk of heart failure, 0 = otherwise)

**Features include:**
- *Age, Sex, ChestPainType, RestingBP, Cholesterol, FastingBS, RestingECG, MaxHR, ExerciseAngina, Oldpeak, ST_Slope*

---

## Overview

1. **Load the dataset** (`dataset/heart.csv`)  
2. **Explore and clean data**  
3. **Preprocess features** (encode categorical variables, scale numerical features)  
4. **Train/test split** to evaluate model generalization  
5. **Build and train** a Logistic Regression model  
6. **Evaluate performance** using metrics such as Accuracy, F1-score, and ROC-AUC  
7. **Interpret coefficients** to understand key risk indicators

---

##  Notes

- All preprocessing for this model (encoding, scaling, and splitting) is handled **within this notebook** for simplicity.  
- The same random seed (`random_state=42`) will be used across models to ensure consistent splits and fair performance comparison.  
- Results here will serve as a **baseline** for evaluating more advanced models later in the project.


## Loading the Dataset

We load the dataset from *Kaggle's* **.csv** file to initialize a dataframe (`df`).

Afterwards, we can verify `df` has been populated.

In [3]:
import pandas as pd

df = pd.read_csv("dataset/heart.csv")

df.head()
df.duplicated().sum()


Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64