# Logistic Regression - Heart Disease Prediction

## Introduction

The World Health Organization estimates that approximately **12 million deaths** occur worldwide each year due to heart diseases. In fact, half of the deaths in the United States and other developed countries are attributed to cardiovascular diseases. Early prognosis of cardiovascular diseases can significantly aid in making informed decisions regarding lifestyle changes for high-risk patients, ultimately reducing complications and improving health outcomes.

This research aims to identify the most relevant risk factors associated with heart disease and to predict the overall risk using **logistic regression**. By leveraging statistical methods, we can enhance the understanding of how various factors contribute to the likelihood of developing heart disease, facilitating timely interventions.

**Please note that this project is conducted as part of my studies, and the findings presented herein should be verified and interpreted with caution.**

### Source

The dataset used for this analysis is publicly available on the **Kaggle** platform and originates from an ongoing cardiovascular study conducted on residents of **Framingham, Massachusetts**. The classification goal of this study is to predict whether a patient has a **10-year risk** of future coronary heart disease (CHD).

The dataset comprises over **4,000 records** and includes **15 attributes** detailing various patient information. This wealth of data will enable a comprehensive analysis of the risk factors associated with heart disease and support the development of a predictive model.


## Project Column Overview

### Demographic
- **Sex**: The patient's sex, with possible values of "male" or "female" (Nominal).
- **Age**: Age of the patient, measured in years. Although recorded as whole numbers, the concept of age is continuous (Continuous).

### Behavioral
- **Current Smoker**: Indicates whether the patient is a current smoker, with values of "Yes" or "No" (Nominal).
- **Cigs Per Day**: The average number of cigarettes smoked per day by the patient. It can be considered continuous, as it is possible to smoke a fraction of a cigarette (Continuous).

### Medical (History)
- **BP Meds**: Indicates whether the patient was on blood pressure medication (Nominal).
- **Prevalent Stroke**: Indicates whether the patient had previously experienced a stroke (Nominal).
- **Prevalent Hyp**: Indicates whether the patient was hypertensive (Nominal).
- **Diabetes**: Indicates whether the patient had diabetes (Nominal).

### Medical (Current)
- **Tot Chol**: Total cholesterol level measured in the blood (Continuous).
- **Sys BP**: Systolic blood pressure, the pressure in blood vessels during heartbeats (Continuous).
- **Dia BP**: Diastolic blood pressure, the pressure in blood vessels when the heart is resting between beats (Continuous).
- **BMI**: Body Mass Index, a measure of body fat based on weight and height (Continuous).
- **Heart Rate**: The number of heartbeats per minute. Although typically considered discrete, it is treated as continuous due to the large number of possible values (Continuous).
- **Glucose**: Glucose level in the blood (Continuous).

### Target Variable (Desired Prediction)
- **10-Year Risk of Coronary Heart Disease (CHD)**: Indicates the risk of developing coronary heart disease within ten years following the assessment. The values are binary: "1" means "Yes," and "0" means "No."

In [1]:
import pandas as pd

In [12]:
# Load data 
df= pd.read_csv("Dataset/framingham.csv")

## Preliminary Data Exploration

In [13]:
# Display first five rows
df.head()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
0,1,39,4.0,0,0.0,0.0,0,0,0,195.0,106.0,70.0,26.97,80.0,77.0,0
1,0,46,2.0,0,0.0,0.0,0,0,0,250.0,121.0,81.0,28.73,95.0,76.0,0
2,1,48,1.0,1,20.0,0.0,0,0,0,245.0,127.5,80.0,25.34,75.0,70.0,0
3,0,61,3.0,1,30.0,0.0,0,1,0,225.0,150.0,95.0,28.58,65.0,103.0,1
4,0,46,3.0,1,23.0,0.0,0,0,0,285.0,130.0,84.0,23.1,85.0,85.0,0


In [15]:
# Display basic statistics of the DataFrame
df.describe()

Unnamed: 0,male,age,education,currentSmoker,cigsPerDay,BPMeds,prevalentStroke,prevalentHyp,diabetes,totChol,sysBP,diaBP,BMI,heartRate,glucose,TenYearCHD
count,4238.0,4238.0,4133.0,4238.0,4209.0,4185.0,4238.0,4238.0,4238.0,4188.0,4238.0,4238.0,4219.0,4237.0,3850.0,4238.0
mean,0.429212,49.584946,1.97895,0.494101,9.003089,0.02963,0.005899,0.310524,0.02572,236.721585,132.352407,82.893464,25.802008,75.878924,81.966753,0.151958
std,0.495022,8.57216,1.019791,0.500024,11.920094,0.169584,0.076587,0.462763,0.158316,44.590334,22.038097,11.91085,4.080111,12.026596,23.959998,0.359023
min,0.0,32.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,107.0,83.5,48.0,15.54,44.0,40.0,0.0
25%,0.0,42.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,206.0,117.0,75.0,23.07,68.0,71.0,0.0
50%,0.0,49.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,234.0,128.0,82.0,25.4,75.0,78.0,0.0
75%,1.0,56.0,3.0,1.0,20.0,0.0,0.0,1.0,0.0,263.0,144.0,89.875,28.04,83.0,87.0,0.0
max,1.0,70.0,4.0,1.0,70.0,1.0,1.0,1.0,1.0,696.0,295.0,142.5,56.8,143.0,394.0,1.0


In [16]:
# Display the total number of missing values in each column
df.isna().sum()

male                 0
age                  0
education          105
currentSmoker        0
cigsPerDay          29
BPMeds              53
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             50
sysBP                0
diaBP                0
BMI                 19
heartRate            1
glucose            388
TenYearCHD           0
dtype: int64

In [22]:
# Display the number of rows and columns in the DataFrame
shape_df = df.shape

In [23]:
shape_df

(4238, 16)

In [18]:
# Display the data types of each column in the DataFrame
df.dtypes

male                 int64
age                  int64
education          float64
currentSmoker        int64
cigsPerDay         float64
BPMeds             float64
prevalentStroke      int64
prevalentHyp         int64
diabetes             int64
totChol            float64
sysBP              float64
diaBP              float64
BMI                float64
heartRate          float64
glucose            float64
TenYearCHD           int64
dtype: object

## Data Cleaning

### Manage missing values

In [24]:
row_sum = shape_df[0]
five_percent = int(row_sum * 0.05)

In [26]:
print(f"The rule is that if there are less than 5% of missing values in a column, we can delete those rows. Thus, we can delete all the rows in columns with less than {five_percent} of missing values.")


The rule is that if there are less than 5% of missing values in a column, we can delete those rows. Thus, we can delete all the rows in columns with less than 211 of missing values.


We can delete rows in the following columns due to the presence of missing values:

- **education**: 105 missing values
- **cigsPerDay**: 29 missing values
- **BPMeds**: 53 missing values
- **totChol**: 50 missing values
- **BMI**: 19 missing values
- **heartRate**: 1 missing value