# **EECS 3401 Final Project: Understanding Heart Disease Risk**
### Group Members: Deep Patel, Yukta Bhutani, Abdul Wasay Faizan
#### Original Dataset Source: [link](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?select=2022)
#### Modified Dataset Source: [link](https://github.com/Deep26053/EECS3401_Heart/blob/main/heart_2022_with_nans.csv)
### Project Description: 
This project aims to analyze a comprehensive dataset related to heart disease risk factors sourced from the Centers for Disease Control and Prevention (CDC). The dataset encompasses survey data from over 400,000 adults collected in 2022, focusing on key indicators associated with heart disease prevalence.

### Attributes for the heart disease dataset:

1. Age: Age of the respondent.
2. Gender: Gender of the respondent.
3. Race: Race of the respondent.
4. HighBloodPressure: Presence of high blood pressure (binary: "Yes" or "No").
5. HighCholesterol: Presence of high cholesterol (binary: "Yes" or "No").
6. SmokingStatus: Smoking status of the respondent.
7. DiabetesStatus: Diabetes status of the respondent.
8. Obesity: Obesity status of the respondent (measured by BMI).
9. PhysicalActivity: Level of physical activity.
10. AlcoholConsumption: Level of alcohol consumption.
11. OtherMedicalConditions: Presence of other medical conditions.
12. FamilyHistory: Family history of heart disease (binary: "Yes" or "No").
13. MedicationUsage: Usage of medications for heart disease or related conditions.
14. StressLevel: Level of stress reported by the respondent.
15. Diet: Dietary habits of the respondent.
16. ExerciseRoutine: Regularity of exercise routines.
17. SleepQuality: Quality of sleep reported by the respondent.
18. SocioeconomicStatus: Socioeconomic status of the respondent.
19. EducationLevel: Level of education attained by the respondent.
20. AccessToHealthcare: Access to healthcare facilities and services.
21. GeographicLocation: Geographic location of the respondent.
22. DateOfSurvey: Date when the survey was conducted.
23. HadHeartAttack: Presence of heart disease (binary: "Yes" or "No").

# 1. Frame the Problem:

### Task: 
Predict the likelihood of heart disease based on demographic and health-related variables using the Heart Disease dataset.

### Key Questions:
1. What are the demographic and health-related variables associated with heart disease risk?
2. How do factors such as age, gender, blood pressure, cholesterol levels, smoking status, diabetes status, obesity, physical activity, and alcohol consumption impact the likelihood of heart disease?
3. What is the distribution of heart disease cases among different demographic groups?
4. Can we identify patterns or correlations between variables and heart disease prevalence?
5. Which machine learning algorithm provides the most accurate predictions for heart disease likelihood?

- **Supervised Learning:** Since the dataset includes labelled data (e.g., presence or absence of heart disease), it is a supervised learning problem.
- **Classification Task:** The objective is to predict whether an individual is at risk of heart disease (binary classification).
- **Batch Learning:** The dataset represents a finite set of data collected at a specific time, making batch learning suitable for model training. There is no continuous flow of data coming into the system, and there is no need to adjust to changing data rapidly.
This framing of the problem provides clarity on the type of learning approach, the nature of the prediction task, and the appropriate learning technique to use for developing the heart disease prediction model.


# Look at the Big Picture:
The model's output (prediction of an individual's likelihood of heart disease) will be used as one of many signals in a broader healthcare system. This downstream system will aid healthcare professionals in making informed decisions regarding patient care, such as recommending preventive measures, lifestyle modifications, or medical interventions.

## Key Analyses:
- **Distribution of Heart Disease Risk Factors:**
Explored the prevalence and distribution of heart disease risk factors including blood pressure, cholesterol levels, smoking status, diabetes status, obesity, physical activity levels, and alcohol consumption to identify common patterns and variations.
- **Impact of Demographic Factors on Heart Disease Risk:**
Investigated the relationship between demographic factors (age, gender, ethnicity) and heart disease prevalence to identify any significant associations or disparities.
- **Correlation between Risk Factors and Heart Disease:**
Identified correlations between variables such as blood pressure, cholesterol levels, smoking status, and heart disease to understand their impact on heart disease risk.
- **Feature Importance Analysis:**
This analysis can provide insights into which demographic and health-related factors have the greatest impact on heart disease risk.

### Impact: 
By accurately predicting the likelihood of heart disease, the model can contribute to reducing the burden of cardiovascular disease, improving patient outcomes, and reducing healthcare costs associated with preventable heart-related complications.


In [1]:
# Import the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 2. Load Dataset:
Open the dataset using Pandas and load it into a DataFrame, which is the object Pandas uses to store tables of data.
Pandas uses two objects for storing datasets: the DataFrame and the Series.
Series is used for datasets with only one column, and DataFrame is used for datasets of more than one column.

In [2]:
heart_data = pd.read_csv('heart_2022_with_nans.csv') # Read dataset from the CSV file into a DataFrame

In [3]:
heart_data.head()

Unnamed: 0,State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,...,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
0,Alabama,Female,Very good,0.0,0.0,Within past year (anytime less than 12 months ...,No,8.0,,No,...,,,,No,No,Yes,No,"Yes, received tetanus shot but not sure what type",No,No
1,Alabama,Female,Excellent,0.0,0.0,,No,6.0,,No,...,1.6,68.04,26.57,No,No,No,No,"No, did not receive any tetanus shot in the pa...",No,No
2,Alabama,Female,Very good,2.0,3.0,Within past year (anytime less than 12 months ...,Yes,5.0,,No,...,1.57,63.5,25.61,No,No,No,No,,No,Yes
3,Alabama,Female,Excellent,0.0,0.0,Within past year (anytime less than 12 months ...,Yes,7.0,,No,...,1.65,63.5,23.3,No,No,Yes,Yes,"No, did not receive any tetanus shot in the pa...",No,No
4,Alabama,Female,Fair,2.0,0.0,Within past year (anytime less than 12 months ...,Yes,9.0,,No,...,1.57,53.98,21.77,Yes,No,No,Yes,"No, did not receive any tetanus shot in the pa...",No,No
