#### Team members

1. Mostafa Allahmoradi - 9087818
2. Cemil Caglar Yapici – 9081058
3. Jarius Bedward - 8841640

### Import all necessary libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA


#### Problem statement

Area of Focus:  
In today’s world, health conditions such as heart disease, obesity, and diabetes are rapidly increasing. Many of these illnesses are directly linked to preventable lifestyle factors such as poor diet, lack of physical activity, smoking, alcohol consumption, and inadequate sleep. Despite increased awareness about healthy living, many individuals are still at risk because they underestimate how daily habits affect their well-being in long-term. 

Outcome:  
Early detection and prevention could save lives and reduce medical costs, but traditional screening methods are often not effective, and they typically only identify problems after symptoms appear, missing critical opportunities for prevention. 

This project aims to explore how machine learning can help predict the likelihood of developing a heart condition (or general illness) based on an individual’s diet, lifestyle, and fitness-related factors. By analyzing real-world health data, our goal is to identify key risk factors and build a predictive model that can help individuals take proactive measures toward healthier living.

#### Data Sources: 

Health and Lifestyle Data for Regression 

Heart Attack Prediction in Indonesia 

#### 1. Load data sources

In [None]:
health_condition_dataset = pd.read_csv('data/heart_attack_prediction_dataset.csv')
display(health_condition_dataset.head(5))
display(health_condition_dataset.info())

display(health_condition_dataset.describe().T)

Unnamed: 0,age,gender,region,income_level,hypertension,diabetes,cholesterol_level,obesity,waist_circumference,family_history,...,blood_pressure_diastolic,fasting_blood_sugar,cholesterol_hdl,cholesterol_ldl,triglycerides,EKG_results,previous_heart_disease,medication_usage,participated_in_free_screening,heart_attack
0,60,Male,Rural,Middle,0,1,211,0,83,0,...,62,173,48,121,101,Normal,0,0,0,0
1,53,Female,Urban,Low,0,0,208,0,106,1,...,76,70,58,83,138,Normal,1,0,1,0
2,62,Female,Urban,Low,0,0,231,1,112,1,...,74,118,69,130,171,Abnormal,0,1,0,1
3,73,Male,Urban,Low,1,0,202,0,82,1,...,65,98,52,85,146,Normal,0,1,1,0
4,52,Male,Urban,Middle,1,0,232,0,89,0,...,75,104,59,127,139,Normal,1,0,1,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 158355 entries, 0 to 158354
Data columns (total 28 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   age                             158355 non-null  int64  
 1   gender                          158355 non-null  object 
 2   region                          158355 non-null  object 
 3   income_level                    158355 non-null  object 
 4   hypertension                    158355 non-null  int64  
 5   diabetes                        158355 non-null  int64  
 6   cholesterol_level               158355 non-null  int64  
 7   obesity                         158355 non-null  int64  
 8   waist_circumference             158355 non-null  int64  
 9   family_history                  158355 non-null  int64  
 10  smoking_status                  158355 non-null  object 
 11  alcohol_consumption             63507 non-null   object 
 12  physical_activit

None

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,158355.0,54.543778,11.910897,25.0,46.0,55.0,63.0,90.0
hypertension,158355.0,0.299069,0.457851,0.0,0.0,0.0,1.0,1.0
diabetes,158355.0,0.199804,0.399854,0.0,0.0,0.0,0.0,1.0
cholesterol_level,158355.0,199.533264,39.737565,100.0,172.0,199.0,226.0,350.0
obesity,158355.0,0.249901,0.432957,0.0,0.0,0.0,0.0,1.0
waist_circumference,158355.0,93.268504,16.382205,20.0,82.0,93.0,104.0,173.0
family_history,158355.0,0.300218,0.458354,0.0,0.0,0.0,1.0,1.0
sleep_hours,158355.0,6.480064,1.425398,3.0,5.492985,6.507461,7.52064,9.0
blood_pressure_systolic,158355.0,129.515772,15.005641,61.0,119.0,130.0,140.0,199.0
blood_pressure_diastolic,158355.0,79.490809,10.002964,37.0,73.0,80.0,86.0,127.0


#### Explains if and how Clustering applies to your term project with a 50-word summary

Clustering can segment individuals into groups based on shared health attributes such as blood pressure, cholesterol, and lifestyle factors. By identifying natural clusters, we can discover hidden risk patterns, target prevention programs, and personalize health recommendations without prior labels—enhancing early detection and understanding of heart attack risk profiles.

In [None]:
# Select Relevant Health Indicators for Clustering
features = [
'age', 'cholesterol_level', 'blood_pressure_systolic',
'blood_pressure_diastolic', 'fasting_blood_sugar',
'cholesterol_hdl', 'cholesterol_ldl', 'triglycerides'
]
X = health_condition_dataset[features]

# Data Standardization
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)