# Unveiling Heart Health: Discovering Patterns and Trends Through Data Exploration

Let's dive into our Heart Disease data to learn more about it! Exploratory Data Analysis (EDA) is like a detective mission where we uncover clues about each piece of information. We'll see how they're related and how they affect whether someone has heart disease. Using graphs and charts, we'll make sense of everything and find any surprises along the way. This will help us understand heart health better and make smart choices to keep our hearts strong.

* Source : Kaggle
* Objective : Predict heart disease and identifying the risk factors 



## About dataset

* HeartDisease (Target):A binary trait indicating the presence or absence of heart disease.

* BMI (Body Mass Index): A numerical value assessing the relationship between a person's mass and height, providing insight into weight status. 

* Smoking:A binary variable indicating whether an individual is a smoker, a significant risk factor for cardiovascular disease. Smoking affects heart rate, blood pressure, and blood clotting.

* AlcoholDrinking:A binary variable indicating alcohol consumption, which can lead to both temporary and permanent heart-related issues.

* Stroke:A binary trait indicating the occurrence of stroke, with a focus on ischemic stroke caused by heart-related factors such as impaired heart function and thromboembolism.

* PhysicalHealth:A count of the number of days in a month an individual experienced poor physical health.

* MentalHealth:A count of the number of days in a month an individual experienced poor mental health.

* DiffWalking:A binary variable indicating difficulty climbing stairs, reflecting mobility issues.

* Sex:Categorical variable indicating the gender of the individual.

* AgeCategory:Categorical variable representing age categories of the subjects.

* Race:Categorical variable indicating the race of the individual.

* Diabetic:Binary variable indicating whether an individual has diabetes.

* PhysicalActivity:Binary variable indicating whether an individual engaged in physical activity or exercise in the past 30 days.

* GenHealth:Categorical variable representing the general well-being of the individual.

* SleepTime:Numerical variable indicating the number of hours of sleep an individual gets.

* Asthma:Binary variable indicating whether an individual has asthma.

* KidneyDisease:Binary variable indicating whether an individual has kidney disease.

* Skin Cancer:Binary variable indicating whether an individual has skin cancer.

Import essential libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load the Dataset

In [21]:
df = pd.read_csv("C:\\Users\\anura\\Downloads\\heart_2021.csv")

Exploring the dataset

In [22]:
df.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


In [23]:
df.tail()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
319790,Yes,27.41,Yes,No,No,7.0,0.0,Yes,Male,60-64,Hispanic,Yes,No,Fair,6.0,Yes,No,No
319791,No,29.84,Yes,No,No,0.0,0.0,No,Male,35-39,Hispanic,No,Yes,Very good,5.0,Yes,No,No
319792,No,24.24,No,No,No,0.0,0.0,No,Female,45-49,Hispanic,No,Yes,Good,6.0,No,No,No
319793,No,32.81,No,No,No,0.0,0.0,No,Female,25-29,Hispanic,No,No,Good,12.0,No,No,No
319794,No,46.56,No,No,No,0.0,0.0,No,Female,80 or older,Hispanic,No,Yes,Good,8.0,No,No,No


In [5]:
df.shape

(319795, 18)

In [17]:
df.columns

Index(['HeartDisease', 'BMI', 'Smoking', 'AlcoholDrinking', 'Stroke',
       'PhysicalHealth', 'MentalHealth', 'DiffWalking', 'Sex', 'AgeCategory',
       'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth', 'SleepTime',
       'Asthma', 'KidneyDisease', 'SkinCancer'],
      dtype='object')

In [18]:
df.dtypes

HeartDisease         object
BMI                 float64
Smoking              object
AlcoholDrinking      object
Stroke               object
PhysicalHealth      float64
MentalHealth        float64
DiffWalking          object
Sex                  object
AgeCategory          object
Race                 object
Diabetic             object
PhysicalActivity     object
GenHealth            object
SleepTime           float64
Asthma               object
KidneyDisease        object
SkinCancer           object
dtype: object

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   HeartDisease      319795 non-null  object 
 1   BMI               319795 non-null  float64
 2   Smoking           319795 non-null  object 
 3   AlcoholDrinking   319795 non-null  object 
 4   Stroke            319795 non-null  object 
 5   PhysicalHealth    319795 non-null  float64
 6   MentalHealth      319795 non-null  float64
 7   DiffWalking       319795 non-null  object 
 8   Sex               319795 non-null  object 
 9   AgeCategory       319795 non-null  object 
 10  Race              319795 non-null  object 
 11  Diabetic          319795 non-null  object 
 12  PhysicalActivity  319795 non-null  object 
 13  GenHealth         319795 non-null  object 
 14  SleepTime         319795 non-null  float64
 15  Asthma            319795 non-null  object 
 16  KidneyDisease     31

In [8]:
df.describe()

Unnamed: 0,BMI,PhysicalHealth,MentalHealth,SleepTime
count,319795.0,319795.0,319795.0,319795.0
mean,28.325399,3.37171,3.898366,7.097075
std,6.3561,7.95085,7.955235,1.436007
min,12.02,0.0,0.0,1.0
25%,24.03,0.0,0.0,6.0
50%,27.34,0.0,0.0,7.0
75%,31.42,2.0,3.0,8.0
max,94.85,30.0,30.0,24.0


In [24]:
duplicates=df.duplicated().sum()
duplicates

np.int64(18078)

In [27]:
df.duplicated().any() #'True' means it is duplicated and 'False' means it is not duplicated

np.True_

We cannot drop duplicates beacause the dataset is compiled from many individuals multiple people can have exactly the same health attributes rows that look identical may still represent different patients removing them would lose real information