# Analyse Health and Demographic Data to identify common traits leading to Heart Disease_Anudip

Import Libraries

In [1]:
import pandas as pd

Pandas is a powerful and flexible open-source data analysis and manipulation library for Python, providing data structures like DataFrames that make it easy to clean, transform, and analyze large datasets efficiently.

Import the Dataset

In [2]:
df = pd.read_csv('C:/Users/Vilas/Downloads/heart_disease_health_indicators_BRFSS2015 2 (1).csv')

Convert categorical variables to appropriate data types

In [3]:
categorical_columns = [
    'HeartDiseaseorAttack', 'HighBP', 'HighChol', 'CholCheck', 'Smoker',
    'Stroke', 'PhysActivity', 'Fruits', 'Veggies', 'HvyAlcoholConsump', 'AnyHealthcare',
    'NoDocbcCost', 'DiffWalk', 'Sex', 'Age', 'Education', 'Income'
]

In [4]:
for col in categorical_columns: df[col] = df[col].astype('category')

In [5]:
print(df.dtypes)

HeartDiseaseorAttack    category
HighBP                  category
HighChol                category
CholCheck               category
BMI                        int64
Smoker                  category
Stroke                  category
Diabetes                   int64
PhysActivity            category
Fruits                  category
Veggies                 category
HvyAlcoholConsump       category
AnyHealthcare           category
NoDocbcCost             category
GenHlth                    int64
MentHlth                   int64
PhysHlth                   int64
DiffWalk                category
Sex                     category
Age                     category
Education               category
Income                  category
dtype: object


Displaying the data types of each column in the DataFrame. The output shows that categorical variables (e.g., 'HeartDiseaseorAttack', 'HighBP', 'Smoker') are labeled as 'category', while continuous variables (e.g., 'BMI', 'Diabetes', 'GenHlth') are of type 'int64'. This information helps in understanding the nature of the data and planning for appropriate analysis and preprocessing.


Handling Missing Values in the Dataset

In [6]:
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)

Missing values in each column:
 HeartDiseaseorAttack    0
HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
Diabetes                0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
Education               0
Income                  0
dtype: int64


Displaying the count of missing values for each column in the DataFrame. The output indicates that there are no missing values in any of the columns, suggesting that the dataset is complete and ready for further analysis.

Removing Duplicate 

In [7]:
duplicate_rows = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_rows}")

Number of duplicate rows: 23899


Calculating and printing the number of duplicate rows in the DataFrame.
Initially, the DataFrame has 23,899 duplicate rows.

In [8]:
df = df.drop_duplicates()

In [9]:
print("Shape of the dataset after removing duplicates:", df.shape)

Shape of the dataset after removing duplicates: (229781, 22)


Removing duplicate rows from the DataFrame and printing the shape of the dataset after cleanup.
After removing duplicates, the shape of the DataFrame is (229,781, 22), indicating the number of rows has decreased.

In [10]:
print("Number of duplicate rows after cleanup:", df.duplicated().sum())

Number of duplicate rows after cleanup: 0


Re-checking the number of duplicate rows to confirm the cleanup was successful.
After cleanup, there should be 0 duplicate rows remaining

In [11]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 229781 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype   
---  ------                --------------   -----   
 0   HeartDiseaseorAttack  229781 non-null  category
 1   HighBP                229781 non-null  category
 2   HighChol              229781 non-null  category
 3   CholCheck             229781 non-null  category
 4   BMI                   229781 non-null  int64   
 5   Smoker                229781 non-null  category
 6   Stroke                229781 non-null  category
 7   Diabetes              229781 non-null  int64   
 8   PhysActivity          229781 non-null  category
 9   Fruits                229781 non-null  category
 10  Veggies               229781 non-null  category
 11  HvyAlcoholConsump     229781 non-null  category
 12  AnyHealthcare         229781 non-null  category
 13  NoDocbcCost           229781 non-null  category
 14  GenHlth               229781 non-nul

Displaying the DataFrame information, including the number of entries, data columns, and data types. The DataFrame has 229,781 entries and 22 columns with a mix of categorical and numerical data types.

In [12]:
print(df.describe(include='all'))

        HeartDiseaseorAttack    HighBP  HighChol  CholCheck           BMI  \
count               229781.0  229781.0  229781.0   229781.0  229781.00000   
unique                   2.0       2.0       2.0        2.0           NaN   
top                      0.0       0.0       0.0        1.0           NaN   
freq                206064.0  125359.0  128273.0   220483.0           NaN   
mean                     NaN       NaN       NaN        NaN      28.68567   
std                      NaN       NaN       NaN        NaN       6.78636   
min                      NaN       NaN       NaN        NaN      12.00000   
25%                      NaN       NaN       NaN        NaN      24.00000   
50%                      NaN       NaN       NaN        NaN      27.00000   
75%                      NaN       NaN       NaN        NaN      32.00000   
max                      NaN       NaN       NaN        NaN      98.00000   

          Smoker    Stroke       Diabetes  PhysActivity    Fruits  ...  \
c

Displaying descriptive statistics for all columns, including categorical and numerical data.
This provides insights into the distribution and unique values of each column.

In [13]:
df.head()

Unnamed: 0,HeartDiseaseorAttack,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0,1,1,1,40,1,0,0,0,0,...,1,0,5,18,15,1,0,9,4,3
1,0,0,0,0,25,1,0,0,1,0,...,0,1,3,0,0,0,0,7,6,1
2,0,1,1,1,28,0,0,0,0,1,...,1,1,5,30,30,1,0,9,4,8
3,0,1,0,1,27,0,0,0,1,1,...,1,0,2,0,0,0,0,11,3,6
4,0,1,1,1,24,0,0,0,1,1,...,1,0,2,3,0,0,0,11,5,4


Displaying the first few rows of the DataFrame to get a preview of the data.
This helps in understanding the structure and content of the dataset

In [14]:
df['HeartDiseaseorAttack'].replace({0: 'No', 1: 'Prediabetes', 2: 'Diabetes'}, inplace=True)
df['HighBP'].replace({0: 'No', 1: 'Yes'}, inplace=True)
df['HighChol'].replace({0: 'No', 1: 'Yes'}, inplace=True)
df['CholCheck'].replace({0: 'No', 1: 'Yes'}, inplace=True)
df['Smoker'].replace({0: 'No', 1: 'Yes'}, inplace=True)
df['Stroke'].replace({0: 'No', 1: 'Yes'}, inplace=True)
df['HeartDiseaseorAttack'].replace({0: 'No', 1: 'Yes'}, inplace=True)
df['PhysActivity'].replace({0: 'No', 1: 'Yes'}, inplace=True)
df['Fruits'].replace({0: 'No', 1: 'Yes'}, inplace=True)
df['Veggies'].replace({0: 'No', 1: 'Yes'}, inplace=True)
df['HvyAlcoholConsump'].replace({0: 'No', 1: 'Yes'}, inplace=True)
df['AnyHealthcare'].replace({0: 'No', 1: 'Yes'}, inplace=True)
df['NoDocbcCost'].replace({0: 'No', 1: 'Yes'}, inplace=True)
df['DiffWalk'].replace({0: 'No', 1: 'Yes'}, inplace=True)
df['Sex'].replace({0: 'Female', 1: 'Male'}, inplace=True)

1) HeartDiseaseorAttack: 0 = no diabetes 1 = prediabetes 2 = diabetes
2) HighBP: 0 = no high BP 1 = high BP
3) HighChol: 0 = no high cholesterol 1 = high cholesterol
4) CholCheck: 0 = no cholesterol check in 5 years 1 = yes cholesterol check in 5 years
5) BMI: Body Mass Index
6) Smoker: Have you smoked at least 100 cigarettes in your entire life? (Note: 5 packs = 100 cigarettes) 0 = no 1 = yes
7) Stroke: (Ever told) you had a stroke. 0 = no 1 = yes
8) HeartDiseaseorAttack: coronary heart disease (CHD) or myocardial infarction (MI) 0 = no 1 = yes
9) PhysActivity: physical activity in past 30 days - not including job 0 = no 1 = yes
10) Fruits: Consume Fruit 1 or more times per day 0 = no 1 = yes
11) Veggies: Consume Vegetables 1 or more times per day 0 = no 1 = yes
12) HvyAlcoholConsump: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) 0 = no 1 = yes
13) AnyHealthcare: Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc. 0 = no 1 = yes
14) NoDocbcCost: Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? 0 = no 1 = yes
15) GenHlth: Would you say that in general your health is: scale 1-5 1 = excellent 2 = very good 3 = good 4 = fair 5 = poor
16) MentHlth: Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? scale 1-30 days
17) PhysHlth: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? scale 1-30 days
18) DiffWalk: Do you have serious difficulty walking or climbing stairs? 0 = no 1 = yes
19) Sex:0 = female 1 = male
20) Age: 13-level age category (_AGEG5YR see codebook) 1 = 18-24 9 = 60-64 13 = 80 or older
21) Education: Education level (EDUCA see codebook) scale 1-6 
1 = Never attended school or only kindergarten 
2 = Grades 1 through 8 (Elementary) 
3 = Grades 9 through 11 (Some high school) 
4 = Grade 12 or GED (High school graduate) 
5 = College 1 year to 3 years (Some college or technical school) 
6 = College 4 years or more (College graduate)
22) Income: Income scale (INCOME2 see codebook) scale 1-8 
1 = less than $10,000 
5 = less than $35,000 
8 = $75,000 or more

In [15]:
df.head()

Unnamed: 0,HeartDiseaseorAttack,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,No,Yes,Yes,Yes,40,Yes,No,0,No,No,...,Yes,No,5,18,15,Yes,Female,9,4,3
1,No,No,No,No,25,Yes,No,0,Yes,No,...,No,Yes,3,0,0,No,Female,7,6,1
2,No,Yes,Yes,Yes,28,No,No,0,No,Yes,...,Yes,Yes,5,30,30,Yes,Female,9,4,8
3,No,Yes,No,Yes,27,No,No,0,Yes,Yes,...,Yes,No,2,0,0,No,Female,11,3,6
4,No,Yes,Yes,Yes,24,No,No,0,Yes,Yes,...,Yes,No,2,3,0,No,Female,11,5,4


The above displays the first 5 rows of the DataFrame with updated record values

In [17]:
df.to_csv('C:/Users/Vilas/Downloads/cleaned_dataset.csv', index=False)

The cleaned DataFrame is being saved to a CSV file at the specified path, and the index is excluded from the output file.