<a href="https://colab.research.google.com/github/Arwa678/IT326-Group-3/blob/main/Phase1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About the project
In our project, we chose the **Heart Health dataset** because heart disease is one of the leading causes of death worldwide.
Understanding the factors that contribute to heart disease is crucial for prevention. The dataset provides important information such as
age, gender, blood pressure, cholesterol levels, and other health-related factors which are key indicators of heart health.

By analyzing this data, we hope to identify patterns and risk factors that can help predict heart disease.
This could be valuable for healthcare professionals in developing strategies to prevent heart disease and improve patient care.
Our goal is to contribute to better heart health by using data to spot trends and offer insights into managing heart disease risk.

## Dataset Source:
The Heart Health dataset is provided for research and analysis to better understand the risk factors contributing to heart disease.

link: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease?resource=download

## Goal:
The goal of this dataset is to predict the likelihood of a heart attack based on various health metrics, including blood pressure, cholesterol, glucose levels# smoking habits, and exercise patterns. This analysis will aid in early detection and prevention strategies for heart disease.

## Class label
The class label "HadHeartAttack" is a binary variable indicating whether a respondent has experienced a heart attack. It is categorized as "Yes" if the respondent had heart disease and "No" if the respondent did not have heart disease. This label is crucial for classification tasks aimed at identifying individuals at risk of heart disease.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('/content/heart_2022_with_nans.csv')
df1 = pd.DataFrame(df)
print(df1)

                 State     Sex GeneralHealth  PhysicalHealthDays  \
0              Alabama  Female     Very good                 0.0   
1              Alabama  Female     Excellent                 0.0   
2              Alabama  Female     Very good                 2.0   
3              Alabama  Female     Excellent                 0.0   
4              Alabama  Female          Fair                 2.0   
...                ...     ...           ...                 ...   
445127  Virgin Islands  Female          Good                 0.0   
445128  Virgin Islands  Female     Excellent                 2.0   
445129  Virgin Islands  Female          Poor                30.0   
445130  Virgin Islands    Male     Very good                 0.0   
445131  Virgin Islands    Male     Very good                 0.0   

        MentalHealthDays                                    LastCheckupTime  \
0                    0.0  Within past year (anytime less than 12 months ...   
1                    0.0 

In [None]:
class_label = df['HadHeartAttack']
df = df.drop(columns=['HadHeartAttack'])
df['HadHeartAttack'] = class_label

In [None]:
# Convert invalid values to NaN
df['PhysicalHealthDays'] = pd.to_numeric(df['PhysicalHealthDays'], errors='coerce')
df['MentalHealthDays'] = pd.to_numeric(df['MentalHealthDays'], errors='coerce')
df['RemovedTeeth'] = pd.to_numeric(df['RemovedTeeth'], errors='coerce')  # This step handles NaN values

# Now apply the astype conversion for all columns
df = df.astype({'State': 'string', 'Sex': 'string', 'GeneralHealth': 'string',
    'PhysicalHealthDays': 'Int64',  # Use Int64 to handle NaN values for integers
    'MentalHealthDays': 'Int64',    # Use Int64 for nullable integers
    'RemovedTeeth': 'Int64',        # Use Int64 for nullable integers
    'LastCheckupTime': 'string', 'PhysicalActivities': 'bool', 'SleepHours': 'float',
    'HadHeartAttack': 'bool', 'HadAngina': 'bool', 'HadStroke': 'bool', 'HadAsthma': 'bool',
    'HadSkinCancer': 'bool', 'HadCOPD': 'bool', 'HadDepressiveDisorder': 'bool',
    'HadKidneyDisease': 'bool', 'HadArthritis': 'bool', 'HadDiabetes': 'bool',
    'DeafOrHardOfHearing': 'bool', 'BlindOrVisionDifficulty': 'bool',
    'DifficultyConcentrating': 'bool', 'DifficultyWalking': 'bool',
    'DifficultyDressingBathing': 'bool', 'DifficultyErrands': 'bool',
    'SmokerStatus': 'string', 'ECigaretteUsage': 'bool', 'ChestScan': 'bool',
    'RaceEthnicityCategory': 'string', 'AgeCategory': 'string', 'HeightInMeters': 'float',
    'WeightInKilograms': 'float', 'BMI': 'float', 'AlcoholDrinkers': 'bool',
    'HIVTesting': 'bool', 'FluVaxLast12': 'bool', 'PneumoVaxEver': 'bool',
    'TetanusLast10Tdap': 'bool', 'HighRiskLastYear': 'bool', 'CovidPos': 'bool'})


the code bellow shows the types of each attribute

In [None]:
print(df.dtypes)

State                        string[python]
Sex                          string[python]
GeneralHealth                string[python]
PhysicalHealthDays                    Int64
MentalHealthDays                      Int64
LastCheckupTime              string[python]
PhysicalActivities                     bool
SleepHours                          float64
RemovedTeeth                          Int64
HadAngina                              bool
HadStroke                              bool
HadAsthma                              bool
HadSkinCancer                          bool
HadCOPD                                bool
HadDepressiveDisorder                  bool
HadKidneyDisease                       bool
HadArthritis                           bool
HadDiabetes                            bool
DeafOrHardOfHearing                    bool
BlindOrVisionDifficulty                bool
DifficultyConcentrating                bool
DifficultyWalking                      bool
DifficultyDressingBathing       

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445132 entries, 0 to 445131
Data columns (total 40 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   State                      445132 non-null  string 
 1   Sex                        445132 non-null  string 
 2   GeneralHealth              443934 non-null  string 
 3   PhysicalHealthDays         434205 non-null  Int64  
 4   MentalHealthDays           436065 non-null  Int64  
 5   LastCheckupTime            436824 non-null  string 
 6   PhysicalActivities         445132 non-null  bool   
 7   SleepHours                 439679 non-null  float64
 8   RemovedTeeth               0 non-null       Int64  
 9   HadAngina                  445132 non-null  bool   
 10  HadStroke                  445132 non-null  bool   
 11  HadAsthma                  445132 non-null  bool   
 12  HadSkinCancer              445132 non-null  bool   
 13  HadCOPD                    44

the code below lists all the attributes of the dataset

In [None]:
col_name = df1.columns
print(col_name)

Index(['State', 'Sex', 'GeneralHealth', 'PhysicalHealthDays',
       'MentalHealthDays', 'LastCheckupTime', 'PhysicalActivities',
       'SleepHours', 'RemovedTeeth', 'HadHeartAttack', 'HadAngina',
       'HadStroke', 'HadAsthma', 'HadSkinCancer', 'HadCOPD',
       'HadDepressiveDisorder', 'HadKidneyDisease', 'HadArthritis',
       'HadDiabetes', 'DeafOrHardOfHearing', 'BlindOrVisionDifficulty',
       'DifficultyConcentrating', 'DifficultyWalking',
       'DifficultyDressingBathing', 'DifficultyErrands', 'SmokerStatus',
       'ECigaretteUsage', 'ChestScan', 'RaceEthnicityCategory', 'AgeCategory',
       'HeightInMeters', 'WeightInKilograms', 'BMI', 'AlcoholDrinkers',
       'HIVTesting', 'FluVaxLast12', 'PneumoVaxEver', 'TetanusLast10Tdap',
       'HighRiskLastYear', 'CovidPos'],
      dtype='object')


our dataset contains 445132 samples and 40 attributes

In [None]:
col = df1.columns
print('Number of attributes :',len(col))

Number of attributes : 40


In [None]:
num_of_rows = len(df1)
print('Number of samples :',num_of_rows)

Number of samples : 445132
