# HAI - Heart Attack Indicators

## Data Loading

## Introduction

The primary objective of this notebook is to load this dataset of which is derived from the CDC's, Behavioral Risk Factor Surveillance System (BRFSS), which is self answered, survery data. This dataset has a total of 400K+ responses across the United States and is the data I will be loading to use within the HAI project. 

Given the critical importance of cardiovascular diseases as a leading cause of mortality globally, identifying and understanding the factors contributing to heart attacks is imperative. This dataset includes a multitude of variables related to health conditions, which are potentially associated with heart disease. In this initial phase, we will focus on the following key tasks:

 - Data Importation: Importing the full CDC dataset with null values.

By the end of this notebook, we aim to have a clean, well-understood dataset that is ready for Exploratory Data Analysis (EDA) in the next notebook. This step is crucial as it sets the stage for accurate and insightful analyses that may further inform the health sector of the importance of features most decisive of heart attacks.

## Libraries

Only 3 libraries need to be imported pandas, numpy and matplotlib.pyplot.

In [None]:
# Imported Libraries 
import pandas as pd # pandas library 
import numpy as np # numpy library
import matplotlib.pyplot as plt # Import the pyplot (pythonplot) part of the matlotlib library

## Global Perameters

Setting the global parameters for the project, 8.0 and 6.0 is a good universal size for plots to be shown with enough context, while not squishing anything so it is unreadable.

In [None]:
# Show all dataframe columns
pd.set_option('display.max_columns', None)
# Set matplotlib global settings
plt.rcParams['figure.figsize'] = (8.0, 6.0)

## Helper Functions

Below are the helper functions used to help with this project:

This functions dq_checks, is a sanity check function that allows me to look through a dataframe and return a irregularities description of the dataframe specified. 

In [None]:
# checks the data for everything stated below
def dq_checks(df):
    print("+----------DataFrame Quality Report----------+")
    n_rows, n_cols=df.shape
    n_nulls = df.isna().sum().sum() # 2 sums - total null values for all columns
    n_row_dups = df.duplicated().sum()
    n_col_dups = df.T.duplicated().sum() # transpost the column for a duplicated column
    return (
    f"""
    No. of rows: {n_rows}
    No. of columns: {n_cols}
    No. of missing values: {n_nulls}
    No. of duplicated rows: {n_row_dups}
    No. of duplicated columns: {n_col_dups}
    """
)

## Data Loading

Loading the data needed to progress with the capstone project.

In [None]:
# Data loading for github only

CLN_DATA_PATH='../data/heart_2022_with_nans.csv'

try:
    heart_attack_raw = pd.read_csv(CLN_DATA_PATH)
    print("Data loaded successfully.")
except FileNotFoundError:
    print("ERROR: The data file does not exist.")

## Conclusion

In this notebook, I successfully imported the heart_2022_with_nans.csv dataset. This dataset will serve as the main dataset for our analysis of heart attack indicators. By loading this dataset, I have established a foundation for the subsequent Exploratory Data Analysis (EDA) notebook.

The next steps will involve a detailed inspection, data cleaning and insight extraction which will be carried out in the EDA notebook. Ensuring the integrity and readiness of our data is crucial as I progress towards identifying and understanding the factors that contribute to heart disease. The meticulous preparation of our dataset will enable us to derive meaningful insights and potentially implament public health strategies aimed at reducing the prevalence of heart attacks.

---