# Data Exploration

# Data First Look
Following description are from the kaggle link https://www.kaggle.com/datasets/bhaveshmisra/heart-disease-indicators/ from where the data was obtained. 

| **Column Name** | **Description** | **Markers** |
|:---:|:---:|:---:|
| HeartDiseaseorAttack: | Indicates whether the individual has had a heart disease or heart attack | (binary: 0 = No, 1 = Yes) |
| HighBP | High blood pressure status | (binary: 0 = No, 1 = Yes) |
| HighChol | High cholesterol status | (binary: 0 = No, 1 = Yes) |
| CholCheck | Frequency of cholesterol check | categorical |
| BMI | Body Mass Index | continuous |
| Smoker | Smoking status | (binary: 0 = No, 1 = Yes) |
| Stroke | History of stroke | (binary: 0 = No, 1 = Yes) |
| Diabetes | Diabetes status | (binary: 0 = No, 1 = Yes) |
| PhysActivity | Level of physical activity | categorical |
| Fruits | Frequency of fruit consumption | categorical |
| Veggies | Frequency of vegetable consumption | categorical |
| HvyAlcoholConsump | Heavy alcohol consumption status | (binary: 0 = No, 1 = Yes) |
| AnyHealthcare | Access to any healthcare | (binary: 0 = No, 1 = Yes) |
| NoDocbcCost | No doctor because of cost | (binary: 0 = No, 1 = Yes) |
| GenHlth | General health assessment | categorical |
| MentHlth | Mental health assessment | categorical |
| PhysHlth | Level of physical activity | categorical |
| DiffWalk | Difficulty walking status | binary: 0 = No, 1 = Yes |
| Sex | Gender of the individual | binary: 0 = Female, 1 = Male |
| Age | Age of the individual | continuous |
| Education | Educational level | categorical |
| Income | Income level | categorical |

In [None]:
import pandas as pd 

df = pd.read_csv('data/heart_disease_health_indicators.csv')
df.head()

In [None]:
## Check DF for Missing, duplicates, and Invalid Values
def rework_dataframe(data):
    num_dups = data.duplicated().sum()
    num_na = data.isna().sum()
    if(num_dups > 0):
        print('Duplicates are detected: {}'.format(num_dups))
    if(num_na > 0):
        print('Missing Values detected: {}'.format(num_na))
    

In [None]:

#rework_dataframe(df)
df.duplicated().sum()
df.info()

In [None]:
sorted_corr = sorted(list(zip(df.corrwith(df.HeartDiseaseorAttack).index, df.corrwith(df.HeartDiseaseorAttack).values)), key= lambda x: x[1])
sorted_corr

## First Look
1. Duplicate are acceptable due to the lack of indivudal tracking of rows. 
2. All columns have the data type of integer. This means that the categorial columns are binned in integer rather than string categories. 
3. **253,661** different entries in the dataset
4. General Health and Age are the top two correlating indictors for heart disease. 

Surprisingly, the ones I believe to be more impactful was lower in the list. It might be intereseting to see the correlation from one another instead of the targeted value. 
Since age and general health are such broad terms, it could be umbrela to other issues. For example, assuming that a higher general health number means a worst assessment, a higher blood pressure and cholerstol level would be expected. 

This presents a problem that we are assuming metrics and since the dataset didn't come with explanation to the columns, it would be wrong to use them in an sense of factual useage. Therefore, I will be removing those with less meaningful metric. 
General Health and Mental Health are two of the uncertain metric that doesn't state whther a higher number is better or worse.  

In [None]:
df.corrwith(df.GenHlth)

In [None]:
import matplotlib.pyplot as plt
for column in df.columns[1:]:
    temp_df = df[[column, 'HeartDiseaseorAttack']]
    

In [None]:
cont_col = ['BMI', 'Age']
cat_col = ['CholCheck', 'PhysActivity', 'Fruits', 'Veggies', 'GenHlth','MentHlth', 'PhysHlth', 'Education', 'Income']
bin_col = ['HighBP', 'HighChol', 'Smoker', 'Stroke', 'Diabetes','HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'DiffWalk', 'Sex']
target = 'HeartDiseaseorAttack'

cat_df = df[cat_col]
bin_df = df[bin_col]
cont_df = df[cont_col]

In [None]:
for col in cat_df.columns:
    display(cat_df[col].value_counts())

In [None]:
for col in cont_df.columns:
    display(cont_df[col].value_counts())

In [None]:
for col in bin_df.columns:
    display(bin_df[col].value_counts())

## Adjustments
It appears that some of the columns were not labeled correctly and should be changed to the correct type. Some of the mistakes made sense such as Diebietes have 0, 1, and 2 which I infer as 0 (not diebetix, type 1, and type 2. This however, requeirs me to assume which would be bad to consider currently until I have more information to actually use for this project. 

There is also some ambituos infomation regarding some of the categroatcal having only binary infomation. Following the author original trend, we can assume taht 0 is the lack of and 1 being hte pressents of. Since there isn't a metric of what to compare this to, I will assume that the patient meet the recommended amount of fruit, vegetables and phyiscal activity in the United States. 

In [66]:
cont_col = ['BMI', 'Age']
cat_col = ['Education', 'Income', 'Diabetes']
bin_col = ['CholCheck', 'PhysActivity', 'Fruits', 'Veggies', 'HighBP', 'HighChol', 'Smoker', 'Stroke', 'HvyAlcoholConsump', 'AnyHealthcare', 'NoDocbcCost', 'DiffWalk', 'Sex']
on_hold = ['GenHlth','MentHlth', 'PhysHlth']
target = 'HeartDiseaseorAttack'

cat_df = df[cat_col]
bin_df = df[bin_col]
cont_df = df[cont_col]

In [70]:
print('Binary Columns')
for col in bin_df.columns:
    display(bin_df[col].value_counts())
print('\n') 
print('Continuous Columns')    
for col in cont_df.columns:
    display(cont_df[col].value_counts())
print('\n') 
print('Categorical Columns')
for col in cat_df.columns:
    display(cat_df[col].value_counts())

Binary Columns


1    244191
0      9470
Name: CholCheck, dtype: int64

1    191914
0     61747
Name: PhysActivity, dtype: int64

1    160888
0     92773
Name: Fruits, dtype: int64

1    205830
0     47831
Name: Veggies, dtype: int64

0    144843
1    108818
Name: HighBP, dtype: int64

0    146080
1    107581
Name: HighChol, dtype: int64

0    141242
1    112419
Name: Smoker, dtype: int64

0    243370
1     10291
Name: Stroke, dtype: int64

0    239405
1     14256
Name: HvyAlcoholConsump, dtype: int64

1    241244
0     12417
Name: AnyHealthcare, dtype: int64

0    232312
1     21349
Name: NoDocbcCost, dtype: int64

0    210990
1     42671
Name: DiffWalk, dtype: int64

0    141962
1    111699
Name: Sex, dtype: int64



Continuous Columns


27    24604
26    20562
24    19550
25    17144
28    16543
      ...  
78        1
85        1
86        1
90        1
91        1
Name: BMI, Length: 84, dtype: int64

9     33243
10    32193
8     30831
7     26313
11    23531
6     19815
13    17362
5     16153
12    15979
4     13823
3     11121
2      7597
1      5700
Name: Age, dtype: int64



Categorical Columns


6    107316
5     69907
4     62748
3      9476
2      4040
1       174
Name: Education, dtype: int64

8    90384
7    43217
6    36468
5    25882
4    20131
3    15994
2    11777
1     9808
Name: Income, dtype: int64

0    213690
2     35342
1      4629
Name: Diabetes, dtype: int64