## Spring Board Data Science - Capstone 2

### Heart Disease Indicator - Data Wrangling

##### Introduction

##### What topic does the dataset cover?

According to the CDC, heart disease is one of the leading causes of death for people of most races in the US (African Americans, American Indians and Alaska Natives, and white people). About half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. Other key indicator include diabetic status, obesity (high BMI), not getting enough physical activity or drinking too much alcohol. Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare. Computational developments, in turn, allow the application of machine learning methods to detect "patterns" from the data that can predict a patient's condition.


##### Recap of the Problem

The purpose of this project is to conduct interpretability analyses, to study how the variation of features affects the increase/decrease of the likelihood of heart disease. We will start with data wrangling, learn about the numerical & the categorical features, and if there is any missing data, we'll deal with it. 

##### Imports

In [25]:
import pandas as pd
import numpy as np

Path = r"C:\\Users\\hanna\\OneDrive\\Desktop\\All Folders\\Data Science\\SpringBoard\\Capstone 2\\Personal Key Indicators of Heart Disease\\heart_2020_cleaned.csv"

hearts = pd.read_csv(Path)

In [26]:
hearts.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


In [27]:
hearts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   HeartDisease      319795 non-null  object 
 1   BMI               319795 non-null  float64
 2   Smoking           319795 non-null  object 
 3   AlcoholDrinking   319795 non-null  object 
 4   Stroke            319795 non-null  object 
 5   PhysicalHealth    319795 non-null  float64
 6   MentalHealth      319795 non-null  float64
 7   DiffWalking       319795 non-null  object 
 8   Sex               319795 non-null  object 
 9   AgeCategory       319795 non-null  object 
 10  Race              319795 non-null  object 
 11  Diabetic          319795 non-null  object 
 12  PhysicalActivity  319795 non-null  object 
 13  GenHealth         319795 non-null  object 
 14  SleepTime         319795 non-null  float64
 15  Asthma            319795 non-null  object 
 16  KidneyDisease     31

We have a total of 18 columns and the data seems to have no missing values. Let us learn more about the numerical features.

In [28]:
hearts.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
BMI,319795.0,28.325399,6.3561,12.02,24.03,27.34,31.42,94.85
PhysicalHealth,319795.0,3.37171,7.95085,0.0,0.0,0.0,2.0,30.0
MentalHealth,319795.0,3.898366,7.955235,0.0,0.0,0.0,3.0,30.0
SleepTime,319795.0,7.097075,1.436007,1.0,6.0,7.0,8.0,24.0


PhysicalHealth here includes if a person has had any illness or injuries over the past 30 days. MentalHealth tries to collect, in the past 30 days,  if the person has not felt well mentally in the last 30 days. So both, Physical and Mental columns seem to be consistent, along with SleepTime. Afterall, the maximum hours of sleep a person can get is 24 in a day.

However, BMI does seem to have a bit of inconsistency. People with BMI > 30 are considered obese, but here we see values as high as 94.85. There is something that needs to be considered here.

In [29]:
for col in hearts:
    print(f"{col} unique values are:")
    print(hearts[col].unique())

HeartDisease unique values are:
['No' 'Yes']
BMI unique values are:
[16.6  20.34 26.58 ... 62.42 51.46 46.56]
Smoking unique values are:
['Yes' 'No']
AlcoholDrinking unique values are:
['No' 'Yes']
Stroke unique values are:
['No' 'Yes']
PhysicalHealth unique values are:
[ 3.  0. 20. 28.  6. 15.  5. 30.  7.  1.  2. 21.  4. 10. 14. 18.  8. 25.
 16. 29. 27. 17. 24. 12. 23. 26. 22. 19.  9. 13. 11.]
MentalHealth unique values are:
[30.  0.  2.  5. 15.  8.  4.  3. 10. 14. 20.  1.  7. 24.  9. 28. 16. 12.
  6. 25. 17. 18. 21. 29. 22. 13. 23. 27. 26. 11. 19.]
DiffWalking unique values are:
['No' 'Yes']
Sex unique values are:
['Female' 'Male']
AgeCategory unique values are:
['55-59' '80 or older' '65-69' '75-79' '40-44' '70-74' '60-64' '50-54'
 '45-49' '18-24' '35-39' '30-34' '25-29']
Race unique values are:
['White' 'Black' 'Asian' 'American Indian/Alaskan Native' 'Other'
 'Hispanic']
Diabetic unique values are:
['Yes' 'No' 'No, borderline diabetes' 'Yes (during pregnancy)']
PhysicalActivity un

In [30]:
hearts.AgeCategory.value_counts()

65-69          34151
60-64          33686
70-74          31065
55-59          29757
50-54          25382
80 or older    24153
45-49          21791
75-79          21482
18-24          21064
40-44          21006
35-39          20550
30-34          18753
25-29          16955
Name: AgeCategory, dtype: int64

In [31]:
cols = hearts.loc[:, ['Race', 'Diabetic', 'GenHealth', 'AgeCategory']]

for col in cols:
    print(" ")
    print(f"{col}:")
    print(hearts[col].value_counts())

 
Race:
White                             245212
Hispanic                           27446
Black                              22939
Other                              10928
Asian                               8068
American Indian/Alaskan Native      5202
Name: Race, dtype: int64
 
Diabetic:
No                         269653
Yes                         40802
No, borderline diabetes      6781
Yes (during pregnancy)       2559
Name: Diabetic, dtype: int64
 
GenHealth:
Very good    113858
Good          93129
Excellent     66842
Fair          34677
Poor          11289
Name: GenHealth, dtype: int64
 
AgeCategory:
65-69          34151
60-64          33686
70-74          31065
55-59          29757
50-54          25382
80 or older    24153
45-49          21791
75-79          21482
18-24          21064
40-44          21006
35-39          20550
30-34          18753
25-29          16955
Name: AgeCategory, dtype: int64


Here we will need to make a few adjustments.

Race column contains 10,928 rows that have Others as a value. 
Diabetes column should contain only 'YES' or 'NO'. 
In GenHealth we can also assign numbers to those values: 0 can mean a person feels 'Very Good', 'Good', 'or' 'excellent', since all of those words mean the same thing, and we can assign 1 to 'Fair' and 'Poor' values thus making this another binary column.

In [40]:
# variable to hold the count
cnt = 0
  
# list to hold visited values
visited = []
  
values_above_50 = []
values_above_50_unique = 0
# loop for counting the unique
# values in height
for i in range(0, len(hearts['BMI'])):
    
    if hearts['BMI'][i] not in visited: 
        
        visited.append(hearts['BMI'][i])
          
        cnt += 1
    if hearts['BMI'][i] > 50:
        values_above_50.append(hearts['BMI'][i])
        values_above_50_unique += 1
  
print("No.of.unique values :",
      cnt)
 
#print("unique values :",
#      visited)


No.of.unique values : 3604


Sheesh! The BMI column contains 3604 unique values, since a precise BMI needs to be reported, we see why that is the case. We know that if a person has a BMI > 30 then he is considered Obese. Well, in our dataset, we have values > 50, reaching a maximum to 94.85 BMI. We need to do something about this 

In [41]:
print(f"No.of.unique values above 50: {values_above_50_unique}")
#print(f"Values above 50: {values_above_50}")

No.of.unique values above 50: 2511


Out of 3604 unique values in BMI, 2511 are above 50. What does that mean?

#### Checking for Missing Values

In [34]:
missing = pd.concat([hearts.isnull().sum(), 100 * hearts.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by='count', ascending=False)

Unnamed: 0,count,%
HeartDisease,0,0.0
BMI,0,0.0
KidneyDisease,0,0.0
Asthma,0,0.0
SleepTime,0,0.0
GenHealth,0,0.0
PhysicalActivity,0,0.0
Diabetic,0,0.0
Race,0,0.0
AgeCategory,0,0.0


There are few things we can say about this notebook: 

 - There are 319795 entries X 18 columns
     - No missing values detected. 
    
 - There are 14 columns with categorical values
     - HeartsDisease
     - Smoking
     - AlcoholDrinking
     - Stroke
     - DiffWalking
     - Sex
     - AgeCategory
     - Race
     - Diabetic
     - PhysicalActivity
     - GenHealth
     - Asthma
     - KidneyDisease
     - SkinCancer
 - There are 4 columns with numerical values:
     - BMI
     - PhysicalHealth
     - MentalHealth
     - SleepTime
     
For the nextbook, We can curiously check for the relationship between BMI, Alcohol Drinking, Smoking, Asthma& Heart Disease. It will also be interesting to see the Age Category tell us more about the qualities that persists of Heart Disease. 