# Heart Disease

## I. Project Tasks

### Define your scope
My primary goal is to continue using the skills I learned on Codecademy. Specifically, I want to clean/tidy data, look for trends, and create a visualization that tells a story. This is a low-key project; machine learning and data mining algorithms are outside of my scope. After thinking it over, I think I will look for correlations and run statistical tests (i.e. Pearson) in Jupyter, and display the highest correlating variables in Tableau.

### Decide on a question or topic
I decided to choose a topic related to medicine - heart disease. I dispense all kinds of medications that treat heart conditions on a daily basis.

### Find a dataset
I went to kaggle and found the dataset "Heart Disease Dataset" provided by Mexwell (link here https://www.kaggle.com/datasets/mexwell/heart-disease-dataset).

### Define your problem
After a brief glance at the dataset and the documentation, I would like to find what makes people with heart disease different from those without heart disease. 

### Load and check data
Documentation is mostly clear (the Units column consists of units and/or ranges and/or descriptions
) and I can identify what each variable is measuring, except for the ST segment variables. Specifically, for the 'oldpeak' variable, I do not know what the numbers represents (documentation says the units are 'depression'), and how it relates to the 'ST slope' variable. I was not familiar with the term 'old peak', and decided to familiarize myself with it before getting too far along.

I found that the units for _oldpeak_ are likely mm on an EKG. The first hits on Google were unhelpful, and I am making this assumption based on what I read about ST slopes.

Image describing ST slopes: https://litfl.com/wp-content/uploads/2018/10/ST-segment-depression-upsloping-downsloping-horizontal.png


### Data wrangling and tidying (in progress)
Kaggle gave this dataset a score of 10.00 out of 10.00 for usability, but one person on Kaggle commented that about 150 people had a cholesterol level of zero, so I will review the dataset for missing values or other possible errors. 

### Find the story (to do)
What, in one sentence, do you want your audience to take away from your project? You may have to create all of your visualizations, do all of your explorations, and even get feedback before you know what it is. But, take a moment to decide the one thing people should take away from your project and make that the first thing. Make that story the easiest one to find.

### Communicate your findings (to do)
This could be in the shape of a report, a Tableau Dashboard, a slide deck, etc. The best way to decide is to imagine your audience. Who are they? Where do you envision reaching them? What is the best medium for that?\
Do you want to reach other people interested in your topic? Do you want to persuade someone to care about something? Do you want to focus on business stakeholders? Where might you find each group? Tailor your project to that imagined scenario.\
From there, you can start to create a project to communicate your story with them.

### Wrap up (to do)
This step might be the most impactful for finding a job. Now that you’ve written your report, created your deck, or built your dashboard, check your work. Go through and proofread. Try to view your project from wherever you decided to host it. Make sure that all of the links work, and ask a friend (or visit the forums) to find someone to go through it and tell you if there are any inconsistencies, jumps of logic, or confusing parts. This final round of checks is essential for creating a polished project.

## II. Load data

In [1]:
import pandas as pd
pd.options.mode.copy_on_write = True
import numpy as np
heart_df = pd.read_csv(r'C:\Users\jsbit\OneDrive\Documents\Coding 2023\Git\heart-disease\heart_statlog_cleveland_hungary_final.csv', encoding_errors='replace')

In [2]:
print('There were', str(len(heart_df)), 'participants in this dataset.')
# Although 1,190 participants is a lot, 'heart disease' is a large umbrella term and can affect almost anyone.
# Therefore, I estimate that this dataset not meet power and we can't extrapolate findings to the general public.

There were 1190 participants in this dataset.


In [3]:
print("The column names are:\n", heart_df.columns)

The column names are:
 Index(['age', 'sex', 'chest pain type', 'resting bp s', 'cholesterol',
       'fasting blood sugar', 'resting ecg', 'max heart rate',
       'exercise angina', 'oldpeak', 'ST slope', 'target'],
      dtype='object')


In [4]:
# Explanation of changes to variable names:
    # Because of my German heritage, I prefer EKG instead of ECG.
    # I see an inconsistenct in abbreviating 'bp', but spelling out 'blood sugar', so I will abbreviate them all.
        # For clarification I added 'systolic' before 'bp'
    # I think 'ST depression' is a better variable name than 'oldpeak'. 
        # I can't find much information on why it is called old peak
        # It's a measure of the ST depression in mm on an EKG
        # If it's a measure of depression, why is it called peak?
    # I decided I prefer the variable name 'stress_test' over 'exercise_angina', 
        # especially since I will be converting values to 'angina' or 'no angina'
    # I prefer 'heart disease category' to 'target', it feels more descriptive

heart_df = heart_df.rename(columns={'resting ecg': 'ekg', 
                                    'chest pain type': 'pain type', 
                                    'fasting blood sugar': 'fbs', 
                                    'max heart rate': 'max hr', 
                                    'resting bp s': 'systolic bp',
                                    'oldpeak': 'ST depression',
                                    'exercise angina': 'stress test',
                                    'target': 'heart disease category'})

In [5]:
# While I'm at it, might as well replace spaces with underscores.
heart_df.columns = [variable.replace(' ', '_') for variable in heart_df]

print("The new column names are:\n", heart_df.columns)

The new column names are:
 Index(['age', 'sex', 'pain_type', 'systolic_bp', 'cholesterol', 'fbs', 'ekg',
       'max_hr', 'stress_test', 'ST_depression', 'ST_slope',
       'heart_disease_category'],
      dtype='object')


## III. Initial Exploratory Data Analysis
### Dataset overview and descriptive statistics

In [6]:
print(heart_df.describe(include='all'))

               age          sex    pain_type  systolic_bp  cholesterol  \
count  1190.000000  1190.000000  1190.000000  1190.000000  1190.000000   
mean     53.720168     0.763866     3.232773   132.153782   210.363866   
std       9.358203     0.424884     0.935480    18.368823   101.420489   
min      28.000000     0.000000     1.000000     0.000000     0.000000   
25%      47.000000     1.000000     3.000000   120.000000   188.000000   
50%      54.000000     1.000000     4.000000   130.000000   229.000000   
75%      60.000000     1.000000     4.000000   140.000000   269.750000   
max      77.000000     1.000000     4.000000   200.000000   603.000000   

               fbs          ekg       max_hr  stress_test  ST_depression  \
count  1190.000000  1190.000000  1190.000000  1190.000000    1190.000000   
mean      0.213445     0.698319   139.732773     0.387395       0.922773   
std       0.409912     0.870359    25.517636     0.487360       1.086337   
min       0.000000     0.0000

* Each variable has a count of 1,190, so it's free of NAN.
* I know that some values are zero; this is just initial exploration and I will review these numbers again later.
* Age:
    * I am somewhat surprised that the max age was only 77; it would be nice to know the inclusion criteria for these studies (i.e. if there was a max age for some but not others).
    * It looks like this will have a normal distribution; the distance of the 25th and 75th percentile from the median and the distance of the min and max from the median are similar.
* Sex: The mean and median indicate that most participants were male.
* Resting blood pressure (systolic):
    * Minimum was zero; not a realistic number.
    * The average seems lower than I might have guessed. I am assuming that lots of people have high blood pressure (i.e. systolic over 130 mmHg), hopefully it isn't skewed a lot by missing/zero values.
* Cholesterol:
    * Minimum was zero; not a realistic number.
    * It seems like most people had an elevated cholesterol (over 200 mg/dL)
* Fasting blood sugar over 120 mg/dL: Looks like most people were normal (value of zero).
* Max heart rate (during stress test and in beats per minute, I assume):
    * I don't normally see stress test results, so this is interesting data to me.
    * A min of 60 bpm - someone must be on some strong beta blockers or experienced angina rather quickly.
* Exercise-induced angina: Looks like most people did not experience angina.
* Oldpeak: This could get trippy - a negative ST depression is ST elevation (min of -2.6 mm)
* Target: A slight majority were classified as having heart disease.    

## IV. Data wrangling and tidying

### Data types and observations

In [7]:
print(heart_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1190 entries, 0 to 1189
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   age                     1190 non-null   int64  
 1   sex                     1190 non-null   int64  
 2   pain_type               1190 non-null   int64  
 3   systolic_bp             1190 non-null   int64  
 4   cholesterol             1190 non-null   int64  
 5   fbs                     1190 non-null   int64  
 6   ekg                     1190 non-null   int64  
 7   max_hr                  1190 non-null   int64  
 8   stress_test             1190 non-null   int64  
 9   ST_depression           1190 non-null   float64
 10  ST_slope                1190 non-null   int64  
 11  heart_disease_category  1190 non-null   int64  
dtypes: float64(1), int64(11)
memory usage: 111.7 KB
None


#### Data values
* Leave as numbers:
    * age
    * systolic_bp 
    * cholesterol
    * max_hr
    * ST_depression
    
* Change from numbers to description (using description found in documentation unless otherwise noted):
    * sex: 0: female, 1: male
    * pain_type: 1: typical, 2: atypical, 3: non-anginal, 4: asymptomatic
    * fbs (using my own words): 0: normal, 1: elevated
    * ekg (using my own words): 0: normal, 1: ST abnormality, 2: LVH (abbreviated from left ventricular hypertrophy)
    * stress_test: 0: no angina, 1: angina
    * ST_slope: 0 (my own words - not found in documentation): not evaluated, 1: upsloping, 2: flat, 3: downsloping
    * heart_disease_category: 0: normal, 1: heart disease

In [8]:
heart_df['sex'] = heart_df.sex.replace(
    0, 'female').replace(
    1, 'male')

heart_df['pain_type'] = heart_df.pain_type.replace(
    1, 'typical').replace(
    2, 'atypical').replace(
    3, 'non-anginal').replace(
    4, 'asymptomatic')

heart_df['fbs'] = heart_df.fbs.replace(
    0, 'normal').replace(
    1, 'elevated')

heart_df['ekg'] = heart_df.ekg.replace(
    0, 'normal').replace(
    1, 'ST abnormality').replace(
    2, 'LVH')

heart_df['stress_test'] = heart_df.stress_test.replace(
    0, 'no angina').replace(
    1, 'angina')

heart_df['ST_slope'] = heart_df.ST_slope.replace(
    0, 'not evaluated').replace(
    1, 'upsloping').replace(
    2, 'flat').replace(
    3, 'downsloping')

heart_df['heart_disease_category'] = heart_df.heart_disease_category.replace(
    0, 'normal').replace(
    1, 'heart disease')

print('Values for the sex variable:', str(heart_df.sex.unique()))
print('Values for pain type:', str(heart_df.pain_type.unique()))
print('Values for fasting blood sugar:', str(heart_df.fbs.unique()))
print('Values for EKG:', str(heart_df.ekg.unique()))
print('Values for stress test:', str(heart_df.stress_test.unique()))
print('Values for ST slope:', str(heart_df.ST_slope.unique()))
print('Values for heart disease category:', str(heart_df.heart_disease_category.unique()))

Values for the sex variable: ['male' 'female']
Values for pain type: ['atypical' 'non-anginal' 'asymptomatic' 'typical']
Values for fasting blood sugar: ['normal' 'elevated']
Values for EKG: ['normal' 'ST abnormality' 'LVH']
Values for stress test: ['no angina' 'angina']
Values for ST slope: ['upsloping' 'flat' 'downsloping' 'not evaluated']
Values for heart disease category: ['normal' 'heart disease']


#### Data types
* Appropriate data type:
    * Age
    * Blood pressure, systolic (systolic_bp)
    * Cholesterol
    * Max heart rate
    * ST depression
    
* Needs changed:
    * Sex: Binary nominal categotical
    * Pain type: Nominal categorical
    * EKG: Nominal categorical
    * Fasting blood sugar (fbs)
    * Exercise-induced angina: Binary nominal categorical
    * ST slope: Ordinal categorical
    * Heart disease category: Nominal categorical  

In [22]:
heart_df = heart_df.astype({
    'sex': 'category',
    'pain_type': 'category',
    'ekg': 'category',
    'fbs': 'category', 
    'stress_test': 'category',
    'ST_slope': 'category',
    'heart_disease_category': 'category'})

print('Data types for all variables:\n', heart_df.dtypes)

Data types for all variables:
 age                          int64
sex                       category
pain_type                 category
systolic_bp                  int64
cholesterol                  int64
fbs                       category
ekg                       category
max_hr                       int64
stress_test               category
ST_depression              float64
ST_slope                  category
heart_disease_category    category
dtype: object


### Checking for and handling duplicates

In [10]:
duplicated_hearts = heart_df[heart_df.duplicated()].reset_index()
print(duplicated_hearts)
# The index numbers seem interesting in how they are consecutive - are they all that way?
print(duplicated_hearts['index'].to_string())

     index  age     sex     pain_type  systolic_bp  cholesterol       fbs  \
0      163   49  female      atypical          110          208    normal   
1      604   58    male   non-anginal          150          219    normal   
2      887   63    male       typical          145          233  elevated   
3      888   67    male  asymptomatic          160          286    normal   
4      889   67    male  asymptomatic          120          229    normal   
..     ...  ...     ...           ...          ...          ...       ...   
267   1156   42    male   non-anginal          130          180    normal   
268   1157   61    male  asymptomatic          140          207    normal   
269   1158   66    male  asymptomatic          160          228    normal   
270   1159   46    male  asymptomatic          140          311    normal   
271   1160   71  female  asymptomatic          112          149    normal   

                ekg  max_hr stress_test  ST_depression     ST_slope  \
0   

In [11]:
print(duplicated_hearts['index'])

0       163
1       604
2       887
3       888
4       889
       ... 
267    1156
268    1157
269    1158
270    1159
271    1160
Name: index, Length: 272, dtype: int64


In [12]:
# Yes, it looks like each row after row 887 in heart_df is a duplicate. Calculating if visual analysis was accurate:
dup_range = 1160-887 + 1
other_dups = 2
dup_count_check = dup_range + other_dups
print(dup_count_check)
# Since my calculated guess of 276 is higher than 272, it looks like not all rows in the range 887-1161 are duplicates.

276


##### What percent of the data are duplicates?

In [13]:
perc_dup = round(len(duplicated_hearts) / len(heart_df) * 100, 2)
print('About', str(perc_dup), '% of the data is duplicated.')
# Almost a quarter of the data is duplicated - that is a lot.

About 22.86 % of the data is duplicated.


##### How many rows would be left after dropping duplicates?

In [14]:
remainder = len(heart_df) - len(duplicated_hearts)
print('There will be', str(remainder), 'rows remaining after dropping duplicates.')

There will be 918 rows remaining after dropping duplicates.


##### Discussion on handling duplicates: 
The likelihood of having so many duplicates, and the fact that they are basically all grouped together, I think it is safe to drop the duplicates.\
It would have been nice to have a 'trial ID' or 'participant ID' variable, since this dataset is a conglomeration of multiple datasets. I think this would help pinpoint why there are duplicates.

In [15]:
heart = heart_df.drop_duplicates()
print(len(heart))

918


### Handling missing data

I noticed two lab values with a min of zero, which indicates that the participant was dead. These lab values are:
* systolic_bp: Resting blood pressure (systolic)
* cholesterol

First, I will see how many rows have a zero in these three columns.

#### How many values are missing from the bp and cholesterol variables?

In [16]:
zero_heart = heart[(heart['systolic_bp'] == 0) | (heart['cholesterol'] == 0)]
print('There are', str(len(zero_heart)), 'rows with a zero.\n')
# 172 rows is unmanageable for me to look at. I will separate these zeros into their own dataframes and take a closer look.

There are 172 rows with a zero.



#### Identify zeros in resting blood pressure (systolic)

I am curious what other numbers might be out there, so I will include rows with bp from zero to 100. 'Normal' values can go down to 90, below which is considered low.

In [17]:
zero_range_bp = heart[heart['systolic_bp'] < 100]

print('There are', str(len(zero_range_bp)), 'rows that recorded a resting systolic blood pressure less than 100:\n')
# Thirteen rows is manageable for me to look at.
print(zero_range_bp)

There are 13 rows that recorded a resting systolic blood pressure less than 100:

     age     sex     pain_type  systolic_bp  cholesterol       fbs  \
228   38    male  asymptomatic           92          117    normal   
268   34    male      atypical           98          220    normal   
295   32    male       typical           95            0  elevated   
305   51    male  asymptomatic           95            0  elevated   
310   57    male  asymptomatic           95            0  elevated   
315   53    male  asymptomatic           80            0    normal   
329   52    male  asymptomatic           95            0  elevated   
334   40    male  asymptomatic           95            0  elevated   
340   64  female  asymptomatic           95            0  elevated   
450   55    male   non-anginal            0            0    normal   
520   63    male  asymptomatic           96          305    normal   
694   39  female   non-anginal           94          199    normal   
834   51

#### Review of zeros in zero_range_bp
* There was one row with a bp of zero. 
    * This row also had a zero for cholesterol.
* There were 12 other rows with a bp less than 100.
    * The lowest value was 80, the rest were over 90.
    * Eight of these had a cholesterol of zero.

#### Handling missing values
* The only variables I could realistically check for were bp and cholesterol.
* There was only one row with a missing bp and cholesterol.
* There were 171 other rows with a missing cholesterol value.
* All other variables had complete information, as far as I could tell.
* I will replace zero with NAN for the bp and cholesterol columns.

In [18]:
# Replace zero with NAN in the bp and cholesterol columns:
heart['systolic_bp'] = heart['systolic_bp'].replace( 0, np.nan)
heart['cholesterol'] = heart['cholesterol'].replace( 0, np.nan)

# Checking that zeros were replaced:
print(heart.isna().sum())

# Percent missing values:
print('About {}% of data from the cholesterol column is missing.'.format(round(
    100*heart['cholesterol'].isna().sum()/len(heart))))
# If 1/5 of the data is missing, then the average cholesterol value will go up compared 
    # to my initial calculations... it was already high at 210

age                         0
sex                         0
pain_type                   0
systolic_bp                 1
cholesterol               172
fbs                         0
ekg                         0
max_hr                      0
stress_test                 0
ST_depression               0
ST_slope                    0
heart_disease_category      0
dtype: int64
About 19% of data from the cholesterol column is missing.


### Tidying the data
* All variables appropriately named
* All columns are variables

## V. Statistical analyses

Current plan: Use correlational statistics to identify variables that have a high correlation with having heart disease.

In [None]:
next things to do: identify how to compare variables - chi square, crosstabs, etc.