# Heart Disease

## I. Project Tasks

### Define your scope
My primary goal is to continue using the skills I learned on Codecademy. Specifically, I want to clean/tidy data, look for trends, and create a visualization that tells a story. This is a low-key project; machine learning and data mining algorithms are outside of my scope. After thinking it over, I think I will look for correlations and run statistical tests (i.e. Pearson) in Jupyter, and display the highest correlating variables in Tableau.

### Decide on a question or topic
I decided to choose a topic related to medicine - heart disease. I dispense all kinds of medications that treat heart conditions on a daily basis.

### Find a dataset
I went to kaggle and found the dataset "Heart Disease Dataset" provided by Mexwell (link here https://www.kaggle.com/datasets/mexwell/heart-disease-dataset).

### Define your problem
After a brief glance at the dataset and the documentation, I would like to find what makes people with heart disease different from those without heart disease. 

### Load and check data
Documentation is mostly clear (the Units column consists of units and/or ranges and/or descriptions
) and I can identify what each variable is measuring, except for the ST segment variables. Specifically, for the 'oldpeak' variable, I do not know what the numbers represents (documentation says the units are 'depression'), and how it relates to the 'ST slope' variable. I was not familiar with the term 'old peak', and decided to familiarize myself with it before getting too far along.

I found that the units for _oldpeak_ are likely mm on an EKG. The first hits on Google were unhelpful, and I am making this assumption based on what I read about ST slopes.

Image describing ST slopes: https://litfl.com/wp-content/uploads/2018/10/ST-segment-depression-upsloping-downsloping-horizontal.png


### Data wrangling and tidying (in progress)
Kaggle gave this dataset a score of 10.00 out of 10.00 for usability, but one person on Kaggle commented that about 150 people had a cholesterol level of zero, so I will review the dataset for missing values or other possible errors. 

### Find the story (to do)
What, in one sentence, do you want your audience to take away from your project? You may have to create all of your visualizations, do all of your explorations, and even get feedback before you know what it is. But, take a moment to decide the one thing people should take away from your project and make that the first thing. Make that story the easiest one to find.

### Communicate your findings (to do)
This could be in the shape of a report, a Tableau Dashboard, a slide deck, etc. The best way to decide is to imagine your audience. Who are they? Where do you envision reaching them? What is the best medium for that?\
Do you want to reach other people interested in your topic? Do you want to persuade someone to care about something? Do you want to focus on business stakeholders? Where might you find each group? Tailor your project to that imagined scenario.\
From there, you can start to create a project to communicate your story with them.

### Wrap up (to do)
This step might be the most impactful for finding a job. Now that you’ve written your report, created your deck, or built your dashboard, check your work. Go through and proofread. Try to view your project from wherever you decided to host it. Make sure that all of the links work, and ask a friend (or visit the forums) to find someone to go through it and tell you if there are any inconsistencies, jumps of logic, or confusing parts. This final round of checks is essential for creating a polished project.

## II. Load data

In [1]:
import pandas as pd
heart_df = pd.read_csv(r'C:\Users\jsbit\OneDrive\Documents\Coding 2023\Git\heart-disease\heart_statlog_cleveland_hungary_final.csv', encoding_errors='replace')

In [2]:
print('There were', str(len(heart_df)), 'participants in this dataset.')
# Although 1,190 participants is a lot, 'heart disease' is a large umbrella term and can affect almost anyone.
# Therefore, I estimate that this dataset not meet power and we can't extrapolate findings to the general public.

There were 1190 participants in this dataset.


In [3]:
print("The column names are:\n", heart_df.columns)

The column names are:
 Index(['age', 'sex', 'chest pain type', 'resting bp s', 'cholesterol',
       'fasting blood sugar', 'resting ecg', 'max heart rate',
       'exercise angina', 'oldpeak', 'ST slope', 'target'],
      dtype='object')


In [4]:
# Explanation of changes to variable names:
    # Because of my German heritage, I prefer EKG instead of ECG.
    # I see an inconsistenct in abbreviating 'bp', but spelling out 'blood sugar', so I will abbreviate them all.
    # I think 'ST depression' is a better variable name than 'oldpeak'. 
        # I can't find much information on why it is called old peak
        # It's a measure of the ST depression in mm on an EKG
        # If it's a measure of depression, why is it called peak?

heart_df = heart_df.rename(columns={'resting ecg': 'ekg', 
                                    'chest pain type': 'pain type', 
                                    'fasting blood sugar': 'fbs', 
                                    'max heart rate': 'max hr', 
                                    'resting bp s': 'bp',
                                    'oldpeak': 'ST depression'})

In [5]:
# While I'm at it, might as well replace spaces with underscores.
heart_df.columns = [variable.replace(' ', '_') for variable in heart_df]

print("The new column names are:\n", heart_df.columns)

The new column names are:
 Index(['age', 'sex', 'pain_type', 'bp', 'cholesterol', 'fbs', 'ekg', 'max_hr',
       'exercise_angina', 'ST_depression', 'ST_slope', 'target'],
      dtype='object')


## III. Exploratory Data Analysis
### Dataset overview and descriptive statistics
#### Descriptive statistics and observations

In [6]:
print(heart_df.describe(include='all'))

               age          sex    pain_type           bp  cholesterol  \
count  1190.000000  1190.000000  1190.000000  1190.000000  1190.000000   
mean     53.720168     0.763866     3.232773   132.153782   210.363866   
std       9.358203     0.424884     0.935480    18.368823   101.420489   
min      28.000000     0.000000     1.000000     0.000000     0.000000   
25%      47.000000     1.000000     3.000000   120.000000   188.000000   
50%      54.000000     1.000000     4.000000   130.000000   229.000000   
75%      60.000000     1.000000     4.000000   140.000000   269.750000   
max      77.000000     1.000000     4.000000   200.000000   603.000000   

               fbs          ekg       max_hr  exercise_angina  ST_depression  \
count  1190.000000  1190.000000  1190.000000      1190.000000    1190.000000   
mean      0.213445     0.698319   139.732773         0.387395       0.922773   
std       0.409912     0.870359    25.517636         0.487360       1.086337   
min       0.0

* Each variable has a count of 1,190, so it's free of NAN.
* I know that some values are zero; this is just initial exploration and I will review these numbers again later.
* Age:
    * I am somewhat surprised that the max age was only 77; it would be nice to know the inclusion criteria for these studies (i.e. if there was a max age for some but not others).
    * It looks like this will have a normal distribution; the distance of the 25th and 75th percentile from the median and the distance of the min and max from the median are similar.
* Sex: The mean and median indicate that most participants were male.
* Resting blood pressure (systolic):
    * Minimum was zero; not a realistic number.
    * The average seems lower than I might have guessed. I am assuming that lots of people have high blood pressure (i.e. systolic over 130 mmHg), hopefully it isn't skewed a lot by missing/zero values.
* Cholesterol:
    * Minimum was zero; not a realistic number.
    * It seems like most people had an elevated cholesterol (over 200 mg/dL)
* Fasting blood sugar over 120 mg/dL: Looks like most people were normal (value of zero).
* Max heart rate (during stress test and in beats per minute, I assume):
    * I don't normally see stress test results, so this is interesting data to me.
    * A min of 60 bpm - someone must be on some strong beta blockers or experienced angina rather quickly.
* Exercise-induced angina: Looks like most people did not experience angina.
* Oldpeak: This could get trippy - a negative ST depression is ST elevation (min of -2.6 mm)
* Target: A slight majority were classified as having heart disease.    

#### Data types and observations

In [7]:
print(heart_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1190 entries, 0 to 1189
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   age              1190 non-null   int64  
 1   sex              1190 non-null   int64  
 2   pain_type        1190 non-null   int64  
 3   bp               1190 non-null   int64  
 4   cholesterol      1190 non-null   int64  
 5   fbs              1190 non-null   int64  
 6   ekg              1190 non-null   int64  
 7   max_hr           1190 non-null   int64  
 8   exercise_angina  1190 non-null   int64  
 9   ST_depression    1190 non-null   float64
 10  ST_slope         1190 non-null   int64  
 11  target           1190 non-null   int64  
dtypes: float64(1), int64(11)
memory usage: 111.7 KB
None


* Appropriate data type:
    * Age
    * Blood pressure (bp)
    * Cholesterol
    * Fasting blood sugar (fbs)
    * Max heart rate
    * ST depression
    
* Needs changed:
    * Sex: Binary nominal categotical
    * Pain type: Nominal categorical
    * EKG: Nominal categorical
    * Exercise-induced angina: Binary nominal categorical
    * ST slope: Ordinal categorical
    * Target: Nominal categorical  

In [8]:
print(heart_df.head())

   age  sex  pain_type   bp  cholesterol  fbs  ekg  max_hr  exercise_angina  \
0   40    1          2  140          289    0    0     172                0   
1   49    0          3  160          180    0    0     156                0   
2   37    1          2  130          283    0    1      98                0   
3   48    0          4  138          214    0    0     108                1   
4   54    1          3  150          195    0    0     122                0   

   ST_depression  ST_slope  target  
0            0.0         1       0  
1            1.0         2       1  
2            0.0         1       0  
3            1.5         2       1  
4            0.0         1       0  


#### Checking for duplicates and handling duplicates

In [9]:
duplicated_hearts = heart_df[heart_df.duplicated()].reset_index()
print(duplicated_hearts)
# The index numbers seem interesting in how they are consecutive - are they all that way?
print(duplicated_hearts['index'].to_string())

     index  age  sex  pain_type   bp  cholesterol  fbs  ekg  max_hr  \
0      163   49    0          2  110          208    0    0     160   
1      604   58    1          3  150          219    0    1     118   
2      887   63    1          1  145          233    1    2     150   
3      888   67    1          4  160          286    0    2     108   
4      889   67    1          4  120          229    0    2     129   
..     ...  ...  ...        ...  ...          ...  ...  ...     ...   
267   1156   42    1          3  130          180    0    0     150   
268   1157   61    1          4  140          207    0    2     138   
269   1158   66    1          4  160          228    0    2     138   
270   1159   46    1          4  140          311    0    0     120   
271   1160   71    0          4  112          149    0    0     125   

     exercise_angina  ST_depression  ST_slope  target  
0                  0            0.0         1       0  
1                  1            0.0

In [10]:
print(duplicated_hearts['index'])

0       163
1       604
2       887
3       888
4       889
       ... 
267    1156
268    1157
269    1158
270    1159
271    1160
Name: index, Length: 272, dtype: int64


In [11]:
# Yes, it looks like each row after row 887 in heart_df is a duplicate. Calculating if visual analysis was accurate:
dup_range = 1160-887 + 1
other_dups = 2
dup_count_check = dup_range + other_dups
print(dup_count_check)
# Since my calculated guess of 276 is higher than 272, it looks like not all rows in the range 887-1161 are duplicates.

276


##### What percent of the data are duplicates?

In [12]:
perc_dup = round(len(duplicated_hearts) / len(heart_df) * 100, 2)
print('About', str(perc_dup), '% of the data is duplicated.')
# Almost a quarter of the data is duplicated - that is a lot.

About 22.86 % of the data is duplicated.


##### How many rows would be left after dropping duplicates?

In [13]:
remainder = len(heart_df) - len(duplicated_hearts)
print('There will be', str(remainder), 'rows remaining after dropping duplicates.')

There will be 918 rows remaining after dropping duplicates.


##### Discussion on handling duplicates: 
The likelihood of having so many duplicates, and the fact that they are basically all grouped together, I think it is safe to drop the duplicates.\
It would have been nice to have a 'trial ID' or 'participant ID' variable, since this dataset is a conglomeration of multiple datasets. I think this would help pinpoint why there are duplicates.

In [14]:
heart_nd_df = heart_df.drop_duplicates()
print(len(heart_nd_df))

918


Next step: looking at inappropriate zeros (ex. cholesterol, bp, etc.)