# Heart Disease

## I. Project Tasks

### Define your scope
My primary goal is to continue using the skills I learned on Codecademy. Specifically, I want to clean/tidy data, look for trends, and create a visualization that tells a story. This is a low-key project; machine learning and data mining algorithms are outside of my scope. After thinking it over, I think I will look for correlations and run statistical tests (i.e. Pearson) in Jupyter, and display the highest correlating variables in Tableau.

### Decide on a question or topic
I decided to choose a topic related to medicine - heart disease. I dispense all kinds of medications that treat heart conditions on a daily basis.

### Find a dataset
I went to kaggle and found the dataset "Heart Disease Dataset" provided by Mexwell (link here https://www.kaggle.com/datasets/mexwell/heart-disease-dataset).

### Define your problem
After a brief glance at the dataset and the documentation, I would like to find what makes people with heart disease different from those without heart disease. 

### Load and check data
Documentation is mostly clear (the Units column consists of units and/or ranges and/or descriptions
) and I can identify what each variable is measuring, except for the ST segment variables. Specifically, for the 'oldpeak' variable, I do not know what the numbers represents (documentation says the units are 'depression'), and how it relates to the 'ST slope' variable. I was not familiar with the term 'old peak', and decided to familiarize myself with it before getting too far along.

I found that the units for _oldpeak_ are likely mm on an EKG. The first hits on Google were unhelpful, and I am making this assumption based on what I read about ST slopes.

Image describing ST slopes: https://litfl.com/wp-content/uploads/2018/10/ST-segment-depression-upsloping-downsloping-horizontal.png


### Data wrangling and tidying (in progress)
Kaggle gave this dataset a score of 10.00 out of 10.00 for usability, but one person on Kaggle commented that about 150 people had a cholesterol level of zero, so I will review the dataset for missing values or other possible errors. 

### Find the story (to do)
What, in one sentence, do you want your audience to take away from your project? You may have to create all of your visualizations, do all of your explorations, and even get feedback before you know what it is. But, take a moment to decide the one thing people should take away from your project and make that the first thing. Make that story the easiest one to find.

### Communicate your findings (to do)
This could be in the shape of a report, a Tableau Dashboard, a slide deck, etc. The best way to decide is to imagine your audience. Who are they? Where do you envision reaching them? What is the best medium for that?\
Do you want to reach other people interested in your topic? Do you want to persuade someone to care about something? Do you want to focus on business stakeholders? Where might you find each group? Tailor your project to that imagined scenario.\
From there, you can start to create a project to communicate your story with them.

### Wrap up (to do)
This step might be the most impactful for finding a job. Now that you’ve written your report, created your deck, or built your dashboard, check your work. Go through and proofread. Try to view your project from wherever you decided to host it. Make sure that all of the links work, and ask a friend (or visit the forums) to find someone to go through it and tell you if there are any inconsistencies, jumps of logic, or confusing parts. This final round of checks is essential for creating a polished project.

## II. Load and check data

In [None]:
import pandas as pd
heart_df = pd.read_csv(r'C:\Users\jsbit\OneDrive\Documents\Coding 2023\Git\heart-disease\heart_statlog_cleveland_hungary_final.csv', encoding_errors='replace')

In [2]:
print('There were', str(len(heart_df)), 'participants in this dataset.')
# Although 1,190 participants is a lot, 'heart disease' is a large umbrella term and can affect almost anyone.
# Therefore, I estimate that this dataset not meet power and we can't extrapolate findings to the general public.

There were 1190 participants in this dataset.


In [3]:
print("The column names are:\n", heart_df.columns)

The column names are:
 Index(['age', 'sex', 'chest pain type', 'resting bp s', 'cholesterol',
       'fasting blood sugar', 'resting ecg', 'max heart rate',
       'exercise angina', 'oldpeak', 'ST slope', 'target'],
      dtype='object')


In [4]:
# Explanation of changes to variable names:
    # Because of my German heritage, I prefer EKG instead of ECG.
    # I see an inconsistenct in abbreviating 'bp', but spelling out 'blood sugar', so I will abbreviate them all.

heart_df = heart_df.rename(columns={'resting ecg': 'ekg', 
                                    'chest pain type': 'pain type', 
                                    'fasting blood sugar': 'fbs', 
                                    'max heart rate': 'max hr', 
                                    'resting bp s': 'bp'})

In [5]:
# While I'm at it, might as well replace spaces with underscores.
heart_df.columns = [variable.replace(' ', '_') for variable in heart_df]

print("The new column names are:\n", heart_df.columns)

The new column names are:
 Index(['age', 'sex', 'pain_type', 'bp', 'cholesterol', 'fbs', 'ekg', 'max_hr',
       'exercise_angina', 'oldpeak', 'ST_slope', 'target'],
      dtype='object')


### Initial Exploration

In [8]:
print(heart_df.describe(include='all'))

               age          sex    pain_type           bp  cholesterol  \
count  1190.000000  1190.000000  1190.000000  1190.000000  1190.000000   
mean     53.720168     0.763866     3.232773   132.153782   210.363866   
std       9.358203     0.424884     0.935480    18.368823   101.420489   
min      28.000000     0.000000     1.000000     0.000000     0.000000   
25%      47.000000     1.000000     3.000000   120.000000   188.000000   
50%      54.000000     1.000000     4.000000   130.000000   229.000000   
75%      60.000000     1.000000     4.000000   140.000000   269.750000   
max      77.000000     1.000000     4.000000   200.000000   603.000000   

               fbs          ekg       max_hr  exercise_angina      oldpeak  \
count  1190.000000  1190.000000  1190.000000      1190.000000  1190.000000   
mean      0.213445     0.698319   139.732773         0.387395     0.922773   
std       0.409912     0.870359    25.517636         0.487360     1.086337   
min       0.000000   

#### Observations
* Each variable has a count of 1,190, so it's free of NAN.
* Age:
    * I am somewhat surprised that the max age was only 77; it would be nice to know the inclusion criteria for these studies (i.e. if there was a max age for some but not others).
    * It looks like this will have a normal distribution; the distance of the 25th and 75th percentile from the median and the distance of the min and max from the median are similar.
* Sex: The mean and median indicate that most participants were male.
* Resting blood pressure (systolic):
    * Minimum was zero; not a realistic number.
    