## Part 2: Clinical Application

### Contents
Fill out this notebook as part 2 of your final project submission.

**You will have to complete the Code (Load Metadata & Compute Resting Heart Rate) and Project Write-up sections.**  

- [Code](#Code) is where you will implement some parts of the **pulse rate algorithm** you created and tested in Part 1 and already includes the starter code.
  - [Imports](#Imports) - These are the imports needed for Part 2 of the final project. 
    - [glob](https://docs.python.org/3/library/glob.html)
    - [os](https://docs.python.org/3/library/os.html)
    - [numpy](https://numpy.org/)
    - [pandas](https://pandas.pydata.org/)
  - [Load the Dataset](#Load-the-dataset)  
  - [Load Metadata](#Load-Metadata)
  - [Compute Resting Heart Rate](#Compute-Resting-Heart-Rate)
  - [Plot Resting Heart Rate vs. Age Group](#Plot-Resting-Heart-Rate-vs.-Age-Group)
- [Project Write-up](#Project-Write-Up) to describe the clinical significance you observe from the **pulse rate algorithm** applied to this dataset, what ways/information that could improve your results, and if we validated a trend known in the science community. 

### Dataset (CAST)

The data from this project comes from the [Cardiac Arrythmia Suppression Trial (CAST)](https://physionet.org/content/crisdb/1.0.0/), which was sponsored by the National Heart, Lung, and Blood Institute (NHLBI). CAST collected 24 hours of heart rate data from ECGs from people who have had a myocardial infarction (MI) within the past two years.[1] This data has been smoothed and resampled to more closely resemble PPG-derived pulse rate data from a wrist wearable.[2]

1. **CAST RR Interval Sub-Study Database Citation** - Stein PK, Domitrovich PP, Kleiger RE, Schechtman KB, Rottman JN. Clinical and demographic determinants of heart rate variability in patients post myocardial infarction: insights from the Cardiac Arrhythmia Suppression Trial (CAST). Clin Cardiol 23(3):187-94; 2000 (Mar)
2. **Physionet Citation** - Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals (2003). Circulation. 101(23):e215-e220.

-----

### Code
#### Imports

When you implement the functions, you'll only need to you use the packages you've used in the classroom, like [Pandas](https://pandas.pydata.org/) and [Numpy](http://www.numpy.org/). These packages are imported for you here. We recommend you don't import other packages outside of the [Standard Library](https://docs.python.org/3/library/) , otherwise the grader might not be able to run your code.

In [1]:
import glob
import os

import numpy as np
import pandas as pd

# My imports
import pdb

#### Load the dataset

The dataset is stored as [.npz](https://docs.scipy.org/doc/numpy/reference/generated/numpy.savez.html) files. Each file contains roughly 24 hours of heart rate data in the 'hr' array sampled at 1Hz. The subject ID is the name of the file. You will use these files to compute resting heart rate.

Demographics metadata is stored in a file called 'metadata.csv'. This CSV has three columns, one for subject ID, age group, and sex. You will use this file to make the association between resting heart rate and age group for each gender.

Find the dataset in `../datasets/crisdb/`

In [2]:
hr_filenames = glob.glob('/data/crisdb/*.npz')

#### Load Metadata
Load the metadata file into a datastructure that allows for easy lookups from subject ID to age group and sex.

In [3]:
metadata_filename = '/data/crisdb/metadata.csv'

# Load the metadata file into this variable.
# with open(metadata_filename, 'r') as f:
#     metadata = pass

# I'd rather just parse this with pandas
metadata = pd.read_csv(metadata_filename)

metadata.head()

Unnamed: 0,subject,age,sex
0,e198a,20-24,Male
1,e198b,20-24,Male
2,e028b,30-34,Male
3,e028a,30-34,Male
4,e061b,30-34,Male


#### Compute Resting Heart Rate
For each subject we want to compute the resting heart rate while keeping track of which age group this subject belongs to. An easy, robust way to compute the resting heart rate is to use the lowest 5th percentile value in the heart rate timeseries.

In [4]:
def AgeAndRHR(metadata, filename):

    # Load the heart rate timeseries
    hr_data = np.load(filename)['hr']
    
    # Compute the resting heart rate from the timeseries by finding the lowest 5th percentile value in hr_data
    rhr = pd.Series(hr_data).quantile(0.05)
    
    # Find the subject ID from the filename.
    subject = filename.split("/")[-1].split(".")[0]

    # Find the age group for this subject in metadata.
    age_group = metadata[metadata["subject"] == subject]["age"].values[0]
    
    # Find the sex for this subject in metadata.
    sex = metadata[metadata["subject"] == subject]["sex"].values[0]

    return age_group, sex, rhr

df = pd.DataFrame(data=[AgeAndRHR(metadata, filename) for filename in hr_filenames],
                  columns=['age_group', 'sex', 'rhr'])

Update envt
```bash
import sys
!conda install --yes --prefix {sys.prefix} numpy
```

#### Plot Resting Heart Rate vs. Age Group
We'll use [seaborn](https://seaborn.pydata.org/) to plot the relationship. Seaborn is a thin wrapper around matplotlib, which we've used extensively in this class, that enables higher-level statistical plots.

We will use [lineplot](https://seaborn.pydata.org/generated/seaborn.lineplot.html#seaborn.lineplot) to plot the mean of the resting heart rates for each age group along with the 95% confidence interval around the mean. Learn more about making plots that show uncertainty [here](https://seaborn.pydata.org/tutorial/relational.html#aggregation-and-representing-uncertainty).

In [None]:
import seaborn as sns
from matplotlib import pyplot as plt

labels = sorted(np.unique(df.age_group))
df['xaxis'] = df.age_group.map(lambda x: labels.index(x)).astype('float')
plt.figure(figsize=(12, 8))
sns.lineplot(x='xaxis', y='rhr', hue='sex', data=df)
_ = plt.xticks(np.arange(len(labels)), labels)

Getting tired of the issues that I've seen updating environment files from these machines, 
so I'm going to run this locally, and add a markdown image:
 ![](./plot_upload.png)

### Clinical Conclusion
Answer the following prompts to draw a conclusion about the data.
> 1. For women, we see .... 
> 2. For men, we see ... 
> 3. In comparison to men, women's heart rate is .... 
> 4. What are some possible reasons for what we see in our data?
> 5. What else can we do or go and find to figure out what is really happening? How would that improve the results?
> 6. Did we validate the trend that average resting heart rate increases up until middle age and then decreases into old age? How?

Your write-up will go here...

**> 1. For women, we see ....**  
Heart rate is on average higher than men for comparativly younger ages (35 to ~60), but decreases to equivalence with male heart rate from ~65 years and beyond.


**> 2. For men, we see ...** 
Male heart rate appears to be fairly consistent across age with potentially more variability at younger ages, and (though not _likely_ significant) a small-and-consistent decrease with age.


**> 3. In comparison to men, women's heart rate is ....** 
Higher, and more variable (see tables below)

In [8]:
# Female HR
df[df["sex"] == "Female"]["rhr"].describe()

count    277.000000
mean      65.965632
std       14.393868
min        1.558870
25%       57.744361
50%       66.300003
75%       75.247986
max      101.726316
Name: rhr, dtype: float64

In [9]:
# Male HR
df[df["sex"] == "Male"]["rhr"].describe()

count    1260.000000
mean       63.016196
std        13.064686
min        16.365223
25%        53.782763
50%        61.687742
75%        70.369667
max       109.714286
Name: rhr, dtype: float64

**> 4. What are some possible reasons for what we see in our data?**  
I note that there are substantially more values in the male data set than the female dataset. Added to this... Let's do a little more analysis...

Yup. So the female population that _is_ represented within this dataset is also older.

In summary, disparities between male and female heart rate at younger ages could result from bias due to an under representation of younger female participants within the sample set. 

In [17]:
df[df["sex"] == "Male"]["age_group"].value_counts()

60-64    246
65-69    230
55-59    215
70-74    157
50-54    146
45-49    109
75-79     79
40-44     54
35-39     24
Name: age_group, dtype: int64

In [18]:
df[df["sex"] == "Female"]["age_group"].value_counts()

60-64    67
65-69    61
55-59    46
70-74    39
75-79    19
50-54    18
45-49    15
40-44     8
35-39     4
Name: age_group, dtype: int64

**> 5. What else can we do or go and find to figure out what is really happening? How would that improve the results?**  
I'm not prepared to state that this effect is not physiological, but owing to the aforementioned disparities in sample size, it's not likely to be statistically viable. A larger _young_ female population is required to confirm this claim. 

However... I now note from the next section that this effect (higher female HR than male) _does_ hold across larger sample sizes. I would only be speculating as to physiological rational, but it seems reasonable to state that this has something to do with the biomechanics required to circulate blood across different body sizes; perhaps a more rapid HR is better optimized for a smaller body size. 

**> 6. Did we validate the trend that average resting heart rate increases up until middle age and then decreases into old age? How?**  
_We_ did not. Disparities in the data contibute to an overall variance that in insufficient to make conclusive statements on this difference. The "trend" however, from what we can see, is that female HR is higher than male and _perhaps_ decreases with age... But these statements/trends are not sufficiently supported by the data.

