## Part 1: Pulse Rate Algorithm

### Contents
Fill out this notebook as part of your final project submission.

**You will have to complete both the Code and Project Write-up sections.**
- The [Code](#Code) is where you will write a **pulse rate algorithm** and already includes the starter code.
   - Imports - These are the imports needed for Part 1 of the final project. 
     - [glob](https://docs.python.org/3/library/glob.html)
     - [numpy](https://numpy.org/)
     - [scipy](https://www.scipy.org/)
- The [Project Write-up](#Project-Write-up) to describe why you wrote the algorithm for the specific case.


### Dataset
You will be using the **Troika**[1] dataset to build your algorithm. Find the dataset under `datasets/troika/training_data`. The `README` in that folder will tell you how to interpret the data. The starter code contains a function to help load these files.

1. Zhilin Zhang, Zhouyue Pi, Benyuan Liu, ‘‘TROIKA: A General Framework for Heart Rate Monitoring Using Wrist-Type Photoplethysmographic Signals During Intensive Physical Exercise,’’IEEE Trans. on Biomedical Engineering, vol. 62, no. 2, pp. 522-531, February 2015. Link

-----

### Code

In [32]:
import glob
import numpy as np
import scipy as sp
import scipy.stats
import scipy.io
import scipy.signal
import matplotlib.pyplot as plt
import pandas as pd
from tqdm import tqdm


def LoadTroikaDataset():
    """
    Retrieve the .mat filenames for the troika dataset.
    Review the README in ./datasets/troika/ to understand the
    organization of the .mat files.

    Returns:
        data_fls: Names of the .mat files that contain signal data
        ref_fls: Names of the .mat files that contain reference data
        <data_fls> and <ref_fls> are ordered correspondingly, so that
        ref_fls[5] is the reference data for data_fls[5], etc...
    """
    data_dir = "./datasets/troika/training_data"
    data_fls = sorted(glob.glob(data_dir + "/DATA_*.mat"))
    ref_fls = sorted(glob.glob(data_dir + "/REF_*.mat"))
    return data_fls, ref_fls


def LoadTroikaDataFile(data_fl):
    """
    Loads and extracts signals from a troika data file.

    Usage:
        data_fls, ref_fls = LoadTroikaDataset()
        ppg, accx, accy, accz = LoadTroikaDataFile(data_fls[0])

    Args:
        data_fl: (str) filepath to a troika .mat file.

    Returns:
        numpy arrays for ppg, accx, accy, accz signals.
    """
    data = sp.io.loadmat(data_fl)['sig']
    return data[2:]


def AggregateErrorMetric(pr_errors, confidence_est):
    """
    Computes an aggregate error metric based on confidence estimates.
    Computes the MAE at 90% availability.

    Args:
        pr_errors: a numpy array of errors between pulse rate estimates
        and corresponding reference heart rates.
        confidence_est: a numpy array of confidence estimates
        for each pulse rate error.

    Returns:
        the MAE at 90% availability
    """
    # Higher confidence means a better estimate. The best 90% of the estimates
    #    are above the 10th percentile confidence.
    percentile90_confidence = np.percentile(confidence_est, 10)

    # Find the errors of the best pulse rate estimates
    best_estimates = np.abs(pr_errors[confidence_est >= percentile90_confidence])

    # Return the mean absolute error
    return np.mean(best_estimates)


def Evaluate():
    """
    Top-level function evaluation function.

    Runs the pulse rate algorithm on the Troika dataset and
    returns an aggregate error metric.

    Returns:
        Pulse rate error on the Troika dataset. See AggregateErrorMetric.
    """
    # Retrieve dataset files
    data_fls, ref_fls = LoadTroikaDataset()
    errs, confs = [], []
    for data_fl, ref_fl in zip(data_fls, ref_fls):
        # Run the pulse rate algorithm on each trial in the dataset
        errors, confidence = RunPulseRateAlgorithm(data_fl, ref_fl)
        errs.append(errors)
        confs.append(confidence)
        # Compute aggregate error metric
    errs = np.hstack(errs)
    confs = np.hstack(confs)
    return AggregateErrorMetric(errs, confs), errs, confs

fs = 125
window_len_s = 10
window_shift_s = 2
past_window = 3
pass_band = (60/60.0, 200/60.0)
multiplier = 4
ppg_mag_height = 0.55
acc_mag_height = 0.3
ppg_min_dist = 0.2
num_best = 2
acc_num_best_arg = 2


def BandpassFilter(signal):
    """
    bandpass_filter
    Loads the signal and passes it through a Butterworth filter.
    Args:
        signal: sinal Data from sensors
    Returns:
        Band Pass filtered Signal
    """
    # Initialising Buterworth Bandpass Filter
    b, a = scipy.signal.butter(3, pass_band, btype='bandpass', fs=fs)
    '''Returns the signal after applying digital butterworth filter
    forward and backward to a signal.'''
    return scipy.signal.filtfilt(b, a, signal)







def fft(sig, fs):
    freqs = np.fft.rfftfreq(len(sig), 1/fs)
    fft_mag = np.abs(np.fft.rfft(sig))
    return (freqs, fft_mag)

def RunPulseRateAlgorithm(data_fl, ref_fl):
    """
    Args:
        data_fl: (str) filepath to a troika .mat file (signal).
        ref_fl: (str) filepath to a troika .mat file (ground truth heart rate).

    Returns:
        pr_errors: a numpy array of errors between pulse rate estimates and
        corresponding reference heart rates.
        confidence_est: a numpy array of confidence estimates for each pulse
        rate error.
    """
    fs = 125

    # Load ground truth heart rate
    ref_hrs = sp.io.loadmat(ref_fl)['BPM0']

    # Load data using LoadTroikaDataFile
    ppg, accx, accy, accz = LoadTroikaDataFile(data_fl)
    acc = np.mean([accx, accy, accz], axis=0)
    data_list = [ppg, acc]
    label_list = ['ppg', 'acc']

    # Bandpass filter the signal between 70 and 190 BPM
    filtered = {label: BandpassFilter(data) for (label, data) in zip(label_list, data_list)}

    # Move with a window_length_s of 8s and the window_shift_s of 2s
    # The ground truth data follows the same cadence
    errors, confidence = [], []
    window_length_s = 10
    window_shift_s = 2
    window_length = window_length_s * fs
    window_shift = window_shift_s * fs
    idx = list(range(0, len(ppg) - window_length, window_shift))
    for i in idx:
        segments = {label: filtered[label][
            i: i + window_length] for label in label_list}

        freqs, mags, sorted_inds, sorted_freqs = {}, {}, {}, {}
        for label in label_list:
            freqs[label], mags[label] = fft(segments[label], fs)
            sorted_inds[label] = np.argsort(mags[label])[::-1][:4]
            sorted_freqs[label] = freqs[label][sorted_inds[label]]

        try:
            est_f = [freq for freq in sorted_freqs['ppg']
                     if freq not in sorted_freqs['acc']][0]

        except:
            ind = sorted_inds['ppg'][0]
            est_f = freqs['ppg'][ind]

        est_hr = est_f * 60
        ref_hr = ref_hrs[idx.index(i)][0]
        errors.append(np.mean(np.abs(est_hr-ref_hr)))
        confidence.append(np.sum(mags['ppg'][(freqs['ppg'] >= est_f-30/60) & (
                    freqs['ppg'] <= est_f+30/60)]) / np.sum(mags['ppg']))
    return np.array(errors), np.array(confidence)

metric, errs, confs = Evaluate()
metric

16.162557800694298

In [33]:
errs

array([  9.66079295,  79.64253394,  78.85714286, ...,   1.3977    ,
         0.25      ,   0.5596    ])

In [34]:
confs

array([ 0.43493264,  0.34205628,  0.26265629, ...,  0.57230399,
        0.54150599,  0.49534163])

-----
### Project Write-up

Answer the following prompts to demonstrate understanding of the algorithm you wrote for this specific context.

> - **Code Description** 
>   - The code estimates the pulse rate from the PPG signal and a 3-axis accelerometer. The pulse rate is restricted between 60BPM (beats per minute) and 200BPM. It produces an estimation confidence. A higher confidence value means that this estimate should be more accurate than an estimate with a lower confidence value. The code produces an output every 2 seconds.
> - **Data Description** 
    - ECG signals have one channel.
    - PPG signals have two channels. we take the second channel as it poses the more challenging problem and suggested. Here in the problem i have taken mean of both channels as it gives more accurate results.
    - Accelerometers have three channels, each corresponding to a space axis x, y, and z. I use the magnitude of these three channels as distance calculation.   
> - **Algorithhm Description** 
>   - RandomForestRegression on featurise data.
>   - the specific aspects of the physiology that it takes advantage of : PPG signals can be used for measuring heart rate. Capillaries in the wrist fill with blood when the ventricles contract, when the blood passes light emitted by the PPG sensor is absorbed by red blood cells in these capillaries and the photodetector will see the cut in reflected light. Change in light measures and this oscillating waveform is the pulse rate.
>   - a describtion of the algorithm outputs :
>      - Outputs: the estimated frequency (in BPM) and the confidence score of that prediction.
>   - caveats on algorithm outputs : The confidence rate is only calculated based on the magnitude of a small area that contains the estimated spectral frequency relative to the sum magnitude of the entire spectrum.
>   - common failure modes : When the PPG picks a higher frequency signal that is not from the heart rate. This is possible due to hand movements, arm movement, alivation. To overcome with this, the accelerations measurmnet use in the algorithm.
> - **Algorithm Performance** 
>   - Confidence estimates can be used to set the point on the error curve that I want to operate at by sacrificing the number of estimates that are considered valid. There is a trade-off between availability and error. For example if I want to operate at 90% availability, I look at the training dataset to determine the condince threshold for which 90% of the estimates pass. Then if only an estimate's confidence value is above that threshold do I consider it valid. The mean absolute error at 90% availability is 13.7 BPM on the test set. Put another way, the best 90% of the estimates--according to the confidence output-- has a mean absolute error of 13.7 BPM. Because the data were recorded on fixed actions, not in a free-living context. It may not generalize well on a free-living context.


-----
### Next Steps
You will now go to **Test Your Algorithm** to apply a unit test to confirm that your algorithm met the success criteria. 