<a href="https://colab.research.google.com/github/DigitalBiomarkerDiscoveryPipeline/DBDP-Ed/blob/main/dbdPy-tutorials/sleep-algorithm-preprocessing/dbdy_Sleep_Algorithm_Preprocessing_Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **About**

By: Angelica Pan

---

A "dbdPy Tutorial" helps new users get familiar with different dbdPy features by walking users through real examples of actual digital biomarker studies that have used dbdPy.

This Colab is a first draft of what a dbdPy Tutorial might look like, after the dbdPy Python library is more complete and ready for usage.

There are two parts to this Colab:

- [**1 - Context**](https://colab.research.google.com/drive/1ccG9IxpGcUGXEIk_EQh_U2-tWeFb-rtT#scrollTo=Vv5kJEiYP8bd&line=1&uniqifier=1) contains code and background information that would be absent in a real, more polished Companion Colab.
    - In a real tutorial, the digital biomarker researcher would be able to download and import the dbdPy library, and immediately use the relevant functionality. Since dbdPy does not actually exist yet, Part One contains initial implementations of various functions, to be refined by future dbdPy developers.
- [**2 - Tutorial**](https://colab.research.google.com/drive/1ccG9IxpGcUGXEIk_EQh_U2-tWeFb-rtT#scrollTo=tlzbo4bvZHsC&line=5&uniqifier=1) contains the code and prose that a reader would see in a real tutorial.


# 1 - Context







## 1.1 - Outline

The data preprocessing in this Colab has 5 main steps, as outlined in this [Miro board](https://miro.com/app/board/uXjVMpNxpE4=/?share_link_id=253748706773):
![](https://drive.google.com/uc?export=view&id=1we5u2Itvb6gN8d9rICQbCpKnH-hhiWqZ)

1. `read_raw_data()`: Imports raw activity, sleep, and heart rate data
2. `clean_raw_data()`: Transforms raw activity, sleep, and heart rate dataframes into "dbdPy dataframes"
3. `normalize_df` / `normalize_hr_df()`: Transforms "dbdPy dataframes" into "normalized" dataframes (e.g. all observations/rows are normalized to the same duration)
4. `create_master_df()`: Creates a single, master dataframe from the normalized activity, sleep, and heart rate dataframes
5. `create_subject_dict()`: Saves all the dataframes (raw, dbdPy-style, normalized, master) into a dictionary, the precursor to the [`dbdPy.CommercialDevices`](https://miro.com/app/board/uXjVMrcNUwI=/?share_link_id=7039683671) class


## 1.2 - dbdPy Preprocessing Functions

This section provides context and code behind the implementation of the five dbdPy preprocessing functions used in this Colab.

In [3]:
import pandas as pd
import numpy as np
import datetime as dt

### 1.2.1 - `read_raw_data()`

`read_raw_data()` imports raw activity, sleep, and heart rate data.
- Returns: `raw_activity_df`, `raw_sleep_df`, `raw_hr_df`

![](https://drive.google.com/uc?export=view&id=1-JqCX0dfq3G7_TkbMQnlZ8tWdR1FpFzL)

Notes:
- In this particular study/Colab, "raw" is a misnomer — instead of starting with raw data directly imported from a wearable device, the original authors of the study provide [`data.zip`](https://drive.google.com/file/d/1FZHYiqQ5HChyM9cdfrXQj-GyygnZqNWh/view?usp=drive_link), a collection of CSVs that contain data that has already been transformed. In a real dbdPy pipeline, we would expect to start with data directly imported from the device.
- The author-provided data contains "aggregate" CSVs — these are the original, author-provided versions of the `master_df` dataframes, and are not used in this Colab.
- In the author-provided data, the "epochs" CSVs are what this Colab calls "activity" data.
- **This implementation of `read_raw_data()` is hard-coded to the file structure of the author-provided data**, and will probably need to be refactored in future versions.

    - File structure of [`data.zip`](https://drive.google.com/file/d/1bAPZvGfFzSCan4YLOber9K0ltva4goRX/view), the author-provided data:
    ```
    data/
      ├── aggregate/
      │   ├── aggregate_S1.csv
      │   ├── aggregate_S2.csv
      │   └── ...
      ├── epochs_cleaned/
      │   ├── epochs_cleaned_S1.csv
      │   ├── epochs_cleaned_S2.csv
      │   └── ...
      ├── hr_cleaned/
      │   ├── hr_cleaned_S1.csv
      │   ├── hr_cleaned_S2.csv
      │   └── ...
      └── sleep_cleaned/
          ├── sleep_cleaned_S1.csv
          ├── sleep_cleaned_S2.csv
          └── ...
    ```

In [12]:
def read_raw_data(data_dir, subject_number):
    """Read in raw device data
    """
    activity_path = f"{data_dir}/epochs_cleaned/epochs_cleaned_{subject_number}.csv"
    heart_rate_path = f"{data_dir}/hr_cleaned/hr_cleaned_{subject_number}.csv"
    sleep_path = f"{data_dir}/sleep_cleaned/sleep_cleaned_{subject_number}.csv"


    raw_activity_df = pd.read_csv(activity_path)
    # Optional print statement
    # print(f"Loaded data from: {activity_path}")

    raw_sleep_df = pd.read_csv(sleep_path)
    # Optional print statement
    # print(f"Loaded data from: {sleep_path}")

    raw_hr_df = pd.read_csv(heart_rate_path)
    # Optional print statement
    # print(f"Loaded data from: {heart_rate_path}")

    return raw_activity_df, raw_sleep_df, raw_hr_df

`raw_activity_df` columns and types:
```
startTimeStamp           int64
endTimeStamp             int64
startDate               object
startTime               object
durationInSeconds        int64
maxMotionIntensity       int64
meanMotionIntensity    float64
steps                    int64
distanceInMeters       float64
activityType            object
activeKilocalories       int64
dtype: object
```

`raw_sleep_df` columns and types:
```
startTimeStamp         int64
endTimeStamp           int64
awakeTimeInSeconds     int64
startDate             object
startTime             object
endDate               object
endTime               object
dtype: object
```

`raw_hr_df` columns and types:
```
timeStamp    float64
date          object
time          object
dateTime      object
heartRate      int64
dtype: object
```

### 1.2.2 - `clean_raw_data()`

`clean_raw_data()` transforms raw activity, sleep, and heart rate dataframes into "dbdPy dataframes", the working name for the clean, standardized data format used across the dbdPy library.
- Returns: `activity_df`, `sleep_df`, `hr_df`

![](https://drive.google.com/uc?export=view&id=1-MXa_4oQXV2Rss3KIkP35pkNjXW88hYU)

Notes:
- The main objective of the dbdPy library is to make the process of working with wearable data more interoperable and standardized. The "dbdPy dataframe" is the foundation of that stability — the dbdPy functions, classes, methods, etc. should be designed to work with the "dbdPy dataframe" as a base and therefore should be able to be chained together seamlessly.
- The "dbdPy dataframe" format also provides a buffer between the device manufacturers and the rest of the dbdPy library. If a manafacturer makes changes to the way a wearable device's data is exported, the function that transforms raw data into a "dbdPy dataframe" (`clean_raw_data()` or its future equivalent) will need to be updated to handle those changes, but other functions will not, because they only use the intermediary "dbdPy dataframe" format.
- The exact form (column names and order) of a "dbdPy dataframe" is to be decided by the dbdPy team.
- **This implementation of `clean_raw_data()` is hard-coded to the structure of the author-provided data**, and will probably need to be refactored in future versions.
    - Currently, timestamp information (`start_timestamp`, `end_timestamp`) is saved as `int`; this is arbitrary and future implementations can use `pd.Timestamp` if that makes more sense.
    - Currently, timestamps are saved as Unix timestamps — **going forward, dbdPy developers will need to decide which timezone(s), if any, that dbdPy supports or uses as default.**


In [13]:
def clean_raw_data(subject_number, raw_activity_df, raw_sleep_df, raw_hr_df):
    # activity
    activity_df = raw_activity_df.copy()
    activity_df.insert(0, "subject_number", subject_number)

    activity_df.rename(
        columns = {
            "startTimeStamp": "start_timestamp",
            "endTimeStamp": "end_timestamp",
            "startDate": "start_date",
            "startTime": "start_time",
            "durationInSeconds": "duration",
            "maxMotionIntensity": "max_motion_intensity",
            "meanMotionIntensity": "mean_motion_intensity",
            "distanceInMeters": "step_distance",
            "activityType": "activity_type",
            "activeKilocalories": "active_kilocalories"
        },
        inplace = True
    )

    activity_df.insert(
        3, "start_datetime", pd.to_datetime(
            activity_df["start_timestamp"], unit = "s"
        )
    )
    activity_df.insert(
        4, "end_datetime", pd.to_datetime(
            activity_df["end_timestamp"], unit = "s"
        )
    )
    activity_df.drop(
        columns = ["start_date", "start_time"], inplace = True
    )

    # sleep
    sleep_df = raw_sleep_df[["startTimeStamp", "endTimeStamp"]].copy()
    sleep_df.insert(0, "subject_number", subject_number)

    sleep_df.rename(columns = {"startTimeStamp": "start_timestamp",
                               "endTimeStamp": "end_timestamp"},
                    inplace = True)

    sleep_df["start_datetime"] = pd.to_datetime(
        sleep_df["start_timestamp"], unit = "s"
    )
    sleep_df["end_datetime"] = pd.to_datetime(
        sleep_df["end_timestamp"], unit = "s"
    )
    sleep_df["duration"] = sleep_df["end_timestamp"] - sleep_df["start_timestamp"]
    sleep_df["sleep_label"] = 1

    # heart rate
    hr_df = raw_hr_df[["timeStamp", "heartRate"]].copy().astype("int64")
    hr_df.insert(0, "subject_number", subject_number)

    hr_df.rename(columns = {"timeStamp": "timestamp",
                            "heartRate": "heart_rate"},
                 inplace = True)

    hr_df.insert(2, "datetime", pd.to_datetime(hr_df["timestamp"], unit = "s"))

    return activity_df, sleep_df, hr_df

`activity_df` columns and types:
```
subject_number                   object
start_timestamp                   int64
end_timestamp                     int64
start_datetime           datetime64[ns]
end_datetime             datetime64[ns]
duration                          int64
max_motion_intensity              int64
mean_motion_intensity           float64
steps                             int64
step_distance                   float64
activity_type                    object
active_kilocalories               int64
dtype: object
```

`sleep_df` columns and types:
```
subject_number             object
start_timestamp             int64
end_timestamp               int64
start_datetime     datetime64[ns]
end_datetime       datetime64[ns]
duration                    int64
sleep_label                 int64
dtype: object
```

`hr_df` columns and types:
```
subject_number            object
timestamp                  int64
datetime          datetime64[ns]
heart_rate                 int64
dtype: object
```

### 1.2.3 - `normalize_df()` / `normalize_hr_df()`

`normalize_df` / `normalize_hr_df` transforms dbdPy-formatted data into a "normalized" format, where all rows/observations are time regularized into intervals of the same duration.

For example, a "dbdPy dataframe" might have a row where the duration of the observation is 12 minutes, and another row where the duration is 7 minutes; a normalized version of the same dataframe might trim or transform the data in such a way where all rows have a duration of exactly 10 minutes.

- Returns: `normalized_activity_df`, `normalized_sleep_df`, `normalized_hr_df`

![](https://drive.google.com/uc?export=view&id=1-pJS2-E2DuhAOZufIz14Urjigf99lEDS)

Notes:
- This Colab uses two normalization functions — `normalize_df()` for the activity and sleep data, and `normalize_hr_df()` for the heart rate data — because the requisite transforms for the activity/sleep data and the heart rate data are different and it was easier to write two separate functions.
    - If future dbdPy developers want to keep this timestamp normalization funcationality, it may be cleaner to have a single function that can handle both formats.
    - Rows in `activity_df` and `sleep_df` contain values for the start and stop timestamps of each observation, because these dataframes record periods of activity or sleep. Normalizing this format of data involves trimming/segmenting observations that are too long or expanding/discarding observations that are too short.
    - Rows in `hr_df` only contain a single timestamp for each observation, because this dataframe records individual measurements at individual points in time. Normalizing this format of data could be approached in multiple ways, but probably involves pivoting the original dataframe, i.e. grouping multiple heart rate observations into a single row.`
- **The implementations of `normalize_df()` and `normalize_hr_df()` are hard-coded to the current, tentative "dbdPy dataframe" structure**, and will probably need to be refactored in future versions.
    - `normalize_hr_df()` assumes that the provided heart rate measurements are taken every X seconds, where X is some factor of the `interval` parameter. For example, in the author-provided data, heart rate is measured every 15 seconds and the interval that all data is normalized to is 900 seconds (15 minutes). `normalize_hr_df()` creates groups of heart rate measurements such that each row in the resulting data frame contains 60 heart rate measurements, because there are 60 instances of 15 seconds within a period of 900 seconds.


#### 1.2.3.1 - Helper functions

In [14]:
def next_epoch_start_timestamp(timestamp, interval):
    """
    Given a timestamp in epoch N, returns the timestamp that marks the
    start of epoch N+1.

    >>> next_epoch_start_timestamp(14, 5)
    15
    >>> next_epoch_start_timestamP(15, 5)
    20
    >>> next_epoch_start_timestamP(16, 5)
    20

    Parameters
    ----------
    timestamp : int
        A timestamp, in seconds.

    interval : int
        The duration of each epoch, in seconds.
    """
    next_epoch_start_timestamp = timestamp + (interval - (timestamp % interval))

    return next_epoch_start_timestamp

def current_epoch_start_timestamp(timestamp, interval):
    """
    Given a timestamp in epoch N, returns the timestamp that marks the
    start of epoch N.

    >>> current_epoch_start_timestamp(14, 5)
    10
    >>> current_epoch_start_timestamp(15, 5)
    15
    >>> current_epoch_start_timestamp(16, 5)
    15
    """
    current_epoch_start_timestamp = timestamp - (timestamp % interval)

    return current_epoch_start_timestamp

def current_epoch_end_timestamp(timestamp, interval):
    """
    Given a timestamp in epoch N, returns the timestamp that marks the
    end of epoch N.

    >>> current_epoch_end_timestamp(14, 5)
    15
    >>> current_epoch_end_timestamp(15, 5)
    20
    >>> current_epoch_end_timestamp(16, 5)
    20

    Parameters
    ----------
    timestamp: int
        A timestamp, in seconds.

    interval: int
        The duration of each epoch, in seconds.
    """
    current_epoch_end_timestamp = timestamp + interval - (timestamp % interval)

    return current_epoch_end_timestamp

def normalize_timestamps(df, interval, expand_smaller, expand_larger):
    """
    Normalizes the start and end timestamps of a dataframe, such that
    all timestamps are exactly divisible by the `interval` parameter.

    The original start timestamp of a row is transformed into the normalized
    start timestamp, the first multiple of `interval` that occurs within the
    range (start inclusive, end exclusive) of the original start and end
    timestamps. The row is dropped if there is no such multiple.

    For example, if `interval` is 5, a row with a start timestamp of 6 and an
    end timestamp of 10 is dropped, because there is no multiple of 5 between
    [6, 10).

    Parameters
    ----------
    df : pd.DataFrame
        A dataframe that records the start and stop timestamps of an
        event, like activity or sleep.
    interval : int
        The duration (in seconds) that all timestamps are normalized to.
    expand_smaller : bool, default True
        If True, keeps "partial intervals", rows where the difference between
        the original end timestamp and the normalized start timestamp is less
        than `interval`. This difference is saved as the `duration`.

        If False, drops such rows.

        For example, if `interval` is 5 and `expand_smaller` is True, a row
        with original start and end timestamps of 6 and 11 is transformed into
        a row with normalized start and end timestamps of 10 and 15, and an
        duration of 1 (11 - 10).
    expand_larger : bool, default True
        If True, breaks rows with an original duration greater than `interval`
        into two or more rows.
        If False, keeps such rows as a single row.

        For example, if `interval` is 5 and `expand_larger` is True, a row
        with original start and end timestamps of 6 and 20 is split into
        two rows, one with normalized start and end timestamps of 10 and 15,
        and another with normalized start and end timestamps of 15 and 20.

        If `expand_larger` is False, this row is kept as a single row with
        normalized start and end timestamps of 10 and 20.

    Returns
    ----------
    A list of 4-tuples of the shape (a, b, c, (d, e)), where:
      a : int, normalized start timestamp
      b : int, normalized end timestamp
      c: int, duration
          If `expand_smaller` is True (partial intervals are dropped), the
          duration is equal to the original end timestamp minus the normalized
          start timestamp.
          If `expand_smaller` is False (partial intervals are dropped), the
          duration is equal to the normalized end timestamp minus the
          normalized timestamp.
      d : int, original start timestamp
      e : int, original end timestamp
    """
    original_starts = df["start_timestamp"].copy()
    original_ends = df["end_timestamp"].copy()

    normalized_timestamps = []

    for i in range(len(original_starts)):
        start = original_starts[i]
        end = original_ends[i]

        if start % interval == 0:
            normalized_start = start
        else:
            normalized_start = next_epoch_start_timestamp(start, interval)

        if end % interval == 0:
            normalized_end = end
        else:
            normalized_end = current_epoch_end_timestamp(end, interval)

        if normalized_start == normalized_end:
            continue
        else:
            duration = end - normalized_start

            if expand_smaller is False:
                if duration < interval:
                    continue
                elif duration % interval != 0:
                    normalized_end = current_epoch_start_timestamp(
                        end, interval)
                    normalized_timestamps.append(
                        (normalized_start, normalized_end,
                        normalized_end - normalized_start, (start, end))
                    )
            else:
                normalized_timestamps.append(
                    (normalized_start, normalized_end, duration, (start, end))
                )

    if expand_larger is False:
        return normalized_timestamps
    else:
        expanded_timestamps = []

        for i in range(len(normalized_timestamps)):
            start = normalized_timestamps[i][0]
            end = normalized_timestamps[i][1]
            duration = normalized_timestamps[i][2]
            original_timestamps = normalized_timestamps[i][3]

            if duration <= interval:
                expanded_timestamps.append(normalized_timestamps[i])
            else:
                if duration % interval == 0:
                    expanded_timestamps.extend(
                        [(x, x + interval, interval, original_timestamps)
                        for x in range(start, end, interval)]
                    )
                else:
                    expanded_timestamps.extend(
                        [(x, x + interval, interval, original_timestamps)
                        for x in range(start, end - interval, interval)]
                    )

                    if expand_smaller is False:
                        continue
                    else:
                        expanded_timestamps.append(
                            (end - interval, end, duration % interval,
                              original_timestamps)
                        )

        return expanded_timestamps

**Examples of `normalize_timestamp()`**:

```
>>> df = pd.DataFrame({"start_timestamp": [0, 5, 15, 21, 27, 31, 42, 53, 66],
                       "end_timestamp": [5, 15, 18, 25, 29, 39, 51, 60, 83]})

>>> df
start_timestamp  end_timestamp
              0              5
              5             15
             15             18
             21             25
             27             29
             31             39
             42             51
             53             60
             66             83

>>> normalize_timestamps(df, interval = 5, expand_larger = True, expand_smaller = True)
[(0, 5, 5, (0, 5)),
 (5, 10, 5, (5, 15)),
 (10, 15, 5, (5, 15)),
 (15, 20, 3, (15, 18)),
 (35, 40, 4, (31, 39)),
 (45, 50, 5, (42, 51)),
 (50, 55, 1, (42, 51)),
 (55, 60, 5, (53, 60)),
 (70, 75, 5, (66, 83)),
 (75, 80, 5, (66, 83)),
 (80, 85, 3, (66, 83))]

>>> normalize_timestamps(df, interval = 5, expand_larger = False, expand_smaller = True)
[(0, 5, 5, (0, 5)),
 (5, 15, 10, (5, 15)),
 (15, 20, 3, (15, 18)),
 (35, 40, 4, (31, 39)),
 (45, 55, 6, (42, 51)),
 (55, 60, 5, (53, 60)),
 (70, 85, 13, (66, 83))]

 >>> normalize_timestamps(df, interval = 5, expand_larger = True, expand_smaller = False)
[(45, 50, 5, (42, 51)),
(70, 75, 5, (66, 83)),
(75, 80, 5, (66, 83))]

>>> normalize_timestamps(df, interval = 5, expand_larger = False, expand_smaller = False)
[(45, 50, 5, (42, 51)),
 (70, 80, 10, (66, 83))]
```

#### 1.2.3.2 - Main functions


In [15]:
def normalize_df(df, interval, expand_smaller, expand_larger):
    """
    Normalizes the start and end timestamps of a dataframe that measures
    duration of an event, using `normalize_timestamps()`.

    Parameters
    ----------
    df : pd.DataFrame
        A dataframe that records the start and stop timestamps of an
        event, like activity or sleep.
    interval : int
        The duration (in seconds) that start and end timestamps are normalized
        to.
    expand_smaller : bool, default True
        See `normalize_timestamps()`.
    expand_larger : bool, default True
        See `normalize_timestamps()`.
    """
    normalized_timestamps = normalize_timestamps(df, interval,
        expand_smaller = expand_smaller, expand_larger = expand_larger)

    normalized_starts = [normalized_start for (normalized_start, b, c, d)
        in normalized_timestamps
    ]
    normalized_ends = [normalized_end for (a, normalized_end, c, d)
        in normalized_timestamps
    ]
    original_duration = [original_duration for (a, b, original_duration, d)
        in normalized_timestamps
    ]
    original_start = [start for (a, b, c, (start, end))
        in normalized_timestamps
    ]

    normalized_dict = {
      "normalized_start_timestamp": normalized_starts,
      "normalized_end_timestamp": normalized_ends,
      "normalized_start_datetime": pd.to_datetime(
          normalized_starts, unit = "s"
      ),
      "normalized_end_datetime": pd.to_datetime(normalized_ends, unit = "s"),
      "original_duration": original_duration,
      "start_timestamp": original_start
    }

    normalized_df = pd.DataFrame(normalized_dict).set_index("start_timestamp")

    df_copy = df.copy().set_index("start_timestamp")

    normalized_df = normalized_df.join(df_copy).reset_index()

    drop_columns = [
        "start_timestamp", "end_timestamp", "start_datetime",
        "end_datetime", "duration"
    ]

    normalized_df.drop(
        inplace = True,
        columns = [col for col in normalized_df if col in drop_columns],
    )

    normalized_df.rename(
        columns = {
            "normalized_start_timestamp": "start_timestamp",
            "normalized_end_timestamp": "end_timestamp",
            "normalized_start_datetime": "start_datetime",
            "normalized_end_datetime": "end_datetime",
            "original_duration": "duration",

        },
        inplace = True
    )

    normalized_df.insert(
        0, "subject_number", normalized_df.pop("subject_number")
    )

    return normalized_df

def normalize_hr_df(hr_df, interval):
    """
    Groups the timestamps of a dataframe that measures an event at discrete
    points in time.

    Parameters
    ----------
    hr_df : pd.DataFrame
        A dataframe that records measurements at discrete points in time, like
        heart rate.
    interval : int
        The duration (in seconds) that the original timestamps are grouped to.
    """
    subject_number = hr_df.loc[0, "subject_number"]

    normalized_hr_df = hr_df.copy()
    normalized_hr_df["sub_epoch"] = normalized_hr_df["timestamp"] % interval
    normalized_hr_df["start_timestamp"] = (
        normalized_hr_df["timestamp"] - normalized_hr_df["sub_epoch"]
    )
    normalized_hr_df["hr_col_name"] = (
        "hr_" + normalized_hr_df["sub_epoch"].apply(str).apply(lambda x: x.zfill(3))
    )

    normalized_hr_df.drop(
        inplace = True,
        columns = ["subject_number", "timestamp", "datetime", "sub_epoch"]
    )

    normalized_hr_df = normalized_hr_df.pivot_table(
        index = "start_timestamp",
        columns = "hr_col_name"
    )

    # flatten multi index from pivoting
    normalized_hr_df.columns = normalized_hr_df.columns.droplevel()
    normalized_hr_df.columns.name = None
    normalized_hr_df.reset_index(inplace = True)

    normalized_hr_df.insert(0, "subject_number", subject_number)

    return normalized_hr_df

### 1.2.4 - `create_master_df()`

`create_master_df()` creates a single, master dataframe from the normalized activity, sleep, and heart rate dataframes.
- Returns: `master_df`

![](https://drive.google.com/uc?export=view&id=1-VrN7thgBDQtAgGFk08lXSiJ7MefFxL-)

Notes:
- The master dataframe joins individual dataframes into a single source of truth that can be used by other dbdPy functions down the line. The individual dataframes are normalized to the same interval first, in order to create a uniform index column that they can actually be joined on.

In [16]:
def create_master_df(
    normalized_activity_df, normalized_sleep_df, normalized_hr_df, interval
):
    subject_number = normalized_activity_df.loc[0, "subject_number"]

    # Create the structure of `master_df`
    def create_timestamp_range(df, interval):
        first_start = min(df["start_timestamp"])
        last_start = max(df["start_timestamp"])
        last_end = max(df["end_timestamp"])

        first_epoch_start = next_epoch_start_timestamp(
            first_start, interval)

        if interval > (last_end - last_start):
            last_epoch_start = current_epoch_start_timestamp(
                last_start, interval) - interval
        else:
            last_epoch_start = current_epoch_start_timestamp(
                last_start, interval)

        timestamps = [(x, x + interval) for x in
                      range(first_epoch_start,
                            last_epoch_start + interval,
                            interval)
                      ]

        return timestamps

    epoch_timestamps = create_timestamp_range(normalized_activity_df, interval)
    epoch_starts = [start for (start, end) in epoch_timestamps]
    epoch_ends = [end for (start, end) in epoch_timestamps]

    master_dict = {
        "subject_number": subject_number,
        "start_timestamp": epoch_starts,
        "end_timestamp": epoch_ends,
        "start_datetime": pd.to_datetime(epoch_starts, unit = "s"),
        "end_datetime": pd.to_datetime(epoch_ends, unit = "s")
    }

    # setting `drop` to False preserves the current column order when the index
    #  is reset after joining in all the other DataFrames.
    master_df = pd.DataFrame(master_dict).set_index(
        "start_timestamp", drop = False
    )

    # Transform normalized dataframes
    drop_cols= [
        "subject_number", "end_timestamp", "start_datetime", "end_datetime"
    ]

    normalized_activity_df = normalized_activity_df.drop(
        columns = [col for col in normalized_activity_df if col in drop_cols]
    ).rename(
        columns = {"duration": "activity_duration"}
    ).set_index("start_timestamp")

    normalized_sleep_df = normalized_sleep_df.drop(
        columns = [col for col in normalized_sleep_df if col in drop_cols]
    ).rename(
        columns = {"duration": "sleep_duration"}
    ).set_index("start_timestamp")

    normalized_hr_df = normalized_hr_df.drop(
        columns = [col for col in normalized_hr_df if col in drop_cols]
    ).set_index("start_timestamp")

    # Join dataframes
    master_df = master_df.join(
        other = [normalized_activity_df, normalized_sleep_df, normalized_hr_df]
    )

    master_df.reset_index(drop = True, inplace = True)

    master_df.fillna(inplace = True,
                     value = {"sleep_label": 0,
                              "sleep_duration": 0})

    # pd.join() fills in missing values with NaN, which upcasts columns
    # Recast columns after replacing NaN with None
    master_df = master_df.replace(np.nan, None).convert_dtypes()

    return master_df

### 1.2.5 - `create_subject_dict()`

`create_subject_dict()` saves all the dataframes (raw, dbdPy-style, normalized, master) into a dictionary, the precursor to the [`dbdPy.CommercialDevices`](https://miro.com/app/board/uXjVMrcNUwI=/?share_link_id=7039683671) class
- Returns: A dictionary

![](https://drive.google.com/uc?export=view&id=1BiVy6RXzXtgJ1B66SEFnSVW1ROw8MlC6)

Notes:
- This dictionary is the working version of the `CommercialDevices` class, a key feature of dbdPy that does not exist yet.
- The exact structure of this class is to be determined by future dbdPy developers, but it will likely have at least these properties:
    - `device_info` (dict): Information about the specific device the data was imported from, like manufacturer or model. This particular study did not provide any device info.
    - `user_info` (dict): Information about the wearer of the device the data was imported from, like their age or gender.
    - `raw_X_df`, `raw_Y_df`, `raw_Z_df`, etc. (pd.DataFrame): Data imported directly from the device, like sleep, activity, and heart rate.
    - `X_df`, `Y_df`, `Z_df`, etc. (pd.DataFrame): Raw dataframes transformed into a "dbdPy dataframe"
    - `normalized_X_df`, `normalized_Y_df`, `normalized_Z_df`: "dbdPY dataframes" with normalized intervals
    - `master_df` (pd.DataFrame): A single dataframe that contains all of the individual, normalized, dataframes
- A dbdPy user should be able to read in data from a device (aka create a new `CommercialDevices` object) and get all of the properties for free, without having to write any additional code.

In [17]:
def create_subject_dict(data_dir, interval, subject_number, user_info):
    raw_activity_df, raw_sleep_df, raw_hr_df = read_raw_data(
        data_dir, subject_number
    )

    activity_df, sleep_df, hr_df = clean_raw_data(
        subject_number, raw_activity_df, raw_sleep_df, raw_hr_df
    )

    normalized_activity_df = normalize_df(
        activity_df, interval, expand_smaller = True, expand_larger = True
    )
    normalized_sleep_df = normalize_df(
        sleep_df, interval, expand_smaller = False, expand_larger = True
    )
    normalized_hr_df = normalize_hr_df(hr_df, interval)

    master_df = create_master_df(
        normalized_activity_df, normalized_sleep_df, normalized_hr_df, interval
    )

    subject_dict = {
        "user_info": user_info,
        "raw_activity_df": raw_activity_df,
        "raw_sleep_df": raw_sleep_df,
        "raw_hr_df": raw_hr_df,
        "activity_df": activity_df,
        "sleep_df": sleep_df,
        "hr_df": hr_df,
        "normalized_activity_df": normalized_activity_df,
        "normalized_sleep_df": normalized_sleep_df,
        "normalized_hr_df": normalized_hr_df,
        "master_df": master_df
    }

    return subject_dict

## 1.3 - dbdPy Feature Engineering Functions

This section provides possible implementations of relevant dbdPy feature engineering functions.

The functions defined in this section (`create_summary_df()` and `downsample_sleep_df()`) are not used in this tutorial, although they are part of the pipeline defined by the original authors.

In [19]:
def create_summary_df(subjects, subject_number, df):
    """
    Calculates the per-epoch min, max, mean, standard deviation, variance, and
    deviation from overall mean of a dbdPy-style dataframe.

    Parameters
    ----------
    subjects : dict
        A dictionary of subjects.
    subject_number : str, df : pd.DataFrame
        The subject number and name of the dataframe to summarize.
    """
    values_df = subjects.get(subject_number).get(df).copy()

    # Drop any rows where at least one value is missing
    values_df = values_df.dropna().reset_index(drop = True)

    # Temporarily drop non-measurement columns
    cols = [
        "subject_number", "start_timestamp", "end_timestamp",
        "start_datetime", "end_datetime", "duration"
    ]

    temp_columns = values_df.filter(
        items = [col for col in values_df if col in cols],
        axis = 1
    )

    values_df = values_df.drop(
        columns = [col for col in values_df if col in cols]
    )

    # Calculate per-epoch deviation from overall mean
    mean_deviations = []
    overall_mean = np.mean(values_df.values)

    for i in range(len(values_df)):
        values = values_df.iloc[i].values.tolist()
        mean_deviation = np.mean(np.absolute(values - overall_mean))
        mean_deviations.append(mean_deviation)

    summary_dict = {
        "min": values_df.min(axis = 1),
        "max": values_df.max(axis = 1),
        "mean": values_df.mean(axis = 1),
        "std": values_df.std(axis = 1),
        "var": values_df.var(axis = 1),
        "mean_deviation": mean_deviations
    }

    summary_df = pd.concat(
        [temp_columns, pd.DataFrame(summary_dict)], axis = 1
    )

    return summary_df

In [20]:
def downsample_sleep_df(subjects, subject_number, interval, window):
    """
    Downsamples a subject's sleep information dataframe by discarding any rows
    of non-sleep that do not occur within a specified window of time before and
    after a actual sleep instance.

    Parameters
    ----------
    subjects : dict
        A dictionary of subjects.
    subject_number : str
        The number of the subject to downsample.
    interval : int
        The duration (in seconds) that start and end timestamps are normalized
        to.
    window : int
        The duration (in seconds) of the period before and after a sleep
        instance to keep.

    Returns
    ----------
    downsampled_sleep_df : pd.DataFrame
        A downsampled version of a subject's sleep instances, where only rows
        that indicate sleep instance and rows that are part of a specified
        window before and after a sleep instance are kept.
    """
    sleep_df = subjects.get(subject_number).get("sleep_df")

    sleep_instances_df = normalize_df(
        sleep_df, interval, expand_smaller = False, expand_larger = False
    )
    sleep_instance_starts = sleep_instances_df["start_timestamp"]
    sleep_instance_ends = sleep_instances_df["end_timestamp"]

    sleep_window_starts = [start - window for start in sleep_instance_starts]
    sleep_window_ends = [end + window for end in sleep_instance_ends]

    # All start timestamps for all preceding and following windows
    all_window_starts = []

    for i in range(len(sleep_instance_starts)):
        preceding_window_starts = range(
            sleep_window_starts[i], sleep_instance_starts[i], interval
        )

        following_window_starts = range(
            sleep_instance_ends[i] - interval, sleep_window_ends[i] - interval,
            interval
        )

        all_window_starts.extend(preceding_window_starts)
        all_window_starts.extend(following_window_starts)

    all_window_ends = [start + interval for start in all_window_starts]
    sleep_windows_dict = {
      "subject_number": subject_number,
      "start_timestamp": all_window_starts,
      "end_timestamp": all_window_ends,
      "start_datetime": pd.to_datetime(all_window_starts, unit = "s"),
      "end_datetime": pd.to_datetime(all_window_ends, unit = "s"),
      "duration": interval,
      "sleep_label": -1
    }

    sleep_windows_df = pd.DataFrame(sleep_windows_dict)

    normalized_sleep_df = subjects.get(subject_number).get(
        "normalized_sleep_df")

    sleep_labels_df = normalized_sleep_df.loc[
        normalized_sleep_df["sleep_label"] == 1]

    downsampled_sleep_df = pd.concat(
        [sleep_windows_df, sleep_labels_df]
    ).sort_values(axis = 0, by = "start_timestamp"
    ).reset_index(drop = True)

    return downsampled_sleep_df

# 2 - The Tutorial

In this tutorial, we'll show why and how the authors of ["Field-Based Assessments of Behavioral Patterns During Shiftwork in Police Academy Trainees Using Wearable Technology" (2022, Erickson et al)](https://pubmed.ncbi.nlm.nih.gov/35416084/) used dbdPy to prepare their study data for analysis.

Click here to jump to [Section 2.4 - Code](https://colab.research.google.com/drive/1ccG9IxpGcUGXEIk_EQh_U2-tWeFb-rtT#scrollTo=N70Cg1oy5dtz&line=6&uniqifier=1).

## 2.1 - About dbdPy

dbdPy is an open source Python library designed
for the processing, feature engineering, and visualization of commercial wearable device data.

dbdPy has (*...this section is where you would describe the structure of dbdPy, once it's more fleshed out*).

## 2.2 - About the Study

Accurate sleep tracking, especially for individuals with non-standard sleeping patterns, is an ongoing challenge for commercial wearable devices.

In ["Field-Based Assessments of Behavioral Patterns During Shiftwork in Police Academy Trainees Using Wearable Technology" (2022, Erickson et al)](https://pubmed.ncbi.nlm.nih.gov/35416084/), the authors use wearable technology to monitor the behavioral changes of participants before and during shiftwork, i.e. as their sleeping patterns either stay circadian-aligned (dayshift workers) or become circadian-misaligned (nightshift workers).

The study observes 27 police academy trainees in two 6-week phases:
- **In-class Training** (baseline): All 27 participants attended class during normal daytime hours (circadian-aligned).
- **Field-based Training**: 13 participants were assigned to a day shift (circadian-aligned) and 14 participants were assigned to a night shift (circadian-misaligned).

![](https://drive.google.com/uc?export=view&id=1vO8--qfKWAPcRndd0c6RBqpixtVuOoUX)

Participants wore a Garmin vívosmart® HR activity tracker on the wrist 24/7, except for when the tracker was being charged. The tracker recorded activity level, heart rate, and algorithmically imputed sleep/wake labels every 15 minutes.



## 2.3 - A New Sleep Imputation Algorithm

In the study, the authors develop and present a new algorithm for imputing sleep/wake labels.

The original, Garmin-developed sleep detection algorithm relies on user input of anticipated regular bedtime, which may be irregular for night shift workers or circadian-misaligned individuals. The novel algorithm instead relies on heart rate and activity data collected by an activity tracker in order to determine periods of sleep or wake.

The authors follow a standard machine learning algorithm development workflow:

1. Clean and transform data
2. Explore and visualize data
3. Transform data and engineer features
4. Train and evaluate machine learning models

In this tutorial we focus on the first step — the data preprocessing — and how the dbdPy Python library makes it easier to work with commercial wearable device data.

### 2.3.1 - Algorithm Development Challenges

The authors present a new sleep imputation algorithm that uses heart rate and activity data to predict the sleep labels ("wake" or "sleep") of every 15-minute epoch. This **requires the heart rate and activity data to be on the same time axis**, such that the heart rate and activity dataframes have corresponding observations that start and end on the same timestamps, in 15-minute epochs.

Furthermore, the authors use the Garmin-provided sleep labels collected during the initial in-class training as a baseline, so the sleep data (as predicted by Garmin) will also need to be on this same time axis.

We might want a dataframe like this, where each row represents a 15-minute epoch (900 seconds) and contains values for the relevant activity, heart rate, and sleep observations:

| start_timestamp | end_timestamp | activity_data | heart_rate_data | sleep_data |
|-----------------|---------------|---------------|-----------------|------------|
| 0               | 900           | ...           | ...             | ...        |
| 900             | 1800          | ...           | ...             | ...        |
| 1800            | 2700          | ...           | ...             | ...        |
| 2700            | 3600          | ...           | ...             | ...        |
| ...             | ...           | ...           | ...             | ...        |

Unfortunately, the Garmin vívosmart® HR-recorded data does not conveniently come in this format out-of-the-box, and requires transforming the imported dataframes into the appropriate shape:

- Activity data: Garmin continuously records activity information in 15-minute intervals by default — minimal transformation required.
- Heart rate data: Garmin records heart rate measurements every 15 seconds — observations will need to be pivoted into groups of 15 minutes.
- Sleep data: Garmin records sleep instances as single observations, e.g. one row might indicate that sleep was detected from 12:10AM to 6:08AM — observations will need to be segmented into groups of 15 minutes.


## 2.4 - Code

Before we dive into the differences between using or not using dbdPy, we'll download the data and set some global variables.

In [21]:
# Uncomment if you have not already imported these packages
# import pandas as pd
# import numpy as np
# import datetime as dt

%%capture

# Download and unzip author-provided dataset
!gdown 1bAPZvGfFzSCan4YLOber9K0ltva4goRX
!unzip data.zip

# Google Colab unzips data into a local
data_dir = "./data"

# 15 minutes is 900 seconds
interval = 900

subject_info = {
    # "subject_number": [field_training_start_date, is_circadian_aligned]
    "S1": [dt.date(2017, 3, 21), True],
    "S2": [dt.date(2017, 9, 25), True],
    "S3": [dt.date(2017, 9, 25), True],
    "S4": [dt.date(2018, 2, 28), False],
    "S5": [dt.date(2018, 2, 28), False]
}

### 2.4.1 - With dbdPy

dbdPy makes it incredibly easy to preprocess the Garmin data! Just call the `create_subject_dict()` to store user information, intermediate dataframes, and transformed dataframes as a single dictionary.

![](https://drive.google.com/uc?export=view&id=1BiVy6RXzXtgJ1B66SEFnSVW1ROw8MlC6)

Under the hood, `create_subject_dict()` calls 4 other dbdPy functions:

- `read_raw_data()`: Reads in data from a commercial wearable device
- `clean_raw_data()`: Transforms raw commercial wearable device data into a standardized intermediary form
- `normalize_df()` and `normalize_hr_df()`: Standardizes the timestamps of health metric data.
- `create_master_df()`: Joins multiple types of health metrics, like activity or sleep data, into a single data frame.

![](https://drive.google.com/uc?export=view&id=1we5u2Itvb6gN8d9rICQbCpKnH-hhiWqZ)

In [22]:
# Create a dictionary of "subject_number": "subject_dict" pairs
subjects = {}

for (subject_number, user_info) in subject_info.items():
    subject_dict = create_subject_dict(
        data_dir, interval, subject_number, user_info
    )
    subjects[subject_number] = subject_dict

## 2.4.2 - Without dbdPy

Without dbdPy, you would have to write the code for all of these various transformations manually. You can imagine the amount of work that would take!


In [23]:
# Example of activity data directly imported from device
# Garmin records activity information in 15-minute intervals by default

subjects.get("S1").get("raw_activity_df").head()

Unnamed: 0,startTimeStamp,endTimeStamp,startDate,startTime,durationInSeconds,maxMotionIntensity,meanMotionIntensity,steps,distanceInMeters,activityType,activeKilocalories
0,1486223280,1486223820,2017-02-04,15:48:00,540,7,4.916667,0,0.0,SEDENTARY,0
1,1486224000,1486224900,2017-02-04,16:00:00,900,4,3.733333,0,0.0,SEDENTARY,0
2,1486224900,1486225200,2017-02-04,16:15:00,300,7,2.266667,0,0.0,SEDENTARY,0
3,1486225800,1486226700,2017-02-04,16:30:00,900,4,4.0,0,0.0,SEDENTARY,0
4,1486226700,1486226880,2017-02-04,16:45:00,180,7,6.066667,0,0.0,SEDENTARY,0


In [24]:
# Example of what you might want a transformed activity dataframe to look like

subjects.get("S1").get("normalized_activity_df").head()

Unnamed: 0,subject_number,start_timestamp,end_timestamp,start_datetime,end_datetime,duration,max_motion_intensity,mean_motion_intensity,steps,step_distance,activity_type,active_kilocalories
0,S1,1486224000,1486224900,2017-02-04 16:00:00,2017-02-04 16:15:00,900,4,3.733333,0,0.0,SEDENTARY,0
1,S1,1486224900,1486225800,2017-02-04 16:15:00,2017-02-04 16:30:00,300,7,2.266667,0,0.0,SEDENTARY,0
2,S1,1486225800,1486226700,2017-02-04 16:30:00,2017-02-04 16:45:00,900,4,4.0,0,0.0,SEDENTARY,0
3,S1,1486226700,1486227600,2017-02-04 16:45:00,2017-02-04 17:00:00,180,7,6.066667,0,0.0,SEDENTARY,0
4,S1,1486227600,1486228500,2017-02-04 17:00:00,2017-02-04 17:15:00,811,7,4.4,0,0.0,SEDENTARY,0


In [25]:
# Example of heart rate data directly imported from device
# Garmin records heart rate measurements every 15 seconds

subjects.get("S1").get("raw_hr_df").head()

Unnamed: 0,timeStamp,date,time,dateTime,heartRate
0,1486224000.0,2017-02-04,15:54:15.000000,2017-02-04 15:54:15,88
1,1486224000.0,2017-02-04,15:54:30.000000,2017-02-04 15:54:30,88
2,1486224000.0,2017-02-04,15:54:45.000000,2017-02-04 15:54:45,88
3,1486224000.0,2017-02-04,15:55:00.000000,2017-02-04 15:55:00,88
4,1486224000.0,2017-02-04,15:55:15.000000,2017-02-04 15:55:15,88


In [26]:
# Example of what you might want a transformed heart rate dataframe to look like

subjects.get("S1").get("normalized_hr_df").head()

Unnamed: 0,subject_number,start_timestamp,hr_000,hr_015,hr_030,hr_045,hr_060,hr_075,hr_090,hr_105,...,hr_750,hr_765,hr_780,hr_795,hr_810,hr_825,hr_840,hr_855,hr_870,hr_885
0,S1,1486223100,,,,,,,,,...,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0
1,S1,1486224000,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0,...,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0
2,S1,1486224900,88.0,88.0,88.0,88.0,88.0,88.0,88.0,88.0,...,88.0,88.0,88.0,,,,,,,
3,S1,1486227600,,,,,,,,,...,78.0,78.0,78.0,78.0,78.0,78.0,78.0,78.0,78.0,78.0
4,S1,1486228500,78.0,78.0,78.0,78.0,78.0,,,,...,,,,,,,,81.0,81.0,81.0


In [27]:
# Example of sleep data directly imported from device
# Garmin records entire sleep instances as single observations

subjects.get("S1").get("raw_sleep_df").head()

Unnamed: 0,startTimeStamp,endTimeStamp,awakeTimeInSeconds,startDate,startTime,endDate,endTime
0,1486247760,1486275120,0,2017-02-04,22:36:00,2017-02-05,06:12:00
1,1486333980,1486361280,0,2017-02-05,22:33:00,2017-02-06,06:08:00
2,1486417620,1486448520,0,2017-02-06,21:47:00,2017-02-07,06:22:00
3,1486509120,1486534560,0,2017-02-07,23:12:00,2017-02-08,06:16:00
4,1486599600,1486631160,0,2017-02-09,00:20:00,2017-02-09,09:06:00


In [28]:
# Example of what you might want a transformed sleep dataframe to look like

subjects.get("S1").get("normalized_sleep_df").head()

Unnamed: 0,subject_number,start_timestamp,end_timestamp,start_datetime,end_datetime,duration,sleep_label
0,S1,1486248300,1486249200,2017-02-04 22:45:00,2017-02-04 23:00:00,900,1
1,S1,1486249200,1486250100,2017-02-04 23:00:00,2017-02-04 23:15:00,900,1
2,S1,1486250100,1486251000,2017-02-04 23:15:00,2017-02-04 23:30:00,900,1
3,S1,1486251000,1486251900,2017-02-04 23:30:00,2017-02-04 23:45:00,900,1
4,S1,1486251900,1486252800,2017-02-04 23:45:00,2017-02-05 00:00:00,900,1


## 2.5 - Explore Transformed Data

A subject dictionary contains the following key-value pairs, making it easy to retrieve any type of intermediate or transformed data frame you might be interested in.

```
subject_dict = {
    "user_info": user_info,
    "raw_activity_df": raw_activity_df,
    "raw_sleep_df": raw_sleep_df,
    "raw_hr_df": raw_hr_df,
    "activity_df": activity_df,
    "sleep_df": sleep_df,
    "hr_df": hr_df,
    "normalized_activity_df": normalized_activity_df,
    "normalized_sleep_df": normalized_sleep_df,
    "normalized_hr_df": normalized_hr_df,
    "master_df": master_df
}
```

In [29]:
# master_df: The normalized activity, sleep, and heart rate dataframes, joined

subjects.get("S1").get("master_df")

Unnamed: 0,subject_number,start_timestamp,end_timestamp,start_datetime,end_datetime,activity_duration,max_motion_intensity,mean_motion_intensity,steps,step_distance,...,hr_750,hr_765,hr_780,hr_795,hr_810,hr_825,hr_840,hr_855,hr_870,hr_885
0,S1,1486224900,1486225800,2017-02-04 16:15:00,2017-02-04 16:30:00,300,7,2.266667,0,0.0,...,88,88,88,,,,,,,
1,S1,1486225800,1486226700,2017-02-04 16:30:00,2017-02-04 16:45:00,900,4,4.0,0,0.0,...,,,,,,,,,,
2,S1,1486226700,1486227600,2017-02-04 16:45:00,2017-02-04 17:00:00,180,7,6.066667,0,0.0,...,,,,,,,,,,
3,S1,1486227600,1486228500,2017-02-04 17:00:00,2017-02-04 17:15:00,811,7,4.4,0,0.0,...,78,78,78,78,78,78,78,78,78,78
4,S1,1486228500,1486229400,2017-02-04 17:15:00,2017-02-04 17:30:00,651,7,5.533333,0,0.0,...,,,,,,,,81,81,81
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8918,S1,1494251100,1494252000,2017-05-08 13:45:00,2017-05-08 14:00:00,892,7,4.0,0,0.0,...,66,66,66,65,65,65,65,66,66,66
8919,S1,1494252000,1494252900,2017-05-08 14:00:00,2017-05-08 14:15:00,900,2,0.866667,0,0.0,...,63,63,63,63,63,63,63,63,63,63
8920,S1,1494252900,1494253800,2017-05-08 14:15:00,2017-05-08 14:30:00,900,3,1.8,0,0.0,...,62,62,62,62,62,62,62,64,64,64
8921,S1,1494253800,1494254700,2017-05-08 14:30:00,2017-05-08 14:45:00,900,3,0.8,0,0.0,...,63,63,63,63,63,63,63,62,62,62


## 2.6 - Conclusion

Using dbdPy makes it incredibly easy and fast to transform and work with commercial wearable device data!

(*Here is where you could add some sort of call-to-action, like asking readers to follow Big Ideas Lab on Twitter.*)

## 2.7 - Appendix

This section shows the results of running the functions in [Section 1.3 - dbdPy Feature Engineering Functions](https://colab.research.google.com/drive/1ccG9IxpGcUGXEIk_EQh_U2-tWeFb-rtT#scrollTo=NO23jaPKQav9&line=5&uniqifier=1)

In [30]:
S1_hr_summary_df = create_summary_df(subjects, "S1", "normalized_hr_df")

S1_downsampled_sleep_df = downsample_sleep_df(
    subjects, "S1", interval = 900, window = 4 * 3600)

In [31]:
S1_hr_summary_df

Unnamed: 0,subject_number,start_timestamp,min,max,mean,std,var,mean_deviation
0,S1,1486224000,88.0,88.0,88.000000,0.000000,0.000000,18.306998
1,S1,1486229400,74.0,106.0,91.050000,9.643519,92.997458,21.356998
2,S1,1486230300,67.0,74.0,69.300000,2.010987,4.044068,1.715500
3,S1,1486231200,64.0,85.0,67.900000,5.685843,32.328814,4.941534
4,S1,1486232100,66.0,95.0,84.433333,5.063551,25.639548,14.863432
...,...,...,...,...,...,...,...,...
8404,S1,1494250200,58.0,90.0,65.783333,10.939110,119.664124,10.709368
8405,S1,1494251100,58.0,83.0,71.483333,9.606657,92.287853,9.070467
8406,S1,1494252000,62.0,66.0,62.583333,0.671241,0.450565,7.109668
8407,S1,1494252900,62.0,66.0,62.650000,1.070799,1.146610,7.043002


In [32]:
S1_downsampled_sleep_df

Unnamed: 0,subject_number,start_timestamp,end_timestamp,start_datetime,end_datetime,duration,sleep_label
0,S1,1486233900,1486234800,2017-02-04 18:45:00,2017-02-04 19:00:00,900,-1
1,S1,1486234800,1486235700,2017-02-04 19:00:00,2017-02-04 19:15:00,900,-1
2,S1,1486235700,1486236600,2017-02-04 19:15:00,2017-02-04 19:30:00,900,-1
3,S1,1486236600,1486237500,2017-02-04 19:30:00,2017-02-04 19:45:00,900,-1
4,S1,1486237500,1486238400,2017-02-04 19:45:00,2017-02-04 20:00:00,900,-1
...,...,...,...,...,...,...,...
5271,S1,1494250200,1494251100,2017-05-08 13:30:00,2017-05-08 13:45:00,900,-1
5272,S1,1494251100,1494252000,2017-05-08 13:45:00,2017-05-08 14:00:00,900,-1
5273,S1,1494252000,1494252900,2017-05-08 14:00:00,2017-05-08 14:15:00,900,-1
5274,S1,1494252900,1494253800,2017-05-08 14:15:00,2017-05-08 14:30:00,900,-1


In [33]:
# The downsampled sleep dataframe drops about 3.6k rows

print(subjects.get("S1").get("master_df").shape)
# (8923, 74)

print(S1_downsampled_sleep_df.shape)
# (5276, 7)

(8923, 74)
(5276, 7)
