From 55e8186120784f1fa81ab0f2e5a8ed041ee2f1e0 Mon Sep 17 00:00:00 2001 From: mariakakis Date: Wed, 29 Nov 2023 18:11:50 -0500 Subject: [PATCH] updating function in old assignment --- .../2e - Data Windowing.ipynb | 801 +++++++++++++++++- 1 file changed, 800 insertions(+), 1 deletion(-) diff --git a/lectures/fall/2_timeseries_timedomain/2e - Data Windowing.ipynb b/lectures/fall/2_timeseries_timedomain/2e - Data Windowing.ipynb index 75f490c..8180b44 100644 --- a/lectures/fall/2_timeseries_timedomain/2e - Data Windowing.ipynb +++ b/lectures/fall/2_timeseries_timedomain/2e - Data Windowing.ipynb @@ -1 +1,800 @@ -{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"provenance":[],"collapsed_sections":["OBADxK39qJDx","a_476M_ApXfO","ty37-Ret2vIG","UrwSKjb07ymm","KKFKnVHbIjCR","sijx62Kozleq","AhNAP1_ejAp5","DObQGe8dbLY7"],"authorship_tag":"ABX9TyMcYqq0r1NuzzBoTesmnHMQ"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["In this notebook, we're going to talk about how we can segment time series data into smaller, more manageable chunks for analysis.\n"],"metadata":{"id":"Vj8n7JvdwZ8S"}},{"cell_type":"markdown","source":["# Important: Run this code cell each time you start a new session!"],"metadata":{"id":"OBADxK39qJDx"}},{"cell_type":"code","source":["!pip install numpy\n","!pip install pandas\n","!pip install matplotlib\n","!pip install ipywidgets\n","!pip install os\n","import numpy as np\n","import pandas as pd\n","import matplotlib.pyplot as plt\n","import os"],"metadata":{"id":"yuzcYs-MUBvs"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["!wget -Ncnp https://physionet.org/files/accelerometry-walk-climb-drive/1.0.0/raw_accelerometry_data/id00b70b13.csv"],"metadata":{"id":"Q8hsDyCZ8Hrr"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["df = pd.read_csv('id00b70b13.csv')\n","\n","# Filter to only walking activity, which is given a code of 1\n","df = df[df['activity'] == 1]\n","\n","# Process the time\n","df.rename(columns={'time_s': 'Time'}, inplace=True)\n","df = df[(df['Time']>=700) & (df['Time']<=710)]\n","df['Time'] = df['Time'] - df['Time'].min()\n","\n","# Process the accel\n","df['Accel'] = np.sqrt(df['la_x']**2 + df['la_y']**2 + df['la_z']**2)*9.8\n","\n","# Keep only crucial columns\n","keep_cols = ['Time', 'Accel']\n","df = df[keep_cols]\n","df.to_csv('walking.csv',index=False)"],"metadata":{"id":"3b4MgOU-8Rrs"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# What Is a Sliding Window?"],"metadata":{"id":"a_476M_ApXfO"}},{"cell_type":"markdown","source":["A common approach to handling time-series data is to break it up into a sequence of data chunks, or ***windows***, and then analyzing those chunks rather than the entire sequence at once. Since windows are often generated sequentially over a signal, you will often hear them referred to as ***sliding windows***."],"metadata":{"id":"JAjYdF2DqOn8"}},{"cell_type":"markdown","source":["Basic data windowing requires deciding upon two important parameters:\n","1. **Width/length:** Describes how long the window is along the time axis.\n","2. **Stride:** Describes the temporal separation between successive windows.\n","\n","The units of measurement for both of these parameters can be expressed in various ways:\n","1. **Time:** 5-second window with a 1-second stride\n","2. **Number of samples:** 100-sample window with a 20-sample stride\n","3. **Percentage:** 5-second window with a 1-second stride == (5-1)/5 = 80% overlap\n","\n","As long we know the sampling rate of our signal, we can easily change between any of these units."],"metadata":{"id":"WYxJfhamFJSl"}},{"cell_type":"markdown","source":["Let's write a function to manually slide a window along our signal assuming that we use the number of samples to define the window parameters:"],"metadata":{"id":"5Wip6mM1O_Px"}},{"cell_type":"code","source":["def sliding_windows(x_values, y_values, width, stride):\n"," \"\"\"\n"," x_values: the time component of the signal\n"," y_values: the measured value of the signal\n"," width: the width of the windows measured in # of samples\n"," stride: the stride of the windows measured in # of samples\n"," \"\"\"\n"," # Initialize the start and end of the window\n"," start_idx = 0\n"," end_idx = width\n","\n"," # Stop generating windows it would go past the end of the signal\n"," signal_length = df.shape[0]\n"," while end_idx < signal_length:\n"," # Grab the current window\n"," x_window = x_values.iloc[start_idx:end_idx]\n"," y_window = y_values.iloc[start_idx:end_idx]\n"," print(\"X data:\", x_window.values)\n"," print(\"Y data:\", y_window.values)\n"," print(\"============\")\n","\n"," # Move the window over by a stride\n"," start_idx += stride\n"," end_idx += stride"],"metadata":{"id":"MuSNngicXjg5"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# Visualizing A Sliding Window"],"metadata":{"id":"ty37-Ret2vIG"}},{"cell_type":"markdown","source":["To see this algorithm in action, let's try it on a simple signal that we create on our own:\n","\n"],"metadata":{"id":"1jQO99ttl5xG"}},{"cell_type":"code","source":["x = np.arange(1, 21)\n","y = np.array([0, 1, 2, 3, 3, 3, 3, 4, 5, 6,\n"," 6, 5, 4, 3, 3, 3, 3, 2, 1, 0])\n","df = pd.DataFrame({'Time': x, 'Value': y})"],"metadata":{"id":"fT-zz_e0LvYc"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["plt.figure(figsize=(5, 3))\n","plt.plot(df['Time'], df['Value'])\n","plt.xlabel('Time')\n","plt.ylabel('Value')\n","plt.show()"],"metadata":{"id":"1hZcmsM2atDr"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Let's run the function we created earlier to see what windowing does:"],"metadata":{"id":"n4CnctAl22En"}},{"cell_type":"code","source":["sliding_windows(df['Time'], df['Value'], 5, 1)"],"metadata":{"id":"naO8AqjR27BX"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["To make this concept a bit more clear, here is some code that provides an animated version of our manual implementation above. This code is a bit more complicated in order to make the animation work, so you can just focus on the end result for now."],"metadata":{"id":"Q2HlsWA10bPl"}},{"cell_type":"code","source":["import matplotlib.animation as animation\n","from matplotlib import rc\n","rc('animation', html='jshtml')\n","\n","def show_animated_window(x_values, y_values, width, stride,\n"," show_points=True):\n"," # Calculate bounds\n"," xmin, xmax = min(x_values), max(x_values)\n"," ymin, ymax = min(y_values), max(y_values)\n","\n"," # Create the figure and plot the line\n"," fig, ax = plt.subplots(figsize=(5, 3))\n"," linestyle = 'k-*' if show_points else 'k-'\n"," line, = ax.plot(x_values, y_values, linestyle)\n"," ax.set_xlim(xmin, xmax)\n"," ax.xaxis.set_ticks(np.arange(xmin, xmax+1, 1))\n","\n"," # Calculate window parameters\n"," n = len(x_values)\n"," box_height = ymax - ymin\n"," box_width = x_values[width] - x_values[0]\n"," box_stride = x_values[stride] - x_values[0]\n"," num_windows = min(((n-width)//stride), 40) # cap the number of windows to 40\n","\n"," # Create the animated box and add it to the graph\n"," box = plt.Rectangle((xmin, ymin), box_width, box_height,\n"," fill=True, color='orange', alpha=0.5)\n"," ax.add_patch(box)\n","\n"," # Calculate and update the new box position\n"," def update(frame):\n"," box.set_x(xmin+frame*box_stride)\n"," return [box]\n","\n"," # Create the animation\n"," anim = animation.FuncAnimation(fig, update, frames=range(num_windows), interval=1000)\n"," return anim"],"metadata":{"id":"ynVQ_zCUYV2j"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Run the animation as it is currently set up (`window_width = 5` and `window_stride = 1`) to see how it aligns with the output of the `sliding_windows()` function we created earlier. After that, play around with the sliders to see how changing `window_width` and `window_stride` impacts the sliding window."],"metadata":{"id":"7Z0hqEZgNnp-"}},{"cell_type":"code","source":["window_width = 5 #@param {type:\"slider\", min:1, max:5, step:1}\n","window_stride = 5 #@param {type:\"slider\", min:1, max:5, step:1}\n","show_animated_window(df['Time'], df['Value'], window_width, window_stride)"],"metadata":{"id":"d-g3gLKIb1Im"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["When we decrease `window_width`, the shaded area becomes narrower and the window contains fewer values with each iteration. When we increase `window_width`, the shaded area becomes wider and the window contains more values with each iterations."],"metadata":{"id":"ehyVWa815-Sk"}},{"cell_type":"markdown","source":["When we decrease `window_stride`, the shaded area moves over the signal more gradually and consecutive windows will share more values. When we increase `window_stride`, the shaded area moves over the signal more aggressively and consectuive windows will share fewer values."],"metadata":{"id":"tN1G6S746IgU"}},{"cell_type":"markdown","source":["# Example with Sliding Windows: Peak Detection"],"metadata":{"id":"UrwSKjb07ymm"}},{"cell_type":"markdown","source":["Now that we know about sliding windows, let's go through an example to show how powerful they can be."],"metadata":{"id":"pykZdAMMOcur"}},{"cell_type":"markdown","source":["We are going to write an algorithm that is going to process motion data from a sensor strapped to a person's ankle and count the number of steps the person has taken. An example of the input to our algorithm is shown below:"],"metadata":{"id":"si4mo5Fg76bG"}},{"cell_type":"code","source":["# Plot the data\n","df = pd.read_csv('walking.csv')\n","plt.figure(figsize=(5,3))\n","plt.plot(df['Time'], df['Accel'])\n","plt.xlabel('Time (s)')\n","plt.ylabel('Accelerometer (m/s^2)')\n","plt.show()"],"metadata":{"id":"NlRKW_VC76HK"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["You might notice that this signal seems to have alternating peaks: a small one followed by a large one with roughly equal spacing. This is because we are measuring the motion of one foot during walking. The large peak corresponds to when the person steps with the same foot that has the sensor, and the smaller peak corresponds to when the person steps with the opposite foot."],"metadata":{"id":"7IFmj6MmW8jZ"}},{"cell_type":"markdown","source":["Conceptually, our step-counting algorithm will incrementally slide a window across the signal, and update the step counter if the window contains a step (i.e., a peak in the accelerometer signal). The pseudocode for this algorithm is as follows:"],"metadata":{"id":"NWDRwWUPIECa"}},{"cell_type":"markdown","source":["```\n","initialize our window\n","initialize our step counter\n","\n","while the window has not reached the end of the signal:\n"," grab the data within the window\n"," if there is a peak in the middle of the window:\n"," update our step counter\n"," move the window\n","```"],"metadata":{"id":"vVY_VKMXghYx"}},{"cell_type":"markdown","source":["For now, we are going to skip through some of the details and jump straight to the implementation."],"metadata":{"id":"HTMfNsCmIvAe"}},{"cell_type":"code","source":["def detect_steps(x_values, y_values, width, stride):\n"," \"\"\"\n"," x_values: the time component of the signal\n"," y_values: the measured value of the signal\n"," width: the width of the windows measured in # of samples\n"," stride: the stride of the windows measured in # of samples\n"," \"\"\"\n"," # Initialize the start and end of the window\n"," start_idx = 0\n"," end_idx = width\n"," middle_idx = width // 2\n","\n"," # NEW: Initialize the step counter\n"," steps = {}\n","\n"," # Stop generating windows it would go past the end of the signal\n"," signal_length = df.shape[0]\n"," while end_idx < signal_length:\n"," # Grab the current window\n"," x_window = x_values.iloc[start_idx:end_idx]\n"," y_window = y_values.iloc[start_idx:end_idx]\n","\n"," # NEW: Check if there is a peak in the middle\n"," if y_window.argmax() == middle_idx:\n"," # NEW: Update the contents of the step counter\n"," step_timestamp = x_window.values[middle_idx]\n"," step_value = y_window.values[middle_idx]\n"," steps[step_timestamp] = step_value\n","\n"," # Move the window over by a stride\n"," start_idx += stride\n"," end_idx += stride\n","\n"," # NEW: Show the steps overlaid on the graph\n"," plt.figure(figsize=(5,3))\n"," plt.plot(df['Time'], df['Accel'])\n"," plt.plot(steps.keys(), steps.values(), 'k*')\n"," plt.xlabel('Time (s)')\n"," plt.ylabel('Accelerometer (m/s^2)')\n"," plt.title(f'Number of detected steps: {len(steps)}')\n"," plt.show()"],"metadata":{"id":"L9Ip_0bR8EkK"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["detect_steps(df['Time'], df['Accel'], 50, 1)"],"metadata":{"id":"5J7mYFQ-NkcG"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Our algorithm isn't perfect, as you can see that we've missed the last step. We could fix this oversight by modifying our algorithm so that the window can effectively go past the end of the signal without throwing an exception, but we'll forgo that exercise for now."],"metadata":{"id":"HzqOhWzMRdPQ"}},{"cell_type":"markdown","source":["## Deciding On Window Width"],"metadata":{"id":"KKFKnVHbIjCR"}},{"cell_type":"markdown","source":["There were three major details that we skipped when writing our step-counting algorithm. Let's start with the first one:\n","1. How did we know to set `width = 50`?"],"metadata":{"id":"GpWRxldJF77U"}},{"cell_type":"markdown","source":["To investigate this question, play around with the setting of `window_width` in the code block below.\n","\n"],"metadata":{"id":"FJZWaUmnUWOL"}},{"cell_type":"code","source":["window_width = 100 #@param {type:\"slider\", min:10, max:100, step:10}\n","detect_steps(df['Time'], df['Accel'], window_width, 1)"],"metadata":{"id":"DWF2GJ9hTSPm"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Notice that when the width is too narrow, the algorithm detects insignificant peaks between footsteps. When the width is too wide, the algorithm misses the more subtle steps by the non-dominant foot."],"metadata":{"id":"2GCmWigrUaXT"}},{"cell_type":"markdown","source":["So how did we know that `window_size = 50` would work? This number came from two sources of information:\n","1. The **sampling rate** of the signal, which we can find out by inspecting the time-series or doing a bit of math:"],"metadata":{"id":"9H7Q-H3JXES0"}},{"cell_type":"code","source":["time_diff = df['Time'].iloc[1] - df['Time'].iloc[0]\n","print(f\"Sampling rate: {1/(time_diff):0.2f} Hz\")"],"metadata":{"id":"Q1HrTB-XyK8P"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["2. **Domain expertise** of knowing how quickly people typically walk. Using either our best guess or looking through literature, we can estimate that people take roughly 1-2 steps per second while walking."],"metadata":{"id":"LMFEmKW_yJ2X"}},{"cell_type":"markdown","source":["If we assume a person takes 1 step per second, that means there will be `100 samples per second / 1 step per second = 100 samples per step` in our signal. If we assume a person takes 2 steps per second, that means there will be `100 samples per second / 2 steps per second = 50 samples per step` in our signal.\n","\n","Recall that when we used the wider setting (100 samples), the window missed all the softer steps. Therefore, it makes to lean more toward the narrower windows as long as we don't go *too* narrow."],"metadata":{"id":"MfuO0B_Yy-a2"}},{"cell_type":"markdown","source":["## Deciding On Window Stride"],"metadata":{"id":"sijx62Kozleq"}},{"cell_type":"markdown","source":["There were two other details we skipped over when writing our step-counting algorithm:\n","2. How did we know to set `stride = 1`?\n","3. Why did we only count a step if it happened exactly in the middle of the sliding window?"],"metadata":{"id":"MOFtG_w04jRZ"}},{"cell_type":"markdown","source":["These two decisions went hand-in-hand for this particular algorithm, but first play around with the setting of window_stride in the code block below to see what happens."],"metadata":{"id":"lHWOBXxT5Fid"}},{"cell_type":"code","source":["window_stride = 1 #@param {type:\"slider\", min:1, max:5, step:1}\n","detect_steps(df['Time'], df['Accel'], 50, window_stride)"],"metadata":{"id":"CXUzLpQ-4k-m"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Since our algorithm is designed to detect peaks that land right in the middle of the window, setting the stride to the smallest unit possible ensures that every data point lands in the middle of a window at least once. By increasing the stride and not changing the requirement that the point lands in the middle, we miss peaks and undercount the number of steps taken."],"metadata":{"id":"cDpCBDblZ96s"}},{"cell_type":"markdown","source":["Had we loosened the requirement about the peak landing in the middle of the window but used the smallest stride possible, we would have counted the same peak multiple times and inflated our step count."],"metadata":{"id":"hppQXY7klHxY"}},{"cell_type":"markdown","source":["As a compromise, we could change both the stride **AND** the requirement of where the peak landed within the window, but we will leave that as a thought exercise."],"metadata":{"id":"ARJJ7PVCl6Kr"}},{"cell_type":"markdown","source":["## Another Way To Write Our Algorithm"],"metadata":{"id":"AhNAP1_ejAp5"}},{"cell_type":"markdown","source":["Knowing that the window parameters are tied to real-world time and that we want to check if there's a peak at every single point of our signal, we can actually rewrite our algorithm in a different way that only requires us to specify the window width in seconds:"],"metadata":{"id":"_XsnY8mpjEbC"}},{"cell_type":"code","source":["def detect_steps(x_values, y_values, width):\n"," \"\"\"\n"," x_values: the time component of the signal\n"," y_values: the measured value of the signal\n"," width: the width of the windows measured in seconds\n"," \"\"\"\n"," # Initialize the start and end of the window\n"," start_time = 0\n"," end_time = width\n","\n"," # NEW: Calculate the sampling period and where the\n"," # middle index of the window will be\n"," sample_period = x_values.iloc[1] - x_values.iloc[0]\n"," middle_idx = int((width / sample_period) // 2)\n","\n"," # Initialize the step counter\n"," steps = {}\n","\n"," # Stop generating windows it would go past the end of the signal\n"," signal_duration = x_values.iloc[-1]\n"," while end_time < signal_duration:\n"," # NEW: Grab the current window by filtering indexes according to time\n"," window_idxs = (x_values >= start_time) & (x_values <= end_time)\n"," x_window = x_values[window_idxs]\n"," y_window = y_values[window_idxs]\n","\n"," # Check if there is a peak in the middle\n"," if y_window.argmax() == middle_idx:\n"," # Update the contents of the step counter\n"," step_timestamp = x_window.values[middle_idx]\n"," step_value = y_window.values[middle_idx]\n"," steps[step_timestamp] = step_value\n","\n"," # Move the window over by a stride\n"," start_time += sample_period\n"," end_time += sample_period\n","\n"," # Show the steps overlaid on the graph\n"," plt.figure(figsize=(5,3))\n"," plt.plot(df['Time'], df['Accel'])\n"," plt.plot(steps.keys(), steps.values(), 'k*')\n"," plt.xlabel('Time (s)')\n"," plt.ylabel('Accelerometer (m/s^2)')\n"," plt.title(f'Number of detected steps: {len(steps)}')\n"," plt.show()"],"metadata":{"id":"WJCeOSxWjEq4"},"execution_count":null,"outputs":[]},{"cell_type":"code","source":["detect_steps(df['Time'], df['Accel'], 0.5)"],"metadata":{"id":"fK12Z6WtkAbI"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["# General Guidelines for Sliding Windows"],"metadata":{"id":"DObQGe8dbLY7"}},{"cell_type":"markdown","source":["**Window Width**\n","* Longer windows are better at accounting for long-term trends.\n","* Shorter windows are better at accounting for short-term trends."],"metadata":{"id":"C3B0qSEhd8TH"}},{"cell_type":"markdown","source":["**Window Stride**\n","* Longer window strides will decrease the temporal resolution of your analysis but make your analysis more efficient.\n","* Shorter window strides will increase the temperal resolution of your analysis but may result in redundant calculations if the signal is stagnant over time."],"metadata":{"id":"GTmbz1ZAbSLo"}},{"cell_type":"markdown","source":["**Window Width and Stride**\n","* The window stride almost never exceeds the window width. Otherwise, this would cause some parts of the signal to never be included in any window.\n","* Having the window width and the window stride being set to the same value essentially splits up your data into distinct, non-overlapping segments. In some cases, samples at the boundaries may be handled incorrectly (e.g., peak detection).\n","* Common window configurations typically involve overlap percentages like 90%, 75%, and 50% that ensure every sample is seen the same number of times."],"metadata":{"id":"KVeUcJCMb8ag"}},{"cell_type":"markdown","source":["All in all, you will need to use a combination of domain expertise, intuition, and trial-and-error to determine a reasonable configuration for a sliding window. You will rarely need to be super precise with how you construct your sliding window; otherwise, your analysis plan is probably too fragile to generalize to other scenarios. Still, thinking about how to subdivide time-series data into managable chunks will help as you prepare for downstream analyses like machine learning."],"metadata":{"id":"IyAN77ukgSpd"}}]} \ No newline at end of file +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "provenance": [], + "collapsed_sections": [ + "OBADxK39qJDx", + "a_476M_ApXfO", + "ty37-Ret2vIG", + "UrwSKjb07ymm", + "KKFKnVHbIjCR", + "sijx62Kozleq", + "AhNAP1_ejAp5", + "DObQGe8dbLY7" + ] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "language_info": { + "name": "python" + } + }, + "cells": [ + { + "cell_type": "markdown", + "source": [ + "In this notebook, we're going to talk about how we can segment time series data into smaller, more manageable chunks for analysis.\n" + ], + "metadata": { + "id": "Vj8n7JvdwZ8S" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Important: Run this code cell each time you start a new session!" + ], + "metadata": { + "id": "OBADxK39qJDx" + } + }, + { + "cell_type": "code", + "source": [ + "!pip install numpy\n", + "!pip install pandas\n", + "!pip install matplotlib\n", + "!pip install ipywidgets\n", + "!pip install os\n", + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "import os" + ], + "metadata": { + "id": "yuzcYs-MUBvs" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "!wget -Ncnp https://physionet.org/files/accelerometry-walk-climb-drive/1.0.0/raw_accelerometry_data/id00b70b13.csv" + ], + "metadata": { + "id": "Q8hsDyCZ8Hrr" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "df = pd.read_csv('id00b70b13.csv')\n", + "\n", + "# Filter to only walking activity, which is given a code of 1\n", + "df = df[df['activity'] == 1]\n", + "\n", + "# Process the time\n", + "df.rename(columns={'time_s': 'Time'}, inplace=True)\n", + "df = df[(df['Time']>=700) & (df['Time']<=710)]\n", + "df['Time'] = df['Time'] - df['Time'].min()\n", + "\n", + "# Process the accel\n", + "df['Accel'] = np.sqrt(df['la_x']**2 + df['la_y']**2 + df['la_z']**2)*9.8\n", + "\n", + "# Keep only crucial columns\n", + "keep_cols = ['Time', 'Accel']\n", + "df = df[keep_cols]\n", + "df.to_csv('walking.csv',index=False)" + ], + "metadata": { + "id": "3b4MgOU-8Rrs" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# What Is a Sliding Window?" + ], + "metadata": { + "id": "a_476M_ApXfO" + } + }, + { + "cell_type": "markdown", + "source": [ + "A common approach to handling time-series data is to break it up into a sequence of data chunks, or ***windows***, and then analyzing those chunks rather than the entire sequence at once. Since windows are often generated sequentially over a signal, you will often hear them referred to as ***sliding windows***." + ], + "metadata": { + "id": "JAjYdF2DqOn8" + } + }, + { + "cell_type": "markdown", + "source": [ + "Basic data windowing requires deciding upon two important parameters:\n", + "1. **Width/length:** Describes how long the window is along the time axis.\n", + "2. **Stride:** Describes the temporal separation between successive windows.\n", + "\n", + "The units of measurement for both of these parameters can be expressed in various ways:\n", + "1. **Time:** 5-second window with a 1-second stride\n", + "2. **Number of samples:** 100-sample window with a 20-sample stride\n", + "3. **Percentage:** 5-second window with a 1-second stride == (5-1)/5 = 80% overlap\n", + "\n", + "As long we know the sampling rate of our signal, we can easily change between any of these units." + ], + "metadata": { + "id": "WYxJfhamFJSl" + } + }, + { + "cell_type": "markdown", + "source": [ + "Let's write a function to manually slide a window along our signal assuming that we use the number of samples to define the window parameters:" + ], + "metadata": { + "id": "5Wip6mM1O_Px" + } + }, + { + "cell_type": "code", + "source": [ + "def sliding_windows(x_values, y_values, width, stride):\n", + " \"\"\"\n", + " x_values: the time component of the signal\n", + " y_values: the measured value of the signal\n", + " width: the width of the windows measured in # of samples\n", + " stride: the stride of the windows measured in # of samples\n", + " \"\"\"\n", + " # Initialize the start and end of the window\n", + " start_idx = 0\n", + " end_idx = width\n", + "\n", + " # Stop generating windows it would go past the end of the signal\n", + " signal_length = df.shape[0]\n", + " while end_idx < signal_length:\n", + " # Grab the current window\n", + " x_window = x_values.iloc[start_idx:end_idx]\n", + " y_window = y_values.iloc[start_idx:end_idx]\n", + " print(\"X data:\", x_window.values)\n", + " print(\"Y data:\", y_window.values)\n", + " print(\"============\")\n", + "\n", + " # Move the window over by a stride\n", + " start_idx += stride\n", + " end_idx += stride" + ], + "metadata": { + "id": "MuSNngicXjg5" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# Visualizing A Sliding Window" + ], + "metadata": { + "id": "ty37-Ret2vIG" + } + }, + { + "cell_type": "markdown", + "source": [ + "To see this algorithm in action, let's try it on a simple signal that we create on our own:\n", + "\n" + ], + "metadata": { + "id": "1jQO99ttl5xG" + } + }, + { + "cell_type": "code", + "source": [ + "x = np.arange(1, 21)\n", + "y = np.array([0, 1, 2, 3, 3, 3, 3, 4, 5, 6,\n", + " 6, 5, 4, 3, 3, 3, 3, 2, 1, 0])\n", + "df = pd.DataFrame({'Time': x, 'Value': y})" + ], + "metadata": { + "id": "fT-zz_e0LvYc" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "plt.figure(figsize=(5, 3))\n", + "plt.plot(df['Time'], df['Value'])\n", + "plt.xlabel('Time')\n", + "plt.ylabel('Value')\n", + "plt.show()" + ], + "metadata": { + "id": "1hZcmsM2atDr" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Let's run the function we created earlier to see what windowing does:" + ], + "metadata": { + "id": "n4CnctAl22En" + } + }, + { + "cell_type": "code", + "source": [ + "sliding_windows(df['Time'], df['Value'], 5, 1)" + ], + "metadata": { + "id": "naO8AqjR27BX" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "To make this concept a bit more clear, here is some code that provides an animated version of our manual implementation above. This code is a bit more complicated in order to make the animation work, so you can just focus on the end result for now." + ], + "metadata": { + "id": "Q2HlsWA10bPl" + } + }, + { + "cell_type": "code", + "source": [ + "import matplotlib.animation as animation\n", + "from matplotlib import rc\n", + "rc('animation', html='jshtml')\n", + "\n", + "def show_animated_window(x_values, y_values, width, stride,\n", + " show_points=True):\n", + " # Calculate bounds\n", + " xmin, xmax = min(x_values), max(x_values)\n", + " ymin, ymax = min(y_values), max(y_values)\n", + "\n", + " # Create the figure and plot the line\n", + " fig, ax = plt.subplots(figsize=(5, 3))\n", + " linestyle = 'k-*' if show_points else 'k-'\n", + " line, = ax.plot(x_values, y_values, linestyle)\n", + " ax.set_xlim(xmin, xmax)\n", + " ax.xaxis.set_ticks(np.arange(xmin, xmax+1, 1))\n", + "\n", + " # Calculate window parameters\n", + " n = len(x_values)\n", + " box_height = ymax - ymin\n", + " box_width = x_values[width] - x_values[0]\n", + " box_stride = x_values[stride] - x_values[0]\n", + " num_windows = min(((n-width)//stride), 40) # cap the number of windows to 40\n", + "\n", + " # Create the animated box and add it to the graph\n", + " box = plt.Rectangle((xmin, ymin), box_width, box_height,\n", + " fill=True, color='orange', alpha=0.5)\n", + " ax.add_patch(box)\n", + "\n", + " # Calculate and update the new box position\n", + " def update(frame):\n", + " box.set_x(xmin+frame*box_stride)\n", + " return [box]\n", + "\n", + " # Create the animation\n", + " anim = animation.FuncAnimation(fig, update, frames=range(num_windows), interval=1000)\n", + " return anim" + ], + "metadata": { + "id": "ynVQ_zCUYV2j" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Run the animation as it is currently set up (`window_width = 5` and `window_stride = 1`) to see how it aligns with the output of the `sliding_windows()` function we created earlier. After that, play around with the sliders to see how changing `window_width` and `window_stride` impacts the sliding window." + ], + "metadata": { + "id": "7Z0hqEZgNnp-" + } + }, + { + "cell_type": "code", + "source": [ + "window_width = 5 #@param {type:\"slider\", min:1, max:5, step:1}\n", + "window_stride = 5 #@param {type:\"slider\", min:1, max:5, step:1}\n", + "show_animated_window(df['Time'], df['Value'], window_width, window_stride)" + ], + "metadata": { + "id": "d-g3gLKIb1Im" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "When we decrease `window_width`, the shaded area becomes narrower and the window contains fewer values with each iteration. When we increase `window_width`, the shaded area becomes wider and the window contains more values with each iterations." + ], + "metadata": { + "id": "ehyVWa815-Sk" + } + }, + { + "cell_type": "markdown", + "source": [ + "When we decrease `window_stride`, the shaded area moves over the signal more gradually and consecutive windows will share more values. When we increase `window_stride`, the shaded area moves over the signal more aggressively and consectuive windows will share fewer values." + ], + "metadata": { + "id": "tN1G6S746IgU" + } + }, + { + "cell_type": "markdown", + "source": [ + "# Example with Sliding Windows: Peak Detection" + ], + "metadata": { + "id": "UrwSKjb07ymm" + } + }, + { + "cell_type": "markdown", + "source": [ + "Now that we know about sliding windows, let's go through an example to show how powerful they can be." + ], + "metadata": { + "id": "pykZdAMMOcur" + } + }, + { + "cell_type": "markdown", + "source": [ + "We are going to write an algorithm that is going to process motion data from a sensor strapped to a person's ankle and count the number of steps the person has taken. An example of the input to our algorithm is shown below:" + ], + "metadata": { + "id": "si4mo5Fg76bG" + } + }, + { + "cell_type": "code", + "source": [ + "# Plot the data\n", + "df = pd.read_csv('walking.csv')\n", + "plt.figure(figsize=(5,3))\n", + "plt.plot(df['Time'], df['Accel'])\n", + "plt.xlabel('Time (s)')\n", + "plt.ylabel('Accelerometer (m/s^2)')\n", + "plt.show()" + ], + "metadata": { + "id": "NlRKW_VC76HK" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "You might notice that this signal seems to have alternating peaks: a small one followed by a large one with roughly equal spacing. This is because we are measuring the motion of one foot during walking. The large peak corresponds to when the person steps with the same foot that has the sensor, and the smaller peak corresponds to when the person steps with the opposite foot." + ], + "metadata": { + "id": "7IFmj6MmW8jZ" + } + }, + { + "cell_type": "markdown", + "source": [ + "Conceptually, our step-counting algorithm will incrementally slide a window across the signal, and update the step counter if the window contains a step (i.e., a peak in the accelerometer signal). The pseudocode for this algorithm is as follows:" + ], + "metadata": { + "id": "NWDRwWUPIECa" + } + }, + { + "cell_type": "markdown", + "source": [ + "```\n", + "initialize our window\n", + "initialize our step counter\n", + "\n", + "while the window has not reached the end of the signal:\n", + " grab the data within the window\n", + " if there is a peak in the middle of the window:\n", + " update our step counter\n", + " move the window\n", + "```" + ], + "metadata": { + "id": "vVY_VKMXghYx" + } + }, + { + "cell_type": "markdown", + "source": [ + "For now, we are going to skip through some of the details and jump straight to the implementation." + ], + "metadata": { + "id": "HTMfNsCmIvAe" + } + }, + { + "cell_type": "code", + "source": [ + "def detect_steps(df, width, stride):\n", + " \"\"\"\n", + " df: a DataFrame containing accelerometer values over time\n", + " width: the width of the windows measured in # of samples\n", + " stride: the stride of the windows measured in # of samples\n", + " \"\"\"\n", + " # Initialize the start and end of the window\n", + " start_idx = 0\n", + " end_idx = width\n", + " middle_idx = width // 2\n", + "\n", + " # NEW: Initialize the step counter\n", + " steps = {}\n", + "\n", + " # Stop generating windows it would go past the end of the signal\n", + " signal_length = df.shape[0]\n", + " while end_idx < signal_length:\n", + " # Grab the current window\n", + " x_window = df['Time'].iloc[start_idx:end_idx]\n", + " y_window = df['Accel'].iloc[start_idx:end_idx]\n", + "\n", + " # NEW: Check if there is a peak in the middle\n", + " if y_window.argmax() == middle_idx:\n", + " # NEW: Update the contents of the step counter\n", + " step_timestamp = x_window.values[middle_idx]\n", + " step_value = y_window.values[middle_idx]\n", + " steps[step_timestamp] = step_value\n", + "\n", + " # Move the window over by a stride\n", + " start_idx += stride\n", + " end_idx += stride\n", + "\n", + " # NEW: Show the steps overlaid on the graph\n", + " plt.figure(figsize=(5,3))\n", + " plt.plot(df['Time'], df['Accel'])\n", + " plt.plot(steps.keys(), steps.values(), 'k*')\n", + " plt.xlabel('Time (s)')\n", + " plt.ylabel('Accelerometer (m/s^2)')\n", + " plt.title(f'Number of detected steps: {len(steps)}')\n", + " plt.show()" + ], + "metadata": { + "id": "L9Ip_0bR8EkK" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "detect_steps(df, 50, 1)" + ], + "metadata": { + "id": "5J7mYFQ-NkcG" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Our algorithm isn't perfect, as you can see that we've missed the last step. We could fix this oversight by modifying our algorithm so that the window can effectively go past the end of the signal without throwing an exception, but we'll forgo that exercise for now." + ], + "metadata": { + "id": "HzqOhWzMRdPQ" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Deciding On Window Width" + ], + "metadata": { + "id": "KKFKnVHbIjCR" + } + }, + { + "cell_type": "markdown", + "source": [ + "There were three major details that we skipped when writing our step-counting algorithm. Let's start with the first one:\n", + "1. How did we know to set `width = 50`?" + ], + "metadata": { + "id": "GpWRxldJF77U" + } + }, + { + "cell_type": "markdown", + "source": [ + "To investigate this question, play around with the setting of `window_width` in the code block below.\n", + "\n" + ], + "metadata": { + "id": "FJZWaUmnUWOL" + } + }, + { + "cell_type": "code", + "source": [ + "window_width = 100 #@param {type:\"slider\", min:10, max:100, step:10}\n", + "detect_steps(df, window_width, 1)" + ], + "metadata": { + "id": "DWF2GJ9hTSPm" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Notice that when the width is too narrow, the algorithm detects insignificant peaks between footsteps. When the width is too wide, the algorithm misses the more subtle steps by the non-dominant foot." + ], + "metadata": { + "id": "2GCmWigrUaXT" + } + }, + { + "cell_type": "markdown", + "source": [ + "So how did we know that `window_size = 50` would work? This number came from two sources of information:\n", + "1. The **sampling rate** of the signal, which we can find out by inspecting the time-series or doing a bit of math:" + ], + "metadata": { + "id": "9H7Q-H3JXES0" + } + }, + { + "cell_type": "code", + "source": [ + "time_diff = df['Time'].iloc[1] - df['Time'].iloc[0]\n", + "print(f\"Sampling rate: {1/(time_diff):0.2f} Hz\")" + ], + "metadata": { + "id": "Q1HrTB-XyK8P" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "2. **Domain expertise** of knowing how quickly people typically walk. Using either our best guess or looking through literature, we can estimate that people take roughly 1-2 steps per second while walking." + ], + "metadata": { + "id": "LMFEmKW_yJ2X" + } + }, + { + "cell_type": "markdown", + "source": [ + "If we assume a person takes 1 step per second, that means there will be `100 samples per second / 1 step per second = 100 samples per step` in our signal. If we assume a person takes 2 steps per second, that means there will be `100 samples per second / 2 steps per second = 50 samples per step` in our signal.\n", + "\n", + "Recall that when we used the wider setting (100 samples), the window missed all the softer steps. Therefore, it makes to lean more toward the narrower windows as long as we don't go *too* narrow." + ], + "metadata": { + "id": "MfuO0B_Yy-a2" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Deciding On Window Stride" + ], + "metadata": { + "id": "sijx62Kozleq" + } + }, + { + "cell_type": "markdown", + "source": [ + "There were two other details we skipped over when writing our step-counting algorithm:\n", + "2. How did we know to set `stride = 1`?\n", + "3. Why did we only count a step if it happened exactly in the middle of the sliding window?" + ], + "metadata": { + "id": "MOFtG_w04jRZ" + } + }, + { + "cell_type": "markdown", + "source": [ + "These two decisions went hand-in-hand for this particular algorithm, but first play around with the setting of window_stride in the code block below to see what happens." + ], + "metadata": { + "id": "lHWOBXxT5Fid" + } + }, + { + "cell_type": "code", + "source": [ + "window_stride = 1 #@param {type:\"slider\", min:1, max:5, step:1}\n", + "detect_steps(df, 50, window_stride)" + ], + "metadata": { + "id": "CXUzLpQ-4k-m" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "Since our algorithm is designed to detect peaks that land right in the middle of the window, setting the stride to the smallest unit possible ensures that every data point lands in the middle of a window at least once. By increasing the stride and not changing the requirement that the point lands in the middle, we miss peaks and undercount the number of steps taken." + ], + "metadata": { + "id": "cDpCBDblZ96s" + } + }, + { + "cell_type": "markdown", + "source": [ + "Had we loosened the requirement about the peak landing in the middle of the window but used the smallest stride possible, we would have counted the same peak multiple times and inflated our step count." + ], + "metadata": { + "id": "hppQXY7klHxY" + } + }, + { + "cell_type": "markdown", + "source": [ + "As a compromise, we could change both the stride **AND** the requirement of where the peak landed within the window, but we will leave that as a thought exercise." + ], + "metadata": { + "id": "ARJJ7PVCl6Kr" + } + }, + { + "cell_type": "markdown", + "source": [ + "## Another Way To Write Our Algorithm" + ], + "metadata": { + "id": "AhNAP1_ejAp5" + } + }, + { + "cell_type": "markdown", + "source": [ + "Knowing that the window parameters are tied to real-world time and that we want to check if there's a peak at every single point of our signal, we can actually rewrite our algorithm in a different way that only requires us to specify the window width in seconds:" + ], + "metadata": { + "id": "_XsnY8mpjEbC" + } + }, + { + "cell_type": "code", + "source": [ + "def detect_steps(df, width):\n", + " \"\"\"\n", + " df: a DataFrame containing accelerometer values over time\n", + " width: the width of the windows measured in seconds\n", + " \"\"\"\n", + " # Initialize the start and end of the window\n", + " start_time = 0\n", + " end_time = width\n", + "\n", + " # NEW: Calculate the sampling period and where the\n", + " # middle index of the window will be\n", + " sample_period = df['Time'].iloc[1] - df['Time'].iloc[0]\n", + " middle_idx = int((width / sample_period) // 2)\n", + "\n", + " # Initialize the step counter\n", + " steps = {}\n", + "\n", + " # Stop generating windows it would go past the end of the signal\n", + " signal_duration = df['Time'].max()\n", + " while end_time < signal_duration:\n", + " # NEW: Grab the current window by filtering indexes according to time\n", + " window_idxs = (df['Time'] >= start_time) & (df['Time'] <= end_time)\n", + " x_window = df['Time'][window_idxs]\n", + " y_window = df['Accel'][window_idxs]\n", + "\n", + " # Check if there is a peak in the middle\n", + " if y_window.argmax() == middle_idx:\n", + " # Update the contents of the step counter\n", + " step_timestamp = x_window.values[middle_idx]\n", + " step_value = y_window.values[middle_idx]\n", + " steps[step_timestamp] = step_value\n", + "\n", + " # Move the window over by a stride\n", + " start_time += sample_period\n", + " end_time += sample_period\n", + "\n", + " # Show the steps overlaid on the graph\n", + " plt.figure(figsize=(5,3))\n", + " plt.plot(df['Time'], df['Accel'])\n", + " plt.plot(steps.keys(), steps.values(), 'k*')\n", + " plt.xlabel('Time (s)')\n", + " plt.ylabel('Accelerometer (m/s^2)')\n", + " plt.title(f'Number of detected steps: {len(steps)}')\n", + " plt.show()" + ], + "metadata": { + "id": "WJCeOSxWjEq4" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "source": [ + "detect_steps(df, 0.5)" + ], + "metadata": { + "id": "fK12Z6WtkAbI" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "source": [ + "# General Guidelines for Sliding Windows" + ], + "metadata": { + "id": "DObQGe8dbLY7" + } + }, + { + "cell_type": "markdown", + "source": [ + "**Window Width**\n", + "* Longer windows are better at accounting for long-term trends.\n", + "* Shorter windows are better at accounting for short-term trends." + ], + "metadata": { + "id": "C3B0qSEhd8TH" + } + }, + { + "cell_type": "markdown", + "source": [ + "**Window Stride**\n", + "* Longer window strides will decrease the temporal resolution of your analysis but make your analysis more efficient.\n", + "* Shorter window strides will increase the temperal resolution of your analysis but may result in redundant calculations if the signal is stagnant over time." + ], + "metadata": { + "id": "GTmbz1ZAbSLo" + } + }, + { + "cell_type": "markdown", + "source": [ + "**Window Width and Stride**\n", + "* The window stride almost never exceeds the window width. Otherwise, this would cause some parts of the signal to never be included in any window.\n", + "* Having the window width and the window stride being set to the same value essentially splits up your data into distinct, non-overlapping segments. In some cases, samples at the boundaries may be handled incorrectly (e.g., peak detection).\n", + "* Common window configurations typically involve overlap percentages like 90%, 75%, and 50% that ensure every sample is seen the same number of times." + ], + "metadata": { + "id": "KVeUcJCMb8ag" + } + }, + { + "cell_type": "markdown", + "source": [ + "All in all, you will need to use a combination of domain expertise, intuition, and trial-and-error to determine a reasonable configuration for a sliding window. You will rarely need to be super precise with how you construct your sliding window; otherwise, your analysis plan is probably too fragile to generalize to other scenarios. Still, thinking about how to subdivide time-series data into managable chunks will help as you prepare for downstream analyses like machine learning." + ], + "metadata": { + "id": "IyAN77ukgSpd" + } + } + ] +} \ No newline at end of file