In this notebook, we're going to explore a time-series gait dataset collected from patients with Parkinson's disease. The data is taken from a [PhysioNet](https://www.physionet.org/) repository called [Gait in Parkinson's Disease](https://www.physionet.org/content/gaitpdb/1.0.0/) by Jeffrey Hausdorff.

# Important: Run this code cell each time you start a new session!

In [None]:
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os

In [None]:
!wget -rNcnp https://physionet.org/files/gaitpdb/1.0.0/

# Overview of the Dataset

Parkinson's disease is a chronic and progressive neurological disorder that affects the central nervous system. It primarily affects movement and is characterized by a variety of symptoms, including tremors, stiffness, slow movements, and difficulty with balance and coordination. A disturbed gait is a common, debilitating symptom; patients with severe gait disturbances are prone to falls and may lose their functional independence.

The goal of this dataset is to enable researchers to investigate whether characteristics of gait can be used to automatically monitor the severity of Parkinson's disease over time. This dataset is actually composed of data collected by three institutions. Together, these institutions recruited 93 patients with idiopathic PD and 73 healthy controls. During enrollment, subjects were asked to complete a number of clinical scales to assess the severity of their Parkinsonian symptoms. The clinical scale we will focus on the most is the Unified Parkinson's Disease Rating Scale (UPDRS). This scale is composed of four different parts, but we will focus on the portion that deals with motor control function (Part III).


During the study itself, subjects were asked to walk at their usual pace for approximately 2 minutes on level ground. Subjects were asked to repeat this protocol for multiple trials depending on the institution where the data was collected. Underneath each foot were 8 sensors that measure force (in Newtons) as a function of time; the researchers who compiled this dataset refer to the sensor data as the vertical ground reaction force (VGRF). The output of each of these 16 sensors has been digitized and recorded at 100 Hz, and the records also include two signals that reflect the sum of the 8 sensor outputs for each foot.

All of our data, which has already been downloaded, is located in the folder `physionet.org/files/gaitpbd/1.0.0/`. Along with the recording files, we will also look at the file `demographics.xls`, which contains both subject demographics and their clinical assessment scores.

In [None]:
# The relevant folders and files associated with this dataset
base_folder = os.path.join('physionet.org', 'files', 'gaitpdb', '1.0.0')
demo_filename = os.path.join(base_folder, 'demographics.xls')

The recording files are named according to the following convention: `{study_prefix}{subject_type}{subject_id}_{trial_id}.txt` (e.g., `GaCo01_01.txt`)

* `study_prefix`: Specifies the institution where the subject was recruited (either `Ga`, `Ju`, or `Si`)
* `subject_type`: Specifies whether the subject was a control (`Co`) or a patient (`Pt`)
* `subject_id`: Numerical identifier indicating the subject's number within the institution's cohort
* `trial_id`: Numerical identifier indicating the trial number. We will be looking at all trials except for any numbered `10`, which relates to a special protocol used by a single institution.

These files are structured as tab-separated spreadsheets with the following columns:

| Column # | Description |
|----------|-------------|
| 1 | Time in seconds|
| 2–9 | VGRF on each of the 8 sensors located under the left foot |
|10–17 | VGRF on each of the 8 sensors located under the right foot |
| 18 | Total force under the left foot |
| 19 | Total force under the right foot |

In this notebook, we are going to see if we can extract useful information from the VGRF time-series data that is correlated with patients' UDPRS scores. We are going to exclude the healthy controls from our analyses so that we do not have an excess of negligible UPDRS scores. Nevertheless, most of the steps in this notebook could be repeated with that population if you so choose.

# Inspecting the Dataset

Before we start trying to extract information from our images, let's look at a hand-selected example of VGRF time-series data to see what we are working with.

In [None]:
# The names of the columns in the recordings
column_names = ['Time']
for i in range(1, 9):
    column_names.append(f'Left Sensor {i}')
for i in range(1, 9):
    column_names.append(f'Right Sensor {i}')
column_names.append('Left Foot')
column_names.append('Right Foot')

In [None]:
# Show the structure of one of the files
example_filename = 'GaCo01_01.txt'
example_df = pd.read_csv(os.path.join(base_folder, example_filename),
                         sep="\t", header=None, names=column_names)
example_df

Inspecting time-series data in a table will only give us information about the duration of the recording and the range of values the measurement can take. It's usually a good idea to plot time-series data so we can get a better understanding of its structure.

In [None]:
# Plot the data
plt.figure(figsize=(9, 3))
plt.subplot(1, 2, 1)
plt.plot(example_df['Time'], example_df['Left Foot'], 'k-', label='Left')
plt.xlabel('Time (s)'), plt.ylabel('VGRF (N)'), plt.title('Entire Recording, Single Foot')
plt.subplot(1, 2, 2)
plt.plot(example_df['Time'], example_df['Left Foot'], 'k-', label='Left')
plt.plot(example_df['Time'], example_df['Right Foot'], 'r-', label='Right')
plt.xlabel('Time (s)'), plt.ylabel('VGRF (N)'), plt.title('Short Snippet, Both Feet')
plt.xlim(0, 5)
plt.legend()
plt.show()

We can verify that the data makes sense for a couple of reasons:
* The measurements recorded from each foot fluctuate between 0 N and roughly 1000 N. While the magnitude of the range's upper limit may not have an intuitive interpretation, the fact that the signals periodically approach zero makes sense since there should be negligible force exerted by the foot while it is up in the air.
* The measurements recorded on each foot oppose one another. In other words, if one signal is high, the other is low. This makes sense since feet alternate when they touch the ground whiel a person is walking.

Let's look at the measurements for a single foot according to three different representations:
1. In the time domain (force vs. time)
2. In the frequency domain as an FFT (FFT amplitude vs. frequency)
3. In the frequency domain as a spectrogram (FFT amplitude vs. frequency vs. time)

In [None]:
from numpy.fft import fftfreq
from scipy.fftpack import fft
from scipy import signal
def view_recording(filename, column_names, fs=100):
    """
    Show the force measurements over time from the left foot according to
    three different representations
    filename: the name of the file that should be loaded
    column_names: the names of the file's columns
    fs: the sampling rate of our data (set to 100 Hz since we know
    that is the case for our dataset)
    """
    # Load the file
    df = pd.read_csv(os.path.join(base_folder, filename),
                     sep="\t", header=None, names=column_names)
    time = df['Time'].values
    values = df['Left Foot'].values

    # Calculate the FFT
    values_centered = values - values.mean()
    fft_mag = np.abs(fft(values_centered))
    fft_freqs = fftfreq(len(values_centered), 1/fs)

    # Calculate the spectrogram
    spec_freqs, spec_times, spectro = signal.spectrogram(values_centered, fs)

    # Show the three signal representations
    plt.figure(figsize=(12, 3))
    plt.subplot(1, 3, 1)
    plt.plot(time, values)
    plt.xlabel('Time (s)'), plt.ylabel('VGRF (N)')
    plt.title('Time Domain')

    plt.subplot(1, 3, 2)
    plt.stem(fft_freqs, fft_mag, markerfmt=" ", basefmt="-")
    plt.xlabel('Freq (Hz)'), plt.ylabel('FFT Amplitude')
    plt.ticklabel_format(axis='y', style='sci', scilimits=(4,4))
    plt.xlim(-0.1, 8)
    plt.title('Frequency Domain: FFT')

    plt.subplot(1, 3, 3), plt.pcolormesh(spec_times, spec_freqs, spectro, shading='gouraud')
    plt.xlabel('Time (s)'), plt.ylabel('Frequency (Hz)')
    plt.ylim(0, 8)
    plt.title('Frequency Domain: Spectrogram')
    plt.show()

In [None]:
view_recording(example_filename, column_names)

Here are a few notes and observations about this function:
* We removed the mean of the overall signal in the time domain before translating our data into the frequency domain in order to remove the FFT components at 0 Hz.
* Since our data was recorded at 100 Hz, Nyquist–Shannon Sampling Theorem states that we should be able to extract frequency information as high as 50 Hz. However, given that we are talking about gross motor coordination, we aren't going to worry too much about frequency information beyond 8 Hz.
* We could apply a digital filter to clean up our data a bit, but given that there does not seem to be a great deal of high-frequency information, we will forgo that step for now.

# Extracting Information from a Recording

In a previous session, we discussed various time-domain and frequency-domain analysis techniques to summarize time-series data. If we apply all of these techniques, we will be left with hundreds of different metrics we will need to sort through. Although this can be an unbiased way of approaching data analysis, it can also be very time-consuming. For this dataset, we will look at a few standard calculations for each recording; however, we will also apply domain expertise about the problem we are trying to solve to extract information that should ideally be more meaningful to us.

 There are four cardinal symptoms of Parkinson's disease:

1. **Tremor:** Tremor is one of the most common symptoms of Parkinson's disease. It usually begins with a slight trembling or shaking of a hand, finger, or thumb. The tremor typically occurs when the affected limb is at rest and may subside during voluntary movement.

2. **Bradykinesia:** Bradykinesia refers to slowness of movement and is another key symptom of Parkinson's disease. It can manifest as a general reduction in spontaneous movements, including reduced arm swinging while walking, difficulty initiating movement, and a gradual decline in the speed of repetitive actions.

3. **Rigidity:** Rigidity is characterized by stiffness and resistance to movement in the muscles. It can be noticed as increased muscle tone that causes stiffness and resistance, leading to decreased range of motion, muscle aches, and general discomfort.

4. **Postural Instability:** Postural instability is commonly observed in the later stages of Parkinson's disease. It may result in impaired balance and increased risk of falling. People with Parkinson's disease may have difficulty making rapid, automatic, and involuntary adjustments to maintain balance.

Because tremor is more prominent in the hands and arms than it is the legs, we are going to skip that symptom and focus on the latter three.

Knowing how to translate these English explanations into code is a difficult skill that comes with practice, exposure to a diverse toolbox of techniques, and a healthy amount of internet searching for code examples and academic papers. We will cover one way of extracting information related to each rule in order of increasing complexity, but bear in mind two things:

1. There are likely alternative ways of implementing each rule that are just as valid.
2. Some of these techniques are not going to be obvious at first glance, but this is a skill you will develop over time.

## Standard Time-Domain Calculations

To start, let's extract some simple descriptive statistics in the time-domain that will summarize the amplitude of the entire signal. We will calculate the average, standard deviation, 95th percentile, and root mean square (RMS). All of these numbers should be higher under two conditions:
1. Subjects who exert a greater force with their foot (i.e., higher amplitude)
2. Subjects who maintain foot contact for longer periods of time (i.e., wider width).

In [None]:
def compute_arbitrary_time_domain_metrics(times, values, fs=100):
    """
    Calculates generic time-domain statistics on the signal
    times: the times associated with the VGRF data
    values: the VGRF data
    fs: the sampling rate
    """
    return {'average VGRF': np.mean(values),
            'stdev VGRF': np.std(values),
            '95th percentile VGRF': np.percentile(values, 95),
            'rms VGRF': np.sqrt(np.mean(values**2))}

## Standard Frequency-Domain Calculations

Let's also calculate some standard metrics in the frequency domain. We will look at peak frequency a bit later, but for now, we are going to look at the signal's power at different frequency ranges.

When we inspected a randomly selected signal in the frequency domain, we saw that most of the frequency information was within 0–3 Hz, and looking beyond 8 Hz did not give us much new information given how smooth the signal already is. Therefore, we will use 0–3 Hz to define our "low frequency" information and 3–8 Hz to define our "high frequency" information. These decisions are somewhat arbitrary though, so you could try different ranges and see how the influence your results down the road.

The total power within these frequency ranges is heavily correlated with the overall magnitude with which the subject is walking. The harder their steps, the more power there is likely to be across both ranges. To quantify how the frequency content is distributed across these ranges, we will calculate the ratio between the total power in the two ranges.

In [None]:
def compute_arbitrary_freq_domain_metrics(times, values, fs=100):
    """
    Calculates generic frequency-domain statistics on the signal
    times: the times associated with the VGRF data
    values: the VGRF data
    fs: the sampling rate
    """
    # Calculate the FFT
    values_centered = values - values.mean()
    fft_mag = np.abs(fft(values_centered))
    fft_freqs = fftfreq(len(values_centered), 1/fs)

    # Calculate the indices relevant to our frequency bands of interest
    low_indices = np.where((fft_freqs >= 0) & (fft_freqs <= 3))
    high_indices = np.where((fft_freqs >= 3) & (fft_freqs <= 8))

    # Calculate the power at the low and high frequencies
    low_power = np.sum(fft_mag[low_indices]**2)
    high_power = np.sum(fft_mag[high_indices]**2)

    # Calculate the power within the frequency range
    high_to_low_ratio = 10*np.log10(high_power / low_power)
    return {'power at low freqs': low_power,
            'power at high freqs': high_power,
            'high-to-low power ratio': high_to_low_ratio}

## Amplitude Measurements for Rigidity and Postural Instability

The descriptive statistics we have calculated so far have a couple limitations:
1. They do not account for the fact that person's gait characterstics can change over time (e.g., speed up, slow down)
2. For the time-domain calculations in particular, the rate at which the person walks and the force with which they step can affect these statistics.

To account for these shortcomings, we will calculate the signal amplitude (according to RMS) over non-overlapping 5-second windows. This will give us a collection of amplitude measurements that we can aggregate to summarize the entire signal. The decision to use non-overlapping windows is strictly for computational efficiency. The decision to use a 5-second window is somewhat arbitrary, but the general intuition is that it is long enough to include multiple steps and short enough to capture a relatively consistent gait pattern.

The more stiff someone is, the more likely they are to take soft steps. Therefore, we will take the average of the amplitude measurements as a potential metric of rigidity. People who have an unstable gait are more likely to vary their walking behavior over time. Therefore, we will also calculate the standard deviation of the amplitude measurements as a potential metric for postural instability.

In [None]:
def compute_amplitude_metrics(times, values, fs=100):
    """
    Calculate metrics related to the transient amplitude of the signal over time
    using a 5-second window with 0% overlap
    times: the times associated with the VGRF data
    values: the VGRF data
    fs: the sampling rate
    """
    # Set the sliding window parameters
    window_width = 5
    start_time = 0
    end_time = window_width
    sample_period = 1/fs

    # Stop generating windows it would go past the end of the signal
    window_amplitudes = []
    while end_time < times.max():
        # Grab the current window by filtering indexes according to time
        window_idxs = (times >= start_time) & (times <= end_time)
        window_values = values[window_idxs]

        # Calculate the amplitude
        window_rms = np.sqrt(np.mean(window_values**2))
        window_amplitudes.append(window_rms)

        # Move the window over by a stride
        start_time += window_width
        end_time += window_width

    # Summarize the amplitude over time
    return {'average amplitude': np.mean(window_amplitudes),
            'stdev amplitude': np.std(window_amplitudes)}

## Cadence Measurements for Bradykinesia and Postural Instability

Walking speed is considered by some clinical researchers to be the "6th vital sign", which makes it an important metric for most gait analyses. Although we cannot easily determine subjects' walking speed in meters per second, we can establish the cadence of their gait in steps per second.

To calculate a subject's cadence over time, we will generate a spectrogram and then look for the peak frequency within each time window. We will then aggregate those frequencies in a similar fashion to how we aggregated the RMS amplitudes earlier. More specifically, we will calculate the average peak frequency over time as a potential metric of bradykinesia, and we will calculate the standard deviation of the peak frequency over time as a potential metric of postural instability.

In [None]:
def compute_cadence_metrics(times, values, fs=100):
    """
    Calculate metrics related to the transient peak frequency of the signal
    over time
    times: the times associated with the VGRF data
    values: the VGRF data
    fs: the sampling rate
    """
    # Calculate the spectrogram
    values_centered = values - values.mean()
    spec_freqs, spec_times, spectro = signal.spectrogram(values_centered, fs)

    # Find the largest bin along the frequency dimension
    dominant_bins = np.argmax(spectro, axis=0)

    # Map those bin indeces to frequencies
    peak_freqs = spec_freqs[dominant_bins]

    # Summarize the step rate over time
    return {'average cadence': np.mean(peak_freqs),
            'stdev cadence': np.std(peak_freqs)}

## Difference Measurements for Postural Instability

We could calculate most of the aforementioned features for both the left and right foot separately, which would nearly double the number of metrics we have. However, should we interpret the step cadence on the left side any differently than the step cadence on the right side? Would we interpret them differently if they were flipped? Probably not.

To make our lives simple, we are going to only calculate the aforementioned metrics on a single side of the body (left). However, it is important to know if the two sides are different. For example, a subject who has a limp may have a heavier footstep on one side compared to the other.

Therefore, we are going to calculate the difference between corresponding metrics on the left and right side of the body. Since we do not care whether the higher value is on the right or left side, we will compute the absolute value of the difference.

In [None]:
def compute_differences(left, right):
    """
    Compares corresponding metrics across two feet
    left: the dictionary of metrics from the left side
    right: the dictionary of metrics from the right side
    """
    diffs_dict = {}
    for key in left:
        diffs_dict[key] = np.abs(left[key] - right[key])
    return diffs_dict

## Processing a Single Recording

Now that we have helper functions to extract information from our recordings, let's put everything together into a single function. This function will take a single recording filename as input and return all of the information calculated for that recording as a `dict`.

In [None]:
def process_recording(filename):
    """
    Process a VGRF recording and produce all of the metrics as a dictionary
    (one value per key)
    filename: the name of the recording file
    """
    # Get the useful columns
    df = pd.read_csv(os.path.join(base_folder, filename),
                     sep="\t", header=None, names=column_names)
    time = df['Time'].values
    left_values = df['Left Foot'].values
    right_values = df['Right Foot'].values

    # Extract metrics from the left side
    left_time = compute_arbitrary_time_domain_metrics(time, left_values)
    left_freq = compute_arbitrary_freq_domain_metrics(time, left_values)
    left_amplitude = compute_amplitude_metrics(time, left_values)
    left_cadence = compute_cadence_metrics(time, left_values)

    # Extract metrics from the right side
    right_time = compute_arbitrary_time_domain_metrics(time, right_values)
    right_freq = compute_arbitrary_freq_domain_metrics(time, right_values)
    right_amplitude = compute_amplitude_metrics(time, right_values)
    right_cadence = compute_cadence_metrics(time, right_values)

    # Extract difference metrics
    diff_time = compute_differences(left_time, right_time)
    diff_freq = compute_differences(left_freq, right_freq)
    diff_amplitude = compute_differences(left_amplitude, right_amplitude)
    diff_cadence = compute_differences(left_cadence, right_cadence)

    # Combine everything into a dictionary
    info_dict = {}
    for left_dict in [left_time, left_freq, left_amplitude, left_cadence]:
        for key in left_dict:
            info_dict['Single foot ' + key] = left_dict[key]
    for diff_dict in [diff_time, diff_freq, diff_amplitude, diff_cadence]:
        for key in diff_dict:
            info_dict['Difference ' + key] = diff_dict[key]
    return info_dict

In [None]:
# Test our function
process_recording(example_filename)

# Creating Our Processed Dataset

To process all of our recordings, we will iterate through all of the files and call our `process_recording()` function on each recording. Because there are a variety of files in our data folder, we will want to ignore any files that either (1) are not recording files, (2) come from control subjects, or (3) are numbered as trial 10. We will gather the results in a single `DataFrame`.

In [None]:
data_filenames = os.listdir(base_folder)

# Iterate through the filenames
info_df = pd.DataFrame()
for data_filename in data_filenames:
    # Skip the file if we want to ignore it
    patient_name = data_filename[0:6]
    patient_type = data_filename[2:4]
    trial_id = data_filename[7:9]
    if (patient_type == 'Co') or (trial_id == '10') or not ('_' in data_filename):
        continue

    # Generate the features
    result_dict = process_recording(data_filename)

    # Add the patient's name as the identifier
    result_dict['ID'] = patient_name
    result_df = pd.DataFrame([result_dict])
    info_df = pd.concat([info_df, result_df], axis=0)

# Set the index to the recording name
info_df.set_index(['ID'], inplace=True)
info_df

Because some subjects had multiple recordings, their ID will appear multiple times in the index. That is usually not ideal for ensuring that that the each index uniquely points to a single row; however, we are only going to be using the index to combine information about subjects' recordings with their UPDRS scores, so this will not be a big issue.

In [None]:
info_df.loc['JuPt15']

On that note, the demographic information about our subjects is provided in an `.xls` file (note: they are also provided in a tab-limited `.txt` file, but that file has a formatting issue). Let's load this file and see what it looks like:

In [None]:
demo_df = pd.read_excel(demo_filename, index_col='ID')
demo_df

This file contains many different columns, but we are only going to concern ourselves with the UPDRSM score (i.e., UPDRS Part III) for now.

We are going to do a few things to make this `DataFrame` better serve our needs:
1. We will get rid of all of the rows corresponding to healthy controls
2. We will get rid of all of the columns except for the index and the UPDRSM score.

In [None]:
# Keep only patient data
demo_df = demo_df[demo_df['Group'] == 'PD']

# Get rid of unnecessary columns
score_df = demo_df['UPDRSM']
score_df

We now have two `DataFrames` (or technically, a `DataFrame` and a `Series`):
1. `info_df`, which holds all of the characteristics we have extracted from our recordings
2. `score_df`, which holds the UPDRSM scores associated with the subjects

We can combine these together using the function `pd.merge()`. Since we have set the index of `info_df` and `score_df` to be the subject ID, we can use this column as our reference for merging.

It is important to note that each subject completed multiple trials. If we just do a plain `pd.merge()` (i.e., an inner join), we will only have a single row for each subject since that is the list of indices that both `DataFrames` have in common. If we want to have a single row per recording, we will need to do a single-sided join.

In [None]:
df = pd.merge(info_df, score_df, how='left', left_index=True, right_index=True)
df

We can confirm this worked by making sure that we still have multiple entries for subjects who contributed multiple trials.

In [None]:
df.loc['JuPt15']

It is also important to note that some subjects did not complete the UPDRS, so they will have `NaN` as their entry under that column. We will get rid of those rows so that we do not have to deal with missing data:

In [None]:
df = df[~pd.isna(df['UPDRSM'])]
df

# Exploring Our Recording Characterstics

At this point, you could export your `DataFrame` into a `.csv` and explore the metrics using any tool that you desire (e.g., R, Excel). Nevertheless, we're going to stick with Python to explore our processed dataset.

To make it easy for us to test all of the relevant columns in our `DataFrame`, we will generate a list that holds all of the names of our recording characteristics:

In [None]:
metrics = list(info_df.columns)
metrics

## Descriptive Statistics

Let's start by extracting some descriptive statistics from our data. We can use the `.describe()` method to compute statistics like the mean and range for each column in our entire `DataFrame`:

In [None]:
df.describe()

Using these results, we can get a sanity check of whether the values we are calculating follow our expectations. Here are some example observations:
* **Single foot 95th percentile VGRF**: Nearly of the recordings have VGRF values that go as high as 950 N while a person is walking.
* **Difference in cadence**: For the most part, subjects have a neglible difference between the cadence of their left and right feet. This makes sense considering most people walk at an even pace.


## Histograms

Although descriptive statistics can distill a lot of data into a small handful of numbers, they can also hide important information about the distribution of our data. Therefore, it can be helpful to generate histograms of our recording characteristics.

The most intuitive way of generating histograms is by calling the corresponding `matplotlib` function. Since we have lots of recording characteristics, we will only display the histograms for the first five:

In [None]:
def generate_histogram(df, col):
    plt.figure(figsize=(3,3))
    plt.hist(df[col], bins=20, color='blue', edgecolor='black',
            alpha=0.7)
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.title(f'Histogram of {col}')
    plt.show()

In [None]:
for col in metrics[:5]:
    generate_histogram(df, col)

If you are interested in shortcuts, `pandas` has built-in methods for generating histograms on `Series`:

In [None]:
def generate_histogram(df, col):
    plt.figure(figsize=(3,3))
    df[col].hist(bins=20, alpha=0.7, legend=True)
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.title(f'Histogram of {col}')
    plt.show()

In [None]:
for col in metrics[:5]:
    generate_histogram(df, col)

## Correlations

Up until now, we have only looked at the distribution of individual recording characteristics without any other context. We have not looked at how they vary with respect to UPDRSM score.

One way we can do that is by calculating the correlation coefficient between the UPDRSM score and each recording characteristic. There are different correlations that can be calculated depending on whether data is normally distributed or not. For now, we are going to calculate ***Pearson's correlation coefficient*** under the assumption that data is normally distributed rather than ***Spearman's correlation coefficient*** which does not make that assumption. We can calculate these using functions in the `scipy` library.

In [None]:
from scipy import stats

for col in metrics:
    r, p_value = stats.pearsonr(df[col], df['UPDRSM'])
    print(f"Pearson r for {col}: {r:0.2f}, p-value is {p_value:0.3f}")

We can also also generate a graph illustrate the correlation between these pairs of variables. We will only look at one of the statistically significant pairs for brevity:

In [None]:
def show_correlation(df, col):
    plt.figure(figsize=(3,3))
    plt.plot(df[col], df['UPDRSM'], '*')
    plt.xlabel(col)
    plt.ylabel('UPDRSM')
    plt.title(f'Correlation between UPDRSM and \n{col}')
    plt.show()

In [None]:
for col in metrics[:5]:
    show_correlation(df, col)

Here are some example observations:
* **Weak correlations:** Nearly all of the correlation coefficient are between ±0.2, indicating that there isn't a single variable strongly associated with UPDRSM
* **Statistically significant results:** At the very least, we do see a few recording characteristics with statistically significant correlations, such as "Single foot average VGRF", "Single foot average cadence", and "Difference power at high freqs"

Although these results may be less than ideal for trying to predict a subject's UPDRSM score, that's okay! We're working with real data, and real data is usually messy. In the near future, we will see how machine learning can be used to make sense of this data to achieve that very task.