In this notebook, we're going to revisit the regression model we trained to predict the UPDRS score of patients with Parkinson's disease from their gait. Most of the code in this notebook is copied from the previous session, but we will use the following techniques to boost the model's performance:
* Imputation of missing data
* Normalizing the feature distributions
* Feature selection



# Important: Run this code cell each time you start a new session!

In [None]:
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install os
!pip install librosa
!pip install scikit-learn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import librosa
import sklearn

In [None]:
!wget -rNcnp https://physionet.org/files/gaitpdb/1.0.0/

# Step 1: Define the Problem You Are Trying to Solve

As before, our overarching goal is to generate a regression model that predicts a person's UPDRSM score based on characteristics of their gait.

In [None]:
# The relevant folders and files associated with this dataset
base_folder = os.path.join('physionet.org', 'files', 'gaitpdb', '1.0.0')
label_filename = os.path.join(base_folder, 'demographics.xls')

In [None]:
# The names of the columns in the recordings
column_names = ['Time']
for i in range(1, 9):
    column_names.append(f'Left Sensor {i}')
for i in range(1, 9):
    column_names.append(f'Right Sensor {i}')
column_names.append('Left Foot')
column_names.append('Right Foot')

In [None]:
# Show the structure of one of the files
example_filename = 'GaCo01_01.txt'
example_df = pd.read_csv(os.path.join(base_folder, example_filename),
                         sep="\t", header=None, names=column_names)
example_df

In [None]:
# Plot the data
plt.figure(figsize=(9, 3))
plt.subplot(1, 2, 1)
plt.plot(example_df['Time'], example_df['Left Foot'], 'k-', label='Left')
plt.xlabel('Time (s)'), plt.ylabel('VGRF (N)'), plt.title('Entire Recording, Single Foot')
plt.subplot(1, 2, 2)
plt.plot(example_df['Time'], example_df['Left Foot'], 'k-', label='Left')
plt.plot(example_df['Time'], example_df['Right Foot'], 'r-', label='Right')
plt.xlabel('Time (s)'), plt.ylabel('VGRF (N)'), plt.title('Short Snippet, Both Feet')
plt.xlim(0, 5)
plt.legend()
plt.show()

# Step 2: Create Your Features and Labels

We are going to keep our labels and features the same as before. However, we are going to dive deeper into the availability and the distribution of our data to address some of its limitations.

All of the code from the previous session is copied below, so refer to that notebook if you need a reminder of how we came up with these code blocks.

In [None]:
def compute_arbitrary_time_domain_metrics(times, values, fs=100):
    """
    Calculates generic time-domain statistics on the signal
    times: the times associated with the VGRF data
    values: the VGRF data
    fs: the sampling rate
    """
    return {'average': np.mean(values),
            'stdev': np.std(values),
            '95th percentile': np.percentile(values, 95),
            'rms': np.sqrt(np.mean(values**2))}

In [None]:
from numpy.fft import fftfreq
from scipy.fftpack import fft
from scipy import signal
def compute_arbitrary_freq_domain_metrics(times, values, fs=100):
    """
    Calculates generic frequency-domain statistics on the signal
    times: the times associated with the VGRF data
    values: the VGRF data
    fs: the sampling rate
    """
    # Calculate the FFT
    values_centered = values - values.mean()
    fft_mag = np.abs(fft(values_centered))
    fft_freqs = fftfreq(len(values_centered), 1/fs)

    # Calculate the indices relevant to our frequency bands of interest
    low_indices = np.where((fft_freqs >= 0) & (fft_freqs <= 3))
    high_indices = np.where((fft_freqs >= 3) & (fft_freqs <= 8))

    # Calculate the power at the low and high frequencies
    low_power = np.sum(fft_mag[low_indices]**2)
    high_power = np.sum(fft_mag[high_indices]**2)

    # Calculate the power within the frequency range
    high_to_low_ratio = 10*np.log10(high_power / low_power)
    return {'power at low freqs': low_power,
            'power at high freqs': high_power,
            'high-to-low power ratio': high_to_low_ratio}

In [None]:
def compute_amplitude_metrics(times, values, fs=100):
    """
    Calculate metrics related to the transient amplitude of the signal over time
    using a 5-second window with 0% overlap
    times: the times associated with the VGRF data
    values: the VGRF data
    fs: the sampling rate
    """
    # Set the sliding window parameters
    window_width = 5
    start_time = 0
    end_time = window_width
    sample_period = 1/fs
    middle_idx = int((window_width / sample_period) // 2)

    # Stop generating windows it would go past the end of the signal
    window_amplitudes = []
    while end_time < times.max():
        # Grab the current window by filtering indexes according to time
        window_idxs = (times >= start_time) & (times <= end_time)
        window_values = values[window_idxs]

        # Calculate the amplitude
        window_rms = np.sqrt(np.mean(window_values**2))
        window_amplitudes.append(window_rms)

        # Move the window over by a stride
        start_time += window_width
        end_time += window_width

    # Summarize the amplitude over time
    return {'average amplitude': np.mean(window_amplitudes),
            'stdev amplitude': np.std(window_amplitudes)}

In [None]:
def compute_cadence_metrics(times, values, fs=100):
    """
    Calculate metrics related to the transient peak frequency of the signal
    over time
    times: the times associated with the VGRF data
    values: the VGRF data
    fs: the sampling rate
    """
    # Calculate the spectrogram
    values_centered = values - values.mean()
    spec_freqs, spec_times, spectro = signal.spectrogram(values_centered, fs)

    # Find the largest bin along the frequency dimension
    dominant_bins = np.argmax(spectro, axis=0)

    # Map those bin indeces to frequencies
    peak_freqs = spec_freqs[dominant_bins]

    # Summarize the step rate over time
    return {'average cadence': np.mean(peak_freqs),
            'stdev cadence': np.std(peak_freqs)}

In [None]:
def compute_differences(left, right):
    """
    Compares corresponding metrics across two feet
    left: the dictionary of metrics from the left side
    right: the dictionary of metrics from the right side
    """
    diffs_dict = {}
    for key in left:
        diffs_dict[key] = np.abs(left[key] - right[key])
    return diffs_dict

In [None]:
def process_recording(filename):
    """
    Process a VGRF recording and produce all of the features as a dictionary
    (one value per key)
    filename: the name of the recording file
    """
    # Get the useful columns
    df = pd.read_csv(os.path.join(base_folder, filename),
                     sep="\t", header=None, names=column_names)
    time = df['Time'].values
    left_values = df['Left Foot'].values
    right_values = df['Right Foot'].values

    # Extract metrics from the left side
    left_time = compute_arbitrary_time_domain_metrics(time, left_values)
    left_freq = compute_arbitrary_freq_domain_metrics(time, left_values)
    left_amplitude = compute_amplitude_metrics(time, left_values)
    left_cadence = compute_cadence_metrics(time, left_values)

    # Extract metrics from the right side
    right_time = compute_arbitrary_time_domain_metrics(time, right_values)
    right_freq = compute_arbitrary_freq_domain_metrics(time, right_values)
    right_amplitude = compute_amplitude_metrics(time, right_values)
    right_cadence = compute_cadence_metrics(time, right_values)

    # Extract difference metrics
    diff_time = compute_differences(left_time, right_time)
    diff_freq = compute_differences(left_freq, right_freq)
    diff_amplitude = compute_differences(left_amplitude, right_amplitude)
    diff_cadence = compute_differences(left_cadence, right_cadence)

    # Combine everything into a dictionary
    feature_dict = {}
    for left_dict in [left_time, left_freq, left_amplitude, left_cadence]:
        for key in left_dict:
            feature_dict['Single foot ' + key] = left_dict[key]
    for diff_dict in [diff_time, diff_freq, diff_amplitude, diff_cadence]:
        for key in diff_dict:
            feature_dict['Difference ' + key] = diff_dict[key]
    return feature_dict

In [None]:
data_filenames = os.listdir(base_folder)

# Iterate through the filenames
features_df = pd.DataFrame()
for data_filename in data_filenames:
    # Skip the file if we want to ignore it
    patient_name = data_filename[0:6]
    patient_type = data_filename[2:4]
    trial_id = data_filename[7:9]
    if (patient_type == 'Co') or (trial_id == '10') or not ('_' in data_filename):
        continue

    # Generate the features
    feature_dict = process_recording(data_filename)

    # Add the patient's name as the identifier
    feature_dict['ID'] = patient_name
    feature_dict = pd.DataFrame([feature_dict])
    features_df = pd.concat([features_df, feature_dict], axis=0)

# Set the index to the image name
features_df.set_index(['ID'], inplace=True)
features_df

In [None]:
labels_df = pd.read_excel(label_filename, index_col='ID')
labels_df

In [None]:
# Keep only patient data
labels_df = labels_df[labels_df['Group'] == 'PD']

# Get rid of unnecessary columns
labels_df = labels_df[['Gender', 'Age', 'Height (meters)',
                       'Weight (kg)', 'UPDRSM']]

# Rename the columns
labels_df.rename(columns={'Gender': 'Sex', 'Height (meters)': 'Height',
                          'Weight (kg)': 'Weight', 'UPDRSM': 'Label'}, inplace=True)
labels_df

In [None]:
# Convert gender to a binary sex variable
labels_df['Sex'] = labels_df['Sex'].replace({'male': 0, 'female': 1})
labels_df

In [None]:
# Fix the Ju rows so that the height is in meters
bad_height_rows = labels_df.index.str.startswith('Ju')
labels_df.loc[bad_height_rows, 'Height'] /= 100
labels_df

In [None]:
df = pd.merge(labels_df, features_df, how='right', left_index=True, right_index=True)
df

## Missing Feature Values

Recall that we have missing entries for some people's UPDRS scores and demographics. We can identify how many rows have missing values in these columns by looking at the "count" row that results from calling the `.describe()` method:

In [None]:
df.describe()

We will still exclude the patients who did not complete the UPDRS because there is no use in trying to guess what their scores would have been had they completed the assessment.

In [None]:
# Remove rows with missing data
df = df[~pd.isna(df['Label'])]
df

We will fill in the missing demographic variables with reasonable guesses so that we have access to more data during model training. `scikit-learn` provides a class called `SimpleImputer` that generates reasonable guesses based on the distribution of known values according to a `strategy` like mean or mode. For example, we could fill in missing weights by calculating the average of the weights reported in our dataset.

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

Because this imputer requires learning from our dataset, it is important that we treat similarly to our machine learning model. In other words, we should only determine the distribution of known values in our training dataset; otherwise, we would be cheating by inspecting our test data before the final evaluation step. Since we are doing k-fold cross-validation, we will add this imputer in Step 8 of this notebook.

## Scaling Feature Values

If we look at the output of the `.describe()` method one more time, you will notice that each column has a very different range.

In [None]:
df.describe()

Some models are inherently robust to feature scaling, such as decision trees and random forests. These models make decisions based on threshold rules, so the scale of one feature does not impact the threshold that is optimal for another feature. Models that rely on distance metrics (e.g., k-nearest neighbors) or linear combinations of features (e.g., linear regression, SVM) can be more sensitive to feature scaling, as features with larger scales may dominate the calculations underlying these models.

While there are ways of making these models account for features with varied scaling, normalizing the scale of our features can prevent this issue and allow models to give appropriate weight to each feature. We will use the `StandardScaler` from `scikit-learn`. This class scales each feature independently such that they look like standard normally distributed data (i.e., Gaussian with 0 mean and unit variance).

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

Like our data-driven imputer for filling in missing values, this feature scaling step depends on the distribution of known values in our training dataset. Therefore, we will incorporate this scaler in Step 8 of this notebook.

# Step 3: Decide How the Data Should Be Split for Training and Testing

We will train our model as before, using `GroupKFold` to split our data into 5 folds in a way that keeps samples associated with the same `ID` in the same fold.

# Step 4: (Optional) Add Feature Selection

We intentionally created some features at random and some features inspired by domain expertise. While the number of features we have is still significantly smaller than the number of recordings, fewer features could prevent overfitting and lead to better test accuracy.

We can select features in a variety of ways. We can remove features that are redundant, we can pick features that have a certain amount variance, or we could even pick features according to a second machine learning model.

For simplicity, we will use the `SelectKBest` class from `scikit-learn`. This feature selector removes all but the top-`k` best features according to a scoring function (`score_func`). `scikit-learn` provides multiple scoring functions for classification and regression tasks, each with their own mathematical underpininngs. It is important to note that the default scoring function is for classification tasks, so we will need to specify a different one. We will use the `f_regression` score, which determines the utility of each feature according to the F-statistic of univariate linear regression tests.

For this particular feature selector, we also need to carefully consider the number of features we keep. We could treat this number like a model hyperparameter and try different settings to identify the optimal one. For now, however, we will fix this hyperparameter to `k=5`.

In [None]:
from sklearn.feature_selection import SelectKBest, f_regression
feat_select = SelectKBest(f_regression, k=5)

As with the other feature pre-processing steps we have added, feature selection depends on the distribution of known values in our training dataset. Therefore, we will incorporate this feature selector in Step 8 of this notebook.

# Step 5: (Optional) Balance Your Dataset

We saw that the distribution in UPDRSM scores was fairly normally distributed, so we won't worry about trying to balance our dataset.

# Step 6: Select an Appropriate Model

We will use the same k-nearest neighbors regressor as before.

In [None]:
from sklearn.neighbors import KNeighborsRegressor
reg = KNeighborsRegressor()

# Step 7: (Optional) Select Your Hyperparameters

We could optimize the hyperparameter `n_neighbors` in our regressor; however, we will skip that step since we already have already made many other changes in our pipeline.

# Step 8: Train and Test Your Model

We will train and test our model as we have in the past, but with three key differences:
1. We will fill in missing demographic values
2. We will scale our features to the same distribution
3. We will apply feature selection to identify the most informative features

The logic of putting imputation first is that the other steps are easiest to work with when the dataset is complete. Since we don't have too much missing data, it is also unlikely that filling in missing values will completely change the scale or utility of those features.

The logic of putting feature scaling before feature selection is that the model will be trained on scaled features rather unscaled ones. Therefore, we should pick which scaled features are most useful.

Depending on how your model pipeline is configured and the type of machine learning model you are using, you might find that a different order may make more sense.

In [None]:
from sklearn.model_selection import GroupKFold

# Get the features, labels, and grouping variables
x = df.drop(['Label'], axis=1).values
y = df['Label'].values
groups = df.index.values

# Initialize a data structure to save our final results,
# assuming all of the predictions are 0 to start
y_pred = np.zeros(y.shape)

# Split the data into folds
group_kfold = GroupKFold(n_splits=5)
for train_idxs, test_idxs in group_kfold.split(x, y, groups):
    # Split the data into train and test
    x_train = x[train_idxs]
    y_train = y[train_idxs]
    x_test = x[test_idxs]
    y_test = y[test_idxs]

    # Impute the missing features
    imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
    imputer.fit(x_train)
    x_train = imputer.transform(x_train)
    x_test = imputer.transform(x_test)

    # Scale the features
    scaler = StandardScaler()
    scaler.fit(x_train)
    x_train = scaler.transform(x_train)
    x_test = scaler.transform(x_test)

    # Select the most useful features
    feat_select = SelectKBest(f_regression, k=5)
    feat_select.fit(x_train, y_train)
    x_train = feat_select.transform(x_train)
    x_test = feat_select.transform(x_test)

    # Train a model on the transformed training data
    reg = KNeighborsRegressor(n_neighbors=5)
    reg.fit(x_train, y_train)

    # Predict on the transformed test data
    y_test_pred = reg.predict(x_test)
    y_pred[test_idxs] = y_test_pred

As an aside, recall that using cross-validation means that we are training distinct models for each fold of our dataset. Because we are doing feature selection within each fold, we could easily end up selecting different features for each model. This could have implications for how we communicate our results, but since we are focusing on optimizing performance, we will leave things as they are.

# Step 9: Use an Appropriate Method for Interpreting Results

We are going to use the same function we created earlier to view the regression accuracy of our model.

In [None]:
from scipy.stats import pearsonr

def regression_evaluation(y_true, y_pred):
    """
    Generate a series of graphs that will help us determine the performance of
    a regression model
    y_true: the target labels
    y_pred: the predicted labels
    """
    # Calculate the distance metrics and Pearson's correlation
    mean_error = np.mean(y_pred-y)
    std_error = np.std(y_pred-y)
    mean_absolute_error = np.mean(np.abs(y_pred-y))
    corr, pval = pearsonr(y, y_pred)

    # Set up the graphs
    fig_bounds = [y.min()-1, y.max()+1]
    corr_title = f'Correlation = {corr:0.2f}'
    corr_title += ', p<.05' if pval <.05 else ', n.s.'
    ba_title = f'Mean Error = {mean_error:0.2f} ± {std_error:0.2f}'

    # Generate a correlation plot with the scores in the title
    plt.figure(figsize=(9, 3))
    plt.subplot(1, 2, 1)
    plt.plot(y, y_pred, '*')
    plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--')
    plt.grid()
    plt.xlim(fig_bounds), plt.ylim(fig_bounds)
    plt.xlabel('Actual Score'), plt.ylabel('Predicted Score')
    plt.title(corr_title)

    plt.subplot(1, 2, 2)
    plt.plot(y, y_pred-y, '*')
    plt.axhline(y=mean_error, color='k', linestyle='--')
    plt.axhline(y=mean_error+std_error, color='r', linestyle='--')
    plt.axhline(y=mean_error-std_error, color='r', linestyle='--')
    plt.xlim(fig_bounds)
    plt.xlabel('Actual Score'), plt.ylabel('Error')
    plt.title(ba_title)
    plt.show()

In [None]:
regression_evaluation(y, y_pred)

So after all of that work, we ended up with worse accuracy...what gives? In this case, it would be good to re-run the code block from Step 8 while trying different combinations of the changes we made to identify where the issues may lie. If you were to do that, you would find out that doing imputation is the main culprit.

From there, it would make sense to see if the imputed values are reasonable and/or whether a different imputation method leads to vastly different results. If you were to do that, you would see that using different imputation methods generally lead to comparable and reasonable results.

Let's take a step back and look at the data that was missing demographic information in the first place:

In [None]:
df = pd.merge(labels_df, features_df, how='right', left_index=True, right_index=True)
df = df[pd.isna(df['Weight']) | pd.isna(df['Height'])]
df

Here are a few observations:
* All of the participants came from the same cohort: `Ga`.
* The average age across the entire dataset was 68.2 ± 8.7, yet almost half of these patients are over 75.
* The average UPDRSM score across the entire dataset was 17.9 ± 7.1, yet almost half of these patients had scores above 25.

So what does this mean? It could be that by doing imputation and increasing the size of our dataset, we just so happen to be adding anomalous examples back into our training and test folds. Furthermore, these samples are significantly deflating all of our performance metrics because they are coming from individuals with high UPDRSM scores.

Although we would ideally not want to exclude patients from our analyses, let's see what happens when we re-run our supposedly improved pipeline without imputation:

In [None]:
from sklearn.model_selection import GroupKFold

# Generate the original dataset while removing patients with missing features or labels
# (same as in the previous notebook)
df = pd.merge(labels_df, features_df, how='right', left_index=True, right_index=True)
df = df.dropna(how='any')

# Get the features, labels, and grouping variables
x = df.drop(['Label'], axis=1).values
y = df['Label'].values
groups = df.index.values

# Initialize a data structure to save our final results,
# assuming all of the predictions are 0 to start
y_pred = np.zeros(y.shape)

# Split the data into folds
group_kfold = GroupKFold(n_splits=5)
for train_idxs, test_idxs in group_kfold.split(x, y, groups):
    # Split the data into train and test
    x_train = x[train_idxs]
    y_train = y[train_idxs]
    x_test = x[test_idxs]
    y_test = y[test_idxs]

    # Scale the features
    scaler = StandardScaler()
    scaler.fit(x_train)
    x_train = scaler.transform(x_train)
    x_test = scaler.transform(x_test)

    # Select the most useful features
    feat_select = SelectKBest(f_regression, k=5)
    feat_select.fit(x_train, y_train)
    x_train = feat_select.transform(x_train)
    x_test = feat_select.transform(x_test)

    # Train a model on the transformed training data
    reg = KNeighborsRegressor(n_neighbors=5)
    reg.fit(x_train, y_train)

    # Predict on the transformed test data
    y_test_pred = reg.predict(x_test)
    y_pred[test_idxs] = y_test_pred

In [None]:
regression_evaluation(y, y_pred)

It turns out that by ignoring the patients with missing data and just focusing on our other pipeline modifications (feature scaling and selection), the model goes back to a positive correlation. However, we still aren't doing as well as our original model.

While this may be disappointing, it shows that there is a lot of trial and error that happens in machine learning. There is no guarantee that every tool in your arsenal is going to improve model performance. In this particularly case, our feature selection might be removing too much useful information.

So what are some things we could do instead to improve model performance? Here are some ideas:
* **Hyperparameter search:** Both our model architecture and the steps that we added had hyperparameters, namely the `k` for the `SelectKBest` feature selector and the number of neighbors for the `KNearestRegressor` model. We kept these values fixed throughout this notebook, but tuning these numbers could have led to better results.
* **Examine outliers:** We identified outliers in our dataset that worsened the performance of our model. It could be worth our time investigating these individuals more closely and seeing how their features compare to the rest of our dataset. It might also be worth considering if these individuals should be even more represented in our dataset (either by resampling or more data collection).
* **Reconsider the task:** Admittedly, this is not an easy dataset to work with. The UPDRSM examines motor symptoms throughout the body, asking patients to perform tasks ranging from finger tapping and hand movements to leg agility and speech evaluation. We are attempting to predict these comprehensive scores solely using what was measured from the soles of patients' feet while they walked. This doesn't even give us a full picture of their gait, as we're missing out on the length of their stride and the angles of their joints. Perhaps we need more signals to get a meaningful model? Perhaps we should be trying classification (high vs. low score) instead of regression?