# Gait Classification 

In this notebook we are going to achieve the following:

1. **Create the Training Dataset**: We will preprocess and combine data from different sources to create a comprehensive training dataset.
2. **Feature Selection and Dimensionality Reduction**: We will identify the most relevant features for classification using dimensionality reduction techniques.
3. **Model Evaluation**: We will test and compare the performance of multiple machine learning and deep learning algorithms for gait classification.

In [11]:
# Libraries
import os
import pandas as pd
import numpy as np
from scipy.stats import pointbiserialr
from data_preprocessing import merge_all_types

# Constants
healthy_dir = 'Data/Healthy'
stroke_dir = 'Data/Stroke'

In [8]:
# Create the data file combining the results of feature extraction for both healthy and stroke patients
merge_all_types(healthy_dir, stroke_dir)
data = pd.read_csv('final_dataset.csv')

# Drop the column subject_id as we don't need it
data = data.drop(columns=['subject_id'])
data.head()

Saved dataset with shape (14, 58) to final_dataset.csv


Unnamed: 0,left-z-axis-(deg/s)-mean,left-z-axis-(deg/s)-std,left-z-axis-(deg/s)-max,left-z-axis-(deg/s)-min,left-z-axis-(deg/s)-rms,left-z-axis-(deg/s)-mad,left-z-axis-(deg/s)-range,left-z-axis-(deg/s)-iqr,left-z-axis-(deg/s)-skew,left-z-axis-(deg/s)-kurt,...,stride_duration_symmetry_ratio_std,left_peak_mean,left_peak_std,right_peak_mean,right_peak_std,peak_diff_mean,peak_diff_std,z_corr_mean,z_corr_std,label
0,3.517065,159.709732,441.89,-335.854,159.742786,51.037,777.744,75.549,0.983153,3.727653,...,0.215819,255.738435,192.021695,241.707494,186.276608,51.220594,98.289195,0.355017,0.598094,0
1,5.582664,110.967208,403.659,-266.768,111.105428,13.841,670.427,46.463,1.430622,5.716239,...,0.1991,145.11723,172.010967,152.168704,166.921704,62.353264,109.251306,0.355703,0.500949,0
2,-1.558208,119.737293,423.963,-293.171,119.743424,18.232,717.134,48.39975,1.192807,5.407293,...,0.215823,171.33485,181.01232,194.829357,208.993551,37.156625,53.607212,0.430178,0.523371,0
3,1.881431,121.682049,390.549,-314.024,121.694034,35.427,704.573,53.354,1.261703,4.868585,...,0.164696,170.333325,170.836405,157.977763,159.326704,45.641111,79.686294,0.458434,0.486268,0
4,0.746919,128.397431,427.317,-247.439,128.396801,42.866,674.756,63.232,1.34739,4.846586,...,0.200439,166.902381,185.688745,183.529052,197.64833,77.646721,136.361102,0.33473,0.561701,0


# Feature Selection and Dimensionality Reduction

### 1. Point-biserial Correlation
The point biserial correlation coefficient is a measure of the correlation between a binary variable (such as a yes/no or pass/fail variable) and a continuous variable. It is similar to the Pearson correlation coefficient, but is used specifically for this type of data. The point biserial correlation coefficient ranges from -1 to 1, with positive values indicating a positive correlation and negative values indicating a negative correlation. Values close to 0 indicate little or no correlation. The p-value represents the probability that the correlation between the two variables is due to chance. Typically, a p-value of less than 0.05 is considered to be statistically significant.

In [None]:
label = data['label']
features = data.drop(columns=['label'])

correlations, p_values = [], []

# Calculate the Point Biserial Correlation Coefficient for each feature
for feature in features:
    correlation, p_value = pointbiserialr(label, features[feature])
    correlations.append(correlation)
    p_values.append(p_value)

# Sort the features by their correlation with the label and create dataframes for better visualization
sorted_indices = np.argsort(correlations)[::-1]
sorted_features = features.columns[sorted_indices]
sorted_correlations = np.array(correlations)[sorted_indices]
sorted_p_values = np.array(p_values)[sorted_indices]

corr_df = pd.DataFrame({'Feature': sorted_features, 'Correlation': sorted_correlations, 'p-value': sorted_p_values})
c

Feature: right-z-axis-(deg/s)-zcr, Correlation: nan, p-value: nan
Feature: right-z-axis-(deg/s)-pkcnt, Correlation: nan, p-value: nan
Feature: left-z-axis-(deg/s)-zcr, Correlation: nan, p-value: nan
Feature: left-z-axis-(deg/s)-pkcnt, Correlation: nan, p-value: nan
Feature: left-z-axis-(deg/s)-min, Correlation: 0.9127, p-value: 0.0000
Feature: right-z-axis-(deg/s)-min, Correlation: 0.7721, p-value: 0.0012
Feature: left-z-axis-(deg/s)-kurt, Correlation: 0.5088, p-value: 0.0632
Feature: left-z-axis-(deg/s)-skew, Correlation: 0.5064, p-value: 0.0646
Feature: right_dominant_freq_std, Correlation: 0.5023, p-value: 0.0672
Feature: left_dominant_freq_std, Correlation: 0.3964, p-value: 0.1606
Feature: left_stride_duration_mean, Correlation: 0.3481, p-value: 0.2226
Feature: right-z-axis-(deg/s)-kurt, Correlation: 0.3277, p-value: 0.2527
Feature: left_dominant_freq_mean, Correlation: 0.3229, p-value: 0.2602
Feature: stride_duration_diff_std, Correlation: 0.2259, p-value: 0.4374
Feature: stride_d

  rpb, prob = pearsonr(x, y)


### 2. Feature Correlation Check
Next is the check of feature correlation, where features with a correlation greater than 90% are going to be removed since they are redundant.


### 3. Feature Importance (Model-Based)
Train a Random Forest classifier and rank features by importance, providing us some intuition about the features as well.


### 4. Principle Component Analysis
Use PCA as a dimensionality reduction technique in order to reduce noise in data, while retaining most of the meaningful information