#Detecting Freezing of Gait Episodes in Parkinson's disease: A Comprehensive Step-by-Step Approach:




Objective: the development of a model to predict Parkinson's freezing-of-gait episodes

##1. Introduction

Freezing of gait (FOG) is a debilitating symptom that affects individuals diagnosed with Parkinson’s disease, which significantly impacting their ability to walk and limiting their mobility and independence. Machine learning (ML) techniques can provide valuable insights into the occurrence and causes of FOG episodes. By leveraging ML, medical professionals can enhance their evaluation, monitoring, and prevention of FOG events. This notebook is based on Parkinson's Freezing of Gait Prediction competition dataset which includes data collected from a wearable 3D lower back sensor. The goal of this project is detecting the start and stop of each freezing episode, as well as identifying three types of FOG events: Start Hesitation, Turn, and Walking.

##2. The Big Picture

The objective is to develop a model to detect and predict Parkinson's FOG episodes. These episodes will be predicted based on time series data that was recorded for each patient during the execution of a specific protocol in addition to some provided patient characteristics. Given the availability of labeled targets in the dataset, a supervised learning approach is suitable for addressing this problem. Since there are multiple targets (Start Hesitation, Turn, and Walking), the problem is a one-class classification. The evaluation metric in this project is the mean average precision, which measures the average precision of predictions for each event class. Thus, accurate predictions of correct event types are more important than predicting all events correctly. Considering these requirements, this notebook presents a LightGBM (Light Gradient Boosting Model) model developed specifically to optimize the desired evaluation metric.

##3.Data Collection

Each patient in the dataset is considered as a subject. The dataset includes two types of experiments conducted to assess conditions of the patient: 1. TDCSFOG dataset: This dataset consists of data series collected in a lab, where subjects completed a FOG-provoking protocol. 2. DeFOG dataset: This dataset comprises data series collected in the subject's home, where the subject also completed a FOG-provoking protocol. The identification of each series in the TDCSFOG dataset is provided in the "tdcsfog_metadata.csv" file. Each series is uniquely identified by the Subject, Visit, Test, and Medication condition. Similarly, the identification of each series in the DeFOG dataset is given in the "defog_metadata.csv" file, supplied by unique identifiers for Subject, Visit, and Medication condition.

In [None]:
'''
Connecting Google Drive
'''
from google.colab import drive
drive.mount("/content/gdrive")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
'''
install catboost module
'''
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.1-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.1


In [None]:
'''
Load modules
'''
import pandas as pd
import glob
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
import os
import warnings
import pickle
# Suppress FutureWarning messages
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
'''
Unziping data folder

'''
!unzip gdrive/My\ Drive/defog.zip > /dev/null

In [None]:
'''
Creating home dataset after merging all csv file
'''
csv_folder = '/content/defog'  # Update this with the path to your CSV files
df=pd.DataFrame()
for csv_file in os.listdir(csv_folder):
    if csv_file.endswith('.csv'):
        csv_path = os.path.join(csv_folder, csv_file)
        temp_df = pd.read_csv(csv_path)
        df = df.append(temp_df, ignore_index=True)

##4. Data Exploration

We first look at the subjects dataset.

In [None]:
'''
Displaying shape of data
'''
df.shape

(13525702, 9)

In [None]:
'''
Displaying content of data
'''
df

Unnamed: 0,Time,AccV,AccML,AccAP,StartHesitation,Turn,Walking,Valid,Task
0,0,-0.989968,-0.096900,-0.032962,0,0,0,False,False
1,1,-0.990478,-0.098616,-0.035393,0,0,0,False,False
2,2,-0.989018,-0.098869,-0.034183,0,0,0,False,False
3,3,-0.992178,-0.099121,-0.034910,0,0,0,False,False
4,4,-0.991226,-0.098661,-0.034451,0,0,0,False,False
...,...,...,...,...,...,...,...,...,...
13525697,300283,-0.929958,-0.158213,-0.334716,0,0,0,False,False
13525698,300284,-0.934397,-0.159220,-0.333716,0,0,0,False,False
13525699,300285,-0.936518,-0.162630,-0.332749,0,0,0,False,False
13525700,300286,-0.931836,-0.163615,-0.332280,0,0,0,False,False


In [None]:
'''
Checking Unique class in target attribute
'''
df.Turn.value_counts()

0    12938216
1      587486
Name: Turn, dtype: int64

##5. Data Preprocessing

In [None]:
def preprocessing_inputs(df):
  df = df.copy()
  # remove unneed col
  unneed_col = ['Time', 'StartHesitation', 'Walking','Valid','Task']
  df = df.drop(unneed_col,axis = 1)
   #split Data X,y
  X = df.drop("Turn",axis = 1)
  y = df['Turn']
  # Train test split
  X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=42)
  return X_train,X_test,y_train,y_test

In [None]:
X_train,X_test,y_train,y_test= preprocessing_inputs(df)

##6. Model

Given the availability of labeled targets in the dataset, a supervised learning approach is suitable for addressing this problem. Since there are multiple targets (Fog occur or not while Turn), the problem is a binary-class classification. The evaluation metric in this project is the mean average precision, which measures the average precision of predictions for each event class. Thus, accurate predictions of correct event types are more important than predicting all events correctly

###6.1. Model Selection

For this project, after tuning different ML algorithms, LightGBM algorithm,XGB algorithm,CatBoost algorithm,DecisionTree alogrithm is taken into account for the following reasons.

The dataset is large including many features, so a highly-effiecient model is required.

The task involves Binary-class classification.

The Binary-class problem exhibits imbalanced classes.

It is crucial to capture complex and non-linear relationships in the data.

### 6.2. Model Training

In [None]:
# Сreate function for testing models
def test_model(algorithm, X_train, y_train, X_test, y_test):

    model = algorithm()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    f1 = f1_score(y_test, y_pred)
    accuracy = accuracy_score(y_test, y_pred)


    return f1, accuracy

##7. Prediction and Submission

The defog models are evaluated based on their corresponding test data, and the resulting predictions are submitted to the competition.

Model 1

XGBoost Classification

In [None]:
%%time

f1_xgb, accuracy_xgb = test_model(XGBClassifier, X_train, y_train, X_test, y_test)
pickle.dump(f1_xgb,open('XGB_unbalanced.pkl','wb'))
print(f"XGBClassifier | F1 score - {f1_xgb}, | Accuracy - {accuracy_xgb}")

XGBClassifier | F1 score - 0.07171134973114959, | Accuracy - 0.9574112104090212
CPU times: user 10min 55s, sys: 18.2 s, total: 11min 13s
Wall time: 6min 57s


Model 2

Light GBM Classification

In [None]:
%%time

f1_lgbm, accuracy_lgbm = test_model(LGBMClassifier, X_train, y_train, X_test, y_test)
pickle.dump(f1_lgbm,open('lgbm_unbalanced.pkl','wb'))
print(f"LGBMClassifier | F1 score - {f1_lgbm}, | Accuracy - {accuracy_lgbm}")

[LightGBM] [Info] Number of positive: 411720, number of negative: 9056271
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data points in the train set: 9467991, number of used features: 3
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.043485 -> initscore=-3.090869
[LightGBM] [Info] Start training from score -3.090869
LGBMClassifier | F1 score - 0.04677849811209084, | Accuracy - 0.9571332211682892
CPU times: user 1min 24s, sys: 504 ms, total: 1min 25s
Wall time: 1min 25s


Model 3

DecisionTree Classification

In [None]:
%%time

f1_dt, accuracy_dt = test_model(DecisionTreeClassifier, X_train, y_train, X_test, y_test)
pickle.dump(f1_dt,open('dt_unbalanced.pkl','wb'))
print(f"DecisionTreeClassifier | F1 score - {f1_dt}, | Accuracy - {accuracy_dt}")

DecisionTreeClassifier | F1 score - 0.19743986273690864, | Accuracy - 0.9299131949022491
CPU times: user 1min 58s, sys: 507 ms, total: 1min 58s
Wall time: 1min 58s


Model 4

CatBoost Classification

In [19]:
%%time

f1_cb, accuracy_cb = test_model(CatBoostClassifier, X_train, y_train, X_test, y_test)
pickle.dump(f1_cb,open('cb_unbalanced.pkl','wb'))
print(f"CatBoostClassifier | F1 score - {f1_cb}, | Accuracy - {accuracy_cb}")

Learning rate set to 0.5
0:	learn: 0.2241002	total: 1.63s	remaining: 27m 5s
1:	learn: 0.1696193	total: 2.97s	remaining: 24m 44s
2:	learn: 0.1572404	total: 4.33s	remaining: 24m
3:	learn: 0.1537746	total: 5.72s	remaining: 23m 44s
4:	learn: 0.1515606	total: 7.01s	remaining: 23m 15s
5:	learn: 0.1506310	total: 8.3s	remaining: 22m 56s
6:	learn: 0.1501469	total: 9.59s	remaining: 22m 40s
7:	learn: 0.1489744	total: 12s	remaining: 24m 50s
8:	learn: 0.1476190	total: 14s	remaining: 25m 45s
9:	learn: 0.1470813	total: 15.3s	remaining: 25m 19s
10:	learn: 0.1467926	total: 16.7s	remaining: 25m 2s
11:	learn: 0.1464822	total: 18.1s	remaining: 24m 51s
12:	learn: 0.1459094	total: 19.4s	remaining: 24m 35s
13:	learn: 0.1451271	total: 20.8s	remaining: 24m 28s
14:	learn: 0.1448424	total: 22.3s	remaining: 24m 27s
15:	learn: 0.1446939	total: 23.7s	remaining: 24m 19s
16:	learn: 0.1442547	total: 26.2s	remaining: 25m 17s
17:	learn: 0.1433368	total: 28.1s	remaining: 25m 34s
18:	learn: 0.1430748	total: 29.6s	remainin