![ML](https://raw.githubusercontent.com/AniMilina/Parkinson-s-Freezing-of-Gait-Prediction/main/ML.jpg)


To detect episodes of freezing of gait (FOG) and classify their types along the timeline, we have the following dataset available:

* Unique patient identifier (stored in the 'Id' column)  
* Patient number (stored in the 'Subject' column)  
* Visit number (stored in the 'Visit' column)  
* Medications administered (stored in the 'Medication' column)  
* Time of FOG measurement (stored in the 'Time' column)  
* Start time of FOG event (recorded in the 'Init' column)  
* End time of FOG event (recorded in the 'Completion' column)  
* Vertical axis acceleration (captured in the 'AccV' column)  
* Mediolateral axis acceleration (captured in the 'AccML' column)  
* Anteroposterior axis acceleration (captured in the 'AccAP' column)  
* Initial uncertainty at the start of the event (stored in the 'StartHesitation' column)  
* Uncertainty during turning (stored in the 'Turn' column)  
* Movement delays (captured in the 'Walking' column)  

Our goal is to develop a model capable of predicting episodes of freezing of gait and their corresponding types using this dataset. To evaluate the model's performance, we will utilize the average sum of AP (mean Average Precision) across all three event classes.

In [1]:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import os

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
import joblib

from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import average_precision_score,accuracy_score, confusion_matrix, roc_auc_score, f1_score



## Settings

In [2]:
# # Set formatting option

pd.options.display.float_format = '{:.2f}'.format

In [3]:
def fill_missing_values(df):
    """
    Replaces missing values with the median value for each numerical column.

    :param df: pandas DataFrame, the dataset in which missing values need to be replaced
    :return: pandas DataFrame, the dataset with replaced missing values
    """
    numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns  # Selecting all numerical columns
    for col in numeric_cols:
        median = df[col].median()  # Finding the Median Value of a Column
        df[col].fillna(median, inplace=True)  # Replacing Missing Values with the Median Value
    return df

In [4]:
# Function to Get Data Information

def explore_dataframe(df):
    print("Shape of dataframe:", df.shape)
    display(df.head())
    print("Info of dataframe:\n")
    df.info()
    print("Summary statistics of dataframe:\n", df.describe())
    print("Missing values in dataframe:\n", df.isnull().sum())
    print("Duplicate rows in dataframe:", df.duplicated().sum())

In [5]:
# Checking for Missing Values in Each Column

def check_missing_values(df):
    """
    Checks the count of missing values in each column of a DataFrame.

    :param df: pandas.DataFrame, the DataFrame to check for missing values.
    :return: pandas.DataFrame, the DataFrame with information about missing values.
    """
    return df.isnull().sum()

### Reading Data

In [6]:
# Reading Data from a CSV File

data = pd.read_csv('/kaggle/input/eda-parkinson/EDA_Parkinson.csv', low_memory=False)

In [7]:
data

Unnamed: 0,Id,Subject,Visit,Medication,Time,Init,Completion,AccV,AccML,AccAP,StartHesitation,Turn,Walking
0,003f117e14,4dc2f8,3,on,0.00,8.61,14.77,-9.78,0.11,-0.54,0.00,1.00,0.00
1,003f117e14,4dc2f8,3,on,1.00,8.61,14.77,-9.79,0.09,-0.53,0.00,1.00,0.00
2,003f117e14,4dc2f8,3,on,2.00,8.61,14.77,-9.79,0.07,-0.54,0.00,1.00,0.00
3,003f117e14,4dc2f8,3,on,3.00,8.61,14.77,-9.78,0.12,-0.55,0.00,1.00,0.00
4,003f117e14,4dc2f8,3,on,4.00,8.61,14.77,-9.79,0.12,-0.54,0.00,1.00,0.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...
15856286,f9fc61ce85,040587,1,on,4469.00,1172.65,1173.44,-9.68,0.55,-0.88,0.00,1.00,0.00
15856287,f9fc61ce85,040587,1,on,4470.00,1172.65,1173.44,-9.56,0.52,-0.81,0.00,1.00,0.00
15856288,f9fc61ce85,040587,1,on,4471.00,1172.65,1173.44,-9.61,0.57,-0.79,0.00,1.00,0.00
15856289,f9fc61ce85,040587,1,on,4472.00,1172.65,1173.44,-9.54,0.49,-0.73,0.00,1.00,0.00


In [None]:
# Removing unnecessary columns

data.drop(['Id', 'Subject', 'Visit', 'Medication'], axis=1, inplace=True)

# Splitting into features (X) and target variable (y)

X = data.drop(['StartHesitation', 'Turn', 'Walking'], axis=1)
y = data[['StartHesitation', 'Turn', 'Walking']]

# Splitting into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating a pipeline for each column

pipelines = {}
hyperparameters = {
    'max_depth': 3,
    'n_estimators': 100,
    'learning_rate': 0.1
}

# Processing each column separately

for column in y.columns:
    pipeline = make_pipeline(StandardScaler(), xgb.XGBClassifier(**hyperparameters))
    pipeline.fit(X_train, y_train[column])
    pipelines[column] = pipeline

# Model evaluation

for column, pipeline in pipelines.items():
    train_score = pipeline.score(X_train, y_train[column])
    test_score = pipeline.score(X_test, y_test[column])
    print(f"{column}: Train Score - {train_score}, Test Score - {test_score}")

# Predictions on the test set

predictions = pd.DataFrame()

for column, pipeline in pipelines.items():
    predictions[column] = pipeline.predict(X_test)

In [None]:
# Metrics evaluation for each column

for column in predictions.columns:  
    y_train_pred = pipeline.predict(X_train)  
    y_test_pred = pipeline.predict(X_test) 

    train_accuracy = accuracy_score(y_train[column], y_train_pred)
    test_accuracy = accuracy_score(y_test[column], y_test_pred)

    average_precision = average_precision_score(y_test[column], y_test_pred)

    confusion = confusion_matrix(y_test[column], y_test_pred)

    try:
        roc_auc = roc_auc_score(y_test[column], y_test_pred)
    except ValueError:
        roc_auc = "Not defined"

    f1 = f1_score(y_test[column], y_test_pred)

    print(f"Metrics for column '{column}':")
    print(f"Train Accuracy: {train_accuracy}")
    print(f"Test Accuracy: {test_accuracy}")
    print(f"Average Precision: {average_precision}")
    print("Confusion Matrix:")
    print(confusion)
    print(f"ROC AUC Score: {roc_auc}")
    print(f"F1 Score: {f1}")
    print()

In [None]:
# Reading the sample_submission.csv file

test_data = pd.read_csv('/kaggle/input/tlvmc-parkinsons-freezing-gait-prediction/sample_submission.csv')

# Creating the submission dataframe

submission_df = pd.DataFrame({'Id': test_data['Id'], 'StartHesitation': predictions['StartHesitation'], 'Turn': predictions['Turn'], 'Walking': predictions['Walking']})

# Merging with the sample_submission.csv file

submission_df = submission_df.merge(test_data, on='Id', how='inner')
submission_df.drop(['StartHesitation_y', 'Turn_y', 'Walking_y'], axis=1, inplace=True)
submission_df.rename(columns={'StartHesitation_x': 'StartHesitation', 'Turn_x': 'Turn', 'Walking_x': 'Walking'}, inplace=True)

# Saving the submission file as submission.csv

submission_df.to_csv('submission.csv', index=False)

In [None]:
import joblib

# Saving the pipelines

joblib.dump(pipelines, 'pipeline.pkl')

# Loading the pipelines

predictions = joblib.load('pipeline.pkl')

## Conclusion:

Our model for predicting episodes of mobility impairment and their types is based on utilizing various features from the dataset. We incorporated patient information, including their identifiers, numbers, and visits, as well as data on administered medications. Additionally, we utilized information on the timing of FOG measurements, the start and completion times of FOG events, and acceleration data from the vertical, mediolateral, and anteroposterior axes.  

To train the model, we employed a pipeline consisting of several stages.  

* Firstly, we performed the separation of data into **features (X)** and the **target variable (y)**, where X contains all columns except those related to the target variable, and y contains the `'StartHesitation'`, `'Turn'`, and `'Walking'` columns.  

* Next, we applied the train-test split method, using the train_test_split function with a **test_size of 0.2**.

For each target variable column, we created a separate pipeline comprising data scaling using StandardScaler and training an XGBClassifier with specific hyperparameters. We chose **XGBClassifier** due to its powerful gradient boosting capabilities, enabling it to handle complex dependencies in the data.

Training the model for each target variable class was necessary as each class represents a distinct type of mobility impairment episode. Our model can account for the differences and characteristics of each class, resulting in more accurate predictions and classification of FOG episodes.

In conclusion, we obtained a model capable of predicting episodes of mobility impairment and their types based on patient data and acceleration along different axes. Incorporating various features and training for each target variable class enhances the prediction quality and enables personalized approaches to the treatment and monitoring of Parkinson's disease patients.  
