- GAIT ANALYSIS (Optimized Approach with GridSearchCV, No train/test split)

- Aim: Multi-class classification to identify subjects (16 individuals) based on gait characteristics, using model comparison and hyperparameter tuning WITHOUT train/test split.

- Dataset: As mentioned in notebook named 'gait_final_output_v1_1' 

- Data Credits: https://archive.ics.uci.edu/dataset/604/gait+classification

- Key Improvements:
    - No train/test split: All 48 samples used for training with cross-validation for evaluation to maximizes available training data

    - Six Models Tested
        - Gaussian Naive Bayes: Simple probabilistic classifier (4 param combinations) 
        - SVC (RBF/Linear): Support Vector Machine with different kernels (12 param combinations)
        - Logistic Regression: Linear classifier with regularization (16 param combinations)
        - K-Nearest Neighbors: Distance-based classifier (36 param combinations) - BEST
        - Random Forest: Ensemble of shallow trees (36 param combinations)
        - Decision Tree: Single tree with depth constraints (36 param combinations)

    - Hyperparameter Tuning with GridSearchCV
        - Systematic search across parameter combinations
        - StratifiedKFold cross-validation with n_splits = 3 (min samples/class)
        - Scoring: Accuracy metric

    - Model Selection for Small Datasets
        - Appropriate: K-Nearest Neighbors, SVM, Gaussian Naive Bayes, Logistic Regression
        - Poor: Random Forest, XGBoost, Gradient Boosting, Extra Trees 

- Methodology
    - Pipeline Components:
        - Oversampling: SMOTEENN or SMOTETomek (with RandomOverSampler fallback)
        - Scaling: StandardScaler normalization
        - Feature Selection: SelectKBest with mutual information (top ~few features)
        - Model: One of six algorithms with tuned hyperparameters
        - Validation: StratifiedKFold CV (~3 folds)

    - GridSearchCV Configuration: Each model searches through parameter combinations

    - Evaluation Strategy:
        - Cross-validation score: Performance across 3 folds (best_cv_score)
        - Training accuracy: Performance on full dataset (no train/test split)

- Results
    - KNeighborsClassifier with SMOTETomek consistently achieves highest CV accuracy
    - KNN works well because it memorizes local patterns
    - With only 3 samples/subject, distance-based classification is effective
    - Performance is more stable than complex models
   
- Performance Variability: 
    - Results are not fully consistent across runs
    - Small dataset leads to high variance
    - Different random seeds can produce different "best" models
    - CV scores provide better estimates than single train/test split
      
- Limitations:
    - Only 3 samples/class is still fundamentally insufficient
    - High variance in results despite optimization
    - Perfect training accuracy indicates memorization, not learning
    - Cross-validation with 3 folds is minimal (each fold has only 1 sample/class)

- Summary:
    - Using all 48 samples instead of splitting improves data utilization
    - GridSearchCV finds better hyperparameters than default values
    - Multiple model comparison identifies KNN as most suitable
    - Cross-validation provides more reliable evaluation than single test set
    - This approach maximizes learning from limited data through: no train/test split (uses all data) => Hyperparameter optimization => Appropriate model selection for small datasets
      
- Conclusion: See end of this notebook.

- Fundamental limitation remains: 48 samples for 16-class classification is insufficient for reliable machine learning, regardless of methodology. Results demonstrate proper workflow but are not scientifically generalizable due to extreme data scarcity. KNeighborsClassifier performs best, but even optimized models cannot overcome the curse of dimensionality with only 3 samples/class.

> Recommendations: As mentioned in notebook named 'gait_final_output_v1_1'

In [2]:
# Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score, cross_val_predict
from sklearn.model_selection import LeaveOneOut, LeaveOneGroupOut, StratifiedKFold, GroupKFold, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
#from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import NearestNeighbors

from imblearn.pipeline import Pipeline

# these clean up the noisy data
from imblearn.combine import SMOTETomek, SMOTEENN

# these do not clean up the noisy data
from imblearn.over_sampling import RandomOverSampler, ADASYN, SMOTE

# avoid as it only duplicates data
# from imblearn.over_sampling import RandomOverSampler 

from sklearn.feature_selection import SelectKBest, mutual_info_classif

import time

import warnings
warnings.filterwarnings("ignore")

In [3]:
# Load data

df = pd.read_csv('gait_final_output.csv')
print(f'df.shape: {df.shape}')
print("---")
df.sample(3)


df.shape: (48, 322)
---


Unnamed: 0,Speed_R1,Variability_R1,Symmetry_R1,HeelPressTime_R1,CycleTime_R1,Cadence_R1,Posture_R1,Oscillation_R1,Loading_R1,FootPress_R1,...,P99_R3,P100_R3,P101_R3,P102_R3,P103_R3,P104_R3,P105_R3,P106_R3,P107_R3,Subject_ID_Y
33,1.31,3.48,5.1,1.06,1.048,1.05,1.045,0.12,0.037,0.05,...,0.016,0.03,0.02,0.069,0.146,0.185,0.222,0.711,1.406,11
23,1.28,4.46,0.5,1.158,1.16,1.163,1.157,0.071,0.052,0.065,...,0.015,0.071,0.024,-0.014,0.0129,0.212,0.199,0.658,1.519,7
0,1.32,4.15,4.0,1.054,1.054,1.05,1.06,0.043,0.044,0.044,...,0.027,0.044,0.039,0.073,0.097,0.232,0.215,0.928,1.078,0


In [4]:
# check for NA values

df.dtypes


Speed_R1            float64
Variability_R1      float64
Symmetry_R1         float64
HeelPressTime_R1    float64
CycleTime_R1        float64
                     ...   
P104_R3             float64
P105_R3             float64
P106_R3             float64
P107_R3             float64
Subject_ID_Y          int64
Length: 322, dtype: object

In [5]:
# check for NA values

df.isna().sum().sum()


np.int64(1)

In [7]:
# Get row with na value

row_index_of_na = df[ df.isna().any(axis=1) ].index[0]
print("row_index_of_na:", row_index_of_na)
print( type(row_index_of_na), '|' , row_index_of_na.dtype)


print(df.columns[ np.where( df.isna().any(axis=0) == True ) ] )
col_name_of_na  = df.columns[  np.where( df.isna().any(axis=0) == True )[0][0]  ]
print("col_name_of_na:", col_name_of_na)


row_index_of_na: 43
<class 'numpy.int64'> | int64
Index(['CycleTime_R2'], dtype='object')
col_name_of_na: CycleTime_R2


In [144]:
# Impute the NA value

# Since we have small number of rows, 48, and each 3 rows corresponds to 1 subject, we havae 16 subjects,
# the missing values comes from row subject 14's 2nd iteration, and column CycleTime_R2 (R1, R2, R3 are 
# three different sensors on the body).
# Therefore, it is appropriate to take median of CycleTime_R2 for imputing this value.


In [9]:
# median of CycleTime_R2 is 

print("Value BEFORE imputation:", df.loc[row_index_of_na, col_name_of_na] )

median_CycleTime_R2 = df[col_name_of_na].median()

df_new = df.copy()
    
# imput missing value in the df_new
df_new.loc[row_index_of_na, col_name_of_na] = median_CycleTime_R2

print("Value AFTER imputation:", df_new.loc[row_index_of_na, col_name_of_na] )


Value BEFORE imputation: nan
Value AFTER imputation: 6.15


In [10]:
# check NA values in the df_new dataframe

df_new.isna().sum().sum()


np.int64(0)

In [147]:
# Save this "df_new" as a new CSV file with name "gait_final_output_updated.csv"
# df_new.to_csv("gait_final_output_updated.csv", index=False)
# Update the data to 'Kaggle' and 'GitHub'

In [11]:
# Check for class imbalance

print( df_new['Subject_ID_Y'].value_counts().to_list() )

min_y_count = df_new['Subject_ID_Y'].value_counts().min()
max_y_count = df_new['Subject_ID_Y'].value_counts().max()

if min_y_count/max_y_count > 5: 
    print(f"Classes are imbalanced. Max-to-min count ratio is: {min_y_count/max_y_count}")
else:
    print(f"Classes are balanced. Max-to-min count ratio is: {min_y_count/max_y_count}")
    

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
Classes are balanced. Max-to-min count ratio is: 1.0


In [12]:
# STEP 0: Split X and y

y =  df_new['Subject_ID_Y']
X =  df_new.drop('Subject_ID_Y', axis='columns')

print(f"Total samples: {len(y)}")
print(f"Total features: {X.shape[1]}")
print(f"Number of classes: {len(y.unique())}")
print(f"Class distribution in full dataset:")
print(y.value_counts().sort_index().to_list())


Total samples: 48
Total features: 321
Number of classes: 16
Class distribution in full dataset:
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]


In [13]:
min(10, X.shape[1], len(X) // 10) 

4

In [150]:
# Define oversampler TYPES
oversampler_types = ['smoteenn', 'smotetomek']
oversampler_count = len(oversampler_types)

results = {}

for key in oversampler_types:
    print(f'\n=== {key} ===')
    results[key] = {}
    
    # Train-test split: NO NO NO
    # X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42 )
    # print("y_train distribution:", y_train.value_counts().sort_index().to_list())
    
    # Get sample counts in train data
    n_samples = y.value_counts().min()
    k_neighbors = max(1, min(3, n_samples - 1))
    print("k_neighbors:", k_neighbors)
    print("y.value_counts().min():", n_samples)
    
    # Choose appropriate oversampler based on sample size
    if n_samples < 6:  # Too few samples for SMOTEENN/SMOTETomek
        print(f"Only {n_samples} samples per class - using RandomOverSampler instead of {key}")
        oversampler = RandomOverSampler(random_state=42)
    else:
        sampling_strategy = {cls: target_samples_per_class for cls in np.unique(y)}
        
        if key == 'smoteenn':
            oversampler = SMOTEENN(
                smote=SMOTE(random_state=42, sampling_strategy=sampling_strategy, k_neighbors=k_neighbors),
                random_state=42,
                n_jobs=-1
            )
        elif key == 'smotetomek':
            oversampler = SMOTETomek(
                smote=SMOTE(random_state=42, sampling_strategy=sampling_strategy, k_neighbors=k_neighbors),
                random_state=42,
                n_jobs=-1
            )

    print("oversampler used: ", oversampler)

    # Now try differnt models in the pipeline, not just one model.
    models = {
        'GaussianNB': {
            'model': GaussianNB(),
            'params': {                
                'var_smoothing': np.logspace(-10, -7, num=4) 
            }
        },
        'SVC-RBF': {
            'model': SVC(random_state=42, decision_function_shape='ovr'),
            'params': {
                'C':      [0.1, 1, 10],
                'gamma':  ['scale', 0.1],
                'kernel': ['rbf', 'linear']
            }
        },    
       'Logistic': {
            'model': LogisticRegression(max_iter=1000, random_state=42, multi_class='ovr'),
            'params': {
                'C':       [0.01, 0.1, 1, 10],
                'penalty': ['l1', 'l2'],
                'solver':  ['liblinear', 'saga']
                }
        },
        'KNeighborsClassifier': {
            'model': KNeighborsClassifier(),
            'params': {
                'n_neighbors':[1, 2, 3],
                'weights':	  ['uniform', 'distance'],
                'metric':	  ['euclidean', 'manhattan', 'minkowski'],
                'p':          [1, 2]
                }
        },
        'RandomForest': {
            'model': RandomForestClassifier(random_state=42, n_jobs=-1),
            'params': {
                'n_estimators':      [50, 100],        # Not too many
                'max_depth':         [2, 3, 4],        # VERY shallow (key!)
                'min_samples_split': [2, 3, 5],        # Require multiple samples
                'min_samples_leaf':  [1, 2]            # Prevent tiny leaves
                }
        },
        'DecisionTree': {
            'model': DecisionTreeClassifier(random_state=42),
            'params': {
                'max_depth':         [2, 3, 4, 5],
                'min_samples_split': [2, 3, 5],
                'min_samples_leaf':  [1, 2, 3]
                }
        }
    }

    
    # Check all model's performance
    for models_name, models_config in models.items():
        print('---')
        model_start_time = time.time()
                    
        # Pipeline        
        pipeline = Pipeline([
            ('oversample', oversampler),
            ('scaler',     StandardScaler()),
            ('selector',   SelectKBest(mutual_info_classif, k= min(10, X.shape[1], len(X) // 10) )),
            ('model',      models_config['model'] )
        ])
        param_grid = models_config['params']
        
        #print("param_grid:", param_grid) 
        #print( {f'model__{k}': v for k, v in models_config['params'].items()} )
        param_grid = {f'model__{k}':v for k,v in models_config['params'].items() } 
        
        # GridSearchCV 
        cv = StratifiedKFold(n_splits=y.value_counts().min(), shuffle=True, random_state=42)
        # n_splits=3 cannot be greater than the number of members in each class.
        try:
            grid_search = GridSearchCV(
                estimator  = pipeline,
                param_grid = param_grid,
                cv         = cv,
                scoring    = 'accuracy', # can explore this later for precision/recall
                n_jobs     = -1,
                verbose    = 0
            )
            grid_search.fit(X, y)

            # evaluate
            train_acc = grid_search.score(X, y)
            #test_acc  = grid_search.score(X_test, y_test)

            results[key][models_name] = {
                'grid'          : grid_search,
                'train_acc'     : train_acc,
                #'test_acc'      : test_acc,
                'best_param'    : grid_search.best_params_,
                'best_estimator': grid_search.best_estimator_,
                'best_cv_score' : grid_search.best_score_,
                'model_run_time': time.time() - model_start_time
            }
            
            
        except Exception as e:
            print(f"{models_name} ERROR: {str(e)}")
            continue
            
        print("-------end of model for loop iter---------")
    print("-------end of sampler for loop iter---------")
print("---------end of execution---------")
    
# Print results after the execution ends
       


=== smoteenn ===
k_neighbors: 2
y.value_counts().min(): 3
Only 3 samples per class - using RandomOverSampler instead of smoteenn
oversampler used:  RandomOverSampler(random_state=42)
---
-------end of model for loop iter---------
---
-------end of model for loop iter---------
---
-------end of model for loop iter---------
---
-------end of model for loop iter---------
---
-------end of model for loop iter---------
---
-------end of model for loop iter---------
-------end of sampler for loop iter---------

=== smotetomek ===
k_neighbors: 2
y.value_counts().min(): 3
Only 3 samples per class - using RandomOverSampler instead of smotetomek
oversampler used:  RandomOverSampler(random_state=42)
---
-------end of model for loop iter---------
---
-------end of model for loop iter---------
---
-------end of model for loop iter---------
---
-------end of model for loop iter---------
---
-------end of model for loop iter---------
---
-------end of model for loop iter---------
-------end of sampl

In [151]:
# print a nice looking table
data = []
for keys, vals in results.items():
    for key, val in vals.items():
        data.append({
            'sampler':       keys,
            'model'  :       key,
            'train_acc':     round(val['train_acc'],2),
            'best_cv_score': round(val['best_cv_score'],2),
            'model_run_time':round(val['model_run_time'],2)
        })

df = pd.DataFrame(data).sort_values(by='best_cv_score', ascending=False)
print( df )


       sampler                 model  train_acc  best_cv_score  model_run_time
9   smotetomek  KNeighborsClassifier       1.00           0.92          326.14
1     smoteenn               SVC-RBF       0.94           0.85          120.59
7   smotetomek               SVC-RBF       0.92           0.85          114.52
2     smoteenn              Logistic       0.88           0.83          165.67
3     smoteenn  KNeighborsClassifier       1.00           0.83          324.41
8   smotetomek              Logistic       0.88           0.81          151.37
4     smoteenn          RandomForest       0.96           0.79          351.13
10  smotetomek          RandomForest       0.96           0.79          349.63
6   smotetomek            GaussianNB       0.90           0.52           43.45
0     smoteenn            GaussianNB       0.96           0.46           48.17
5     smoteenn          DecisionTree       0.44           0.40          334.93
11  smotetomek          DecisionTree       0.44     

In [154]:
print("The above performance is not consistent. \
However, KNeighborsClassifier with smotetomek \
is consistently the highest performer.")

The above performance is not consistent. However, KNeighborsClassifier with smotetomek is consistently the highest performer.


In [152]:
# Get the first value in the table
first_sampler = df['sampler'].iloc[0]
first_model   = df['model'].iloc[0]

# Get the best estimator and params from results
estimator   = results[first_sampler][first_model]['best_estimator']
model_param = results[first_sampler][first_model]['best_param']

print('-'*60)
print("Best param: ", model_param)
print('-'*60)
print("\nBest estimator: ", estimator)
print('-'*60)

# Verify they match by extracting params from estimator
best_estimator_params = estimator.get_params()
tuned_params = {k: v for k, v in best_estimator_params.items() if k in model_param}

print("\nBest params from results:", model_param)
print('-'*60)
print("\nBest params from estimator:", tuned_params)
print('-'*60)

------------------------------------------------------------
Best param:  {'model__metric': 'manhattan', 'model__n_neighbors': 2, 'model__p': 2, 'model__weights': 'distance'}
------------------------------------------------------------

Best estimator:  Pipeline(steps=[('oversample', RandomOverSampler(random_state=42)),
                ('scaler', StandardScaler()),
                ('selector',
                 SelectKBest(k=4,
                             score_func=<function mutual_info_classif at 0x71503f173a30>)),
                ('model',
                 KNeighborsClassifier(metric='manhattan', n_neighbors=2,
                                      weights='distance'))])
------------------------------------------------------------

Best params from results: {'model__metric': 'manhattan', 'model__n_neighbors': 2, 'model__p': 2, 'model__weights': 'distance'}
------------------------------------------------------------

Best params from estimator: {'model__metric': 'manhattan', 'model_

In [153]:
# Define the best model/params to see if I get the same result

# Use the best_estimator directly
best_pipeline = results[first_sampler][first_model]['best_estimator']

# Fit on your training data & predict output
best_pipeline.fit(X, y)
y_pred = best_pipeline.predict(X)

# Evaluate
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import cross_val_score

cv_accuracy = cross_val_score(
    estimator = best_pipeline, 
    X=X,
    y=y, 
    scoring='accuracy',
    cv=cv
)
print('Mean CV Accuracy:', cv_accuracy.mean())

report = classification_report(y, y_pred)
print(report)


Mean CV Accuracy: 0.8125
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         3
           1       1.00      1.00      1.00         3
           2       1.00      1.00      1.00         3
           3       1.00      1.00      1.00         3
           4       1.00      1.00      1.00         3
           5       1.00      1.00      1.00         3
           6       1.00      1.00      1.00         3
           7       1.00      1.00      1.00         3
           8       1.00      1.00      1.00         3
           9       1.00      1.00      1.00         3
          10       1.00      1.00      1.00         3
          11       1.00      1.00      1.00         3
          12       1.00      1.00      1.00         3
          13       1.00      1.00      1.00         3
          14       1.00      1.00      1.00         3
          15       1.00      1.00      1.00         3

    accuracy                           1.00        48
 

SUMMARY:
- The best performing model is KNeighborsClassifier
