Gait Analysis (baseline approach)

Aim: Multi-class classification (16 subjects) using train/test split with SMOTE-based oversampling in pipeline - the first attempt/baseline approach.

Data Source/Credit: https://archive.ics.uci.edu/dataset/604/gait+classification

Dataset
- Size:         48 samples (16 subjects * 3 trials)
- Features:     321 gait cycle measurements from 3 sensors (R1, R2, R3)
- Target:       Subject identification (16 classes)
- Missing data: 1 NA value in CycleTime_R2 (imputed with median)
- Split:        67% train / 33% test -> ~2 training samples/class + 1 test sample/class 

Methodology: Standard supervised learning pipeline:
- Data Cleaning: Median imputation for missing value
- Stratified Split: 67/33 train/test (ensures all classes represented)
- Oversampling (in pipeline): SMOTEENN or SMOTETomek
- Feature Scaling: StandardScaler normalization
- Feature Selection: SelectKBest with mutual information (top ~10 features)
- Model: Gaussian Naive Bayes (fixed hyperparameters)
- Validation: Adaptive cross-validation (skipped when n<3 samples/class)

Oversampling strategy
- SMOTEENN: SMOTE + Edited Nearest Neighbors (removes noisy synthetic samples)
- SMOTETomek: SMOTE + Tomek Links (cleans class boundaries)
- Fallback: RandomOverSampler when n_samples < 6 (too few for SMOTE)

Adaptive CV: Smart handling of tiny dataset
- If n_samples < 3: Skip CV, just train and evaluate on test
- If n_samples >= 3: Use StratifiedKFold with n_splits = min(5, n_samples)

Results: Very poor performance 
- 0.00 recall/precision for many classes
- Model never predicts certain classes
- High variance, no generalization

Why This Failed
1. Insufficient training data: only 2 training samples per class:
    - Impossible to learn distinguishing patterns
    - Model cannot identify what makes each subject unique
    - Not enough examples to capture gait variability

2. Unreliable evaluation: only 1 test sample per class:
    - Single misclassification = 100% error for that subject
    - Metrics are completely unstable
    - Cannot distinguish signal from noise

3. Oversampling limitations: SMOTE/SMOTEENN/SMOTETomek with 2 samples:
    - Creates synthetic samples by interpolating between only 2 points
    - Very limited diversity in synthetic data
    - Essentially creates variations of the same 2 examples
    - Cannot add meaningful new information

4. Curse of Dimensionality
    - 321 features vs 32 training samples
    - Features >> Samples (severe overfitting risk)
    - Even after feature selection (10 features), ratio still problematic
    - Model cannot learn reliable patterns

What This Baseline Established
1. Lessons Learned
    - Proper methodology: No data leakage (oversampling in pipeline after split)
    - Adaptive approach: Handles edge cases (n_samples < 3)
    - Smart oversampling: Tests multiple methods with fallback
    - Feature selection: Reduces dimensionality

2. Fundamental Problems Identified
    - Dataset too small: 48 samples insufficient for 16-class problem
    - Train/test split wasteful: Loses 16 samples to test set
    - Single split unstable: Results vary drastically with random_state
    - Not person-independent: Tests on subjects seen during training

3. Comparison with later versions, what v1_1 did right
    - Correct pipeline implementation (no data leakage)
    - Stratified split preserves class balance
    - Adaptive handling of tiny samples
    - Tests multiple oversampling methods

What later versions improved 
- v1_2: Added GridSearchCV for hyperparameter tuning, tested multiple models (KNN, SVC, Logistic, RandomForest, DecisionTree), no train/test split (uses all 48 samples)
- v1_3: Tested aggressive oversampling (100 samples/class) to see if more synthetic data helps - showed it doesn't (BAD RESULT DUE TO OVERFITTING)
- v1_4: Demonstrated data leakage by oversampling BEFORE split - educational example of what NOT to do (VERY BAD)
- v1_5: Tested oversampling outside pipeline (**UPDATE**)
- v1_6: Tested LOOCV with pre-CV oversampling (480 folds) - showed 99.4% accuracy but with data leakage (Invalid - educational)
- v1_7: Removed train/test split, used StratifiedKFold (2-fold) with pipeline oversampling, tested multiple k values (2,3,4,5,10,15) - BEST APPROACH
- v1_8: Same as v1_7 but with train/test split added back, showed why split is inferior for tiny data (88.2% accuracy, unstable)

What v1_1 did right
- Correct pipeline implementation (no data leakage)
- Stratified split preserves class balance
- Adaptive handling of tiny samples
- Tests multiple oversampling methods

What v1_1 did that others avoided
- Train/test split: Later versions realized this wastes data with only 48 samples
- Fixed hyperparameters: v1_2 added systematic tuning
- Single model: Later versions compared multiple algorithms

Valid conclusions
- Data scarcity is the problem: No amount of clever methodology can overcome 2 samples/class
- Oversampling has limits: Cannot create meaningful diversity from 2 examples
- Need different approach: Train/test split inappropriate for this dataset size

Invalid conclusions
- Cannot conclude model quality (data too small to evaluate)
- Cannot conclude feature quality (not enough samples to test)
- Cannot conclude biometric viability (not person-independent anyway)

Other possible solutions
- Collect more data: 30-50 samples/subject minimum
- Simplify problem: Binary classification instead of 16-class
- Alternative approaches: Unsupervised learning, one-vs-rest
- Use GroupKFold: Test person-independent performance
 
This baseline established:
- What proper methodology looks like (no leakage, stratified splits)
- That the dataset is too small for reliable classification
- Need for alternative approaches (led to v1_2, v1_7 improvements)

> This was the starting point that established both the methodology and the fundamental challenge of working with extremely small datasets.


In [1]:
# Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score, cross_val_predict
from sklearn.model_selection import LeaveOneOut, LeaveOneGroupOut, StratifiedKFold, GroupKFold
from sklearn.metrics import classification_report, confusion_matrix
#from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, ExtraTreesClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import NearestNeighbors

from imblearn.pipeline import Pipeline

# these clean up the noisy data
from imblearn.combine import SMOTETomek, SMOTEENN

# these do not clean up the noisy data
from imblearn.over_sampling import RandomOverSampler, ADASYN, SMOTE

# avoid as it only duplicates data
# from imblearn.over_sampling import RandomOverSampler 


from sklearn.feature_selection import SelectKBest, mutual_info_classif


In [2]:
# Load data

df = pd.read_csv('gait_final_output.csv')
print(f'df.shape: {df.shape}')
print("---")
df.sample(3)


df.shape: (48, 322)
---


Unnamed: 0,Speed_R1,Variability_R1,Symmetry_R1,HeelPressTime_R1,CycleTime_R1,Cadence_R1,Posture_R1,Oscillation_R1,Loading_R1,FootPress_R1,...,P99_R3,P100_R3,P101_R3,P102_R3,P103_R3,P104_R3,P105_R3,P106_R3,P107_R3,Subject_ID_Y
20,1.27,4.12,1.8,1.113,1.112,1.115,1.115,0.044,0.046,0.04,...,0.02,0.026,0.027,0.13,0.128,0.216,0.216,1.01,0.99,6
40,1.39,0.0,-5.3,1.162,1.161,1.078,1.07,0.702,0.687,0.035,...,0.018,0.017,0.015,0.168,0.122,0.238,0.239,1.007,0.993,13
47,1.42,0.0,-8.1,1.123,1.19,1.125,1.125,0.04,0.603,0.031,...,0.017,0.024,0.017,0.143,0.171,0.235,0.249,0.93,1.075,15


In [3]:
# check for NA values

df.dtypes


Speed_R1            float64
Variability_R1      float64
Symmetry_R1         float64
HeelPressTime_R1    float64
CycleTime_R1        float64
                     ...   
P104_R3             float64
P105_R3             float64
P106_R3             float64
P107_R3             float64
Subject_ID_Y          int64
Length: 322, dtype: object

In [4]:
# check for NA values

df.isna().sum().sum()


np.int64(1)

In [5]:
# Get row with na value

row_index_of_na = df[ df.isna().any(axis=1) ].index[0]
print("row_index_of_na:", row_index_of_na)
print( type(row_index_of_na), '|' , row_index_of_na.dtype)


print(df.columns[ np.where( df.isna().any(axis=0) == True ) ] )
col_name_of_na  = df.columns[  np.where( df.isna().any(axis=0) == True )[0][0]  ]
print("col_name_of_na:", col_name_of_na)


row_index_of_na: 43
<class 'numpy.int64'> | int64
Index(['CycleTime_R2'], dtype='object')
col_name_of_na: CycleTime_R2


In [6]:
# Impute the NA value

# Since we have small number of rows, 48, and each 3 rows corresponds to 1 subject, we havae 16 subjects,
# the missing values comes from row subject 14's 2nd iteration, and column CycleTime_R2 (R1, R2, R3 are 
# three different sensors on the body).
# Therefore, it is appropriate to take median of CycleTime_R2 for imputing this value.


In [7]:
# median of CycleTime_R2 is 

print("Value BEFORE imputation:", df.loc[row_index_of_na, col_name_of_na] )

median_CycleTime_R2 = df[col_name_of_na].median()

df_new = df.copy()
    
# imput missing value in the df_new
df_new.loc[row_index_of_na, col_name_of_na] = median_CycleTime_R2

print("Value AFTER imputation:", df_new.loc[row_index_of_na, col_name_of_na] )


Value BEFORE imputation: nan
Value AFTER imputation: 6.15


In [8]:
# check NA values in the df_new dataframe

df_new.isna().sum().sum()


np.int64(0)

In [9]:
# Save this "df_new" as a new CSV file with name "gait_final_output_updated.csv"
# df_new.to_csv("gait_final_output_updated.csv", index=False)
# Update the data to 'Kaggle' and 'GitHub'

In [10]:
# Check for class imbalance

print( df_new['Subject_ID_Y'].value_counts().to_list() )

min_y_count = df_new['Subject_ID_Y'].value_counts().min()
max_y_count = df_new['Subject_ID_Y'].value_counts().max()

if min_y_count/max_y_count > 5: 
    print(f"Classes are imbalanced. Max-to-min count ratio is: {min_y_count/max_y_count}")
else:
    print(f"Classes are balanced. Max-to-min count ratio is: {min_y_count/max_y_count}")
    

[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]
Classes are balanced. Max-to-min count ratio is: 1.0


In [11]:
# STEP 0: Split X and y

y =  df_new['Subject_ID_Y']
X =  df_new.drop('Subject_ID_Y', axis='columns')

print(f"Total samples: {len(y)}")
print(f"Total features: {X.shape[1]}")
print(f"Number of classes: {len(y.unique())}")
print(f"Class distribution in full dataset:")
print(y.value_counts().sort_index().to_list())


Total samples: 48
Total features: 321
Number of classes: 16
Class distribution in full dataset:
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]


In [15]:
# Define oversampler TYPES
oversampler_types = ['smoteenn', 'smotetomek']
oversampler_count = len(oversampler_types)

for key in oversampler_types:
    print(f'\n=== {key} ===')
    
    # Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.33, random_state=42 )
    print("y_train distribution:", y_train.value_counts().sort_index().to_list())
    
    # Get sample counts in train data
    n_samples = y_train.value_counts().min()
    k_neighbors = max(1, min(3, n_samples - 1))
    print("k_neighbors:", k_neighbors)
    print("y_train.value_counts().min():", n_samples)
    
    # Choose appropriate oversampler based on sample size
    if n_samples < 6:  # Too few samples for SMOTEENN/SMOTETomek
        print(f"Only {n_samples} samples per class - using RandomOverSampler instead of {key}")
        oversampler = RandomOverSampler(random_state=42)
    else:
        sampling_strategy = {cls: target_samples_per_class for cls in np.unique(y_train)}
        
        if key == 'smoteenn':
            oversampler = SMOTEENN(
                smote=SMOTE(random_state=42, sampling_strategy=sampling_strategy, k_neighbors=k_neighbors),
                random_state=42,
                n_jobs=-1
            )
        elif key == 'smotetomek':
            oversampler = SMOTETomek(
                smote=SMOTE(random_state=42, sampling_strategy=sampling_strategy, k_neighbors=k_neighbors),
                random_state=42,
                n_jobs=-1
            )

    print("oversampler used: ", oversampler)    
    
    # Pipeline with GaussianNB
    max_k = min(10, X_train.shape[1], len(X_train) // 10)
    pipeline = Pipeline([
        ('oversample', oversampler),
        ('scaler', StandardScaler()),
        ('selector', SelectKBest(mutual_info_classif, k=max_k)),
        ('model', GaussianNB())
    ])
    
    try:
        # Check if we have enough samples for cross-validation
        if n_samples < 3:
            print(f"Skipping cross-validation - only {n_samples} samples per class")
            print("Training on training set and evaluating on test set only...")
            
            # Just fit on training data and evaluate on test
            pipeline.fit(X_train, y_train)
            y_pred_test = pipeline.predict(X_test)
            
            print("\nTest Report:")
            print(classification_report(y_test, y_pred_test, zero_division=0))
        else:
            # Cross-validation
            n_splits = min(5, n_samples)
            cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
            
            y_pred_train = cross_val_predict(pipeline, X_train, y_train, cv=cv, n_jobs=-1)
            print("\nTraining CV Report:")
            print(classification_report(y_train, y_pred_train, zero_division=0))
            
            # Final evaluation
            pipeline.fit(X_train, y_train)
            y_pred_test = pipeline.predict(X_test)
            print("\nTest Report:")
            print(classification_report(y_test, y_pred_test, zero_division=0))

        
    except ValueError as e:
        print(f"Error with {key}: {e}")
        print("Debug info:")
        print(f"  - X_train shape: {X_train.shape}")
        print(f"  - y_train distribution: {y_train.value_counts().to_dict()}")
        continue


=== smoteenn ===
y_train distribution: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
k_neighbors: 1
y_train.value_counts().min(): 2
Only 2 samples per class - using RandomOverSampler instead of smoteenn
oversampler used:  RandomOverSampler(random_state=42)
Skipping cross-validation - only 2 samples per class
Training on training set and evaluating on test set only...

Test Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.00      0.00      0.00         1
           2       0.00      0.00      0.00         1
           3       0.20      1.00      0.33         1
           4       1.00      1.00      1.00         1
           5       1.00      1.00      1.00         1
           6       0.33      1.00      0.50         1
           7       1.00      1.00      1.00         1
           8       0.00      0.00      0.00         1
           9       0.00      0.00      0.00         1
          10     

Conclusion from this analysis:
- Terrible model performance: 0.00 recall/precision for many of the 16 classes
- The model never predicts these classes where recall/precision is 0.0
- Only 1 test sample/class: completely unreliable evaluation
- Only 2 training samples/class - impossible to learn patterns. 
- With only 2 training samples per class, the model can't learn distinguishing features
- Even after oversampling, there's not enough diversity in the synthetic samples
