Gait Analysis (Train/Test Split with Proper Pipeline Oversampling)

Aim: Multi-class classification (16 subjects) using a train/test split with oversampling inside the pipeline to avoid data leakage.

Data Source: UCI Machine Learning Repository – Gait Classification Dataset

Dataset
- Original: 48 samples (16 subjects, 3 samples/class)
- Train Set: ~31 samples (65%) => ~2 samples per class
- Test Set: ~17 samples (35%) => ~1 sample per class
- Split: Stratified to ensure all classes are represented
  
Key Strategy: Oversampling inside the pipeline (after split)
- Step 1: Split data (65/35 stratified)
- Step 2: Pipeline applies oversampling only to training data
- No data leakage: Test set remains original, unmodified

Pipeline Components
- StandardScaler: Feature normalization
- RandomOverSampler: Class balancing (applied only to training)
- SelectKBest: Feature selection (tested k = 2, 3, 4, 5, 10, 15)
- SVC: RBF kernel classifier

Results
- Best Performance: k=10 features, 88.2% accuracy
- Observation: Complete failure on few subjects (0% precision/recall)

What This Approach Gets Right 
- No data leakage: Oversampling only on training data (via pipeline)
- Stratified split: All classes represented in train/test
- Feature selection: Tests multiple k values to find optimal subset
- Meaningful result: 88.2% accuracy with proper train/test separation
- Demonstrates correct ML workflow: Model learns discriminative features

Critical Limitations
- Not Person-Independent
    - Training: All 16 subjects (samples 1–2)
    - Test: All 16 subjects (sample 3)
    - Problem: Model learns to recognize known individuals, not new ones
    - Model sees all subjects during training -> cannot generalize to unseen people
- Extremely Unstable Evaluation
    - Only 1 test sample per class -> one misclassification = 100% error
    - Few subjects/classes: 0% precision/recall (complete failure)
    - High variance: Results are highly sensitive to individual predictions

- Single Random Split Dependency
    - Results vary drastically based on random split: random_state=42 (88.2% accuracy), random_state=43 (potentially different results)
    - No averaging across multiple splits
    - Cannot assess performance stability

Comparison: StratifiedKFold vs Train/Test Split
- This approach uses a train/test split with a single random split 
- Train/test split is inferior to StratifiedKFold for tiny datasets because:
    - Data utilization: StratifiedKFold uses all 48 samples (no waste), while this approach wastes 17 samples (only for testing)
    - Stability: StratifiedKFold averages results across folds (more stable), while this approach has high variance due to a single split
    - Reliability: StratifiedKFold provides more robust estimates with multiple evaluation points

Valid Interpretations 
- Session classification: Can distinguish different walking sessions of known subjects
- Intra-subject variability: How consistent gait features are across recordings
- Feature discriminability: Features contain subject-specific information
- Pipeline correctness: Proper implementation without data leakage

Invalid Interpretations
- Biometric identification: Cannot identify new/unseen people
- Deployment readiness: Not ready for real-world identification system
- Person-independent performance: Doesn't test recognition of strangers
- Stable estimates: Single split with 1 test sample/class is unreliable

Valid Applications
- Drift detection: Identifying changes in known subjects' gait over time
- Session authentication: Verifying same person across different sessions
- Feature engineering: Understanding which features work best
- Algorithm comparison: Comparing different models on same task

Invalid Applications
- New person identification: Cannot recognize people not in training set
- Security/access control: Not suitable for biometric authentication
- Forensic applications: Cannot identify unknown individuals
- Medical diagnosis: Cannot generalize to new patient populations

Key Lessons
- This approach correctly implements ML methodology (no data leakage, proper pipeline), but is inferior to StratifiedKFold for tiny datasets due to:
    - Unstable evaluation (1 test sample per class)
    - Data waste (17 samples only for testing)
    - High variance (single random split)
    - Complete failure on some subjects (0% precision/recall)

- Both approaches share a fundamental limitation: 
    - Neither is person-independent. 
    - Both test on different samples from subjects already seen during training.

For this dataset: 
- StratifiedKFold with pipeline oversampling is the better approach because it maximizes data usage AND provides more stable estimates
- 88.2% accuracy shows features are discriminative, but represents within-subject recognition (different sessions of same people), not cross-subject identification (recognizing new people). 
- The unstable results for few classes (complete failures) confirm that 3 samples/class is insufficient for reliable train/test split evaluation.

Summary
- Sound methodologically for testing intra-subject discrimination (can features distinguish known subjects?)
- Not sound for inter-subject generalization (can the model identify unseen individuals?).
- It validates feature discriminability and pattern consistency, but not real-world generalization.



In [None]:
# Import lib
from sklearn.model_selection import StratifiedKFold, cross_val_predict, train_test_split
from imblearn.pipeline import Pipeline      
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from imblearn.over_sampling import RandomOverSampler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd
import numpy as np
import pprint

import warnings
warnings.filterwarnings("ignore")

In [48]:
df_new = pd.read_csv("gait_final_output_updated.csv")

X = df_new.drop(columns=['Subject_ID_Y'])
y = df_new['Subject_ID_Y']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.35,
    stratify=y,
    random_state=42
)

accuracy = {}

k_vals = [2, 3, 4, 5, 10, 15]
for k in k_vals:
    accuracy[k] = {}
    
    pipeline = Pipeline([
        ("scaler", StandardScaler()),
        ("oversample", RandomOverSampler()),
        ("feature_selector", SelectKBest(mutual_info_classif, k=k)),
        ("clf", SVC(kernel='rbf'))
    ])
    
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)    
    score = accuracy_score(y_test, y_pred)
            
    accuracy[k]['score'] =  score
    accuracy[k]['report']   =  classification_report(y_test, y_pred, zero_division=0)
    
    # pprint.pprint("accuracy: ", accuracy)
    
best_acc = []
for i, k in enumerate(k_vals):
    print(f"i:{i}, k:{k}, accuracy:{accuracy[k]['score']}" )
    best_acc.append( accuracy[k]['score'] )

best_idx = np.argmax(best_acc)
print("best_idx:", best_idx)

print("="*50)
print(f"Classification report for best accuracy at k={ k_vals[best_idx] }, accuracy:{accuracy[ k_vals[best_idx] ]['score']}" )
pprint.pprint( accuracy[ k_vals[best_idx] ]['report'] )
print("="*50)


i:0, k:2, accuracy:0.47058823529411764
i:1, k:3, accuracy:0.4117647058823529
i:2, k:4, accuracy:0.47058823529411764
i:3, k:5, accuracy:0.6470588235294118
i:4, k:10, accuracy:0.8823529411764706
i:5, k:15, accuracy:0.7058823529411765
best_idx: 4
Classification report for best accuracy at k=10, accuracy:0.8823529411764706
('              precision    recall  f1-score   support\n'
 '\n'
 '           0       1.00      1.00      1.00         1\n'
 '           1       0.00      0.00      0.00         1\n'
 '           2       1.00      1.00      1.00         1\n'
 '           3       1.00      1.00      1.00         1\n'
 '           4       0.50      1.00      0.67         1\n'
 '           5       1.00      1.00      1.00         1\n'
 '           6       1.00      1.00      1.00         2\n'
 '           7       0.50      1.00      0.67         1\n'
 '           8       1.00      1.00      1.00         1\n'
 '           9       1.00      1.00      1.00         1\n'
 '          10       1.0

CONCLUSION: 
- (StratefiedKFold + No data split) is BETTER than (No StratefiedKFold + Data split)
    - This is bec in latter we see precision/recall = 0.0 
- With k=10 features fro SelectKBest perform consistently well.