GAIT ANALYSIS (Oversampled data (INSIDE PIPELINE) + Stratified KFold + No Train/Test Split)

Data Credits: https://archive.ics.uci.edu/dataset/604/gait+classification

Dataset
- Size: 48 samples (16 subjects, 3 samples/class)
- No train/test split: All 48 samples used for cross-validation
- CV Strategy: StratifiedKFold (n_splits=2)

Key Strategy: Oversampling inside pipeline
- StratifiedKFold: Splits data into 2 folds while preserving class distribution
- Each fold:
    - Oversampling applied only to training fold (via pipeline)
    - Test fold remains original, unmodified data
- No data leakage: Synthetic samples never appear in test sets
- Pipeline Components:
    - StandardScaler
    - RandomOverSampler (Class balancing (inside pipeline, no leakage)) 
    - SelectKBest: Feature selection (tested with k = 2, 3, 4, 5, 10, 15)
    - SVC: RBF kernel classifier
  
Why StratifiedKFold Works Better (vs. Person-Independent)
- StratifiedKFold ensures balanced class distribution in each fold
- Each fold contains at least 1 sample per subject
- Model sees all subjects during training (1–2 samples per subject in each fold)
- Oversampling provides sufficient training data per fold
- Model learns subject-specific patterns since all subjects are represented
- No data leakage: Oversampling occurs after split, inside pipeline

Consequences:
- Model learns from every subject in every fold
- Evaluation reflects intra-subject discrimination (can it distinguish known subjects?)

Upside:
- All subjects appear in training
- Oversampling works effectively
- Model can learn subject-specific patterns

Downside: Not person-independent
- Person-independent: Train on subjects 1–12 -> Test on 13–16 (unseen people)
- This setup: Train on all 16 subjects -> Test on different samples from the same 16 subjects
- Problem: Subject leakage across folds
- Model learns to recognize known individuals, not new ones
- Overly optimistic accuracy due to familiarity with subjects

Valid Interpretations
- Features can distinguish between known subjects
- Subject gait patterns are consistent across trials
- Classification is possible given sufficient data
- Feature selection identifies optimal discriminative features
- Methodology is correct (no data leakage)

Invalid Interpretations
- Generalization to new people: Cannot identify unseen individuals
- Biometric deployment: Not ready for real-world identification
- Person-independent recognition: Does not test recognition of strangers
- Cross-population performance: Only validates within 16 subjects

Conclusion
- Sound methodolofy for testing intra-subject discrimination (can features distinguish known subjects?)
- Not sound for inter-subject generalization (can the model identify unseen individuals?).
- It validates feature discriminability and pattern consistency, but not real-world generalization.

Other notes
- Test samples here are NEVER oversampled
- Test samples are ORIGINAL data, not synthetic copies
- Each test sample is completely independent from training synthetic data
- Question: No Train/Test Split ≠ Data Leakage
            If there's no train/test split, isn't everything mixed together?
- Answer: No, because StratifiedKFold CREATES splits internally
          Here we doesn't have a permanent train/test split
          But StratifiedKFold creates temporary train/test splits for each fold
          Oversampling happens WITHIN each fold (only on that fold's training data)

In [37]:
# Import lib
from sklearn.model_selection import StratifiedKFold, cross_val_predict, train_test_split
from imblearn.pipeline import Pipeline      
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from imblearn.over_sampling import RandomOverSampler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd
import numpy as np
import pprint

import warnings
warnings.filterwarnings("ignore")

In [46]:
# Load and identify X and y
df_new = pd.read_csv("gait_final_output_updated.csv")

X = df_new.drop(columns=['Subject_ID_Y'])
y = df_new['Subject_ID_Y']

cv = StratifiedKFold(n_splits=2, shuffle=True, random_state=42)

accuracy = {}

k_vals = [2, 3, 4, 5, 10, 15]
for k in k_vals:
    accuracy[k] = {}
    
    pipeline = Pipeline([
        ("scaler", StandardScaler()),
        ("oversample", RandomOverSampler()),
        ("feature_selector", SelectKBest(mutual_info_classif, k=k)),
        ("clf", SVC(kernel='rbf'))
    ])
    
    y_pred = cross_val_predict(pipeline, X, y, cv=cv, n_jobs=-1)   
    score = accuracy_score(y, y_pred)
    # print("Accuracy:", score)
    accuracy[k]['score']  =  score
    accuracy[k]['report'] =  classification_report(y, y_pred, zero_division=0)
    
# pprint.pprint("accuracy: ", accuracy)

best_acc = []
for i, k in enumerate(k_vals):
    print(f"i:{i}, k:{k}, accuracy:{accuracy[k]['score']}" )
    best_acc.append( accuracy[k]['score'] )

best_idx = np.argmax(best_acc)
print("best_idx:", best_idx)

print("="*50)
print(f"Classification report for best accuracy at k={ k_vals[best_idx] }, accuracy:{accuracy[ k_vals[best_idx] ]['score']}" )
pprint.pprint( accuracy[ k_vals[best_idx] ]['report'] )
print("="*50)

i:0, k:2, accuracy:0.4375
i:1, k:3, accuracy:0.6041666666666666
i:2, k:4, accuracy:0.5416666666666666
i:3, k:5, accuracy:0.6041666666666666
i:4, k:10, accuracy:0.75
i:5, k:15, accuracy:0.7083333333333334
best_idx: 4
Classification report for best accuracy at k=10, accuracy:0.75
('              precision    recall  f1-score   support\n'
 '\n'
 '           0       0.60      1.00      0.75         3\n'
 '           1       0.67      0.67      0.67         3\n'
 '           2       0.40      0.67      0.50         3\n'
 '           3       0.50      0.33      0.40         3\n'
 '           4       1.00      1.00      1.00         3\n'
 '           5       0.67      0.67      0.67         3\n'
 '           6       0.75      1.00      0.86         3\n'
 '           7       1.00      0.33      0.50         3\n'
 '           8       0.50      0.33      0.40         3\n'
 '           9       1.00      1.00      1.00         3\n'
 '          10       1.00      0.67      0.80         3\n'
 '     