GAIT ANALYSIS (Leave-One-Out cross-validation (LOOCV) with pre-CV oversampling, without train/test split)

Aim: Multi-class classification (16 subjects) with oversampling applied before splitting data into train/test sets.
- Can features distinguish between subjects?

Data Credits: https://archive.ics.uci.edu/dataset/604/gait+classification

Dataset:
- Before Oversampling: 48 samples, 16 classes (3/class)
- After Oversampling:  480 samples (30/class)
- CV:                  480 folds   (1/oversampled sample)
  
Approach:
- 1. Oversampling: Apply SMOTE/RandomOverSampler to balance all classes to 30 samples each. 
- 2. CV:           Leave-One-Out CV on the already oversampled dataset (train on 479 samples / test on 1 sample)
- 3. Pipeline:     StandardScaler -> SelectKBest/k=5 -> SVM/RBF kernel (rather than KNeighbors)

Why?
- With extramly small data, LOOCV maximizes the use of available data.
- Maximizes training data usage (479 out of 480 samples per fold)
- LOOCV provides a nearly unbiased estimate of model performance on the available data.
- Appropriate for extremely small datasets as each iteration gets maximum training samples

Upside:
- Near-perfect accuracy (99.4%) demonstrates strong feature discriminability on seen subjects only.
- LOOCV maximizes training data usage for each fold.
- Almost all classes achieve perfect precision and recall.
- Only one or two subjects/classes show minor misclassifications.

Downside: data leakage due to pre-CV oversampling:
- Oversampling (SMOTE/RandomOverSampler) creates synthetic samples based on existing data
- When oversampling before CV, synthetic samples derived from the same original sample can appear in both training and test sets
- In each CV fold, the test sample may be nearly identical to synthetic samples in the training set
- The model is tested on variations of data it has already seen during training
- This leads to the model memorizing patterns (overfitting) from the same original samples, resulting in inflated accuracy
- Results are severely over-optimistic and do not reflect true generalization ability
- The fundamental assumption of independent test samples is violated

Valid Interpretation
- Features CAN distinguish between subjects when sufficient samples are available
- Proof of concept that the gait measurements contain subject-specific patterns
- Feature discriminability is confirmed
  
Invalid Interpretation
- NOT real-world accuracy on unseen subjects
- NOT generalizable to new people
- NOT deployment-ready performance estimate

Summary:
- This approach is INVALID for true performance estimation (as explain in the 'Downside' point).
- The 99.4% accuracy does not represent real-world performance on unseen subjects.
- Correct approach: Oversample within each LOOCV fold (oversample training set only, never the test sample).

Conclusion:
- While this experiment confirms that the features can distinguish between subjects when given enough synthetic examples, 
    the methodology fundamentally violates the principle of keeping training and test data independent. 
        The near-perfect results are an artifact of data leakage, not genuine model performance.


In [None]:
from sklearn.model_selection import LeaveOneOut
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd
import numpy as np
from imblearn.over_sampling import RandomOverSampler
from joblib import Parallel, delayed

In [3]:
print("="*70)
print("LEAVE-ONE-OUT CV ON OVERSAMPLED DATA")
print("="*70)

# Load data
df = pd.read_csv("gait_final_output_updated.csv")
X = df.drop("Subject_ID_Y", axis=1)
y = df["Subject_ID_Y"]

print(f"Original: {len(y)} samples, {len(y.unique())} classes")

# Oversample FIRST
target_per_class = 30  # Conservative

min_samples = y.value_counts().min()
if min_samples >= 3:
    oversampler = SMOTE(random_state=42, k_neighbors=2,
                       sampling_strategy={cls: target_per_class for cls in np.unique(y)})
else:    
    oversampler = RandomOverSampler(random_state=42, 
                                   sampling_strategy={cls: target_per_class for cls in np.unique(y)})

X_resampled, y_resampled = oversampler.fit_resample(X, y)

print(f"Oversampled: {len(y_resampled)} samples")
print(f"Samples per class: {target_per_class}")

# Pipeline
pipeline = Pipeline([
    ("scaler", StandardScaler() ),
    ("feature_selector", SelectKBest(mutual_info_classif, k=5) ),  
    ("clf", SVC(kernel="rbf", C=1.0, gamma='scale', random_state=42) )
])

# Leave-One-Out CV
loo = LeaveOneOut()
predictions = []
actuals = []

print(f"\nRunning LOO CV on {len(y_resampled)} samples...")
print("(This may take a few minutes...)")

# Define a function for a single fold
def run_fold(train_idx, test_idx):
    X_train_fold = X_resampled.iloc[train_idx]
    X_test_fold = X_resampled.iloc[test_idx]
    y_train_fold = y_resampled.iloc[train_idx]
    y_test_fold = y_resampled.iloc[test_idx]
    
    pipeline.fit(X_train_fold, y_train_fold)
        
    y_pred = pipeline.predict(X_test_fold)
    # returns an array with one element (e.g., [label]), since only one sample is tested each fold.
    return y_pred[0], y_test_fold.iloc[0]


LEAVE-ONE-OUT CV ON OVERSAMPLED DATA
Original: 48 samples, 16 classes


In [7]:
# This line runs all LOOCV folds in parallel, each calling run_fold() 
# with the respective train/test indices, and collects their outputs.

results = Parallel(n_jobs=-1, verbose=10)( delayed(run_fold)(train_idx, test_idx) 
                                           for train_idx, test_idx in loo.split(X_resampled) )


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  2.9min
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  6.1min
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed:  7.6min
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:  8.7min
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed: 10.7min
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed: 12.2min
[Parallel(n_jobs=-1)]: Done 105 tasks      | elapsed: 14.3min
[Parallel(n_jobs=-1)]: Done 120 tasks      | elapsed: 15.8min
[Parallel(n_jobs=-1)]: Done 137 tasks      | elapsed: 18.3min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 20.4min
[Parallel(n_jobs=-1)]: Done 173 tasks      | elapsed: 22

In [8]:
# Unpack results
predictions, actuals = zip(*results)

# Convert to lists
predictions, actuals = list(predictions), list(actuals)

# Results
accuracy = accuracy_score(actuals, predictions)

print("\n" + "="*70)
print("LOO CV RESULTS")
print("="*70)
print(f"Accuracy: {accuracy:.4f} ({accuracy*100:.1f}%)")
print(f"\nThis is an estimate of performance, but still affected by")
print(f"data leakage since we oversampled before CV.")

print("\nClassification Report:")
print(classification_report(actuals, predictions, zero_division=0))



LOO CV RESULTS
Accuracy: 0.9938 (99.4%)

This is an estimate of performance, but still affected by
data leakage since we oversampled before CV.

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        30
           1       1.00      1.00      1.00        30
           2       1.00      1.00      1.00        30
           3       1.00      1.00      1.00        30
           4       1.00      1.00      1.00        30
           5       1.00      1.00      1.00        30
           6       1.00      1.00      1.00        30
           7       1.00      1.00      1.00        30
           8       1.00      1.00      1.00        30
           9       1.00      1.00      1.00        30
          10       1.00      1.00      1.00        30
          11       1.00      1.00      1.00        30
          12       1.00      0.90      0.95        30
          13       1.00      1.00      1.00        30
          14       1