Gait Analysis (Train/Test Split with Proper Pipeline Oversampling)

Aim: Multi-class classification (16 subjects) using a train/test split with oversampling inside the pipeline to avoid data leakage.

Data Source: UCI Machine Learning Repository – Gait Classification Dataset

Dataset
- Original: 48 samples (16 subjects, 3 samples/class)
- Train Set: ~31 samples (65%) => ~2 samples per class
- Test Set: ~17 samples (35%) => ~1 sample per class
- Split: Stratified to ensure all classes are represented
  
Key Strategy: Oversampling inside the pipeline (after split)
- Step 1: Split data (65/35 stratified)
- Step 2: Pipeline applies oversampling only to training data
- No data leakage: Test set remains original, unmodified

Pipeline Components
- StandardScaler: Feature normalization
- RandomOverSampler: Class balancing (applied only to training)
- SelectKBest: Feature selection (tested k = 2, 3, 4, 5, 10, 15)
- SVC: RBF kernel classifier

Results
- Best Performance: k=10 features, 88.2% accuracy
- Observation: Complete failure on few subjects (0% precision/recall)

What This Approach Gets Right 
- No data leakage: Oversampling only on training data (via pipeline)
- Stratified split: All classes represented in train/test
- Feature selection: Tests multiple k values to find optimal subset
- Meaningful result: 88.2% accuracy with proper train/test separation
- Demonstrates correct ML workflow: Model learns discriminative features

Critical Limitations
- Not Person-Independent
    - Training: All 16 subjects (samples 1–2)
    - Test: All 16 subjects (sample 3)
    - Problem: Model learns to recognize known individuals, not new ones
    - Model sees all subjects during training -> cannot generalize to unseen people
- Extremely Unstable Evaluation
    - Only 1 test sample per class -> one misclassification = 100% error
    - Few subjects/classes: 0% precision/recall (complete failure)
    - High variance: Results are highly sensitive to individual predictions

- Single Random Split Dependency
    - Results vary drastically based on random split: random_state=42 (88.2% accuracy), random_state=43 (potentially different results)
    - No averaging across multiple splits
    - Cannot assess performance stability

Comparison: StratifiedKFold vs Train/Test Split
- This approach uses a train/test split with a single random split 
- Train/test split is inferior to StratifiedKFold for tiny datasets because:
    - Data utilization: StratifiedKFold uses all 48 samples (no waste), while this approach wastes 17 samples (only for testing)
    - Stability: StratifiedKFold averages results across folds (more stable), while this approach has high variance due to a single split
    - Reliability: StratifiedKFold provides more robust estimates with multiple evaluation points

Valid Interpretations 
- Session classification: Can distinguish different walking sessions of known subjects
- Intra-subject variability: How consistent gait features are across recordings
- Feature discriminability: Features contain subject-specific information
- Pipeline correctness: Proper implementation without data leakage

Invalid Interpretations
- Biometric identification: Cannot identify new/unseen people
- Deployment readiness: Not ready for real-world identification system
- Person-independent performance: Doesn't test recognition of strangers
- Stable estimates: Single split with 1 test sample/class is unreliable

Valid Applications
- Drift detection: Identifying changes in known subjects' gait over time
- Session authentication: Verifying same person across different sessions
- Feature engineering: Understanding which features work best
- Algorithm comparison: Comparing different models on same task

Invalid Applications
- New person identification: Cannot recognize people not in training set
- Security/access control: Not suitable for biometric authentication
- Forensic applications: Cannot identify unknown individuals
- Medical diagnosis: Cannot generalize to new patient populations

Key Lessons
- This approach correctly implements ML methodology (no data leakage, proper pipeline), but is inferior to StratifiedKFold for tiny datasets due to:
    - Unstable evaluation (1 test sample per class)
    - Data waste (17 samples only for testing)
    - High variance (single random split)
    - Complete failure on some subjects (0% precision/recall)

- Both approaches share a fundamental limitation: 
    - Neither is person-independent. 
    - Both test on different samples from subjects already seen during training.

For this dataset: 
- StratifiedKFold with pipeline oversampling is the better approach because it maximizes data usage AND provides more stable estimates
- 88.2% accuracy shows features are discriminative, but represents within-subject recognition (different sessions of same people), not cross-subject identification (recognizing new people). 
- The unstable results for few classes (complete failures) confirm that 3 samples/class is insufficient for reliable train/test split evaluation.

Summary
- Sound methodologically for testing intra-subject discrimination (can features distinguish known subjects?)
- Not sound for inter-subject generalization (can the model identify unseen individuals?).
- It validates feature discriminability and pattern consistency, but not real-world generalization.



In [31]:
# Import lib
from sklearn.model_selection import StratifiedKFold, cross_val_predict, train_test_split
from imblearn.pipeline import Pipeline      
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif
from imblearn.over_sampling import RandomOverSampler
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
import pandas as pd
import numpy as np
import pprint
from sklearn.model_selection import cross_val_score, StratifiedKFold
from collections import Counter
import warnings
warnings.filterwarnings("ignore")

In [32]:
# Tests multiple train/test splits (5 different random states)
# Finds which k value is most stable across splits (either most frequently best, or highest mean accuracy)
# Uses that k value to train final model with random_state=42
# Extracts feature names for that final model

df_new = pd.read_csv("gait_final_output_updated.csv")
X = df_new.drop(columns=['Subject_ID_Y'])
y = df_new['Subject_ID_Y']

k_vals = [2, 3, 4, 5, 10, 15]
random_states = [42, 123, 456, 789, 101112]  # Test 5 different splits

# Store results across all random states
all_results = {k: [] for k in k_vals}
best_k_per_state = []

for rs in random_states:
    print(f"\n{'='*50}")
    print(f"Testing with random_state={rs}")
    print(f"{'='*50}")
    
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, 
        test_size=0.35,
        stratify=y,
        random_state=rs
    )
    
    accuracy = {}
    
    for k in k_vals:
        pipeline = Pipeline([
            ("scaler", StandardScaler()),
            ("oversample", RandomOverSampler(random_state=42)),
            ("feature_selector", SelectKBest(mutual_info_classif, k=k)),
            ("clf", SVC(kernel='rbf', random_state=42))
        ])
        
        pipeline.fit(X_train, y_train)
        y_pred = pipeline.predict(X_test)    
        score = accuracy_score(y_test, y_pred)
        
        accuracy[k] = {
            'score': score,
            'report': classification_report(y_test, y_pred, zero_division=0),
            'pipeline': pipeline
        }
        
        all_results[k].append(score)
    
    # Find best k for this random state
    best_k_this_state = max(accuracy.keys(), key=lambda k: accuracy[k]['score'])
    best_k_per_state.append(best_k_this_state)
    print(f"Best k for this split: {best_k_this_state} (accuracy: {accuracy[best_k_this_state]['score']:.3f})")




Testing with random_state=42
Best k for this split: 10 (accuracy: 0.706)

Testing with random_state=123
Best k for this split: 15 (accuracy: 0.882)

Testing with random_state=456
Best k for this split: 10 (accuracy: 0.824)

Testing with random_state=789
Best k for this split: 4 (accuracy: 0.824)

Testing with random_state=101112
Best k for this split: 4 (accuracy: 0.882)


In [33]:
# Analyze results across all random states
print("\n" + "="*50)
print("SUMMARY ACROSS ALL RANDOM STATES")
print("="*50)

for k in k_vals:
    scores = all_results[k]
    print(f"k={k}: mean={np.mean(scores):.3f}, std={np.std(scores):.3f}, scores={[f'{s:.3f}' for s in scores]}")

# Find most frequently selected best k
k_frequency = Counter(best_k_per_state)
most_common_k = k_frequency.most_common(1)[0][0]
print(f"\nMost frequently best k: {most_common_k} (selected {k_frequency[most_common_k]}/{len(random_states)} times)")

# Find k with highest mean accuracy
best_mean_k = max(k_vals, key=lambda k: np.mean(all_results[k]))
print(f"k with highest mean accuracy: {best_mean_k} (mean={np.mean(all_results[best_mean_k]):.3f})")

# Use the k with highest mean accuracy
final_k = best_mean_k

print("\n" + "="*50)
print(f"FINAL RECOMMENDATION: k={final_k}")
print("="*50)

# Now get features for the final k using the original random_state=42
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.35,
    stratify=y,
    random_state=42
)

final_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("oversample", RandomOverSampler(random_state=42)),
    ("feature_selector", SelectKBest(mutual_info_classif, k=final_k)),
    ("clf", SVC(kernel='rbf', random_state=42))
])

final_pipeline.fit(X_train, y_train)
y_pred = final_pipeline.predict(X_test)

# Get selected features
selected_features_mask = final_pipeline.named_steps['feature_selector'].get_support()
selected_feature_names = X.columns[selected_features_mask].tolist()

print(f"\nFinal model performance (random_state=42):")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"\nTop {final_k} features selected:")
for i, feature in enumerate(selected_feature_names, 1):
    print(f"  {i}. {feature}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred, zero_division=0))


SUMMARY ACROSS ALL RANDOM STATES
k=2: mean=0.529, std=0.153, scores=['0.412', '0.588', '0.294', '0.706', '0.647']
k=3: mean=0.471, std=0.207, scores=['0.118', '0.529', '0.471', '0.471', '0.765']
k=4: mean=0.729, std=0.109, scores=['0.647', '0.588', '0.706', '0.824', '0.882']
k=5: mean=0.718, std=0.101, scores=['0.588', '0.647', '0.706', '0.765', '0.882']
k=10: mean=0.765, std=0.053, scores=['0.706', '0.824', '0.824', '0.706', '0.765']
k=15: mean=0.706, std=0.158, scores=['0.412', '0.882', '0.765', '0.706', '0.765']

Most frequently best k: 10 (selected 2/5 times)
k with highest mean accuracy: 10 (mean=0.765)

FINAL RECOMMENDATION: k=10

Final model performance (random_state=42):
Accuracy: 0.647

Top 10 features selected:
  1. P33_R1
  2. P81_R1
  3. P83_R2
  4. Posture_R3
  5. Loading_R3
  6. P52_R3
  7. P53_R3
  8. P55_R3
  9. P66_R3
  10. P68_R3

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00         1
      

CONCLUSION: 
- (StratefiedKFold + No data split) is BETTER than (No StratefiedKFold + Data split)
    - This is bec in latter we see precision/recall = 0.0 
- With k=10 features fro SelectKBest perform consistently well.