# Feature Engineering Pipeline

This notebook implements three versions of feature engineering for the Basketball Motion Capture Data Challenge. Each version builds upon insights derived from our `exhaustive_analysis.ipynb`.

## Setup
Ensure `utils.py` is in the same directory.

In [None]:
import pandas as pd
import numpy as np
import os
import sys
from sklearn.model_selection import train_test_split

# Add current directory to path to import utils
sys.path.append(os.getcwd())
from utils import load_data, DT, NUM_FRAMES

In [None]:
# Configuration
DATA_DIR = '../../data'  # Adjust if needed
OUTPUT_DIR = '../../data/processed'
VAL_SIZE = 0.2
RANDOM_SEED = 42
os.makedirs(OUTPUT_DIR, exist_ok=True)

## Load Data & Create Validation Split
We split the provided training data into a Training set (80%) and a Validation set (20%) to evaluate our models locally.

In [None]:
print("Loading Data...")
df_full, X_full, kp_cols = load_data(DATA_DIR, train=True)
df_test, X_test, _ = load_data(DATA_DIR, train=False)

# Validation Split
# We simply split randomly as the test set contains the same participants (Interpolation task)
train_idx, val_idx = train_test_split(np.arange(len(df_full)), test_size=VAL_SIZE, random_state=RANDOM_SEED)

df_train = df_full.iloc[train_idx].reset_index(drop=True)
X_train = X_full[train_idx]
df_val = df_full.iloc[val_idx].reset_index(drop=True)
X_val = X_full[val_idx]

print(f"Train Stats: {len(df_train)} shots")
print(f"Val Stats:   {len(df_val)} shots")
print(f"Test Stats:  {len(df_test)} shots")

---
## V1: The "Naive" Baseline

### Logic
This approach treats the time-series data as a single massive bag of statistics. We flatten the time dimension by computing global aggregations (`mean`, `std`, `min`, `max`) for every single keypoint axis.

### Source of Idea
Standard Machine Learning practice. Before building complex temporal models, we must establish a **lower bound** of performance. If a complex model cannot beat this simple statistical aggregation, the complex model is flawed.

### Features
- **Input:** 240 Frames x 207 Coords
- **Output:** 207 Coords * 4 Stats = 828 Features

In [None]:
def extract_features_v1(X: np.ndarray, keypoint_cols: list) -> pd.DataFrame:
    feature_names = []
    features = []
    
    stats = ['mean', 'std', 'min', 'max']
    for col in keypoint_cols:
        for stat in stats:
            feature_names.append(f"{col}_{stat}")
            
    print(f"Extracting V1 features for {X.shape[0]} shots...")
    for i in range(X.shape[0]):
        means = np.nanmean(X[i], axis=0)
        stds = np.nanstd(X[i], axis=0)
        mins = np.nanmin(X[i], axis=0)
        maxs = np.nanmax(X[i], axis=0)
        
        shot_feats = np.stack([means, stds, mins, maxs], axis=0).T.flatten()
        features.append(shot_feats)
        
    return pd.DataFrame(features, columns=feature_names)

# Generate V1
train_v1 = extract_features_v1(X_train, kp_cols)
val_v1 = extract_features_v1(X_val, kp_cols)
test_v1 = extract_features_v1(X_test, kp_cols)

# Save
os.makedirs(f"{OUTPUT_DIR}/v1", exist_ok=True)
pd.concat([train_v1, df_train[['angle', 'depth', 'left_right']]], axis=1).to_csv(f"{OUTPUT_DIR}/v1/train_v1.csv", index=False)
pd.concat([val_v1, df_val[['angle', 'depth', 'left_right']]], axis=1).to_csv(f"{OUTPUT_DIR}/v1/val_v1.csv", index=False)
test_v1.to_csv(f"{OUTPUT_DIR}/v1/test_v1.csv", index=False)
print("V1 Saved (Train, Val, Test).")

---
## V2: The "Windowed" Approach (Data-Driven)

### Logic
Instead of looking at the whole shot, we focus on the **critical moments**. A basketball shot is determined by the mechanics immediately preceding the release. We isolate this "Pre-Release" window and the "Follow-Through" separately.

### Source of Idea
Derived from **Level 3** of our `exhaustive_analysis.ipynb`. 
1.  **Why Windows?** The analysis showed tremendous variance in the "setup" phase (dribbling) that is noise. 
2.  **Why Dynamic?** We found that different players release at different times. Hardcoding "Frame 180" fails for fast shooters. 
3.  **The Algorithm:** We use the specific detection logic validated in the analysis: **Max Wrist Velocity** is the robust marker for the moment the ball leaves the hand.

### Features
- **Dynamic Release Frame:** Calculated per-shot using `argmax(wrist_velocity)`.
- **Pre-Release Stats:** Mean/Std of features in [Release - 50 frames, Release].
- **Release Snapshot:** Exact Position & Velocity at the moment of release.

In [None]:
def detect_release_frame(shot_data: np.ndarray, kp_cols: list) -> int:
    try:
        wrist_idx = -1
        for i, col in enumerate(kp_cols):
            if 'right_wrist_x' in col:
                wrist_idx = i // 3
                break
        if wrist_idx == -1: return 180
            
        # Velocity Magnitude of Wrist
        wrist_pos = shot_data[:, wrist_idx*3 : wrist_idx*3 + 3]
        vel = np.gradient(wrist_pos, DT, axis=0)
        vel_mag = np.linalg.norm(vel, axis=1)
        
        # Search 140-230
        search_start, search_end = 140, 230
        window_vel = vel_mag[search_start:search_end]
        if len(window_vel) == 0: return 180
            
        return int(np.nanargmax(window_vel) + search_start)
    except:
        return 180

def extract_features_v2(X: np.ndarray, keypoint_cols: list) -> pd.DataFrame:
    features = []
    feature_names = []
    for col in keypoint_cols:
        feature_names.extend([f"{col}_pre_mean", f"{col}_pre_std", f"{col}_rel_pos", f"{col}_rel_vel", f"{col}_post_mean"])
    feature_names.append("release_frame")
    
    print(f"Extracting V2 features for {X.shape[0]} shots...")
    for i in range(X.shape[0]):
        shot = X[i]
        rel_frame = detect_release_frame(shot, keypoint_cols)
        
        pre_win = shot[max(0, rel_frame-50):rel_frame]
        post_win = shot[rel_frame:min(NUM_FRAMES, rel_frame+40)]
        vel_full = np.gradient(shot, DT, axis=0)
        
        row = []
        for f_idx in range(len(keypoint_cols)):
            p_mean = np.nanmean(pre_win[:, f_idx]) if len(pre_win) > 0 else 0
            p_std = np.nanstd(pre_win[:, f_idx]) if len(pre_win) > 0 else 0
            r_pos = shot[rel_frame, f_idx]
            r_vel = vel_full[rel_frame, f_idx]
            post_mean = np.nanmean(post_win[:, f_idx]) if len(post_win) > 0 else 0
            row.extend([p_mean, p_std, r_pos, r_vel, post_mean])
            
        row.append(rel_frame)
        features.append(row)
        
    return pd.DataFrame(features, columns=feature_names)

# Generate V2
train_v2 = extract_features_v2(X_train, kp_cols)
val_v2 = extract_features_v2(X_val, kp_cols)
test_v2 = extract_features_v2(X_test, kp_cols)

# Save
os.makedirs(f"{OUTPUT_DIR}/v2", exist_ok=True)
pd.concat([train_v2, df_train[['angle', 'depth', 'left_right']]], axis=1).to_csv(f"{OUTPUT_DIR}/v2/train_v2.csv", index=False)
pd.concat([val_v2, df_val[['angle', 'depth', 'left_right']]], axis=1).to_csv(f"{OUTPUT_DIR}/v2/val_v2.csv", index=False)
test_v2.to_csv(f"{OUTPUT_DIR}/v2/test_v2.csv", index=False)
print("V2 Saved (Train, Val, Test).")

---
## V3: The "Kinematic" Approach (Physics-Based)

### Logic
This version transforms raw, noisy 3D coordinates into **biomechanical signals**. A player's position on the court (X,Y) shouldn't matter; their **form** (Joint Angles) does. We calculate the "Kinetic Chain": the sequential extension of Knee -> Elbow -> Wrist.

### Source of Idea
Domain Knowledge & Physics. 
- **Joint Angles:** The 'angle' target is directly correlated with the `Elbow Angle` and `Release Angle`. Raw X,Y,Z coords obscure this relationship.
- **Smoothness:** A common coaching metric is "fluidity". We calculate 'Jerk' (change in acceleration) to quantify this.
- **Projectile Motion:** The `depth` target is purely a function of `Release Velocity` and `Release Angle` (basic physics). We explicitly calculate these.

### Features
- **Joint Angles:** Knee, Elbow (computed via vector geometry).
- **Kinetic Timing:** `Max Extension Velocity` of Knee vs Elbow (coordination).
- **Release Physics:** Speed (m/s) and vertical height at release.

In [None]:
def calculate_angle(a, b, c): # b is vertex
    ba = a - b
    bc = c - b
    cosine = np.einsum('ij,ij->i', ba, bc) / (np.linalg.norm(ba, axis=1) * np.linalg.norm(bc, axis=1) + 1e-10)
    return np.degrees(np.arccos(np.clip(cosine, -1.0, 1.0)))

def extract_features_v3(X: np.ndarray, keypoint_cols: list) -> pd.DataFrame:
    def get_xyz(shot, name):
        for i, c in enumerate(keypoint_cols):
            if f"{name}_x" in c:
                idx = i // 3
                return shot[:, idx*3 : idx*3+3]
        return None

    features = []
    print(f"Extracting V3 features for {X.shape[0]} shots...")
    
    for i in range(X.shape[0]):
        shot = X[i]
        # Joints
        r_sh = get_xyz(shot, 'right_shoulder')
        r_el = get_xyz(shot, 'right_elbow')
        r_wr = get_xyz(shot, 'right_wrist')
        r_hip = get_xyz(shot, 'right_hip')
        r_kn = get_xyz(shot, 'right_knee')
        r_an = get_xyz(shot, 'right_ankle')
        
        # Angles
        elbow_ang = calculate_angle(r_sh, r_el, r_wr) if r_sh is not None else np.zeros(NUM_FRAMES)
        knee_ang = calculate_angle(r_hip, r_kn, r_an) if r_hip is not None else np.zeros(NUM_FRAMES)
        
        # Metrics
        min_knee = np.nanmin(knee_ang)
        max_knee_ext_vel = np.nanmax(np.gradient(knee_ang, DT))
        
        # Release Physics (at max wrist velocity)
        rel_params = [0, 0, 0]
        if r_wr is not None:
            wr_vel = np.gradient(r_wr, DT, axis=0)
            wr_jerk = np.gradient(np.gradient(wr_vel, DT, axis=0), DT, axis=0)
            smoothness = np.nanmean(np.linalg.norm(wr_jerk, axis=1))
            
            # Release frame by max vertical velocity (y-axis)
            # Assuming Y is vertical or major axis of lift
            try:
                max_v_frame = np.nanargmax(wr_vel[:, 1]) 
                rel_speed = np.linalg.norm(wr_vel[max_v_frame])
                rel_height = r_wr[max_v_frame, 1]
                rel_params = [smoothness, rel_speed, rel_height]
            except:
                pass
                
        features.append([min_knee, max_knee_ext_vel] + rel_params)
        
    cols = ['min_knee_angle', 'max_knee_ext_vel', 'smoothness', 'release_speed', 'release_height']
    return pd.DataFrame(features, columns=cols)

# Generate V3
train_v3 = extract_features_v3(X_train, kp_cols)
val_v3 = extract_features_v3(X_val, kp_cols)
test_v3 = extract_features_v3(X_test, kp_cols)

# Save
os.makedirs(f"{OUTPUT_DIR}/v3", exist_ok=True)
pd.concat([train_v3, df_train[['angle', 'depth', 'left_right']]], axis=1).to_csv(f"{OUTPUT_DIR}/v3/train_v3.csv", index=False)
pd.concat([val_v3, df_val[['angle', 'depth', 'left_right']]], axis=1).to_csv(f"{OUTPUT_DIR}/v3/val_v3.csv", index=False)
test_v3.to_csv(f"{OUTPUT_DIR}/v3/test_v3.csv", index=False)
print("V3 Saved (Train, Val, Test).")