# Chapter 3: Data Transformation & Preprocessing
## Tennis Analysis - Missing Value Imputation and Data Smoothing

This notebook demonstrates the concepts from ML4QS Chapter 3 applied to tennis analysis:
- Missing value imputation using pandas interpolation
- Rolling mean smoothing (equivalent to low-pass filtering)
- Bounding box utilities: Data normalization and coordinate transformations

In [None]:
import sys
sys.path.append('../tennis_analysis-main')

import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import butter, filtfilt
from utils.bbox_utils import get_center_of_bbox, measure_distance

## 1. Load Raw Detection Data

Load the ball detection data that contains missing values (similar to sensor data with gaps).

In [None]:
# Load ball detection data
with open('../tennis_analysis-main/tracker_stubs/ball_detections.pkl', 'rb') as f:
    ball_detections = pickle.load(f)

# Extract ball positions
ball_positions = [x.get(1,[]) for x in ball_detections]

# Create DataFrame with missing values
df_ball = pd.DataFrame(ball_positions, columns=['x1','y1','x2','y2'])

print(f"Total frames: {len(df_ball)}")
print(f"Missing values per column:")
print(df_ball.isnull().sum())
print(f"\nDetection rate: {(len(df_ball) - df_ball.isnull().any(axis=1).sum()) / len(df_ball):.2%}")

## 2. Missing Value Imputation

Apply interpolation techniques similar to ML4QS Chapter 3's imputation methods.

In [None]:
# Show raw data with missing values
print("Raw data (first 20 rows):")
print(df_ball.head(20))

# Apply interpolation - equivalent to ImputationMissingValues.py
df_ball_interpolated = df_ball.interpolate(method='linear')
df_ball_interpolated = df_ball_interpolated.bfill()  # Backward fill for any remaining NaN

print("\nAfter interpolation:")
print(df_ball_interpolated.head(20))

print(f"\nMissing values after interpolation:")
print(df_ball_interpolated.isnull().sum())

## 3. Coordinate Transformations and Normalization

Transform bounding box coordinates to center points and normalize data.

In [None]:
# Calculate center coordinates from bounding boxes
df_ball_interpolated['center_x'] = (df_ball_interpolated['x1'] + df_ball_interpolated['x2']) / 2
df_ball_interpolated['center_y'] = (df_ball_interpolated['y1'] + df_ball_interpolated['y2']) / 2

# Calculate bounding box dimensions
df_ball_interpolated['width'] = df_ball_interpolated['x2'] - df_ball_interpolated['x1']
df_ball_interpolated['height'] = df_ball_interpolated['y2'] - df_ball_interpolated['y1']

print("Coordinate transformations complete:")
print(df_ball_interpolated[['center_x', 'center_y', 'width', 'height']].describe())

## 4. Data Normalization

Normalize coordinates to [0,1] range assuming video dimensions.

In [None]:
# Assume video dimensions (typical HD video)
VIDEO_WIDTH = 1920
VIDEO_HEIGHT = 1080

# Normalize coordinates
df_ball_normalized = df_ball_interpolated.copy()
df_ball_normalized['center_x_norm'] = df_ball_normalized['center_x'] / VIDEO_WIDTH
df_ball_normalized['center_y_norm'] = df_ball_normalized['center_y'] / VIDEO_HEIGHT
df_ball_normalized['width_norm'] = df_ball_normalized['width'] / VIDEO_WIDTH
df_ball_normalized['height_norm'] = df_ball_normalized['height'] / VIDEO_HEIGHT

print("Normalized coordinates (0-1 range):")
print(df_ball_normalized[['center_x_norm', 'center_y_norm', 'width_norm', 'height_norm']].describe())

## 5. Low-Pass Filtering (Rolling Mean Smoothing)

Apply smoothing techniques equivalent to ML4QS Chapter 3's LowPassFilter.

In [None]:
# Rolling mean smoothing (equivalent to low-pass filtering)
window_size = 5
df_ball_smoothed = df_ball_normalized.copy()

# Apply rolling mean to center coordinates
df_ball_smoothed['center_x_smooth'] = df_ball_smoothed['center_x'].rolling(window=window_size, min_periods=1, center=True).mean()
df_ball_smoothed['center_y_smooth'] = df_ball_smoothed['center_y'].rolling(window=window_size, min_periods=1, center=True).mean()

print(f"Applied rolling mean with window size: {window_size}")
print("Smoothed coordinates:")
print(df_ball_smoothed[['center_x', 'center_x_smooth', 'center_y', 'center_y_smooth']].head(10))

## 6. Butterworth Low-Pass Filter

Apply actual low-pass filtering similar to ML4QS Chapter 3.

In [None]:
# Apply Butterworth low-pass filter
def apply_lowpass_filter(data, sampling_freq=30, cutoff_freq=2, order=5):
    """
    Apply Butterworth low-pass filter
    sampling_freq: Video frame rate (30 fps)
    cutoff_freq: Cutoff frequency in Hz
    """
    nyq = 0.5 * sampling_freq
    normalized_cutoff = cutoff_freq / nyq
    
    b, a = butter(order, normalized_cutoff, btype='low', analog=False)
    filtered_data = filtfilt(b, a, data)
    
    return filtered_data

# Apply low-pass filter to coordinates
df_ball_smoothed['center_x_lowpass'] = apply_lowpass_filter(df_ball_smoothed['center_x'])
df_ball_smoothed['center_y_lowpass'] = apply_lowpass_filter(df_ball_smoothed['center_y'])

print("Applied Butterworth low-pass filter (cutoff: 2 Hz)")

## 7. Visualization of Preprocessing Effects

Compare raw, interpolated, smoothed, and filtered data.

In [None]:
# Plot preprocessing effects
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 10))

frames = range(len(df_ball_smoothed))

# X coordinate preprocessing
ax1.plot(frames, df_ball['center_x'], 'r-', alpha=0.7, label='Raw (with gaps)', linewidth=1)
ax1.plot(frames, df_ball_interpolated['center_x'], 'b-', alpha=0.8, label='Interpolated', linewidth=1)
ax1.set_title('X Coordinate: Raw vs Interpolated')
ax1.set_ylabel('X Position (pixels)')
ax1.legend()
ax1.grid(True)

# Y coordinate preprocessing
ax2.plot(frames, df_ball['center_y'], 'r-', alpha=0.7, label='Raw (with gaps)', linewidth=1)
ax2.plot(frames, df_ball_interpolated['center_y'], 'b-', alpha=0.8, label='Interpolated', linewidth=1)
ax2.set_title('Y Coordinate: Raw vs Interpolated')
ax2.set_ylabel('Y Position (pixels)')
ax2.legend()
ax2.grid(True)

# Smoothing comparison (X)
ax3.plot(frames, df_ball_interpolated['center_x'], 'b-', alpha=0.5, label='Interpolated', linewidth=1)
ax3.plot(frames, df_ball_smoothed['center_x_smooth'], 'g-', label='Rolling Mean', linewidth=2)
ax3.plot(frames, df_ball_smoothed['center_x_lowpass'], 'orange', label='Low-pass Filter', linewidth=2)
ax3.set_title('X Coordinate: Smoothing Comparison')
ax3.set_ylabel('X Position (pixels)')
ax3.set_xlabel('Frame')
ax3.legend()
ax3.grid(True)

# Smoothing comparison (Y)
ax4.plot(frames, df_ball_interpolated['center_y'], 'b-', alpha=0.5, label='Interpolated', linewidth=1)
ax4.plot(frames, df_ball_smoothed['center_y_smooth'], 'g-', label='Rolling Mean', linewidth=2)
ax4.plot(frames, df_ball_smoothed['center_y_lowpass'], 'orange', label='Low-pass Filter', linewidth=2)
ax4.set_title('Y Coordinate: Smoothing Comparison')
ax4.set_ylabel('Y Position (pixels)')
ax4.set_xlabel('Frame')
ax4.legend()
ax4.grid(True)

plt.tight_layout()
plt.show()

## 8. Outlier Detection and Removal

Detect and handle outliers in ball position data.

In [None]:
# Simple outlier detection using velocity-based approach
def detect_velocity_outliers(df, position_col, threshold_multiplier=3):
    """
    Detect outliers based on sudden velocity changes
    """
    # Calculate velocity (change in position)
    velocity = df[position_col].diff().abs()
    
    # Define threshold as mean + threshold_multiplier * std
    threshold = velocity.mean() + threshold_multiplier * velocity.std()
    
    # Mark outliers
    outliers = velocity > threshold
    
    return outliers, threshold

# Detect outliers in X and Y coordinates
outliers_x, threshold_x = detect_velocity_outliers(df_ball_interpolated, 'center_x')
outliers_y, threshold_y = detect_velocity_outliers(df_ball_interpolated, 'center_y')

# Combine outliers
outliers_combined = outliers_x | outliers_y

print(f"Detected {outliers_combined.sum()} outliers ({outliers_combined.sum()/len(df_ball_interpolated):.2%})")
print(f"X velocity threshold: {threshold_x:.2f} pixels/frame")
print(f"Y velocity threshold: {threshold_y:.2f} pixels/frame")

# Create cleaned dataset
df_ball_cleaned = df_ball_interpolated.copy()
df_ball_cleaned.loc[outliers_combined, ['center_x', 'center_y']] = np.nan
df_ball_cleaned = df_ball_cleaned.interpolate(method='linear')

print("\nOutliers removed and re-interpolated")

## 9. Data Quality Metrics

Calculate metrics to assess preprocessing quality.

In [None]:
# Calculate preprocessing quality metrics
def calculate_smoothness(data):
    """Calculate data smoothness using second derivative"""
    second_derivative = np.diff(data, n=2)
    return np.std(second_derivative)

# Compare smoothness before and after preprocessing
raw_smoothness_x = calculate_smoothness(df_ball_interpolated['center_x'].dropna())
filtered_smoothness_x = calculate_smoothness(df_ball_smoothed['center_x_lowpass'])

raw_smoothness_y = calculate_smoothness(df_ball_interpolated['center_y'].dropna())
filtered_smoothness_y = calculate_smoothness(df_ball_smoothed['center_y_lowpass'])

print("Data Quality Metrics:")
print(f"X Coordinate Smoothness:")
print(f"  Raw (interpolated): {raw_smoothness_x:.2f}")
print(f"  After filtering: {filtered_smoothness_x:.2f}")
print(f"  Improvement: {(raw_smoothness_x - filtered_smoothness_x)/raw_smoothness_x:.2%}")
print(f"\nY Coordinate Smoothness:")
print(f"  Raw (interpolated): {raw_smoothness_y:.2f}")
print(f"  After filtering: {filtered_smoothness_y:.2f}")
print(f"  Improvement: {(raw_smoothness_y - filtered_smoothness_y)/raw_smoothness_y:.2%}")

## 10. Export Preprocessed Data

Save the preprocessed data for use in subsequent chapters.

In [None]:
# Create final preprocessed dataset
df_final = pd.DataFrame({
    'frame': range(len(df_ball_smoothed)),
    'center_x_raw': df_ball_interpolated['center_x'],
    'center_y_raw': df_ball_interpolated['center_y'],
    'center_x_smooth': df_ball_smoothed['center_x_smooth'],
    'center_y_smooth': df_ball_smoothed['center_y_smooth'],
    'center_x_filtered': df_ball_smoothed['center_x_lowpass'],
    'center_y_filtered': df_ball_smoothed['center_y_lowpass'],
    'width': df_ball_interpolated['width'],
    'height': df_ball_interpolated['height']
})

# Save preprocessed data
df_final.to_csv('ball_positions_preprocessed.csv', index=False)
print("Preprocessed data saved to 'ball_positions_preprocessed.csv'")
print(f"Final dataset shape: {df_final.shape}")
print("\nFinal dataset summary:")
print(df_final.describe())

## Summary

This notebook demonstrated Chapter 3 concepts:

1. **Missing Value Imputation**: Used pandas interpolation to fill gaps in ball detection data
2. **Rolling Mean Smoothing**: Applied window-based smoothing equivalent to low-pass filtering
3. **Butterworth Low-Pass Filter**: Implemented actual frequency-domain filtering
4. **Coordinate Transformations**: Converted bounding boxes to center coordinates
5. **Data Normalization**: Scaled coordinates to normalized ranges
6. **Outlier Detection**: Identified and removed velocity-based outliers
7. **Quality Assessment**: Measured preprocessing effectiveness

The preprocessed ball position data is now ready for feature extraction and analysis in subsequent chapters.