# Step 5: Train/Test Split (Temporal)

**Goal:** Split data chronologically to avoid data leakage and simulate real-world prediction.

**Date:** 1/3/2026

**Split Strategy:**
- Training: 2010 - 2020 (older fights)
- Validation: 2021 - 2022 (recent fights for tuning)
- Test: 2023 - 2024 (hold-out set for final evaluation)

**Why temporal?**
- No data leakage (can't use future to predict past)
- Simulates real betting scenario
- Tests model generalization to new data

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load engineered features
df = pd.read_csv('../data/processed/ufc_engineered.csv')
df['Date'] = pd.to_datetime(df['Date'])

X = pd.read_csv('../data/processed/X_features.csv')
y = pd.read_csv('../data/processed/y_target.csv')['RedWon']  # Extract series

print(f"Loaded: {len(df)} fights")
print(f"X shape: {X.shape}")
print(f"y shape: {len(y)}")
print(f"Date range: {df['Date'].min().date()} to {df['Date'].max().date()}")

Loaded: 6290 fights
X shape: (6290, 24)
y shape: 6290
Date range: 2010-03-21 to 2024-12-07


In [5]:
#Create temporal train/validation/test split

# Define split dates
train_end = '2020-12-31'
val_end = '2022-12-31'

# Create boolean masks
train_mask = df['Date'] <= train_end
val_mask = (df['Date'] > train_end) & (df['Date'] <= val_end)
test_mask = df['Date'] > val_end

# Split X and y using these masks
X_train = X[train_mask].reset_index(drop=True)
y_train = y[train_mask].reset_index(drop=True)

X_val = X[val_mask].reset_index(drop=True)
y_val = y[val_mask].reset_index(drop=True)

X_test = X[test_mask].reset_index(drop=True)
y_test = y[test_mask].reset_index(drop=True)

# Print split summary
print("Data Split Summary\n")
print(f"Training set (2010-2020):")
print(f"  Samples: {len(X_train)}")
print(f"  Red wins: {y_train.sum()} ({y_train.mean()*100:.1f}%)")

print(f"\nValidation set (2021-2022):")
print(f"  Samples: {len(X_val)}")
print(f"  Red wins: {y_val.sum()} ({y_val.mean()*100:.1f}%)")

print(f"\nTest set (2023-2024):")
print(f"  Samples: {len(X_test)}")
print(f"  Red wins: {y_test.sum()} ({y_test.mean()*100:.1f}%)")

print(f"\n Total: {len(X_train) + len(X_val) + len(X_test)} samples (matches {len(X)})")

=== Data Split Summary ===

Training set (2010-2020):
  Samples: 4524
  Red wins: 2651 (58.6%)

Validation set (2021-2022):
  Samples: 980
  Red wins: 567 (57.9%)

Test set (2023-2024):
  Samples: 786
  Red wins: 438 (55.7%)

âœ“ Total: 6290 samples (matches 6290)


In [7]:
# Verify temporal ordering

# Check that dates don't overlap
print(f"Train Max Date: {df[train_mask]['Date'].max().date()}")
print(f"Val min date: {df[val_mask]['Date'].min().date()}")

# Check class balance across splits
print("\nClass Balance Check ")

# Compare y_train.mean(), y_val.mean(), y_test.mean()
print(f"y_train mean: {y_train.mean()}")
print(f"y_val mean: {y_val.mean()}")
print(f"y_test mean: {y_test.mean()}")

print("\n No temporal leakage")

Train Max Date: 2020-12-19
Val min date: 2021-01-16

=== Class Balance Check ===
y_train mean: 0.5859858532272325
y_val mean: 0.5785714285714286
y_test mean: 0.5572519083969466

 No temporal leakage - splits are clean!


In [9]:
#Verify feature distributions are similar across splits

# Check if training data is representative of test data
# Compare means of a few key features

key_features = ['WinDif', 'ReachDif', 'WinPctDif', 'AgeDif']

print("Feature Distribution Comparison\n")
print("Feature".ljust(20), "Train Mean", "Val Mean", "Test Mean")
print("-" * 60)

for feat in key_features:
    train_mean = X_train[feat].mean()
    val_mean = X_val[feat].mean()
    test_mean = X_test[feat].mean()
    print(f"Train Mean: {train_mean}")
    print(f"Val Mean: {val_mean}")
    print(f"Test Mean: {test_mean}")



# If distributions are very different, model might not generalize well

=== Feature Distribution Comparison ===

Feature              Train Mean Val Mean Test Mean
------------------------------------------------------------
Train Mean: -1.3762157382847038
Val Mean: -1.8081632653061224
Test Mean: -1.7531806615776082
Train Mean: -0.23196286472148542
Val Mean: -0.4509795918367347
Test Mean: -0.31992366412213735
Train Mean: 0.0459556832945079
Val Mean: 0.03361687514949599
Test Mean: 0.06332620644583879
Train Mean: 0.4230769230769231
Val Mean: -0.7306122448979592
Test Mean: -0.6208651399491094


In [11]:
# Save train/val/test splits

# Save training data

X_train.to_csv('../data/processed/X_train.csv',index = False)
y_train.to_csv('../data/processed/y_train.csv',index = False)

# Save validation data
X_val.to_csv('../data/processed/X_val.csv',index=False)
y_val.to_csv('../data/processed/y_val.csv',index = False)

# Save test data  
X_test.to_csv('../data/processed/X_test.csv',index = False)
y_test.to_csv('../data/processed/y_test.csv',index = False)


print("Saved all splits")
print(f"  - X_train: {X_train.shape}")
print(f"  - X_val: {X_val.shape}")
print(f"  - X_test: {X_test.shape}")

Saved all splits
  - X_train: (4524, 24)
  - X_val: (980, 24)
  - X_test: (786, 24)
