# Ultra-Marathon Pace Prediction - Example Usage

This notebook demonstrates how to use the ultra-marathon pace prediction pipeline.

## Overview

The pipeline consists of several modules that work together to:
1. Load and clean ultramarathon data
2. Engineer features that capture athlete progression
3. Train a machine learning model to predict race pace
4. Evaluate model performance using pace-specific metrics

## 1. Import Required Modules

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import pipeline modules
from src.data.load import load_raw_data
from src.data.clean import clean_data
from src.data.features import engineer_features
from src.data.split import split_train_test
from src.models.prepare import prepare_model_data
from src.models.train import train_evaluate_lgbm
from src.evaluation.metrics import print_pace_metrics
from src.visualization.eda import plot_pace_distribution, plot_model_performance
from src.pipeline import run_pipeline

# Set plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## 2. Load and Explore Data

In [None]:
# Load the raw data
print("Loading data...")
df = load_raw_data("data/raw/ultra_marathons.csv")

print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Display basic information
print("\nBasic info:")
print(df.info())

print("\nFirst few rows:")
print(df.head())

## 3. Data Cleaning

In [None]:
# Clean the data
print("Cleaning data...")
df_clean = clean_data(df)

print(f"\nAfter cleaning: {df_clean.shape}")
print(f"Removed {df.shape[0] - df_clean.shape[0]} rows")

# Check for any remaining missing values
missing_cols = df_clean.columns[df_clean.isnull().any()].tolist()
if missing_cols:
    print(f"\nColumns with missing values: {missing_cols}")
    print(df_clean[missing_cols].isnull().sum())
else:
    print("\nNo missing values found!")

## 4. Feature Engineering

In [None]:
# Engineer features
print("Engineering features...")
df_features = engineer_features(df_clean)

print(f"\nAfter feature engineering: {df_features.shape}")
print(f"Added {df_features.shape[1] - df_clean.shape[1]} new features")

# Show new features
new_features = [col for col in df_features.columns if col not in df_clean.columns]
print(f"\nNew features created:")
for feature in new_features:
    print(f"  - {feature}")

## 5. Train/Test Split

In [None]:
# Split into train/test sets
print("Splitting data...")
df_train, df_test, feature_cols = split_train_test(df_features)

print(f"\nTrain set: {df_train.shape}")
print(f"Test set:  {df_test.shape}")
print(f"\nFeature columns ({len(feature_cols)}):")
for i, col in enumerate(feature_cols, 1):
    print(f"  {i:2d}. {col}")

## 6. Model Training

In [None]:
# Prepare data for modeling
print("Preparing data for modeling...")
X_train, X_test, y_train, y_test = prepare_model_data(df_train, df_test, feature_cols)

print(f"\nTraining features shape: {X_train.shape}")
print(f"Test features shape:     {X_test.shape}")
print(f"Training target shape:   {y_train.shape}")
print(f"Test target shape:       {y_test.shape}")

# Train the model
print("\nTraining model...")
model, y_pred = train_evaluate_lgbm(X_train, y_train, X_test, y_test)

## 7. Model Evaluation

In [None]:
# Print detailed metrics
print_pace_metrics(y_test, y_pred, "Ultra-Marathon Pace Predictor")

# Plot model performance
plot_model_performance(y_test, y_pred, "Ultra-Marathon Pace Predictor")

## 8. Feature Importance

In [None]:
# Get feature importance
from src.models.train import get_feature_importance

importance_df = get_feature_importance(model, X_train.columns)

print("Top 10 most important features:")
print(importance_df.head(10)[['feature', 'importance']])

# Plot feature importance
from src.visualization.eda import plot_feature_importance
plot_feature_importance(importance_df, top_n=15)

## 9. Using the Full Pipeline

In [None]:
# The full pipeline can be run in one step
print("Running full pipeline...")

# Note: This would use the same data we've been working with
# In practice, you'd point to your actual data file
# model, X_train, X_test, y_train, y_test, y_pred = run_pipeline('path/to/your/data.csv')

print("\nPipeline completed successfully!")
print("\nTo run with your own data:")
print("model, X_train, X_test, y_train, y_test, y_pred = run_pipeline('your_data.csv')")

## 10. Making Predictions on New Data

In [None]:
# Example: Making predictions on new data
print("Example: Making predictions on new data")

# Create some example data (this would normally come from your dataset)
example_data = {
    'Year of event': [2023, 2023],
    'Event number of finishers': [150, 200],
    'Athlete gender': ['M', 'F'],
    'Event distance_numeric': [100, 50],
    'cum_num_races': [5, 3],
    'cum_avg_pace': [10.5, 11.2],
    'cum_best_pace': [9.8, 10.5],
    'cum_ws_finishes': [1, 0],
    'cum_total_distance': [400, 180],
    'cum_avg_distance': [80, 60],
    'cum_shortest_distance': [50, 42],
    'cum_longest_distance': [100, 80],
    'recent_avg_distance': [85, 65],
    'distance_gap_from_longest': [0, -30],
    'athlete_age': [35, 28]
}

example_df = pd.DataFrame(example_data)
print(f"Example data shape: {example_df.shape}")
print("\nExample data:")
print(example_df)

# Make predictions
if 'model' in locals():
    predictions = model.predict(example_df)
    print(f"\nPredicted paces (min/km): {predictions}")
else:
    print("\nModel not available in this example")

## Summary

This notebook demonstrated the complete workflow for ultramarathon pace prediction:

1. **Data Loading**: Load raw ultramarathon data
2. **Data Cleaning**: Remove inconsistencies and missing values
3. **Feature Engineering**: Create cumulative and rolling features
4. **Train/Test Split**: Split data while preventing data leakage
5. **Model Training**: Train a LightGBM model
6. **Evaluation**: Assess model performance with pace-specific metrics
7. **Feature Importance**: Understand which features matter most
8. **Pipeline Usage**: Run the complete pipeline in one step
9. **Predictions**: Make predictions on new data

The pipeline is designed to be:
- **Reproducible**: All steps are deterministic
- **Scalable**: Handles large datasets efficiently
- **Interpretable**: Provides insights into feature importance
- **Extensible**: Easy to add new features or models