# F1 Race Outcome Predictor - Data Exploration

## Project Overview
This notebook builds a data-driven Formula 1 race outcome predictor using the **OpenF1 API**. 

### Key Components:
- **Data Collection**: Historical race data (lap times, weather, pit stops)
- **Feature Engineering**: Average pace, consistency, team performance
- **ML Models**: Random Forest and XGBoost
- **Predictions**: Finishing positions and podium probabilities

### Goals:
- Validate feasibility of forecasting F1 race results
- Achieve MAE < 2.5 positions
- Build foundation for advanced predictive dashboard

## 1. Import Required Libraries

In [None]:
# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Add project root to path
import sys
sys.path.append('..')

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## 2. Load Processed Data

We'll load the processed data that was collected from the OpenF1 API and preprocessed.

In [None]:
# Define data paths
data_path = Path('../data/processed')

# Load processed datasets
sessions_df = pd.read_csv(data_path / 'sessions_processed.csv')
laps_df = pd.read_csv(data_path / 'laps_processed.csv')
positions_df = pd.read_csv(data_path / 'positions_processed.csv')
weather_df = pd.read_csv(data_path / 'weather_processed.csv')
pit_stops_df = pd.read_csv(data_path / 'pit_stops_processed.csv')
drivers_df = pd.read_csv(data_path / 'drivers_processed.csv')

print("✅ Data loaded successfully!")
print(f"\nDataset Sizes:")
print(f"  Sessions: {len(sessions_df):,} records")
print(f"  Laps: {len(laps_df):,} records")
print(f"  Positions: {len(positions_df):,} records")
print(f"  Weather: {len(weather_df):,} records")
print(f"  Pit Stops: {len(pit_stops_df):,} records")
print(f"  Drivers: {len(drivers_df):,} records")

## 3. Explore Race Sessions

In [None]:
# Display session information
print("Session Data Overview:")
print(sessions_df.info())
print("\n" + "="*60)
print("\nSample Sessions:")
sessions_df.head()

In [None]:
# Count sessions by type
print("Sessions by Type:")
session_counts = sessions_df['session_type'].value_counts()
print(session_counts)

# Visualize session distribution
plt.figure(figsize=(10, 5))
session_counts.plot(kind='bar', color='steelblue')
plt.title('Distribution of Session Types', fontsize=14, fontweight='bold')
plt.xlabel('Session Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 4. Analyze Lap Times

In [None]:
# Lap times overview
print("Lap Times Statistics:")
print(laps_df[['lap_time_seconds', 'sector_1_seconds', 'sector_2_seconds', 'sector_3_seconds']].describe())

# Distribution of lap times
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(laps_df['lap_time_seconds'].dropna(), bins=50, color='coral', edgecolor='black', alpha=0.7)
plt.xlabel('Lap Time (seconds)')
plt.ylabel('Frequency')
plt.title('Distribution of Lap Times')

plt.subplot(1, 2, 2)
plt.boxplot(laps_df['lap_time_seconds'].dropna(), vert=True)
plt.ylabel('Lap Time (seconds)')
plt.title('Lap Time Box Plot')

plt.tight_layout()
plt.show()

In [None]:
# Lap times by driver (top 10 drivers by number of laps)
top_drivers = laps_df['driver_number'].value_counts().head(10).index

plt.figure(figsize=(14, 6))
for driver in top_drivers[:5]:  # Plot top 5 for clarity
    driver_laps = laps_df[laps_df['driver_number'] == driver]
    plt.plot(driver_laps['lap_number'], driver_laps['lap_time_seconds'], 
             marker='o', alpha=0.6, label=f'Driver {driver}')

plt.xlabel('Lap Number')
plt.ylabel('Lap Time (seconds)')
plt.title('Lap Times Evolution for Top 5 Drivers')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 5. Weather Data Analysis

In [None]:
# Weather statistics
print("Weather Data Overview:")
weather_cols = ['air_temperature', 'track_temperature', 'humidity', 'pressure']
available_cols = [col for col in weather_cols if col in weather_df.columns]
print(weather_df[available_cols].describe())

# Visualize weather conditions
if 'air_temperature' in weather_df.columns and 'track_temperature' in weather_df.columns:
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    axes[0, 0].hist(weather_df['air_temperature'].dropna(), bins=30, color='skyblue', edgecolor='black')
    axes[0, 0].set_xlabel('Air Temperature (°C)')
    axes[0, 0].set_ylabel('Frequency')
    axes[0, 0].set_title('Air Temperature Distribution')
    
    axes[0, 1].hist(weather_df['track_temperature'].dropna(), bins=30, color='orange', edgecolor='black')
    axes[0, 1].set_xlabel('Track Temperature (°C)')
    axes[0, 1].set_ylabel('Frequency')
    axes[0, 1].set_title('Track Temperature Distribution')
    
    if 'humidity' in weather_df.columns:
        axes[1, 0].hist(weather_df['humidity'].dropna(), bins=30, color='lightgreen', edgecolor='black')
        axes[1, 0].set_xlabel('Humidity (%)')
        axes[1, 0].set_ylabel('Frequency')
        axes[1, 0].set_title('Humidity Distribution')
    
    if 'pressure' in weather_df.columns:
        axes[1, 1].hist(weather_df['pressure'].dropna(), bins=30, color='pink', edgecolor='black')
        axes[1, 1].set_xlabel('Pressure (mbar)')
        axes[1, 1].set_ylabel('Frequency')
        axes[1, 1].set_title('Pressure Distribution')
    
    plt.tight_layout()
    plt.show()
else:
    print("Weather temperature columns not found in data")

## 6. Pit Stop Analysis

In [None]:
# Pit stop statistics
print("Pit Stop Overview:")
print(f"Total pit stops: {len(pit_stops_df):,}")

if 'pit_duration_seconds' in pit_stops_df.columns:
    print(f"\nPit Stop Duration Statistics:")
    print(pit_stops_df['pit_duration_seconds'].describe())
    
    # Visualize pit stop durations
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    plt.hist(pit_stops_df['pit_duration_seconds'].dropna(), bins=40, color='purple', edgecolor='black', alpha=0.7)
    plt.xlabel('Pit Duration (seconds)')
    plt.ylabel('Frequency')
    plt.title('Distribution of Pit Stop Durations')
    
    plt.subplot(1, 2, 2)
    pit_stops_per_driver = pit_stops_df['driver_number'].value_counts().head(10)
    pit_stops_per_driver.plot(kind='bar', color='teal')
    plt.xlabel('Driver Number')
    plt.ylabel('Number of Pit Stops')
    plt.title('Top 10 Drivers by Pit Stops')
    plt.xticks(rotation=45)
    
    plt.tight_layout()
    plt.show()
else:
    print("Pit duration data not available")

## 7. Position Data and Race Results

In [None]:
# Analyze race positions
print("Position Data Overview:")
print(positions_df.head())

# Get final race results for a sample session
if 'session_key' in positions_df.columns and len(positions_df) > 0:
    sample_session = positions_df['session_key'].iloc[0]
    session_data = positions_df[positions_df['session_key'] == sample_session]
    
    # Get final positions
    final_positions = session_data.groupby('driver_number').last().reset_index()
    final_positions = final_positions.sort_values('position')
    
    print(f"\nFinal Results for Session {sample_session}:")
    print(final_positions[['driver_number', 'position']].head(10))
    
    # Visualize position changes
    plt.figure(figsize=(14, 6))
    for driver in final_positions['driver_number'].head(5):
        driver_data = session_data[session_data['driver_number'] == driver]
        if 'date' in driver_data.columns:
            plt.plot(pd.to_datetime(driver_data['date']), driver_data['position'], 
                    marker='o', label=f'Driver {driver}', alpha=0.7)
    
    plt.xlabel('Time')
    plt.ylabel('Position')
    plt.title('Position Evolution During Race (Top 5 Finishers)')
    plt.legend()
    plt.gca().invert_yaxis()  # Lower position numbers at top
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

## 8. Data Quality Assessment

In [None]:
# Check for missing values
datasets = {
    'Sessions': sessions_df,
    'Laps': laps_df,
    'Positions': positions_df,
    'Weather': weather_df,
    'Pit Stops': pit_stops_df,
    'Drivers': drivers_df
}

print("Missing Values Summary:\n")
for name, df in datasets.items():
    missing = df.isnull().sum()
    missing_pct = (missing / len(df)) * 100
    missing_df = pd.DataFrame({
        'Missing Count': missing[missing > 0],
        'Percentage': missing_pct[missing > 0]
    })
    
    if not missing_df.empty:
        print(f"\n{name}:")
        print(missing_df.sort_values('Missing Count', ascending=False).head())
    else:
        print(f"\n{name}: No missing values ✅")

## 9. Key Insights

### Data Summary:
- Successfully loaded and explored F1 race data from OpenF1 API
- Data includes sessions, laps, positions, weather, pit stops, and driver information
- Multiple seasons of historical data available for model training

### Next Steps:
1. **Feature Engineering**: Create predictive features (pace, consistency, team performance)
2. **Model Training**: Train Random Forest and XGBoost models
3. **Evaluation**: Assess model performance with MAE and accuracy metrics
4. **Predictions**: Generate race outcome predictions

---

Continue to **02_feature_engineering.ipynb** for the next phase!