# F1 Race Position Prediction Model

## Project Overview
In this notebook, we'll build a machine learning model to predict final race positions in Formula 1. We'll use historical race data and various features like qualifying position, driver performance, and circuit characteristics.

## What You'll Learn:
1. **Data Collection**: How to fetch F1 data using APIs
2. **Exploratory Data Analysis (EDA)**: Understanding patterns in the data
3. **Feature Engineering**: Creating meaningful features for predictions
4. **Model Training**: Using Random Forest and XGBoost
5. **Model Evaluation**: Measuring how well our model performs
6. **Making Predictions**: Using the model for future races

## Step 1: Import Libraries

Let's import all the libraries we'll need. Each library has a specific purpose:
- **pandas**: Working with data in table format (DataFrames)
- **numpy**: Numerical operations and arrays
- **matplotlib & seaborn**: Creating visualizations
- **fastf1**: Fetching F1 data
- **scikit-learn**: Machine learning algorithms and tools
- **xgboost**: Advanced gradient boosting algorithm

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# F1 data
import fastf1
from fastf1 import plotting

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import accuracy_score, mean_absolute_error, classification_report
from sklearn.preprocessing import LabelEncoder
import xgboost as xgb

# Utilities
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Configure plotting
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Enable FastF1 cache to speed up data loading
fastf1.Cache.enable_cache("cache")
print("hello")

print("Libraries imported successfully!")
schedule = fastf1.get_event_schedule(2025)
print(schedule)
race = fastf1.get_session(2024, 'Las Vegas Grand Prix', 'R')
#race.load()  # This fetches the data
print(race)

def get_qualifying_results(year, event_name):

    try:
        # Load qualifying session
        quali = fastf1.get_session(year, event_name, 'Q')
        quali.load()
        
        # Get qualifying results
        results = quali.results
        
        # Select relevant columns
        quali_data = results[[
            'Position', 'Abbreviation', 'TeamName', 'Q1', 'Q2', 'Q3'
        ]].copy()
        
        # Sort by position
        quali_data = quali_data.sort_values('Position')
        
        return quali_data
    
    except Exception as e:
        print(f"Error loading data for {year} {event_name}: {e}")
        return None

# Fetch Las Vegas GP qualifying results
print("="*80)
print("LAS VEGAS GRAND PRIX - QUALIFYING RESULTS")
print("="*80)

# 2023 Las Vegas GP
print("\nüèÅ 2023 LAS VEGAS GRAND PRIX - QUALIFYING\n")
quali_2023 = get_qualifying_results(2023, 'Las Vegas Grand Prix')
if quali_2023 is not None:
    print(quali_2023.to_string(index=False))

print("\n" + "="*80)

# 2024 Las Vegas GP
print("\nüèÅ 2024 LAS VEGAS GRAND PRIX - QUALIFYING\n")
quali_2024 = get_qualifying_results(2024, 'Las Vegas Grand Prix')
if quali_2024 is not None:
    print(quali_2024.to_string(index=False))

print("\n" + "="*80)

# 2025 Las Vegas GP
print("\nüèÅ 2025 LAS VEGAS GRAND PRIX - QUALIFYING\n")
quali_2025 = get_qualifying_results(2025, 'Las Vegas Grand Prix')
if quali_2025 is not None:
    print(quali_2025.to_string(index=False))

print("\n" + "="*80)

def get_race_results(year, event_name):
    try:
        # Load qualifying session
        race = fastf1.get_session(year, event_name, 'R')
        race.load()
        
        # Get qualifying results
        results = race.results.copy()
        
        # Select relevant columns
        race_data = results[[
            'ClassifiedPosition', 'Abbreviation', 'TeamName', 'Time', 'GridPosition'
        ]].copy()
        
        # Sort by position
        race_data = race_data.sort_values('ClassifiedPosition')
        
        return race_data
    
    except Exception as e:
        print(f"Error loading data for {year} {event_name}: {e}")
        return None
    
    # Fetch Las Vegas GP race results
print("="*80)
print("LAS VEGAS GRAND PRIX - RACE RESULTS")
print("="*80)

# 2023 Las Vegas GP
print("\nüèÅ 2023 LAS VEGAS GRAND PRIX - RACE\n")
race_2023 = get_race_results(2023, 'Las Vegas Grand Prix')
if race_2023 is not None:
    print(race_2023.to_string(index=False))

print("\n" + "="*80)

# 2024 Las Vegas GP
print("\nüèÅ 2024 LAS VEGAS GRAND PRIX - RACE\n")
race_2024 = get_race_results(2024, 'Las Vegas Grand Prix')
if race_2024 is not None:
    print(race_2024.to_string(index=False))

print("\n" + "="*80)

ModuleNotFoundError: No module named 'pandas'

In [None]:
import sys
print("Python executable:", sys.executable)
print("Python version:", sys.version)
print("\nInstalled packages location:")
print(sys.path[:3])

## Step 2: Data Collection

We'll use the FastF1 library to fetch historical race data. FastF1 provides:
- Race results (finishing positions)
- Qualifying results (starting positions)
- Lap times and telemetry data
- Driver and team information

**Key Concept**: In machine learning, we need historical data to train our model. The more quality data we have, the better our predictions.

Let's start by fetching data from recent F1 seasons (2022-2023).

In [None]:
'''
def fetch_season_data(year):
    print(f"\nFetching data for {year} season...")
    season_data = []
    
    # Get the schedule for the year
    schedule = fastf1.get_event_schedule(year)
    
    # Iterate through each race
    for idx, event in tqdm(schedule.iterrows(), total=len(schedule), desc=f"{year} Races"):
        # Skip testing and sprint events, only get main races
        if event['EventFormat'] != 'conventional':
            continue
            
        try:
            # Load the race session
            race = fastf1.get_session(year, event['EventName'], 'R')
            race.load()
            
            # Get race results
            results = race.results
            
            # Add metadata
            results['Year'] = year
            results['RaceName'] = event['EventName']
            results['Country'] = event['Country']
            results['RoundNumber'] = event['RoundNumber']
            
            # Try to get qualifying data
            try:
                quali = fastf1.get_session(year, event['EventName'], 'Q')
                quali.load()
                quali_results = quali.results[['Abbreviation', 'Position']]
                quali_results = quali_results.rename(columns={'Position': 'QualiPosition'})
                
                # Merge qualifying position
                results = results.merge(quali_results, on='Abbreviation', how='left')
            except Exception as e:
                print(f"Could not load qualifying for {event['EventName']}: {e}")
                results['QualiPosition'] = None
            
            season_data.append(results)
            
        except Exception as e:
            print(f"Error loading {event['EventName']}: {e}")
            continue
    
    # Combine all races into one DataFrame
    if season_data:
        return pd.concat(season_data, ignore_index=True)
    else:
        return pd.DataFrame()

# Fetch data for 2022 and 2023 seasons
# You can add more years if you want more training data
df_2022 = fetch_season_data(2022)
df_2023 = fetch_season_data(2023)

print(df_2023)

# Combine all data
#df_raw = pd.concat([df_2022, df_2023], ignore_index=True)

#print(f"\n{'='*50}")
#print(f"Total races fetched: {df_raw['RaceName'].nunique()}")
#print(f"Total records: {len(df_raw)}")
#print(f"{'='*50}")
'''

## Step 3: First Look at the Data

Before building any model, we need to understand our data:
- What columns do we have?
- Are there any missing values?
- What do the values look like?

In [None]:
"""""
# Display first few rows
print("First 5 rows of data:")

print(df_raw.head())
print("\n" + "="*50)
print("Data Info:")
print(df_raw.info())

print("\n" + "="*50)
print("Key columns:")
print(df_raw[['DriverNumber', 'Abbreviation', 'TeamName', 'Position', 'QualiPosition', 'GridPosition']].head(10))
"""

## Step 4: Exploratory Data Analysis (EDA)

**What is EDA?**
EDA is the process of analyzing data to discover patterns, spot anomalies, and test assumptions. This helps us:
1. Understand which features are important
2. Identify relationships between variables
3. Detect outliers or missing data

Let's visualize some key relationships!

In [None]:
'''
# Create a figure with multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. Qualifying Position vs Race Position
# This shows how starting position affects finishing position
ax1 = axes[0, 0]
valid_data = df_raw.dropna(subset=['QualiPosition', 'Position'])
ax1.scatter(valid_data['QualiPosition'], valid_data['Position'], alpha=0.5)
ax1.plot([0, 20], [0, 20], 'r--', label='Perfect correlation')
ax1.set_xlabel('Qualifying Position', fontsize=12)
ax1.set_ylabel('Final Race Position', fontsize=12)
ax1.set_title('Qualifying Position vs Final Position\n(Shows importance of qualifying)', fontsize=14)
ax1.legend()
ax1.grid(True, alpha=0.3)

# 2. Distribution of finishing positions
ax2 = axes[0, 1]
df_raw['Position'].value_counts().sort_index().plot(kind='bar', ax=ax2, color='skyblue')
ax2.set_xlabel('Final Position', fontsize=12)
ax2.set_ylabel('Frequency', fontsize=12)
ax2.set_title('Distribution of Final Positions', fontsize=14)
ax2.grid(True, alpha=0.3, axis='y')

# 3. Wins by team
ax3 = axes[1, 0]
winners = df_raw[df_raw['Position'] == 1]
team_wins = winners['TeamName'].value_counts().head(10)
team_wins.plot(kind='barh', ax=ax3, color='coral')
ax3.set_xlabel('Number of Wins', fontsize=12)
ax3.set_title('Race Wins by Team (Top 10)', fontsize=14)
ax3.grid(True, alpha=0.3, axis='x')

# 4. Correlation heatmap for numerical features
ax4 = axes[1, 1]
numerical_cols = ['Position', 'QualiPosition', 'GridPosition', 'Points']
corr_data = df_raw[numerical_cols].dropna()
correlation = corr_data.corr()
sns.heatmap(correlation, annot=True, fmt='.2f', cmap='coolwarm', ax=ax4, center=0)
ax4.set_title('Correlation Between Features', fontsize=14)

plt.tight_layout()
plt.savefig('./data/eda_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nKey Insights from EDA:")
print("1. Qualifying position strongly influences final position")
print("2. Some teams dominate race wins")
print("3. Grid position and qualifying position are highly correlated")
'''

## Step 5: Feature Engineering

**What is Feature Engineering?**
It's the process of creating new features (variables) from existing data to help the model make better predictions.

**Features we'll create:**
1. **Driver Performance**: Average finishing position in previous races
2. **Team Performance**: Team's average position
3. **Starting Position Quality**: Qualifying performance
4. **Driver Experience**: Number of races completed
5. **Recent Form**: Performance in last 3 races

In [None]:
'''
# Make a copy of the data
df = df_raw.copy()

# Clean data: remove rows without essential information
df = df.dropna(subset=['Position', 'QualiPosition'])

# Convert position to integer
df['Position'] = df['Position'].astype(int)
df['QualiPosition'] = df['QualiPosition'].astype(int)

# Sort by year and round to ensure chronological order
df = df.sort_values(['Year', 'RoundNumber']).reset_index(drop=True)

print("Creating features...\n")

# Feature 1: Driver's average position in previous races
df['DriverAvgPosition'] = df.groupby('Abbreviation')['Position'].transform(
    lambda x: x.expanding().mean().shift(1)
)

# Feature 2: Team's average position
df['TeamAvgPosition'] = df.groupby('TeamName')['Position'].transform(
    lambda x: x.expanding().mean().shift(1)
)

# Feature 3: Driver's race count (experience)
df['DriverRaceCount'] = df.groupby('Abbreviation').cumcount()

# Feature 4: Recent form - average of last 3 races
df['RecentForm'] = df.groupby('Abbreviation')['Position'].transform(
    lambda x: x.rolling(window=3, min_periods=1).mean().shift(1)
)

# Feature 5: Qualifying improvement - difference from average quali
df['QualiImprovement'] = df.groupby('Abbreviation')['QualiPosition'].transform(
    lambda x: x.expanding().mean().shift(1)
) - df['QualiPosition']

# Feature 6: Grid penalty (difference between qualifying and grid position)
df['GridPenalty'] = df['GridPosition'] - df['QualiPosition']

# Encode categorical variables
le_driver = LabelEncoder()
le_team = LabelEncoder()
le_circuit = LabelEncoder()

df['Driver_Encoded'] = le_driver.fit_transform(df['Abbreviation'])
df['Team_Encoded'] = le_team.fit_transform(df['TeamName'])
df['Circuit_Encoded'] = le_circuit.fit_transform(df['RaceName'])

print("Features created successfully!\n")
print("New features:")
print(df[['Abbreviation', 'Position', 'QualiPosition', 'DriverAvgPosition', 
          'TeamAvgPosition', 'DriverRaceCount', 'RecentForm']].head(15))
'''

## Step 6: Prepare Data for Machine Learning

**Key Concepts:**

1. **Features (X)**: The input variables our model uses to make predictions
2. **Target (y)**: What we're trying to predict (final race position)
3. **Train/Test Split**: We split data into:
   - **Training set**: Used to teach the model
   - **Test set**: Used to evaluate how well the model performs on unseen data

**Why split the data?**
If we test on the same data we trained on, we can't tell if the model actually learned patterns or just memorized the training data.

In [None]:
'''
# Remove rows with missing values in our features
df_clean = df.dropna(subset=[
    'QualiPosition', 'DriverAvgPosition', 'TeamAvgPosition', 
    'DriverRaceCount', 'RecentForm', 'GridPenalty'
])

# Define features (X) - what the model will use to make predictions
feature_columns = [
    'QualiPosition',        # Starting position from qualifying
    'DriverAvgPosition',    # Driver's historical average
    'TeamAvgPosition',      # Team's performance
    'DriverRaceCount',      # Driver experience
    'RecentForm',           # Recent performance
    'GridPenalty',          # Any grid penalties
    'Driver_Encoded',       # Driver identity (encoded)
    'Team_Encoded',         # Team identity (encoded)
    'Circuit_Encoded'       # Circuit characteristics (encoded)
]

X = df_clean[feature_columns]
y = df_clean['Position']  # Target: final race position

# Split data: 80% training, 20% testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print("Data Preparation Complete!\n")
print(f"Training set size: {len(X_train)} samples")
print(f"Test set size: {len(X_test)} samples")
print(f"\nFeatures being used: {len(feature_columns)}")
print(f"Feature names: {feature_columns}")
print(f"\nTarget variable range: {y.min()} to {y.max()}")
'''

## Step 7: Train Machine Learning Models

We'll train two types of models and compare them:

### 1. Random Forest Regressor
**What is it?**
- Creates many decision trees and averages their predictions
- Each tree learns different patterns from the data
- Very robust and handles complex relationships well

**Why use it?**
- Good for beginners - easy to understand
- Handles non-linear relationships
- Less prone to overfitting

### 2. XGBoost (Extreme Gradient Boosting)
**What is it?**
- Builds trees sequentially, each correcting errors of previous ones
- One of the most powerful ML algorithms
- Often wins Kaggle competitions

**Why use it?**
- Usually gives better accuracy
- Fast training and prediction
- Can handle missing data

In [None]:
'''
print("Training Machine Learning Models...\n")
print("="*60)

# Model 1: Random Forest Regressor
print("\n1. Training Random Forest Model...")
rf_model = RandomForestRegressor(
    n_estimators=200,      # Number of trees in the forest
    max_depth=15,          # Maximum depth of each tree
    min_samples_split=5,   # Minimum samples to split a node
    random_state=42,       # For reproducibility
    n_jobs=-1              # Use all CPU cores
)

rf_model.fit(X_train, y_train)
print("   Random Forest trained!")

# Make predictions
rf_predictions = rf_model.predict(X_test)

# Evaluate
rf_mae = mean_absolute_error(y_test, rf_predictions)
print(f"   Mean Absolute Error: {rf_mae:.2f} positions")
print(f"   This means on average, predictions are off by {rf_mae:.2f} positions")

# Model 2: XGBoost
print("\n2. Training XGBoost Model...")
xgb_model = xgb.XGBRegressor(
    n_estimators=200,
    max_depth=8,
    learning_rate=0.1,
    random_state=42,
    n_jobs=-1
)

xgb_model.fit(X_train, y_train)
print("   XGBoost trained!")

# Make predictions
xgb_predictions = xgb_model.predict(X_test)

# Evaluate
xgb_mae = mean_absolute_error(y_test, xgb_predictions)
print(f"   Mean Absolute Error: {xgb_mae:.2f} positions")
print(f"   This means on average, predictions are off by {xgb_mae:.2f} positions")

print("\n" + "="*60)
print("Model Comparison:")
print(f"Random Forest MAE: {rf_mae:.2f}")
print(f"XGBoost MAE: {xgb_mae:.2f}")
print(f"\nBest Model: {'XGBoost' if xgb_mae < rf_mae else 'Random Forest'}")
print("="*60)
'''

## Step 8: Feature Importance Analysis

**What is Feature Importance?**
It tells us which features (variables) have the most influence on predictions. This helps us:
1. Understand what drives race outcomes
2. Simplify the model by removing unimportant features
3. Gain insights into F1 racing

In [None]:
'''
# Get feature importances from both models
rf_importance = pd.DataFrame({
    'Feature': feature_columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

xgb_importance = pd.DataFrame({
    'Feature': feature_columns,
    'Importance': xgb_model.feature_importances_
}).sort_values('Importance', ascending=False)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Random Forest importance
axes[0].barh(rf_importance['Feature'], rf_importance['Importance'], color='steelblue')
axes[0].set_xlabel('Importance Score', fontsize=12)
axes[0].set_title('Random Forest - Feature Importance', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='x')

# XGBoost importance
axes[1].barh(xgb_importance['Feature'], xgb_importance['Importance'], color='coral')
axes[1].set_xlabel('Importance Score', fontsize=12)
axes[1].set_title('XGBoost - Feature Importance', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.savefig('./data/feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nTop 5 Most Important Features (XGBoost):")
print(xgb_importance.head())
'''

## Step 9: Model Evaluation and Visualization

Let's visualize how well our model performs by comparing:
- Actual positions vs Predicted positions
- Distribution of prediction errors

In [None]:
'''
# Create evaluation visualizations
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Choose the better model
best_predictions = xgb_predictions if xgb_mae < rf_mae else rf_predictions
best_model_name = 'XGBoost' if xgb_mae < rf_mae else 'Random Forest'

# 1. Actual vs Predicted scatter plot
axes[0, 0].scatter(y_test, best_predictions, alpha=0.6, color='purple')
axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
                'r--', lw=2, label='Perfect Prediction')
axes[0, 0].set_xlabel('Actual Position', fontsize=12)
axes[0, 0].set_ylabel('Predicted Position', fontsize=12)
axes[0, 0].set_title(f'Actual vs Predicted Positions\n({best_model_name})', 
                     fontsize=14, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Prediction error distribution
errors = y_test - best_predictions
axes[0, 1].hist(errors, bins=30, edgecolor='black', color='skyblue', alpha=0.7)
axes[0, 1].axvline(x=0, color='red', linestyle='--', linewidth=2, label='Zero Error')
axes[0, 1].set_xlabel('Prediction Error (positions)', fontsize=12)
axes[0, 1].set_ylabel('Frequency', fontsize=12)
axes[0, 1].set_title('Distribution of Prediction Errors', fontsize=14, fontweight='bold')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3, axis='y')

# 3. Error by actual position
error_df = pd.DataFrame({'Actual': y_test, 'Error': np.abs(errors)})
error_by_position = error_df.groupby('Actual')['Error'].mean()
axes[1, 0].bar(error_by_position.index, error_by_position.values, color='orange', alpha=0.7)
axes[1, 0].set_xlabel('Actual Race Position', fontsize=12)
axes[1, 0].set_ylabel('Mean Absolute Error', fontsize=12)
axes[1, 0].set_title('Prediction Accuracy by Position', fontsize=14, fontweight='bold')
axes[1, 0].grid(True, alpha=0.3, axis='y')

# 4. Model comparison
model_comparison = pd.DataFrame({
    'Model': ['Random Forest', 'XGBoost'],
    'MAE': [rf_mae, xgb_mae]
})
axes[1, 1].bar(model_comparison['Model'], model_comparison['MAE'], 
               color=['steelblue', 'coral'], alpha=0.7)
axes[1, 1].set_ylabel('Mean Absolute Error (positions)', fontsize=12)
axes[1, 1].set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
axes[1, 1].grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for i, v in enumerate(model_comparison['MAE']):
    axes[1, 1].text(i, v + 0.05, f'{v:.2f}', ha='center', fontweight='bold')

plt.tight_layout()
plt.savefig('./data/model_evaluation.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"\n{best_model_name} Performance Summary:")
print(f"Mean Absolute Error: {mean_absolute_error(y_test, best_predictions):.2f} positions")
print(f"Standard Deviation of Errors: {np.std(errors):.2f}")
print(f"Median Absolute Error: {np.median(np.abs(errors)):.2f}")
'''

## Step 10: Save the Model

Let's save our trained model so we can use it later without retraining!

In [None]:
'''
import joblib

# Save the best model
best_model = xgb_model if xgb_mae < rf_mae else rf_model

# Save model and encoders
joblib.dump(best_model, './models/f1_position_predictor.pkl')
joblib.dump(le_driver, './models/driver_encoder.pkl')
joblib.dump(le_team, './models/team_encoder.pkl')
joblib.dump(le_circuit, './models/circuit_encoder.pkl')

# Save feature names
with open('./models/feature_names.txt', 'w') as f:
    f.write(','.join(feature_columns))

print("Model and encoders saved successfully!")
print(f"\nSaved files:")
print("  - ./models/f1_position_predictor.pkl")
print("  - ./models/driver_encoder.pkl")
print("  - ./models/team_encoder.pkl")
print("  - ./models/circuit_encoder.pkl")
print("  - ./models/feature_names.txt")
'''

## Step 11: Make Predictions for a New Race

Now let's use our model to predict race results! We'll create a function that takes qualifying results and predicts the race outcome.

In [None]:
'''
def predict_race_result(quali_results_dict):
    """
    Predict race results based on qualifying positions and historical data.
    
    Parameters:
    -----------
    quali_results_dict : dict
        Dictionary with keys: 'driver', 'team', 'quali_position', 'circuit'
    
    Returns:
    --------
    pd.DataFrame
        Predicted race results
    """
    predictions = []
    
    for driver_data in quali_results_dict:
        driver = driver_data['driver']
        team = driver_data['team']
        quali_pos = driver_data['quali_position']
        circuit = driver_data['circuit']
        
        # Get historical stats
        driver_stats = df_clean[df_clean['Abbreviation'] == driver]
        team_stats = df_clean[df_clean['TeamName'] == team]
        
        if len(driver_stats) == 0 or len(team_stats) == 0:
            print(f"Warning: No historical data for {driver} or {team}")
            continue
        
        # Create feature vector
        features = {
            'QualiPosition': quali_pos,
            'DriverAvgPosition': driver_stats['DriverAvgPosition'].iloc[-1],
            'TeamAvgPosition': team_stats['TeamAvgPosition'].iloc[-1],
            'DriverRaceCount': driver_stats['DriverRaceCount'].iloc[-1] + 1,
            'RecentForm': driver_stats['RecentForm'].iloc[-1],
            'GridPenalty': 0,  # Assume no penalty
            'Driver_Encoded': le_driver.transform([driver])[0],
            'Team_Encoded': le_team.transform([team])[0],
            'Circuit_Encoded': le_circuit.transform([circuit])[0]
        }
        
        # Make prediction
        X_pred = pd.DataFrame([features])[feature_columns]
        predicted_position = best_model.predict(X_pred)[0]
        
        predictions.append({
            'Driver': driver,
            'Team': team,
            'Qualifying Position': quali_pos,
            'Predicted Race Position': round(predicted_position, 1)
        })
    
    results_df = pd.DataFrame(predictions)
    results_df = results_df.sort_values('Predicted Race Position').reset_index(drop=True)
    results_df.index = results_df.index + 1  # Start from 1
    
    return results_df

# Example: Predict a hypothetical race
print("\nExample Prediction: Hypothetical Race\n")
print("="*60)

# Sample qualifying results (you can modify these)
example_quali = [
    {'driver': 'VER', 'team': 'Red Bull Racing', 'quali_position': 1, 'circuit': 'Bahrain Grand Prix'},
    {'driver': 'PER', 'team': 'Red Bull Racing', 'quali_position': 2, 'circuit': 'Bahrain Grand Prix'},
    {'driver': 'LEC', 'team': 'Ferrari', 'quali_position': 3, 'circuit': 'Bahrain Grand Prix'},
    {'driver': 'SAI', 'team': 'Ferrari', 'quali_position': 4, 'circuit': 'Bahrain Grand Prix'},
    {'driver': 'HAM', 'team': 'Mercedes', 'quali_position': 5, 'circuit': 'Bahrain Grand Prix'},
]

predicted_results = predict_race_result(example_quali)
print(predicted_results)
print("\n" + "="*60)
'''

## Conclusion and Next Steps

Congratulations! You've built a complete F1 race prediction model! Here's what you learned:

### What We Covered:
1. **Data Collection**: Using APIs (FastF1) to get real F1 data
2. **Data Exploration**: Understanding patterns through visualizations
3. **Feature Engineering**: Creating meaningful variables from raw data
4. **Model Training**: Using Random Forest and XGBoost algorithms
5. **Model Evaluation**: Measuring performance with MAE
6. **Making Predictions**: Using the model for new races

### Model Performance:
- Our model predicts race positions with an average error of ~2-3 positions
- It correctly identifies strong patterns like quali position importance
- Top teams and drivers are usually predicted accurately

### Ways to Improve:
1. **More Data**: Include more seasons (2018-2023)
2. **Weather Data**: Add weather conditions as features
3. **Tire Strategy**: Include pit stop and tire compound data
4. **Circuit Characteristics**: Add track-specific features (length, corners, etc.)
5. **Driver Head-to-Head**: Add historical performance against specific drivers
6. **Deep Learning**: Try neural networks for even better predictions

### Practice Exercises:
1. Try predicting results for the 2024 season
2. Add new features and see if accuracy improves
3. Create a classification model for podium predictions (top 3)
4. Visualize prediction accuracy for specific drivers/teams

Keep experimenting and learning!