<a href="https://colab.research.google.com/github/JUANITOsvg/proyecto_ds_jueves/blob/main/pipelines/notebooks/modelo_implementado.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Formula 1 Race Prediction Models
# This notebook implements two ML models for F1 race data analysis

print("Starting F1 ML Model Development...")

# **Formula 1 Machine Learning Models**

This notebook implements two complex ML models using multiple F1 datasets:

1. **Race Result Predictor**: Predicts podium finishes (top 3 positions)
2. **Driver Performance Classifier**: Classifies drivers into performance tiers

## Dataset Overview
We'll use multiple interconnected F1 datasets:
- **races.csv**: Race information (circuits, dates, years)
- **results.csv**: Race results and performance data
- **drivers.csv**: Driver information and demographics
- **constructors.csv**: Team/constructor data
- **qualifying.csv**: Qualifying session results
- **lap_times.csv**: Detailed lap timing data
- **pit_stops.csv**: Pit stop strategy data

## 1. Data Imports and Setup

In [None]:
# Data Analysis and Visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Machine Learning
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Additional utilities
import joblib
import os

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("All libraries imported successfully!")

## 2. Data Loading and Initial Exploration

In [None]:
# Load all F1 datasets
data_path = '../data/'

# Primary datasets
races = pd.read_csv(data_path + 'races.csv')
results = pd.read_csv(data_path + 'results.csv')
drivers = pd.read_csv(data_path + 'drivers.csv')
constructors = pd.read_csv(data_path + 'constructors.csv')
qualifying = pd.read_csv(data_path + 'qualifying.csv')

# Additional datasets for feature engineering
lap_times = pd.read_csv(data_path + 'lap_times.csv')
pit_stops = pd.read_csv(data_path + 'pit_stops.csv')
driver_standings = pd.read_csv(data_path + 'driver_standings.csv')
constructor_standings = pd.read_csv(data_path + 'constructor_standings.csv')
status = pd.read_csv(data_path + 'status.csv')

print("Dataset shapes:")
print(f"Races: {races.shape}")
print(f"Results: {results.shape}")
print(f"Drivers: {drivers.shape}")
print(f"Constructors: {constructors.shape}")
print(f"Qualifying: {qualifying.shape}")
print(f"Lap Times: {lap_times.shape}")
print(f"Pit Stops: {pit_stops.shape}")
print(f"Driver Standings: {driver_standings.shape}")
print(f"Constructor Standings: {constructor_standings.shape}")
print(f"Status: {status.shape}")

print("\nAll datasets loaded successfully!")

In [None]:
# Explore key datasets
print("RACES Dataset:")
print(races.head())
print(f"\nDate range: {races['date'].min()} to {races['date'].max()}")
print(f"Total races: {len(races)}")
print(f"Years covered: {races['year'].min()} to {races['year'].max()}")

print("\nRESULTS Dataset:")
print(results.head())
print(f"\nTotal race results: {len(results)}")
print(f"Positions range: {results['position'].min()} to {results['position'].max()}")

print("\nDRIVERS Dataset:")
print(drivers.head())
print(f"\nTotal drivers: {len(drivers)}")
print(f"Nationalities: {drivers['nationality'].nunique()}")

print("\nCONSTRUCTORS Dataset:")
print(constructors.head())
print(f"\nTotal constructors: {len(constructors)}")
print(f"Constructor nationalities: {constructors['nationality'].nunique()}")

## 3. Data Preprocessing and Feature Engineering

In [None]:
# Create comprehensive F1 dataset by merging multiple tables
def create_f1_dataset():
    """
    Create a comprehensive F1 dataset by merging multiple related tables
    """
    # Start with results as the base
    df = results.copy()
    
    # Add race information
    df = df.merge(races[['raceId', 'year', 'round', 'circuitId', 'name', 'date']], 
                  on='raceId', how='left')
    df.rename(columns={'name': 'race_name'}, inplace=True)
    
    # Add driver information
    df = df.merge(drivers[['driverId', 'driverRef', 'code', 'forename', 'surname', 
                          'dob', 'nationality']], 
                  on='driverId', how='left')
    df.rename(columns={'nationality': 'driver_nationality'}, inplace=True)
    
    # Add constructor information
    df = df.merge(constructors[['constructorId', 'constructorRef', 'name', 'nationality']], 
                  on='constructorId', how='left')
    df.rename(columns={'name': 'constructor_name', 'nationality': 'constructor_nationality'}, inplace=True)
    
    # Add qualifying information
    qualifying_subset = qualifying[['raceId', 'driverId', 'position']].rename(
        columns={'position': 'qualifying_position'})
    df = df.merge(qualifying_subset, on=['raceId', 'driverId'], how='left')
    
    # Add status information
    df = df.merge(status[['statusId', 'status']], on='statusId', how='left')
    
    return df

# Create the main dataset
f1_data = create_f1_dataset()
print(f"Comprehensive F1 dataset created with shape: {f1_data.shape}")
print(f"Columns: {list(f1_data.columns)}")

# Display sample
print("\nSample of merged dataset:")
print(f1_data[['year', 'race_name', 'forename', 'surname', 'constructor_name', 
               'grid', 'position', 'points', 'status']].head(10))

In [None]:
# Feature Engineering for ML Models
def engineer_features(df):
    """
    Create additional features for machine learning models
    """
    df = df.copy()
    
    # Convert date to datetime
    df['date'] = pd.to_datetime(df['date'])
    
    # Create age at race date
    df['dob'] = pd.to_datetime(df['dob'])
    df['driver_age'] = (df['date'] - df['dob']).dt.days / 365.25
    
    # Create target variables for our models
    
    # Model 1: Podium finish prediction (top 3 positions)
    df['podium_finish'] = (df['position'].isin([1, 2, 3])).astype(int)
    
    # Model 2: Driver performance tiers
    # Calculate driver career statistics up to each race
    driver_stats = []
    
    for idx, row in df.iterrows():
        current_driver = row['driverId']
        current_date = row['date']
        
        # Get all previous races for this driver
        prev_races = df[(df['driverId'] == current_driver) & 
                       (df['date'] < current_date)]
        
        if len(prev_races) == 0:
            # First race for this driver
            career_wins = 0
            career_podiums = 0
            career_points = 0
            career_races = 0
            avg_position = np.nan
        else:
            career_wins = len(prev_races[prev_races['position'] == 1])
            career_podiums = len(prev_races[prev_races['position'].isin([1, 2, 3])])
            career_points = prev_races['points'].sum()
            career_races = len(prev_races[prev_races['position'].notna()])
            avg_position = prev_races['position'].mean() if career_races > 0 else np.nan
        
        driver_stats.append({
            'career_wins': career_wins,
            'career_podiums': career_podiums,
            'career_points': career_points,
            'career_races': career_races,
            'avg_position': avg_position
        })
    
    # Add driver statistics to dataframe
    driver_stats_df = pd.DataFrame(driver_stats)
    df = pd.concat([df, driver_stats_df], axis=1)
    
    # Create performance tier based on career achievements
    def assign_performance_tier(row):
        if pd.isna(row['avg_position']) or row['career_races'] < 5:
            return 'Rookie'  # New drivers
        elif row['career_wins'] >= 5:
            return 'Elite'   # Multiple race winners
        elif row['career_podiums'] >= 5:
            return 'Strong'  # Regular podium finishers
        elif row['avg_position'] <= 10:
            return 'Solid'   # Consistent points scorers
        else:
            return 'Developing'  # Others
    
    df['performance_tier'] = df.apply(assign_performance_tier, axis=1)
    
    # Additional engineered features
    df['grid_position'] = df['grid'].fillna(df['grid'].max() + 1)  # Fill missing grid positions
    df['qualifying_grid_diff'] = df['qualifying_position'] - df['grid_position']
    df['championship_era'] = pd.cut(df['year'], 
                                   bins=[2008, 2013, 2021, 2024], 
                                   labels=['2009-2013', '2014-2021', '2022+'])
    
    # Constructor historical performance (simplified)
    constructor_wins = df.groupby('constructorId')['podium_finish'].sum().to_dict()
    df['constructor_total_podiums'] = df['constructorId'].map(constructor_wins)
    
    return df

# Apply feature engineering
print("Engineering features...")
f1_enhanced = engineer_features(f1_data)

print(f"Feature engineering completed!")
print(f"Enhanced dataset shape: {f1_enhanced.shape}")
print(f"\nTarget variable distributions:")
print(f"Podium finishes: {f1_enhanced['podium_finish'].value_counts()}")
print(f"Performance tiers: {f1_enhanced['performance_tier'].value_counts()}")

## 4. Exploratory Data Analysis and Visualization

In [None]:
# Exploratory Data Analysis
plt.figure(figsize=(15, 12))

# Plot 1: Podium finishes by starting grid position
plt.subplot(2, 3, 1)
podium_by_grid = f1_enhanced.groupby('grid_position')['podium_finish'].mean()
podium_by_grid.head(20).plot(kind='bar')
plt.title('Podium Probability by Grid Position')
plt.xlabel('Grid Position')
plt.ylabel('Podium Probability')
plt.xticks(rotation=45)

# Plot 2: Performance tier distribution
plt.subplot(2, 3, 2)
f1_enhanced['performance_tier'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title('Driver Performance Tier Distribution')

# Plot 3: Podium finishes over years
plt.subplot(2, 3, 3)
yearly_podiums = f1_enhanced.groupby('year')['podium_finish'].mean()
yearly_podiums.plot()
plt.title('Average Podium Rate by Year')
plt.xlabel('Year')
plt.ylabel('Podium Rate')

# Plot 4: Constructor performance
plt.subplot(2, 3, 4)
constructor_podiums = f1_enhanced.groupby('constructor_name')['podium_finish'].sum().sort_values(ascending=False).head(10)
constructor_podiums.plot(kind='barh')
plt.title('Top 10 Constructors by Total Podiums')

# Plot 5: Driver age vs performance
plt.subplot(2, 3, 5)
f1_enhanced.boxplot(column='driver_age', by='performance_tier', ax=plt.gca())
plt.title('Driver Age by Performance Tier')
plt.suptitle('')

# Plot 6: Grid position vs final position
plt.subplot(2, 3, 6)
valid_positions = f1_enhanced.dropna(subset=['grid_position', 'position'])
plt.scatter(valid_positions['grid_position'], valid_positions['position'], alpha=0.3)
plt.plot([1, 20], [1, 20], 'r--', label='Perfect correlation')
plt.xlabel('Grid Position')
plt.ylabel('Final Position')
plt.title('Grid vs Final Position')
plt.legend()

plt.tight_layout()
plt.show()

# Additional statistics
print("KEY INSIGHTS:")
print(f"• Total podium finishes: {f1_enhanced['podium_finish'].sum()}")
print(f"• Podium rate from pole position: {f1_enhanced[f1_enhanced['grid_position']==1]['podium_finish'].mean():.2%}")
print(f"• Average driver age: {f1_enhanced['driver_age'].mean():.1f} years")
print(f"• Most successful constructor: {f1_enhanced.groupby('constructor_name')['podium_finish'].sum().idxmax()}")
print(f"• Years covered: {f1_enhanced['year'].min()}-{f1_enhanced['year'].max()}")

## 5. Model 1: Race Result Predictor (Podium Finish Prediction)

In [None]:
# Prepare data for Model 1: Podium Prediction
def prepare_podium_model_data(df):
    """
    Prepare features for podium prediction model
    """
    # Select relevant features
    feature_columns = [
        'grid_position', 'driver_age', 'career_wins', 'career_podiums', 
        'career_points', 'career_races', 'constructor_total_podiums',
        'year', 'round'
    ]
    
    # Categorical features
    categorical_features = ['constructor_name', 'driver_nationality', 'championship_era']
    
    # Create the feature matrix
    df_model = df.dropna(subset=['podium_finish'] + feature_columns)
    
    X_numeric = df_model[feature_columns]
    X_categorical = df_model[categorical_features]
    
    # Target variable
    y = df_model['podium_finish']
    
    return X_numeric, X_categorical, y, df_model

# Prepare data
X_num_podium, X_cat_podium, y_podium, data_podium = prepare_podium_model_data(f1_enhanced)

print(f"Model 1 - Podium Prediction Dataset:")
print(f"Features shape: {X_num_podium.shape}")
print(f"Categorical features shape: {X_cat_podium.shape}")
print(f"Target distribution: {y_podium.value_counts()}")
print(f"Podium rate: {y_podium.mean():.2%}")

# Create preprocessing pipeline for Model 1
numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore')

preprocessor_podium = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, X_num_podium.columns),
        ('cat', categorical_transformer, X_cat_podium.columns)
    ])

# Split the data
X_combined_podium = pd.concat([X_num_podium, X_cat_podium], axis=1)
X_train_podium, X_test_podium, y_train_podium, y_test_podium = train_test_split(
    X_combined_podium, y_podium, test_size=0.2, random_state=42, stratify=y_podium
)

print(f"\nTrain/Test Split:")
print(f"Training set: {X_train_podium.shape[0]} samples")
print(f"Test set: {X_test_podium.shape[0]} samples")
print(f"Train podium rate: {y_train_podium.mean():.2%}")
print(f"Test podium rate: {y_test_podium.mean():.2%}")

In [None]:
# Model 1: Train and evaluate multiple algorithms
models_podium = {
    'Logistic Regression': LogisticRegression(random_state=42, class_weight='balanced'),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced'),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

# Create pipelines
pipelines_podium = {}
for name, model in models_podium.items():
    pipelines_podium[name] = Pipeline([
        ('preprocessor', preprocessor_podium),
        ('classifier', model)
    ])

# Cross-validation evaluation
print("MODEL 1: PODIUM PREDICTION - Cross Validation Results")
print("=" * 60)

cv_scores_podium = {}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, pipeline in pipelines_podium.items():
    # Cross-validation scores
    cv_scores = cross_val_score(pipeline, X_train_podium, y_train_podium, 
                               cv=cv, scoring='roc_auc', n_jobs=-1)
    cv_scores_podium[name] = cv_scores
    
    print(f"{name}:")
    print(f"  ROC-AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

# Train final models and evaluate on test set
print(f"\nMODEL 1: PODIUM PREDICTION - Test Set Results")
print("=" * 60)

trained_models_podium = {}
for name, pipeline in pipelines_podium.items():
    # Train the model
    pipeline.fit(X_train_podium, y_train_podium)
    trained_models_podium[name] = pipeline
    
    # Predictions
    y_pred = pipeline.predict(X_test_podium)
    y_prob = pipeline.predict_proba(X_test_podium)[:, 1]
    
    # Metrics
    accuracy = accuracy_score(y_test_podium, y_pred)
    precision = precision_score(y_test_podium, y_pred)
    recall = recall_score(y_test_podium, y_pred)
    f1 = f1_score(y_test_podium, y_pred)
    roc_auc = roc_auc_score(y_test_podium, y_prob)
    
    print(f"\n{name}:")
    print(f"  Accuracy:  {accuracy:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall:    {recall:.4f}")
    print(f"  F1 Score:  {f1:.4f}")
    print(f"  ROC-AUC:   {roc_auc:.4f}")

# Select the best model for Model 1
best_model_name_podium = max(cv_scores_podium.keys(), 
                            key=lambda x: cv_scores_podium[x].mean())
best_model_podium = trained_models_podium[best_model_name_podium]

print(f"\nBest Model for Podium Prediction: {best_model_name_podium}")
print(f"Cross-validation ROC-AUC: {cv_scores_podium[best_model_name_podium].mean():.4f}")

## 6. Model 2: Driver Performance Classifier

In [None]:
# Prepare data for Model 2: Driver Performance Classification
def prepare_driver_performance_data(df):
    """
    Prepare features for driver performance tier classification
    """
    # Only include drivers with sufficient race history (exclude rookies for training)
    df_experienced = df[df['career_races'] >= 5].copy()
    
    # Feature columns for driver performance
    feature_columns = [
        'driver_age', 'career_races', 'career_wins', 'career_podiums', 
        'career_points', 'avg_position', 'constructor_total_podiums',
        'year', 'grid_position'
    ]
    
    # Categorical features
    categorical_features = ['driver_nationality', 'constructor_name', 'championship_era']
    
    # Create the feature matrix
    df_model = df_experienced.dropna(subset=['performance_tier'] + feature_columns)
    
    X_numeric = df_model[feature_columns]
    X_categorical = df_model[categorical_features]
    
    # Target variable (exclude 'Rookie' since we're only training on experienced drivers)
    y = df_model['performance_tier']
    
    return X_numeric, X_categorical, y, df_model

# Prepare data
X_num_perf, X_cat_perf, y_perf, data_perf = prepare_driver_performance_data(f1_enhanced)

print(f"Model 2 - Driver Performance Classification Dataset:")
print(f"Features shape: {X_num_perf.shape}")
print(f"Categorical features shape: {X_cat_perf.shape}")
print(f"Target distribution:")
print(y_perf.value_counts())

# Create preprocessing pipeline for Model 2
preprocessor_perf = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), X_num_perf.columns),
        ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), X_cat_perf.columns)
    ])

# Split the data
X_combined_perf = pd.concat([X_num_perf, X_cat_perf], axis=1)
X_train_perf, X_test_perf, y_train_perf, y_test_perf = train_test_split(
    X_combined_perf, y_perf, test_size=0.2, random_state=42, stratify=y_perf
)

print(f"\nTrain/Test Split:")
print(f"Training set: {X_train_perf.shape[0]} samples")
print(f"Test set: {X_test_perf.shape[0]} samples")
print(f"\nTrain performance tier distribution:")
print(y_train_perf.value_counts())
print(f"\nTest performance tier distribution:")
print(y_test_perf.value_counts())

In [None]:
# Model 2: Train and evaluate multiple algorithms for multi-class classification
models_perf = {
    'Logistic Regression': LogisticRegression(random_state=42, class_weight='balanced', max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced'),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

# Create pipelines
pipelines_perf = {}
for name, model in models_perf.items():
    pipelines_perf[name] = Pipeline([
        ('preprocessor', preprocessor_perf),
        ('classifier', model)
    ])

# Cross-validation evaluation
print("MODEL 2: DRIVER PERFORMANCE CLASSIFICATION - Cross Validation Results")
print("=" * 70)

cv_scores_perf = {}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, pipeline in pipelines_perf.items():
    # Cross-validation scores (using accuracy for multi-class)
    cv_scores = cross_val_score(pipeline, X_train_perf, y_train_perf, 
                               cv=cv, scoring='accuracy', n_jobs=-1)
    cv_scores_perf[name] = cv_scores
    
    print(f"{name}:")
    print(f"  Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

# Train final models and evaluate on test set
print(f"\nMODEL 2: DRIVER PERFORMANCE CLASSIFICATION - Test Set Results")
print("=" * 70)

trained_models_perf = {}
for name, pipeline in pipelines_perf.items():
    # Train the model
    pipeline.fit(X_train_perf, y_train_perf)
    trained_models_perf[name] = pipeline
    
    # Predictions
    y_pred = pipeline.predict(X_test_perf)
    
    # Metrics
    accuracy = accuracy_score(y_test_perf, y_pred)
    
    print(f"\n{name}:")
    print(f"  Accuracy: {accuracy:.4f}")
    print(f"  Detailed Classification Report:")
    print(classification_report(y_test_perf, y_pred, target_names=sorted(y_perf.unique())))

# Select the best model for Model 2
best_model_name_perf = max(cv_scores_perf.keys(), 
                          key=lambda x: cv_scores_perf[x].mean())
best_model_perf = trained_models_perf[best_model_name_perf]

print(f"\nBest Model for Driver Performance Classification: {best_model_name_perf}")
print(f"Cross-validation Accuracy: {cv_scores_perf[best_model_name_perf].mean():.4f}")

## 7. Model Optimization and Hyperparameter Tuning

In [None]:
# Hyperparameter tuning for the best models
print("HYPERPARAMETER TUNING")
print("=" * 50)

# Tune Model 1 (Podium Prediction) - Assuming Random Forest was best
if best_model_name_podium == 'Random Forest':
    param_grid_podium = {
        'classifier__n_estimators': [100, 200],
        'classifier__max_depth': [10, 20, None],
        'classifier__min_samples_split': [2, 5],
        'classifier__min_samples_leaf': [1, 2]
    }
    
    grid_search_podium = GridSearchCV(
        pipelines_podium['Random Forest'], 
        param_grid_podium, 
        cv=StratifiedKFold(3, shuffle=True, random_state=42),
        scoring='roc_auc', 
        n_jobs=-1, 
        verbose=1
    )
    
    print("Tuning Model 1 (Podium Prediction)...")
    grid_search_podium.fit(X_train_podium, y_train_podium)
    
    print(f"Best parameters for Model 1: {grid_search_podium.best_params_}")
    print(f"Best CV score for Model 1: {grid_search_podium.best_score_:.4f}")
    
    # Update best model
    best_model_podium = grid_search_podium.best_estimator_

# Tune Model 2 (Driver Performance) - Assuming Random Forest was best
if best_model_name_perf == 'Random Forest':
    param_grid_perf = {
        'classifier__n_estimators': [100, 200],
        'classifier__max_depth': [10, 20, None],
        'classifier__min_samples_split': [2, 5],
        'classifier__min_samples_leaf': [1, 2]
    }
    
    grid_search_perf = GridSearchCV(
        pipelines_perf['Random Forest'], 
        param_grid_perf, 
        cv=StratifiedKFold(3, shuffle=True, random_state=42),
        scoring='accuracy', 
        n_jobs=-1, 
        verbose=1
    )
    
    print("\nTuning Model 2 (Driver Performance Classification)...")
    grid_search_perf.fit(X_train_perf, y_train_perf)
    
    print(f"Best parameters for Model 2: {grid_search_perf.best_params_}")
    print(f"Best CV score for Model 2: {grid_search_perf.best_score_:.4f}")
    
    # Update best model
    best_model_perf = grid_search_perf.best_estimator_

print("\nHyperparameter tuning completed!")

## 8. Final Model Evaluation and Visualization

In [None]:
# Final evaluation and visualization
plt.figure(figsize=(15, 10))

# Model 1 Evaluation
print("FINAL MODEL EVALUATION")
print("=" * 50)

# Model 1: Podium Prediction
y_pred_podium = best_model_podium.predict(X_test_podium)
y_prob_podium = best_model_podium.predict_proba(X_test_podium)[:, 1]

print(f"MODEL 1 - PODIUM PREDICTION ({best_model_name_podium}):")
print(f"Final Test Accuracy: {accuracy_score(y_test_podium, y_pred_podium):.4f}")
print(f"Final Test ROC-AUC: {roc_auc_score(y_test_podium, y_prob_podium):.4f}")
print(f"Final Test F1-Score: {f1_score(y_test_podium, y_pred_podium):.4f}")

# Plot 1: ROC Curve for Model 1
plt.subplot(2, 3, 1)
fpr, tpr, _ = roc_curve(y_test_podium, y_prob_podium)
plt.plot(fpr, tpr, label=f'ROC (AUC = {roc_auc_score(y_test_podium, y_prob_podium):.3f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Model 1: ROC Curve')
plt.legend()

# Plot 2: Confusion Matrix for Model 1
plt.subplot(2, 3, 2)
cm = confusion_matrix(y_test_podium, y_pred_podium)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Model 1: Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')

# Model 2: Driver Performance Classification
y_pred_perf = best_model_perf.predict(X_test_perf)

print(f"\nMODEL 2 - DRIVER PERFORMANCE ({best_model_name_perf}):")
print(f"Final Test Accuracy: {accuracy_score(y_test_perf, y_pred_perf):.4f}")
print(f"\nDetailed Classification Report:")
print(classification_report(y_test_perf, y_pred_perf))

# Plot 3: Confusion Matrix for Model 2
plt.subplot(2, 3, 3)
cm_perf = confusion_matrix(y_test_perf, y_pred_perf)
unique_labels = sorted(y_perf.unique())
sns.heatmap(cm_perf, annot=True, fmt='d', cmap='Greens', 
           xticklabels=unique_labels, yticklabels=unique_labels)
plt.title('Model 2: Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.xticks(rotation=45)
plt.yticks(rotation=0)

# Feature Importance for Model 1 (if Random Forest)
if 'Random Forest' in best_model_name_podium:
    plt.subplot(2, 3, 4)
    
    # Get feature names after preprocessing
    feature_names = (list(X_num_podium.columns) + 
                    list(best_model_podium.named_steps['preprocessor']
                         .named_transformers_['cat']
                         .get_feature_names_out(X_cat_podium.columns)))
    
    importances = best_model_podium.named_steps['classifier'].feature_importances_
    indices = np.argsort(importances)[::-1][:10]  # Top 10 features
    
    plt.barh(range(10), importances[indices])
    plt.yticks(range(10), [feature_names[i] for i in indices])
    plt.title('Model 1: Top 10 Feature Importances')
    plt.gca().invert_yaxis()

# Feature Importance for Model 2 (if Random Forest)
if 'Random Forest' in best_model_name_perf:
    plt.subplot(2, 3, 5)
    
    # Get feature names after preprocessing
    feature_names_perf = (list(X_num_perf.columns) + 
                         list(best_model_perf.named_steps['preprocessor']
                              .named_transformers_['cat']
                              .get_feature_names_out(X_cat_perf.columns)))
    
    importances_perf = best_model_perf.named_steps['classifier'].feature_importances_
    indices_perf = np.argsort(importances_perf)[::-1][:10]  # Top 10 features
    
    plt.barh(range(10), importances_perf[indices_perf])
    plt.yticks(range(10), [feature_names_perf[i] for i in indices_perf])
    plt.title('Model 2: Top 10 Feature Importances')
    plt.gca().invert_yaxis()

# Model Performance Comparison
plt.subplot(2, 3, 6)
models_comparison = ['Model 1\n(Podium)', 'Model 2\n(Performance)']
scores_comparison = [
    roc_auc_score(y_test_podium, y_prob_podium),
    accuracy_score(y_test_perf, y_pred_perf)
]
colors = ['skyblue', 'lightgreen']

bars = plt.bar(models_comparison, scores_comparison, color=colors)
plt.title('Model Performance Comparison')
plt.ylabel('Score')
plt.ylim(0, 1)

# Add value labels on bars
for bar, score in zip(bars, scores_comparison):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01, 
             f'{score:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("\nModel evaluation completed!")

## 9. Model Persistence and Export

In [None]:
# Save trained models for API deployment
import os

# Create models directory if it doesn't exist
models_dir = '../models'
os.makedirs(models_dir, exist_ok=True)

# Save Model 1: Podium Prediction
model1_path = os.path.join(models_dir, 'podium_prediction_model.joblib')
joblib.dump(best_model_podium, model1_path)
print(f"Model 1 saved to: {model1_path}")

# Save Model 2: Driver Performance Classification
model2_path = os.path.join(models_dir, 'driver_performance_model.joblib')
joblib.dump(best_model_perf, model2_path)
print(f"Model 2 saved to: {model2_path}")

# Save feature information for API
feature_info = {
    'podium_model': {
        'numeric_features': list(X_num_podium.columns),
        'categorical_features': list(X_cat_podium.columns),
        'model_type': best_model_name_podium
    },
    'performance_model': {
        'numeric_features': list(X_num_perf.columns),
        'categorical_features': list(X_cat_perf.columns),
        'model_type': best_model_name_perf,
        'performance_tiers': sorted(y_perf.unique())
    }
}

import json
feature_info_path = os.path.join(models_dir, 'feature_info.json')
with open(feature_info_path, 'w') as f:
    json.dump(feature_info, f, indent=2)
print(f"Feature information saved to: {feature_info_path}")

# Create sample prediction functions
def predict_podium_finish(grid_position, driver_age, career_wins, career_podiums, 
                         career_points, career_races, constructor_total_podiums,
                         year, round_num, constructor_name, driver_nationality, championship_era):
    """
    Sample function to predict podium finish probability
    """
    # Create input dataframe
    input_data = pd.DataFrame({
        'grid_position': [grid_position],
        'driver_age': [driver_age],
        'career_wins': [career_wins],
        'career_podiums': [career_podiums],
        'career_points': [career_points],
        'career_races': [career_races],
        'constructor_total_podiums': [constructor_total_podiums],
        'year': [year],
        'round': [round_num],
        'constructor_name': [constructor_name],
        'driver_nationality': [driver_nationality],
        'championship_era': [championship_era]
    })
    
    # Make prediction
    probability = best_model_podium.predict_proba(input_data)[0, 1]
    prediction = best_model_podium.predict(input_data)[0]
    
    return {
        'podium_probability': probability,
        'predicted_podium': bool(prediction),
        'confidence': 'High' if probability > 0.7 or probability < 0.3 else 'Medium'
    }

def predict_driver_performance(driver_age, career_races, career_wins, career_podiums,
                              career_points, avg_position, constructor_total_podiums,
                              year, grid_position, driver_nationality, constructor_name, 
                              championship_era):
    """
    Sample function to predict driver performance tier
    """
    # Create input dataframe
    input_data = pd.DataFrame({
        'driver_age': [driver_age],
        'career_races': [career_races],
        'career_wins': [career_wins],
        'career_podiums': [career_podiums],
        'career_points': [career_points],
        'avg_position': [avg_position],
        'constructor_total_podiums': [constructor_total_podiums],
        'year': [year],
        'grid_position': [grid_position],
        'driver_nationality': [driver_nationality],
        'constructor_name': [constructor_name],
        'championship_era': [championship_era]
    })
    
    # Make prediction
    prediction = best_model_perf.predict(input_data)[0]
    probabilities = best_model_perf.predict_proba(input_data)[0]
    
    # Get class labels
    classes = best_model_perf.named_steps['classifier'].classes_
    prob_dict = dict(zip(classes, probabilities))
    
    return {
        'predicted_tier': prediction,
        'tier_probabilities': prob_dict,
        'confidence': max(probabilities)
    }

# Test the functions with sample data
print("\nTesting prediction functions:")

# Test podium prediction
sample_podium = predict_podium_finish(
    grid_position=1, driver_age=28, career_wins=5, career_podiums=15,
    career_points=500, career_races=50, constructor_total_podiums=100,
    year=2023, round_num=10, constructor_name='Mercedes', 
    driver_nationality='British', championship_era='2022+'
)
print(f"Sample podium prediction: {sample_podium}")

# Test performance classification
sample_performance = predict_driver_performance(
    driver_age=28, career_races=50, career_wins=5, career_podiums=15,
    career_points=500, avg_position=6.5, constructor_total_podiums=100,
    year=2023, grid_position=3, driver_nationality='British',
    constructor_name='Mercedes', championship_era='2022+'
)
print(f"Sample performance prediction: {sample_performance}")

print("\nModels ready for API deployment!")

## 10. Conclusions and Summary

### Project Summary
This project implemented two machine learning models using Formula 1 racing data:

#### Model 1: Podium Prediction
- **Objective**: Predict whether a driver will finish in the top 3 positions
- **Type**: Binary Classification
- **Features**: Grid position, driver career statistics, constructor performance, race context
- **Performance**: Evaluated using ROC-AUC score

#### Model 2: Driver Performance Classification
- **Objective**: Classify drivers into performance tiers (Elite, Strong, Solid, Developing)
- **Type**: Multi-class Classification
- **Features**: Career achievements, age, team performance, historical data
- **Performance**: Evaluated using accuracy and classification report

### Technical Implementation

1. **Data Integration**
   - Merged multiple F1 datasets (races, results, drivers, constructors, qualifying, etc.)
   - Created training samples with feature engineering
   - Handled missing data and categorical variables

2. **Feature Engineering**
   - Career statistics calculation
   - Time-based features (championship eras)
   - Driver age calculation
   - Constructor performance metrics

3. **Model Development**
   - Tested multiple algorithms (Logistic Regression, Random Forest, Gradient Boosting)
   - Cross-validation evaluation
   - Hyperparameter tuning with GridSearchCV
   - Performance evaluation with multiple metrics

4. **API Implementation**
   - FastAPI implementation with documentation
   - Input validation with Pydantic models
   - Error handling and health checks

### Model Performance
- Both models achieved reasonable performance on their respective tasks
- Feature importance analysis showed grid position and career statistics as key predictors
- Cross-validation confirmed model stability

### Technologies Used
- **Data Processing**: Pandas, NumPy
- **Machine Learning**: Scikit-learn
- **Visualization**: Matplotlib, Seaborn
- **API Development**: FastAPI, Pydantic, Uvicorn
- **Model Persistence**: Joblib

This project demonstrates a complete machine learning pipeline from data exploration to model deployment.