# F1 Ideal Lap Analysis: Constructing the Fastest Possible Lap from Sector Times

This project focuses on **Formula 1 qualifying data** collected using the **FastF1 Python library**.

The primary goal is to determine the ***Ideal Lap*** for specific Grand Prix events, defined as the sum of the fastest Sector 1, Sector 2, and Sector 3 times recorded by *any* driver during a session. This allows us to answer the question: “What would the fastest possible lap look like if the best sector performances were combined into a single perfect lap?”


### **Objectives**  

- Collect and clean raw sector time data across all the circuits during the season.
- Calculate the **Ideal Lap** time for each targeted event (the best theoretically possible lap).
- Quantify and compare the Ideal Lap time against the **Actual Fastest Lap** achieved.
- Provide insights into which drivers or teams demonstrated the strongest performance in specific sectors of the circuit.


### **Data & Scope**  

| Field | Detail |
| :--- | :--- |
| **Source** | **FastF1 Python API** (Official Formula 1 timing feeds) |
| **Scope** | Selected **Qualifying Sessions** from chosen F1 season. |
| **Key Variables** | Lap times, Sector 1, Sector 2, and Sector 3 times, Driver identification, Circuit details. |


### **Information**  
**Author:** Paulo Castro  
**Date:** August 2025  
**Tools:** Python (pandas, matplotlib, seaborn, FastF1)  

---

## 1. Load and Inspect Data

### 1.1 Initialization and Data Collection

This section focuses on initializing the required libraries and defining the parameters for data collection. We use the **FastF1 API** to fetch data for the **entire `2021` season's qualifying sessions `(Q)`**. The data is iterated through each round, and key timing columns are extracted and aggregated into a single DataFrame for comprehensive analysis.

The columns collected are: `Driver`, `LapTime`, `Sector1Time`, `Sector2Time`, `Sector3Time`, and `IsAccurate`. We also add `Year` and `Circuit` for context.


In [None]:
# Install FastF1 (if not already installed)
!pip install fastf1 --quiet

In [None]:
# Import necessary libraries
import fastf1 as f1
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Parameters for data collection
year= 2021
session_type = "Q" # Qualifying

all_data = []
# Get the event schedule for the specified year excluding testing events
schedule = f1.get_event_schedule(year, include_testing=False)

# Iterate through all Grand Prix rounds and extract sector times
for round_number  in schedule["RoundNumber"]:
  try:
    session = f1.get_session(year, round_number , session_type)
    # Load the session data, only laps needed, telemetry skipped for speed
    session.load(laps=True, telemetry=False)
    laps = session.laps

    # Select and copy relevant columns
    df = laps[['Driver', 'LapTime', 'Sector1Time', 'Sector2Time', 'Sector3Time', 'IsAccurate']].copy()
    df['Year'] = year
    # Extract the circuit name
    df['Circuit'] = session.event['EventName']

    all_data.append(df)

  except Exception as e:
     # Handle cases where session data might be unavailable
    print(f"Could not load data for Round {round_number}: {session.event['EventName']}. Error: {e}")

# Concatenate all individual DataFrames into a single one
df_all = pd.concat(all_data, ignore_index=True)

### 1.2 Initial Data Inspection: head(), shape, info()

A quick inspection of the collected raw data is essential to understand its structure, size, and data types before proceeding to cleaning.

In [None]:
# View first rows
print("Initial Data Sample:")
display(df_all.head())

In [None]:
# Analyse Dataframe shape
print("Dataframe Shape:")
print(df_all.shape)

In [None]:
# General Information and Data Types
print("\nGeneral Information:")
print(df_all.info())

### 1.3 Descriptive Statistics

To facilitate numerical analysis and statistics, the Timedelta columns (LapTime and all SectorTime columns) are converted to a common unit of seconds.

In [None]:
# Convert Timedelta to Seconds
# Create a copy to avoid SettingWithCopyWarning
df_sec = df_all.copy()

# Convert time columns (timedelta) into total seconds for easier analysis
for col in ['LapTime','Sector1Time','Sector2Time','Sector3Time']:
    df_sec[col] = df_sec[col].dt.total_seconds()

print("\nDescriptive Statistics (in seconds):")
display(df_sec.describe())

### 1.4 Initial Assessment of Raw Data State
Based on the `info()` and `describe()` outputs, the raw data presents several issues that need to be addressed in the subsequent cleaning phase:

- **Missing Values `(NaN)` and Inaccurate Laps**: The `info()` output confirms that the time-based columns (`LapTime`, `Sector1Time`, etc.) contain significantly fewer non-null values than the total entries. In F1 data, these `NaN` values, along with non-representative lap times (e.g., pit-in/pit-out, slow laps), must be filtered out. We will primarily utilize the `IsAccurate` flag to retain only the fastest, most reliable qualifying laps for analysis.

- **Time Column Conversion (Completed)**: The initial step of converting the time columns from *Timedelta* objects to *float* (seconds) has been successfully executed. This is fundamental for enabling the numerical calculations required for the *Ideal Lap* metric.

- **Categorical Data Optimization**: The `Driver` and `Circuit` columns, which are essential categorical keys for grouping and calculating the *Ideal Lap*, are currently stored as the inefficient `object` data type. To optimize memory usage and improve processing speed, these columns will be explicitly converted to the `category` data type.

- **Column Name Standardization**: Following standard data cleaning best practices, the column names will be standardized to the `snake_case` format. This improves code readability and consistency throughout the project.

## 2. Data Cleaning & Preparation

This section focuses on transforming the raw aggregated data into a reliable and optimized dataset suitable for the *Ideal Lap* analysis. We will systematically address data quality issues, including inaccuracies, missing values, timing inconsistencies, and ensure efficient data types.

### 2.1 Data Integrity and Initial Filtering

We begin by removing redundant rows and eliminating inaccurate or incomplete laps, as indicated by the `IsAccurate` flag and `NaN` values in timing columns.

In [None]:
# Create a copy for cleaning operations to maintain the original aggregated data (df_sec)
df_clean = df_sec.copy()

# Remove duplicated rows
print("Rows before removing duplicates:", df_clean.shape[0])
print("Number of duplicate rows found:", df_clean.duplicated().sum())
df_clean.drop_duplicates(inplace=True)
print("Rows after removing duplicates:", df_clean.shape[0])

#### Filtering by Accuracy and Completeness

Only laps marked as accurate (`IsAccurate=True`) are retained, as they represent official, non-interrupted flying laps. We then remove the `IsAccurate` column as its purpose has been served.

In [None]:
# Retain only laps marked as accurate
df_clean = df_clean[df_clean['IsAccurate']].copy()

# Remove the 'IsAccurate' column as it is no longer necessary
df_clean.drop(columns=['IsAccurate'], inplace=True)

print("Data sample after filtering for accurate laps:")
display(df_clean.head())

#### Handling Missing Time Values

We verify the presence of any remaining `NaN` values in the timing columns, which can occur despite the `IsAccurate` filter. These rows are subsequently dropped to ensure all data points used have complete timing records across all three sectors.

In [None]:
# Check for any remaining missing values
print("Missing values per column:")
display(df_clean.isna().sum())

In [None]:
# Remove remaining rows with missing time values
df_clean.dropna(subset=['LapTime', 'Sector1Time', 'Sector2Time', 'Sector3Time'], inplace=True)
df_clean.reset_index(drop=True, inplace=True)
print(df_clean.info())

### 2.2 Standardizing Column Names and Data Types
To improve code readability and memory efficiency, we standardize column naming to snake_case and optimize categorical columns.

#### Column Name Standardization

We adopt the `snake_case` convention for all timing columns, ensuring consistency with Python naming standards.

In [None]:
# Standardize column names
df_clean.rename(columns={
    'LapTime': 'lap_time',
    'Sector1Time': 'sector1_time',
    'Sector2Time': 'sector2_time',
    'Sector3Time': 'sector3_time',
    'Driver': 'driver',
    'Circuit': 'circuit',
    'Year': 'year'
}, inplace=True)

# Verify changes
print("Updated column names:")
print(df_clean.columns)

#### Categorical Data Optimization

The driver and circuit columns are converted from the memory-intensive object type to the more efficient category type, which is ideal for fixed sets of strings used for grouping and plotting.

In [None]:
# Identify potential categorical columns
cols_cat = df_clean.select_dtypes(include='object').nunique().sort_values().index.tolist()
print(f"Columns converted to category: {cols_cat if cols_cat else 'None found'}")
print("\n")

# Apply category conversion
df_clean[cols_cat] = df_clean[cols_cat].astype('category')
df_clean.info()

### 2.3 Timing Consistency Check
While filtering for accuracy minimizes errors, a final check is performed to ensure that the sum of the three sector times is logically consistent with the total `lap_time`. A tolerance of `±0.1` seconds is used to account for minor API timing precision differences.

In [None]:
# Check for negative or zero times (Expected to be clean after initial describe check)
# This step is commented out as initial checks confirmed positive minimum values.
'''
print("Laps with lap_time <= 0:")
display(df_clean[df_clean['lap_time'] <= 0])
print("Sectors with time <= 0:")
display(df_clean[(df_clean['sector1_time'] <= 0) |
                 (df_clean['sector2_time'] <= 0) |
                 (df_clean['sector3_time'] <= 0)])
'''

In [None]:
# Calculate the sum of sectors
df_clean['sectors_sum'] = df_clean['sector1_time'] + df_clean['sector2_time'] + df_clean['sector3_time']

# Identify rows where the difference between LapTime and the sum of sectors exceeds the 0.1s tolerance
inconsistent = df_clean[abs(df_clean['lap_time'] - df_clean['sectors_sum']) > 0.1]  # tolerância 0.1s
print("Inconsistent Laps (LapTime vs. Sum of Sectors > 0.1s):")
display(inconsistent)

> **Observation**: No inconsistencies greater than 0.1s were detected between recorded lap_time and the sum of sector times. This validates the internal synchronization of the FastF1 timing system and confirms that sector-level data can be safely used for Ideal Lap reconstruction.

### 2.4 Final Data State
The data is now clean, optimized, and ready for the Exploratory Data Analysis (EDA) and the core calculation of the Ideal Lap.

In [None]:
# Drop the temporary 'sectors_sum' column before proceeding
df_clean.drop(columns=['sectors_sum'], inplace=True)

# Final check of descriptive statistics on the cleaned dataset
print("Descriptive Statistics on the Cleaned Dataset:")
display(df_clean.describe())

> With a fully consistent and optimized dataset, we can now explore sector-based dynamics and proceed to the analytical construction of the Ideal Lap in the next section.

## 3. Exploratory Data Analysis (EDA)

### 3.1 Global Lap Time Distribution

Before analyzing performance circuit by circuit, we start with a global overview of the entire dataset. This serves as a vital **sanity check** to ensure the recorded lap times are within plausible and expected values for modern Formula 1 circuits.


In [None]:
# Create Histogram with KDE
plt.figure(figsize=(12,6))
ax = sns.histplot(df_clean['lap_time'], bins=50, kde=True, color='sandybrown')
# Setting the KDE line color
ax.lines[0].set_color('black')
plt.title("Overall Lap Time Distribution (Global View)", fontsize=14, weight='bold')
plt.xlabel("Lap Time (Seconds)")
plt.ylabel("Frequency")
plt.show()

# Create Boxplot
plt.figure(figsize=(12,6))
sns.boxplot(x=df_clean['lap_time'], color='sandybrown')
plt.title("Boxplot of Lap Times (Global View)", fontsize=14, weight='bold')
plt.xlabel("Lap Time (Seconds)")
plt.show()

#### Observations on Global Distribution

-   **Typical Range:** The majority of recorded laps fall between **60 and 90 seconds**, which is consistent with the standard length of contemporary Formula 1 tracks during a qualifying session.
-   **Central Tendency:** The distribution's center (median, visible in the boxplot) is situated near the **85–90 second** range.
-   **Right Skew and Outliers:** As anticipated, the distribution is **right-skewed**, reflecting the presence of **outliers above 140 seconds**. These longer times correspond to preparatory laps, laps affected by yellow/red flags, or pit entry/exit laps, which remain in the "accurate" filtered set if they are not `NaN` but are much slower than flying laps.

This preliminary inspection validates the consistency of the cleaned data, allowing us to proceed confidently with the **circuit-specific and sector-based analysis**, which is the primary objective of this project.


### 3.2 Circuit-by-Circuit Analysis

In this step, we analyze the **minimum and median lap times per circuit**. The goal is to compare the performance recorded at each event and establish a baseline for the core calculation: the *Ideal Lap* (the sum of the best sectors).


In [None]:
# Calculate minimum, median, and maximum lap times grouped by circuit
circuit_stats = df_clean.groupby('circuit', observed=True)['lap_time'].agg(['min','median','max']).reset_index()

# Sort by the minimum time (fastest)
circuit_stats = circuit_stats.sort_values(by='min').reset_index(drop=True)

print("Top 10 Fastest Circuits by Minimum Lap Time:")
display(circuit_stats.head(10))

In [None]:
# Visualization: Boxplot by Circuit
plt.figure(figsize=(14,7))
sns.boxplot(data=df_clean, x='circuit', y='lap_time', color='sandybrown')
plt.xticks(rotation=90)
plt.title("Distribution of Lap Times per Circuit (2021 Qualifying)", fontsize=14, fontweight='bold')
plt.ylabel("Lap Time (Seconds)")
plt.xlabel("Circuit")
plt.show()

#### Observations on Circuit Performance

-   **Track Length vs. Time:** The dispersion of lap times clearly shows differences based on track length:
    -   Shorter tracks (e.g., Red Bull Ring) feature minimum times well below 70 seconds.
    -   Longer, demanding circuits (e.g., Spa-Francorchamps) easily surpass 100 seconds.
-   **Performance Metrics:** The `median` lap time reflects the typical competitive pace of each circuit, while the `min` lap time indicates the absolute fastest lap achieved during the session.
-   **Ideal Lap Justification:** This visualization confirms that lap times vary logically according to the circuit layout. However, to fairly assess maximum performance, we must calculate the *Ideal Lap* for each circuit by summing the **best sectors**, as different drivers may specialize in distinct track sectors.


### 3.3 Ideal Lap Calculation and Visualization

The objective of this section is to calculate the **Ideal Lap** for every circuit.

-   For each circuit, we sum the best times recorded for **Sector 1, Sector 2, and Sector 3** across all drivers.
-   This sum represents the theoretical perfect lap, should a single driver combine all optimal sector performances.
-   We also include the **Best Real Lap Time** for a direct comparison.


In [None]:
# Calculate the minimum sector times per circuit
min_sectors = df_clean.groupby('circuit', observed=True)[['sector1_time','sector2_time','sector3_time']].min().reset_index()

# Calculate the Ideal Lap as the sum of the minimum sectors
min_sectors['ideal_lap'] = min_sectors['sector1_time'] + min_sectors['sector2_time'] + min_sectors['sector3_time']

# Add the actual best real lap time per circuit
best_lap = df_clean.groupby('circuit', observed=True)['lap_time'].min().reset_index()
ideal_vs_real = pd.merge(min_sectors, best_lap, on='circuit')
ideal_vs_real.rename(columns={'lap_time':'best_real_lap'}, inplace=True)

# Display results
print("Comparison of Ideal vs. Best Real Lap (Top 10 Fastest Ideal Laps):")
display(ideal_vs_real.sort_values('ideal_lap').head(10))

In [None]:
# Comparative Visualization: Ideal Lap vs Best Real Lap
plt.figure(figsize=(14,7))
sns.scatterplot(data=ideal_vs_real, x='circuit', y='best_real_lap', label='Best Real Lap', color='black', s=100)
sns.scatterplot(data=ideal_vs_real, x='circuit', y='ideal_lap', label='Ideal Lap', color='sandybrown', s=100)
plt.xticks(rotation=90)
plt.ylabel("Lap Time (Seconds)")
plt.xlabel("Circuit")
plt.title("Comparison: Best Real Lap vs. Ideal Lap per Circuit (2021 Qualifying)", fontsize=14, fontweight='bold')
plt.legend()
plt.show()

#### Observations on Ideal vs. Real Performance

-   The **Ideal Lap (blue)** is, as expected, always equal to or faster than the **Best Real Lap (red)**.
-   The distance between the two points indicates the potential for sector-by-sector optimization: a larger gap signifies that different drivers or different laps contributed the fastest times to the individual sectors.
-   This calculation provides a more precise view of the maximum theoretical performance possible at each circuit, isolating the performance potential from single-lap execution errors or inconsistencies.


### 3.4 Ideal Lap vs Best Real Lap Comparison

Here we quantify the difference between the **Ideal Lap** and the **Best Real Lap** in both seconds and percentage terms. This metric highlights which circuits have the highest potential for time optimization across sectors.


In [None]:
# Calculate difference in seconds (Ideal Lap - Best Real Lap)
ideal_vs_real['diff_seconds'] = (ideal_vs_real['ideal_lap'] - ideal_vs_real['best_real_lap']).abs()

# Calculate percentage difference relative to the Best Real Lap
ideal_vs_real['diff_percent'] = (ideal_vs_real['diff_seconds'] / ideal_vs_real['best_real_lap']) * 100

# Sort by the largest optimization potential
ideal_vs_real_sorted = ideal_vs_real.sort_values('diff_seconds', ascending=False).reset_index(drop=True)

# Display top 10 circuits with the largest difference (highest optimization potential)
print("Top 10 Circuits with Highest Optimization Potential:")
display(ideal_vs_real_sorted.head(10))

In [None]:
# Visualization: time difference per circuit
plt.figure(figsize=(14,7))
barplot = sns.barplot(data=ideal_vs_real_sorted, x='circuit', y='diff_seconds', hue='diff_percent', dodge=False, palette='gist_heat_r', legend=False, order=ideal_vs_real_sorted['circuit'])
plt.xticks(rotation=90)
plt.ylabel("Time Difference (seconds)")
plt.xlabel("Circuit")
plt.title("Ideal vs Real Lap Difference by Circuit", fontsize=14, fontweight='bold')

# Add value labels
for container in barplot.containers:
    barplot.bar_label(container, fmt="%.3f", label_type='edge', fontsize=9)

plt.tight_layout()
plt.show()

#### Observations on Optimization Potential

-   Circuits with the **largest difference** indicate that the best sector times were achieved by different drivers or on different laps. This suggests a higher collective maximum potential that no single driver was able to fully realize on one flying lap.
-   Low values indicate that the best real lap was already very close to the *Ideal Lap*, meaning marginal gains would be difficult to achieve.
-   This metric is crucial for focusing analysis and development efforts on circuits where the gap between real and theoretical maximum performance is largest.


### 3.5 Sector-Specific Difference Analysis

To understand *where* the optimization potential lies, we calculate the difference between the fastest sectors of the Ideal Lap and the corresponding sectors of the single **Best Real Lap**. This helps localize performance gaps.


In [None]:
# Calculate the Ideal Lap sectors (minimum of each sector per circuit)
sector_min = df_clean.groupby('circuit', observed=True)[['sector1_time','sector2_time','sector3_time']].min().reset_index()

# Find the sectors for the single best real lap per circuit
best_lap_idx = df_clean.groupby('circuit', observed=True)['lap_time'].idxmin()

# Retrieve the sector times corresponding to the fastest overall lap
best_laps = df_clean.loc[best_lap_idx, ['circuit','sector1_time','sector2_time','sector3_time']].reset_index(drop=True)

# Calculate Difference: Ideal Lap Sector - Best Real Lap Sector
sector_diff = sector_min.copy()

# Note: Since Ideal Lap sectors are minimal, these differences will always be <= 0
sector_diff['sector1_diff'] = sector_min['sector1_time'] - best_laps['sector1_time']
sector_diff['sector2_diff'] = sector_min['sector2_time'] - best_laps['sector2_time']
sector_diff['sector3_diff'] = sector_min['sector3_time'] - best_laps['sector3_time']

# Prepare data for Heatmap (set circuit as index)
heatmap_data = sector_diff.set_index('circuit')[['sector1_diff','sector2_diff','sector3_diff']]

plt.figure(figsize=(10,6))

sns.heatmap(heatmap_data.abs(), annot=True, fmt=".3f", cmap= 'gist_heat_r', cbar_kws={'label': 'Gap (s)'})
plt.title("Sector-Level Ideal vs Real Lap Gaps by Circuit")
plt.xlabel("Sector")
plt.ylabel("Circuit")
plt.show()

### 3.6 Driver Contribution to Ideal Lap

Finally, we identify which drivers contributed the fastest time to **Sector 1, Sector 2, or Sector 3** across the entire season. This reveals the sectorial strengths of individual drivers.


In [None]:
# Function to identify the driver who set the minimum sector time per circuit
def sector_contributors(df, sector):
    # Find the index of the row with the minimum sector time, grouped by circuit
    idx = df.groupby('circuit', observed=True)[sector].idxmin()
    return df.loc[idx, ['circuit', 'driver']]

# Get the driver who contributed the minimum sector time in each circuit
s1_contrib = sector_contributors(df_clean, 'sector1_time')
s2_contrib = sector_contributors(df_clean, 'sector2_time')
s3_contrib = sector_contributors(df_clean, 'sector3_time')


# Count contributions per driver for each sector
s1_count = s1_contrib['driver'].value_counts().reset_index()
s1_count.columns = ['driver','sector1_count']

s2_count = s2_contrib['driver'].value_counts().reset_index()
s2_count.columns = ['driver','sector2_count']

s3_count = s3_contrib['driver'].value_counts().reset_index()
s3_count.columns = ['driver','sector3_count']

# Convert driver to string for merging (avoids category issues)
s1_count['driver'] = s1_count['driver'].astype(str)
s2_count['driver'] = s2_count['driver'].astype(str)
s3_count['driver'] = s3_count['driver'].astype(str)

# Combine all sector contribution counts
pilot_sector_summary = (s1_count.merge(s2_count, on='driver', how='outer').merge(s3_count, on='driver', how='outer').fillna(0))

# Ensure counters are integers
pilot_sector_summary[['sector1_count','sector2_count','sector3_count']] = pilot_sector_summary[['sector1_count','sector2_count','sector3_count']].astype(int)

# Add column for total contributions to the Ideal Lap
pilot_sector_summary['total_contrib'] = pilot_sector_summary['sector1_count'] + pilot_sector_summary['sector2_count'] + pilot_sector_summary['sector3_count']

# Sort by total contributions (overall most dominant)
pilot_sector_summary = pilot_sector_summary.sort_values('total_contrib', ascending=False).reset_index(drop=True)

print("Top 10 Driver Ranking by Sector Contribution (2021 Season Ideal Laps):")
display(pilot_sector_summary.head(10))

## 4. Conclusion and Key Insights

This analysis successfully used `FastF1` data to calculate the theoretical **Ideal Lap** for every circuit during the `2021` Qualifying season, providing a benchmark for maximum potential performance.

### 4.1 Key Findings

1.  **Ideal Lap as Performance Ceiling:** The comparison plots consistently demonstrated that the `Ideal Lap` is always faster (lower time) than the `Best Real Lap`. This confirms its role as a valid theoretical performance ceiling, providing a truer measure of the combined potential of the cars and drivers at any given circuit.
2.  **Optimization Potential:** The `Difference Analysis` (Section 3.4) highlighted specific circuits, such as **Spa-Francorchamps** and **Emilia-Romagna**, as having the largest `Ideal Lap` gap. This suggests that optimal performance requires combining different driving styles or strategies across sectors, which no single driver executed perfectly on one lap.
3.  **Sector-Level Gaps:** The `Heatmap` (Section 3.5) provided granular insight, showing which sectors contribute most to the gap. For circuits with high overall optimization potential, the largest values in the heatmap pinpoint the exact sector (e.g., Sector 1 or 2) where the best real lap was significantly slower than the theoretical fastest sector time. This directs performance development focus.
4.  **Driver Dominance:** The `Driver Contribution Ranking` (Section 3.6) effectively measured the sectorial mastery of each driver over the entire season. The leader in this ranking demonstrated the highest frequency of setting the benchmark time across all three sectors, making them the most dominant performer at the sector level during the `2021` Qualifying sessions.

### 4.2 Final Remarks

The *Ideal Lap* metric moves beyond simple overall lap times to offer a sophisticated view of potential performance, crucial for race strategy and engineering optimization. The clean, structured data pipeline established in this project ensures the reliability of these calculated insights. Furthermore, while this analysis provides macro-level sector insights, utilizing micro-sector data (detailed timing points within each sector) would offer even finer granularity, pinpointing the exact corners or straight-line segments responsible for the observed performance losses.
