# F1 Lap Time Data Pipeline
Load FastF1 race sessions (2022-2023), filter laps, and build the feature table.


In [15]:
%pip install -r requirements.txt


Collecting fastf1>=3.4.1 (from -r requirements.txt (line 1))
  Using cached fastf1-3.7.0-py3-none-any.whl.metadata (5.2 kB)
Collecting pandas>=2.1.4 (from -r requirements.txt (line 2))
  Downloading pandas-2.3.3-cp311-cp311-win_amd64.whl.metadata (19 kB)
Collecting scikit-learn>=1.3.2 (from -r requirements.txt (line 4))
  Using cached scikit_learn-1.8.0-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting matplotlib>=3.8.2 (from -r requirements.txt (line 5))
  Using cached matplotlib-3.10.8-cp311-cp311-win_amd64.whl.metadata (52 kB)
Collecting seaborn>=0.13.1 (from -r requirements.txt (line 6))
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting pyarrow>=14.0.2 (from -r requirements.txt (line 7))
  Downloading pyarrow-22.0.0-cp311-cp311-win_amd64.whl.metadata (3.3 kB)
Collecting tqdm>=4.66.1 (from -r requirements.txt (line 8))
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting xgboost>=2.0.0 (from -r requirements.txt (line 9))
  Using cache


[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [16]:
from IPython.display import display, Javascript
display(Javascript("Jupyter.notebook.kernel.restart()"))

<IPython.core.display.Javascript object>

In [17]:
from pathlib import Path
import sys

def find_project_root(start: Path) -> Path:
    for parent in [start] + list(start.parents):
        if (parent / "src").is_dir() and (parent / "requirements.txt").exists():
            return parent
    return start

project_root = find_project_root(Path.cwd().resolve())
sys.path.insert(0, str(project_root))


In [18]:
pip install numpy

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [19]:
from pathlib import Path
import numpy as np
import pandas as pd
import random

from src.data_loader import enable_cache, load_laps_for_seasons, clean_laps
from src.features import build_feature_table

RANDOM_STATE = 42
random.seed(RANDOM_STATE)
np.random.seed(RANDOM_STATE)

DATA_DIR = Path("data")
CACHE_DIR = DATA_DIR / "cache"
PROCESSED_DIR = DATA_DIR / "processed"
FEATURES_PATH = PROCESSED_DIR / "feature_table.parquet"


In [20]:
enable_cache(CACHE_DIR)

raw_laps = load_laps_for_seasons([2022, 2023, 2024], cache_dir=CACHE_DIR)
clean_laps_df = clean_laps(raw_laps)

feature_df, numeric_features, categorical_features = build_feature_table(clean_laps_df)
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)
feature_df.to_parquet(FEATURES_PATH, index=False)

feature_df.head()


Race sessions:   0%|          | 0/68 [00:00<?, ?it/s]core           INFO 	Loading data for Bahrain Grand Prix - Race [v3.7.0]
req            INFO 	No cached data found for session_info. Loading data...
_api           INFO 	Fetching session info data...
req            INFO 	Data has been written to cache!
req            INFO 	No cached data found for driver_info. Loading data...
_api           INFO 	Fetching driver list...
req            INFO 	Data has been written to cache!
req            INFO 	No cached data found for session_status_data. Loading data...
_api           INFO 	Fetching session status data...
req            INFO 	Data has been written to cache!
req            INFO 	No cached data found for lap_count. Loading data...
_api           INFO 	Fetching lap count data...
req            INFO 	Data has been written to cache!
req            INFO 	No cached data found for track_status_data. Loading data...
_api           INFO 	Fetching track status data...
req            INFO 	Data 

Skipping 2023_11 (Hungarian Grand Prix): The data you are trying to access has not been loaded yet. See `Session.load`


req            INFO 	No cached data found for driver_info. Loading data...
_api           INFO 	Fetching driver list...
req            INFO 	No cached data found for session_status_data. Loading data...
_api           INFO 	Fetching session status data...
req            INFO 	No cached data found for lap_count. Loading data...
_api           INFO 	Fetching lap count data...
req            INFO 	No cached data found for track_status_data. Loading data...
_api           INFO 	Fetching track status data...
req            INFO 	No cached data found for _extended_timing_data. Loading data...
_api           INFO 	Fetching timing data...
req            INFO 	No cached data found for race_control_messages. Loading data...
_api           INFO 	Fetching race control messages...
core           INFO 	Finished loading data for 0 drivers: []
Race sessions:  50%|█████     | 34/68 [04:48<02:35,  4.57s/it]core           INFO 	Loading data for Dutch Grand Prix - Race [v3.7.0]
req            INFO 	No cac

Skipping 2023_12 (Belgian Grand Prix): The data you are trying to access has not been loaded yet. See `Session.load`


req            INFO 	No cached data found for driver_info. Loading data...
_api           INFO 	Fetching driver list...
req            INFO 	No cached data found for session_status_data. Loading data...
_api           INFO 	Fetching session status data...
req            INFO 	No cached data found for lap_count. Loading data...
_api           INFO 	Fetching lap count data...
req            INFO 	No cached data found for track_status_data. Loading data...
_api           INFO 	Fetching track status data...
req            INFO 	No cached data found for _extended_timing_data. Loading data...
_api           INFO 	Fetching timing data...
req            INFO 	No cached data found for race_control_messages. Loading data...
_api           INFO 	Fetching race control messages...
core           INFO 	Finished loading data for 0 drivers: []
Race sessions:  51%|█████▏    | 35/68 [04:50<02:07,  3.87s/it]core           INFO 	Loading data for Italian Grand Prix - Race [v3.7.0]
req            INFO 	No c

Skipping 2023_13 (Dutch Grand Prix): The data you are trying to access has not been loaded yet. See `Session.load`


req            INFO 	No cached data found for driver_info. Loading data...
_api           INFO 	Fetching driver list...
req            INFO 	No cached data found for session_status_data. Loading data...
_api           INFO 	Fetching session status data...
req            INFO 	No cached data found for lap_count. Loading data...
_api           INFO 	Fetching lap count data...
req            INFO 	No cached data found for track_status_data. Loading data...
_api           INFO 	Fetching track status data...
req            INFO 	No cached data found for _extended_timing_data. Loading data...
_api           INFO 	Fetching timing data...
req            INFO 	No cached data found for race_control_messages. Loading data...
_api           INFO 	Fetching race control messages...
core           INFO 	Finished loading data for 0 drivers: []
Race sessions:  53%|█████▎    | 36/68 [04:53<01:48,  3.38s/it]core           INFO 	Loading data for Singapore Grand Prix - Race [v3.7.0]
req            INFO 	No

Skipping 2023_14 (Italian Grand Prix): The data you are trying to access has not been loaded yet. See `Session.load`


req            INFO 	No cached data found for driver_info. Loading data...
_api           INFO 	Fetching driver list...
req            INFO 	No cached data found for session_status_data. Loading data...
_api           INFO 	Fetching session status data...
req            INFO 	No cached data found for lap_count. Loading data...
_api           INFO 	Fetching lap count data...
req            INFO 	No cached data found for track_status_data. Loading data...
_api           INFO 	Fetching track status data...
req            INFO 	No cached data found for _extended_timing_data. Loading data...
_api           INFO 	Fetching timing data...
req            INFO 	No cached data found for race_control_messages. Loading data...
_api           INFO 	Fetching race control messages...
core           INFO 	Finished loading data for 0 drivers: []
Race sessions:  54%|█████▍    | 37/68 [04:55<01:34,  3.05s/it]core           INFO 	Loading data for Japanese Grand Prix - Race [v3.7.0]
req            INFO 	No 

Skipping 2023_15 (Singapore Grand Prix): The data you are trying to access has not been loaded yet. See `Session.load`


req            INFO 	No cached data found for driver_info. Loading data...
_api           INFO 	Fetching driver list...
req            INFO 	No cached data found for session_status_data. Loading data...
_api           INFO 	Fetching session status data...
req            INFO 	No cached data found for lap_count. Loading data...
_api           INFO 	Fetching lap count data...
req            INFO 	No cached data found for track_status_data. Loading data...
_api           INFO 	Fetching track status data...
req            INFO 	No cached data found for _extended_timing_data. Loading data...
_api           INFO 	Fetching timing data...
req            INFO 	No cached data found for race_control_messages. Loading data...
_api           INFO 	Fetching race control messages...
core           INFO 	Finished loading data for 0 drivers: []
Race sessions:  56%|█████▌    | 38/68 [04:57<01:24,  2.81s/it]core           INFO 	Loading data for Qatar Grand Prix - Race [v3.7.0]
req            INFO 	No cac

Skipping 2023_16 (Japanese Grand Prix): The data you are trying to access has not been loaded yet. See `Session.load`


req            INFO 	No cached data found for driver_info. Loading data...
_api           INFO 	Fetching driver list...
req            INFO 	No cached data found for session_status_data. Loading data...
_api           INFO 	Fetching session status data...
req            INFO 	No cached data found for lap_count. Loading data...
_api           INFO 	Fetching lap count data...
req            INFO 	No cached data found for track_status_data. Loading data...
_api           INFO 	Fetching track status data...
req            INFO 	No cached data found for _extended_timing_data. Loading data...
_api           INFO 	Fetching timing data...
req            INFO 	No cached data found for race_control_messages. Loading data...
_api           INFO 	Fetching race control messages...
core           INFO 	Finished loading data for 0 drivers: []
Race sessions:  57%|█████▋    | 39/68 [04:59<01:16,  2.64s/it]

Skipping 2023_17 (Qatar Grand Prix): The data you are trying to access has not been loaded yet. See `Session.load`


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  59%|█████▉    | 40/68 [05:00<00:58,  2.07s/it]

Skipping 2023_18 (United States Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  60%|██████    | 41/68 [05:01<00:45,  1.68s/it]

Skipping 2023_19 (Mexico City Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  62%|██████▏   | 42/68 [05:02<00:36,  1.40s/it]

Skipping 2023_20 (São Paulo Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  63%|██████▎   | 43/68 [05:02<00:30,  1.20s/it]

Skipping 2023_21 (Las Vegas Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  65%|██████▍   | 44/68 [05:03<00:25,  1.07s/it]

Skipping 2023_22 (Abu Dhabi Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  66%|██████▌   | 45/68 [05:04<00:22,  1.03it/s]

Skipping 2024_01 (Bahrain Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  68%|██████▊   | 46/68 [05:05<00:19,  1.10it/s]

Skipping 2024_02 (Saudi Arabian Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  69%|██████▉   | 47/68 [05:05<00:18,  1.16it/s]

Skipping 2024_03 (Australian Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  71%|███████   | 48/68 [05:06<00:16,  1.21it/s]

Skipping 2024_04 (Japanese Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  72%|███████▏  | 49/68 [05:07<00:15,  1.25it/s]

Skipping 2024_05 (Chinese Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  74%|███████▎  | 50/68 [05:08<00:14,  1.27it/s]

Skipping 2024_06 (Miami Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  75%|███████▌  | 51/68 [05:08<00:13,  1.29it/s]

Skipping 2024_07 (Emilia Romagna Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  76%|███████▋  | 52/68 [05:09<00:12,  1.30it/s]

Skipping 2024_08 (Monaco Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  78%|███████▊  | 53/68 [05:10<00:11,  1.31it/s]

Skipping 2024_09 (Canadian Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  79%|███████▉  | 54/68 [05:11<00:10,  1.32it/s]

Skipping 2024_10 (Spanish Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  81%|████████  | 55/68 [05:11<00:09,  1.32it/s]

Skipping 2024_11 (Austrian Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  82%|████████▏ | 56/68 [05:12<00:09,  1.33it/s]

Skipping 2024_12 (British Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  84%|████████▍ | 57/68 [05:13<00:08,  1.33it/s]

Skipping 2024_13 (Hungarian Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  85%|████████▌ | 58/68 [05:14<00:07,  1.33it/s]

Skipping 2024_14 (Belgian Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  87%|████████▋ | 59/68 [05:14<00:06,  1.33it/s]

Skipping 2024_15 (Dutch Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  88%|████████▊ | 60/68 [05:15<00:06,  1.33it/s]

Skipping 2024_16 (Italian Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  90%|████████▉ | 61/68 [05:16<00:05,  1.33it/s]

Skipping 2024_17 (Azerbaijan Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  91%|█████████ | 62/68 [05:17<00:04,  1.33it/s]

Skipping 2024_18 (Singapore Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  93%|█████████▎| 63/68 [05:17<00:03,  1.33it/s]

Skipping 2024_19 (United States Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  94%|█████████▍| 64/68 [05:18<00:02,  1.33it/s]

Skipping 2024_20 (Mexico City Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  96%|█████████▌| 65/68 [05:19<00:02,  1.33it/s]

Skipping 2024_21 (São Paulo Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  97%|█████████▋| 66/68 [05:20<00:01,  1.33it/s]

Skipping 2024_22 (Las Vegas Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions:  99%|█████████▊| 67/68 [05:20<00:00,  1.33it/s]

Skipping 2024_23 (Qatar Grand Prix): Failed to load any schedule data.


req            INFO 	No cached data found for season_schedule. Loading data...
_api           INFO 	Fetching season schedule...
Race sessions: 100%|██████████| 68/68 [05:21<00:00,  4.73s/it]


Skipping 2024_24 (Abu Dhabi Grand Prix): Failed to load any schedule data.


Unnamed: 0,LapNumber,Stint,TyreLife,LapTimeLag1,LapTimeLag2,LapTimeLag3,RollingMean3,Driver,Team,Compound,TrackStatusFlag,Circuit,LapTimeSeconds,Season,RoundNumber,EventName
0,2.0,1.0,2.0,,,,,ALB,Williams,SOFT,green,Sakhir,100.548,2022,1,Bahrain Grand Prix
1,3.0,1.0,3.0,100.548,,,100.548,ALB,Williams,SOFT,green,Sakhir,100.664,2022,1,Bahrain Grand Prix
2,4.0,1.0,4.0,100.664,100.548,,100.606,ALB,Williams,SOFT,green,Sakhir,101.126,2022,1,Bahrain Grand Prix
3,5.0,1.0,5.0,101.126,100.664,100.548,100.779333,ALB,Williams,SOFT,green,Sakhir,102.303,2022,1,Bahrain Grand Prix
4,6.0,1.0,6.0,102.303,101.126,100.664,101.364333,ALB,Williams,SOFT,green,Sakhir,101.708,2022,1,Bahrain Grand Prix


In [21]:
pd.Series({
    "rows": len(feature_df),
    "columns": feature_df.shape[1],
    "seasons": sorted(feature_df["Season"].unique().tolist()),
})


rows              29820
columns              16
seasons    [2022, 2023]
dtype: object