# üìä Notebook 01: Data Collection

**Predicting F1 Race Finishing Positions Using Practice & Qualifying Data**

This notebook collects all F1 session data for 2023, 2024, and 2025 seasons using the **FastF1** library.

- Practice sessions (FP1, FP2, FP3)
- Qualifying results
- Race results (target variable)
- Weather conditions

All data is cached locally and saved as Parquet files for downstream use.

In [5]:
!pip install fastf1



In [1]:
# Standard imports
import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Add project root to path
sys.path.insert(0, os.path.abspath('..'))

import pandas as pd
import numpy as np
import fastf1

import config
from src.data_collection import (
    get_race_schedule,
    collect_season_data,
    collect_and_save_all_data,
    load_saved_data,
)

print(f"FastF1 version: {fastf1.__version__}")
print(f"Cache dir: {config.CACHE_DIR}")
print(f"Processed dir: {config.PROCESSED_DIR}")

FastF1 version: 3.8.1
Cache dir: /Users/t.joongpiriyapong/Documents MacHDD/Workspace/f1-race-data-analytics/data/cache
Processed dir: /Users/t.joongpiriyapong/Documents MacHDD/Workspace/f1-race-data-analytics/data/processed


## 1. Race Schedule Overview

Let's first inspect the race calendar for each season.

In [3]:
# Show race schedules for all years
for year in config.ALL_YEARS:
    schedule = get_race_schedule(year)
    print(f"\n{'='*50}")
    print(f"{year} Season: {len(schedule)} races")
    print(f"{'='*50}")
    display(schedule[['RoundNumber', 'EventName', 'Location', 'EventDate']].head(25))


2023 Season: 22 races


Unnamed: 0,RoundNumber,EventName,Location,EventDate
0,1,Bahrain Grand Prix,Sakhir,2023-03-05
1,2,Saudi Arabian Grand Prix,Jeddah,2023-03-19
2,3,Australian Grand Prix,Melbourne,2023-04-02
3,4,Azerbaijan Grand Prix,Baku,2023-04-30
4,5,Miami Grand Prix,Miami,2023-05-07
5,6,Monaco Grand Prix,Monaco,2023-05-28
6,7,Spanish Grand Prix,Barcelona,2023-06-04
7,8,Canadian Grand Prix,Montr√©al,2023-06-18
8,9,Austrian Grand Prix,Spielberg,2023-07-02
9,10,British Grand Prix,Silverstone,2023-07-09



2024 Season: 24 races


Unnamed: 0,RoundNumber,EventName,Location,EventDate
0,1,Bahrain Grand Prix,Sakhir,2024-03-02
1,2,Saudi Arabian Grand Prix,Jeddah,2024-03-09
2,3,Australian Grand Prix,Melbourne,2024-03-24
3,4,Japanese Grand Prix,Suzuka,2024-04-07
4,5,Chinese Grand Prix,Shanghai,2024-04-21
5,6,Miami Grand Prix,Miami,2024-05-05
6,7,Emilia Romagna Grand Prix,Imola,2024-05-19
7,8,Monaco Grand Prix,Monaco,2024-05-26
8,9,Canadian Grand Prix,Montr√©al,2024-06-09
9,10,Spanish Grand Prix,Barcelona,2024-06-23



2025 Season: 24 races


Unnamed: 0,RoundNumber,EventName,Location,EventDate
0,1,Australian Grand Prix,Melbourne,2025-03-16
1,2,Chinese Grand Prix,Shanghai,2025-03-23
2,3,Japanese Grand Prix,Suzuka,2025-04-06
3,4,Bahrain Grand Prix,Sakhir,2025-04-13
4,5,Saudi Arabian Grand Prix,Jeddah,2025-04-20
5,6,Miami Grand Prix,Miami Gardens,2025-05-04
6,7,Emilia Romagna Grand Prix,Imola,2025-05-18
7,8,Monaco Grand Prix,Monaco,2025-05-25
8,9,Spanish Grand Prix,Barcelona,2025-06-01
9,10,Canadian Grand Prix,Montr√©al,2025-06-15


## 2. Collect All Season Data

This step fetches all practice, qualifying, and race data for 2023, 2024, and 2025.

‚ö†Ô∏è **First run will take 30-60 minutes** as data is downloaded from FastF1 servers.  
Subsequent runs use the local cache and are much faster.

In [None]:
# Collect and save all data
# This is the main data collection step
collect_and_save_all_data()

## 3. Verify Collected Data

Load the saved data and inspect its quality.

In [4]:
# Verify saved data for each year
for year in config.ALL_YEARS:
    print(f"\n{'='*60}")
    print(f"Verifying {year} data")
    print(f"{'='*60}")
    data = load_saved_data(year)
    
    for key, df in data.items():
        if not df.empty:
            print(f"\n  {key}: {df.shape[0]} rows √ó {df.shape[1]} columns")
            print(f"    Columns: {list(df.columns)}")
        else:
            print(f"\n  {key}: EMPTY")


Verifying 2023 data

  practice_laps: 23474 rows √ó 26 columns
    Columns: ['DriverNumber', 'Driver', 'Team', 'LapNumber', 'SpeedI1', 'SpeedI2', 'SpeedFL', 'SpeedST', 'Compound', 'TyreLife', 'Stint', 'IsPersonalBest', 'FreshTyre', 'TrackStatus', 'LapTime_sec', 'Sector1Time_sec', 'Sector2Time_sec', 'Sector3Time_sec', 'Year', 'RoundNumber', 'EventName', 'SessionType', 'LapTime_seconds', 'Sector1Time_seconds', 'Sector2Time_seconds', 'Sector3Time_seconds']

  qualifying: 440 rows √ó 15 columns
    Columns: ['DriverNumber', 'Driver', 'Team', 'quali_position', 'Q1_sec', 'Q2_sec', 'Q3_sec', 'quali_best_time', 'quali_gap_to_pole', 'Year', 'RoundNumber', 'EventName', 'Q1_seconds', 'Q2_seconds', 'Q3_seconds']

  race_results: 440 rows √ó 13 columns
    Columns: ['DriverNumber', 'Driver', 'Team', 'grid_position', 'race_position', 'ClassifiedPosition', 'Status', 'Points', 'Year', 'RoundNumber', 'EventName', 'circuit_key', 'is_street_circuit']

  weather: 22 rows √ó 8 columns
    Columns: ['track

In [5]:
# Quick look at the 2023 race results
data_2023 = load_saved_data(2023)
print("\n‚îÄ‚îÄ 2023 Race Results (sample) ‚îÄ‚îÄ")
display(data_2023['race_results'].head(20))


‚îÄ‚îÄ 2023 Race Results (sample) ‚îÄ‚îÄ


Unnamed: 0,DriverNumber,Driver,Team,grid_position,race_position,ClassifiedPosition,Status,Points,Year,RoundNumber,EventName,circuit_key,is_street_circuit
0,1,VER,Red Bull Racing,1.0,1.0,1,Finished,25.0,2023,1,Bahrain Grand Prix,Sakhir,0
1,11,PER,Red Bull Racing,2.0,2.0,2,Finished,18.0,2023,1,Bahrain Grand Prix,Sakhir,0
2,14,ALO,Aston Martin,5.0,3.0,3,Finished,15.0,2023,1,Bahrain Grand Prix,Sakhir,0
3,55,SAI,Ferrari,4.0,4.0,4,Finished,12.0,2023,1,Bahrain Grand Prix,Sakhir,0
4,44,HAM,Mercedes,7.0,5.0,5,Finished,10.0,2023,1,Bahrain Grand Prix,Sakhir,0
5,18,STR,Aston Martin,8.0,6.0,6,Finished,8.0,2023,1,Bahrain Grand Prix,Sakhir,0
6,63,RUS,Mercedes,6.0,7.0,7,Finished,6.0,2023,1,Bahrain Grand Prix,Sakhir,0
7,77,BOT,Alfa Romeo,12.0,8.0,8,Finished,4.0,2023,1,Bahrain Grand Prix,Sakhir,0
8,10,GAS,Alpine,20.0,9.0,9,Finished,2.0,2023,1,Bahrain Grand Prix,Sakhir,0
9,23,ALB,Williams,15.0,10.0,10,Finished,1.0,2023,1,Bahrain Grand Prix,Sakhir,0


In [6]:
# Quick look at practice lap data
print("\n‚îÄ‚îÄ 2023 Practice Laps (sample) ‚îÄ‚îÄ")
display(data_2023['practice_laps'].head(10))

print(f"\nLaps per session type:")
print(data_2023['practice_laps'].groupby('SessionType').size())


‚îÄ‚îÄ 2023 Practice Laps (sample) ‚îÄ‚îÄ


Unnamed: 0,DriverNumber,Driver,Team,LapNumber,SpeedI1,SpeedI2,SpeedFL,SpeedST,Compound,TyreLife,...,Sector2Time_sec,Sector3Time_sec,Year,RoundNumber,EventName,SessionType,LapTime_seconds,Sector1Time_seconds,Sector2Time_seconds,Sector3Time_seconds
0,1,VER,Red Bull Racing,1.0,106.0,157.0,280.0,187.0,MEDIUM,1.0,...,69.383,28.687,2023,1,Bahrain Grand Prix,Practice 1,,,69.383,28.687
1,1,VER,Red Bull Racing,2.0,235.0,265.0,280.0,314.0,MEDIUM,2.0,...,41.153,23.741,2023,1,Bahrain Grand Prix,Practice 1,95.429,30.535,41.153,23.741
2,1,VER,Red Bull Racing,3.0,124.0,127.0,,148.0,MEDIUM,3.0,...,72.105,38.71,2023,1,Bahrain Grand Prix,Practice 1,,52.814,72.105,38.71
3,1,VER,Red Bull Racing,4.0,138.0,192.0,282.0,144.0,MEDIUM,4.0,...,67.418,28.889,2023,1,Bahrain Grand Prix,Practice 1,,69.699,67.418,28.889
4,1,VER,Red Bull Racing,5.0,233.0,228.0,,316.0,MEDIUM,5.0,...,55.102,28.289,2023,1,Bahrain Grand Prix,Practice 1,113.662,30.271,55.102,28.289
5,1,VER,Red Bull Racing,6.0,149.0,177.0,284.0,175.0,SOFT,1.0,...,71.255,34.098,2023,1,Bahrain Grand Prix,Practice 1,,,71.255,34.098
6,1,VER,Red Bull Racing,7.0,238.0,267.0,284.0,317.0,SOFT,2.0,...,40.148,23.72,2023,1,Bahrain Grand Prix,Practice 1,93.375,29.507,40.148,23.72
7,1,VER,Red Bull Racing,8.0,158.0,174.0,,214.0,SOFT,3.0,...,59.322,31.209,2023,1,Bahrain Grand Prix,Practice 1,128.682,38.151,59.322,31.209
8,1,VER,Red Bull Racing,9.0,190.0,197.0,273.0,195.0,SOFT,4.0,...,48.727,29.491,2023,1,Bahrain Grand Prix,Practice 1,,,48.727,29.491
9,1,VER,Red Bull Racing,10.0,228.0,245.0,273.0,286.0,SOFT,5.0,...,42.527,24.007,2023,1,Bahrain Grand Prix,Practice 1,97.9,31.366,42.527,24.007



Laps per session type:
SessionType
Practice 1    9114
Practice 2    8287
Practice 3    6073
dtype: int64


In [7]:
# Qualifying data
print("\n‚îÄ‚îÄ 2023 Qualifying Data (sample) ‚îÄ‚îÄ")
display(data_2023['qualifying'].head(10))


‚îÄ‚îÄ 2023 Qualifying Data (sample) ‚îÄ‚îÄ


Unnamed: 0,DriverNumber,Driver,Team,quali_position,Q1_sec,Q2_sec,Q3_sec,quali_best_time,quali_gap_to_pole,Year,RoundNumber,EventName,Q1_seconds,Q2_seconds,Q3_seconds
0,1,VER,Red Bull Racing,1.0,91.295,90.503,89.708,89.708,0.0,2023,1,Bahrain Grand Prix,91.295,90.503,89.708
1,11,PER,Red Bull Racing,2.0,91.479,90.746,89.846,89.846,0.138,2023,1,Bahrain Grand Prix,91.479,90.746,89.846
2,16,LEC,Ferrari,3.0,91.094,90.282,90.0,90.0,0.292,2023,1,Bahrain Grand Prix,91.094,90.282,90.0
3,55,SAI,Ferrari,4.0,90.993,90.515,90.154,90.154,0.446,2023,1,Bahrain Grand Prix,90.993,90.515,90.154
4,14,ALO,Aston Martin,5.0,91.158,90.645,90.336,90.336,0.628,2023,1,Bahrain Grand Prix,91.158,90.645,90.336
5,63,RUS,Mercedes,6.0,91.057,90.507,90.34,90.34,0.632,2023,1,Bahrain Grand Prix,91.057,90.507,90.34
6,44,HAM,Mercedes,7.0,91.543,90.513,90.384,90.384,0.676,2023,1,Bahrain Grand Prix,91.543,90.513,90.384
7,18,STR,Aston Martin,8.0,91.184,91.127,90.836,90.836,1.128,2023,1,Bahrain Grand Prix,91.184,91.127,90.836
8,31,OCO,Alpine,9.0,91.508,90.914,90.984,90.914,1.206,2023,1,Bahrain Grand Prix,91.508,90.914,90.984
9,27,HUL,Haas F1 Team,10.0,91.204,90.809,,90.809,1.101,2023,1,Bahrain Grand Prix,91.204,90.809,


In [8]:
# Weather data
print("\n‚îÄ‚îÄ 2023 Weather Data ‚îÄ‚îÄ")
display(data_2023['weather'])


‚îÄ‚îÄ 2023 Weather Data ‚îÄ‚îÄ


Unnamed: 0,track_temp_avg,air_temp_avg,humidity_avg,rainfall,wind_speed_avg,Year,RoundNumber,EventName
0,31.011801,27.431677,21.496894,0,0.68323,2023,1,Bahrain Grand Prix
1,31.792568,26.091892,57.790541,0,1.772297,2023,2,Saudi Arabian Grand Prix
2,30.13964,17.44955,54.157658,0,1.127027,2023,3,Australian Grand Prix
3,41.21,24.860625,49.225,0,1.083125,2023,4,Azerbaijan Grand Prix
4,36.689032,27.117419,59.425806,0,3.970323,2023,5,Miami Grand Prix
5,39.255682,25.065909,45.863636,1,1.005682,2023,6,Monaco Grand Prix
6,33.838961,22.674026,63.077922,0,1.869481,2023,7,Spanish Grand Prix
7,30.59321,18.530247,63.901235,0,1.941975,2023,8,Canadian Grand Prix
8,32.054248,22.398039,52.045752,1,1.097386,2023,9,Austrian Grand Prix
9,30.94106,21.47351,56.18543,0,3.023841,2023,10,British Grand Prix


## 4. Data Summary Statistics

Overview of what we collected for downstream analysis.

In [9]:
# Summary across all years
summary_rows = []
for year in config.ALL_YEARS:
    data = load_saved_data(year)
    row = {
        'Year': year,
        'Practice Laps': len(data['practice_laps']),
        'Qualifying Entries': len(data['qualifying']),
        'Race Results': len(data['race_results']),
        'Weather Records': len(data['weather']),
        'Races': data['race_results']['RoundNumber'].nunique() if not data['race_results'].empty else 0,
        'Drivers': data['race_results']['Driver'].nunique() if 'Driver' in data['race_results'].columns else 0,
    }
    summary_rows.append(row)

summary_df = pd.DataFrame(summary_rows)
print("\nüìä DATA COLLECTION SUMMARY")
print("=" * 60)
display(summary_df)

print(f"\n‚úÖ Data collection complete!")
print(f"   Training: {config.TRAIN_YEARS} ‚Üí {summary_df[summary_df['Year'].isin(config.TRAIN_YEARS)]['Race Results'].sum()} samples")
print(f"   Testing:  {config.TEST_YEARS} ‚Üí {summary_df[summary_df['Year'].isin(config.TEST_YEARS)]['Race Results'].sum()} samples")


üìä DATA COLLECTION SUMMARY


Unnamed: 0,Year,Practice Laps,Qualifying Entries,Race Results,Weather Records,Races,Drivers
0,2023,23474,440,440,22,22,22
1,2024,26076,479,479,24,24,24
2,2025,27986,480,479,24,24,21



‚úÖ Data collection complete!
   Training: [2023, 2024] ‚Üí 919 samples
   Testing:  [2025] ‚Üí 479 samples
