# F1 Strategy - Notebook 01: Exploratory Data Analysis

## Objectives
- Verify integrity of OpenF1 aggregates produced by `get_data.py`.
- Build a clean race- and driver-race-level snapshot for 2023-2024.
- Define modeling targets for strategy prediction (pit count, first pit lap, compound sequence).
- Identify data gaps to exclude or impute.

## Inputs (from `data/openf1_full/`)
- `sessions_all.csv`
- `stints_all.csv`
- `pit_all.csv`
- `weather_all.csv`
- `starting_grid_all.csv`
- `session_result_all.csv`
- `race_control_all.csv`
- `meetings_all.csv`

## Sanity checks
- Row counts by season and by session type.
- Uniqueness: `session_key × driver_number × stint_number`.
- Consistency: sum of stint lengths ≈ laps completed for finishers.
- Pit-stint coherence: number of pit entries ≈ compound changes.
- Weather coverage: timestamps span the race window.
- Missingness matrix per table and per season.

## Core EDA questions
- Distribution of pit counts per race and per season.
- Stint length by compound and by circuit.
- First-pit lap vs. starting position buckets.
- SC/VSC frequency by circuit and its relation to pit timing.
- Track temperature vs. stint length for each compound.
- Outliers: extreme pit durations, micro stints, irregular weather segments.

## Temporary data models (in-memory)
- **races_master**: one row per race with date, circuit hint, SC/VSC, weather summery.
- **driver_race**: one row per driver-race with targets and covariates:
    - Targets: `pit_count`, `first_pit_lap`, `compound_seq` (e.g. S-M-H).
    - Covariates: grid_position, finish status, SC/VSC exposure, simple weather summary.

## Visuals (preview plan)
- Bar: plt count distributions per race.
- Box: stint length by compound.
- Hist: first pit-lap.
- Line: track temperature over race time with vertical lines at SC and pit windows.
- Heatmap: compound usage across grid for a selected race.

## Outputs
- Short EDA conclusions and data caveats.
- Final target definitions for Notebook 02 (feature engineering)
- List of sessions with incomplete data to exclude or impute

## Next
- Notebook 02: feature engineering and target creation.
- Notebook 03: baseline models and calibrations.
- Notebook 04: error analysis per circuit and weather conditions.

In [1]:
from __future__ import annotations
from pathlib import Path
import sys, platform, warnings

import numpy as np
import pandas as pd

In [2]:
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 120)
pd.set_option('display.width', 140)

In [3]:
TARGET_YEARS: tuple[int, ...] | None = None
SESSION_SCOPE = "ALL"
USE_LAPS = False

In [4]:
def env_report() -> dict[str, str]:
    return {
        'python': sys.version.split()[0],
        'pandas': pd.__version__,
        'numpy': np.__version__,
        'platform': platform.platform(),
        'years': 'all available' if TARGET_YEARS is None else ','.join(map(str, TARGET_YEARS)),
        'session_scope': SESSION_SCOPE,
        'use_laps': str(USE_LAPS),
        'seed': str(RANDOM_SEED),
    }
env_report()

{'python': '3.13.7',
 'pandas': '2.3.2',
 'numpy': '2.3.3',
 'platform': 'Windows-11-10.0.26100-SP0',
 'years': 'all available',
 'session_scope': 'ALL',
 'use_laps': 'False',
 'seed': '42'}

In [11]:
sessions_df       = pd.read_csv('../data/openf1_full/sessions_all.csv')
stints_df         = pd.read_csv('../data/openf1_full/stints_all.csv')
pit_df            = pd.read_csv('../data/openf1_full/pit_all.csv')
weather_df        = pd.read_csv('../data/openf1_full/weather_all.csv')
starting_grid_df  = pd.read_csv('../data/openf1_full/starting_grid_all.csv')
session_result_df = pd.read_csv('../data/openf1_full/session_result_all.csv')
race_control_df   = pd.read_csv('../data/openf1_full/race_control_all.csv')
laps_df           = pd.read_csv('../data/openf1_full/laps_all.csv') if USE_LAPS else None

In [12]:
target_years = (
    sorted(sessions_df['year'].unique().tolist())
    if TARGET_YEARS is None else list(TARGET_YEARS)
)
target_years

[2023, 2024, 2025]

In [13]:
df = sessions_df[sessions_df['year'].isin(target_years)]
session_keys = df['session_key'].tolist()

len(df), len(session_keys)

(323, 323)

In [9]:
from pandas.api.types import is_datetime64_any_dtype as is_dt

INT64 = 'Int64'
INT32 = 'Int32'

def as_int(df: pd.DataFrame, cols, kind=INT64):
    for c in cols:
        if c in df.columns:
            df[c] = pd.to_numeric(df[c], errors='coerce').astype(kind)

def as_float(df: pd.DataFrame, cols):
    for c in cols:
        if c in df.columns:
            df[c] = pd.to_numeric(df[c], errors='coerce')

def as_dt_utc(df: pd.DataFrame, cols):
    for c in cols:
        if c in df.columns and not is_dt(df[c]):
            df[c] = pd.to_datetime(df[c], errors='coerce', utc=True)

def upper_str(df: pd.DataFrame, cols):
    for c in cols:
        if c in df.columns:
            df[c] = df[c].astype('string').str.strip().str.upper()

In [14]:
sessions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 323 entries, 0 to 322
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   circuit_key         323 non-null    int64 
 1   circuit_short_name  323 non-null    object
 2   country_code        323 non-null    object
 3   country_key         323 non-null    int64 
 4   country_name        323 non-null    object
 5   date_end            323 non-null    object
 6   date_start          323 non-null    object
 7   gmt_offset          323 non-null    object
 8   location            323 non-null    object
 9   meeting_key         323 non-null    int64 
 10  session_key         323 non-null    int64 
 11  session_name        323 non-null    object
 12  session_type        323 non-null    object
 13  year                323 non-null    int64 
dtypes: int64(5), object(9)
memory usage: 35.5+ KB


In [16]:
as_dt_utc(sessions_df, ['date_start', 'date_end', 'session_start_utc', 'session_end_utc'])
upper_str(sessions_df, ['session_name', 'session_type', 'country_name'])
sessions_df

Unnamed: 0,circuit_key,circuit_short_name,country_code,country_key,country_name,date_end,date_start,gmt_offset,location,meeting_key,session_key,session_name,session_type,year
0,63,Sakhir,BRN,36,BAHRAIN,2023-02-23 16:30:00+00:00,2023-02-23 07:00:00+00:00,03:00:00,Sakhir,1140,9222,PRACTICE 1,PRACTICE,2023
1,63,Sakhir,BRN,36,BAHRAIN,2023-02-24 16:30:00+00:00,2023-02-24 07:00:00+00:00,03:00:00,Sakhir,1140,7763,PRACTICE 2,PRACTICE,2023
2,63,Sakhir,BRN,36,BAHRAIN,2023-02-25 16:30:00+00:00,2023-02-25 07:00:00+00:00,03:00:00,Sakhir,1140,7764,PRACTICE 3,PRACTICE,2023
3,63,Sakhir,BRN,36,BAHRAIN,2023-03-03 12:30:00+00:00,2023-03-03 11:30:00+00:00,03:00:00,Sakhir,1141,7765,PRACTICE 1,PRACTICE,2023
4,63,Sakhir,BRN,36,BAHRAIN,2023-03-03 16:00:00+00:00,2023-03-03 15:00:00+00:00,03:00:00,Sakhir,1141,7766,PRACTICE 2,PRACTICE,2023
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318,39,Monza,ITA,13,ITALY,2025-09-07 15:00:00+00:00,2025-09-07 13:00:00+00:00,02:00:00,Monza,1268,9912,RACE,RACE,2025
319,144,Baku,AZE,30,AZERBAIJAN,2025-09-19 09:30:00+00:00,2025-09-19 08:30:00+00:00,04:00:00,Baku,1269,9897,PRACTICE 1,PRACTICE,2025
320,144,Baku,AZE,30,AZERBAIJAN,2025-09-19 13:00:00+00:00,2025-09-19 12:00:00+00:00,04:00:00,Baku,1269,9898,PRACTICE 2,PRACTICE,2025
321,144,Baku,AZE,30,AZERBAIJAN,2025-09-20 09:30:00+00:00,2025-09-20 08:30:00+00:00,04:00:00,Baku,1269,9899,PRACTICE 3,PRACTICE,2025


In [17]:
stints_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4881 entries, 0 to 4880
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   compound           4881 non-null   object 
 1   driver_number      4881 non-null   int64  
 2   lap_end            4840 non-null   float64
 3   lap_start          4840 non-null   float64
 4   meeting_key        4881 non-null   int64  
 5   session_key        4881 non-null   int64  
 6   stint_number       4881 non-null   int64  
 7   tyre_age_at_start  4881 non-null   int64  
dtypes: float64(2), int64(5), object(1)
memory usage: 305.2+ KB


In [18]:
as_int(stints_df, ['stint_number', 'lap_start', 'lap_end'], kind=INT32)
upper_str(stints_df, ['compound'])
stints_df

Unnamed: 0,compound,driver_number,lap_end,lap_start,meeting_key,session_key,stint_number,tyre_age_at_start
0,SOFT,2,12,1,1141,7953,1,0
1,SOFT,22,10,1,1141,7953,1,0
2,SOFT,23,11,1,1141,7953,1,0
3,SOFT,63,13,1,1141,7953,1,3
4,SOFT,18,15,1,1141,7953,1,3
...,...,...,...,...,...,...,...,...
4876,MEDIUM,6,53,33,1268,9912,2,0
4877,HARD,55,53,31,1268,9912,2,0
4878,HARD,87,53,19,1268,9912,2,0
4879,HARD,22,53,20,1268,9912,2,0


In [19]:
pit_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3033 entries, 0 to 3032
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           3033 non-null   object 
 1   driver_number  3033 non-null   int64  
 2   lap_number     3031 non-null   float64
 3   meeting_key    3033 non-null   int64  
 4   pit_duration   2761 non-null   float64
 5   session_key    3033 non-null   int64  
dtypes: float64(2), int64(3), object(1)
memory usage: 142.3+ KB


In [21]:
as_int(pit_df, ['lap_number'], kind=INT32)
as_dt_utc(pit_df, ['date'])
pit_df

Unnamed: 0,date,driver_number,lap_number,meeting_key,pit_duration,session_key
0,2023-06-04 13:05:22.607000+00:00,4,1,1211,37.7,9102
1,2023-06-04 13:10:42.773000+00:00,77,5,1211,23.6,9102
2,2023-06-04 13:14:44.785000+00:00,27,8,1211,22.2,9102
3,2023-06-04 13:16:02.690000+00:00,24,9,1211,23.4,9102
4,2023-06-04 13:16:07.674000+00:00,21,9,1211,22.5,9102
...,...,...,...,...,...,...
3028,2025-09-07 14:06:26.941000+00:00,81,45,1268,23.6,9912
3029,2025-09-07 14:07:50.629000+00:00,4,46,1268,27.4,9912
3030,2025-09-07 14:12:58.481000+00:00,10,49,1268,25.0,9912
3031,2025-09-07 14:13:01.946000+00:00,18,49,1268,38.4,9912


In [23]:
weather_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12179 entries, 0 to 12178
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   air_temperature    12179 non-null  float64
 1   date               12179 non-null  object 
 2   humidity           12179 non-null  float64
 3   meeting_key        12179 non-null  int64  
 4   pressure           12179 non-null  float64
 5   rainfall           12179 non-null  int64  
 6   session_key        12179 non-null  int64  
 7   track_temperature  12179 non-null  float64
 8   wind_direction     12179 non-null  int64  
 9   wind_speed         12179 non-null  float64
dtypes: float64(5), int64(4), object(1)
memory usage: 951.6+ KB


In [24]:
as_dt_utc(weather_df, ['date'])
as_float(weather_df, ['rainfall', 'wind_direction'])
weather_df

Unnamed: 0,air_temperature,date,humidity,meeting_key,pressure,rainfall,session_key,track_temperature,wind_direction,wind_speed
0,29.8,2023-03-05 14:01:47.286000+00:00,19.0,1141,1016.5,0,7953,35.1,176,1.2
1,29.7,2023-03-05 14:02:47.301000+00:00,19.0,1141,1016.5,0,7953,35.0,182,1.2
2,29.7,2023-03-05 14:03:47.300000+00:00,19.0,1141,1016.5,0,7953,34.9,156,1.1
3,29.6,2023-03-05 14:04:47.314000+00:00,19.0,1141,1016.5,0,7953,34.9,201,0.8
4,29.6,2023-03-05 14:05:47.297000+00:00,19.0,1141,1016.5,0,7953,34.8,219,0.8
...,...,...,...,...,...,...,...,...,...,...
12174,26.9,2025-09-07 14:25:03.794000+00:00,44.0,1268,996.2,0,9912,42.4,166,1.0
12175,26.9,2025-09-07 14:26:03.802000+00:00,44.0,1268,996.2,0,9912,42.4,166,1.0
12176,26.9,2025-09-07 14:27:03.793000+00:00,44.0,1268,996.2,0,9912,42.4,166,1.0
12177,26.9,2025-09-07 14:28:03.802000+00:00,44.0,1268,996.2,0,9912,42.4,166,1.0


In [25]:
starting_grid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 298 entries, 0 to 297
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   driver_number  298 non-null    int64  
 1   lap_duration   279 non-null    float64
 2   meeting_key    298 non-null    int64  
 3   position       298 non-null    int64  
 4   session_key    298 non-null    int64  
dtypes: float64(1), int64(4)
memory usage: 11.8 KB


In [26]:
as_int(starting_grid_df, ['position'], kind=INT32)
starting_grid_df

Unnamed: 0,driver_number,lap_duration,meeting_key,position,session_key
0,16,101.697,1207,1,9278
1,11,101.844,1207,2,9278
2,1,101.987,1207,3,9278
3,63,102.252,1207,4,9278
4,55,102.287,1207,5,9278
...,...,...,...,...,...
293,23,103.212,1265,16,9930
294,27,103.217,1265,17,9930
295,44,103.408,1265,18,9930
296,12,105.394,1265,19,9930


In [27]:
session_result_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1836 entries, 0 to 1835
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   dnf             1836 non-null   bool   
 1   dns             1836 non-null   bool   
 2   driver_number   1836 non-null   int64  
 3   dsq             1836 non-null   bool   
 4   duration        1394 non-null   object 
 5   gap_to_leader   1661 non-null   object 
 6   meeting_key     1836 non-null   int64  
 7   number_of_laps  1831 non-null   float64
 8   points          1834 non-null   float64
 9   position        1681 non-null   float64
 10  session_key     1536 non-null   float64
dtypes: bool(3), float64(4), int64(2), object(2)
memory usage: 120.3+ KB


In [28]:
as_int(session_result_df, ['position', 'number_of_laps'], kind=INT32)
session_result_df

Unnamed: 0,dnf,dns,driver_number,dsq,duration,gap_to_leader,meeting_key,number_of_laps,points,position,session_key
0,False,False,1,False,5636.736,0,1141,57,25.0,1,7953.0
1,False,False,11,False,5648.723,11.987,1141,57,18.0,2,7953.0
2,False,False,14,False,5675.373,38.637,1141,57,15.0,3,7953.0
3,False,False,55,False,5684.788,48.052,1141,57,12.0,4,7953.0
4,False,False,44,False,5687.713,50.977,1141,57,10.0,5,7953.0
...,...,...,...,...,...,...,...,...,...,...,...
1831,False,False,10,False,,+1 LAP,1268,52,0.0,16,9912.0
1832,False,False,43,False,,+1 LAP,1268,52,0.0,17,9912.0
1833,False,False,18,False,,+1 LAP,1268,52,0.0,18,9912.0
1834,True,False,14,False,,,1268,24,0.0,,9912.0


In [29]:
race_control_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6901 entries, 0 to 6900
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   category       6901 non-null   object 
 1   date           6901 non-null   object 
 2   driver_number  1506 non-null   float64
 3   flag           3497 non-null   object 
 4   lap_number     6167 non-null   float64
 5   meeting_key    6901 non-null   int64  
 6   message        6901 non-null   object 
 7   scope          3497 non-null   object 
 8   sector         1607 non-null   float64
 9   session_key    6901 non-null   int64  
dtypes: float64(3), int64(2), object(5)
memory usage: 539.3+ KB


In [30]:
as_dt_utc(race_control_df, ['date'])
as_int(race_control_df, ['driver_number'], kind=INT64)
as_int(race_control_df, ['lap_number', 'sector'], kind=INT32)
upper_str(race_control_df, ['category', 'flag', 'message', 'scope'])
race_control_df

Unnamed: 0,category,date,driver_number,flag,lap_number,meeting_key,message,scope,sector,session_key
0,OTHER,2023-03-05 14:04:34+00:00,,,1,1141,PINK HEAD PADDING MATERIAL MUST BE USED,,,7953
1,FLAG,2023-03-05 14:20:00+00:00,,GREEN,1,1141,GREEN LIGHT - PIT EXIT OPEN,TRACK,,7953
2,FLAG,2023-03-05 14:23:00+00:00,,YELLOW,1,1141,YELLOW IN TRACK SECTOR 10,SECTOR,10,7953
3,FLAG,2023-03-05 14:23:04+00:00,,CLEAR,1,1141,CLEAR IN TRACK SECTOR 10,SECTOR,10,7953
4,OTHER,2023-03-05 14:30:00+00:00,,,1,1141,PIT EXIT CLOSED,,,7953
...,...,...,...,...,...,...,...,...,...,...
6896,FLAG,2025-09-07 14:16:59+00:00,,CHEQUERED,53,1268,CHEQUERED FLAG,TRACK,,9912
6897,OTHER,2025-09-07 14:17:05+00:00,,,53,1268,CAR 5 (BOR) TIME 1:22.619 DELETED - TRACK LIMI...,,,9912
6898,OTHER,2025-09-07 14:17:56+00:00,,,53,1268,CAR 43 (COL) TIME 1:25.035 DELETED - TRACK LIM...,,,9912
6899,OTHER,2025-09-07 14:18:26+00:00,,,53,1268,FIA STEWARDS: PENALTY SERVED - 5 SECOND TIME P...,,,9912


In [33]:
def schema_summary(name, df):
    return {
        'table': name,
        'rows': len(df),
        'cols': len(df.columns),
    }

summary = []
for name, df in [
    ('sessions', sessions_df),
    ('stints', stints_df),
    ('pit', pit_df),
    ('weather', weather_df),
    ('starting_grid', starting_grid_df),
    ('session_result', session_result_df),
    ('race_control', race_control_df),
]:
    if df is not None:
        summary.append(schema_summary(name, df))

pd.DataFrame(summary)

Unnamed: 0,table,rows,cols
0,sessions,323,14
1,stints,4881,8
2,pit,3033,6
3,weather,12179,10
4,starting_grid,298,5
5,session_result,1836,11
6,race_control,6901,10
