# COGS 108 - Data Checkpoint

## Authors

- Aditya Jadhav: Conceptualization, Data curation, Analysis, Writing – review & editing
- Swayam Dani: Conceptualization, Methodology, Analysis, Visualization
- Albert Bunyi: Data curation, Software, Analysis
- Sean Yang: Background research, Visualization, Writing – original draft
- Benjamin Balingit: Project administration, Methodology, Writing – original draft, Writing – review & editing

## Research Question

How does tyre degradation within individual stints, measured as the difference between the fastest lap and the final lap of that stint, differ between Red Bull Racing drivers across circuits during the 2023 to 2025 regulation era, and how is that degradation related to the tyre age at the end of the stint?

## Background and Prior Work

Formula 1 is a technologically advanced motorsport in which race performance is shaped not only by final results, but also by underlying factors such as tyre behavior, car setup, and track characteristics. Modern Formula 1 cars operate at the intersection of mechanical engineering, aerodynamics, and data-driven decision making. Over recent decades, advancements in engineering and data collection have made it possible to analyze race performance at a much finer granularity, including lap-by-lap pace and stint-level trends. Prior work has shown that these factors play a critical role in shaping race dynamics, particularly through tyre degradation and consistency of lap times over a race distance.

Engineering-focused analyses of Formula 1 emphasize how regulatory stability and technical innovation influence performance trends over time. For example, a historical review of Formula 1 engineering development highlights how improvements in materials, aerodynamics, and power units have steadily increased both performance and data availability in the sport (Evolution of Formula One Motorsport, Academia.edu). This work demonstrates that as engineering systems become more sophisticated, performance differences increasingly emerge through operational factors such as tyre management and race strategy rather than raw speed alone. This context motivates a closer examination of lap-level performance metrics, which can reveal meaningful patterns beyond race outcomes.

More focused prior work has examined tyre compounds and degradation directly. The official Formula 1 tyre guide explains how different compounds are designed to trade off grip and durability, with softer tyres providing higher initial performance at the cost of faster degradation (Formula1.com, Beginner's Guide to F1 Tyres). Complementing this practical perspective, recent academic research has modeled tyre degradation using lap-time data, demonstrating that degradation can be quantified as a systematic increase in lap times over a stint rather than random noise (arXiv:2512.00640). These studies show that tyre degradation and lap-time variability are measurable, interpretable phenomena. However, much of this work focuses on individual races or modeling approaches, rather than descriptive, multi-season analysis. Our project builds on these insights by examining lap-time degradation and driver consistency across multiple seasons (2021-2025), tyre compounds, and track types using publicly available race data.

https://www.academia.edu/129272726/EVOLUTION_OF_FORMULA_ONE_F1_MOTORSPORTS_AND_ITS_TOP_NOTCH_ADVANCEMENT_IN_ENGINEERING_INNOVATIONS_ACROSS_THE_RACING_INDUSTRY 

https://www.formula1.com/en/latest/article/the-beginners-guide-to-formula-1-tyres.61SvF0Kfg29UR2SPhakDqd?utm_source 
 
https://arxiv.org/abs/2512.00640?utm_source 

## Hypothesis


We hypothesize that softer tyre compounds will show faster lap-time degradation than harder compounds across all tracks. We also expect that in lap-times will be shorter earlier in the stint when the tyres are fresher as compared to later in the stint when the tyres are older. We anticipate that the tyre degradation for both drivers of the Red Bull Racing team will be similar due to running the same car. Finally, we anticipate that driver consistency will vary by track type but remain relatively stable within the same driver across races.

## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name: OpenF1 dataset
  - Link to the dataset: https://openf1.org/?python#csv-format
    - There are multiple endpoints but we grabbed 3 endpoints and merged them: Drivers, Sessions, Stints
    - Drivers: https://api.openf1.org/v1/drivers?&csv=true
    - Sessions: https://api.openf1.org/v1/sessions?&csv=true
    - Stints: https://api.openf1.org/v1/stints?&csv=true
  - Number of observations: 
    - Final dataset observations = 743
  - Number of variables: 
    - Final dataset variables = 10
    - Total dataset variables = 28
  - Description of the variables most relevant to this project
    - compound = describes the tyre compound type during the given stint
    - driver_number = the unique identifying number given to each driver
    - lap_end = the last lap of the stint on one set of tyres
    - lap_start = the first lap of the stint on one set of tyres
    - meeting_key = the unique key that refers to the GrandPrix that the stint takes place in
    - session_key = the unique key that refers to the kind of event going on during the stint (eg. practice, qualify, sprint, race)
    - stint_number = the number for the current sequence of laps between pitstops that the driver is on
    - tyre_age_at_start = the age of the tyres when they are put on in number of laps
    - fastest_lap_duration = the duration of fastest lap in seconds of the stint
    - lap_end_duration = the duration of the last lap of the stint in seconds
  - Descriptions of any shortcomings this dataset has with repsect to the project
    - The data set does not take into account factors that may have an effect on the rate of tyre degredation such as track temperatures, how hard the driver is pushing, whether the driver is defending, if the driver is stuck in the dirty air of another car, or if the track in general has a higher amount of tyre wear due to its design. 
    - The data set does not take into account how weather affects how soon drivers come into pitstops to change their tyres, or how changes in weather affect laptimes.

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [None]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [None]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://api.openf1.org/v1/drivers?&csv=true', 'filename':'drivers.csv'},
    { 'url': 'https://api.openf1.org/v1/stints?&csv=true', 'filename':'stints.csv'},
    { 'url': 'https://api.openf1.org/v1/sessions?&csv=true', 'filename':'sessions.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

### OpenF1 RedBull drivers' stint tyre degredations & laptimes 

Description:
OpenF1 is a data set that includes 18 endpoints, often ranging from team radio to throttle traces. It includes data such as but not limited to, car telemetry, lap times, race positions, pit stops, team radio, weather, race control and championship standings. Data will be accessed via CSV format to API calls. As live data requires a paid account, we will not be considering live data from this set. The endpoints that we decided to use for our purposes are: sessions, drivers, and stints. The important variables that we will be using for our dataset is: compound, driver_number, lap_end, lap_start, meeting_key, session_key, stint_number, tyre_age_at_start, lap_end_duration, fastest_lap_duration. 

The units for tyre_age_at_start is an integer in number of laps old. The units for lap_end_duration is seconds. The units for fastest_lap_duration is seconds. We are comparing the fastest lap time rather than the first lap of the stint as tyres need to warm up before they get to a working temperature where they are optimal and the lap times become a minimum. This could take a number of laps into the stint so we did not choose to take the first lap of the stint as the best possible lap time to compare for degredation. We chose to look at the lap time of the last lap of the stint as this would be the stint where the tyres are most worn and thus would lead to the greatest point of comparison.

We tried to limit the variability of the dataset by limiting our dataset to one constructor team, RedBull racing, to ensure that there are no differences in the cars that would lead to differences in tyre degredation. One concern is that the dataset does not take into account the unique characteristics of each circuit which could lead to longer tyre lifetime or shorter lifetimes. Additionally, F1 is a very dynamic sport and there are many events occuring at the same time that lead to strategy changes during the race that could lead to the drivers opting to extend the stint to perform an "overcut" on their opponents, or pitting early to perform an "undercut" on their opponents. This is a variable that we simply cannot predict or take into account.



In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
import pandas as pd
import numpy as np

drivers = pd.read_csv("./data/00-raw/drivers.csv")

redbull_drivers = (
    drivers[drivers["team_name"].str.contains("Red Bull", case=False, na=False)]
    .drop(columns=["headshot_url"])
    .reset_index(drop=True)
)

redbull_drivers.to_csv("./data/01-interim/redbull_drivers.csv", index=False)
print("done")

In [None]:
redbull_drivers = pd.read_csv("./data/01-interim/redbull_drivers.csv")
stints = pd.read_csv("./data/00-raw/stints.csv")

rb_numbers = redbull_drivers["driver_number"].unique()
rb_stints = stints[stints["driver_number"].isin(rb_numbers)].reset_index(drop=True)

rb_stints.to_csv("./data/01-interim/rb_stints.csv", index=False)
print("done")

In [None]:
sessions = pd.read_csv("./data/00-raw/sessions.csv")
rb_stints = pd.read_csv("./data/01-interim/rb_stints.csv")

rb_stints = rb_stints.merge(
    sessions[["session_key", "year", "session_type"]],
    on="session_key",
    how="left"
)

rb_stints = rb_stints[
    (rb_stints["year"].notna()) &
    (rb_stints["year"] != 2026) &
    (rb_stints["session_type"] == "Race")
]

rb_stints = rb_stints.drop(columns=["year", "session_type"]).reset_index(drop=True)

rb_stints.to_csv("./data/01-interim/rb_stints.csv", index=False)
print("done")

In [None]:
bad_dates = pd.to_datetime([
    "2023-04-29",
    "2023-07-29",
    "2023-10-07",
    "2023-10-21",
    "2023-11-04",
    "2024-04-20",
    "2024-05-04",
    "2024-06-29",
    "2024-10-19",
    "2024-11-02",
    "2024-11-30",
    "2025-03-22",
    "2025-05-03",
    "2025-07-26",
    "2025-10-18",
    "2025-11-08",
    "2025-11-29",
]).date

tmp = rb_stints.merge(
    sessions[["session_key", "date_start"]],
    on="session_key",
    how="left"
)

tmp["date_start"] = pd.to_datetime(tmp["date_start"], errors="coerce").dt.date

filtered = tmp[~tmp["date_start"].isin(bad_dates)] \
            .drop(columns=["date_start"]) \
            .reset_index(drop=True)

filtered.to_csv("./data/01-interim/rb_stints.csv", index=False)

print("Dimensions of the dataset:")
rb_stints.shape

In [None]:
rb_stints = pd.read_csv("./data/01-interim/rb_stints.csv")

rb_stints["lap_start"] = pd.to_numeric(rb_stints["lap_start"], errors="coerce")
rb_stints["lap_end"]   = pd.to_numeric(rb_stints["lap_end"], errors="coerce")

rb_stints["lap_end_duration"] = np.nan
rb_stints["fastest_lap_duration"] = np.nan

groups = rb_stints.groupby(["driver_number", "session_key"]).groups

for (driver_number, session_key), idx in groups.items():
    driver_number = int(driver_number)
    session_key = int(session_key)

    url = f"https://api.openf1.org/v1/laps?driver_number={driver_number}&session_key={session_key}"
    laps = pd.read_json(url)

    if laps is None or laps.empty:
        continue

    laps["lap_number"] = pd.to_numeric(laps["lap_number"], errors="coerce")
    laps["lap_duration"] = pd.to_numeric(laps["lap_duration"], errors="coerce")

    for i in idx:
        lap_start = rb_stints.at[i, "lap_start"]
        lap_end = rb_stints.at[i, "lap_end"]

        if pd.isna(lap_start) or pd.isna(lap_end):
            continue

        lap_start = int(lap_start)
        lap_end = int(lap_end)

        stint_laps = laps[(laps["lap_number"] >= lap_start) & (laps["lap_number"] <= lap_end)]
        if stint_laps.empty:
            continue

        rb_stints.at[i, "fastest_lap_duration"] = stint_laps["lap_duration"].min(skipna=True)

        end_row = stint_laps[stint_laps["lap_number"] == lap_end]
        if not end_row.empty:
            rb_stints.at[i, "lap_end_duration"] = end_row["lap_duration"].iloc[0]

rb_stints.to_csv("./data/02-processed/rb_stints.csv", index=False)

print("Updated rb_stints.csv")
print("Rows with missing lap_start or lap_end or lap_end_duration or fastest_lap_duration:", rb_stints["lap_start"].isna().sum() + rb_stints["lap_end"].isna().sum() + rb_stints["fastest_lap_duration"].isna().sum() + rb_stints["lap_end_duration"].isna().sum())

In [None]:
rb_stints = rb_stints.dropna(subset=["lap_start", "lap_end", "lap_end_duration", "fastest_lap_duration"]).reset_index(drop=True)

rb_stints.to_csv("./data/02-processed/rb_stints.csv", index=False)

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist

- put an X there if you've considered the item
- IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.

Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section. You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?\
<br>This project relies exclusively on publicly available Formula 1 race data, including lap times, tyre compounds, and circuit characteristics released by Formula 1, the FIA, and reputable third-party motorsport databases. No direct interaction with drivers, teams, or other individuals occurs, and no private or proprietary data are collected. Because the data are historical, observational, and already in the public domain, informed consent from individuals is not applicable in this context.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?\
<br>We recognize that Formula 1 data collection may introduce biases related to race-specific events such as safety cars, red flags, weather changes, and retirements, which can disproportionately affect lap-time degradation measurements. Additionally, differences in track layouts and race lengths between street circuits and permanent circuits may lead to uneven representation across categories. To address these issues, we explicitly document data exclusions, normalize lap-time metrics where appropriate, and interpret results within the context of these known sources of bias rather than attributing outcomes solely to tyre compounds or driver performance.

- **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?
- **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

### B. Data Storage
<br>All our data comes from public databases<br>
- **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?
- **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?
- **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

Data will only be stored for the duration of the course project. After the project is completed, local copies will be deleted and only final analysis outputs will remain in the project repository.

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?\
<br>Our analysis focuses on quantitative performance metrics and does not capture qualitative factors such as driver feedback, team communications, or real-time strategic decision-making during races. These perspectives may influence tyre management and lap-time consistency but are not directly observable in the available data. To mitigate this limitation, we contextualize findings using existing Formula 1 regulations, race reports, and established motorsport knowledge, and we avoid making causal claims that would require access to these missing perspectives.

 - [x] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?\
<br>The dataset may reflect imbalances in tyre compound usage, circuit types, and competitive conditions across the 2021–2025 regulation era. For example, certain tyre compounds may be used more frequently at specific circuits, and stronger teams may complete longer stints under favorable conditions. We address these biases by stratifying analyses by circuit type, controlling for stint length, and clearly stating assumptions and limitations where confounding variables cannot be fully removed.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?\
<br>All visualizations and summary statistics are designed to accurately reflect the underlying race data without exaggerating trends or masking variability. Axis scales, aggregation choices, and comparisons between tyre compounds and circuit types are selected to avoid misleading interpretations. Where variability is high or sample sizes differ across circuits or seasons, this uncertainty is clearly communicated rather than smoothed over.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?\
<br>The analysis does not involve private or sensitive information. All data used consist of publicly available lap times, tyre choices, and circuit classifications that are widely reported in Formula 1 coverage. No attempts are made to infer personal characteristics or private behavior beyond what is explicitly observable in race performance data.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?\
<br>The full analytical process, including data sources, preprocessing steps, filtering criteria, and modeling choices, is documented to ensure reproducibility. This documentation allows results to be independently verified and enables future review if errors, biases, or alternative interpretations are identified.

### D. Modeling
- **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?
- **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?
 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?\
<br>We carefully select modeling metrics to align with the research question and avoid misleading optimization. Instead of relying on a single outcome measure, we analyze multiple complementary metrics, including lap-time degradation rates and measures of lap-time variability to capture driver consistency. This approach reduces the risk that conclusions are driven by artifacts of a single metric and allows for a more nuanced comparison across tyre compounds and circuit types.

- **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?
 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?\
<br>We clearly communicate the limitations of our modeling approach, including the observational nature of Formula 1 race data, the presence of unobserved confounding variables such as weather and strategic decisions, and simplifications involved in classifying circuits and tyre behavior. Model outputs are presented as descriptive or associative findings rather than causal claims, and conclusions are framed to avoid overgeneralization beyond the scope of the data.

### E. Deployment
- **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?
- **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?
- **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?
 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?\
<br>Although this project is not deployed as a production system, we acknowledge the possibility that analytical results could be misinterpreted or taken out of context. For example, findings about lap-time degradation or driver consistency could be incorrectly attributed solely to driver skill rather than broader factors such as team resources, race strategy, or external conditions. To mitigate this risk, results are presented with clear explanations, appropriate caveats, and explicit statements about the observational and non-causal nature of the analysis.


## Team Expectations 

- All team members will communicate regularly via Messages, GitHub issues, and biweekly zoom meetings to update one another on the current progress of their tasks.  
- Expected tone: All team members will communicate with a friendly tone when bringing up issues, and will not personally target one another. For example: “ I am confused with X, could you explain further what you mean by that”  
- Expectations around tasks: Tasks will just be assigned based on need, so there will not be any specialized roles.  
- Expectations for struggling to deliver on time: Members will be expected to communicate if they are struggling to complete tasks on time, but if there are issues contacting anyone within the group, the group will try to take over the task within 2 days of the original deadline assigned.  
- Each member will complete assigned tasks before agreed deadlines and notify the group early if delays arise.  
- All analysis and writing will be reviewed by at least one other team member before submission.  
- Disagreements will be resolved respectfully through discussion and reference to project goals.


## Project Timeline Proposal

## Phase 1: Data Extraction & Stint Structuring  
**Feb 6 – Feb 14**

Since our unit of analysis is the stint, this phase focuses on building a clean stint-level dataset.

### Tasks:
- Filter dataset to Red Bull Racing drivers (2023–2025)  
- Identify and separate individual stints  
- For each stint, compute:
  - Fastest lap  
  - Final lap  
  - Degradation gap (final lap time minus fastest lap time)  
  - Tyre age on the final lap  
- Validate data consistency and remove abnormal stints  

### Deliverable:
Clean, structured stint-level dataset ready for analysis  

---

## Phase 2: Exploratory Data Analysis (Checkpoint #2)  
**Feb 15 – Feb 23**

### Tasks:
- Examine distributions of:
  - Stint lengths  
  - Tyre age at final lap  
  - Degradation gaps  
- Identify outliers and unusual patterns  
- Generate initial visualizations:
  - Fastest lap vs final lap scatter plot  
  - Distribution of degradation gaps  
  - Degradation vs tyre age  

### Deliverable:
EDA notebook with interpretations ready for submission  

---

## Phase 3: Intra-Team & Circuit-Level Analysis  
**Feb 24 – Mar 3**

### Tasks:
- Compare degradation distributions between Red Bull drivers  
- Analyze degradation across circuits  
- Evaluate the relationship between tyre age and degradation  
- Assess variation across seasons (2023, 2024, 2025)  
- Finalize summary statistics and key comparisons  

### Deliverable:
Core analysis completed and metrics finalized  

---

## Phase 4: Visualization & Writing  
**Mar 4 – Mar 9**

### Tasks:
- Create polished final visualizations:
  - Driver comparison plots  
  - Degradation vs tyre age  
  - Circuit-level variation  
- Write:
  - Background & Prior Work  
  - Methods section (clearly stating stint as the unit of analysis)  
  - Results interpretation  
  - Ethics and bias discussion  
  - Limitations  

### Deliverable:
Complete draft of final project notebook  

---

## Phase 5: Final Review & Submission  
**Mar 10 – Mar 13**

### Tasks:
- Proofread and refine writing  
- Ensure the research question is clearly answered  
- Confirm all plots are labeled and interpreted  
- Restart kernel and run the notebook end-to-end  
- Final GitHub commit and submission  
- Complete individual peer evaluations  

**Deliverable:** Final Project Submission (March 13) (3/20 is the actual deadline)
