Argus

Federated Intelligence for Formula 1 Lap Time Prediction

Can 10 competing F1 teams collaborate on intelligence without sharing a single byte of secret telemetry?

This repository is the answer. Argus is a federated learning system that predicts F1 lap times across 22 circuits and 10 constructors, built from scratch in PyTorch. No Flower. No frameworks. Every line of FedAvg and FedProx was written by hand, from the weighted aggregation math to the proximal regularizer.

The result: a 42.9% collective accuracy gain over what any team could achieve training alone, while keeping every team's raw telemetry exactly where it belongs: on their own servers.

Why This Needs to Exist

Formula 1 teams spend hundreds of millions of dollars per season on simulation, strategy, and data infrastructure. Every team collects telemetry on tire degradation, fuel loads, track conditions, weather sensitivity. And every team guards that data like a state secret, because it is one.

But there is a problem with secrecy: small teams starve.

Williams had 1,770 clean training laps in 2023. Red Bull had 2,131. That gap of 361 laps doesn't sound like much until you realize it means Williams has 17% less information to learn tire behavior, track sensitivity, and degradation curves. When Williams builds a model in isolation, it gets a mean absolute error of 3.35 seconds. Red Bull gets 2.42 seconds. The gap compounds. The teams that need data the most are the teams that have the least of it.

Federated learning solves this exact asymmetry. Instead of sharing raw telemetry, each team trains a local model on its own data, then shares only the model weights. A central aggregator averages those weights into a global model that benefits from all 10 teams' experience. No telemetry leaves any team. No competitive advantage is leaked. But every team, especially the smaller ones, gets a model that is dramatically better than anything they could train alone.

This is not theoretical. Google deploys federated learning across billions of phones for Gboard predictions. Apple uses it for Siri and QuickType. Hospital consortiums use it through NVIDIA FLARE to train diagnostic models without sharing patient data. The pattern is proven. I applied it to F1 because the incentive structure is identical: competitors who would all benefit from collaboration but cannot share raw data.

What I Built

Argus processes 19,590 clean laps from the complete 2023 FIA Formula 1 World Championship, spanning all 22 Grand Prix weekends from Bahrain to Abu Dhabi. The data pipeline pulls from the official F1 live timing system via FastF1, filters for accuracy, merges weather telemetry, and produces a single structured dataset.

The system trains and evaluates three models on the same data, under the same conditions:

1. Centralized (Oracle): All 19,590 laps pooled together, trained as one model. This is the theoretical ceiling. It cannot exist in a real federated deployment because it requires every team to hand over their raw data. It exists here purely as a benchmark.

2. Federated (FedAvg/FedProx): 10 clients, one per constructor. Each client trains locally on only its own laps. After each round, clients send their updated model weights to a central server. The server computes a weighted average (proportional to each team's data volume) and sends the new global model back. 20 communication rounds, 2 local epochs per round.

3. Local Only: Each team trains its own model in complete isolation. No communication. No collaboration. This is what F1 teams effectively do today.

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         ARGUS SERVER                                │
│                                                                     │
│    Receives weight updates from all 10 clients each round           │
│    Computes: w_global = Σ (n_k / n_total) * w_k                   │
│    Broadcasts new global weights back to all clients                │
│                                                                     │
└────────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┬───┘
         │      │      │      │      │      │      │      │      │
    ┌────┴──┐┌──┴───┐┌─┴────┐┌┴─────┐┌┴────┐┌┴────┐┌┴────┐┌┴────┐┌┴────┐┌──────┐
    │Red    ││Ferra-││Merc- ││Aston ││McLa-││Alfa ││Alpi-││Alpha││Haas ││Willi-│
    │Bull   ││ri    ││edes  ││Martin││ren  ││Romeo││ne   ││Tauri││F1   ││ams   │
    │2131   ││2021  ││2120  ││2032  ││2004 ││1896 ││1885 ││1897 ││1834 ││1770  │
    │laps   ││laps  ││laps  ││laps  ││laps ││laps ││laps ││laps ││laps ││laps  │
    └───────┘└──────┘└──────┘└──────┘└─────┘└─────┘└─────┘└─────┘└─────┘└──────┘

The model itself is a 3 layer MLP with 128 → 64 → 1 neurons, ReLU activations, no batch normalization (deliberately, because batch statistics don't aggregate cleanly in federated settings). 31 input features: 6 numerical (lap number, stint, tire life, air temperature, track temperature, rainfall) and 25 one hot encoded categoricals (tire compound + event name). The target is raw lap time in seconds, optimized with Huber loss to handle occasional outlier laps.

The server aggregation is FedAvg (McMahan et al. 2017), Algorithm 1. Weighted by each client's sample count so that a team with more finishing laps contributes proportionally more to the global model. This is important: a naive simple average would let Haas (1,834 laps) pull the model as hard as Red Bull (2,131 laps), which is statistically wrong.

Results

The Numbers

Model	Test MAE (seconds)	vs. Local Only
Centralized (Oracle)	0.853	Theoretical ceiling
Federated (FedAvg, 20 rounds)	1.677	42.9% better
Local Only (average across 10 teams)	2.936	Baseline
Local Only (worst: Alfa Romeo)	3.803	Suffers most from isolation
Local Only (best: Aston Martin)	2.276	Even the best local model loses to federated

The federated model at 1.677s beats every single team's local model. Not just the average. Even Aston Martin, the best performing local model at 2.276s, is still 0.6 seconds worse than the federated model that Aston Martin contributed to. That is the fundamental value proposition: collaboration makes even the strongest participant better.

Per Team Breakdown

Team	Local Only MAE	Federated MAE	Improvement
Red Bull Racing	2.424s	1.677s	30.8%
Ferrari	2.593s	1.677s	35.3%
Mercedes	2.503s	1.677s	33.0%
Aston Martin	2.276s	1.677s	26.3%
McLaren	2.486s	1.677s	32.5%
Alfa Romeo	3.803s	1.677s	55.9%
Alpine	3.170s	1.677s	47.1%
Williams	3.353s	1.677s	50.0%
Haas F1 Team	3.302s	1.677s	49.2%
AlphaTauri	3.447s	1.677s	51.3%

The pattern is unmistakable. Backmarker teams gain the most. Alfa Romeo sees a 55.9% improvement. AlphaTauri gets 51.3%. Williams gets a clean 50%. These are the teams that would benefit most from a federated protocol in reality, and the ones who have the strongest economic incentive to participate.

The Comparison Plot

Three bars per constructor. Green is the oracle (centralized, impossible in practice). Blue is the federated model. Red is what each team achieves alone. The blue bar is constant across teams because the federated model is a single global model evaluated on the same test set. The red bars tell the real story: isolation hurts, and it hurts the small teams worst.

The Non IID Problem (and Why It Matters)

This is where the project gets interesting, and where most toy federated learning demos fall apart.

In a perfect world, every client would have data drawn from the same distribution. The textbook term is "IID": independent and identically distributed. In that world, FedAvg converges quickly and cleanly because every client is essentially a random shard of the same dataset.

F1 is the opposite of that world.

The Data is Fundamentally Heterogeneous

Mercedes has 1,726 training laps. Williams has 1,420. That is quantity imbalance: some clients simply have more data.

But quantity imbalance is the easy part. The hard part is distribution imbalance:

At Bahrain alone, Red Bull's median lap time sits around 97.7 seconds. Williams' median is 98.4 seconds. The interquartile ranges barely overlap for some team pairs. Across the full season, team pace varies by track type (power circuits vs high downforce), tire strategy (aggressive vs conservative), and car characteristics (drag level, mechanical grip, reliability).

When a team like Red Bull trains locally, it learns "fast lap" patterns. When Haas trains locally, it learns "slow lap" patterns. When you federate them naively, the global model has to reconcile these contradictory signals. FedAvg handles this reasonably well because the weighted average ensures that disagreements are resolved proportionally. But the convergence curve shows the strain.

FedProx: The Stabilizer

I implemented FedProx (Li et al. 2020) as the extension. The core idea is simple: during local training, add a proximal term to the loss function that penalizes the local model from drifting too far from the global model.

The modified local loss becomes:

L_local = L_task + (μ/2) * ||w_local - w_global||²

Where μ controls the strength of the "rubber band" pulling the local model back toward the global consensus. At μ = 0, you recover FedAvg exactly. At higher μ, local updates stay closer to the global model, which stabilizes convergence in non IID settings at the cost of some local expressiveness.

I swept μ across [0.0, 0.001, 0.01, 0.1]:

The convergence curves overlap almost perfectly in this regime. That is actually an insightful result: the non IID effect in this dataset, while visible in the data distributions, is not severe enough to cause the catastrophic client drift that FedProx was designed to fix. The quantity weighted averaging in standard FedAvg already does most of the stabilization work here because the data imbalance across teams is moderate (1,770 to 2,131 laps, roughly a 1.2x ratio, not a 10x ratio).

In production F1 deployments with more heterogeneous telemetry streams (car sensor data, not just lap level features), or in seasons where one team dominates catastrophically (2023 Red Bull), I would expect μ = 0.1 to produce visible stabilization gains. For this lap level dataset, FedAvg is already well behaved.

This is an honest finding. If I inflated the FedProx gains, anyone who ran the code would see the truth in the plot. The value of the extension is not that it improved accuracy here. The value is that I understand when and why it would matter, and the infrastructure is in place to activate it when conditions demand it.

Technical Decisions and Why

Why Huber loss instead of MSE? A few laps in the dataset are genuine outliers: safety car restarts, pit exit laps that barely passed the accuracy filter, wet weather anomalies. MSE squares the error, so a single 10 second outlier dominates the gradient. Huber loss transitions to L1 (linear penalty) for large errors, making training robust without manually cleaning every edge case.

Why time based split instead of random split? Random splitting would leak information. If the model sees laps 5, 15, 35, and 48 from Bahrain during training, predicting lap 25 is trivially easy because the track conditions are nearly identical. Splitting by lap number (train on the first 80% of laps per event, test on the final 20%) forces the model to generalize to late race conditions it has never seen. This simulates the real use case: predicting future laps from historical patterns.

Why no Driver or Team as input features? Team is the partition key for federation. Including it as a feature would let the model learn "this is a Red Bull lap, therefore fast," which defeats the purpose of showing that collaboration helps. Driver is excluded because per driver pace is emergent from the team's car and the driver's skill, and encoding it would be information that a federated model should learn from context (tire, track, stint), not from a label.

Why a 3 layer MLP and not something deeper? The dataset is 19,590 samples with 31 features. A deeper or wider network would overfit. The 128 → 64 → 1 architecture is deliberately small. The purpose of this project is to demonstrate federated learning, not to build a state of the art lap time predictor. A more complex model (LSTM on sequential lap data, transformer on stint sequences) would improve accuracy but obscure the FL mechanics.

Why CPU/MPS and no GPU requirement? The dataset fits in memory on a laptop. Training the full federated simulation (20 rounds × 10 clients × 2 local epochs) completes in under 2 minutes on an M series Mac. Requiring a GPU would gatekeep reproducibility for no benefit.

Reproducing This

git clone https://github.com/Forgingalex/argus.git
cd argus
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Step 1: Extract the data (~30 minutes first run, cached after that)

python -m src.extract

Downloads the complete 2023 season from the F1 live timing system via FastF1. Saves 19,590 clean laps to data/processed/laps_2023.parquet.

Step 2: Train the centralized baseline

python -m src.train_baseline

Trains for 30 epochs on all data. Saves checkpoint to experiments/results/baseline.pt. Expect test MAE around 0.85s.

Step 3: Train the federated model

python -m src.federated_train

Runs 20 communication rounds of FedAvg across 10 team clients. Saves checkpoint and round by round MAE to experiments/results/.

Step 4: Run the three way comparison

python -m src.compare

Trains local only models, loads baseline and federated checkpoints, produces the comparison table and plot.

Step 5: Run the FedProx experiment

python -m src.fedprox_experiment

Sweeps μ values [0.0, 0.001, 0.01, 0.1] and produces the convergence comparison plot.

Repository Structure

argus/
├── src/
│   ├── extract.py           # FastF1 data pipeline: 22 races → parquet
│   ├── dataset.py           # PyTorch Dataset + feature engineering
│   ├── model.py             # LapTimeMLP: 128→64→1 regression head
│   ├── train_baseline.py    # Centralized oracle training
│   ├── client.py            # FLClient: local training with FedProx support
│   ├── server.py            # Weighted FedAvg aggregation
│   ├── federated_train.py   # Full FL simulation loop
│   ├── fedprox_experiment.py # μ sweep experiment
│   └── compare.py           # Three way evaluation + plotting
├── notebooks/
│   └── 01_explore.ipynb     # Data exploration and non-IID analysis
├── experiments/
│   └── results/             # Checkpoints, CSVs, plots
├── data/
│   ├── raw_cache/           # FastF1 cache (gitignored)
│   └── processed/           # Cleaned parquet (gitignored)
├── requirements.txt
└── .gitignore

What I Learned

Federated learning is not magic, but it is surprisingly effective. The 42.9% improvement over local only models is not because the algorithm is clever. It is because pooling gradient information from 10 different data distributions teaches the model patterns that no single distribution contains. Alfa Romeo alone never sees enough Red Bull type fast laps to learn low degradation tire curves. Through federation, it does, without ever seeing a Red Bull lap.

Non IID is the real problem, but not always the catastrophic one. The FedProx experiment showed that for this dataset, vanilla FedAvg already handles the heterogeneity well enough. The team pace distributions differ, but not by orders of magnitude. In domains where non IID is more severe (medical imaging across different scanner hardware, keyboard predictions across different languages), FedProx and its descendants become essential. Here, they are insurance.

The oracle gap is real and permanent. The centralized model at 0.853s is ~50% better than the federated model at 1.677s. That gap represents the cost of privacy. In a world where all teams shared everything, you would get better predictions. The argument for federation is not that it matches centralized performance. The argument is that it gets you dramatically closer to centralized performance than isolation does, without requiring anyone to trust anyone.

Small teams are the economic engine of federation. If this were a real protocol, the pitch to Williams and Haas is easy: "your model improves by 50%." The pitch to Red Bull is harder: "your model improves by 31%, and you are subsidizing your competitors." This is the core tension in any federated system. The clients who contribute the most data get the least marginal benefit. Mechanism design for fair compensation in federated protocols is an active research area.

Building from scratch forces understanding. I wrote every line of weighted_average, every line of the proximal term, every line of the training loop. I know why detach().cpu().clone() is necessary in the client return (autograd graph cleanup, device consistency, memory safety). I know why the optimizer is recreated each round (carrying momentum across global weight resets would corrupt the update direction). I could not have explained any of this if I had imported flwr.client.NumPyClient and overridden two methods.

What I Would Build Next

Per driver personalization. The current model predicts a global lap time. A production system would fine tune the global model per driver, keeping a personalized head while sharing a common backbone. This is the personalization vs generalization tradeoff that current FL research is actively investigating.

Secure aggregation. Right now, the server sees raw weight updates. In theory, an adversarial server could reconstruct training data from gradients (Zhu et al. 2019). Adding secure aggregation or differential privacy (DP FedAvg with proper moments accountant) would make the protocol cryptographically private, not just architecturally private.

Port to Flower. Now that I understand what FedAvg does at the tensor level, using Flower for the next iteration would let me focus on scale (hundreds of clients, asynchronous communication, client selection strategies) instead of plumbing.

Sequential lap modeling. An LSTM or Transformer over stint sequences would capture tire degradation dynamics that the current per lap MLP cannot. Lap 15 on mediums is not independent of lap 14. The current model treats it that way.

References

McMahan, H.B. et al. (2017). Communication Efficient Learning of Deep Networks from Decentralized Data. AISTATS. arXiv:1602.05629.

Li, T. et al. (2020). Federated Optimization in Heterogeneous Networks (FedProx). MLSys. arXiv:1812.06127.

Kairouz, P. et al. (2021). Advances and Open Problems in Federated Learning. Foundations and Trends in Machine Learning.

FastF1 by Philipp Schäfer: https://docs.fastf1.dev

License

Argus: because the best intelligence is the intelligence you never had to share.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
experiments/results		experiments/results
notebooks		notebooks
src		src
.gitignore		.gitignore
Argus_Technical_Report.pdf		Argus_Technical_Report.pdf
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Argus

Federated Intelligence for Formula 1 Lap Time Prediction

Why This Needs to Exist

What I Built

Architecture

Results

The Numbers

Per Team Breakdown

The Comparison Plot

The Non IID Problem (and Why It Matters)

The Data is Fundamentally Heterogeneous

FedProx: The Stabilizer

Technical Decisions and Why

Reproducing This

Step 1: Extract the data (~30 minutes first run, cached after that)

Step 2: Train the centralized baseline

Step 3: Train the federated model

Step 4: Run the three way comparison

Step 5: Run the FedProx experiment

Repository Structure

What I Learned

What I Would Build Next

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Argus

Federated Intelligence for Formula 1 Lap Time Prediction

Why This Needs to Exist

What I Built

Architecture

Results

The Numbers

Per Team Breakdown

The Comparison Plot

The Non IID Problem (and Why It Matters)

The Data is Fundamentally Heterogeneous

FedProx: The Stabilizer

Technical Decisions and Why

Reproducing This

Step 1: Extract the data (~30 minutes first run, cached after that)

Step 2: Train the centralized baseline

Step 3: Train the federated model

Step 4: Run the three way comparison

Step 5: Run the FedProx experiment

Repository Structure

What I Learned

What I Would Build Next

References

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages