Can 10 competing F1 teams collaborate on intelligence without sharing a single byte of secret telemetry?
This repository is the answer. Argus is a federated learning system that predicts F1 lap times across 22 circuits and 10 constructors, built from scratch in PyTorch. No Flower. No frameworks. Every line of FedAvg and FedProx was written by hand, from the weighted aggregation math to the proximal regularizer.
The result: a 42.9% collective accuracy gain over what any team could achieve training alone, while keeping every team's raw telemetry exactly where it belongs: on their own servers.
Formula 1 teams spend hundreds of millions of dollars per season on simulation, strategy, and data infrastructure. Every team collects telemetry on tire degradation, fuel loads, track conditions, weather sensitivity. And every team guards that data like a state secret, because it is one.
But there is a problem with secrecy: small teams starve.
Williams had 1,770 clean training laps in 2023. Red Bull had 2,131. That gap of 361 laps doesn't sound like much until you realize it means Williams has 17% less information to learn tire behavior, track sensitivity, and degradation curves. When Williams builds a model in isolation, it gets a mean absolute error of 3.35 seconds. Red Bull gets 2.42 seconds. The gap compounds. The teams that need data the most are the teams that have the least of it.
Federated learning solves this exact asymmetry. Instead of sharing raw telemetry, each team trains a local model on its own data, then shares only the model weights. A central aggregator averages those weights into a global model that benefits from all 10 teams' experience. No telemetry leaves any team. No competitive advantage is leaked. But every team, especially the smaller ones, gets a model that is dramatically better than anything they could train alone.
This is not theoretical. Google deploys federated learning across billions of phones for Gboard predictions. Apple uses it for Siri and QuickType. Hospital consortiums use it through NVIDIA FLARE to train diagnostic models without sharing patient data. The pattern is proven. I applied it to F1 because the incentive structure is identical: competitors who would all benefit from collaboration but cannot share raw data.
Argus processes 19,590 clean laps from the complete 2023 FIA Formula 1 World Championship, spanning all 22 Grand Prix weekends from Bahrain to Abu Dhabi. The data pipeline pulls from the official F1 live timing system via FastF1, filters for accuracy, merges weather telemetry, and produces a single structured dataset.
The system trains and evaluates three models on the same data, under the same conditions:
1. Centralized (Oracle): All 19,590 laps pooled together, trained as one model. This is the theoretical ceiling. It cannot exist in a real federated deployment because it requires every team to hand over their raw data. It exists here purely as a benchmark.
2. Federated (FedAvg/FedProx): 10 clients, one per constructor. Each client trains locally on only its own laps. After each round, clients send their updated model weights to a central server. The server computes a weighted average (proportional to each team's data volume) and sends the new global model back. 20 communication rounds, 2 local epochs per round.
3. Local Only: Each team trains its own model in complete isolation. No communication. No collaboration. This is what F1 teams effectively do today.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ARGUS SERVER β
β β
β Receives weight updates from all 10 clients each round β
β Computes: w_global = Ξ£ (n_k / n_total) * w_k β
β Broadcasts new global weights back to all clients β
β β
ββββββββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββ¬βββββββ¬ββββ
β β β β β β β β β
ββββββ΄βββββββ΄βββββββ΄βββββββ΄ββββββββ΄βββββββ΄βββββββ΄βββββββ΄βββββββ΄βββββββββββββ
βRed ββFerra-ββMerc- ββAston ββMcLa-ββAlfa ββAlpi-ββAlphaββHaas ββWilli-β
βBull ββri ββedes ββMartinββren ββRomeoββne ββTauriββF1 ββams β
β2131 ββ2021 ββ2120 ββ2032 ββ2004 ββ1896 ββ1885 ββ1897 ββ1834 ββ1770 β
βlaps ββlaps ββlaps ββlaps ββlaps ββlaps ββlaps ββlaps ββlaps ββlaps β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The model itself is a 3 layer MLP with 128 β 64 β 1 neurons, ReLU activations, no batch normalization (deliberately, because batch statistics don't aggregate cleanly in federated settings). 31 input features: 6 numerical (lap number, stint, tire life, air temperature, track temperature, rainfall) and 25 one hot encoded categoricals (tire compound + event name). The target is raw lap time in seconds, optimized with Huber loss to handle occasional outlier laps.
The server aggregation is FedAvg (McMahan et al. 2017), Algorithm 1. Weighted by each client's sample count so that a team with more finishing laps contributes proportionally more to the global model. This is important: a naive simple average would let Haas (1,834 laps) pull the model as hard as Red Bull (2,131 laps), which is statistically wrong.
| Model | Test MAE (seconds) | vs. Local Only |
|---|---|---|
| Centralized (Oracle) | 0.853 | Theoretical ceiling |
| Federated (FedAvg, 20 rounds) | 1.677 | 42.9% better |
| Local Only (average across 10 teams) | 2.936 | Baseline |
| Local Only (worst: Alfa Romeo) | 3.803 | Suffers most from isolation |
| Local Only (best: Aston Martin) | 2.276 | Even the best local model loses to federated |
The federated model at 1.677s beats every single team's local model. Not just the average. Even Aston Martin, the best performing local model at 2.276s, is still 0.6 seconds worse than the federated model that Aston Martin contributed to. That is the fundamental value proposition: collaboration makes even the strongest participant better.
| Team | Local Only MAE | Federated MAE | Improvement |
|---|---|---|---|
| Red Bull Racing | 2.424s | 1.677s | 30.8% |
| Ferrari | 2.593s | 1.677s | 35.3% |
| Mercedes | 2.503s | 1.677s | 33.0% |
| Aston Martin | 2.276s | 1.677s | 26.3% |
| McLaren | 2.486s | 1.677s | 32.5% |
| Alfa Romeo | 3.803s | 1.677s | 55.9% |
| Alpine | 3.170s | 1.677s | 47.1% |
| Williams | 3.353s | 1.677s | 50.0% |
| Haas F1 Team | 3.302s | 1.677s | 49.2% |
| AlphaTauri | 3.447s | 1.677s | 51.3% |
The pattern is unmistakable. Backmarker teams gain the most. Alfa Romeo sees a 55.9% improvement. AlphaTauri gets 51.3%. Williams gets a clean 50%. These are the teams that would benefit most from a federated protocol in reality, and the ones who have the strongest economic incentive to participate.
Three bars per constructor. Green is the oracle (centralized, impossible in practice). Blue is the federated model. Red is what each team achieves alone. The blue bar is constant across teams because the federated model is a single global model evaluated on the same test set. The red bars tell the real story: isolation hurts, and it hurts the small teams worst.
This is where the project gets interesting, and where most toy federated learning demos fall apart.
In a perfect world, every client would have data drawn from the same distribution. The textbook term is "IID": independent and identically distributed. In that world, FedAvg converges quickly and cleanly because every client is essentially a random shard of the same dataset.
F1 is the opposite of that world.
Mercedes has 1,726 training laps. Williams has 1,420. That is quantity imbalance: some clients simply have more data.
But quantity imbalance is the easy part. The hard part is distribution imbalance:
At Bahrain alone, Red Bull's median lap time sits around 97.7 seconds. Williams' median is 98.4 seconds. The interquartile ranges barely overlap for some team pairs. Across the full season, team pace varies by track type (power circuits vs high downforce), tire strategy (aggressive vs conservative), and car characteristics (drag level, mechanical grip, reliability).
When a team like Red Bull trains locally, it learns "fast lap" patterns. When Haas trains locally, it learns "slow lap" patterns. When you federate them naively, the global model has to reconcile these contradictory signals. FedAvg handles this reasonably well because the weighted average ensures that disagreements are resolved proportionally. But the convergence curve shows the strain.
I implemented FedProx (Li et al. 2020) as the extension. The core idea is simple: during local training, add a proximal term to the loss function that penalizes the local model from drifting too far from the global model.
The modified local loss becomes:
L_local = L_task + (ΞΌ/2) * ||w_local - w_global||Β²
Where ΞΌ controls the strength of the "rubber band" pulling the local model back toward the global consensus. At ΞΌ = 0, you recover FedAvg exactly. At higher ΞΌ, local updates stay closer to the global model, which stabilizes convergence in non IID settings at the cost of some local expressiveness.
I swept ΞΌ across [0.0, 0.001, 0.01, 0.1]:
The convergence curves overlap almost perfectly in this regime. That is actually an insightful result: the non IID effect in this dataset, while visible in the data distributions, is not severe enough to cause the catastrophic client drift that FedProx was designed to fix. The quantity weighted averaging in standard FedAvg already does most of the stabilization work here because the data imbalance across teams is moderate (1,770 to 2,131 laps, roughly a 1.2x ratio, not a 10x ratio).
In production F1 deployments with more heterogeneous telemetry streams (car sensor data, not just lap level features), or in seasons where one team dominates catastrophically (2023 Red Bull), I would expect ΞΌ = 0.1 to produce visible stabilization gains. For this lap level dataset, FedAvg is already well behaved.
This is an honest finding. If I inflated the FedProx gains, anyone who ran the code would see the truth in the plot. The value of the extension is not that it improved accuracy here. The value is that I understand when and why it would matter, and the infrastructure is in place to activate it when conditions demand it.
Why Huber loss instead of MSE? A few laps in the dataset are genuine outliers: safety car restarts, pit exit laps that barely passed the accuracy filter, wet weather anomalies. MSE squares the error, so a single 10 second outlier dominates the gradient. Huber loss transitions to L1 (linear penalty) for large errors, making training robust without manually cleaning every edge case.
Why time based split instead of random split? Random splitting would leak information. If the model sees laps 5, 15, 35, and 48 from Bahrain during training, predicting lap 25 is trivially easy because the track conditions are nearly identical. Splitting by lap number (train on the first 80% of laps per event, test on the final 20%) forces the model to generalize to late race conditions it has never seen. This simulates the real use case: predicting future laps from historical patterns.
Why no Driver or Team as input features? Team is the partition key for federation. Including it as a feature would let the model learn "this is a Red Bull lap, therefore fast," which defeats the purpose of showing that collaboration helps. Driver is excluded because per driver pace is emergent from the team's car and the driver's skill, and encoding it would be information that a federated model should learn from context (tire, track, stint), not from a label.
Why a 3 layer MLP and not something deeper? The dataset is 19,590 samples with 31 features. A deeper or wider network would overfit. The 128 β 64 β 1 architecture is deliberately small. The purpose of this project is to demonstrate federated learning, not to build a state of the art lap time predictor. A more complex model (LSTM on sequential lap data, transformer on stint sequences) would improve accuracy but obscure the FL mechanics.
Why CPU/MPS and no GPU requirement? The dataset fits in memory on a laptop. Training the full federated simulation (20 rounds Γ 10 clients Γ 2 local epochs) completes in under 2 minutes on an M series Mac. Requiring a GPU would gatekeep reproducibility for no benefit.
git clone https://github.com/Forgingalex/argus.git
cd argus
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtpython -m src.extractDownloads the complete 2023 season from the F1 live timing system via FastF1. Saves 19,590 clean laps to data/processed/laps_2023.parquet.
python -m src.train_baselineTrains for 30 epochs on all data. Saves checkpoint to experiments/results/baseline.pt. Expect test MAE around 0.85s.
python -m src.federated_trainRuns 20 communication rounds of FedAvg across 10 team clients. Saves checkpoint and round by round MAE to experiments/results/.
python -m src.compareTrains local only models, loads baseline and federated checkpoints, produces the comparison table and plot.
python -m src.fedprox_experimentSweeps ΞΌ values [0.0, 0.001, 0.01, 0.1] and produces the convergence comparison plot.
argus/
βββ src/
β βββ extract.py # FastF1 data pipeline: 22 races β parquet
β βββ dataset.py # PyTorch Dataset + feature engineering
β βββ model.py # LapTimeMLP: 128β64β1 regression head
β βββ train_baseline.py # Centralized oracle training
β βββ client.py # FLClient: local training with FedProx support
β βββ server.py # Weighted FedAvg aggregation
β βββ federated_train.py # Full FL simulation loop
β βββ fedprox_experiment.py # ΞΌ sweep experiment
β βββ compare.py # Three way evaluation + plotting
βββ notebooks/
β βββ 01_explore.ipynb # Data exploration and non-IID analysis
βββ experiments/
β βββ results/ # Checkpoints, CSVs, plots
βββ data/
β βββ raw_cache/ # FastF1 cache (gitignored)
β βββ processed/ # Cleaned parquet (gitignored)
βββ requirements.txt
βββ .gitignore
Federated learning is not magic, but it is surprisingly effective. The 42.9% improvement over local only models is not because the algorithm is clever. It is because pooling gradient information from 10 different data distributions teaches the model patterns that no single distribution contains. Alfa Romeo alone never sees enough Red Bull type fast laps to learn low degradation tire curves. Through federation, it does, without ever seeing a Red Bull lap.
Non IID is the real problem, but not always the catastrophic one. The FedProx experiment showed that for this dataset, vanilla FedAvg already handles the heterogeneity well enough. The team pace distributions differ, but not by orders of magnitude. In domains where non IID is more severe (medical imaging across different scanner hardware, keyboard predictions across different languages), FedProx and its descendants become essential. Here, they are insurance.
The oracle gap is real and permanent. The centralized model at 0.853s is ~50% better than the federated model at 1.677s. That gap represents the cost of privacy. In a world where all teams shared everything, you would get better predictions. The argument for federation is not that it matches centralized performance. The argument is that it gets you dramatically closer to centralized performance than isolation does, without requiring anyone to trust anyone.
Small teams are the economic engine of federation. If this were a real protocol, the pitch to Williams and Haas is easy: "your model improves by 50%." The pitch to Red Bull is harder: "your model improves by 31%, and you are subsidizing your competitors." This is the core tension in any federated system. The clients who contribute the most data get the least marginal benefit. Mechanism design for fair compensation in federated protocols is an active research area.
Building from scratch forces understanding. I wrote every line of weighted_average, every line of the proximal term, every line of the training loop. I know why detach().cpu().clone() is necessary in the client return (autograd graph cleanup, device consistency, memory safety). I know why the optimizer is recreated each round (carrying momentum across global weight resets would corrupt the update direction). I could not have explained any of this if I had imported flwr.client.NumPyClient and overridden two methods.
Per driver personalization. The current model predicts a global lap time. A production system would fine tune the global model per driver, keeping a personalized head while sharing a common backbone. This is the personalization vs generalization tradeoff that current FL research is actively investigating.
Secure aggregation. Right now, the server sees raw weight updates. In theory, an adversarial server could reconstruct training data from gradients (Zhu et al. 2019). Adding secure aggregation or differential privacy (DP FedAvg with proper moments accountant) would make the protocol cryptographically private, not just architecturally private.
Port to Flower. Now that I understand what FedAvg does at the tensor level, using Flower for the next iteration would let me focus on scale (hundreds of clients, asynchronous communication, client selection strategies) instead of plumbing.
Sequential lap modeling. An LSTM or Transformer over stint sequences would capture tire degradation dynamics that the current per lap MLP cannot. Lap 15 on mediums is not independent of lap 14. The current model treats it that way.
McMahan, H.B. et al. (2017). Communication Efficient Learning of Deep Networks from Decentralized Data. AISTATS. arXiv:1602.05629.
Li, T. et al. (2020). Federated Optimization in Heterogeneous Networks (FedProx). MLSys. arXiv:1812.06127.
Kairouz, P. et al. (2021). Advances and Open Problems in Federated Learning. Foundations and Trends in Machine Learning.
FastF1 by Philipp SchΓ€fer: https://docs.fastf1.dev
Released under the MIT License. Copyright (c) 2026 Argus.
Argus: because the best intelligence is the intelligence you never had to share.



