ideas/2024: propose an analytics project (time budgeted builds) by SomeoneSerge · Pull Request #16 · NixOS/GSoC

SomeoneSerge · 2024-02-21T01:57:20Z

I'd like to CC @Mic92, @RaitoBezarius, @GuillaumeDesforges, @GaetanLepage, @ConnorBaker, and @samuela for comments and as potential "potential mentors" (e.g. I've never looked into the implementations of nix-index, nix-eval-jobs, nix-fast-builds, etc so I may lack some of the expertise required for the project to succeed in time)

I didn't write this up but I think one of the prerequisites of a clean solution is the problem of identifying derivations from different nixpkgs instantiations (different revisions, different config arguments, etc), which by design "lack identity". What we can easily match is e.g. nixpkgs' attribute paths. However, derivations overridden/defined in e.g. let-in expressions will have non-trivial contributions to the total cost and we need to be able to identify these

Janik-Haag · 2024-02-21T17:33:31Z

+### Nixpkgs analytics: [`nixpkgs-review`](https://github.com/Mic92/nixpkgs-review) with a time-budget
+
+This project is a small step in analyzing and understanding the data generated
+by the Nixpkgs' fifteen years of modeling (bootstrapping, building and testing)
+"much of the world of open source software". This data includes:
+
+- [Hydra](https://hydra.nixos.org/job/nixpkgs/trunk/python311Packages.torch.aarch64-linux)'s evaluation errors, output paths, [build times](https://hydra.nixos.org/build/250198091#tabs-details), and build logs.
+- The `.narinfo` files stored by https://cache.nixos.org, which together describe runtime dependency graphs between packages built by Hydra.
+- Logs for the builds and `passthru` tests run by [Ofborg](https://logs.ofborg.org/).
+- The [Nixpkgs](https://github.com/NixOS/nixpkgs) Git repository, where each
+  revision includes Nix code that can be evaluated or built, as well as
+  human-readable comments left by maintainers possibly providing insights into
+  the evolution of coding patterns and conventions used by Nixpkgs, as well as
+  into the details of upstream projects.
+- The NixOS/Nixpkgs GitHub repository, which features conversations in GitHub issues and code reviews.
+- IRC, Discourse, and Matrix logs.
+
+Some of this data can be explored and visualized in Grafana hosted at https://monitoring.nixos.org/grafana.
+This data allows tools like [nix-index](https://github.com/nix-community/nix-index/) (`nix-locate`, `comma`, [envfs](https://github.com/Mic92/envfs), etc.) to exist.
+Nonetheless, we currently lack tools to use this data to (conveniently) answer
+sometimes very simple queries like "how long has a given package been taking to
+build on average". To contrast, our ambition could have been to reason about questions such as "how likely is the next build to succeed?", "how long is it likely to take until termination?", "given a Nixpkgs revision and a PR, what attributes is it likely to break?", "given Nixpkgs git history and a failing attribute, which commit is likely to have introduced the breakage?", "why was a given change introduced?". Some of these questions have brute-force solutions: termination times can be obtained by executing the builds, the offending commit may be found by bisection. The accumulated data offers us an opportunity to consult a model prior to performing the expensive computation.
+
+In this project we'll attempt a modest task: write a version of [`nixpkgs-review`](https://github.com/Mic92/nixpkgs-review) or [`nix-fast-build`](https://github.com/Mic92/nix-fast-build) that can be given a time-budget to follow. The program would use historical data to estimate each package's contribution to the time complexity of the full review. The program would discard packages that "do not fit into the budget" and report their build status as "uncertain". When the observed build times deviate from the estimates, the program would dynamically adjust and schedule fewer or more builds as appropriate. The program would require a calibration procedure to attune to specific builders. Some sort of a simple linear model should suffice for the initial implementation.
+
+Skills: the project might take understanding the codebase of [`nix-eval-jobs`](https://github.com/nix-community/nix-eval-jobs) and [`nix-fast-build`](https://github.com/Mic92/nix-fast-build) in order to learn how to control and schedule nix evaluation and builds; experience of working with data and the basic numerical sympathy would be helpful.


This is a bit too long I guess. The guide suggests 2-5 sentences per idea.

samuela · 2024-02-21T19:21:04Z

I was once hired as an intern to do a project like this at one of the faanGs. My takeaway at the time was that we ought to have just built better static analysis/build infrastructure instead of trying to ML it.

That's not to say that this project is not worth pursuing... Experiments are worthwhile. And even if we got just a dashboard showing plots of build times (in CPU hours) for each package that would be a worthwhile success IMHO.

RaitoBezarius · 2024-02-21T21:11:27Z

Hmm, where is the machine learning component in the proposal here? Or do we consider basic statistical analysis to be an machine learning algorithm :P ?

RaitoBezarius · 2024-02-21T21:12:36Z

Either way, I think it's important to separate the:

analysis
prediction
scheduling

parts of the project. Even building something that can collect the data and ship it somewhere else is already great and can be reused by other people to do other parts of this idea.

SomeoneSerge · 2024-02-22T07:52:16Z

Sorry I didn't follow up on the reviews yesterday.

Threw out most of the generic blathering (there's some left in the opening sentence). @Janik-Haag is this short enough now? I could delete more

Updated the complexity rating to "hard" (350h)

I suppose we shouldn't try to outline every detail in the proposal, but I included @RaitoBezarius's decomposition in the description, because it actually structures the proposal well and maybe makes the "Skills" redundant.

SomeoneSerge · 2024-02-22T08:03:59Z

+The problem decomposes into at least three tasks:
+- Obtaining the data from Hydra and preparing it for redistribution and downstream usage.
+- Working out a statistical model for the build times, including online update rules.
+- Writing the evaluation and build scheduler that can consult such a model.


Maybe I should have kept the bit about "marking status as uncertain" and "scheduling fewer or more builds" 🤔

Janik-Haag

I like it, and think we can merge it.

nixos-discourse · 2026-03-20T01:12:08Z

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/how-should-we-handle-software-created-with-llms/76061/62

samuela · 2026-03-20T21:36:00Z

+Nixpkgs and its infrastructure feature fifteen years of history of the open source software: in the form of build and test logs, dependency graphs, and conversations. Compared to the opportunities offered by this data, we'll attempt a modest task: write a version of [`nixpkgs-review`](https://github.com/Mic92/nixpkgs-review) or [`nix-fast-build`](https://github.com/Mic92/nix-fast-build) that can follow a fixed time-budget.
+
+The problem decomposes into at least three tasks:
+- Obtaining the data from Hydra and preparing it for redistribution and downstream usage.


One thing that would be cool IMHO: add a field to search.nixos.org or similar that shows the expected CPU-minutes to build each derivation. It could even link to a dashboard with plots showing statistics from Hydra over time, etc.

samuela · 2026-03-20T21:37:52Z

+
+### Nixpkgs analytics: [`nixpkgs-review`](https://github.com/Mic92/nixpkgs-review) with a time-budget
+
+Nixpkgs and its infrastructure feature fifteen years of history of the open source software: in the form of build and test logs, dependency graphs, and conversations. Compared to the opportunities offered by this data, we'll attempt a modest task: write a version of [`nixpkgs-review`](https://github.com/Mic92/nixpkgs-review) or [`nix-fast-build`](https://github.com/Mic92/nix-fast-build) that can follow a fixed time-budget.


I agree that there is interesting data here, but I'm not sure that I understand the motivation: What is the value proposition for having time-bounded nixpkgs-review? Eg supposing that this project is completed successfully, what problem in the nix community would be solved after all is said and done?

SomeoneSerge force-pushed the proposal/time-budget branch 4 times, most recently from 629b77a to 6a54c5c Compare February 21, 2024 06:58

SomeoneSerge commented Feb 21, 2024

View reviewed changes

Comment thread ideas/2024.md Outdated

Janik-Haag reviewed Feb 21, 2024

View reviewed changes

SomeoneSerge added 2 commits February 22, 2024 07:46

ideas/2024/analytics: init (time budgeted builds)

786097c

ideas/2024/analytics: 1st round of pruning

4d62e31

SomeoneSerge force-pushed the proposal/time-budget branch from 6a54c5c to 4d62e31 Compare February 22, 2024 07:46

SomeoneSerge commented Feb 22, 2024

View reviewed changes

Comment thread ideas/2024.md

SomeoneSerge commented Feb 22, 2024

View reviewed changes

Janik-Haag approved these changes Feb 22, 2024

View reviewed changes

Janik-Haag merged commit a1170b6 into NixOS:main Feb 22, 2024

samuela reviewed Mar 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ideas/2024: propose an analytics project (time budgeted builds)#16

ideas/2024: propose an analytics project (time budgeted builds)#16
Janik-Haag merged 2 commits intoNixOS:mainfrom
SomeoneSerge:proposal/time-budget

SomeoneSerge commented Feb 21, 2024 •

edited

Loading

Uh oh!

Uh oh!

Janik-Haag Feb 21, 2024

Uh oh!

samuela commented Feb 21, 2024

Uh oh!

RaitoBezarius commented Feb 21, 2024 •

edited

Loading

Uh oh!

RaitoBezarius commented Feb 21, 2024

Uh oh!

SomeoneSerge Feb 22, 2024

Uh oh!

Uh oh!

SomeoneSerge Feb 22, 2024

Uh oh!

Janik-Haag left a comment

Uh oh!

nixos-discourse commented Mar 20, 2026

Uh oh!

samuela Mar 20, 2026

Uh oh!

samuela Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		### Nixpkgs analytics: [`nixpkgs-review`](https://github.com/Mic92/nixpkgs-review) with a time-budget

		Nixpkgs and its infrastructure feature fifteen years of history of the open source software: in the form of build and test logs, dependency graphs, and conversations. Compared to the opportunities offered by this data, we'll attempt a modest task: write a version of [`nixpkgs-review`](https://github.com/Mic92/nixpkgs-review) or [`nix-fast-build`](https://github.com/Mic92/nix-fast-build) that can follow a fixed time-budget.

Uh oh!

Conversation

SomeoneSerge commented Feb 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Janik-Haag Feb 21, 2024

Choose a reason for hiding this comment

Uh oh!

samuela commented Feb 21, 2024

Uh oh!

RaitoBezarius commented Feb 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RaitoBezarius commented Feb 21, 2024

Uh oh!

SomeoneSerge Feb 22, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SomeoneSerge Feb 22, 2024

Choose a reason for hiding this comment

Uh oh!

Janik-Haag left a comment

Choose a reason for hiding this comment

Uh oh!

nixos-discourse commented Mar 20, 2026

Uh oh!

samuela Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

samuela Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SomeoneSerge commented Feb 21, 2024 •

edited

Loading

RaitoBezarius commented Feb 21, 2024 •

edited

Loading