Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split build.cc -- new version of #3098 #4114

Merged
merged 35 commits into from
Oct 13, 2020

Conversation

Ericson2314
Copy link
Member

@Ericson2314 Ericson2314 commented Oct 6, 2020

Opening this because I think @edolstra's opinion has changed, and this would certainly be useful to to me doing incremental debug builds and being able to page less code into my head at a time.

I merged each of the original commits in succession, and basically re-copped code as needed. There were no interesting conflicts.

CC @regnat

@Ericson2314
Copy link
Member Author

If this is wanted, I hope we can merge it quickly, because it will bit-rot very fast :).

@Ericson2314 Ericson2314 mentioned this pull request Oct 6, 2020
@thufschmitt
Copy link
Member

If this is wanted, I hope we can merge it quickly, because it will bit-rot very fast :).

image

Too late ;)

@edolstra
Copy link
Member

edolstra commented Oct 7, 2020

The other issues are that it breaks history (which I do actually use), invalidates existing PRs (e.g. #3600, which BTW moves UserLock into a separate file because it was largely rewritten in that PR anyway) and makes cherrypicking changes to the maintenance branch a lot harder.

@Ericson2314
Copy link
Member Author

So git has recently been horrendous with conflicts for me with this, I'll admit. I ended up manually computing a diff, manually partitioning the hunks based one what new file they changes, and apply thing to fix conflicts. The modified patches at least applied cleanly, but that still is a rather bad process.

Still, I think we should do this.

  • The diff cost is temporary; going forward with PRs made after this there is no problem

  • As @edolstra says:

    it makes cherry-picking changes to the maintenance branch a lot harder

    But we're soon going to release 3.0, so this is also one of the best times to make cherry-picking harder. I'm quite convinced this is something we will at some point end up anyways as the schedule is hard to modify/grok right now and many changes around scheduling (especially with remote builders) are desire. If so, we might as well pay the cost now when it's cheaper.

  • After 3.0 and by whenever floating CA derivations become stable, I think the current flurry of activity around libnixstore will subside. Similar to the above, I want to get the big invasive changes "out of the way".

In particular, there are some big scheduler changes we'll need/want to do in the short term I rather do after this:

  • support building only drvs that lack substitutes (aka nix-build-uncached) #3946: This seems quite popular, but my idea of doing nix-build --store $remote_store --builders auto requires making scheduled ensurePath work for stores beyond LocalStore so that they can use the builders. This is a pretty hefty refactor that builds very nicely on this as it's further separating scheduling from building.

  • Build loops trying to substitute when output is missing #3964: My solution for that is to a split derivation goals into "plain goals", for which a method isn't yet choosen, and "build goals", for which no substitution is attempted and wantedOuptuts isn't needed. This is also a pretty major change to the scheduler that benefits from separating out these concepts.

  • Trust map substituting. This is needed so we don't need to download all the build-time dependencies of things just to figure out the resolved derivation we look up: (e.g don't want to download GCC just to figure out whether we can download bash). The "build goal" vs "plain goal" distinction from above might also come in handy, as dealing with trust maps and DRV resolving can be another planning aspect of plain goals and not clutter up building.

  • Derivations that build derivations: I decided this is best way to make [RFC 0040] "Ret-cont" recursive Nix rfcs#40. Just producing them is easy and done in Derivations can output "text-hashed" data #3959, but depending on those built derivations means changes to inputDrvs and the scheduler.

Hope this helps motivate this reactor :).

@edolstra
Copy link
Member

edolstra commented Oct 9, 2020

The diff cost is temporary

It's actually a permanent cost: the history becomes permanently polluted. It's too bad git isn't better at following history across these kinds of refactors... Although apparently there is a very hacky way: https://devblogs.microsoft.com/oldnewthing/20190916-00/?p=102892

I'm not sure it's really worth it though. Incremental build times probably won't go down by much (since build/derivation-goal.cc is still huge) and non-incremental build times will go up.

@Ericson2314
Copy link
Member Author

Ericson2314 commented Oct 9, 2020

Although apparently there is a very hacky way: https://devblogs.microsoft.com/oldnewthing/20190916-00/?p=102892

I'm more than happy to redo this with that trick.

[FWIW, it doesn't even feel "very hacky" to me. The underlying issue is git only knows states not patches, computing patches as needed. In those circumstances, it may well be better to hand-hold git with a careful history than increase the complexity of the diff calculation, as one would pay for the latter with every computed diff.

There are patch-theory-oriented version control systems like Darcs and https://pijul.org/ that solve this better by making the patch rather than state authoritative, but I don't think you are proposing that we switch to one of those at this time :).]

I'm not sure it's really worth it though. Incremental build times probably won't go down by much (since build/derivation-goal.cc is still huge) and non-incremental build times will go up.

Well to me the build times is not main point for this. Rather, it's structuring the code better so the subsequent refactors are easier to understand and maintain. If we were to do all those changes in the single build.cc, I'm worried we could easily hit, say, 11,000 lines. Surely we the large file and combination of concerns (building vs scheduling) is already hard to maintain and not something we'd want to make worse?

(Lastly, didn't you experiment with a way to do unity builds? We can always do that in non-incremental builds and never get a performance downside from splitting .cc files.)

@Mathnerd314
Copy link
Contributor

It's too bad git isn't better at following history

There is also the -C flag. Trying it on the foods example -C4 gives full history (because peas is 4 characters and he splits up the file by line). Here the chunks are larger and the default of 40 mostly works, although the merge commits do show up occasionally. -C10 is a little slower but resolved every nontrivial line of code I looked at. And git gui blame uses -C by default. So overall the history issue seems like a non-issue to me, although I guess that doesn't apply to GitHub or third-party blame tools missing the option.

Regarding the merge commits, is there a reason for them? Rebasing patches is based on the "patch-theory-oriented" view and I have found it much easier to resolve conflicts when working with patches. And of course the commit history looks cleaner when rebasing instead of merging.

@Ericson2314
Copy link
Member Author

Ericson2314 commented Oct 11, 2020

OK I did the Raymond Chen method for now, but I have the old stuff at https://github.com/obsidiansystems/nix/commits/split_build_cc_old so we switch methods and recover @regnat's breaking down the big split into interated splits if we like.


@Mathnerd314

although I guess that doesn't apply to GitHub or third-party blame tools missing the option.

Sadly that might matter

Regarding the merge commits, is there a reason for them? Rebasing patches is based on the "patch-theory-oriented" view and I have found it much easier to resolve conflicts when working with patches. And of course the commit history looks cleaner when rebasing instead of merging.

So since git will not store nice patches that show how the file is split, when a PR that predates the now-mainlined split is rebased git will have a hard time figuring out what to do. (Is there e -C flag for patch applying, let alone git rebase?) The merge trick basically forces git to track the files properly despite having the lousy info per delta, by breaking up the interesting patch into boring steps.

@Mathnerd314
Copy link
Contributor

Regarding the merge commits

I was talking about the old version: master...obsidiansystems:split_build_cc_old. If you rebased you'd end up with only 4 commits, instead of 17. But handling all the open PRs nicely is more important than clean commit history.

Is there a -C flag for patch applying, let alone git rebase?

There is actually git apply -C but it's not useful here. And git rebase/merge use an interface into diff which only supports rename detection - although with enough work copy detection could be added. So the Raymond Chen method seems to be the most effective at present.

@Ericson2314
Copy link
Member Author

In other news, I added another round of this splitting build.hh, so now each of the new .cc files has a corresponding new .hh. I hope this makes the components more "genuinely separated".

@edolstra edolstra merged commit 2653801 into NixOS:master Oct 13, 2020
@edolstra
Copy link
Member

@Ericson2314 Thanks, looks great now!

@Ericson2314 Ericson2314 deleted the split_build_cc branch October 13, 2020 14:22
@Ericson2314
Copy link
Member Author

Ericson2314 commented Oct 13, 2020

Thank you! It means a lot :).

@Ericson2314
Copy link
Member Author

Ericson2314 commented Oct 13, 2020

Bad news :(

Having done more merges since this was merged, I'm noticing Raymon Chen's method isn't working as well as it should. The issue is 3bab1c5, despite being a very simple removing of code, shows up as hot garbage.

On the command line, one can demonstrate this by comparing

git show --patience 3bab1c5bb0a56f850a7bc1bacc9f974b108cf601

with

git show 3bab1c5bb0a56f850a7bc1bacc9f974b108cf601

Evidentially the shear amount of deleted code is fooling git, and it starts aligning blank lines instead, with chaos naturally ensuing.


One thing I could attempt is to make a new branch off the parent of that commit, deleting the code little by little so git wouldn't get lost, and then merging that back in.

Another thing is my original offer of fixing merge conflicts for others still stands.

Very sorry this latest version of the history wasn't the panacea I thought it would be.

@Mathnerd314
Copy link
Contributor

There are a few settings that can be tweaked, git merge -X diff-algorithm=patience -X find-renames=<n>. But I don't have a merge handy to test.

@Ericson2314
Copy link
Member Author

Oh whew! That makes me feel a lot better. Thanks so much, @Mathnerd314.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants