Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory corruption #4200

Closed
kirelagin opened this issue Oct 29, 2020 · 17 comments
Closed

Memory corruption #4200

kirelagin opened this issue Oct 29, 2020 · 17 comments

Comments

@kirelagin
Copy link
Member

kirelagin commented Oct 29, 2020

Describe the bug

When trying to build a certain expression, I am getting:

malloc(): invalid size (unsorted)

or, sometimes, much less often either:

malloc(): invalid next size (unsorted)

or

malloc(): invalid next->prev_inuse (unsorted)

or some other malloc error, which I failed to write down.

There seems to be something interesting going on with string allocation.

Steps To Reproduce

I managed to minimise our case to a simple build with haskell.nix of an empty pseudo-Haskell project with a long enough stack.yaml file. By varying the size of stack.yaml you can get different errors and have fun (🥳).

Unfortunately, since it uses haskell.nix, you might need to setup the IOHK binary cache. Or maybe not, I’m not sure... maybe we can share our cache...

  • git clone https://github.com/serokell/nix-malloc-repro.git
  • cd nix-malloc-repro
  • nix build -f .

Note that both nix-build and nix-daemon dump core.

Expected behavior

Some error related to the Haskell project being completely empty.

Additional context

The repository has its own pinned dependencies and I am using nixUnstable from the nixos-20.09 branch, although I don’t think any of this matters.

  • It reproduces with current master (e0ca98c).
  • @notgne2, who has older nixpkgs with Nix f156513 cannot reproduce it.
@kirelagin kirelagin added the bug label Oct 29, 2020
@kirelagin
Copy link
Member Author

I guess I’ll just leave it bisecting overnight. So far, if I did not mess up, I know that:

git bisect start
# bad: [ed02d20e1d437ea2c3f99b13f1dd2ff31c48e0c8] Merge pull request #4085 from 0mp/patch-2
git bisect bad ed02d20e1d437ea2c3f99b13f1dd2ff31c48e0c8
# good: [6387550d58884ac09204c339ed3a3cd700780a75] Get rid of confusing `std::optional<bool>` for validity
git bisect good 6387550d58884ac09204c339ed3a3cd700780a75

@kirelagin
Copy link
Member Author

Current status:

# bad: [dac8a1a5e178f3a000bb6995c41b6c36d3a0de6c] Merge pull request #4052 from ujjwaljainnn/fix-typo
git bisect bad dac8a1a5e178f3a000bb6995c41b6c36d3a0de6c
# good: [ecc8672aa007af045d77434b495ca09541e9fee3] fmt.hh: Don't include boost/algorithm/string/replace.hpp
git bisect good ecc8672aa007af045d77434b495ca09541e9fee3

The next step is ca30abb3fb36440e5a13161c39647189036fc18f and either I am crazy or when I build this commit I get nix-instantiate that just hangs.

@kirelagin
Copy link
Member Author

If I skip it, the next step ecc8088cb7e91e4c45787f290c8ff61547fb838a does not compile. Ugh. Doesn’t look like it is going to bisect itself without a manual oversight. Maybe tomorrow.

@kirelagin
Copy link
Member Author

Looks like 97b5154 is good too, but what’s crazy is that 35a0ac1 hangs too, even though it is nowhere close the other hanging commit. Something is very odd...

@kirelagin
Copy link
Member Author

kirelagin commented Oct 29, 2020

git bisect start
# bad: [a5019f0508be961bf0230d2a528d30d3ded4b12a] Consistency
git bisect bad a5019f0508be961bf0230d2a528d30d3ded4b12a
# good: [6387550d58884ac09204c339ed3a3cd700780a75] Get rid of confusing `std::optional<bool>` for validity
git bisect good 6387550d58884ac09204c339ed3a3cd700780a75
# bad: [69afaeace355ab78010f0d849206ef4a4805d016] Merge remote-tracking branch 'upstream/master' into templated-daemon-protocol
git bisect bad 69afaeace355ab78010f0d849206ef4a4805d016
# bad: [5b107f2c5f5bb86042f9f65b022cf0ed0bfaccbd] Merge pull request #4038 from maljub01/master
git bisect bad 5b107f2c5f5bb86042f9f65b022cf0ed0bfaccbd
# good: [649d3aaf2481b928120b6ce77d68b1b7c68f69e6] Merge pull request #3829 from obsidiansystems/remove-storetype-delegate-regStore
git bisect good 649d3aaf2481b928120b6ce77d68b1b7c68f69e6
# good: [7dd8baafe1af81703d551e29ae329b0077fb8e39] Merge pull request #4041 from cole-h/enum-stringify
git bisect good 7dd8baafe1af81703d551e29ae329b0077fb8e39
# bad: [7c682640857106f18d7020c9c75ea39b1ef8dd2c] wopAddToStore: add RepairFlag
git bisect bad 7c682640857106f18d7020c9c75ea39b1ef8dd2c
# good: [14b30b3f3d5af75c210a15cb128e67c0eff66149] Move FramedSource and FramedSink, extract withFramedSink
git bisect good 14b30b3f3d5af75c210a15cb128e67c0eff66149
# skip: [ecc8088cb7e91e4c45787f290c8ff61547fb838a] wopAddToStore: Throw to clarify unused refs
git bisect skip ecc8088cb7e91e4c45787f290c8ff61547fb838a
# skip: [c602ebfb34de3626fa0b9110face6ea4b171ac0f] Refactor wopAddToStore to make wopAddTextToStore obsolete
git bisect skip c602ebfb34de3626fa0b9110face6ea4b171ac0f
# skip: [8279178b078103f816bf85f5a153a4845d30cc83] Move FramedSink next to FramedSource
git bisect skip 8279178b078103f816bf85f5a153a4845d30cc83
# bad: [e34fe47d0ce19fc7657970fb0e610bffbc3e43f0] Overhaul wopAddToStore
git bisect bad e34fe47d0ce19fc7657970fb0e610bffbc3e43f0
# first bad commit: [e34fe47d0ce19fc7657970fb0e610bffbc3e43f0] Overhaul wopAddToStore

(“skip” means that it either does not build or hangs during evaluation)

@kirelagin
Copy link
Member Author

That would be #4030 (cc @roberth @Ericson2314 @edolstra).

@kirelagin
Copy link
Member Author

kirelagin commented Oct 29, 2020

Ok, after reading the PR I see why it hangs... I guess, I’d need to bisect it with some specific version of nix-daemon that will not hang? :/ (I was bisecting it against nix-daemon 2.3.7 and apparently, unlike the newer one, it was not dumping.)

@kirelagin
Copy link
Member Author

kirelagin commented Oct 29, 2020

Actually, I double-checked manually and it looks like all those skips are compilation failures, not freezes. And the commit it points to fails for a different reason, ugh.

@kirelagin
Copy link
Member Author

kirelagin commented Oct 29, 2020

So, here is the latest run:

git bisect start
# bad: [a5019f0508be961bf0230d2a528d30d3ded4b12a] Consistency
git bisect bad a5019f0508be961bf0230d2a528d30d3ded4b12a
# good: [6387550d58884ac09204c339ed3a3cd700780a75] Get rid of confusing `std::optional<bool>` for validity
git bisect good 6387550d58884ac09204c339ed3a3cd700780a75
# bad: [69afaeace355ab78010f0d849206ef4a4805d016] Merge remote-tracking branch 'upstream/master' into templated-daemon-protocol
git bisect bad 69afaeace355ab78010f0d849206ef4a4805d016
# bad: [5b107f2c5f5bb86042f9f65b022cf0ed0bfaccbd] Merge pull request #4038 from maljub01/master
git bisect bad 5b107f2c5f5bb86042f9f65b022cf0ed0bfaccbd
# good: [649d3aaf2481b928120b6ce77d68b1b7c68f69e6] Merge pull request #3829 from obsidiansystems/remove-storetype-delegate-regStore
git bisect good 649d3aaf2481b928120b6ce77d68b1b7c68f69e6
# good: [7dd8baafe1af81703d551e29ae329b0077fb8e39] Merge pull request #4041 from cole-h/enum-stringify
git bisect good 7dd8baafe1af81703d551e29ae329b0077fb8e39
# bad: [7c682640857106f18d7020c9c75ea39b1ef8dd2c] wopAddToStore: add RepairFlag
git bisect bad 7c682640857106f18d7020c9c75ea39b1ef8dd2c
# good: [14b30b3f3d5af75c210a15cb128e67c0eff66149] Move FramedSource and FramedSink, extract withFramedSink
git bisect good 14b30b3f3d5af75c210a15cb128e67c0eff66149
# skip: [ecc8088cb7e91e4c45787f290c8ff61547fb838a] wopAddToStore: Throw to clarify unused refs
git bisect skip ecc8088cb7e91e4c45787f290c8ff61547fb838a
# skip: [c602ebfb34de3626fa0b9110face6ea4b171ac0f] Refactor wopAddToStore to make wopAddTextToStore obsolete
git bisect skip c602ebfb34de3626fa0b9110face6ea4b171ac0f
# skip: [8279178b078103f816bf85f5a153a4845d30cc83] Move FramedSink next to FramedSource
git bisect skip 8279178b078103f816bf85f5a153a4845d30cc83
# skip: [e34fe47d0ce19fc7657970fb0e610bffbc3e43f0] Overhaul wopAddToStore
git bisect skip e34fe47d0ce19fc7657970fb0e610bffbc3e43f0
# skip: [fbf509c1137fd59ebb3e7a993c8da42c17deb68a] parseContentAddressMethodPrefix: use string_view
git bisect skip fbf509c1137fd59ebb3e7a993c8da42c17deb68a
# only skipped commits left to test
# possible first bad commit: [7c682640857106f18d7020c9c75ea39b1ef8dd2c] wopAddToStore: add RepairFlag
# possible first bad commit: [fbf509c1137fd59ebb3e7a993c8da42c17deb68a] parseContentAddressMethodPrefix: use string_view
# possible first bad commit: [8279178b078103f816bf85f5a153a4845d30cc83] Move FramedSink next to FramedSource
# possible first bad commit: [ecc8088cb7e91e4c45787f290c8ff61547fb838a] wopAddToStore: Throw to clarify unused refs
# possible first bad commit: [c602ebfb34de3626fa0b9110face6ea4b171ac0f] Refactor wopAddToStore to make wopAddTextToStore obsolete
# possible first bad commit: [e34fe47d0ce19fc7657970fb0e610bffbc3e43f0] Overhaul wopAddToStore

I believe, this time it’s correct and basically what it says is that the issue is somewhere in that PR, but since it all either doesn’t compile or crashes with protocol errors, it’s hard to say where exactly the specific memory corruption was introduced.

@roberth
Copy link
Member

roberth commented Oct 29, 2020

Seems related to #4178.

With master and an unmodified https://github.com/serokell/nix-malloc-repro I get

trace: gitSource.nix: /home/user/h/nix-malloc-repro/squirrel does not seem to be a git repository,
assuming it is a clean checkout.
trace: WARNING: `cleanGit` called on /home/user/h/nix-malloc-repro/squirrel without a `name`. Consider adding `name = "squirrel";`
corrupted size vs. prev_size
Aborted (core dumped)

After fixing the trace messages it hangs while draining the source (builtins.path on the squirrel directory), based on print statements in my modified Nix. Although I can't exclude an infinite recursion in the source expression (if an accidental change in the language causes that), I suspect something is up with the new use of FramedSink. That's still just speculation though :(

EDIT: not an infinite recursion because the process is idle.

@roberth
Copy link
Member

roberth commented Oct 29, 2020

I believe, this time it’s correct and basically what it says is that the issue is somewhere in that PR, but since it all either doesn’t compile or crashes with protocol errors, it’s hard to say where exactly the specific memory corruption was introduced.

The PR iterates on the protocol itself, so intermediate versions are only compatible with a daemon built from the exact same commit.

@kirelagin
Copy link
Member Author

I see, it makes sense. Well, if I remember correctly, only the very last skip is due to a protocol error, all others are compilation errors. I think what we can do is try that last commit in single-user mode, although I’m not sure if it will provide any useful information.

roberth added a commit to hercules-ci/nix that referenced this issue Oct 30, 2020
Crucially this introduces BoehmGCStackAllocator, but it also
adds a bunch of wiring to avoid making libutil depend on bdw-gc.

Part of the solutions for NixOS#4178, NixOS#4200
@roberth
Copy link
Member

roberth commented Oct 30, 2020

#4206 fixes the memory corruption, but I also had to increase the stack size and --store "daemon?max-connections=2" to avoid a deadlock involving the connection pool.

@roberth
Copy link
Member

roberth commented Oct 30, 2020

#4207 takes care of the max-connections issue.

@blaggacao
Copy link
Contributor

blaggacao commented Jan 7, 2021

Toady involving a stack.yaml with haskell.nix as well, I observe:

nix develop
warning: Git tree '/home/blaggacao/ghq/git.joeyh.name/git/arduino-copilot' is dirty
trace: gitSource.nix: /nix/store/mzqnhviawcj3lhdimjybj776177h7dzx-source does not seem to be a git repository,
assuming it is a clean checkout.
trace: To make project.stack-nix for arduino-copilot a fixed-output derivation but not materialized, set `stack-sha256` to the output of the 'calculateMaterializedSha' script in 'passthru'.
trace: To materialize project.stack-nix for arduino-copilot entirely, pass a writable path as the `materialized` argument and run the 'updateMaterialized' script in 'passthru'.
malloc(): invalid next size (unsorted)
[1]    1366203 abort (core dumped)  nix develop

on

nix --version
nix (Nix) 3.0pre20200829_f156513

Is this closeable and I need to update my nix version? (EDIT: I think I get the clue now: 20200829)

@blaggacao
Copy link
Contributor

I confirm it does not happen any more with:

nix --version
nix (Nix) 2.4pre20201201_5a6ddb3

→ close?

@stale
Copy link

stale bot commented Jul 8, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the stale label Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants