Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emacsMacport: build fails on aarch64-darwin #127902

Closed
shaunsingh opened this issue Jun 23, 2021 · 68 comments · Fixed by #252244
Closed

emacsMacport: build fails on aarch64-darwin #127902

shaunsingh opened this issue Jun 23, 2021 · 68 comments · Fixed by #252244
Labels
6.topic: darwin Running or building packages on Darwin 6.topic: emacs

Comments

@shaunsingh
Copy link

Issue description

If you install emacsMacport on an aarch64-darwin machine (running nix natively) the build fails. Note that on homebrew, the emacsMacport formula (railway cat) builds fine on aarch64 and includes a prebuilt release (cask). The only arm-specific change they have is this: https://github.com/railwaycat/homebrew-emacsmacport/blob/master/patches/mac-arm-fix.diff, which patches the codesigning process.

Steps to reproduce

Install nix
Modify /etc/nix/nix.conf to include the aarch64 architecture
Rebuild nix
run 'nix-env -iA nixpkgs.emacsMacport` to install emacs

Build fails with the following output:

[  6%] Building CXX object lib/sanitizer_common/CMakeFiles/RTSanitizerCommonNoHooks.osx.dir/sanitizer_persistent_allocator.cc.o
[  7%] Building CXX object lib/sanitizer_common/CMakeFiles/RTSanitizerCommonNoHooks.osx.dir/sanitizer_platform_limits_linux.cc.o
[  7%] Building CXX object lib/sanitizer_common/CMakeFiles/RTSanitizerCommonNoHooks.osx.dir/sanitizer_platform_limits_netbsd.cc.o
[  7%] Building CXX object lib/sanitizer_common/CMakeFiles/RTSanitizerCommonNoHooks.osx.dir/sanitizer_platform_limits_posix.cc.o
/tmp/nix-build-compiler-rt-6.0.1.drv-0/compiler-rt-6.0.1.src/lib/sanitizer_common/sanitizer_platform_limits_posix.cc:202:31: error: invalid application of 'sizeof' to an incomplete type 'struct stat64'
  unsigned struct_stat64_sz = sizeof(struct stat64);
                              ^     ~~~~~~~~~~~~~~~
/tmp/nix-build-compiler-rt-6.0.1.drv-0/compiler-rt-6.0.1.src/lib/sanitizer_common/sanitizer_platform_limits_posix.cc:202:45: note: forward declaration of '__sanitizer::stat64'
  unsigned struct_stat64_sz = sizeof(struct stat64);
                                            ^
/tmp/nix-build-compiler-rt-6.0.1.drv-0/compiler-rt-6.0.1.src/lib/sanitizer_common/sanitizer_platform_limits_posix.cc:227:33: error: invalid application of 'sizeof' to an incomplete type 'struct statfs64'
  unsigned struct_statfs64_sz = sizeof(struct statfs64);
                                ^     ~~~~~~~~~~~~~~~~~
/tmp/nix-build-compiler-rt-6.0.1.drv-0/compiler-rt-6.0.1.src/lib/sanitizer_common/sanitizer_platform_limits_posix.cc:227:47: note: forward declaration of '__sanitizer::statfs64'
  unsigned struct_statfs64_sz = sizeof(struct statfs64);
                                              ^
2 errors generated.
make[2]: *** [lib/sanitizer_common/CMakeFiles/RTSanitizerCommonNoHooks.osx.dir/build.make:303: lib/sanitizer_common/CMakeFiles/RTSanitizerCommonNoHooks.osx.dir/sanitizer_platform_limits_posix.cc.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:764: lib/sanitizer_common/CMakeFiles/RTSanitizerCommonNoHooks.osx.dir/all] Error 2
make: *** [Makefile:149: all] Error 2
builder for '/nix/store/b11kjv1kkh489zr3hi0iscq62faa45bw-compiler-rt-6.0.1.drv' failed with exit code 2
building '/nix/store/s6r151m6wsm7xs0y777vfqhdf1fa122f-apple-framework-Quartz-11.0.0.drv'...
cannot build derivation '/nix/store/gq19zfgabj5b53frj43vlk69lprdi9db-clang-wrapper-6.0.1.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/axmp969633hzkbl4733xl4w44sq6n034-stdenv-darwin.drv': 1 dependencies couldn't be built
cannot build derivation '/nix/store/sxjawx6ls6n1s5c2lnm1vblynpc9mfl6-emacs-mac-27.2-8.2.drv': 1 dependencies couldn't be built
error: build of '/nix/store/sxjawx6ls6n1s5c2lnm1vblynpc9mfl6-emacs-mac-27.2-8.2.drv' failed
```zsh

## Technical details

Please run `nix-shell -p nix-info --run "nix-info -m"` and paste the result.
```zsh
 - system: `"aarch64-darwin"`
 - host os: `Darwin 21.0.0, macOS 12.0`
 - multi-user?: `no`
 - sandbox: `no`
 - version: `nix-env (Nix) 2.3.12`
 - channels(shauryasingh): `"nixpkgs-21.11pre297243.b27eaa18b47, unstable-21.11pre297243.b27eaa18b47"`
 - nixpkgs: `/Users/shauryasingh/.nix-defexpr/channels/nixpkgs`
@veprbl veprbl added the 6.topic: darwin Running or building packages on Darwin label Jun 23, 2021
@npajkovsky
Copy link

npajkovsky commented Sep 16, 2021

Why is it compiled with ugly old llvm6?

@npajkovsky
Copy link

Aha, so when I have updated the llvm6 to llvm11 for emacsMacports

stdenv = if stdenv.cc.isClang then llvmPackages_11.stdenv else stdenv;

I have got another build issue.

./macappkit.h:30:9: fatal error: 'UniformTypeIdentifiers/UniformTypeIdentifiers.h' file not found
#import <UniformTypeIdentifiers/UniformTypeIdentifiers.h>
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.

@npajkovsky
Copy link

I have working build.

@npajkovsky
Copy link

npajkovsky commented Sep 17, 2021

The patch below fix the build issue on aarch64-darwin. The terminal emacs binary works like charm, but it's very
different story for Emacs.app, because it segfaults

> % ~/.nix-profile/Applications/Emacs.app/Contents/MacOS/Emacs -Q
Fatal error 11: Segmentation fault
Backtrace:
0   Emacs                               0x0000000104fac34c deliver_thread_signal + 60
1   Emacs                               0x0000000104faacf8 seed_random + 0
2   Emacs                               0x0000000104fac3c8 handle_sigsegv + 64
3   libsystem_platform.dylib            0x000000018b656c44 _sigtramp + 56
4   libobjc.A.dylib                     0x000000018b4c680c _ZN19AutoreleasePoolPage12releaseUntilEPP11objc_object + 204
5   libobjc.A.dylib                     0x000000018b4a4e0c objc_autoreleasePoolPop + 212
6   Emacs                               0x00000001050d937c mac_gui_loop + 92
7   Emacs                               0x00000001050d92a8 main + 1116
8   libdyld.dylib                       0x000000018b629430 start + 4

This is were I can help more.

pkgs/applications/editors/emacs/macport.nix | 12 +++++++++---
 pkgs/top-level/all-packages.nix             |  5 +++--
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/pkgs/applications/editors/emacs/macport.nix b/pkgs/applications/editors/emacs/macport.nix
index 8c395219aeb3..358890960870 100644
--- a/pkgs/applications/editors/emacs/macport.nix
+++ b/pkgs/applications/editors/emacs/macport.nix
@@ -1,5 +1,5 @@
-{ lib, stdenv, fetchurl, ncurses, pkg-config, texinfo, libxml2, gnutls, gettext, autoconf, automake, jansson
-, AppKit, Carbon, Cocoa, IOKit, OSAKit, Quartz, QuartzCore, WebKit
+{ lib, stdenv, fetchurl, fetchpatch, ncurses, pkg-config, texinfo, libxml2, gnutls, gettext, autoconf, automake, jansson
+, AppKit, Carbon, Cocoa, IOKit, OSAKit, Quartz, QuartzCore, WebKit, UniformTypeIdentifiers, sigtool
 , ImageCaptureCore, GSS, ImageIO # These may be optional
 }:

@@ -26,12 +26,18 @@ stdenv.mkDerivation rec {
     sha256 = "0f2wzdw2a3ac581322b2y79rlj3c9f33ddrq9allj97r1si6v5xk";
   };

+  patches = fetchpatch {
+      name = "fix-aarch64-darwin-triplet.patch";
+      url = "https://git.savannah.gnu.org/cgit/emacs.git/patch/?id=a88f63500e475f842e5fbdd9abba4ce122cdb082";
+      sha256 = "sha256-RF9b5PojFUAjh2TDUW4+HaWveV30Spy1iAXhaWf1ZVg=";
+  };
+
   enableParallelBuilding = true;

   nativeBuildInputs = [ pkg-config autoconf automake ];

   buildInputs = [ ncurses libxml2 gnutls texinfo gettext jansson
-    AppKit Carbon Cocoa IOKit OSAKit Quartz QuartzCore WebKit
+    AppKit Carbon Cocoa IOKit OSAKit Quartz QuartzCore WebKit UniformTypeIdentifiers sigtool
     ImageCaptureCore GSS ImageIO   # may be optional
   ];

diff --git a/pkgs/top-level/all-packages.nix b/pkgs/top-level/all-packages.nix
index 01c955087234..7aa55b682447 100644
--- a/pkgs/top-level/all-packages.nix
+++ b/pkgs/top-level/all-packages.nix
@@ -23899,9 +23899,10 @@ with pkgs;

   emacsMacport = callPackage ../applications/editors/emacs/macport.nix {
     inherit (darwin.apple_sdk.frameworks)
-      AppKit Carbon Cocoa IOKit OSAKit Quartz QuartzCore WebKit
+      AppKit Carbon Cocoa IOKit OSAKit Quartz QuartzCore WebKit UniformTypeIdentifiers
       ImageCaptureCore GSS ImageIO;
-    stdenv = if stdenv.cc.isClang then llvmPackages_6.stdenv else stdenv;
+    inherit (darwin) sigtool;
+    stdenv = if stdenv.cc.isClang then llvmPackages_11.stdenv else stdenv;
   };

   emacsPackagesFor = emacs: import ./emacs-packages.nix {

@peterbecich
Copy link
Contributor

Label suggestion: "6.topic: emacs"
https://github.com/NixOS/nixpkgs/labels/6.topic%3A%20emacs

@srid
Copy link
Contributor

srid commented Nov 25, 2021

Why is it compiled with ugly old llvm6?

I get this exact error with compiler-rt-libc-9.0.1. srid/haskell-template#1 (comment)

@npajkovsky
Copy link

@srid Highly likely. Try a bit saner version of llvm.

@soulomoon
Copy link

Why is it compiled with ugly old llvm6?

I get this exact error with compiler-rt-libc-9.0.1. srid/haskell-template#1 (comment)

compiler-rt{5,6,7,8,9,10} are marked broken in aarch64-darwin,
broken

@Atemu
Copy link
Member

Atemu commented Jan 17, 2022

In the process of integrating macport into the generic emacs drv, I also wanted to make it build on regular stdenv and ran across this issue.

What I've noticed though: The crash only happens sometimes. If you run it a couple times, you get a Window that works just as expected though it did crash later on me.
This leads me to believe it's either a bug in the GUI handling of macport or it's because of our use of ancient system libraries.

@npajkovsky
Copy link

@Atemu There's a new version of macport and maybe it's fixed. It should not be a problem to build it on stdenv. If I recall correctly, it worked.

I can give it a shake this week.

@Atemu
Copy link
Member

Atemu commented Jan 17, 2022

Do you mean the experimental emacs28 version? The reason for my PR is actually to not have to duplicate the native-comp infra of the generic drv for that version.

I'm in the process of trying it out now.

@npajkovsky
Copy link

@Atemu Ouch. Sorry, my bad. I thought that a new version of macport was released[1]

[1] https://bitbucket.org/mituharu/emacs-mac/downloads/?tab=tags

@Atemu
Copy link
Member

Atemu commented Jan 18, 2022

Well, sort of. It's not officially "released" yet but Yamamoto-san pushed emacs-28 to their work branch: https://bitbucket.org/mituharu/emacs-mac/branch/work

The derivation from my PR above works now and can be used with that branch too FYI. Newer stdenv still doesn't work though unfortunately.

@npajkovsky
Copy link

That's great news. What's the issue now? I had to add UniformTypeIdentifiers to get it built.

@Atemu
Copy link
Member

Atemu commented Jan 18, 2022

I actually don't own an aarch64-darwin mac yet, this is just an oddity I wanted to get rid of during the refactor since the regular NS emacs is also built on regular stdenv.

It builds just fine with newer llvm but it still crashes just like before: #127902 (comment)

@Atemu
Copy link
Member

Atemu commented Jan 24, 2022

Does it work with Apple clang perhaps? Something worth trying.

The sandbox is not enforced on macOS, so you could just set CC to /usr/bin/cc etc.

@SirensOfTitan
Copy link

Running into similar problems on aarch64-darwin here with my own emacs-mac overlay based on Yamamoto's work branch. I can get successful builds, working libgccjit support, but graphical eventually segfaults (usually immediately, occasionally I can get as far as bringing up projectile before I beachball).

I've tried trying different llvm versions to same effect. I'll try building using Apple clang tonight and see where it gets me.

@teehemkay
Copy link

@SirensOfTitan any update on building w/ Apple clang?

@SirensOfTitan
Copy link

@teehemkay: No go, I still get crashes. I'd classify myself as a nix novice though, perhaps I did something wrong. @Atemu: any ideas or instructions from you and I'm happy to try them out. Will try to be a little more responsive.

From lldb:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000019ed50334 libobjc.A.dylib`objc_msgSend + 52
libobjc.A.dylib`objc_msgSend:
->  0x19ed50334 <+52>: ldp    x17, x9, [x13], #-0x10
    0x19ed50338 <+56>: cmp    x9, x1
    0x19ed5033c <+60>: b.ne   0x19ed5034c               ; <+76>
    0x19ed50340 <+64>: eor    x10, x10, x1
Target 0: (emacs-28.0.91) stopped.

Applications/Emacs.app/Contents/MacOS/Emacs -Q:

Fatal error 11: Segmentation fault
Backtrace:
0   Emacs                               0x000000010022386c deliver_thread_signal + 60
1   Emacs                               0x0000000100221fc4 seed_random + 0
2   Emacs                               0x00000001002238e8 handle_sigsegv + 64
3   libsystem_platform.dylib            0x000000019eedc4e4 _sigtramp + 56
4   libobjc.A.dylib                     0x000000019ed5738c _ZN19AutoreleasePoolPage12releaseUntilEPP11objc_object + 200
5   libobjc.A.dylib                     0x000000019ed53d68 objc_autoreleasePoolPop + 208
6   Emacs                               0x0000000100368630 mac_gui_loop + 92
7   Emacs                               0x000000010036855c main + 1116
8   dyld                                0x0000000100c250f4 start + 520
[1]    54325 segmentation fault  Applications/Emacs.app/Contents/MacOS/Emacs -Q

@LoganBarnett
Copy link

Looks like there's at least two pull requests converging on this: #155360 by @Atemu and #138424 by @mikroskeem.

I'm not versed enough with nix nor building from source to know which one of these are further along. From what I've gathered @Atemu's PR is trying to roll the emacsMacport derivation into the standard Emacs derivation (and therefore adding aarch64 support) and @mikroskeem is just trying to get the emacsMacport derivation working on aarch64.

I apologize in advance if my ignorance here misrepresents the efforts here. Thanks for all the help!

@Atemu
Copy link
Member

Atemu commented Mar 1, 2022

No, what I'm trying to do is get rid of the extra macport derivation to make it easier to add things like using native-comp to macport. I also wanted to get rid of the stdenv override so that it builds on aarch64 but I wasn't able to.
I have tried integrating @mikroskeem's patches as I do have an aarch64-darwin machine now but they don't actually seem to work.

Right now, my PR is in development hell as there's an issue I cannot debug. I think I might just build macport from source instead, I don't really see the point in using release tarballs anyways.

@flurie
Copy link
Contributor

flurie commented Apr 16, 2022

I'm actively working on reestablishing a baseline for emacsMacport for aarch64-darwin. My goal is to get it running and then slowly add back things or fix extreme changes as needed. I pulled in some stuff from @mikroskeem's work, and I've been looking at what's common among the macports and homebrew builds. I was getting the same segfaults everyone else seemed to be experiencing when using various clang envs, but they disappeared when I went nuclear witih CC=/usr/bin/clang. Replacing that, even with a clang13 stdenv, seems to bring back the intermittent segfaults. I noticed as well that the homebrew formula has a hack for getting the codesign working, but not having that doesn't seem to impede running the version built with Xcode clang.

Perhaps someone who is more familiar with the macport code has an idea of what's going on here? I'm not opposed to doing what some other packages need to do in order to build, but I hope to understand a bit more about why it's necessary.

@Atemu
Copy link
Member

Atemu commented Apr 17, 2022

Interesting that it doesn't crash when built with Apple clang. I'd have assumed it's due to some outdated system libraries (the bringup is WIP but has been long and painful AFAIK) but I don't think Apple clang will use that in a drv? There's nothing preventing it though.

Could you double check Apple clang'd Emacs is linked against Nixpkgs' libsystem and not the regular one?

@flurie
Copy link
Contributor

flurie commented Apr 17, 2022

aha, good thinking.

❯ otool -L result/bin/emacs
result/bin/emacs:
        /System/Library/Frameworks/Carbon.framework/Versions/A/Carbon (compatibility version 2.0.0, current version 165.0.0)
        /System/Library/Frameworks/WebKit.framework/Versions/A/WebKit (compatibility version 1.0.0, current version 613.1.17)        /System/Library/Frameworks/Quartz.framework/Versions/A/Quartz (compatibility version 1.0.0, current version 1.0.0)
        /System/Library/Frameworks/QuartzCore.framework/Versions/A/QuartzCore (compatibility version 1.2.0, current version 1.11.0)
        /System/Library/Frameworks/IOKit.framework/Versions/A/IOKit (compatibility version 1.0.0, current version 275.0.0)
        /System/Library/Frameworks/OSAKit.framework/Versions/A/OSAKit (compatibility version 1.0.0, current version 113.0.0)
        /System/Library/Frameworks/Accelerate.framework/Versions/A/Accelerate (compatibility version 1.0.0, current version 4.0.0)
        /System/Library/Frameworks/IOSurface.framework/Versions/A/IOSurface (compatibility version 1.0.0, current version 1.0.0)
        /System/Library/Frameworks/UniformTypeIdentifiers.framework/Versions/A/UniformTypeIdentifiers (compatibility version 1.0.0, current version 709.0.0)
        /System/Library/Frameworks/Cocoa.framework/Versions/A/Cocoa (compatibility version 1.0.0, current version 23.0.0)
        /nix/store/qqp3sbsh57n33ppkrl8sby7f7gpgli84-libxml2-2.9.13/lib/libxml2.2.dylib (compatibility version 12.0.0, current version 12.13.0)
        /nix/store/piz5f9n6pyj947jvn0wg54iw1an6w1pw-ncurses-6.3/lib/libncursesw.6.dylib (compatibility version 6.0.0, current version 6.0.0)
        /nix/store/vx9v678g9gxwqgihd3s44g1mb2q1cgwq-gnutls-3.7.3/lib/libgnutls.30.dylib (compatibility version 62.0.0, current version 62.0.0)
        /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.11)
        /nix/store/ms2973z4qbddbwnl3xgkqwiwza0fcmmm-jansson-2.13.1/lib/libjansson.4.dylib (compatibility version 18.0.0, current version 18.0.0)
        /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1292.60.1)
        /System/Library/Frameworks/AppKit.framework/Versions/C/AppKit (compatibility version 45.0.0, current version 2113.40.126)
        /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation (compatibility version 150.0.0, current version 1858.112.0)
        /System/Library/Frameworks/CoreGraphics.framework/Versions/A/CoreGraphics (compatibility version 64.0.0, current version 1557.5.4)
        /System/Library/Frameworks/CoreImage.framework/Versions/A/CoreImage (compatibility version 1.0.1, current version 5.0.0)
        /System/Library/Frameworks/CoreServices.framework/Versions/A/CoreServices (compatibility version 1.0.0, current version 1141.1.0)
        /System/Library/Frameworks/CoreText.framework/Versions/A/CoreText (compatibility version 1.0.0, current version 1.0.0)
        /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation (compatibility version 300.0.0, current version 1858.112.0)
        /System/Library/Frameworks/ImageIO.framework/Versions/A/ImageIO (compatibility version 1.0.0, current version 1.0.0)
        /System/Library/Frameworks/PDFKit.framework/Versions/A/PDFKit (compatibility version 1.0.0, current version 1.0.0)
        /usr/lib/libobjc.A.dylib (compatibility version 1.0.0, current version 228.0.0)

Unless I'm mistaken, we could start swapping these out to figure out which library is causing the segfaults. Do I have to add the library paths to the build command explicitly? Will clang not use the build shell path to resolve those? Please forgive my lack of C build knowledge.

@Atemu
Copy link
Member

Atemu commented Apr 17, 2022

Sorry, I don't know either.

Perhaps @toonn could help you with that, they're very knowledgable when it comes to Darwin and its stdenv.

@toonn
Copy link
Contributor

toonn commented Apr 22, 2022

I think we can change out dylibs one by one using install_name_tool?

@Atemu
Copy link
Member

Atemu commented Jan 18, 2023

gccStdenv causes no GUI toolkit to be built at all for some odd reason.

libgccjit works with clang and has been working in macPort for a while.

tnytown added a commit to tnytown/nixpkgs-overlay-tny that referenced this issue Feb 14, 2023
Patch mac_gui_loop in macappkit.m to fix the block lifetime issue.

NixOS/nixpkgs#127902
@tnytown
Copy link
Contributor

tnytown commented Feb 15, 2023

I attempted to triage the issue and I think I got it:
tnytown/nixpkgs-overlay-tny@03deb14
It seems to be working on macOS 13.2. You can test it with Flakes by running nix build github:tnytown/nixpkgs-overlay-tny#emacsMacport, then open result/Applications/Emacs.app. I want to try to get this patch into nixpkgs if it works on other machines and if it doesn't cause any runtime regressions with x86_64.

@Atemu
Copy link
Member

Atemu commented Feb 15, 2023

How and why would that patch fix it? Also, why does it require a new header when there's no header change?

Emacs seems to be running quite stably with it but executing any action on the menu bar causes an immediate crash.

Nixpkgs patch
diff --git a/pkgs/applications/editors/emacs/0002-mac-gui-loop-block-autorelease.patch b/pkgs/applications/editors/emacs/0002-mac-gui-loop-block-autorelease.patch
new file mode 100644
index 00000000000..4908a01ba07
--- /dev/null
+++ b/pkgs/applications/editors/emacs/0002-mac-gui-loop-block-autorelease.patch
@@ -0,0 +1,28 @@
+diff --git a/src/macappkit.m b/src/macappkit.m
+index babb3560c7..29b0ab3511 100644
+--- a/src/macappkit.m
++++ b/src/macappkit.m
+@@ -16145,9 +16145,9 @@ - (void)sound:(NSSound *)sound didFinishPlaying:(BOOL)finishedPlaying
+       dispatch_semaphore_wait (mac_gui_semaphore, DISPATCH_TIME_FOREVER);
+       block = [mac_gui_queue dequeue];
+       if (block)
+-	block ();
+-      dispatch_semaphore_signal (mac_lisp_semaphore);
++	block (); 
+       END_AUTORELEASE_POOL;
++      dispatch_semaphore_signal (mac_lisp_semaphore);
+     }
+   while (block);
+ }
+@@ -16161,9 +16161,9 @@ - (void)sound:(NSSound *)sound didFinishPlaying:(BOOL)finishedPlaying
+   dispatch_semaphore_wait (mac_gui_semaphore, DISPATCH_TIME_FOREVER);
+   block = [mac_gui_queue dequeue];
+   eassert (block);
+-  block ();
+-  dispatch_semaphore_signal (mac_lisp_semaphore);
++  block (); 
+   END_AUTORELEASE_POOL;
++  dispatch_semaphore_signal (mac_lisp_semaphore);
+ }
+ 
+ /* Ask execution of BLOCK to the GUI thread synchronously.  The
diff --git a/pkgs/applications/editors/emacs/generic.nix b/pkgs/applications/editors/emacs/generic.nix
index feed7ba5b41..9d0f48be647 100644
--- a/pkgs/applications/editors/emacs/generic.nix
+++ b/pkgs/applications/editors/emacs/generic.nix
@@ -7,7 +7,7 @@
   , patches ? _: [ ]
   , macportVersion ? null
 }:
-{ stdenv, llvmPackages_6, lib, fetchurl, fetchpatch, substituteAll, ncurses, libXaw, libXpm
+{ stdenv, llvmPackages_14, lib, fetchurl, fetchpatch, substituteAll, ncurses, libXaw, libXpm
 , Xaw3d, libXcursor,  pkg-config, gettext, libXft, dbus, libpng, libjpeg, giflib
 , libtiff, librsvg, libwebp, gconf, libxml2, imagemagick, gnutls, libselinux
 , alsa-lib, cairo, acl, gpm, m17n_lib, libotf
@@ -17,7 +17,7 @@
 , fetchFromSavannah, fetchFromBitbucket
 
   # macOS dependencies for NS and macPort
-, AppKit, Carbon, Cocoa, IOKit, OSAKit, Quartz, QuartzCore, WebKit
+, AppKit, Carbon, Cocoa, IOKit, OSAKit, Quartz, QuartzCore, WebKit, UniformTypeIdentifiers
 , ImageCaptureCore, GSS, ImageIO # These may be optional
 
 , withX ? !stdenv.isDarwin && !withPgtk
@@ -60,7 +60,7 @@ assert withPgtk -> withGTK3 && !withX && gtk3 != null;
 assert withXwidgets -> withGTK3 && webkitgtk != null;
 
 
-let emacs = (if withMacport then llvmPackages_6.stdenv else stdenv).mkDerivation (lib.optionalAttrs nativeComp {
+let emacs = (if withMacport then llvmPackages_14.stdenv else stdenv).mkDerivation (lib.optionalAttrs nativeComp {
   NATIVE_FULL_AOT = "1";
   LIBRARY_PATH = "${lib.getLib stdenv.cc.libc}/lib";
 } // {
@@ -159,7 +159,7 @@ let emacs = (if withMacport then llvmPackages_6.stdenv else stdenv).mkDerivation
     ++ lib.optionals (withX && withXwidgets) [ webkitgtk glib-networking ]
     ++ lib.optionals withNS [ AppKit GSS ImageIO ]
     ++ lib.optionals withMacport [
-      AppKit Carbon Cocoa IOKit OSAKit Quartz QuartzCore WebKit
+      AppKit Carbon Cocoa IOKit OSAKit Quartz QuartzCore WebKit UniformTypeIdentifiers
       # TODO are these optional?
       ImageCaptureCore GSS ImageIO
     ]
diff --git a/pkgs/applications/editors/emacs/macport.nix b/pkgs/applications/editors/emacs/macport.nix
index 9a30b2b5b40..48acfc4648b 100644
--- a/pkgs/applications/editors/emacs/macport.nix
+++ b/pkgs/applications/editors/emacs/macport.nix
@@ -3,4 +3,7 @@ import ./generic.nix rec {
   version = "28.2";
   macportVersion = "emacs-${version}-mac-9.1";
   sha256 = "sha256-Ne2jQ2nVLNiQmnkkOXVc5AkLVkTpm8pFC7VNY2gQjPE=";
+  patches = _: [
+    ./0002-mac-gui-loop-block-autorelease.patch
+  ];
 }
diff --git a/pkgs/top-level/all-packages.nix b/pkgs/top-level/all-packages.nix
index b538e2ad58e..700aa265a6c 100644
--- a/pkgs/top-level/all-packages.nix
+++ b/pkgs/top-level/all-packages.nix
@@ -29060,7 +29060,7 @@ with pkgs;
     gconf = null;
 
     inherit (darwin.apple_sdk.frameworks)
-      AppKit Carbon Cocoa IOKit OSAKit Quartz QuartzCore WebKit
+      AppKit Carbon Cocoa IOKit OSAKit Quartz QuartzCore WebKit UniformTypeIdentifiers
       ImageCaptureCore GSS ImageIO;
     inherit (darwin) sigtool;
   };

@bestlem
Copy link

bestlem commented Feb 15, 2023

That patch would imply a bug in Mitsuharu's port of emacs. However the original code works on Macports and Homebrew without a patch for this.

Thus you might have stoped a crash but I think you have introduced a memory leak (moving things out of autorelease sections) or a race condition.

It does run on my Apple Silicon mac mini.

@tnytown
Copy link
Contributor

tnytown commented Feb 16, 2023

Sorry all, I should have elaborated further on what my patch is meant to do. I'll start from the top.

As a disclaimer, I am not an Emacs maintainer and I do not know what I am doing. If anybody reading this has a better idea of what's going on, please chime in!

First, I fixed the compilation issue by adding the UniformTypeIdentifiers framework. This was also discovered at #127902 (comment).

Once the compilation error was fixed, I got the familiar segfault(s):

$ outputs/out/Applications/Emacs.app/Contents/MacOS/Emacs
Fatal error 11: Segmentation fault
Backtrace:
0   Emacs                               0x0000000100754e48 emacs_backtrace + 500
1   Emacs                               0x0000000100d5b010 terminate_due_to_signal + 568
2   Emacs                               0x000000010075aab0 deliver_thread_signal + 0
3   Emacs                               0x0000000100753724 deliver_process_signal + 736
4   Emacs                               0x0000000100753d88 deliver_fatal_signal + 32
5   libsystem_platform.dylib            0x00000001a29cc2a4 _sigtramp + 56
6   libdispatch.dylib                   0x00000001a2816aa0 _dispatch_sema4_wait + 28
7   libdispatch.dylib                   0x00000001a2817154 _dispatch_semaphore_wait_slow + 132
8   Emacs                               0x0000000100cd2718 mac_within_gui_and_here + 280
9   Emacs                               0x0000000100be5254 mac_within_gui + 28
10  Emacs                               0x0000000100c14ec8 mac_set_frame_window_title + 928
11  Emacs                               0x0000000100b2e074 mac_set_name_internal + 188
12  Emacs                               0x0000000100b2de88 mac_set_name + 1100
13  Emacs                               0x0000000100b33634 mac_window + 932
14  Emacs                               0x0000000100b31724 Fx_create_frame + 8740
15  faces-b9447c93-029fbb21.eln         0x0000000103a3eaa0 F782d6372656174652d6672616d652d776974682d6661636573_x_create_frame_with_faces_0 + 304
16  Emacs                               0x0000000100916e8c funcall_subr_1 + 1668
17  Emacs                               0x0000000100916788 __funcall_subr_block_invoke + 208
18  Emacs                               0x0000000100bd6b28 mac_autorelease_loop + 104
19  Emacs                               0x0000000100915ce4 funcall_subr + 872
20  Emacs                               0x0000000100912b34 Ffuncall + 1044
21  Emacs                               0x00000001009e9dbc exec_byte_code + 11848
22  Emacs                               0x000000010091e010 fetch_and_exec_byte_code + 144
23  Emacs                               0x0000000100916064 funcall_lambda + 652
24  Emacs                               0x0000000100912b80 Ffuncall + 1120
25  Emacs                               0x000000010090c960 Fapply + 484
26  Emacs                               0x0000000100916c10 funcall_subr_1 + 1032
27  Emacs                               0x0000000100916788 __funcall_subr_block_invoke + 208
28  Emacs                               0x0000000100bd6b28 mac_autorelease_loop + 104
29  Emacs                               0x0000000100915ce4 funcall_subr + 872
30  Emacs                               0x0000000100912b34 Ffuncall + 1044
31  Emacs                               0x00000001009e9dbc exec_byte_code + 11848
32  Emacs                               0x000000010091e010 fetch_and_exec_byte_code + 144
33  Emacs                               0x0000000100916064 funcall_lambda + 652
34  Emacs                               0x0000000100912b80 Ffuncall + 1120
35  frame-b40fc590-ab21ad97.eln         0x0000000109ef71c8 F6d616b652d6672616d65_make_frame_0 + 1684
36  Emacs                               0x0000000100916e8c funcall_subr_1 + 1668
37  Emacs                               0x0000000100916788 __funcall_subr_block_invoke + 208
38  Emacs                               0x0000000100bd6b28 mac_autorelease_loop + 104
39  Emacs                               0x0000000100915ce4 funcall_subr + 872
40  Emacs                               0x0000000100912b34 Ffuncall + 1044
...
fish: Job 1, 'outputs/out/Applications/Emac...' terminated by signal SIGSEGV (Erreur de frontière d'adresse)

To try to gain visibility into the segfault, I tried many things, including:

  • Running under XCode Instruments
  • Setting NSZombieEnabled and CFZombieLevel
    • These environment variables enable zombie object generation. With zombie objects, the Objective-C runtime replaces released objects with a placeholder that contains helpful debugging info.
    • The program crashed without no extra info. Also, breakpoints I set on NSZombie in lldb were never hit.
    • I initially believed that these environment variables were broken, but I tested them with a minimal repro case and zombie objects appeared correctly.
  • Stepping through with lldb
    • Not very helpful, a lot of context was lost by the time the process crashed.
  • Compiling with ASAN
    • Most of the crashes involved a null pointer dereference for me as opposed to deref of free'd memory, so this did not work.
  • Compiling with -O0, -g3, -glldb
    • Still crashed, but with debug info!
  • Disabling mitigations in the derivation with hardeningDisable = [ "all" ]
    • This removed a lot of the extra flags that Nix passes to clang, whittling down the possibility of the issue being in the compiler invocation.
    • The underlying issue still may lie in the compiler invocation! I can't be sure that the flags I didn't touch don't affect codegen.
  • Enabling ARC with -fobjc-arc
    • Emacs launched and ran for a little bit with this, then crashed again.
    • There are a bunch of adjustments for ARC in the tree and compiling with it enables a different codepath for many memory-related operations.
    • I thought that the default Apple toolchain might have been compiling with ARC, but this turned out to be false.

All of this (and more) did not yield any tangible results.

What finally did lead me closer to what was happening was digging deeper into one of the crash backtraces in lldb.

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x00000001a2631834 libobjc.A.dylib`objc_msgSend + 52
    frame #1: 0x00000001a2638168 libobjc.A.dylib`AutoreleasePoolPage::releaseUntil(objc_object**) + 196
    frame #2: 0x00000001a2634870 libobjc.A.dylib`objc_autoreleasePoolPop + 256
    frame #3: 0x00000001009f7048 Emacs`mac_gui_loop at macappkit.m:16150:7
    frame #4: 0x00000001009f6940 Emacs`main(argc=1, argv=0x000000016fdfe948) at macappkit.m:16728:3
    frame #5: 0x00000001a2673e50 dyld`start + 2544

This crash occurs in the deallocation of the autorelease pool. When Objective-C objects are allocated and autoreleased, they are added to the running thread's topmost autorelease pool.

Our issue is (allegedly) with an invalid object being released. How do we track down the source of that object?
We can get the pointer from the autorelease pool (print the first argument passed to objc_msgSend, but we don't know what it held before its backing memory was freed.

One potentially useful source of information that occurred to me was breakpointing the function that adds objects to the autorelease pool and printing out the objects. This was a very silly idea, as there are a lot of objects in a program like Emacs, but I decided to give it a try. It felt very tedious to do manually, so I wrote a script. Here's the latest version of that:

trace.py
#!/usr/bin/python3

import sys
import os
import subprocess

_p = subprocess.getoutput("lldb -P")
sys.path.append(_p.strip())

import lldb

assert len(sys.argv) >= 2

debugger = lldb.SBDebugger.Create()
debugger.SetAsync(False) 


def _eval(c):
    rtn = lldb.SBCommandReturnObject()
    interp = debugger.GetCommandInterpreter()
    interp.HandleCommand(c, rtn)
    if not rtn.Succeeded():
        raise Exception(rtn.GetError())
    return rtn.GetOutput()


def _exec(c):
    debugger.HandleCommand(c)


def _target():
    return debugger.GetSelectedTarget()


def _thread():
    return _target().GetProcess().GetSelectedThread()


def _frame(n=0):
    return _thread().GetFrameAtIndex(n)


def _expr(e):
    return _frame().EvaluateExpression(e)


def _symbol(e):
    addr = _target().ResolveLoadAddress(e)
    return _target().ResolveSymbolContextForAddress(addr, lldb.eSymbolContextEverything)


def _thr_not_stopped():
    return _thread().GetStopReason() != lldb.eStopReasonException

target = debugger.CreateTarget(sys.argv[1])
assert target

# mac_gui_loop > if(block) ...
_exec("b macappkit.m:16146")
pool_bp = target.BreakpointCreateByName("_objc_rootAutorelease")
assert pool_bp

pool_bp.enabled = False

proc = target.LaunchSimple([], [f"{k}={v}" for k, v in os.environ.items()], os.getcwd())
assert proc

# _exec("c")
while proc.state is not lldb.eStateCrashed and _thr_not_stopped():
    print(_frame().name)
    pool_bp.enabled = True
    _eval("thread until 16149")
    in_fn = _frame().name
    while "_objc_rootAutorelease" in in_fn and _thr_not_stopped(): 
        obj = _expr("(id) $x0")
        print(f"autoreleased: 0x{obj.unsigned:X}, {obj.GetObjectDescription()}")
        _eval("fr se 1")
        _exec("fr inf")
        # print(in_fn, "continuing")
        proc.Continue()

    pool_bp.enabled = False
    print("--- mac_gui_loop ---")
    addr = _expr("*((uint64_t*) block + 2)")
    # _exec("po (id) block") 
    sym = _symbol(addr.unsigned)
    print("0x{:X}, {}".format(addr.unsigned, sym.symbol.name))
    proc.Continue() 
    

#_exec("po (id) block")
_exec("bt")

_exec("br list")
# _("thr ba all")

while True:
    cmd = input("] ")
    _exec(cmd)

This script prints out all autoreleased objects from execution of mac_gui_loop's loop, the problematic section. After Emacs crashes, it emulates a lldb shell. Since the script prints the objects while they are being added to the pool, we get to see far more detail than just opaque invalid objects. With this tool in my belt, I again looked at the offending object from the crash:

] fr info
frame #0: 0x00000001a2631834 libobjc.A.dylib`objc_msgSend + 52
] p (id) $x0
(id) $2575 = 0x00000001707885c0

This happened to match one of the objects that was printed! (Notice the matching address, 0x1707885c0.)

autoreleased: 0x1707885C0, <__NSGlobalBlock__: 0x1707885c0>
 signature: "v8@?0"
 invoke   : 0x10094ec94 (/Users/tnytown/dev/emacs-mac/outputs/out/Applications/Emacs.app/Contents/MacOS/Emacs`__mac_set_frame_window_background_block_invoke)
frame #1: 0x00000001008edd70 Emacs`-[NSMutableArray(self=@"1 element", _cmd="dequeue") dequeue] at macappkit.m:249:40

This was an Objective-C block from set_frame_window_background! NSGlobalBlock is (according to my searches) supposed to describe a block that is allocated in some kind of static memory. Its memory management related selectors (retain, release, etc) are supposed to be no-ops:

(lldb) dis -n '-[__NSGlobalBlock__ retain]'
libsystem_blocks.dylib`-[__NSGlobalBlock__ retain]:
    0x1a26fb0b8 <+0>: ret    
(lldb) dis -n '-[__NSGlobalBlock__ release]'
libsystem_blocks.dylib`-[__NSGlobalBlock__ release]:
    0x1a26fb0b4 <+0>: ret   

How does an object that cannot be deallocated get deallocated? It turns out that in this instance, the compiler constructed NSGlobalBlock on the stack. The address to the object I was seeing was a stack address.

This brings me to my patch:

patch
diff --git a/src/macappkit.m b/src/macappkit.m
index babb3560c7..29b0ab3511 100644
--- a/src/macappkit.m
+++ b/src/macappkit.m
@@ -16145,9 +16145,9 @@ - (void)sound:(NSSound *)sound didFinishPlaying:(BOOL)finishedPlaying
       dispatch_semaphore_wait (mac_gui_semaphore, DISPATCH_TIME_FOREVER);
       block = [mac_gui_queue dequeue];
       if (block)
-	block ();
-      dispatch_semaphore_signal (mac_lisp_semaphore);
+	block (); 
       END_AUTORELEASE_POOL;
+      dispatch_semaphore_signal (mac_lisp_semaphore);
     }
   while (block);
 }
@@ -16161,9 +16161,9 @@ - (void)sound:(NSSound *)sound didFinishPlaying:(BOOL)finishedPlaying
   dispatch_semaphore_wait (mac_gui_semaphore, DISPATCH_TIME_FOREVER);
   block = [mac_gui_queue dequeue];
   eassert (block);
-  block ();
-  dispatch_semaphore_signal (mac_lisp_semaphore);
+  block (); 
   END_AUTORELEASE_POOL;
+  dispatch_semaphore_signal (mac_lisp_semaphore);
 }
 
 /* Ask execution of BLOCK to the GUI thread synchronously.  The

The Emacs Mac port handles Lisp execution on a thread separate from the GUI thread. It seems like these threads are meant to be cooperative, that is, they block on one another and yield when a task needs to be done by the other. The patch moves the yield in the main GUI thread loop outside of the loop's autorelease pool.

Why does this matter? Let's walk through a hypothetical scenario with the old code.

[LISP] denotes an operation on the Lisp thread. [MAIN] denotes an operation on the main thread. (Concurrently) denotes an operation that is concurrent with operations on the opposite thread. Ordering of concurrent options is variable.

  1. [LISP] set_frame_window_background, the function of interest, is called on the Lisp thread.
  2. [LISP] set_frame_window_background schedules work on the GUI thread by passing a block to mac_within_gui.
    i. By the way, a block is just a fancy closure, which itself is a function pointer with extra steps.
  3. [LISP] mac_within_gui adds the block to the queue and politely waits for the GUI thread to signal completion.
  4. [MAIN] mac_gui_loop wakes up on the main thread. It dequeues the block.
    i. The dequeue function autoreleases its value, adding block to the main thread's active autorelease pool.
  5. [MAIN] block is invoked.
  6. [MAIN] mac_gui_loop tells the Lisp thread that it's done!
  7. [LISP] (Concurrently) mac_within_gui on the Lisp thread resumes.
  8. [LISP] (Concurrently) mac_within_gui returns, then set_frame_window_background returns. Their stack frames are popped.
  9. [LISP] (Concurrently) Does some more stuff, probably in mac_window in src/macfns.c.
  10. [MAIN] (Concurrently) Pops autorelease pool containing block. A message is sent to the block pointer. block points to the set_frame_window_background stack frame.
  11. DOOM!

Moving the Lisp semaphore signal operation out of the autorelease pool is one way of solving this issue.

Now for Q&A!

How and why would that patch fix it?

With the patch, the Lisp thread will not invalidate block before the autorelease pool messages block. I'm not sure why this is a Nix clang specific issue, there may be compiler differences.

Also, why does it require a new header when there's no header change?

UniformTypeIdentifiers is included conditionally on newer macOS minimum versions from autoconf. Maybe the stdenv for aarch64-darwin specifies a newer -mmacos-version-min?

Emacs seems to be running quite stably with it but executing any action on the menu bar causes an immediate crash.

Yikes, that is not good :( I will investigate the menu bar issue when I have spare time. I spent quite a lot tracking this issue down.

That patch would imply a bug in Mitsuharu's port of emacs. However the original code works on Macports and Homebrew without a patch for this.

Fair point. I am not sure why the non-Nix versions work. My pet theory is that deep in the bowels of Apple, there exists a clang patch that changes NSGlobalBlock codegen behavior. I plan on diffing relevant generated code once I get more spare time.

Thus you might have stoped a crash but I think you have introduced a memory leak (moving things out of autorelease sections) or a race condition.

It is possible that there is a memory leak. However, the memory leak will be present in the context of the Lisp thread, which does not have an autorelease pool. I cannot guess the maintainer's intentions, but I do not think that the author of the Mac port intended for Lisp thread objects to end up in the GUI autorelease pool.

My current Emacs instance on my work machine is sitting at ~300MB of memory usage. It's been open for two days and has gone through heavy use (magit, lsp-mode, etc). If there is a memory leak, it doesn't seem to be egregious.

As for the race condition, I think I have conclusively tracked down a race condition in the unchanged code compiled with Nix clang :)

It does run on my Apple Silicon mac mini.

Glad to hear it works for you!

@Atemu
Copy link
Member

Atemu commented Feb 16, 2023

@tnytown Thanks a lot for digging into this and the write-up!

Could you post your findings towards YAMAMOTO Mitsuharu? Perhaps he has more ideas what might be going on here or how to fix it properly.

@bestlem
Copy link

bestlem commented Feb 16, 2023

A good writeup but...
In that case shouldn't your fix be passed upstream,

connorfeeley added a commit to connorfeeley/dotfiles that referenced this issue Mar 3, 2023
GitHub user @tnytown did an incredible job figuring out the cause of
emacsMacport crashing:
  NixOS/nixpkgs#127902 (comment)

Bless your soul, @tnytown.
@lillycat332
Copy link
Contributor

@tnytown nice writeup!
It doesn't seem to specifically be the menu bar that is the issue but rather anything that is invoked via a native dialog prompt; the file picker consistently causes segfaults, right click menus have the same issue, etc.

@RaitoBezarius
Copy link
Member

@tnytown Would you be interested into turning your writeup into a PR? I am interested into dropping LLVM6 and emacsMacport is the last obstacle for this.

@tnytown
Copy link
Contributor

tnytown commented Apr 5, 2023

Would you be interested into turning your writeup into a PR? I am interested into dropping LLVM6 and emacsMacport is the last obstacle for this.

I'll be open to doing this, @RaitoBezarius! Unfortunately, my changes do not completely fix the issues with the Mac port. As mentioned earlier, there are still bugs with OS integration. I'm still working on fixing those issues, at which point I plan on upstreaming the changes for a more permanent solution.

If there's any appetite for it, I can try to get these changes as-is into nixpkgs, keeping these issues in mind. Another option is to just mark the package as broken.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/is-it-currently-possible-to-use-emacs-on-aarch64-darwin/31571/1

@nathanscully
Copy link

Has anyone had any success building the recently merged in emacs29-macport from #246203?
I am getting the same error as OP:

error: builder for '/nix/store/38g7vlx4xam4jfjbdi00bacg4l74bng1-compiler-rt-6.0.1.drv' failed with exit code 2;
       last 10 log lines:
       > [ 68%] Building C object lib/builtins/CMakeFiles/clang_rt.cc_kext_arm64_osx.dir/atomic_thread_fence.c.o
       > [ 68%] Building C object lib/builtins/CMakeFiles/clang_rt.cc_kext_arm64_osx.dir/gcc_personality_v0.c.o
       > [ 68%] Building C object lib/builtins/CMakeFiles/clang_rt.cc_kext_arm64_osx.dir/__/profile/InstrProfiling.c.o
       > [ 68%] Building C object lib/builtins/CMakeFiles/clang_rt.cc_kext_arm64_osx.dir/clear_cache.c.o
       > [ 68%] Building C object lib/builtins/CMakeFiles/clang_rt.cc_kext_arm64_osx.dir/__/profile/InstrProfilingBuffer.c.o
       > [ 69%] Building C object lib/builtins/CMakeFiles/clang_rt.cc_kext_arm64_osx.dir/__/profile/InstrProfilingPlatformDarwin.c.o
       > [ 69%] Building C object lib/builtins/CMakeFiles/clang_rt.cc_kext_arm64_osx.dir/__/profile/InstrProfilingWriter.c.o
       > [ 69%] Linking C static library libclang_rt.cc_kext_arm64_osx.a
       > [ 69%] Built target clang_rt.cc_kext_arm64_osx
       > make: *** [Makefile:136: all] Error 2
       For full logs, run 'nix log /nix/store/38g7vlx4xam4jfjbdi00bacg4l74bng1-compiler-rt-6.0.1.drv'.

Like others I've enabled Homebrew and using the railwaycat formula with success. I tried to review the overlays and patches posted but as a Nix begineer its not clear how I am meant to leverage them. If someone can point me to a guide or list out some steps I'm happy to try it out.

nix-info -m                                                       
 - system: `"aarch64-darwin"`
 - host os: `Darwin 22.5.0, macOS 13.4.1`
 - multi-user?: `yes`
 - sandbox: `no`
 - version: `nix-env (Nix) 2.17.0`
 - channels(root): `"nixpkgs"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixpkgs`

@sielicki
Copy link
Contributor

@nathanscully I have confirmed that the same crash is present with the current 29.1 derivation, but am looking to test against #241692 and potentially #229210 when I can.

I am skeptical of the above patch; there's nothing nix-specific about it and given the extent to which apple freely contributes patches in llvm upstream (per git log grep, 5307 commits with apple email addresses since 2020), I don't think it would make sense that they've been internally maintaining a patch for fixes to core objective-c language features like autorelease. More likely IMO is that some component in the dependency chain of this package is omitting a barrier that goes unnoticed under other platforms because of how weakly-ordered armv8a is, and we're just using a super old toolchain here. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697 for an example.

As for testing, you can just clone nixpkgs, hack on pkgs/applications/editors/emacs/generic.nix, then run nix-build -A emacs29-macport to build it and run it directly from ./result.

@Atemu
Copy link
Member

Atemu commented Aug 27, 2023

The LLVM breakage is unrelated to emacs. LLVM6 is ancient and actualy due for removal but we need it for building a non-crashing macport some reason. LLVM6 does not support modern aarch64-darwin, it's no surprise that it's broken.

Given the status of LLVM6, macport will soon be broken on x86_64-darwin too.

@RaitoBezarius
Copy link
Member

@Atemu I know we discussed some of this IRL, feel free to clean up this and put me as a reviewer for removals.

@tnytown
Copy link
Contributor

tnytown commented Aug 30, 2023

I worked on this a bit more and discovered the root cause of the issue.

First off, changes are on my overlay as usual. I also opened a draft PR if there's any interest in getting these changes into nixpkgs. More testers would be appreciated: the menu bar and context menu crashes should be gone, but there may be other bugs lurking. I am contacting the maintainer to find a solution upstream.

TL;DR: I have empirically observed differences between how LLVM clang and Apple clang generates code for Objective-C blocks. These differences are, to my knowledge, not Nix-specific; they may present themselves in any code that uses Objective-C blocks.

If you're curious about how I came to this conclusion, read on!

As I mentioned in my previous comment, the Mac port operates two threads that schedule tasks on each other. These tasks are Objective-C blocks that are constructed on one thread, then executed on the other. This would lead to situations where one thread would reference a block that was no longer valid because the other thread had continued execution and popped the stack frame that defined the block. My older patch fixed an instance of this use after stack return in the main loop, but unfortunately that was not enough to resolve all instability.

To resolve the menu bar crash, I decided to approach the problem from a different angle. I downloaded a known-good Homebrew build and diffed it against my own. One interesting difference was in how blocks were constructed:

Excerpt from Nix's mac_set_frame_window_background
1001e309c  e88a3c58   ldr     x8, __NSConcreteGlobalBlock
1001e30a0  e80700f9   str     x8, [sp, #0x8 {var_58}]
1001e30a4  1f2003d5   nop     
1001e30a8  c01d1e5c   ldr     d0, data_10021f460  {0xd0800000}
1001e30ac  e00b00fd   str     d0, [sp, #0x10]  {0xd0800000}
1001e30b0  88020010   adr     x8, ___mac_set_frame_window_background_block_invoke
1001e30b4  1f2003d5   nop     
1001e30b8  499b4210   adr     x9, ___block_descriptor_48_e5_v8�?0ls32l8
1001e30bc  1f2003d5   nop     
1001e30c0  e8a701a9   stp     x8, x9, [sp, #0x18] {var_48} {var_40}  {___mac_set_frame_window_background_block_invoke}  {___block_descriptor_48_e5_v8�?0ls32l8}
1001e30c4  e0cf02a9   stp     x0, x19, [sp, #0x28] {var_38} {var_30}
1001e30c8  e0230091   add     x0, sp, #0x8 {var_58}
1001e30cc  010080d2   mov     x1, #0
1001e30d0  bd6b0094   bl      _mac_within_gui_and_here
Excerpt from Homebrew's mac_set_frame_window_background
100232098  c8664b58   ldr     x8, __NSConcreteStackBlock
10023209c  e80300f9   str     x8, [sp {var_50}]
1002320a0  1f2003d5   nop     
1002320a4  e0212a5c   ldr     d0, _handle_services_invocation.names[0x10]  {0xc2000000}
1002320a8  e00700fd   str     d0, [sp, #0x8]  {0xc2000000}
1002320ac  a8010010   adr     x8, ___mac_set_frame_window_background_block_invoke
1002320b0  1f2003d5   nop     
1002320b4  a9335210   adr     x9, ___block_descriptor_48_e8_32o_e5_v8�?0l
1002320b8  1f2003d5   nop     
1002320bc  e82701a9   stp     x8, x9, [sp, #0x10] {var_40} {var_38}  {___mac_set_frame_window_background_block_invoke}  {___block_descriptor_48_e8_32o_e5_v8�?0l}
1002320c0  e04f02a9   stp     x0, x19, [sp, #0x20] {var_30} {var_28}
1002320c4  e0030091   mov     x0, sp {var_50}
1002320c8  010080d2   mov     x1, #0
1002320cc  d25e0094   bl      _mac_within_gui_and_here

Nix clang and Homebrew (which uses Apple clang) both construct Objective-C blocks that conform to the Block ABI. However, Nix clang has a difference: it makes some blocks NSConcreteGlobalBlock, while Apple clang apparently uses NSConcreteStackBlock. This extends beyond mac_set_frame_window_background: the Homebrew binary only ever constructs blocks of type NSConcreteStackBlock and never references NSConcreteGlobalBlock.

So, this is all well and good, you might say, but what is the difference between NSConcreteStackBlock and NSConcreteGlobalBlock? I found out last time that NSConcreteGlobalBlock does not implement retain, release, copy, or any memory-related operation, really:

(lldb) dis -n '-[__NSGlobalBlock__ retain]'
libsystem_blocks.dylib`-[__NSGlobalBlock__ retain]:
    0x1a26fb0b8 <+0>: ret    
(lldb) dis -n '-[__NSGlobalBlock__ release]'
libsystem_blocks.dylib`-[__NSGlobalBlock__ release]:
    0x1a26fb0b4 <+0>: ret   

This is because NSConcreteGlobalBlock blocks, as the name suggests, are meant to be global: stored in read-only memory and referenced at its consistent location. There would be no point to copying or ref-counting, blocks of type NSConcreteGlobalBlock should never have any instance-specific data.

In Homebrew Emacs, retain is called on NSConcreteStackBlock when it is dequeued. [NSConcreteStackBlock retain] converts the block into a NSConcreteMallocBlock which, as the name implies, is heap-allocated. Any reference to this NSConcreteMallocBlock would not touch the defining thread's stack frame. Unfortunately, in the Nix clang-compiled binary, stack-local NSConcreteGlobalBlock are queued. [NSConcreteGlobalBlock retain] then does not do any conversion, leading to invalid stack references and the crashes that we've been seeing.

With that in mind, I isolated the core queue logic in the Mac port and produced an example that also exhibits this codegen difference between Nix and Apple clang:

Reproducer

A brief aside on compiler flags: the Nix CC wrapper adds a bunch of flags to the invocation by default. To ensure purity, I used llvmPackages_14.clang-unwrapped with system headers:

nix shell nixpkgs#llvmPackages_14.clang-unwrapped -c clang-14 \
    -isystem /Library/Developer/CommandLineTools/usr/lib/clang/*/include/ \
    -isystem /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/ \
    -iframework /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/System/Library/Frameworks/ \
    -framework CoreFoundation \
    reproducer.m -o reproducer_nix

Here's the invocation for Apple clang:

/usr/bin/clang -framework CoreFoundation reproducer.m -o reproducer_apple

And reproducer.m:

#import <Foundation/Foundation.h>
#include <stdio.h>

#define MY_NOESCAPE __attribute__ ((noescape))

NSMutableArray *queue;
void queue_block(void (^ MY_NOESCAPE block) (void)) {
    puts("queue_block");
    [queue addObject:block];
}

void call_block(void) {
    puts("call_block");
    void (^dequeued_block) (void) = [[queue[0] retain] autorelease];
    [queue removeObjectAtIndex:0];
    dequeued_block();
}

int main(int argc, char **argv) {
    queue = [[NSMutableArray alloc] init];

    queue_block(^{
        printf("block: %s\n", argv[0]);
    });
    call_block();
}

The block defined in main is a stack NSConcreteGlobalBlock on Nix clang and a NSConcreteStackBlock on Apple clang, just like what we saw in Emacs. Interestingly enough, if MY_NOESCAPE (which is identical to CF_NOESCAPE in the macport codebase) is empty, Nix clang generates a NSConcreteStackBlock.

Armed with this knowledge, I set off to https://github.com/llvm/llvm-project. I looked for NSConcreteGlobalBlock and found
the following in clang/lib/CodeGen/CGBlocks.cpp:

    // If the block is non-escaping, set field 'isa 'to NSConcreteGlobalBlock
    // and set the BLOCK_IS_GLOBAL bit of field 'flags'. Copying a non-escaping
    // block just returns the original block and releasing it is a no-op.
    llvm::Constant *blockISA = blockInfo.NoEscape
                                   ? CGM.getNSConcreteGlobalBlock()
                                   : CGM.getNSConcreteStackBlock();

(Incidentally, this was committed by an Apple employee and released in LLVM 7.0. This nicely explains why LLVM 6 does not cause crashes :)

So, the generation of NSConcreteGlobalBlock in stack literals is intentional and depends on __attribute__((noescape)).
It remains to be seen why Apple clang won't generate a NSConcreteGlobalBlock in the same way. I tried all optimization levels with and without noescape, and I was unable to make it match the Nix behavior. FWIW, the issue tracker for the Homebrew build seems to have a few reports of a similar segfault issue when compiled with Homebrew-built clang, so I think it is likely that Apple clang is divergent here (edit: the Homebrew case seems a bit more complex, I'm looking into it).

With all that out of the way, my fix this time does not involve a code patch. I defined CFLAGS to globally include a header that clears CF_NOESCAPE, coaxing Nix clang into generating the correct code.

CC @Atemu and @RaitoBezarius for visibility :)

@sielicki
Copy link
Contributor

sielicki commented Aug 30, 2023

So, the generation of NSConcreteGlobalBlock in stack literals is intentional and depends on __attribute__((noescape)).

I don't think that's totally right -- __attribute__((noescape)) is just a way for the programmer to say "trust me" to the compiler, to force the compiler to optimize as if it's a temporary. But there are also noescape blocks that aren't explicitly labeled as such, as far as I understand it.

https://clang.llvm.org/docs/AttributeReference.html#noescape

noescape placed on a function parameter of a pointer type is used to inform the compiler that the pointer cannot escape: that is, no reference to the object the pointer points to that is derived from the parameter value will survive after the function returns. Users are responsible for making sure parameters annotated with noescape do not actually escape. Calling free() on such a parameter does not constitute an escape.

The example file that you've posted is doing this, too:

// global memory, accessible by any thread from any callsite
NSMutableArray *queue;

// [MY_NOESCAPE makes a promise to the compiler
//  that the block passed to this function will not 
//  survive this function call]
NSMutableArray *queue;
void queue_block(void (^ MY_NOESCAPE block) (void)) {
    puts("queue_block");
    // [it's stored into a queue in global memory where it can be pulled out
    //  at some unknown future time; eg: the `block' ref is escaping]
    [queue addObject:block];
}

If emacs-macport is doing something similar, it's 100% a bug on his end.

@tnytown
Copy link
Contributor

tnytown commented Aug 30, 2023

So, the generation of NSConcreteGlobalBlock in stack literals is intentional and depends on __attribute__((noescape)).

Sorry, I worded this poorly. What I meant was that in this context, NSConcreteGlobalBlock is generated as a direct result of the annotation. clang definitely also does some escape analysis to determine if un-annotated blocks can be global.

I want to clarify that I find both approaches (respecting __attribute__((noescape)) or not in this scenario) reasonable. When I say that Apple clang diverges, I just mean that it behaves differently, and I'm not saying that its behavior is worse.

I don't think that's totally right -- __attribute__((noescape)) is just a way for the programmer to say "trust me" to the compiler, to force the compiler to optimize as if it's a temporary. But there are also noescape blocks that aren't explicitly labeled as such, as far as I understand it.

That is how I understand it as well :)

If emacs-macport is doing something similar, it's 100% a bug on his end.

emacs-mac does indeed label the block parameters of the queuing functions as CF_NOESCAPE, just like my reproducer. The intent here may have been to avoid unnecessary allocations that come with copying the blocks to the queue, which would require either thread's execution to stop until the other is done executing. Of course, that doesn't happen in practice, as evidenced by my last patch. I think it's possible that, at some revision of emacs-mac in the past, both threads were synchronized perfectly and were able to refer to stack blocks from the other thread without any issues.

Anyway, the system clang does not generate stack local blocks with isa = NSConcreteGlobalBlock, so the optimization there clearly isn't working correctly if it was meant as such.

@rytswd
Copy link

rytswd commented Aug 31, 2023

Sorry to jump in, but I just wanted to thank @tnytown for the great work with the investigations, patches and fixes, and other folks involved to get them shipped! 🥰 With the latest fix following #252244 and #252611, I got my build to finally succeed and not crash randomly on my M1 machine for the first time. It may be a bit too early to celebrate, but it's simply amazing to see investigations and fixes with such great details, and while I have learned a lot of technical details in there, the way this issue was tackled is a treasure in itself. Thank you very much for the inspiring work! ☺️

@montchr
Copy link
Member

montchr commented Sep 2, 2023

I've also just installed emacs29-macport and I am blown away. I've avoided installing GUI apps with Nix on macOS for a while after many frustrations (especially with various flavors of Emacs). But so far, this package works wonderfully. Thank you so much!

@nathanscully
Copy link

+1 @tnytown - thanks for the hard work here (and all the reviewers across the threads). emacs29-macport building and working without issue for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: darwin Running or building packages on Darwin 6.topic: emacs
Projects
None yet
Development

Successfully merging a pull request may close this issue.