Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libfabric 1.4.1 #6551

Closed
wants to merge 1 commit into from
Closed

libfabric 1.4.1 #6551

wants to merge 1 commit into from

Conversation

ilovezfs
Copy link
Contributor

@ilovezfs ilovezfs commented Nov 3, 2016

Created with brew bump-formula-pr.

@ilovezfs ilovezfs added the clock_gettime CI fails due to clock_gettime symbol label Nov 3, 2016
@jsquyres
Copy link
Contributor

jsquyres commented Nov 4, 2016

I can't tell why the CI test failed on El Cap...? Is this a libfabric failure, or a brew infrastructure failure, or ...?

@ilovezfs
Copy link
Contributor Author

ilovezfs commented Nov 4, 2016

@jsquyres it's a libfabric failure, specifically a problem when using the Xcode 8 SDK on El Capitan caused by the weak symbol clock_gettime, which is defined in the time.h header but not actually available at run time unless you're on Sierra.

@jsquyres
Copy link
Contributor

jsquyres commented Nov 4, 2016

Can you point me to the specific failure? I'm able to build libfabric 1.4 on my El Cap-based (10.11.6) laptop with Xcode 8.1 (8B62), and run its executables.

@ilovezfs
Copy link
Contributor Author

ilovezfs commented Nov 4, 2016

You're probably building against the Xcode 7 Command Line Tools, not Xcode 8. The full log is here: https://bot.brew.sh/job/Homebrew%20Core/10613/version=el_capitan/consoleText

If you make sure xcode-select -p is pointing at /Applications/Xcode.app/Contents/Developer and you set SDKROOT=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk you should be able to reproduce it outside of Homebrew.

If not, brew install -dvs libfrabic and you can enter the shell by choosing option 5 when it goes 💥

@jsquyres
Copy link
Contributor

jsquyres commented Nov 4, 2016

Forgive me, on multiple levels...

  1. Yes, xcode-select -p is pointing at /Applications/Xcode.app/Contents/Developer.
  2. I exported SDKROOT to the value you specified, but I apparently only had one meaningful subdir in there, anyway.
  3. I did the build again (from a vanilla libfabric 1.4.0 tarball), and it still worked fine. Here's a gist showing the full output.

So I installed Homebrew (gasp!) and then tried brew install -dvs libfabric It downloads and builds libfabric 1.3.0 -- not 1.4.0. So I can't see if the fix in v1.4.0 made a difference (via homebrew, anyway).

@ilovezfs
Copy link
Contributor Author

ilovezfs commented Nov 4, 2016

brew pull 6551

Then you can test 1.4.0

@jsquyres
Copy link
Contributor

jsquyres commented Nov 4, 2016

Perfect; that got it.

I see off the bat that when I run configure/make myself, it uses gcc. But when homebrew runs configure/make, it uses clang. I'll dig into this...

@ilovezfs
Copy link
Contributor Author

ilovezfs commented Nov 4, 2016

@jsquyres thanks. You may also want to peruse our collection of these and the various workarounds: https://github.com/Homebrew/homebrew-core/pulls?q=is%3Apr+is%3Aclosed+label%3Aclock_gettime

Also, you may find the logs in ~/Library/Logs/Homebrew/libfabric useful since you can see what command was handed to our so-called "superenv" shims, and then what command was actually executed.

@jsquyres
Copy link
Contributor

jsquyres commented Nov 6, 2016

After a bunch of builds, I figured out the difference. Recall:

  1. When I build libfabric outside of Homebrew, it compiles and runs fine.
  2. When Homebrew builds libfabric, it fails to compile.
  3. I'm on an El Cap-based (10.11.6) laptop with Xcode 8.1 (8B62).

The difference is in the environment that Homebrew is using to build libfabric. If I add these two environment variables to my environment, I can replicate the error that Homebrew encounters (i.e., that libfabric fails to build):

export 'SDKROOT=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk'
export 'ac_cv_search_clock_gettime=no'

Specifically: Homebrew is setting these env variables before it invokes libfabric's configure script. If I set either one of these without the other, libfabric still builds/works fine. It is the combination of both of these env variables that causes the problem.

I'll have to admit ignorance here of why the SDKROOT variable matters -- to my unknowning eye, I only have one SDK installed:

$ ls -l /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs
total 8
drwxr-xr-x  5 root  wheel  170 Jan 27  2019 MacOSX.sdk/
lrwxr-xr-x  1 root  wheel   10 Oct 29 10:23 MacOSX10.12.sdk@ -> MacOSX.sdk

That being said, I think if you update your libfabric formula to not set ac_cv_search_clock_gettime and instead let libfabric's configure figure it out, then the Right Things will happen.

@ilovezfs
Copy link
Contributor Author

ilovezfs commented Nov 7, 2016

@jsquyres you're saying a cached value of ac_cv_search_clock_gettime=no causes your configure to choose to use clock_gettime? How can that possibly be correct?

@jsquyres
Copy link
Contributor

jsquyres commented Nov 7, 2016

@ilovezfs Libfabric provides its own clock_gettime on platforms that don't have it; that was our workaround to the portability issue. It means we don't have to do things like this in our code base:

#if HAVE_CLOCK_GETTIME
    clock_gettime(...);
#else
    something_else(...);
#endif

@ilovezfs
Copy link
Contributor Author

ilovezfs commented Nov 7, 2016

OK, but what does that have to do with the standard ac_cv_search_clock_gettime check returning "no"?

@jsquyres
Copy link
Contributor

jsquyres commented Nov 7, 2016

The problem appears to be that MacOSX10.12.sdk does contain clock_gettime, even on OS X 10.11.6. Hence, when you force the use of MacOSX10.12.sdk and you force the (incorrect) answer of no in ac_cv_search_clock_gettime, things go badly -- as they should.

Here's a trivial test that shows this:

# My simple configure script for this test
$ cat configure.ac
AC_INIT([bogus], [0.1], [bogus@example.com])
AC_SEARCH_LIBS([clock_gettime])
# My simple test script
$ cat run
#!/bin/sh

autoconf

export SDKROOT=
echo "======= WITH DEFAULT SDK (SDKROOT is unset)"
echo "======= SDKROOT=$SDKROOT"
./configure

export 'SDKROOT=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk'
echo "======= WITH 10.12 SDK"
echo "======= SDKROOT=$SDKROOT"
./configure

Ok, let's now run this script on my OS X 10.11.x El Cap machine with Xcode 8.1 (8B62):

$ ./run
======= WITH DEFAULT SDK (SDKROOT is unset)
======= SDKROOT=
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for library containing clock_gettime... no
======= WITH 10.12 SDK
======= SDKROOT=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for library containing clock_gettime... none required

Notice the "none required" output -- meaning that clock_gettime was found.

@jsquyres
Copy link
Contributor

jsquyres commented Nov 7, 2016

@bturrubiates and I were looking closer at this, and he found Homebrew/brew@1d7aa1fe0.

Is this perhaps a difference between Xcode 8.0 and 8.1? I ask because if I build libfabric out of the box, calls to clock_gettime() seem to be working fine...?

@ilovezfs
Copy link
Contributor Author

ilovezfs commented Nov 8, 2016

10.11 does not have clock_gettime. It's only defined in the Xcode 8/8.1 time.h header and in the libsystem_c.tbd file as a weak symbol.

jsquyres added a commit to jsquyres/libfabric that referenced this pull request Nov 8, 2016
clock_gettime(3) on MacOS is... complicated.

1. CLOCK_REALTIME (and friends) may or may not exist
2. clockid_t may or may not exist
3. clock_gettime, as a symbol, may or may not exist
4. Even if clock_gettime exists as a symbol, it may be a weak symbol
   that has no strong symbol behind it (which causes a run-time linker
   error when you call it).

Separate off the clock_gettime(3) checks into their own .m4 to keep
the insanity limited.

The m4 probably could have been written a bit more compactly, but I
tried for maximum clarity instead (i.e., small, simple macros that
call each other).

Thanks to Homebrew/homebrew-core#6551 for
identifying the issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
jsquyres added a commit to jsquyres/libfabric that referenced this pull request Nov 8, 2016
clock_gettime(3) on MacOS is... complicated.

1. CLOCK_REALTIME (and friends) may or may not exist
2. clockid_t may or may not exist
3. clock_gettime, as a symbol, may or may not exist
4. Even if clock_gettime exists as a symbol, it may be a weak symbol
   that has no strong symbol behind it (which causes a run-time linker
   error when you call it).

Separate off the clock_gettime(3) checks into their own .m4 to keep
the insanity limited.

The m4 probably could have been written a bit more compactly, but I
tried for maximum clarity instead (i.e., small, simple macros that
call each other).

Thanks to Homebrew/homebrew-core#6551 for
identifying the issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
jsquyres added a commit to jsquyres/libfabric that referenced this pull request Nov 8, 2016
clock_gettime(3) on MacOS is... complicated.

1. CLOCK_REALTIME (and friends) may or may not exist
2. clockid_t may or may not exist
3. clock_gettime, as a symbol, may or may not exist
4. Even if clock_gettime exists as a symbol, it may be a weak symbol
   that has no strong symbol behind it (which causes a run-time linker
   error when you call it).

Separate off the clock_gettime(3) checks into their own .m4 to keep
the insanity limited.

The m4 probably could have been written a bit more compactly, but I
tried for maximum clarity instead (i.e., small, simple macros that
call each other).

Thanks to Homebrew/homebrew-core#6551 for
identifying the issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
jsquyres added a commit to jsquyres/libfabric that referenced this pull request Nov 8, 2016
clock_gettime(3) on MacOS is... complicated.

1. CLOCK_REALTIME (and friends) may or may not exist
2. clockid_t may or may not exist
3. clock_gettime, as a symbol, may or may not exist
4. Even if clock_gettime exists as a symbol, it may be a weak symbol
   that has no strong symbol behind it (which causes a run-time linker
   error when you call it).

Separate off the clock_gettime(3) checks into their own .m4 to keep
the insanity limited.

The m4 probably could have been written a bit more compactly, but I
tried for maximum clarity instead (i.e., small, simple macros that
call each other).

Thanks to Homebrew/homebrew-core#6551 for
identifying the issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@jsquyres
Copy link
Contributor

jsquyres commented Nov 8, 2016

Ok, I see the issue now: MacOS can be evil with regards to clock_gettime(3). Sigh.

I opened a PR with what I hope is the fix: ofiwg/libfabric#2508

If this is the fix, can you pull this patch into your brew recipe? Or do we need to do a v1.4.1 release? (I'm a little hesitant to do a 1.4.1 release just for this issue...)

@ilovezfs
Copy link
Contributor Author

ilovezfs commented Nov 8, 2016

@jsquyres LOL. You aren't alone in that sentiment. Erlang upstream had a similar reaction: https://bugs.erlang.org/browse/ERL-256?focusedCommentId=11634&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-11634

Adding the patch is no problem. No need for you to do a new release until you're ready.

@jsquyres
Copy link
Contributor

jsquyres commented Nov 8, 2016

Thanks @ilovezfs. Let's see how my PR shakes out.

@ilovezfs
Copy link
Contributor Author

ilovezfs commented Nov 8, 2016

jsquyres added a commit to jsquyres/libfabric that referenced this pull request Nov 8, 2016
clock_gettime(3) on MacOS is... complicated.

1. CLOCK_REALTIME (and friends) may or may not exist
2. clockid_t may or may not exist
3. clock_gettime, as a symbol, may or may not exist
4. Even if clock_gettime exists as a symbol, it may be a weak symbol
   that has no strong symbol behind it (which causes a run-time linker
   error when you call it).

Separate off the clock_gettime(3) checks into their own .m4 to keep
the insanity limited.

The m4 probably could have been written a bit more compactly, but I
tried for maximum clarity instead (i.e., small, simple macros that
call each other).

Thanks to Homebrew/homebrew-core#6551 for
identifying the issue.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@jsquyres
Copy link
Contributor

jsquyres commented Nov 8, 2016

Blech! After all that, I forgot to test the SDKROOT+ac_cv_search_clock_gettime combination. PR updated with what I think is the fix.

@ilovezfs
Copy link
Contributor Author

@jsquyres this looks like it works. However, I think you may want to use a different name for your fallback implementation. If you attempt to link libfabric with -Wl,-no_weak_imports and export MACOSX_DEPLOYMENT_TARGET=10.11 or pass -mmacosx-version-min=10.11, then it will fail with

ld: weak import of symbol '_clock_gettime' not supported because of option: -no_weak_imports for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

since that symbol name is not permitted on 10.11 with -no_weak_imports even if you've provided a custom implementation. Since Apple is recommending people use -no_weak_imports when possible, that is likely to become a problem for you given that you're using the colliding symbol name.

This has already caused problems elsewhere where software has provided a custom implementation and called it "clock_gettime"

@jsquyres
Copy link
Contributor

Yoinks. Ok.

jsquyres added a commit to jsquyres/libfabric that referenced this pull request Nov 10, 2016
clock_gettime(3) on MacOS is... complicated.

Base facts:

1. OS X 10.11 (El Capitan) does not have clock_gettime(3).
1. MacOS 10.12 (Sierra) has clock_gettime(3).
1. XCode 8.x allows you to build 10.12 applications on 10.11 by
   setting the env var SDKROOT to point to the 10.12 root. On 10.11,
   the SDK libraries contain a weak symbol for clock_gettime(3) (and
   some other POSIX functions).  Meaning:
   1. If you run the resulting executable on 10.11, you get a linker
      error because there is no strong symbol for clock_gettime behind
      the weak symbol.
   1. If you run the resulting executable on 10.12, it works fine.

What this boils down to is: not only do we have to AC_SEARCH_LIBS to
check for the presence of the clock_gettime symbol, we also have to
make sure that it actually works (with AC_RUN_IFELSE).  This is a
little gross, but... pass the beer nuts.

If we don't have a functioning clock_gettime(3), then emulate it.

Also due to OS X/MacOS shenanigans, update the rest of the code base
to rename our emulated clock_gettime() to be ofi_clock_gettime().  In
systems with a functioning clock_gettime(3), this is just an inline
passthru to clock_gettime(3).  Otherwise, we call an OS-specific
emulation function.

The reason for this is because Apple is encouraging developers to link
with "-Wl,-no_weak_imports", which will fail to link if the
clock_gettime weak symbol is pulled in (per the problematic scenario
described above).  So just avoid the whole situation by always using
ofi_clock_gettime().

Finally, separate off the clock_gettime(3) checks into their own .m4
to keep the insanity limited.

Thanks to @ilovezfs for identifying the issue on
Homebrew/homebrew-core#6551.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
jsquyres added a commit to jsquyres/libfabric that referenced this pull request Nov 18, 2016
The portability of clock_gettime(3) on Mac OS / OS X
is... complicated.  Per
Homebrew/homebrew-core#6551 and
ofiwg#2508, remove the use of
clock_gettime(3) from the core of libfabric.  clock_gettime(3) is
still used in some providers that do not compile on OS X / MacOS, but
that's ok).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
jsquyres added a commit to jsquyres/libfabric that referenced this pull request Nov 18, 2016
The portability of clock_gettime(3) on Mac OS / OS X
is... complicated.  Per
Homebrew/homebrew-core#6551 and
ofiwg#2508, remove the use of
clock_gettime(3) from the core of libfabric.  clock_gettime(3) is
still used in some providers that do not compile on OS X / MacOS, but
that's ok).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
jsquyres added a commit to jsquyres/libfabric that referenced this pull request Nov 19, 2016
The portability of clock_gettime(3) on Mac OS / OS X
is... complicated.  Per
Homebrew/homebrew-core#6551 and
ofiwg#2508, remove the use of
clock_gettime(3) from the core of libfabric.  clock_gettime(3) is
still used in some providers that do not compile on OS X / MacOS, but
that's ok).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
jsquyres added a commit to jsquyres/libfabric that referenced this pull request Nov 19, 2016
The portability of clock_gettime(3) on Mac OS / OS X
is... complicated.  Per
Homebrew/homebrew-core#6551 and
ofiwg#2508, remove the use of
clock_gettime(3) from the core of libfabric.  clock_gettime(3) is
still used in some providers that do not compile on OS X / MacOS, but
that's ok).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
jsquyres added a commit to jsquyres/libfabric that referenced this pull request Nov 19, 2016
The portability of clock_gettime(3) on Mac OS / OS X
is... complicated.  Per
Homebrew/homebrew-core#6551 and
ofiwg#2508, remove the use of
clock_gettime(3) from the core of libfabric.  clock_gettime(3) is
still used in some providers that do not compile on OS X / MacOS, but
that's ok).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
jsquyres added a commit to jsquyres/libfabric that referenced this pull request Nov 28, 2016
The portability of clock_gettime(3) on Mac OS / OS X
is... complicated.  Per
Homebrew/homebrew-core#6551 and
ofiwg#2508, remove the use of
clock_gettime(3) from the core of libfabric.  clock_gettime(3) is
still used in some providers that do not compile on OS X / MacOS, but
that's ok).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
jsquyres added a commit to jsquyres/libfabric that referenced this pull request Nov 28, 2016
The portability of clock_gettime(3) on Mac OS / OS X
is... complicated.  Per
Homebrew/homebrew-core#6551 and
ofiwg#2508, remove the use of
clock_gettime(3) from the core of libfabric.  clock_gettime(3) is
still used in some providers that do not compile on OS X / MacOS, but
that's ok).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
jsquyres added a commit to jsquyres/libfabric that referenced this pull request Nov 28, 2016
The portability of clock_gettime(3) on Mac OS / OS X
is... complicated.  Per
Homebrew/homebrew-core#6551 and
ofiwg#2508, remove the use of
clock_gettime(3) from the core of libfabric.  clock_gettime(3) is
still used in some providers that do not compile on OS X / MacOS, but
that's ok.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@jsquyres
Copy link
Contributor

@ilovezfs It took forever, but I think we finally got the right fix committed upstream: ofiwg/libfabric#2508

@ilovezfs
Copy link
Contributor Author

@jsquyres nice work! Let's give it a try

@ilovezfs ilovezfs changed the title libfabric 1.4.0 libfabric 1.4.1 Nov 28, 2016
@ilovezfs
Copy link
Contributor Author

@jsquyres it worked for me locally and now CI 🍏. As you can see, I've modified the formula to build 16326406506adf8b3f8b30d802453064395d3341 as 1.4.1-alpha1 so that it could go through CI, but it would be great if you could create a new version (preferably a release tarball but just a new tag would work too if we throw in the Autotools dependencies, as I've done here).

@jsquyres
Copy link
Contributor

I don't know if we're ready to do a 1.4.1 release, or if we'll do one just for this fix.

Would it be possible to download the 1.4.0 tarball, but then apply a patch file corresponding to ofiwg/libfabric@0b0c889?

jsquyres added a commit to jsquyres/libfabric that referenced this pull request Dec 3, 2016
The portability of clock_gettime(3) on Mac OS / OS X
is... complicated.  Per
Homebrew/homebrew-core#6551 and
ofiwg#2508, remove the use of
clock_gettime(3) from the core of libfabric.  clock_gettime(3) is
still used in some providers that do not compile on OS X / MacOS, but
that's ok.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@ilovezfs
Copy link
Contributor Author

ilovezfs commented Jan 5, 2017

@jsquyres shipped! My apologies for the delay.

jsquyres added a commit to jsquyres/libfabric that referenced this pull request Feb 24, 2017
The portability of clock_gettime(3) on Mac OS / OS X
is... complicated.  Per
Homebrew/homebrew-core#6551 and
ofiwg#2508, remove the use of
clock_gettime(3) from the core of libfabric.  clock_gettime(3) is
still used in some providers that do not compile on OS X / MacOS, but
that's ok.

As noted above, this issue was discovered after v1.4.0 and was fixed
on master.  However, we later branched from the v1.4.0 tag to make the
v1.4.x branch to fix some specific bugs, and then released v1.4.1.

Unfortunately, the fix for this issue was forgotten, and not included
in the v1.4.1 release (see
Homebrew/homebrew-core#10292).  Although it is
unlikely, I'm cherry-picking this fix from master to the v1.4.x
branch just in case we ever do a v1.4.2 release.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

(cherry picked from commit 0b0c889)
jsquyres added a commit to jsquyres/libfabric that referenced this pull request Feb 24, 2017
The portability of clock_gettime(3) on Mac OS / OS X
is... complicated.  Per
Homebrew/homebrew-core#6551 and
ofiwg#2508, remove the use of
clock_gettime(3) from the core of libfabric.  clock_gettime(3) is
still used in some providers that do not compile on OS X / MacOS, but
that's ok.

This issue was discovered after v1.4.0 was released and was fixed on
master.  However, we later branched from the v1.4.0 tag to make the
v1.4.x branch in order to fix some specific bugs.  We then released
v1.4.1.

Unfortunately, the fix for this issue was forgotten / not included in
the v1.4.1 release (see
Homebrew/homebrew-core#10292).  Although it is
unlikely, I'm cherry-picking this fix from master to the v1.4.x branch
just in case we ever do a v1.4.2 release.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

(cherry picked from commit 0b0c889)
@Homebrew Homebrew locked and limited conversation to collaborators May 4, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
clock_gettime CI fails due to clock_gettime symbol
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants