Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing GEOSgcm with M1 GNU #417

Closed
5 tasks done
mathomp4 opened this issue May 12, 2022 · 27 comments
Closed
5 tasks done

Testing GEOSgcm with M1 GNU #417

mathomp4 opened this issue May 12, 2022 · 27 comments
Assignees
Labels
bug Something isn't working

Comments

@mathomp4
Copy link
Member

mathomp4 commented May 12, 2022

I'm going to open this as a tracking branch for GEOSgcm issues when running with GNU on M1.

Thanks to @iains, these are not needed anymore for M1 and GEOS.

At present the only real "bug" would be in ESMA_env where we need a slight change to g5_modules to make it not complain:

@mathomp4 mathomp4 added the bug Something isn't working label May 12, 2022
@mathomp4 mathomp4 self-assigned this May 12, 2022
@mathomp4
Copy link
Member Author

mathomp4 commented May 12, 2022

Additional issue. I can get GEOSgcm to build just fine, but it can't run. The issue seems to be something in the M1 GNU I built. I did this following how @fxcoudert (or someone??) did in the Homebrew recipe for GCC, but by hand since on my laptop, I'm not allowed to install in /opt/homebrew and I could not get Homebrew M1 gcc to work when installed under $HOME. (And, frankly, I prefer to control my compilers and MPI myself. I only want GCC 11 if I module load gcc so I don't mind building by hand.)

I built as per my modulefile for gcc 11.2.0, where I use the same @fxcoudert tarball as well as a patch that I see for arm64 in the brew recipe.

The issue seems to be that the executable built by our CMake is not handling rpaths the same on Arm as it does on Intel. When rs_numtiles.x tries to run on M1, I get:

dyld[48868]: Library not loaded: @rpath/libgomp.1.dylib
  Referenced from: /Users/mathomp4/Models/GEOSgcm/install-Debug/bin/rs_numtiles.x
  Reason: tried: '/Users/mathomp4/installed/MPI/gcc-gfortran-11.2.0/openmpi-4.1.3/Baselibs/7.1.0/Darwin/lib/libgomp.1.dylib' (no such file), '/Users/mathomp4/Models/GEOSgcm/install-Debug/lib/libgomp.1.dylib' (no such file), '/Users/mathomp4/installed/MPI/gcc-gfortran-11.2.0/openmpi-4.1.3/Baselibs/7.1.0/Darwin/lib/libgomp.1.dylib' (no such file), '/Users/mathomp4/Models/GEOSgcm/install-Debug/lib/libgomp.1.dylib' (no such file), '/usr/local/lib/libgomp.1.dylib' (no such file), '/usr/lib/libgomp.1.dylib' (no such file)

and if I otool -l I see:

otool -l /Users/mathomp4/Models/GEOSgcm/install-Debug/bin/rs_numtiles.x
...
Load command 31
          cmd LC_LOAD_DYLIB
      cmdsize 56
         name @rpath/libstdc++.6.dylib (offset 24)
   time stamp 2 Wed Dec 31 19:00:02 1969
      current version 7.29.0
compatibility version 7.0.0
...
Load command 39
          cmd LC_LOAD_DYLIB
      cmdsize 48
         name @rpath/libgomp.1.dylib (offset 24)
   time stamp 2 Wed Dec 31 19:00:02 1969
      current version 2.0.0
compatibility version 2.0.0
Load command 40
          cmd LC_LOAD_DYLIB
      cmdsize 56
         name @rpath/libgfortran.5.dylib (offset 24)
   time stamp 2 Wed Dec 31 19:00:02 1969
      current version 6.0.0
compatibility version 6.0.0
Load command 41
          cmd LC_LOAD_DYLIB
      cmdsize 56
         name @rpath/libgcc_s.1.1.dylib (offset 24)
   time stamp 2 Wed Dec 31 19:00:02 1969
      current version 1.1.0
compatibility version 1.0.0
...

Now, let us compare that to what I see on my Intel Mac (on Big Sur):

...
Load command 30
          cmd LC_LOAD_DYLIB
      cmdsize 104
         name /Users/mathomp4/installed/Core/gcc-gfortran/11.3.0/lib/libstdc++.6.dylib (offset 24)
   time stamp 2 Wed Dec 31 19:00:02 1969
      current version 7.29.0
compatibility version 7.0.0
...
Load command 36
          cmd LC_LOAD_DYLIB
      cmdsize 96
         name /Users/mathomp4/installed/Core/gcc-gfortran/11.3.0/lib/libgomp.1.dylib (offset 24)
   time stamp 2 Wed Dec 31 19:00:02 1969
      current version 2.0.0
compatibility version 2.0.0
Load command 37
          cmd LC_LOAD_DYLIB
      cmdsize 104
         name /Users/mathomp4/installed/Core/gcc-gfortran/11.3.0/lib/libgfortran.5.dylib (offset 24)
   time stamp 2 Wed Dec 31 19:00:02 1969
      current version 6.0.0
compatibility version 6.0.0
...
Load command 40
          cmd LC_LOAD_DYLIB
      cmdsize 104
         name /Users/mathomp4/installed/Core/gcc-gfortran/11.3.0/lib/libquadmath.0.dylib (offset 24)
   time stamp 2 Wed Dec 31 19:00:02 1969
      current version 1.0.0
compatibility version 1.0.0

So...not sure what to do.

I do see some output in this Homebrew issue:

Homebrew/homebrew-core#97874

like:

==> Changing dylib ID of /opt/homebrew/Cellar/gcc/11.2.0_3/lib/gcc/11/libgomp.1.dylib
  from @rpath/libgomp.1.dylib
    to /opt/homebrew/opt/gcc/lib/gcc/11/libgomp.1.dylib

and I wonder if I need to do the same thing in my by-hand install of GCC? I never have to do it on my version of GCC on Intel on Big Sur, but that is both architecture and OS changing!

A few swings around Google did bring up a configure flag called --disable-darwin-at-rpath which sounded interesting. But, when I tried it, the GCC build was not happy. This seemed to be something that @iains said it wasn't exactly supported in this issue.

Sadly, I nuked my original build of GCC to try --disable-darwin-at-rpath so I'm doing the "rebuild from scratch" sort of thing. I'll check once that's done if the libgomp.1.dylib I get on M1 looks different under otool -l compared to that on Intel.

@iains
Copy link

iains commented May 12, 2022

OK. Can we start from which version of gcc you have built and how you have configured it.

It should be entirely possible (on Arm64 or x86_64) [gcc-11.3-darwin-r0 or gcc-12.1-darwin0r0] to configure it to be installed in your home directory (or /users/Shared) i.e. somewhere writable to non-admin.

If you then use that compiler to build a dependent project, the compiler should embed the rpaths that it needs (additional to any that are supplied by CMAKE or other build system).

This is (probably) not an issue with GEOSgcm but more with the mechanisms being used to combine pieces of the build environments.

@mathomp4
Copy link
Member Author

Oooh. Interesting! I see from this comment, @iains has recommended using:

https://github.com/iains/gcc-11-branch/releases/tag/gcc-11.3-darwin-r0

(associated README)

Well then, I have a task for tomorrow I do believe! Let's try building this!

@mathomp4
Copy link
Member Author

OK. Can we start from which version of gcc you have built and how you have configured it.

@iains I essentially echo'd what Homebrew did as a first attempt, I grabbed the tarball they use:

wget https://github.com/fxcoudert/gcc/archive/refs/tags/gcc-11.2.0-arm-20211124.tar.gz

and a patch:

wget 'https://gcc.gnu.org/git/?p=gcc.git;a=patch;h=fabe8cc41e9b01913e2016861237d1d99d7567bf'

and apply the patch.

Then it's the usual way I build gcc:

./contrib/download_prerequisites
cd ..
mkdir build && cd build
../gcc-gcc-11.2.0-arm-20211124/configure --prefix=$HOME/installed/Core/gcc-gfortran/11.2.0 \ 
  --enable-languages=c,c++,fortran \
  --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk \ 
  |& tee configure.log
make -j4 |& tee make.log
make install |& tee makeinstall.log

It's not as fancy as Homebrew's method, but I decided to start from what I knew.

As for the "not in $HOME" that was for Homebrew's GCC. For some reason, it just wasn't happy being in somewhere other than /opt/homebrew. I know I found a post somewhere online where someone else had the same issue, but I seem to have lost it. Thus my thought to build by hand. I'm going to try gcc-11.3-darwin-r0 tomorrow

@fxcoudert
Copy link

That gcc-11.2.0-arm-20211124 was an experimental branch that Homebrew has been using for some time, but we're migrating. Now I would recommend to build directly the branches from Iain's rapo, either gcc-11.3-darwin-r0 (for 11.3) or the 12.1 one.

@mathomp4
Copy link
Member Author

Okay. Per @iains to fix an issue with ieee_arithmetic, I built a clone of:

https://github.com/iains/gcc-11-branch

This seems to have support for quad-precision, so we can stop caring about many of the PRs and issues above.

However, I seem to now have an error with HDF5:

make[4]: Entering directory '/Users/mathomp4/Baselibs/ESMA-Baselibs-7.1.0/src/hdf5/hl/test'
  CCLD     test_lite
ld: warning: option -s is obsolete and being ignored
ld: in ../../hl/src/.libs/libhdf5_hl.a(H5LTparse.o), in section __TEXT,__text reloc 347: symbol index out of range for architecture arm64
collect2: error: ld returned 1 exit status
make[4]: *** [Makefile:983: test_lite] Error 1

I've filed an issue (iains/gcc-11-branch#3)

@mathomp4
Copy link
Member Author

Welp, thanks to @iains in iains/gcc-11-branch#3 (comment), I can run GEOS on M1 with GCC 11.3.

I'll use this space to talk about speed once I can get runs on my Intel laptop, M1 laptop, and discover at c12. I'll do C12 with and without ExtData just to be fair.

@mathomp4
Copy link
Member Author

mathomp4 commented May 31, 2022

Performance Results of GEOSgcm (6 hours, C12, 1x6, No ExtData)

Release

Processor Compiler SDYN (s) PHYS (s) Last NoRad Tput (d/d) Last Rad Tput (d/d) Final Tput (d/d)
Cascade Lake (Discover) GCC 11.3 11.45 36.39 501.7 372.2 348.802
Coffee Lake (Intel Mac) GCC 11.3 15.23 58.37 287.8 224.1 232.768
M1 (Arm Mac) GCC 11.3 4.41 41.30 533.3 362.5 340.575
--- --- --- --- --- --- ---
Cascade Lake (Discover) Intel 2021.3 2.67 28.94 806.3 526.4 472.667
Coffee Lake (Intel Mac) Intel 2022.1 3.90 36.06 633.5 401.1 347.835

Aggressive

Processor Compiler SDYN (s) PHYS (s) Last NoRad Tput (d/d) Last Rad Tput (d/d) Final Tput (d/d)
Cascade Lake (Discover) GCC 11.3 3.02 33.91 666.4 482.4 399.860
Coffee Lake (Intel Mac) GCC 11.3 4.87 53.38 418.4 313.1 272.146
M1 (Arm Mac) GCC 11.3 3.45 23.81 935.3 573.4 461.952
--- --- --- --- --- --- ---
Cascade Lake (Discover) Intel 2021.3 2.48 21.30 1107.8 675.4 603.894
Coffee Lake (Intel Mac) Intel 2022.1 3.78 29.92 763.8 477.4 362.927

@mathomp4
Copy link
Member Author

mathomp4 commented May 31, 2022

So, I'll fill out the table more tomorrow when I build Intel models on my Intel Mac.

It looks like our Release flags for M1 are equivalent to Aggressive flags on Intel. One odd thing is that the M1 Mac crashes with our M1 Aggressive flags...which aren't that different from release. Here are essentially the flags:

  • Release: -O3 -march=armv8-a -mtune=generic -funroll-loops -ffpe-trap=zero,overflow
  • Aggressive: -O2 -march=native -ffast-math -ftree-vectorize -funroll-loops --param max-unroll-times=4

Everything else is pretty much the same. I wonder which one is the one M1 does not like...

@iains
Copy link

iains commented May 31, 2022

Would larger final Tput be "better"?

Could you try the aggressive with -O2 -march=armv8-a -ffast-math -ftree-vectorize -funroll-loops --param max-unroll-times=4

TBH I am not sure what we would do for native at present, and we have not even started to figure out tuning for M1 (i am not sure that the relevant data are even published, so me might have to infer from what clang does).

@iains
Copy link

iains commented May 31, 2022

I'd hazard that the loop unrolling heuristics might be geared to Intel? (they have the appearance of being experimentally-determined) .. presumably, there will be some similar sweet spot for M1.

@mathomp4
Copy link
Member Author

mathomp4 commented Jun 1, 2022

Would larger final Tput be "better"?

Ah. Yes. "Tput" = "throughput" (in model-days/wall-clock-day). I just abbreviated it because the table was already too wide :)

Could you try the aggressive with -O2 -march=armv8-a -ffast-math -ftree-vectorize -funroll-loops --param max-unroll-times=4

TBH I am not sure what we would do for native at present, and we have not even started to figure out tuning for M1 (i am not sure that the relevant data are even published, so me might have to infer from what clang does).

Sadly, moving to -march=armv8-a didn't help. But I suppose the "good news" is that there are only four flags that might be causing it:

  • -ffast-math
  • -ftree-vectorize
  • -funroll-loops
  • --param max-unroll-times=4

(If -O2 is the cause, I'll eat my hat 😄 )

The GNU Aggressive flags we have are from Jerry DeLisle on the GCC list, so they were a bit above me. It will be interesting to figure out which flag(s) might be causing this. Though it is about 1 hour between each test thanks to big build!

Though, honestly, our "Release" flags seem to have all the performance we might get anyway, so the easy answer might be to just code in CMake "if APPLE and ARM64, set Aggflag=Relflags".

@mathomp4
Copy link
Member Author

mathomp4 commented Jun 1, 2022

I also did a set of runs at C24 (so 4x the number of columns per process) as this is a bit closer to the per-process number-of-columns we run with. And the numbers are much the same.

Good news is that GCC 11 M1 is roughly as good as Intel 2022 on Coffee Lake with Aggressive flags! Kudos to @iains and the GCC Team (and I suppose the Apple chip engineers)!


Performance Results of GEOSgcm (6 hours, C24, 1x6, No ExtData)

Release

Processor Compiler SDYN (s) PHYS (s) Last NoRad Tput (d/d) Last Rad Tput (d/d) Final Tput (d/d)
Cascade Lake (Discover) GCC 11.3 38.08 64.33 244.7 152.8 184.589
Coffee Lake (Intel Mac) GCC 11.3 48.04 94.44 180.1 109.4 131.083
M1 (Arm Mac) GCC 11.3 11.44 51.83 413.3 235.5 267.983
M1 (Arm Mac) GCC 12.1 11.66 60.01 389.1 208.6 239.256
--- --- --- --- --- --- ---
Cascade Lake (Discover) Intel 2021.3 8.15 49.15 497.3 232.4 306.755
Coffee Lake (Intel Mac) Intel 2022.1 13.30 71.36 369.2 177.1 198.048

Aggressive

Processor Compiler SDYN (s) PHYS (s) Last NoRad Tput (d/d) Last Rad Tput (d/d) Final Tput (d/d)
Cascade Lake (Discover) GCC 11.3 8.43 53.34 438.2 234.4 271.357
Coffee Lake (Intel Mac) GCC 11.3 14.54 89.60 256.7 150.7 166.561
M1 (Arm Mac) GCC 11.3 CRASH CRASH CRASH CRASH CRASH
--- --- --- --- --- --- ---
Cascade Lake (Discover) Intel 2021.3 7.89 38.66 641.8 278.4 362.466
Coffee Lake (Intel Mac) Intel 2022.1 11.38 55.17 364.1 184.2 225.260

@iains
Copy link

iains commented Jun 1, 2022

Would larger final Tput be "better"?

Ah. Yes. "Tput" = "throughput" (in model-days/wall-clock-day). I just abbreviated it because the table was already too wide :)

Could you try the aggressive with -O2 -march=armv8-a -ffast-math -ftree-vectorize -funroll-loops --param max-unroll-times=4
TBH I am not sure what we would do for native at present, and we have not even started to figure out tuning for M1 (i am not sure that the relevant data are even published, so me might have to infer from what clang does).

Sadly, moving to -march=armv8-a didn't help. But I suppose the "good news" is that there are only four flags that might be causing it:

  • -ffast-math
  • -ftree-vectorize
  • -funroll-loops
  • --param max-unroll-times=4

-ffast-math might be the first to try, since that does alter the rules significantly (the problem then will be to find what breaks).

(If -O2 is the cause, I'll eat my hat 😄 )

Unless one has a chocolate hat, always a dodgy statement :)

The GNU Aggressive flags we have are from Jerry DeLisle on the GCC list, so they were a bit above me. It will be interesting to figure out which flag(s) might be causing this. Though it is about 1 hour between each test thanks to big build!

Well, obv. that's a reliable source - was that also for 'aarch64'?
(my feeling is that this would be somewhat arch-dependent)

Though, honestly, our "Release" flags seem to have all the performance we might get anyway, so the easy answer might be to just code in CMake "if APPLE and ARM64, set Aggflag=Relflags".

I did some simplistic benchmarking in early days (on a DTK) using a fortran code - which suggested that the arm chip did very well against regular (non AVX512) chips ... but I need to dig that out and re-run on my cascade box.

@mathomp4
Copy link
Member Author

mathomp4 commented Jun 1, 2022

The GNU Aggressive flags we have are from Jerry DeLisle on the GCC list, so they were a bit above me. It will be interesting to figure out which flag(s) might be causing this. Though it is about 1 hour between each test thanks to big build!

Well, obv. that's a reliable source - was that also for 'aarch64'? (my feeling is that this would be somewhat arch-dependent)

Oh, no. He had suggestions just for Intel. My only experience with aarch64 is with Graviton2 on AWS. And there I spent a few days figuring out this:

-march=armv8.2-a+crypto+crc+fp16+rcpc+dotprod"

let things work with Release and Aggressive (and it was happy with the rest of our flags).

I think I/we are just in pioneer space which is when you start to try all the flags and see what helps! Maybe once more people start using M1 + gfortran, I can steal/borrow the flags from OpenBenchmarking Fortran tests (though they usually just do -march=native so maybe once that is shaken out, it won't matter!)

@iains
Copy link

iains commented Jun 1, 2022

  • I have priorities (correctness ones) that trump performance (however interesting that is)

  • the flags I was using in that early benchmark might not be relevant (or might be)
    -Ofast -fexternal-blas -framework Accelerate (so, essentially, making use of Apple's crafted vector ops)

  • The Big Trouble with being in 'flags pioneer space' is combinatorial explosion - one needs some way to direct the effort ...

@mathomp4
Copy link
Member Author

mathomp4 commented Jun 1, 2022

  • the flags I was using in that early benchmark might not be relevant (or might be)
    -Ofast -fexternal-blas -framework Accelerate (so, essentially, making use of Apple's crafted vector ops)

Interesting. I tried doing: -Ofast -march=armv8-a and C12 ran...and it did a LOT. Made physics much faster. Sadly, though, at C24 it still crashes. I am using the same restarts, etc. So I wonder what is happening.

Oh well, at least I learned about -Ofast. New flag for me!

I suppose my next work is to move on to GCC 12 and see if it has the same issues with our model on M1 as it does on Intel.

@iains
Copy link

iains commented Jun 7, 2022

did you have a chance to try the 12.1-pre-r1 version?

edit: is the Intel issue compiler-related or something else?
(equally, we want to fix bugs there too)

@mathomp4
Copy link
Member Author

mathomp4 commented Jun 7, 2022

did you have a chance to try the 12.1-pre-r1 version?

Not yet. I'm hoping to try building it today. I was pulled away to other work, but I have some time now to build compilers, etc. I'm finalizing my "get our flags right" setup first. Hopefully then GCC 12 will be plug-and-play!

edit: is the Intel issue compiler-related or something else? (equally, we want to fix bugs there too)

This issue is...weird. For some reason, GCC 12 does not like GEOS and I don't know why. I recently ran 10 runs of C12 with GCC 11 and GCC 12. With GCC 11, all 10 ran successfully. With GCC 12, only 2 of 10 ran. The other eight died in the same vertical remapping call. But no flags or source code changed between these two runs, just changed the compiler! And the crash is sporadic. One dies at 1000z, another at 0700z, another at 1330z... it's so random!

I'm going to try doing a 10-run set at a higher resolution and see what I see. C12 is so coarse that I might just be exposing a weird setup issue. But C24 is a sort of my "gold standard" low-res case.

But maybe we're seeing some sort of memory corruption on our end that GCC 12 is just more sensitive to?

@mathomp4
Copy link
Member Author

mathomp4 commented Jun 7, 2022

I'm going to try doing a 10-run set at a higher resolution and see what I see. C12 is so coarse that I might just be exposing a weird setup issue. But C24 is a sort of my "gold standard" low-res case.

But maybe we're seeing some sort of memory corruption on our end that GCC 12 is just more sensitive to?

Well, same issue at C24. Not as "severe" as 70% of the runs succeeded, but the ones that failed failed on the same line:

          extm(i,k) = (a4(2,i,k)-a4(1,i,k)) * (a4(3,i,k)-a4(1,i,k)) > 0.

with:

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

That is some boring math to die on. And we've never seen it die with our debugging flags with Intel or GCC before. Weird...

@iains
Copy link

iains commented Jun 7, 2022

yeah, probably what's more likely is that one of the array accesses is going wrong and then some inappropriate number is being pulled into the calculation.

edit : well unless there's some possibility of underflow in the subtractions or overflow in the product.

edit2: does this happen without the fancy loop unroll params?
(of course, O3 will do some loop unrolling anyway) - we do have scope there for something to go wrong with those 3d array access. Debug runs would suppress some optimisations, likely so are not always terribly indicative ...

Is there any way to produce the assembly / .s or .i even (plus the flags used to compile) .. ?

@mathomp4
Copy link
Member Author

mathomp4 commented Jun 8, 2022

@iains Looks like your GCC 12.1 branch runs! Using a sample size of 1 run, it seems a bit slower than 11.3...but that could just be variability of the laptop (maybe it hit the efficiency cores a bit more?).

One interesting thing is that it looks like C12 GCC 12.1 on M1 is stable. I ran it a few times and nothing went wiggy. So I thought I should look at the default Intel GNU flags:

-O3 -march=westmere -mtune=generic -funroll-loops -g 
  -ffree-line-length-none -fno-range-check -Wno-missing-include-dirs 
  -fbacktrace -ffpe-trap=zero,overflow -fbacktrace 
  -fallow-argument-mismatch -fallow-invalid-boz -falign-commons

I mean, westmere is ancient at this point and I think everything we might run on is Broadwell/Haswell at least (and with Intel Fortran we don't support anything below that because we run with core-avx2) so I decided, heck, let's try -march=haswell. And when I do that, 🎉 , I can run 10, 12 runs at C12 and nothing crashes!

Does this tell me why GCC 12 doesn't seem to like GEOS anymore? No. But is a valid workaround if it means not supporting a processor we don't use? I mean... 😄

@iains
Copy link

iains commented Jun 8, 2022

heh. I have to confess that nothing leaps out as meaningful from those options - probably using -mtune=haswell too might be more reasonable.

I do have (and use for testing) westmere and nehalem machines - but only used for testing older OS revs.

good to read that M1 on 12.1 is better, although that is probably nothing to do with the Darwin port - we have pretty much the same code on 11.3 and 12.1 (and if I get to it in time on the upcoming 10.4).

I've no way to judge what effect we might get with the efficiency cores ...

@mathomp4
Copy link
Member Author

mathomp4 commented Jun 8, 2022

I've no way to judge what effect we might get with the efficiency cores ...

Yeah. I might need to stroll through the Open MPI lists/repos and ask/annoy the gurus there (sorry in advance @jsquyres) and see if they have info about what happens if you do mpirun -np 6 on an Apple M1 with 4 performance and 4 efficiency cores. I mean, I suppose if I'm not pinning, I just get 4/6 Firestorm and 2/6 Icestorm over the course of my runs sort of smoothing things out? (And I suppose Alder Lake will be much the same...)

Heck, from an lstopo image I made, I'm not even sure if the version of hwloc I had on hand can see the difference (though that could be from my not knowing the right options to pass hwloc...)

@mathomp4
Copy link
Member Author

Final thought: With #426 we should get M1 compatibility. Huzzah!

@mathomp4
Copy link
Member Author

Let's do an update of the timings with the newer GNU flags and now that I have access to an M1 Max. These are all 1-day and no extdata, history, or checkpointing to interfere.

I think the first take away is "M1 Max be nice". I mean, it's right there with Cascade Lake with Intel compilers on Aggressive optimizations. And I'm guessing the fine folks at GNU could probably figure out how to tune for the M1 given time. At the moment I'm mainly doing:

Fortran_FLAGSarm64 = -O3 -march=armv8-a -mtune=generic -funroll-loops 
  -g -ffree-line-length-none -fno-range-check -Wno-missing-include-dirs 
  -fbacktrace -Wno-unused-dummy-argument -ffpe-trap=zero,overflow 
  -fbacktrace -fallow-argument-mismatch -fallow-invalid-boz -falign-commons -fPIC -fopenmp

because "it works" but it's not tuned at all and Aggressive is pretty much just Release with the M1.

It's now time to try fiddling around. The -Ofast flag @iains informed me about above would be nice to get working, so I'll try for that first...


Performance Results of GEOSgcm (1 day, C24, 1x6, No ExtData, No history, No checkpointing)

Release

Processor Compiler SDYN (s) PHYS (s) Last NoRad Tput (d/d) Last Rad Tput (d/d) Final Tput (d/d)
Cascade Lake (Discover) GCC 12.1 150.19 236.33 259.4 160.5 217.227
Coffee Lake (Intel Mac) GCC 12.1 180.25 397.95 185.8 107.0 144.761
M1 (Arm Mac) GCC 12.1 42.57 189.82 454.0 247.7 350.090
M1 Max (Arm Mac) GCC 12.1 23.65 140.27 662.5 334.8 498.292
--- --- --- --- --- --- ---
Cascade Lake (Discover) Intel 2022.1 31.39 183.81 526.0 239.7 378.333
Coffee Lake (Intel Mac) Intel 2022.1 XXX.XX XXX.XX XXX.X XXX.X XXX.XXX

Aggressive

Processor Compiler SDYN (s) PHYS (s) Last NoRad Tput (d/d) Last Rad Tput (d/d) Final Tput (d/d)
Cascade Lake (Discover) GCC 12.1 33.07 197.00 462.3 246.7 352.724
Coffee Lake (Intel Mac) GCC 12.1 60.25 379.81 265.9 132.1 186.253
M1 (Arm Mac) GCC 12.1 42.04 188.70 466.0 247.4 347.438
M1 Max (Arm Mac) GCC 12.1 23.83 141.65 653.5 342.5 486.099
--- --- --- --- --- --- ---
Cascade Lake (Discover) Intel 2022.1 29.18 139.93 695.4 294.3 475.918
Coffee Lake (Intel Mac) Intel 2022.1 XX.XX XXX.XX XXX.X XXX.X XXX.XXX

@mathomp4
Copy link
Member Author

Update: -Ofast still goes boom with 12.1. I figured it would, but you never know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants