Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault on v3.20.2 and Ryzen 5 5500U #379

Closed
slightlyskepticalpotat opened this issue Aug 27, 2022 · 83 comments
Closed

Segfault on v3.20.2 and Ryzen 5 5500U #379

slightlyskepticalpotat opened this issue Aug 27, 2022 · 83 comments

Comments

@slightlyskepticalpotat
Copy link

I tried to compile the latest version of cpuminer-opt on Ubuntu 22.04 x86_64 with GCC 11.2.0.
-march=native -Wall
-O3 -march=znver2 -mvaes -Wall
-O3 --march=znver2 -Wall
-O3 --march=znver1 -Wall
-O3 --march=znver3 -Wall
All of them gave the following output when run:

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 12:13:13] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 12:13:13] Throughput 8/thr, Buffer 256 kiB/thr, Total 3072 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 12:13:13] CPU affinity [!!!!!!!!!!!!]
Segmentation fault (core dumped)

Changing the thread count didn't help. I was trying to solo mine dogecoin as an experiment with --algo=scrypt. I later tried the same setup on a Ryzen 5 3500U, and everything worked.

@JayDDee
Copy link
Owner

JayDDee commented Aug 27, 2022

I'll need to know where in the code it's crashing. Please add --debug and if you are familiar with gdb a backtrace would be helpfull.
I'm concerned that it crashes on a more capable CPU. This is not typical of a SW issue or incompaible build.
It's also crashing very early so it might not even be in the hash code yet. Try to reproduce with different algos. Scryptn2 should be included, it shares much code with the smaller scrypt but has a different mermoy profile. This will help identify if it's an algo issue.

All testing should be done using the default build, and please provide some more details about the faulting system, like the amount of RAM and any differences from the working system.

Edit: also since the issue is not thread related testing would be better with only one miner thread.

@slightlyskepticalpotat
Copy link
Author

slightlyskepticalpotat commented Aug 27, 2022

Alright, I think I'm going to exclusively use build.sh from now on. The faulting and working systems should have near-identical software (clean installs of 22.04), but the faulting system has 16gb of ram and the working system has 8. Secure boot is also enabled on the working system, but I don't think that's relevant.

Output with --debug:

$ ./cpuminer --algo=scrypt --url=http://127.0.0.1:44555 --user=user --pass=pass --coinbase-addr=[address] --debug --threads=1

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 13:15:15] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 13:15:15] Throughput 8/thr, Buffer 256 kiB/thr, Total 256 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 13:15:15] Coinbase address uses B58 coding
[2022-08-27 13:15:15] CPU affinity [!!!!!!!!!!!!]
[2022-08-27 13:15:15] 1 of 12 miner threads started using 'scrypt' algorithm
[2022-08-27 13:15:15] Default miner thread priority 0 (nice 19)
[2022-08-27 13:15:15] Binding thread 0 to cpu 0
Segmentation fault (core dumped)

gdb output of the same:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 13:19:21] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 13:19:21] Throughput 8/thr, Buffer 256 kiB/thr, Total 256 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 13:19:21] Coinbase address uses B58 coding
[2022-08-27 13:19:21] CPU affinity [!!!!!!!!!!!!]
[New Thread 0x7ffff6870600 (LWP 50901)]
[New Thread 0x7ffff606f600 (LWP 50902)]
[2022-08-27 13:19:21] Default miner thread priority 0 (nice 19)
[2022-08-27 13:19:21] Binding thread 0 to cpu 0
[2022-08-27 13:19:21] 1 of 12 miner threads started using 'scrypt' algorithm

Thread 2 "cpuminer" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff6870600 (LWP 50901)]
0x000055555555f231 in ?? ()
(gdb) bt 10
#0  0x000055555555f231 in ?? ()
#1  0x0000555555564a0e in ?? ()
#2  0x00007ffff7565b43 in start_thread (arg=<optimized out>)
    at ./nptl/pthread_create.c:442
#3  0x00007ffff75f7a00 in clone3 ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

I don't think this is an algo issue—allium, x11, neoscrypt, scryptn2, and any other algorithm I try gives the same output.

@slightlyskepticalpotat
Copy link
Author

Just remembered benchmark mode existed and tested with it, doesn't seem to be an algo issue:

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 14:00:12] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 14:00:12] Throughput 8/thr, Buffer 256 kiB/thr, Total 3072 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 14:00:12] CPU affinity [!!!!!!!!!!!!]
[2022-08-27 14:00:12] 12 of 12 miner threads started using 'scrypt' algorithm
[2022-08-27 14:00:16] Total: 48.95 kH/s, Temp: 39C, Freq: 3.650/3.701 GHz
[2022-08-27 14:00:22] Total: 88.88 kH/s, Temp: 39C, Freq: 3.381/3.451 GHz

@JayDDee
Copy link
Owner

JayDDee commented Aug 27, 2022

You're solo mining. I don't think it's the issue if it works on the 3500U but it gives another opportunity to narrow the crash location.

The only thing after the last message is calling thread_init which does nothing for most algos, then enters the loop and starts looking for work to hash. The next expected log is a new block report from GBT, stratum generates a different report and
benchmark doesn't look for work, just makes up its own.

If you could test with stratum & benchmark the code would take different paths looking for work and might change the symptoms. Beyond that some additional debug messages can be added to zoom in on the exact code that's causing the crash.

However, from a higher level, the fact that it works on the other system indicates a problem specific to the one system.
Possibly a corrupt miner or even the OS. I suggest downloading a fresh copy of cpuminer, or use the copy from the working system. Reinstalling the OS is another option.

Let me know if you're comfortable enough with code to add more debug messages with some coaching.

Edit: adding -P will produce protocol logs and may tell us if it's even trying to connect to the server.

Edit: I was starting to think it's an issue with solo mining. It's not well tested or mainained. You could try Tpruvot cpuminer-multi and/or Pooler cpuminer so see if either of them work. But that theory is shot down by the fact cpuminer-opt works on another system.

@slightlyskepticalpotat
Copy link
Author

I think I'm going to try stratum as a starting point since that's better maintained. I think I could stumble through adding some debug messages to the code if you point me in the right direction, but would prefer not to start with that. As for the miner and the os, I've tried downloading cpuminer-opt several times (even the version before the latest version), and they showed similar issues. Going to also try it on a new Live USB to see if it's the OS.

Adding -P produces this. It looks normal to me up to the segfault, but hopefully you can make more sense of this than I can.

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 14:57:36] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 14:57:36] Throughput 8/thr, Buffer 256 kiB/thr, Total 256 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 14:57:36] Coinbase address uses B58 coding
[2022-08-27 14:57:36] CPU affinity [!!!!!!!!!!!!]
[2022-08-27 14:57:36] 1 of 12 miner threads started using 'scrypt' algorithm
[2022-08-27 14:57:36] Default miner thread priority 0 (nice 19)
[2022-08-27 14:57:36] Binding thread 0 to cpu 0
[2022-08-27 14:57:36] JSON protocol request:
{"method": "getblocktemplate", "params": [{"capabilities": ["coinbasetxn", "coinbasevalue", "longpoll", "workid"], "rules": ["segwit"]}], "id":0}


*   Trying 127.0.0.1:44555...
* Connected to 127.0.0.1 (127.0.0.1) port 44555 (#0)
* Server auth using Basic with user 'user'
> POST / HTTP/1.1
Host: 127.0.0.1:44555
Authorization: Basic dXNlcjpwYXNz
Accept: */*
Accept-Encoding: deflate, gzip, br, zstd
Transfer-Encoding: chunked
Content-Type: application/json
Content-Length: 147
User-Agent: cpuminer-opt/3.20.2
X-Mining-Extensions: longpoll reject-reason
Expect: 100-continue

* Mark bundle as not supporting multiuse
< HTTP/1.1 100 Continue
* Signaling end of chunked upload via terminating chunk.
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: application/json
< Date: Sat, 27 Aug 2022 18:57:36 GMT
< Content-Length: 635
< 
* Connection #0 to host 127.0.0.1 left intact
[2022-08-27 14:57:36] JSON protocol response:
{
   "result": {
      "capabilities": [
         "proposal"
      ],
      "version": 6422532,
      "rules": [],
      "vbavailable": {},
      "vbrequired": 0,
      "previousblockhash": "bbe9e18b65c42a0a4e7773fdb2ce7af303d42e86d8b79a4b4180c9ae47266372",
      "transactions": [],
      "coinbaseaux": {
         "flags": ""
      },
      "coinbasevalue": 1000000000000,
      "longpollid": "bbe9e18b65c42a0a4e7773fdb2ce7af303d42e86d8b79a4b4180c9ae472663722",
      "target": "00000fffff000000000000000000000000000000000000000000000000000000",
      "mintime": 1661625752,
      "mutable": [
         "time",
         "transactions",
         "prevblock"
      ],
      "noncerange": "00000000ffffffff",
      "sigoplimit": 20000,
      "sizelimit": 1000000,
      "curtime": 1661626656,
      "bits": "1e0fffff",
      "height": 4013362
   },
   "error": null,
   "id": 0
}
Segmentation fault (core dumped)

Thanks for all the help so far!

@JayDDee
Copy link
Owner

JayDDee commented Aug 27, 2022

From the protocol logs I can tell that the server sent work and the miner crashed trying to decode it. I have no idea why that would happen on one system but not another. It's also crashing in GBT code so your stratum test might produce different results.

The focus for the GBT crash is on cpu-miner.c:get_upstream_work. That function sends the getblocktemplate request and procceses the result by calling gbt_work_decode then producing the new block log. This is the window where it crashes, and the place to put some debug printf as checkpoints to help narrow it down further.

I'l wait for the stratum test results, if it's reproduceable using stratum it will make troubleshooting easier.

@slightlyskepticalpotat
Copy link
Author

Stratum seems to work, I let it run for a while and it was stable:

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 19:12:52] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 19:12:52] Throughput 8/thr, Buffer 256 kiB/thr, Total 3072 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 19:12:52] CPU affinity [!!!!!!!!!!!!]
[2022-08-27 19:12:52] Creating stratum thread
[2022-08-27 19:12:52] Stratum connect stratum+tcp://stratum.aikapool.com:7915
[2022-08-27 19:12:52] Threads restarted for new work.
[2022-08-27 19:12:52] Default miner thread priority 0 (nice 19)
[2022-08-27 19:12:52] Binding thread 0 to cpu 0
[2022-08-27 19:12:52] Thread 0 waiting for first job
[2022-08-27 19:12:52] Binding thread 1 to cpu 1
[2022-08-27 19:12:52] Thread 1 waiting for first job
[2022-08-27 19:12:52] Binding thread 2 to cpu 2
[2022-08-27 19:12:52] Thread 2 waiting for first job
[2022-08-27 19:12:52] Binding thread 3 to cpu 3
[2022-08-27 19:12:52] Thread 3 waiting for first job
[2022-08-27 19:12:52] Binding thread 4 to cpu 4
[2022-08-27 19:12:52] Thread 4 waiting for first job
[2022-08-27 19:12:52] Binding thread 5 to cpu 5
[2022-08-27 19:12:52] Thread 5 waiting for first job
[2022-08-27 19:12:52] Binding thread 6 to cpu 6
[2022-08-27 19:12:52] Thread 6 waiting for first job
[2022-08-27 19:12:52] Binding thread 7 to cpu 7
[2022-08-27 19:12:52] Thread 7 waiting for first job
[2022-08-27 19:12:52] Binding thread 8 to cpu 8
[2022-08-27 19:12:52] Binding thread 9 to cpu 9
[2022-08-27 19:12:52] Thread 9 waiting for first job
[2022-08-27 19:12:52] Thread 8 waiting for first job
[2022-08-27 19:12:52] Binding thread 10 to cpu 10
[2022-08-27 19:12:52] Thread 10 waiting for first job
[2022-08-27 19:12:52] 12 of 12 miner threads started using 'scrypt' algorithm
[2022-08-27 19:12:52] Binding thread 11 to cpu 11
[2022-08-27 19:12:52] Thread 11 waiting for first job
*   Trying 84.234.52.190:7915...
* Connected to stratum.aikapool.com (84.234.52.190) port 7915 (#0)
* Connection #0 to host stratum.aikapool.com left intact
[2022-08-27 19:12:52] > {"id": 1, "method": "mining.subscribe", "params": ["cpuminer-opt/3.20.2"]}
[2022-08-27 19:12:52] < {"id":1,"result":[[["mining.set_difficulty","deadbeefcafebabe747c130000000000"],["mining.notify","deadbeefcafebabe747c130000000000"]],"780195aa",4],"error":null}
[2022-08-27 19:12:52] Stratum session id: deadbeefcafebabe747c130000000000
[2022-08-27 19:12:52] Stratum extranonce1 0x780195aa, extranonce2 size 4
[2022-08-27 19:12:52] > {"id": 2, "method": "mining.authorize", "params": ["user", "pass"]}
[2022-08-27 19:12:53] < {"id":null,"method":"mining.set_difficulty","params":[16384]}
[2022-08-27 19:12:53] < {"id":null,"method":"mining.notify","params":["5187","3838d1c26496b014b8928cb8f6d2e881fe7cd962067f377ab3496310b0c37b0f","01000000010000000000000000000000000000000000000000000000000000000000000000ffffffff20032ea04204eba40a6308","0d2f6e6f64655374726174756d2f00000000010010a5d4e80000001976a914f6c7f1c2cd06849dd836bb2f40244741dbc0c4fd88ac00000000",[],"00620004","1a03a131","630aa4eb",true]}
[2022-08-27 19:12:53] < {"id":2,"result":true,"error":null}
[2022-08-27 19:12:53] > {"id": 3, "method": "mining.extranonce.subscribe", "params": []}
[2022-08-27 19:12:53] Thread 0 waiting for first job
[2022-08-27 19:12:53] Thread 1 waiting for first job
[2022-08-27 19:12:53] Thread 2 waiting for first job
[2022-08-27 19:12:53] Thread 3 waiting for first job
[2022-08-27 19:12:53] Thread 4 waiting for first job
[2022-08-27 19:12:53] Thread 5 waiting for first job
[2022-08-27 19:12:53] Thread 6 waiting for first job
[2022-08-27 19:12:53] Thread 7 waiting for first job
[2022-08-27 19:12:53] Thread 9 waiting for first job
[2022-08-27 19:12:53] Thread 8 waiting for first job
[2022-08-27 19:12:53] Thread 10 waiting for first job
[2022-08-27 19:12:53] Thread 11 waiting for first job
[2022-08-27 19:12:54] Thread 0 waiting for first job
[2022-08-27 19:12:54] Thread 1 waiting for first job
[2022-08-27 19:12:54] Thread 2 waiting for first job
[2022-08-27 19:12:54] Thread 3 waiting for first job
[2022-08-27 19:12:54] Thread 4 waiting for first job
[2022-08-27 19:12:54] Thread 5 waiting for first job
[2022-08-27 19:12:54] Thread 6 waiting for first job
[2022-08-27 19:12:54] Thread 7 waiting for first job
[2022-08-27 19:12:54] Thread 9 waiting for first job
[2022-08-27 19:12:54] Thread 8 waiting for first job
[2022-08-27 19:12:54] Thread 10 waiting for first job
[2022-08-27 19:12:54] Thread 11 waiting for first job
[2022-08-27 19:12:55] Thread 0 waiting for first job
[2022-08-27 19:12:55] Thread 1 waiting for first job
[2022-08-27 19:12:55] Thread 2 waiting for first job
[2022-08-27 19:12:55] Thread 3 waiting for first job
[2022-08-27 19:12:55] Thread 4 waiting for first job
[2022-08-27 19:12:55] Thread 5 waiting for first job
[2022-08-27 19:12:55] Thread 6 waiting for first job
[2022-08-27 19:12:55] Thread 7 waiting for first job
[2022-08-27 19:12:55] Thread 10 waiting for first job
[2022-08-27 19:12:55] Thread 8 waiting for first job
[2022-08-27 19:12:55] Thread 9 waiting for first job
[2022-08-27 19:12:55] Thread 11 waiting for first job
[2022-08-27 19:12:56] Extranonce disabled, subscribe timed out
[2022-08-27 19:12:56] Stratum connection established
[2022-08-27 19:12:56] Threads restarted for new work.
[2022-08-27 19:12:56] New Stratum Diff 16384, Block 4366382, Job 5187
                      Diff: Net 4.6222e+06, Stratum 16384, Target 0.25
[2022-08-27 19:13:04] < {"id":null,"method":"mining.notify","params":["5188","a5bea714b19a2490f7aacde03277812a3c74193a6bba2d588478374e4812284c","01000000010000000000000000000000000000000000000000000000000000000000000000ffffffff20032fa0420400a50a6308","0d2f6e6f64655374726174756d2f00000000012780cffde80000001976a914f6c7f1c2cd06849dd836bb2f40244741dbc0c4fd88ac00000000",["43ff4bbcc7526c375f6f22b7a816b6b2cbc699f7afd87b154a137e04ec37c5c2","cfefdbffb5c18a48a57ac2de56c2ddd6e53e98165afafd10f219df7956053cdf","41c53c48499cdc1dccc7a2f19ef6fc7fd87775bf63619982951ff665d373eba7"],"00620004","1a034445","630aa500",true]}
[2022-08-27 19:13:05] CPU temp: curr 41 C max 0, Freq: 3.211/3.272 GHz
[2022-08-27 19:13:05] Threads restarted for new work.
[2022-08-27 19:13:05] New Block 4366383, Net diff 5.1358e+06, Job 5188
                      Diff: Net 5.1358e+06, Stratum 16384, Target 0.25
                      TTF @ 72.32 kh/s: Block 9671y181d, Share 4h07m
                      Net hash rate (est) 1838.17 Th/s
[2022-08-27 19:13:39] < {"id":null,"method":"mining.notify","params":["5189","752dc8a3330703e8f89a125bb58aac4e3113e0467b0c9ba0ac41eff721f1e42a","01000000010000000000000000000000000000000000000000000000000000000000000000ffffffff200330a0420423a50a6308","0d2f6e6f64655374726174756d2f0000000001386934dce80000001976a914f6c7f1c2cd06849dd836bb2f40244741dbc0c4fd88ac00000000",["57f6b91a72bfa999cbfc5bf9334555bb18a7b339cdfa54f65baa35f41825507f","f9e883f0f65cb06725c521566f470ca44e763c389691e584cda4db0c0e30a58f"],"00620004","1a02fe94","630aa523",true]}
[2022-08-27 19:13:39] Threads restarted for new work.
[2022-08-27 19:13:39] New Block 4366384, Net diff 5.6027e+06, Job 5189
                      Diff: Net 5.6027e+06, Stratum 16384, Target 0.25
                      TTF @ 83.53 kh/s: Block 9134y9d, Share 3h34m
                      Net hash rate (est) 1046.23 Th/s

Going to try to narrow down the point where it crashes now.

@JayDDee
Copy link
Owner

JayDDee commented Aug 28, 2022

I'm starting to suspect an issue with the wallet, do both systems have their own wallets? If they're on the same network try mining on the other's wallet.

@slightlyskepticalpotat
Copy link
Author

They were originally on different wallets. I tried mining with the 3500u wallet, 5500u wallet, and a newly created wallet on both systems, but the 3500u system always worked and the 5500u system always gave a segfault. Now trying to narrow down the point of the crash.

@slightlyskepticalpotat
Copy link
Author

As you mentioned, I was able to confirm that it first crashes here. Going further into the code, it crashes here. I was able to track it to this for loop, where it looked like the program looped through it a few times, then crashed.

This is where the mystery deepens.

I changed the for loop (with no other changes to the code) to:

for ( i = 0; i < ARRAY_SIZE( work->target ); i++ )
{
    applog( LOG_INFO, "working");
    work->target[7 - i] = be32dec( target + i );
}

And it began solving blocks. The hashrate seems to match with what I was seeing in benchmarks. and the miner was indistinguishable from the working system apart from the junk output. Could there be some sort of race condition here?

$ ./cpuminer --algo=scrypt --url=http://127.0.0.1:44555 --user=user --pass=pass --coinbase-addr=nfPAPyGGjsuyqRyxFfCmnA4C9cH5smSi6g
         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 21:30:40] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 21:30:40] Throughput 8/thr, Buffer 256 kiB/thr, Total 3072 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 21:30:40] CPU affinity [!!!!!!!!!!!!]
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] working
[2022-08-27 21:30:40] 12 of 12 miner threads started using 'scrypt' algorithm
[2022-08-27 21:30:40] CPU temp: curr 42 C max 0, Freq: 0.997/1.812 GHz
[2022-08-27 21:30:40] scrypt: http://127.0.0.1:44555
                      Periodic Report     584942417355y130d        0m00s
                      Share rate        -0.00/min     0.00/min
                      Hash rate         -0.00h/s      0.00h/s   (0.00h/s)
                      Submitted             0            0
                      Accepted              0            0        0.0%
                      Hi/Lo Share Diff  0 /  9e+99
[2022-08-27 21:30:40] New Block 4013576, Net Diff 0.00024414, Ntime 40c50a63
                      Miner TTF @ 240.00 h/s 1h12m, Net TTF @ 9922.63 h/s 1m45s
[2022-08-27 21:30:47] working
[2022-08-27 21:30:47] working
[2022-08-27 21:30:47] working
[2022-08-27 21:30:47] working
[2022-08-27 21:30:47] working
[2022-08-27 21:30:47] working
[2022-08-27 21:30:47] working
[2022-08-27 21:30:47] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:52] working
[2022-08-27 21:30:56] 1 Submitted Diff 0.00039597, Block 4013576, Ntime 4cc50a63
[2022-08-27 21:30:56] 1 A1 S0 R0 BLOCK SOLVED 1, 15.620 sec (1ms)
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] working
[2022-08-27 21:30:56] New Block 4013577, Net Diff 0.00026158, Ntime 50c50a63
                      Miner TTF @ 83.73 kh/s 0m13s, Net TTF @ 9970.20 h/s 1m52s
[2022-08-27 21:30:58] 2 Submitted Diff 0.00047437, Block 4013577, Ntime 50c50a63
[2022-08-27 21:30:58] 2 A2 S0 R0 BLOCK SOLVED 2, 2.305 sec (1ms)

@JayDDee
Copy link
Owner

JayDDee commented Aug 28, 2022

Holy shit, good work. Can you get the loop counter and ARRAY_SIZE?

Edit: It's silly code, ARRAY_SIZE controls the loop but hard coded 7 is used inside, but they should match. The target is just the 256 bit hash expressed as a uint32 array. I don't like that ARRAY_SIZE macro, might as well hard code it to 8 since the array's size is assumed inside the loop anyway.

Edit2: I realize the stupidity of my first question. Capturing the loop counter makes the problem go away so it will always be 8.
Maybe ARRAY_SIZE can be captured before enterring the loop without changing the behaviour.

I suspect the compiler is building that section of code differently when you add the printf. The loop is more likely to be
unrolled, or even vectorized, without the printf. Try compiling with lower optimization to see if that makes a difference.

I'm not sure we'll get to the root cause but getting rid of ARRAY_SIZE macro might be a good start. I'm not a C expert so I'm not sure if its implementation is correct., On the surface I don't see a problem with it.

@slightlyskepticalpotat
Copy link
Author

It's a bit of a challenge as placing a printf or a file write there also seems to fix it. The code I am using is

   for ( i = 0; i < ARRAY_SIZE( work->target ); i++ )
   {
      // applog( LOG_INFO, "working");
      printf ("%d %d\n", i, ARRAY_SIZE( work->target ));
      work->target[7 - i] = be32dec( target + i );
   }
   fflush(stdout);

It generates output like this:

         **********  cpuminer-opt 3.20.2  *********** 
     A CPU miner with multi algo support and optimized for CPUs
     with AVX512, SHA and VAES extensions by JayDDee.
     BTC donation address: 12tdvfF7KmAsihBXQXynT6E6th2c2pByTT

[2022-08-27 21:47:32] Scrypt paramaters: N= 1024, R= 1
[2022-08-27 21:47:32] Throughput 8/thr, Buffer 256 kiB/thr, Total 3072 kiB

CPU: AMD Ryzen 5 5500U with Radeon Graphics         
SW built on Aug 27 2022 with GCC 11.2.0
CPU features:  AVX2    AES SHA
SW features:   AVX2    AES SHA
Algo features: AVX512

Starting miner with AVX2...

[2022-08-27 21:47:32] CPU affinity [!!!!!!!!!!!!]
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
[2022-08-27 21:47:32] 12 of 12 miner threads started using 'scrypt' algorithm
[2022-08-27 21:47:32] CPU temp: curr 43 C max 0, Freq: 1.032/2.298 GHz
[2022-08-27 21:47:32] scrypt: http://127.0.0.1:44555
                      Periodic Report     584942417355y130d        0m00s
                      Share rate        -0.00/min     0.00/min
                      Hash rate         -0.00h/s      0.00h/s   (0.00h/s)
                      Submitted             0            0
                      Accepted              0            0        0.0%
                      Hi/Lo Share Diff  0 /  9e+99
[2022-08-27 21:47:32] New Block 4013605, Net Diff 0.00071352, Ntime 34c90a63
                      Miner TTF @ 240.00 h/s 3h32m, Net TTF @ 13.12 kh/s 3m53s
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
[2022-08-27 21:48:40] 1 Submitted Diff 0.00083736, Block 4013605, Ntime 75c90a63
[2022-08-27 21:48:40] 1 A1 S0 R0 BLOCK SOLVED 1, 67.927 sec (2ms)
0 8
1 8
2 8
3 8
4 8
5 8
6 8
7 8
[2022-08-27 21:48:40] New Block 4013606, Net Diff 0.00065864, Ntime 78c90a63
                      Miner TTF @ 80.72 kh/s 0m35s, Net TTF @ 13.44 kh/s 3m30s

Any ideas on how I could output i while not writing to stdout or a file? Also, what does be32dec do? I think the problem may be inside there.

@JayDDee
Copy link
Owner

JayDDee commented Aug 28, 2022

We're getting some crosstalk, I'm getting caught up with what you just wrote, I made some further comments above.

Edit: as I suspected might happen. Compiler optimization is playing factor but hard coding the array's size might solve the problem. (meaning the crash)

Edit: be32dec is a byte swap function used to convert from Little Endian to Big Endian. It's written to be agnostic, it will return
big endian data regardless of the current byte order. Intel (I mean x86 including Ryzen of course, duh) CPUs are Little Endian so it always does a byte swap.

@slightlyskepticalpotat
Copy link
Author

I'm even less of a C expert, but I gave it a shot. I started every try with a fresh clone of the repo. Just curious, how did you guess that compiler optimisation was playing a factor in this? Past compiler horror stories?
-O0 -march=native -Wall: errors out during compilation
-O1 -march=native -Wall: works normally, appears slightly slower
-O2 -march=native -Wall: works normally, appears slightly faster
-O3 -march=native -Wall: segfaults
-Os -march=native -Wall: also errors during compilation

@JayDDee
Copy link
Owner

JayDDee commented Aug 28, 2022

I'm not sure at what level loop unrolling occurs but vectorization is possible on a fixed sized loop and needs -O3. The entire array can be byte swapped in one shot using AVX2. With a printf in the loop vectorization isn't possible but loop unrolling still is.

@slightlyskepticalpotat
Copy link
Author

It's definitely vectorization. -O3 -fno-tree-vectorize -march=native -Wall builds and works properly.

@JayDDee
Copy link
Owner

JayDDee commented Aug 28, 2022

The only possibilities are wrong loop size, bad target pointer, or data is misaligned for AVX2.

I've seen misaligned data before in hand coded vector instructions. but here the compiler is deciding to vectorize so it should check alignment before doing so. Also the data is defined with 64 bit aligment, which is more than enough for AVX2.
I'm dismissing this as a possibility.

Array size error seems more likely especially if it looped a couple of time before crashing. That's a classic buffer overflow.
A bad pointer would be expected to segfault on the first loop iteration.

Capturing ARRAY_SIZE( work->target ) is critical. Displaying it before the for loop should still allow the loop to be vectorized and crash. Or just hard code the loop to 8 and see if the crash goes away.

I think I'll get rid of the macro. It's used mostly for target and hash who's size is fixed. Using the macro is unnecessary.

Edit: I think I've found part of the problem, misalignment is a possibility for the source target, I was only thinking of the destination work->target. This still involves a compiler bug because it should have detected the misalignment before vectorizing.

Here's a look at the definitions with alignment added where necessary:

static bool gbt_work_decode( const json_t *val, struct work *work ) { int i, n; uint32_t version, curtime, bits; uint32_t prevhash[8] __attribute__ ((aligned (32))); uint32_t target[8] __attribute__ ((aligned (32))); unsigned char final_sapling_hash[32] __attribute__ ((aligned (32))); int cbtx_size; uchar *cbtx = NULL; int tx_count, tx_size; uchar txc_vi[9]; uchar(*merkle_tree)[32] = NULL; bool coinbase_append = false; bool submit_coinbase = false; bool version_force = false; bool version_reduce = false; json_t *tmp, *txa; bool rc = false;

@slightlyskepticalpotat
Copy link
Author

   printf("%d\n", ARRAY_SIZE( work->target ));
   fflush(stdout);
   for ( i = 0; i < ARRAY_SIZE( work->target ); i++ )
   {
      work->target[7 - i] = be32dec( target + i );
   }

Gives array size as 8 before the loop starts. Additionally, if I hardcode i < 8 it still segfaults.

especially if it looped a couple of time before crashing

Unfortunately, I later realised that I wasn't sure if it looped before crashing. I originally had it print the iteration count at the end of each iteration and saw it increase, but that was before I realised it would fix the issue. Hence, I'm not sure now if the loop completes any iterations before crashing.

@JayDDee
Copy link
Owner

JayDDee commented Aug 28, 2022

Stay tuned I think I've found it!!!

@JayDDee
Copy link
Owner

JayDDee commented Aug 28, 2022

Damn, code formatting never works for me.

I think I've found part of the problem, misalignment is a possibility for the source target, I was only thinking of the destination work->target. This still involves a compiler bug because it should have detected the misalignment before vectorizing.

Here's a look at the definitions with alignment added where necessary:

static bool gbt_work_decode( const json_t *val, struct work *work )
{
int i, n;
uint32_t version, curtime, bits;
uint32_t prevhash[8] attribute ((aligned (32)));
uint32_t target[8] attribute ((aligned (32)));
unsigned char final_sapling_hash[32] attribute ((aligned (32)));
int cbtx_size;
uchar *cbtx = NULL;
int tx_count, tx_size;
uchar txc_vi[9];
uchar(*merkle_tree)[32] = NULL;
bool coinbase_append = false;
bool submit_coinbase = false;
bool version_force = false;
bool version_reduce = false;
json_t *tmp, *txa;
bool rc = false;

I don't know why attribute was in bold but it helps identify the three lines that need to be changed.
That int i being the first local variable guarantees that the following arrays are misaligned. Alway define arrays first.
I need to do a code review to look for other similar situations.

@slightlyskepticalpotat
Copy link
Author

I'm probably going to go to sleep soon, but do let me know if you need any help testing! I don't understand vectorization enough to guess—do you have a guess as to why this problem only shows up on some systems?

@JayDDee
Copy link
Owner

JayDDee commented Aug 28, 2022

Can you do a quick test with alignment added since you can reproduce the crash, I can't. I'm pretty confident now but
confirmation would be nice.

Agree on the sleep, we must be in the same time zone. If this works I'll sleep well tonight.

@slightlyskepticalpotat
Copy link
Author

slightlyskepticalpotat commented Aug 28, 2022

static bool gbt_work_decode( const json_t *val, struct work *work )
{
int i, n;
uint32_t version, curtime, bits;
uint32_t prevhash[8] attribute ((aligned (32)));
uint32_t target[8] attribute ((aligned (32)));
unsigned char final_sapling_hash[32] attribute ((aligned (32)));
int cbtx_size;
uchar *cbtx = NULL;
int tx_count, tx_size;
uchar txc_vi[9];
uchar(*merkle_tree)[32] = NULL;
bool coinbase_append = false;
bool submit_coinbase = false;
bool version_force = false;
bool version_reduce = false;
json_t *tmp, *txa;
bool rc = false;

Are you able to put this up on pastebin so I can download and test it? I think GitHub may have removed some of the underscores.

Edit: nevermind, I got it.

@JayDDee
Copy link
Owner

JayDDee commented Aug 28, 2022

You're right. Two leading and 2 trailing undescrores in attribute. There are many examples in the code if you grep -r attribute.

@slightlyskepticalpotat
Copy link
Author

slightlyskepticalpotat commented Aug 28, 2022

Oops. I did it with this and it segfaulted again.

static bool gbt_work_decode( const json_t *val, struct work *work )
{
   int i, n;
   uint32_t version, curtime, bits;
   uint32_t prevhash[8] __attribute__(( aligned(32)));
   uint32_t target[8] __attribute__(( aligned(32)));
   unsigned char final_sapling_hash[32] __attribute__(( aligned(32)));
   int cbtx_size;
   uchar *cbtx = NULL;
   int tx_count, tx_size;
   uchar txc_vi[9];
   uchar(*merkle_tree)[32] = NULL;
   bool coinbase_append = false;
   bool submit_coinbase = false;
   bool version_force = false;
   bool version_reduce = false;
   json_t *tmp, *txa;
   bool rc = false;

Edit: I did figure out the minor mystery of attribute being in bold though. Turns out when you do type __this__ on GitHut it shows as this.

@JayDDee
Copy link
Owner

JayDDee commented Aug 28, 2022

Oh well, maybe have to sleep on it.

@JayDDee
Copy link
Owner

JayDDee commented Aug 28, 2022

Some thoughts to sleep on...

The crash is indeed reported as a segfault. A misaligned address should throw a processor exception in the same way a divide by zero or invalid instruction does. A segfault should allways be an invalid pointer address. At least that's the way it works on some processor architectures I'm more familiar with.

Counterpoint, it only crashes when greater-than-default data alignment is required, that is when the loop is vectorized.

We need to see the work->target & target pointers.

@slightlyskepticalpotat
Copy link
Author

Some last tests before sleep.

Code:

   printf("%p %p\n", work->target, target);
   for ( i = 0; i < ARRAY_SIZE( work->target ); i++ )
      work->target[7 - i] = be32dec( target + i );

WIth the __attribute__(( aligned(32))) patch:

0x7f677c002150 0x7f6782a40ce0
Segmentation fault (core dumped)

Without the patch:

0x7f88b8002150 0x7f88beabfce0
Segmentation fault (core dumped)

Unfortunately, I don't have a very good understanding of pointers so I'm mostly lost here.

@JayDDee
Copy link
Owner

JayDDee commented Aug 28, 2022

You're one up on me, I wasn't aware of %p.

Both those pointers are properly aligned. The low 6 address bits are zero which provides 64 byte alignment, more than requested.
Both pointers also look good. I don't know the memory mapping but both are within 4 GB of each other.

There's something else going on that seems to be specific to your CPU. This copy loop is used frequently in stratum code and has never crashed before. It also doesn't crash on your other CPU with every other controllable variable the same.

I assume the same vectorization occurs on that CPU.

The only difference architecturally is the addition of VAES in Ryzen 5000. That will result in kernel changes as well as any AES related code. Much of the affected code is in cpuminer-opt but only in the hashing code and definitely not anywhere near where it's crashing.

At this point it looks like one or both properly aligned and apparently valid pointers is causing a segfault when the optimiser auto-vectorizes a loop. But if auto-vectorization is disabled in the compiler there is no segfault. It occurs persistently on one particular CPU and never on a very similar CPU with identical OS, compiler and source code.

The crash itself is a mystery, from all the data available it shouldn't crash. That it doesn't crash on the Ryzen 3500U, or apparently anywhere else, makes it even more mysterious. That it only crashes when the code is auto-vectorized, well...

I'm stumped.

@slightlyskepticalpotat
Copy link
Author

slightlyskepticalpotat commented Aug 28, 2022 via email

@JayDDee
Copy link
Owner

JayDDee commented Aug 30, 2022

That test has narrowed the problem to a misaligned access fault when writing the byte-swapped 256 bit vector back to memory.
The only thing needed to fix it was to remove the alignment requirement by using the _mm256_storeu_si256 instead of _mm256_store_si256 or letting the compiler use an aligned store by doing a direct asignment..

You can confirm by removing the "u" to force an aligned store to see if the segfault comes back.
Display the work->target pointer at the same time and it will prove the address was aligned and the misaligned fault is bogus.

You can easilly toggle back and forth and prove the CPU is improperly faulting an aligned access. You can do the same test
on the 3500U and prove it doesn't fault.

BTW loadu/storeu just splits the memory access into multiple smaller chunks to avoid alignement issues, at significant performance penalty.

@slightlyskepticalpotat
Copy link
Author

   __m256i x = mm256_bswap_32_test( _mm256_loadu_si256( (__m256i*)target ) );
   _mm256_storeu_si256( ( (__m256i*)(work->target)), x);

Gives 0x7ff550002150 0x7ff557021ce0 (no segfault)

   __m256i x = mm256_bswap_32_test( _mm256_loadu_si256( (__m256i*)target ) );
   _mm256_store_si256( ( (__m256i*)(work->target)), x);

Gives 0x7f4c88002150 0x7f4c8ffbcce0 (segfault)

You were right, the segfault returns when I remove the u despite the pointers apparently being aligned. I tried it on the 3500U and both of those do not segfault. Just wondering, how often do you hand-write vectorization code instead of letting the compiler optimise?

Incidentally, a warranty ticket for my 5500U laptop I put in a while ago has finally been processed. A usb port is busted, so they're going to replace the motherboard sometime. After that, I'm going to test to see if it also happens on another cpu of the same model.

@JayDDee
Copy link
Owner

JayDDee commented Aug 31, 2022

Hash function vectorization operates on multiple parallel data streams so each lane is like a seperate thread. Compiler is limited to simpler stuff like fixed iteration loops with no dependencies and data copying. I was surprised the compiler was able to vectorize the bswap loop, but I guess it was looking specifically for inverting arrays.

@JayDDee
Copy link
Owner

JayDDee commented Aug 31, 2022

I was reading a bit about AMD64 architecure, they actual had 64 bit before Intel, and how alignment actually works.
It is indeed a processor exception rather than an MMU fault. Align Checking (AC) is programmable but different programming doesn't explain faulting a properly aligned access. For the programming to be different it would have to be assumed that the same OS would program the 3500U and 5500U differently. That seem very unlikely. It would also have to be assumed the compiler was oblivious to the AC setting and generated an aligned access without guarantying the address would be aligned when AC checking was being fullly enforced. And that would also have to assume the address was in fact misaligned.

Editted to remove the reference to zen2 architecture.

@JayDDee
Copy link
Owner

JayDDee commented Aug 31, 2022

Incidentally, a warranty ticket for my 5500U laptop I put in a while ago has finally been processed. A usb port is busted, so they're going to replace the motherboard sometime. After that, I'm going to test to see if it also happens on another cpu of the same model.

This also gives AMD an opportunity to reproduce this problem on the very same CPU.
BTW I opened a ticket with customer care for this segfault: 8201225820
You might want to link it to your ticket.

@slightlyskepticalpotat
Copy link
Author

Just a correction, the 3500U is actually based on Zen+, not Zen 2. AMD naming conventions will never cease to surprise me 😅. I'm going to try to link your ticket to mine—you opened it with AMD, right?

@JayDDee
Copy link
Owner

JayDDee commented Sep 1, 2022

I used the online support to fire off a question as a teaser to see if a human would pick it up. They sent me an email with a ticket number but no link to it. I'll let you know if I hear anything back.

I think my bit counting of alignment was wrong, The source pointer (target) is not aligned to 32 bytes but the destination work->target is. It doesn't matter much because the fault is on the destination pointer.

@JayDDee
Copy link
Owner

JayDDee commented Sep 1, 2022

Got a reply from AMD, it's being escalated to an "expert".

@JayDDee
Copy link
Owner

JayDDee commented Sep 1, 2022

Just a correction, the 3500U is actually based on Zen+, not Zen 2. AMD naming conventions will never cease to surprise me sweat_smile. I'm going to try to link your ticket to mine—you opened it with AMD, right?

I just had a thought about this. AFAIK Zen+ has a different AVX2 implementation, the same as Zen (1). AVX2 (256 bit ) operations are executed as two AVX (128 bit) operations. Zen2 implemented full 256 bit wide execution units. This could effectively reduce the required data alignment for AVX2 on Zen+ and could partially explain the different behaviour on the two CPUs. This is just speculation, I look forward to the AMD experts explaining what's really happening.

@slightlyskepticalpotat
Copy link
Author

Interesting...I don't have any experience with AMD's support system but I hope their "experts" are better than Apple's "geniuses".

@JayDDee
Copy link
Owner

JayDDee commented Sep 5, 2022

Reply from AMD. They want a service request from you. Let me know if you want any help with the information requested.
You can also keep me in the loop as I will be able to better answer their questions about cpuminer-opt.


Dear Jay,

Your service request : SR #{ticketno:[8201225820]} has been reviewed and updated.

Response and Service Request History:

Thank you for your email.

We'd be happy to investigate this issue, however it would be easier to work with the user affected directly.

Please could you ask the user to open a service request here: https://www.amd.com/en/support/contact-email-form

Please provide the following information in the service request:

Description of the issue and a link to the Github page
Full System Specs - Including BIOS version
OS/Distribution Version/Kernel etc
System Name/Model (eg if a laptop what is the model and where was it purchased from)
dmesg log and similar logs from OS

Once we have that information, we will work with the user directly and investigate the issue that is seen with a segfault.

In order to update this service request, please respond without deleting or modifying the service request reference number in the email subject or in the email correspondence below.

Please Note: This service request will automatically close if we do not receive a response within 10 days and cannot be reopened.

If it is not feasible to respond within 10 days, feel free to open a new service request and reference this ticket for continued support.

Best regards,

Matt

AMD Global Customer Care

@slightlyskepticalpotat
Copy link
Author

slightlyskepticalpotat commented Sep 5, 2022 via email

@JayDDee
Copy link
Owner

JayDDee commented Sep 5, 2022

"Segfault" is a good start because that is how the OS is reporting it. You can expand by explaining where the fault is occurring and how it was determined to actually be a misaligned fault and that the faulting address is in fact aligned to 32 bytes

@slightlyskepticalpotat
Copy link
Author

slightlyskepticalpotat commented Sep 6, 2022 via email

@JayDDee
Copy link
Owner

JayDDee commented Sep 6, 2022

Done. That should close the loop so we are all informed. I'm reopening this issue since it's still active.

@JayDDee JayDDee reopened this Sep 6, 2022
@JayDDee
Copy link
Owner

JayDDee commented Sep 6, 2022

For reference here is a summary of the main points as I understand them at this point.

  • Two test laptop PCs, similar except for CPU generation. Target has 5500U, control has 3500U.
  • Testing uses same OS, Ubuntu-22.04, same compiler version, same compile options, same application source code, same application options.
  • Subject source code is a looped copy and byte order reversal of a 256 bit array composed of 8 32 bit integers.
  • Control never crashes.
  • Target crashes when compiled with auto-vectorization, otherwise works correctly.
  • Target crashes when array byte swap source code is replaced with AV2 intrinsics using aligned store _mm256_store_si256
  • Target does not crash and works correctly when using AVX2 intrinsincs with unaligned store _mm256_storeu_si256.
  • Displaying the faulting pointer with printf or gdb shows it always aligned to 32 bytes as required by AVX2.

@slightlyskepticalpotat
Copy link
Author

slightlyskepticalpotat commented Sep 6, 2022 via email

@JayDDee
Copy link
Owner

JayDDee commented Sep 7, 2022

AMD is closing my ticket saying issue is resolved but wil work with you to find root cause of your issue. Pleas let me know what they find, if they find anything.

@slightlyskepticalpotat
Copy link
Author

Just to let you know, they haven't responded to my ticket (8201227170) since they sent me an automated email saying it had been opened. Are you able to ask them what the status is from your closed ticket?

@JayDDee
Copy link
Owner

JayDDee commented Sep 17, 2022

They told me to open a new ticket if I wanted further support so they'd likely ignore any queries about the old one.

I think no news is good news. That your ticket hasn't been closed yet is a good sign. There is always pressure to close tickets quickly to improve metrics. AMD techs are probably waiting to get their hands on the laptop. I expect a reply soon after because it will be a hot potato. What's in that reply will be the interesting part.

If you don't have a contact for your ticket you could use TECH.SUPPORT@amd.com. That was used by the "experts" for my ticket and I was able to reply.

@slightlyskepticalpotat
Copy link
Author

slightlyskepticalpotat commented Sep 19, 2022 via email

@JayDDee
Copy link
Owner

JayDDee commented Sep 20, 2022

Disappointing but not entirely unexpected. Unfortunately AMD took the easy way out and blamed the software, ignoring the evidence to the contrary. There's nothing I can do because I don't own the CPU, or type of CPU, and can't reproduce the problem.

@JayDDee
Copy link
Owner

JayDDee commented Feb 15, 2023

There has been another report of the same problem, #389, this time with an Intel CPU. This eliminates the CPU as the problem.
Both users were using Ubuntu-22.04 and GCC-11.2 with points to a possible compiler problem.

@slightlyskepticalpotat
Copy link
Author

Interesting. I will see if I can test it on a newer version of GCC sometime this week to see if they've fixed it since then.

@JayDDee
Copy link
Owner

JayDDee commented Feb 16, 2023

I'm thinking of adding some debug code just for this issue. The code will be inserted just before the loop that crashes and will test the alignment of the target pointers before the crash. It will be compiled whenever AVX2 is present regardless of compiler optimization and is activated at run time with the --debug option.

I suggest adding it for testing. Feel free to make any modifications.

#if defined(__AVX2__)
if ( opt_debug )
{
if ( (uint64_t)target % 32 )
applog( LOG_ERR, "Misaligned target %p", target );
if ( (uint64_t)(work->target) % 32 )
applog( LOG_ERR, "Misaligned work->target %p", work->target );
}
#endif

@JayDDee
Copy link
Owner

JayDDee commented Feb 16, 2023

Some statistical observations:

There is a 50% random chance that any address will be aligned to 32 bytes or better. The absence of a crash is not conclusive.
The crash seems to be consistent, so far, for a given environment (OS, compiler, CPU).
Changing any variable could flip the random result, for example the CPU that now does not crash could crash when compiled with a different GCC version.

Test results need to be interpreted carefully. I hated statistics in school, so much uncertainty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants