Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to replicate RPi0 tests #4701

Open
samveen opened this issue Apr 22, 2023 · 3 comments
Open

Unable to replicate RPi0 tests #4701

samveen opened this issue Apr 22, 2023 · 3 comments

Comments

@samveen
Copy link

samveen commented Apr 22, 2023

I am trying to verify the test results of the Raspberry Pi Zero W as listed in the table under the Raspberry Pi section in the README.md. However I am unable to get sane (or same) results:

samveen@facez:~/XNNPACK/build/local $ time ./end2end-bench --benchmark_min_time=5
Error in cpuinfo: failed to parse file /sys/devices/system/cpu/kernel_max: "-1
" is not an unsigned number
2023-04-22T02:17:08+01:00
Running ./end2end-bench
Run on (1 X 1000 MHz CPU )
Load Average: 0.99, 1.00, 1.00
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
FP32MobileNetV1/T:1/real_time                 3952697 us      3950282 us            2 cpufreq=1000M
FP32MobileNetV2/T:1/real_time                 2014865 us      2009419 us            3 cpufreq=1000M
^\Quit

real	163m3.618s
user	162m56.411s
sys	0m0.407s

As can be seen from the above, the test for FP32 MobileNet v3 Large does bit complete in the expected timeframe (i.e. less than the time take for FP32 MobileNet v2 1.0X), but instead doesn't complete (maximum time before my patience ran out was a 16 hour overnight run).
I ran strace on the binary in verbose mode with sudo strace ./build/local/end2end-bench --benchmark_min_time=5 --v=1000. The last relevant bit of output is as follows:

brk(0x2197d000)                         = 0x2197d000
write(2, "-- LOG(", 7-- LOG()                  = 7
write(2, "2", 12)                        = 1
write(2, "): ", 3): )                      = 3
write(2, "Ran in ", 7Ran in )                  = 7
write(2, "6.01356", 76.01356)                  = 7
write(2, "/", 1/)                        = 1
write(2, "6.02472", 76.02472)                  = 7
write(2, "\n", 1
)                       = 1
write(1, "\33[0;32mFP32MobileNetV2/T:1/real_"..., 133FP32MobileNetV2/T:1/real_time                 2008239 us      2004521 us            3 cpufreq=1000M
) = 133
write(1, "\33[m", 3)                    = 3
write(2, "-- LOG(", 7-- LOG()                  = 7
write(2, "2", 12)                        = 1
write(2, "): ", 3): )                      = 3
write(2, "Running ", 8Running )                 = 8
write(2, "FP32MobileNetV3Large/T:1/real_ti"..., 34FP32MobileNetV3Large/T:1/real_time) = 34
write(2, " for ", 5 for )                    = 5
write(2, "1", 11)                        = 1
write(2, "\n", 1
)                       = 1
openat(AT_FDCWD, "/dev/urandom", O_RDONLY) = 3
read(3, "\271\0364N", 4)                = 4
futex(0x1fe5c0d8, FUTEX_WAKE_PRIVATE, 2147483647) = 0

Then nothing after (no system calls of any sort). Even the quit signal doesn't prompt anything:

futex(0x1fe5c0d8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
^\strace: Process 958 detached
Quit

However top shows that end2end-bench is consuming all the compute resources on the RPi0 (18 hours into the run):

top - 06:10:47 up 1 day, 26 min,  2 users,  load average: 1.08, 1.02, 1.01
Tasks:  87 total,   2 running,  85 sleeping,   0 stopped,   0 zombie
%Cpu(s): 99.0 us,  1.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :    476.9 total,    112.0 free,    155.6 used,    209.3 buff/cache
MiB Swap:   1000.0 total,    986.7 free,     13.2 used.    265.4 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
19235 root      20   0  527284 131436   3548 R  98.7  26.9   1101:41 end2end-bench
21863 samveen   20   0   11096   2984   2496 R   1.0   0.6   0:00.10 top

My build was created using scripts/build-local.sh (as per the Raspberry Pi section) using the following parameters:
bash -x ./scripts/build-local.sh -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DXNNPACK_ENABLE_ARM_DOTPROD:BOOL=OFF

The parameter XNNPACK_ENABLE_ARM_DOTPROD:BOOL=OFF is required for local builds on the Raspberry Pi Zero due to Neon specific SIMD assembly contained in the sources in the folders src/qc8-igemm/ and src/qc8-gemm/, which the native armv6 assembler doesn't support:

[ 69%] Building C object CMakeFiles/microkernels-all.dir/src/qc8-gemm/gen/qc8-gemm-1x8c4-minmax-fp32-neondot.c.o
/usr/bin/cc -DFXDIV_USE_INLINE_ASSEMBLY=0 -DPTHREADPOOL_NO_DEPRECATED_API=1 -DXNN_ENABLE_ARM_BF16=1 -DXNN_ENABLE_ARM_DOTPROD=1 -DXNN_ENABLE_ARM_FP16_SCALAR=1 -DXNN_ENABLE_ARM_FP16_VECTOR=1 -DXNN_ENABLE_ASSEMBLY=1 -DXNN_ENABLE_DWCONV_MULTIPASS=0 -DXNN_ENABLE_GEMM_M_SPECIALIZATION=1 -DXNN_ENABLE_JIT=0 -DXNN_ENABLE_MEMOPT=1 -DXNN_ENABLE_RISCV_VECTOR=1 -DXNN_ENABLE_SPARSE=1 -I/home/samveen/XNNPACK/src -I/home/samveen/XNNPACK/build/local/pthreadpool-source/include -I/home/samveen/XNNPACK/build/local/FXdiv-source/include -I/home/samveen/XNNPACK/build/local/FP16-source/include -O3 -DNDEBUG -fPIC -Wno-psabi -O2 -pthread -std=c99  -fno-math-errno  -marm  -march=armv8.2-a+dotprod -mfpu=neon-fp-armv8  -o CMakeFiles/microkernels-all.dir/src/qc8-gemm/gen/qc8-gemm-1x8c4-minmax-fp32-neondot.c.o -c /home/samveen/XNNPACK/src/qc8-gemm/gen/qc8-gemm-1x8c4-minmax-fp32-neondot.c
/tmp/ccEN2Cuz.s: Assembler messages:
/tmp/ccEN2Cuz.s:62: Error: selected processor does not support `vsdot.s8 q8,q11,d7[0]' in ARM mode
/tmp/ccEN2Cuz.s:64: Error: selected processor does not support `vsdot.s8 q10,q9,d7[0]' in ARM mode
/tmp/ccEN2Cuz.s:67: Error: selected processor does not support `vsdot.s8 q8,q9,d7[1]' in ARM mode
/tmp/ccEN2Cuz.s:71: Error: selected processor does not support `vsdot.s8 q10,q9,d7[1]' in ARM mode
/tmp/ccEN2Cuz.s:143: Error: selected processor does not support `vsdot.s8 q10,q9,d7[0]' in ARM mode
/tmp/ccEN2Cuz.s:146: Error: selected processor does not support `vsdot.s8 q8,q11,d7[0]' in ARM mode
gmake[2]: *** [CMakeFiles/microkernels-all.dir/build.make:44129: CMakeFiles/microkernels-all.dir/src/qc8-gemm/gen/qc8-gemm-1x8c4-minmax-fp32-neondot.c.o] Error 1
gmake[2]: Leaving directory '/home/samveen/XNNPACK/build/local'
gmake[1]: *** [CMakeFiles/Makefile2:12284: CMakeFiles/microkernels-all.dir/all] Error 2
gmake[1]: Leaving directory '/home/samveen/XNNPACK/build/local'
gmake: *** [Makefile:163: all] Error 2

The environment is as below:

samveen@facez:~ $ date --utc
Sat 22 Apr 04:56:02 UTC 2023
samveen@facez:~ $ grep NAME /etc/os-release; echo; uname -a; echo; cat /proc/cpuinfo|grep -iv serial; echo; vcgencmd get_mem arm; vcgencmd get_mem gpu; echo; free -h; echo; df -h /; echo; sudo apt update && sudo apt upgrade
PRETTY_NAME="Raspbian GNU/Linux 11 (bullseye)"
NAME="Raspbian GNU/Linux"
VERSION_CODENAME=bullseye

Linux facez.cluster.samveen.in 6.1.21+ #1642 Mon Apr  3 17:19:14 BST 2023 armv6l GNU/Linux

processor	: 0
model name	: ARMv6-compatible processor rev 7 (v6l)
BogoMIPS	: 997.08
Features	: half thumb fastmult vfp edsp java tls 
CPU implementer	: 0x41
CPU architecture: 7
CPU variant	: 0x0
CPU part	: 0xb76
CPU revision	: 7

Hardware	: BCM2835
Revision	: 9000c1
Model		: Raspberry Pi Zero W Rev 1.1

arm=496M
gpu=16M

               total        used        free      shared  buff/cache   available
Mem:           476Mi        44Mi       205Mi       0.0Ki       227Mi       382Mi
Swap:          999Mi          0B       999Mi

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        15G   13G  1.4G  90% /

Get:1 http://mirror.ossplanet.net/raspbian/raspbian bullseye InRelease [15.0 kB]
Hit:2 http://archive.raspberrypi.org/debian bullseye InRelease                          
Fetched 15.0 kB in 3s (5,019 B/s)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
All packages are up to date.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
@samveen
Copy link
Author

samveen commented Apr 23, 2023

Digging more into the neon dot error shows me that I'm hitting the issue as listed in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101723 (via #1465 ). As of now, the fix has not made it downstream into Debian bullseye unfortuantely, so successful local builds on Debian Bullseye with the neon-dot kernels is a pipe dream.

Other than that, the other issues still stand.

@Maratyszcza
Copy link
Contributor

Try an older revision of XNNPack. We don't regularly test on pre-NEON ARM systems, and probably a recent refactoring broke something.

@samveen
Copy link
Author

samveen commented Apr 24, 2023

@Maratyszcza , I tried that too. The version which last updated the raspberry pi benchmarks table (your commit 3c6d6b4 from Oct 16, 2021). The build failed with the error about unsupported neon instructions, which lead me to the GCC bug report.

I'm in the process of upgrading to Debian Bookwork/Sid to see if I can manage to get the build working, before proceeding further. Upgrade done.

Debian Bookworm/testing is armel, not arm hf, and gcc-12/g++12 generated armel (no VFP) code. So I discarded this and went with Raspberry Pi OS Bookworm/testing which is armhf (armv6+VFP).

Gcc 10.4 and 12.2 are available.

  • Building 3c6d6b4 fails with both 10.4 and 12.2 as below:
[ 45%] Building C object CMakeFiles/all_microkernels.dir/src/f16-f32-vcvt/gen/vcvt-neonfp16-x8.c.o
/usr/bin/cc -DFXDIV_USE_INLINE_ASSEMBLY=0 -DPTHREADPOOL_NO_DEPRECATED_API=1 -DXNN_ENABLE_ASSEMBLY=1 -DXNN_ENABLE_MEMOPT=1 -DXNN_ENABLE_SPARSE=1 -I/home/samveen/XNNPACK/include -I/home/samveen/XNNPACK/src -I/home/samveen/XNNPACK/build/local/pthreadpool-source/include -I/home/samveen/XNNPACK/build/local/FXdiv-source/include -I/home/samveen/XNNPACK/build/local/FP16-source/include -D__ARM_FP16_FORMAT_IEEE=1 -O3 -DNDEBUG -fPIC -Wno-psabi -pthread  -marm  -march=armv7-a -mfpu=neon-fp16  -O2  -MD -MT CMakeFiles/all_microkernels.dir/src/f16-f32-vcvt/gen/vcvt-neonfp16-x8.c.o -MF CMakeFiles/all_microkernels.dir/src/f16-f32-vcvt/gen/vcvt-neonfp16-x8.c.o.d -o CMakeFiles/all_microkernels.dir/src/f16-f32-vcvt/gen/vcvt-neonfp16-x8.c.o -c /home/samveen/XNNPACK/src/f16-f32-vcvt/gen/vcvt-neonfp16-x8.c
/home/samveen/XNNPACK/src/f16-f32-vcvt/gen/vcvt-neonfp16-x8.c: In function ‘xnn_f16_f32_vcvt_ukernel__neonfp16_x8’:
/home/samveen/XNNPACK/src/f16-f32-vcvt/gen/vcvt-neonfp16-x8.c:31:11: error: unknown type name ‘float16x8_t’
   31 |     const float16x8_t vh = vreinterpretq_f16_u16(vld1q_u16(i)); i += 8;
      |           ^~~~~~~~~~~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants