The original STREAMS benchmark measures the Speed Not To Exceed bandwidth of a CPU to it's attached RAM by performing large numbers of simple operations. This version for EMS.js measures the rate of different atomic operations on shared integer data and is implemented using two different parallelization schemes: Bulk Synchronous Parallel (BSP), and Fork-Join.
The EMS.js STREAMS benchmark is implemented twice, once using Bulk Synchronous Parallelism and again using Fork-Join parallelism.
BSP goes parallel at program startup, decomposes loop iterations across processes, and performs a barrier at the end of each loop. FJ starts serial and goes parallel for each loop, joining back to serial execution before the next loop.
To run either version of STREAMS benchmark, specify the number of processes as an argument to the program:
node streams_bulk_sync_parallel.js 8
Running the BSP version of the STREAMS benchmark reports the rate at which each loop is performed, each loop exercising a single EMS read and write primitive. A second set of "Native" loops performs similar operations on ordinary volatile private memory, taking advantage of all possible native optimizations and just-in-time compilation. These results show a performance difference of 20-50x accessing EMS memory versus native memory, which is still very favorable compared to the 1000-5000x difference accessing a remote key-value store, or worse accessing magnetic media.
dunlin> time node streams_bulk_sync_parallel.js 3
EMStask 0: 1,000,000 write 6,896,551 ops/sec
EMStask 0: 1,000,000 writeXE 7,246,376 ops/sec
EMStask 0: 1,000,000 writeXF 7,246,376 ops/sec
EMStask 0: 1,000,000 read 7,042,253 ops/sec
EMStask 0: 1,000,000 reread 6,896,551 ops/sec
EMStask 0: 1,000,000 readFF 7,299,270 ops/sec
EMStask 0: 1,000,000 readFE 7,194,244 ops/sec
EMStask 0: 1,000,000 writeEF 7,299,270 ops/sec
EMStask 0: 1,000,000 copy 3,521,126 ops/sec
EMStask 0: 1,000,000 recopy 3,484,320 ops/sec
EMStask 0: 1,000,000 c=a*b 2,364,066 ops/sec
EMStask 0: 1,000,000 c+=a*b 1,715,265 ops/sec
EMStask 0: 1,000,000 checksum Infinity ops/sec
------------------------ START NATIVE ARRAY
EMStask 0: 1,000,000 write 333,333,333 ops/sec
EMStask 0: 1,000,000 rd/write 250,000,000 ops/sec
EMStask 0: 1,000,000 read 125,000,000 ops/sec
EMStask 0: 1,000,000 reread 250,000,000 ops/sec
EMStask 0: 1,000,000 copy 200,000,000 ops/sec
EMStask 0: 1,000,000 rmwcopy 250,000,000 ops/sec
EMStask 0: 1,000,000 c=a*b 100,000,000 ops/sec
EMStask 0: 1,000,000 c+=a*b 142,857,142 ops/sec
EMStask 0: 1,000,000 sum Infinity ops/sec
real 0m2.938s
user 0m8.190s
sys 0m0.172s
NOTE: Adding timing instrumentation to the Fork-Join version is left as an exercise to the reader.
Benchmark results for the BSP implementation on an AWS
c4.8xlarge (132 ECUs, 36 vCPUs, 2.9 GHz, Intel Xeon E5-2666v3, 60 GiB memory
.