Skip to content

STREAM results

ict edited this page Dec 16, 2021 · 35 revisions

This page stores various results obtained from running the STREAM benchmark included in this package. Bold text indicates the best result in a given category.

Jump to: Embedded Systems, Mobile Phones, Laptops and Portables, Desktops/PCs, Workstations, (+non-unix), Servers, Summary

Embedded Systems

HP t5700

Released in early 2003, the t5700 was among HP's first PC-compatible thin clients, featuring Transmeta's Crusoe TM5800 x86-compatible 128-bit VLIW microprocessor with independent 64 KiB instruction and data caches as well as a 512 KiB unified secondary cache. One of the Crusoe's dual on-die memory controllers is directly interfaced to 256 MiB of 266 MT/s DDR memory through a 64-bit wide data bus. All tests are performed under Debian 9.0.3 on a system not specifically configured for benchmarking.

GCC 6.3.0 w/ OpenMP, 1 thread, -O2: 5,000 element array

Copy Scale Add Triad
1.194 GB/s 1.375 GB/s 1.603 GB/s 1.485 GB/s

GCC 6.3.0 w/ OpenMP, 1 thread, -O2: 500,000 element array

Copy Scale Add Triad
345.6 MB/s 338.2 MB/s 408.2 MB/s 408.5 MB/s

HP t5325

The t5325 is a miniscule low-power thin client unveiled by HP in late 2009, designed around a Marvell Kirkwood 88F6281 system-on-a-chip implementing a Marvell designed ARMv5TE-compliant "Sheeva" processor core clocked at 1.2 GHz with independent 16 KiB instruction and data caches and a 256 KiB unified secondary cache. The Sheeva core is connected to an internal DDR2 memory controller through a 64-bit internal data bus, which interfaces to 512 MiB of off-chip 800 MT/s DDR2 memory through a 16-bit external data bus. All tests are performed under the HP "ThinPro" operating system, a lightly customized variant of Debian Lenny, on a system not specifically configured for benchmarking.

Because the Kirkwood memory controller's external interface is only 16 bits wide, the t5325's bandwidth is severely limited despite the fast 800 MT/s DDR2 memory it uses, barely able to sustain triad rates at a fraction of even SDR-based systems like the HP VISUALIZE C3000.

GCC 4.2.4 w/ OpenMP, 1 thread, -O2: 10,000,000 element array

Copy Scale Add Triad
784.0 MB/s 184.2 MB/s 178.6 MB/s 114.1 MB/s

GL.iNet GL-MT1300 "Beryl"

GCC 7.4.0, 1 thread, -O2: 5,000,000 element array

Copy Scale Add Triad
259.8 MB/s 79.8 MB/s 158 MB/s 68.3 MB/s

I-O DATA USL-5P

GCC 4.2.1, 1 thread, -O2: 100,000 element array

Copy Scale Add Triad
147.4 MB/s 125.8 MB/s 132.6 MB/s 118.9 MB/s

Mobile Phones

Palm Pre Plus

Announced at CES 2010 and launched on Verizon Wireless on March 2010, the Pre Plus was an updated version of Palm's innovative Pre smartphone with double the RAM (512 MiB) and storage (16 GiB), as well as a new touch-based gesture area rather than the previous home button. Like the original Pre, the Pre Plus is designed around Texas Instruments' OMAP3430 multimedia processor featuring an ARM Cortex-A8 core clocked at 500 MHz with independent 16 KiB instruction and data caches as well as a unified 256 KiB second-level cache. Although the OMAP3430 does contain a VFP floating-point unit within its NEON SIMD coprocessor, this "VFPLite" implementation is not a fully fledged design like that found on most other Cortex-A8s and is significantly slower.

The Pre Plus features 512 MiB of 400MT/s LPDDR memory mounted directly on the OMAP3430 package and attached to its on-die memory controller via a 32-bit bus. All tests are performed under WebOS 1.4.5 with WebOS Internals' UberKernel allowing for greater range of clock frequency tweaking, but otherwise no benchmarking-specific configuration.

GCC 4.2.3: -O2, 10,000,000 element array, Palm default profile (500 MHz underclock)

Copy Scale Add Triad
337.6 MB/s 97.4 MB/s 128.9 MB/s 66.4 MB/s

GCC 4.2.3: -O2, 10,000,000 element array, OMAP3430 standard clock (600 MHz)

Copy Scale Add Triad
361.7 MB/s 113.8 MB/s 141.2 MB/s 91.6 MB/s

GCC 4.2.3: -O2, 10,000,000 element array, 1 GHz overclock

Copy Scale Add Triad
429.3 MB/s 172.3 MB/s 206.1 MB/s 143.2 MB/s

Laptops and Portables

Panasonic ToughBook U1

The ToughBook U1 is a unique and highly ruggedized UMPC released by Panasonic in 2008, and built around Intel's hyper-threaded "Silverthorne" Atom microprocessor, featuring a 32 KiB instruction cache and a 24 KiB data cache, along with a unified 512 KiB secondary cache. The Z520 model featured in the U1 has a 1.33 GHz clock frequency, and is connected to an Intel US15W System Controller Hub using a 533 MHz front-side bus. The US15W SCH features an integrated memory controller connected to 1 GiB of on-board DDR2 memory, likely with 533 MT/s data rate, though 400 MT/s is possible. All tests are performed under Windows XP with the Cygwin environment, on a system not specifically configured for benchmarking.

GCC 5.4.0 w/ OpenMP, 2 threads, -O2: 25,000,000 element array

Copy Scale Add Triad
1.506 GB/s 1.506 GB/s 1.746 GB/s 1.746 GB/s

Dell Latitude E6420

The E6420 is a midrange 14-inch business notebook introduced in early 2012, this particular configuration features Intel's dual-core, hyper-threaded Core i5 "Sandy Bridge" microprocessor with 32+32 KiB per-core instruction and data caches, 256 KiB per-core second level cache, a 3 MiB shared tertiary cache and a standard 2.6 GHz clock and maximum frequency of 3.3 GHz. All models of the E6420 are built around Intel's QM67 express chipset, connected to the system processor through a 5 GT/s Direct Media Interface. This system is configured with 4 GiB of DDR3 SDRAM clocked at 667 MHz (for 1333 MT/s data rate) and directly interfaced to the processor's on-die memory controller. All tests are performed under CentOS 7.5.1804 on a system not specifically configured for benchmarking.

GCC 4.8.5 w/ OpenMP, 4 threads, -O3: 10,000,000 element array

Copy Scale Add Triad
8.647 GB/s 6.333 GB/s 7.117 GB/s 7.115 GB/s

Apple iBook G4 (Mid-2005/1.33)

The last and fastest of the 12'' consumer-oriented iBook G4 line, the mid-2005 model is built around a 1.33 GHz-clocked, 32-bit PowerPC 7447a microprocessor fabricated by Freescale Semiconductor, then recently spun off from Motorola in the previous year. The 7447a is the final desktop iteration of the PowerPC 7400 'G4' microprocessor used by Apple in their systems, featuring two 32 KiB primary caches for instructions and data, a single 512 KiB on-die unified secondary cache, and some additional mobile-oriented features, such as dynamic frequency scaling and an on-chip thermal diode. The 7447a is interfaced to 512 MiB of on-board 333 MT/s DDR memory through the Intrepid ASIC, to which it is attached via a 133 MHz, 64-bit wide data bus. Intrepid also provides I/O device control and most other functionality to the complete system. All tests are performed under Mac OS X 10.4 on a system not specifically configured for benchmarking.

Apple GCC 4.0.1, -O2: 10,000,000 element array

Copy Scale Add Triad
425.6 MB/s 421.7 MB/s 433.4 MB/s 450.7 MB/s

Desktops/Personal Computers

Lenovo 3000 J115 (7387-26U)

Released in late 2006 as one of Lenovo's first entries into the United States market under their own name; a fairly average entry-level PC built around AMD's dual-core Athlon 64 X2 microprocessor with 64+64 KiB shared instruction and data caches, 512 KiB per-core second level cache and a 2 GHz clock frequency (model 3800+). The 3000 J115 employs NVIDIA's nForce 410 chipset, which connects to the system processor through a 1 GHz HyperTransport bus. This system is configured with 1 GiB of DDR2 SDRAM clocked at 266 MHz (for 533 MT/s data rate) and directly interfaced to the Athlon 64 X2's on-die memory controller. All tests are performed under CentOS 7.5.1804 on a system not specifically configured for benchmarking.

GCC 4.8.5 w/ OpenMP, 2 threads, -O3: 10,000,000 element array

Copy Scale Add Triad
4.164 GB/s 2.296 GB/s 2.444 GB/s 2.410 GB/s

Apple iMac G5 (1.6 GHz 17'')

Introduced with the original lineup in August 2004, the 1.6 GHz 17'' model was the slowest generally available G5 system sold by Apple, designed around IBM's 64-bit PowerPC 970FX microprocessor featuring a 32 KiB data cache, 64 KiB instruction cache and a unified 512 KiB secondary cache on-chip. The 970FX interfaces to an off-chip DDR memory controller by a 533 MHz, 64-bit data bus composed of two separate 32-bit uni-directional buses. This system is outfitted with 256 MiB of DDR memory clocked at 200 MHz (with an effective 400 MT/s data rate.)

All tests are performed under Mac OS 10.3.5, on a system not specifically configured for benchmarking.

Apple GCC 3.3, -O2: 1,000,000 element array

Copy Scale Add Triad
1.209 GB/s 1.224 GB/s 1.347 GB/s 1.350 GB/s

Workstations

Apple Power Mac G5 (Late 2005/2.3DC)

The mid-range offering of Apple's final generation of PowerPC-based professional systems, the 2.3DC was introduced in October 2005 and was designed around IBM's new dual-core 64-bit PowerPC 970MP processor, which featured two PowerPC 970 cores each with 32 KiB data cache, 64 KiB instruction cache, and a unified 1 MiB secondary cache, all running at a clock frequency of 2.3 GHz. The 970MP is interfaced to an off-chip DDR2 memory controller by a 1.15 GHz, 64-bit data bus composed of two separate 32-bit uni-directional buses. This system is outfitted with 8 GiB of error-correcting DDR2 memory clocked at 266 MHz (with an effective 533 MT/s data rate.)

All tests are performed under Mac OS 10.4.11, on a system not specifically configured for benchmarking. Note that the version of GCC 4.0.1 that ships with 10.4's Xcode tools does not support OpenMP.

Apple GCC 4.0.1, 1 thread, -O2: 10,000,000 element array

Copy Scale Add Triad
2.759 GB/s 2.767 GB/s 3.227 GB/s 3.225 GB/s

HP VISUALIZE C3000 (9000/785/C3000)

A mid-range Unix workstation released in 1999, based on HP's indigenous PA-8500 microprocessor with 1 MiB of on-die data cache, 512 KiB of on-die instruction cache and a clock frequency of 400 MHz. The C3000's microprocessor is interfaced to the "Astro" chipset through a 120 MHz Runway+ bus. The particular system tested had 2,560 megabytes of SDRAM, also running at 120 MHz, and was not specifically configured for benchmarking. All tests are performed under HP-UX 11.11 (11i v1).

HP C B.11.11.16, +O4: 10,000,000 element array

Copy Scale Add Triad
732.1 MB/s 681.1 MB/s 631.6 MB/s 625.6 MB/s

GCC 4.2.3, -O3: 10,000,000 element array

Copy Scale Add Triad
399.4 MB/s 413.5 MB/s 506.9 MB/s 508.1 MB/s

Non-Unix Workstations

The following results are from workstations not running a Unix-like operating system, compatibility environment or otherwise lacking the proper accommodations to build the LINPACK sources provided in this package as-is. Source/build tweaks are noted on a per-system basis.

DEC VAXstation 4000 VLC

Introduced by DEC in 1991 as the most inexpensive entry in the new VAXstation 4000 line, the VLC was the smallest full-featured VAX ever built, designed around DEC's highly integrated CVAX "SOC" microprocessor with a 1 KiB shared primary instruction/data cache and an innovative on-die 8 KiB DRAM secondary cache. Through a 32-bit data bus, the SOC attaches to the DC7201 "S-chip" ASIC which provides a 32-bit interface to up to 24 MiB of error-correcting memory to the CPU, as well as ethernet and SCSI subsystems via DMA channels.

This test was run under OpenVMS 6.1, with no specific configuration for benchmarking.

DEC C/C++ 1.2, /OPTIMIZE=ALL: 100,000 element array

Copy Scale Add Triad
8.9 MB/s 8.4 MB/s 10.0 MB/s 7.7 MB/s

Build notes:

Due to the various differences of the OpenVMS environment, building STREAM on this system required some significant modifications to the timing logic, mainly a VMS-friendly reimplementation of the gettimeofday() function based on the following snippet:

/* purloined from source code for Xanim, from Mark Podlipec
   not clear if this was the original of this code

   mangled to kill the timezone pieces */

/*      
 *      Provide the UNIX gettimeofday() function for VMS C.
 *      The timezone is currently unsupported.
 */

#include <time.h>

int sys$gettim();
int lib$subx();
int lib$ediv();

int gettimeofday( struct timeval *tp, int * tzp)
{
   long curr_time[2];   /* Eight byte VAX time variable */
   long jan_01_1970[2] = { 0x4BEB4000,0x7C9567} ;
   long diff[2];
   long result;
   long vax_sec_conv = 10000000;
 
   result = sys$gettim( &curr_time );
 
   result = lib$subx( &curr_time, &jan_01_1970, &diff);
 
   if ( tp != 0) {
       result = lib$ediv( &vax_sec_conv, &diff,
                          &(tp->tv_sec), &(tp->tv_usec) );
       tp->tv_usec = tp->tv_usec / 10;  /* convert 1.e-7 to 1.e-6 */
   }
   return ( 0);
}

Additional modifications to the base source included:

  • Supplying the DEC run-time library functions sys$gettim(), lib$subx() and lib$ediv() as well as the timeval struct required inclusion of the following header files: lib$routines.h (for lib$subx() and lib$ediv()), starlet.h (for sys$gettim()) and socket.h (for timeval)

  • Any inclusions of sys/time.h were replaced with references to time.h

  • As OpenVMS does not define ssize_t, it was locally defined to int for this benchmark

Performing the above modifications allowed STREAM to compile and run, delivering results in line with expectations compared to other workstations of the time (the 4000 VLC seems to be about 20-30% slower than the officially posted results of some higher-end competitors like the SPARCstation 1 or SGI 4D/25.)

Servers

Sun Fire T1000

The Sun Fire T1000 is an entry-level 1U rackmounted server released in early 2006 as one of the first systems to use Sun's radically multi-threaded UltraSPARC T1 "Niagra" microprocessor, derived from a SPARC implementation originally developed by Afara Websystems that features four, six or eight relatively simple SPARC V9 cores with individual 16 KiB instruction caches and 8 KiB data caches, a shared 3 MiB secondary cache and a single floating-point unit shared among all cores. Each core also has four threads, all sharing a single pipeline and a massive register file composed of 640 64-bit registers that allows for a thread's state to be quickly saved and resumed in a single cycle in order to maximize processor utilization in heavily multi-threaded workloads.

The 8-core T1 utilized in this T1000 is clocked at 1 GHz, and is directly interfaced to 16 GiB of error-correcting 533 MT/s DDR2 memory through two on-die memory controllers with 128-bit data buses. The T1 possesses a total of four on-die memory controllers, however the T1000 only utilizes two of them to support two banks of four memory modules each. The T1000 is capable of supporting modules of up to 4 GiB in size.

All tests are performed on a T1000 with an 8-core UltraSPARC T1 running Solaris 10 10/09 with no specific configuration for benchmarking purposes.

Sun Studio 12/Sun C 5.9 w/ OpenMP, 1 thread, -xO3: 10,000,000 element array

Copy Scale Add Triad
542.1 MB/s 273.4 MB/s 301.1 MB/s 220.8 MB/s

Sun Studio 12/Sun C 5.9 w/ OpenMP, 32 threads, -xO3: 10,000,000 element array

Copy Scale Add Triad
4.148 GB/s 1.598 GB/s 2.383 GB/s 1.199 GB/s

Summary

Embedded Systems

System Copy Scale Add Triad Year
I-O DATA USL-5P 147.4 MB/s 125.8 MB/s 132.6 MB/s 118.9 MB/s 2003
HP t5700 345.6 MB/s 338.2 MB/s 408.2 MB/s 408.5 MB/s 2003
HP t5325 784.0 MB/s 184.2 MB/s 178.6 MB/s 114.1 MB/s 2009

Mobile Phones

System Copy Scale Add Triad Year
Palm Pre Plus (500 MHz) 337.6 MB/s 97.4 MB/s 128.9 MB/s 66.4 MB/s 2010
Palm Pre Plus (600 MHz) 361.7 MB/s 113.8 MB/s 141.2 MB/s 91.6 MB/s 2010
Palm Pre Plus (1 GHz) 429.3 MB/s 172.3 MB/s 206.1 MB/s 143.2 MB/s 2010

Laptops and Portables

System Copy Scale Add Triad Year
Apple iBook G4 425.6 MB/s 421.7 MB/s 433.4 MB/s 450.7 MB/s 2005
Panasonic ToughBook U1 1.506 GB/s 1.506 GB/s 1.746 GB/s 1.746 GB/s 2008
Dell Latitude E6420 8.647 GB/s 6.333 GB/s 7.117 GB/s 7.115 GB/s 2012

Desktops/PCs

System Copy Scale Add Triad Year
Apple iMac G5 (1.6) 1.209 GB/s 1.224 GB/s 1.347 GB/s 1.350 GB/s 2004
Lenovo 3000 J115 4.164 GB/s 2.296 GB/s 2.444 GB/s 2.410 GB/s 2006

Workstations

System Copy Scale Add Triad Year
DEC VAXstation 4000 VLC 8.9 MB/s 8.4 MB/s 10.0 MB/s 7.7 MB/s 1991
HP VISUALIZE C3000 732.1 MB/s 681.1 MB/s 631.6 MB/s 625.6 MB/s 1999
Apple Power Mac G5 (2.3DC) 2.759 GB/s 2.767 GB/s 3.227 GB/s 3.225 GB/s 2005

Servers

System Copy Scale Add Triad Year
Sun Fire T1000 4.148 GB/s 1.598 GB/s 2.383 GB/s 1.199 GB/s 2006