Introduction

stream-scaling automates running the STREAM memory bandwidth test on Linux systems. It detects the number of CPUs and how large each of their caches are. The program then downloads STREAM, compiles it, and runs it with an array size large enough to not fit into cache. The number of threads is varied from 1 to the total number of cores in the server, so that you can see how memory speed scales as cores involved increase.

Installation/Usage

Just run stream-scaling:

./stream-scaling

And it should do the rest. Note that a stream.c and stream binary will be left behind afterwards.

Note that the program is only expected to work on systems using gcc 4.2 or later, as the OpenMP libraries are required.

Sample result

This sample is from an Intel i7 860 processor, featuring 4 real cores with Hyper Threading for a total of 8 virtual cores. It also features the Turbo feature to accelerate running with low core counts. Memory is 4 X 2GB DDR-1600:

$ ./stream-scaling 
=== CPU cache information ===
CPU /sys/devices/system/cpu/cpu0 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu0 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu0 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu0 Level 3 Cache: 8192K (Unified)
CPU /sys/devices/system/cpu/cpu1 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu1 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu1 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu1 Level 3 Cache: 8192K (Unified)
CPU /sys/devices/system/cpu/cpu2 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu2 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu2 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu2 Level 3 Cache: 8192K (Unified)
CPU /sys/devices/system/cpu/cpu3 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu3 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu3 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu3 Level 3 Cache: 8192K (Unified)
CPU /sys/devices/system/cpu/cpu4 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu4 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu4 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu4 Level 3 Cache: 8192K (Unified)
CPU /sys/devices/system/cpu/cpu5 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu5 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu5 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu5 Level 3 Cache: 8192K (Unified)
CPU /sys/devices/system/cpu/cpu6 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu6 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu6 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu6 Level 3 Cache: 8192K (Unified)
CPU /sys/devices/system/cpu/cpu7 Level 1 Cache: 32K (Data)
CPU /sys/devices/system/cpu/cpu7 Level 1 Cache: 32K (Instruction)
CPU /sys/devices/system/cpu/cpu7 Level 2 Cache: 256K (Unified)
CPU /sys/devices/system/cpu/cpu7 Level 3 Cache: 8192K (Unified)
Total CPU system cache: 69468160 bytes
Suggested minimum array elements needed: 31576436
Array elements used: 31576436

=== CPU Core Summary ===
processor   : 7
model name  : Intel(R) Core(TM) i7 CPU         860  @ 2.80GHz
cpu MHz     : 2898.023
siblings    : 8

=== Check and build stream ===
--2010-09-19 21:41:46--  http://www.cs.virginia.edu/stream/FTP/Code/stream.c
Resolving www.cs.virginia.edu... 128.143.137.29
Connecting to www.cs.virginia.edu|128.143.137.29|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11918 (12K) [text/plain]
Saving to: `stream.c'

100%[======================================>] 11,918      --.-K/s   in 0.03s   

2010-09-19 21:41:46 (373 KB/s) - `stream.c' saved [11918/11918]


=== Testing up to 8 cores ===

-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 31576436, Offset = 0
Total memory required = 722.7 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 1
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 38888 microseconds.
   (= 38888 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:        9663.6238       0.0524       0.0523       0.0527
Scale:       9315.7724       0.0545       0.0542       0.0558
Add:        10429.7390       0.0729       0.0727       0.0732
Triad:      10108.3413       0.0753       0.0750       0.0758
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

Number of Threads requested = 2
Function      Rate (MB/s)   Avg time     Min time     Max time
Triad:      13095.9151       0.0579       0.0579       0.0580

Number of Threads requested = 3
Function      Rate (MB/s)   Avg time     Min time     Max time
Triad:      13958.5017       0.0545       0.0543       0.0547

Number of Threads requested = 4
Function      Rate (MB/s)   Avg time     Min time     Max time
Triad:      14293.3696       0.0532       0.0530       0.0537

Number of Threads requested = 5
Function      Rate (MB/s)   Avg time     Min time     Max time
Triad:      13663.0608       0.0563       0.0555       0.0571

Number of Threads requested = 6
Function      Rate (MB/s)   Avg time     Min time     Max time
Triad:      13757.0249       0.0559       0.0551       0.0567

Number of Threads requested = 7
Function      Rate (MB/s)   Avg time     Min time     Max time
Triad:      13463.7445       0.0564       0.0563       0.0566

Number of Threads requested = 8
Function      Rate (MB/s)   Avg time     Min time     Max time
Triad:      13230.8312       0.0575       0.0573       0.0583

Like many of the post-Nehalem Intel processors, this system gets quite good memory bandwidth even when running a single thread, with the Turbo feature helping a bit too. And it's almost reached saturation of all available bandwidth with only two threads active, which is good for a system with this many cores; they don't all have to be doing something to take advantage of all the memory on this server.

Results database

Eventually it's hoped that this program can help build a database of per-core scaling information for STREAM similar to the the core STREAM project maintains for peak throughput. Guidelines for submission to such a project are still being worked on. Please contact the author if you have any ideas for helping organize this work.

In general the following information is needed:

Output from the stream-scaling command
CPU information
List of memory banks in the system, what size of RAM they have, and what technology/speed it runs at.

Common places you might assemble this info from include:

/proc/cpuinfo
lspci -v
dmidecode

Since CPU performance data of this sort is very generic, many submissions are sent to help this project without wanting the company or individual's name dislosed. Accordingly, unless credit for your submission is specifically requested, the source of reported results will remain private. So far all contributions have been anonymous.

Preliminary Samples

Here are some sample results from the program, showing how memory speeds have marched forward as the industry moved from slower DDR2 to increasingly fast DDR3. They also demonstrate why AMD was able to limp along with slower RAM for so long in their multi-socket configurations. While no single core gets great bandwidth, when the server is fully loaded the aggregate amount can be impressive.

T7200: Intel Core2 T7200. Dual core. 32K Data and Instruction L1 caches, 4096K L2 cache.
E5420: Intel Xeon E5420. Quad core. 16K Data and Instruction L1 caches, 6144MB L2 cache. 8 X 4GB DDR2-667.
2 X E5405: Dual Intel Xeon E5405. Quad core. 32K Data and Instruction L1 caches, 6144K L2 cache. 8 X 4GB DDR2-667.
4 X 8347: AMD Opteron 8347 HE. Quad core, 4 sockets. 64K Data and Instruction L1 caches, 512K L2 cache, 2048K L3 cache. 32 X 2GB DDR2-667.
E2180: Intel Pentium E2180. Dual core. 32K Data and Instruction L1 caches, 1024K L2 cache. 2 X 1GB DDR2-800.
X2 4600+: AMD Athlon 64 X2 4600+. Dual core. 64K Data and Instruction L1 caches, 512K L2 cache. 4 X 2GB RAM.
2 X 280: Amd Opteron 280. Dual core, 2 sockets. 64K Data and Instruction L1 caches, 1024K L2 cache. 8 X 1GB DDR2-800.
Q6600: Intel Q6600. Quad core. 32KB Data and Instruction L1 caches, 4096K L2 cache. 4 X 2GB RAM.
8 X 8431: AMD Opteron 8431. 6 cores each, 8 sockets. 64K Data and Instruction L1 caches, 512K L2 cache, 5118K L3 cache. 256GB RAM.
E5506: Intel Xeon E5506 2.13GHz. Quad core. 32K Data and Instruction L1 caches, 256K L2 cache, 4096K L3 cache.
E5520: Dual Intel Xeon E5520. Quad core with Turbo and Hyper Threading for 8 virtual cores. 32K Data and Instruction L1 caches, 256K L2 cache, 8192K L3 cache. 18 X 4GB RAM.
X4 955: AMD Phenon II X4 955. 64K Data and Instruction L1 caches, 512K L2 cache, 6144K L3 cache. 4GB DDR3-1333.
X6 1055T: AMD Phenon II X6 1055T. 64K Data and Instruction L1 caches, 512K L2 cache, 6144K L3 cache. 8GB DDR3-1333.
i7-860: Intel Core i7 860. Quad core with Turbo and Hyper Threading for 8 virtual cores. 32K Data and Instruction L1 caches, 256K L2 cache, 8192K L3 cache. 4 X 2GB RAM.
i7-870: Intel Core i7 870. Quad core with Turbo and Hyper Threading for 8 virtual cores. 32K Data and Instruction L1 caches, 256K L2 cache, 8192K L3 cache. 2 X 2GB RAM.
i7-870[2]: Intel Core i7 870, as above, except with 4 X 4GB RAM.
2 X E5620: Dual Intel Xeon E5620. Quad core with Turbo and Hyper Threading for 16 virtual cores. 32K Data and Instruction L1 cache, 256K L2 cache, 12288K L3 cache. 12 X 8GB DDR3/1333.
2 X X5560: Dual Intel Xeon X5560. Quad core with Turbo and Hyper Threading for 8 virtual cores. 32K Data and Instruction L1 caches, 256K L2 cache, 8192K L3 cache. 6 X 2GB DDR3/1333.
4 x E7540: Quad Intel Xeon E7540. Six cores with Turbo and Hyper Threading for 48 virtual cores, 32K Data and Instruction L1 caches, 256K L2 cache, 18432K L3 cache. 32 x 4096MB DDR3/1066.
4 x X7550: Quad Intel Xeon X7550. Eight cores with Turbo, Hyper Threading disabled for 32 total. 32K Data and Instruction L1 caches, 256K L2 cache, 18432K L3 cache. 32 X 4096 DDR3/1333.
4 X 6168: Quad AMD Opteron 6168. Twelve cores for 48 total, 64K Data and Instruction L1 caches, 512K L2 cache, 5118K L3 cache. 16 X 8192MB DDR3/133.
4 X 6172: Quad AMD Opteron 6172. Twelve cores for 48 total, 64K Data and Instruction L1 caches, 512K L2 cache, 5118K L3 cache. 32 X 4096MB DDR3/1333.
4x X7560: Quad Intel X7560. Eight cores with Turbo and Hyper Threading for 64 virtual cores. 32K Data and Instruction L1 caches, 256K L2 cache, 24576K L3 cache. 32 X 4096 DDR3/1066.
X7560[2]: Quad Intel X7560. Eight cores with Turbo and Hyper Threading disabled, for 32 virtual cores. 32K Data and Instruction L1 caches, 256K L2 cache, 24576K L3 cache. 32 X 4096 DDR3/1066.
4 X 4850: Quad Intel E7-4850. Ten cores with Turbo and Hyper Threading for 80 virtual cores. 32K Data and Instruction L1 caches, 256K L2 cache, 24576K L3 cache. 64 X 8192MB DDR3/1333.

Processor	Cores	Clock	Memory	1 Core	2	3	4	8	16	24	32	48
T7200 E5420 2 X E5405 4 X 8347 E2180 X2 4600+ 2 X 280 Q6600	2 4 8 16 2 2 4 4	2.0GHz 2.5GHz 2.0GHz 1.9GHz 2.0GHz 2.4GHz 2.4GHz 2.4GHz	DDR2/667 DDR2/667 DDR2/667 DDR2/667 DDR2/800 DDR2/800 DDR2/800 DDR2/800	2965 3596 3651 2684 2744 3657 3035 4383	3084 3992 3830 5212 2784 4460 3263 4537	4305 4941 7542 3130 4480	4365 5774 8760 6264 4390	4452 5773 9389	14590
8 X 8431 E5506 2 X E5520 X4 955 X6 1055T i7-860 i7-870 i7-870[2] 2 X E5620 2 X X5560	48 4 8 4 6 8 8 8 16 16	2.4GHz 2.13GHz 2.27GHz 3.2GHz 3.2GHz 2.8GHz 2.93GHz 2.93GHz 2.4GHz 2.8GHz	DDR2/800 DDR3/800 DDR3/1066 DDR3/1333 DDR3/1333 DDR3/1600 DDR3/1600 DDR3/1600 DDR3/1333 DDR3/1333	4038 7826 7548 6750 7207 9664 10022 9354 9514 11658	7996 9016 9841 7150 8657 13096 12714 11935 16845 18382	11918 9273 9377 7286 9873 13959 13698 13145 17960 19918	13520 9297 9754 7258 9772 14293 13909 13853 22544 24546	23658 12101 9932* 13231 12787 12598 21744 23407	22801 13176 19083 29215	23688	24522	27214
4 X E7540 4 X X7550	48 32	2.0GHz 2.0GHz	DDR3/1066 DDR3/1333	4992 5236	9967 10482	14926 15723	18727 20963	31685 32557	35566 35941	35488 35874	35973 35819	35284
4 X 6168	48	1.90GHz	DDR3/1333	5611	11148	15819	20943	34327	52206	67560	69517	65617
4 X 6172	48	2.1GHz	DDR3/1333	4958	9903	14493	19469	37613	51625	40611	47361	32301
4 X X7560 X7560[2]	64 32	2.26GHz 2.26GHz	DDR3/1066 DDR3/1066	4356 4345	7710 8679	13028 12970	14561 16315	18702 25293	19761 27378	19938 27368	20011 28654	15964
4 X 4850	80	2.0GHz	DDR3/1333	5932	11571	17404	16000	41932	72351	58657	71384	65395

The result for 6-core processors with 6 threads is shown in the 8-core column. Only so much space to work with here...

Multiple runs

Since significant run to run variation is often observed in stream results, a set of tools to help average this data out are included. The programs require the Ruby programming language be installed. Using them looks like this, where we're using the server hostname "grace" to label the files and averaging across 10 runs:

./multi-stream-scaling 10 grace
./multi-averager grace > stream.txt
gnuplot stream-plot

A stream.png file will be produced with a graph showing the average of the values from the multiple runs. If you are interested in analyzing the run to run variation, the stream.txt file also includes the standard deviation of the results at each core count.

Todo

Adding compatibility with more operating systems than Linux would be nice. Some results have been submitted from FreeBSD that look correct, but the automatic cache validation code hasn't been validated on that OS.
A results processor that took the verbose output shown and instead produced a compact version for easy comparison with other systems, similar to the CSV output mode of bonnie++, would make this program more useful.
Some early cache size detection code has been written for OS X. But the program doesn't do anything useful there; it will just run normal STREAM many times. The standard Apple compiler chain doesn't support OpenMP, and changing an OpenMP environment variable is the way core count is limited in this program. You'll get this linker error:

ld: library not found for -lgomp

Limitations

On systems with many processors and large caches, most commonly AMD systems with 24 or more cores, the results at high core counts will vary significantly. This is theorized to come from two causes:

Thread scheduling will move the running stream processees between processors in a way that impacts results.
Despite attempting to use a large enough data set to avoid it, some amount of processor caching will inflate results.

If the variation of results at high core counts is high, running the program multiple times and considering the worst results seen at higher thread counts is recommended. Results listed above have included some work to try and eliminate incorrect data from these processors. That may not have been entirely successful. For example, the 4 X 6172 results show extremely high results from 16 to 32 cores. Determing whether those are accurate is still a work in progress.

Bugs

On some systems, the amount of memory selected for the stream array ends up exceeding how large of a block of RAM the operating system (or in some cases the compiler) is willing to allocate at once. This seems a particular issue on 32-bit operating systems, but even 64-bit ones are not immune.

If your system fails to compile stream with an error such as this:

stream.c:(.text+0x34): relocation truncated to fit: R_X86_64_32S against `.bss'

stream-scaling will try to compile stream using the gcc "-mcmodel=large" option after hitting this error. That will let the program use larger data structures. If you are using a new enough version of the gcc compiler, believed to be at least verison 4.4, the program will run normally after that; you can ignore these "relocation truncated" warnings.

If you have both a large amount of cache--so a matching large block of memory is needed--and an older version of gcc, the second compile attempt will also fail, with the following error:

stream.c:1: sorry, unimplemented: code model ‘large’ not supported yet

In that case, it is unlikely you will get accurate results from stream-scaling. You can try it anyway by manually decreasing the size of the array until the program will compile and link. Manual compile can be done like this:

gcc -O3 -DN=130000000 -fopenmp stream.c -o stream

And then reducing the -DN value until compilation is successful. After that upper limit is determined, adjust the setting for MAX_ARRAY_SIZE at the beginning of the stream-scaling program to reflect it. An upper limit on the stream array size of 130M as shown here allocates approximately 3GB of memory for the test array, with 4GB being the normal limit for 32-bit structures.

The fixes for this issue are new, and it is still possible a problem here still exists. If you have a gcc version >=4.4 but stream-scaling still won't compile correctly, a problem report to the author would be appreciated. It's not clear yet why the exact cut-off value varies on some systems, or if there are systems where the improved dynamic allocation logic may not be sufficient.

Documentation

The documentation README.rst for the program is in ReST markup. Tools that operate on ReST can be used to make versions of it formatted for other purposes, such as rst2html to make a HTML version.

Contact

The project is hosted at http://github.com/gregs1104/stream-scaling

If you have any hints, changes or improvements, please contact:

Greg Smith greg.smith@crunchydatasolutions.com

Credits

The sample results given in this file have benefitted from private contributions all over the world. Most submissions ask to remain anonymous.

The multiple run averaging programs were originally contributed by Ben Bleything <ben@bleything.net>

License

stream-scaling is licensed under a standard 3-clause BSD license.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

Neither the name of the author nor the names of contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.rst

README.rst

multi-averager

multi-averager

multi-stream-scaling

multi-stream-scaling

stream-graph.py

stream-graph.py

stream-plot

stream-plot

stream-scaling

stream-scaling

stream.c

stream.c

Repository files navigation

Introduction

Installation/Usage

Sample result

Results database

Preliminary Samples

Multiple runs

Todo

Limitations

Bugs

Documentation

Contact

Credits

License

About

Releases

Packages

Contributors 4

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.gitignore		.gitignore
README.rst		README.rst
multi-averager		multi-averager
multi-stream-scaling		multi-stream-scaling
stream-graph.py		stream-graph.py
stream-plot		stream-plot
stream-scaling		stream-scaling
stream.c		stream.c

gregs1104/stream-scaling

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation/Usage

Sample result

Results database

Preliminary Samples

Multiple runs

Todo

Limitations

Bugs

Documentation

Contact

Credits

License

About

Resources

Stars

Watchers

Forks

Languages