Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

likwid-topology confused by non-standard core assignements? #46

Closed
gjbex opened this issue Aug 4, 2016 · 17 comments
Closed

likwid-topology confused by non-standard core assignements? #46

gjbex opened this issue Aug 4, 2016 · 17 comments

Comments

@gjbex
Copy link

gjbex commented Aug 4, 2016

This is about likwid 4.1.1 (release), built to use hwloc that comes with it (config.mk included for completeness).

For some obscure reason the assignment of processors to physical address/core-id is not what one would expect on Intel hardware. Normally, one expects on a dual socket, 12-core machine (haswell E5-2680 v3), hyperthreading disabled:
0 -> 0:0
1 -> 0:1
..
11 -> 0:11
12 -> 1:0
13 -> 1:1
...
23 -> 1:11
The left-hand number is the processor, the first right-hand number the physical address, the second the core-id according to /proc/cpuinfo.
On some machines however, we get:
0 -> 0:0
1 -> 0:2
2 -> 0:4
..
5 -> 0:10
6 -> 1:0
7 -> 1:2
...
11 -> 1:10
12 -> 0:1
13 -> 0:3
..
17 -> 0:11
18 -> 1:1
19 -> 1:3
...
23 -> 1:11
Obviously, it is not what we want, but that is our problem.

However, when likwid-topology is run on such a node, it seems to get confused. It reports:
Sockets: 2
Cores per socket: 6
Threads per core: 2
Apparently, the weird round-robin assignment tricks likwid-topology into assuming that hyperthreading is enabled. The complete output of likwid-topology is in attachment.

lscpu and lstopo (version 1.10.1) reports are consistent with /proc/cpuinfo htough (output of both in attachment as well). So it would seem that the information coming for hwloc is somehow misinterpreted.

Thanks, best regards, Geert Jan Bex

lscpu_out.txt
cpuinfo_out.txt
likwid_topology_out.txt
lstopo_out.txt
config.txt

@TomTheBear
Copy link
Member

Thanks for the perfect bug documentation. I will check it next week.

@TomTheBear
Copy link
Member

It seems like the code I added to deal with non-standard core assignments on AMD systems is not able to deal with Intel systems. Can you please supply the output of likwid-topology -V 3 .

@gjbex
Copy link
Author

gjbex commented Aug 16, 2016

Sure, no problem, you'll find it as an attachment.
likwid-topology_out_V3.txt
likwid-topology_out_V3.txt

@TomTheBear
Copy link
Member

I cannot clearly identify where the problem comes from. Since you have hwloc installed, can you send me the topology tarball so I can run likwid-topology virtually on your hardware:
hwloc-gather-topology <tarballname>.

@gjbex
Copy link
Author

gjbex commented Aug 16, 2016

The tarball is attached.

@gjbex
Copy link
Author

gjbex commented Aug 16, 2016

Or not :) GitHub doesn't like bz2. Now it is zipped.
hwloc_gather_topology.tar.zip

@TomTheBear
Copy link
Member

Hi, thanks for supplying the tarball. Please try the patch: likwid-non-std-cores.zip . Basically, only the AMD fixup code is excluded for Intel systems.

@gjbex
Copy link
Author

gjbex commented Aug 17, 2016

Thanks for the patch, but no cigar, I'm afraid. I've included likwid-topology's output as attachment.

Just so that we're on the same page:
$ tar xaf likwid-4.1.1.tar.gz
$ cd likwid-4.1.1
$ cp ../likwid-non-std-cores.patch .
$ patch -p1 < likwid-non-std-cores.patch
Fuss with config.mk, same as before.
$ make
likwid_patched_out.txt

@TomTheBear
Copy link
Member

Hmm, not the result I hoped for. I played around a little bit with your supplied tarball and it looks fine. I attached the topology_hwloc.c file, just change the file suffix from txt to c and copy it in the src folder.
Please try that version. It does additional checks if a field is not found in hwloc (CPU family, CPU model, ...). If it doesn't work, please send me the output with -V 3 again.
topology_hwloc.txt

@gjbex
Copy link
Author

gjbex commented Aug 24, 2016

Unfortunately, no go. The result is still the same. In attachment, you'll find the -V 3 output.
likwid_patched2_out.txt

@TomTheBear
Copy link
Member

Are you sure that you rebuilt LIKWID properly? In the sent topology_hwloc.c the debug print is in line 329 but in you -V 3 output it is in line 236. Always do a make distclean && make if not starting from scratch.
Also the lines
DEBUG - [hwloc_init_nodeTopology:236] HWLOC Thread Pool PU -1 Thread -1 Core -1 Socket -1 inCpuSet 0
are strange as this is a bug in 4.1.1 but shouldn't happen with the sent topology_hwloc.c

@gjbex
Copy link
Author

gjbex commented Aug 24, 2016

In fact, I removed the directory, untarred, added my config.mk, replaced the topology_hwloc.c.
I verified the md5sums of the file you sent, and what is currently in the likwid/src:
$ md5sum src/topology_hwloc.c
70a0d091edf39f305836d751c8dc6c25 src/topology_hwloc.c
$ md5sum ~/Downloads/topology_hwloc.c
70a0d091edf39f305836d751c8dc6c25 ~/Downloads/topology_hwloc.c

So, are you sure I got the right file? ;)

@TomTheBear
Copy link
Member

I just checked the file from above and that should behave differently. Have you installed the patched version? Or might it be that you have another liblikwid.so in you LD_LIBRARY_PATH that is used instead of the patched one?
This is the core assignment of likwid-topology (4.1.1 and above topology_hwloc.c) with your supplied hwloc tarball:

********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets:        2
Cores per socket:   12
Threads per core:   1
--------------------------------------------------------------------------------
HWThread    Thread      Core        Socket      Available
0       0       0       0       
1       0       2       0       
2       0       4       0       
3       0       8       0       
4       0       10      0       
5       0       12      0       
6       0       0       1       
7       0       2       1       
8       0       4       1
9       0       8       1
10      0       10      1
11      0       12      1
12      0       1       0
13      0       3       0
14      0       5       0
15      0       9       0
16      0       11      0
17      0       13      0
18      0       1       1
19      0       3       1
20      0       5       1
21      0       9       1
22      0       11      1
23      0       13      1

There are some differences that are gathered from the actual system and not the tarball (No * at available). But the core assignment should be valid.

@gjbex
Copy link
Author

gjbex commented Aug 24, 2016

Hm, no. This is what I do:

$ rm -rf likwid-4.1.1
$ tar xaf likwid-4.1.1.tar.gz 
$ cp config.mk likwid-4.1.1
$ cd likwid-4.1.1
$ cp ~/Downloads/topology_hwloc.c src/topology_hwloc.c 
$ make &> make.log
$ chmod u+x likwid-topology 
$ sudo ssh some_haswell_node
# cd /someplace/likwid-4.1.1
# ./likwid-topology

(Since this is a public repository, I replaced a node name and a path by something uninformative.)

@TomTheBear
Copy link
Member

There is the problem. You only make likwid-topology executable but don't set the LD_LIBRARY_PATH to the built liblikwid.so.
Try the following:

$ rm -rf likwid-4.1.1
$ tar xaf likwid-4.1.1.tar.gz 
$ cp config.mk likwid-4.1.1
$ cd likwid-4.1.1
$ cp ~/Downloads/topology_hwloc.c src/topology_hwloc.c 
$ make &> make.log
$ make local    (prints out the new LD_LIBRARY_PATH, make all scripts executable and fixes some paths in the Lua scripts)
export LD_LIBRARY_PATH=/someplace/likwid-4.1.1:$LD_LIBRARY_PATH
$ sudo ssh some_haswell_node
# cd /someplace/likwid-4.1.1
# export LD_LIBRARY_PATH=/someplace/likwid-4.1.1:$LD_LIBRARY_PATH (just copy from above)
# ./likwid-topology

With your setup, likwid-topology uses the already installed library, probably <PREFIX in config.mk>/liblikwid.so. You have to call make local because the path to the library needs to be changed in likwid.lua.

@TomTheBear
Copy link
Member

Any new finding?

@gjbex
Copy link
Author

gjbex commented Sep 13, 2016

Dear Thomas,

This is weird, I answered quite a while ago, apparently this got lost.

The problem is indeed solved, I had not taken into account that some paths
would be hardcoded during the installation, and that hence my tests were
wrong.

Thanks for solving it, best regards, -gjb-

On Tue, Sep 13, 2016 at 10:56 AM, Thomas Roehl notifications@github.com
wrote:

Any new finding?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#46 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AElDOJee42ANarrMrvHmCC__x9PeDkIvks5qpmVMgaJpZM4JcZg1
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants