Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something changed in lspci and the grep is failing is my guess #5

Closed
JStateson opened this issue Dec 7, 2019 · 72 comments
Closed

Something changed in lspci and the grep is failing is my guess #5

JStateson opened this issue Dec 7, 2019 · 72 comments
Assignees
Labels
bug Something isn't working

Comments

@JStateson
Copy link

Ubuntu 18.04
lscpi version unknown (no -version argument)
from

lspci | grep -E \"^.*(VGA|Display).*\[AMD\/ATI\].*$\" | grep -Eo \"^([0-9a-fA-F]+:[0-9a-fA-F]+.[0-9a-fA-F])\"

to


lspci | grep -E \"^.*(VGA|Display).*$\" | grep -Eo \"^([0-9a-fA-F]+:[0-9a-fA-F]+.[0-9a-fA-F])\"

removing the ATI and AMD fixed the problem of the first grep feeding "null" into the second

@Ricks-Lab
Copy link
Owner

@JStateson Sorry about the late response to this. I think the notification came during travels so I missed it. I think this problem is associated with AMD only energy measurements. I will need some time to dig into and implement a fix.

@Ricks-Lab Ricks-Lab self-assigned this Jan 13, 2020
@Ricks-Lab Ricks-Lab added the bug Something isn't working label Jan 13, 2020
@JStateson
Copy link
Author

When I was in undergrad (century ago?) It was fun to see who could write the shortest program to translate Morse code. I thought my 3 line program was good, but the instructor showed us his 1 line program in APL that did the trick. Unlike your grep above, his APL was understandable to me.

@Ricks-Lab
Copy link
Owner

When I was in undergrad (century ago?) It was fun to see who could write the shortest program to translate Morse code. I thought my 3 line program was good, but the instructor showed us his 1 line program in APL that did the trick. Unlike your grep above, his APL was understandable to me.

The grep statement does 3 things: It looks for all GPU's and then selects only AMD GPU's from those results and then gets the PCIe ID from the final results. Maybe it would be better to do this in 3 steps. Originally, I only intended to run this when the --energy option is used which is only applicable for AMD at this time, but then pulled it early in the flow to determine devmap which maps boinc device numbers to linux card numbers. I will work on this with the plan of eventually including energy measurements for NVidia. Can you help to provide the output of the grep for NVidia GPUs?
lspci | grep -E \"^.*(VGA|Display).*$\"

The code on master has already been modified to work correctly, but still want to make the longer term improvements.

@JStateson
Copy link
Author

JStateson commented Jan 18, 2020

following did nothing

jstateson@h110btc:~$ lspci | grep -E \"^.*(VGA|Display).*$\"
jstateson@h110btc:~$

this was what it had to work with on H110BTC with 18.04 and 9 NV GPUs and 1 Intel. Note that two of the nvidia are designated as 3d and not VGA.

jstateson@h110btc:~$ lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x16) (rev 07)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
00:14.0 USB controller: Intel Corporation 100 Series/C230 Series Chipset Family USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation 100 Series/C230 Series Chipset Family Thermal Subsystem (rev 31)
00:16.0 Communication controller: Intel Corporation 100 Series/C230 Series Chipset Family MEI Controller #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Q170/Q150/B150/H170/H110/Z170/CM236 Chipset SATA Controller [AHCI Mode] (rev 31)
00:1c.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #5 (rev f1)
00:1c.6 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #7 (rev f1)
00:1c.7 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #8 (rev f1)
00:1d.0 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #9 (rev f1)
00:1d.1 PCI bridge: Intel Corporation 100 Series/C230 Series Chipset Family PCI Express Root Port #10 (rev f1)
00:1f.0 ISA bridge: Intel Corporation H110 Chipset LPC/eSPI Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation 100 Series/C230 Series Chipset Family Power Management Controller (rev 31)
00:1f.3 Audio device: Intel Corporation 100 Series/C230 Series Chipset Family HD Audio Controller (rev 31)
00:1f.4 SMBus: Intel Corporation 100 Series/C230 Series Chipset Family SMBus (rev 31)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V (rev 31)
01:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)
02:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)
03:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
04:00.0 3D controller: NVIDIA Corporation GP106 [P106-100] (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
05:00.1 Audio device: NVIDIA Corporation GP104 High Definition Audio Controller (rev a1)
06:00.0 PCI bridge: ASMedia Technology Inc. Device 1187
07:01.0 PCI bridge: ASMedia Technology Inc. Device 1187
07:02.0 PCI bridge: ASMedia Technology Inc. Device 1187
07:03.0 PCI bridge: ASMedia Technology Inc. Device 1187
07:04.0 PCI bridge: ASMedia Technology Inc. Device 1187
07:05.0 PCI bridge: ASMedia Technology Inc. Device 1187
07:06.0 PCI bridge: ASMedia Technology Inc. Device 1187
07:07.0 PCI bridge: ASMedia Technology Inc. Device 1187
08:00.0 3D controller: NVIDIA Corporation GP106 [P106-090] (rev a1)
0a:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)
0a:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
0b:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)
0b:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)
0e:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)
0e:00.1 Audio device: NVIDIA Corporation GP106 High Definition Audio Controller (rev a1)

on an 18.04 with three AMD boards lscpi showed the following

jstateson@jysdualxeon:~$ lspci
00:00.0 Host bridge: Intel Corporation 5500 I/O Hub to ESI Port (rev 22)
00:01.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 1 (rev 22)
00:03.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 3 (rev 22)
00:07.0 PCI bridge: Intel Corporation 5520/5500/X58 I/O Hub PCI Express Root Port 7 (rev 22)
00:09.0 PCI bridge: Intel Corporation 7500/5520/5500/X58 I/O Hub PCI Express Root Port 9 (rev 22)
00:13.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub I/OxAPIC Interrupt Controller (rev 22)
00:14.0 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub System Management Registers (rev 22)
00:14.1 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub GPIO and Scratch Pad Registers (rev 22)
00:14.2 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Control Status and RAS Registers (rev 22)
00:14.3 PIC: Intel Corporation 7500/5520/5500/X58 I/O Hub Throttle Registers (rev 22)
00:16.0 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.1 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.2 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.3 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.4 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.5 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.6 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:16.7 System peripheral: Intel Corporation 5520/5500/X58 Chipset QuickData Technology Device (rev 22)
00:1a.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #4
00:1a.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #5
00:1a.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #6
00:1a.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #2
00:1c.0 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 1
00:1c.4 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 5
00:1c.5 PCI bridge: Intel Corporation 82801JI (ICH10 Family) PCI Express Root Port 6
00:1d.0 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #1
00:1d.1 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #2
00:1d.2 USB controller: Intel Corporation 82801JI (ICH10 Family) USB UHCI Controller #3
00:1d.7 USB controller: Intel Corporation 82801JI (ICH10 Family) USB2 EHCI Controller #1
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
00:1f.0 ISA bridge: Intel Corporation 82801JIR (ICH10R) LPC Interface Controller
00:1f.2 SATA controller: Intel Corporation 82801JI (ICH10 Family) SATA AHCI Controller
00:1f.3 SMBus: Intel Corporation 82801JI (ICH10 Family) SMBus Controller
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev ef)
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 580]
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev ef)
03:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 580]
04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev ef)
04:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 580]
06:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
07:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
08:01.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)
fe:00.0 Host bridge: Intel Corporation Xeon 5600 Series QuickPath Architecture Generic Non-core Registers (rev 02)
fe:00.1 Host bridge: Intel Corporation Xeon 5600 Series QuickPath Architecture System Address Decoder (rev 02)
fe:02.0 Host bridge: Intel Corporation Xeon 5600 Series QPI Link 0 (rev 02)
fe:02.1 Host bridge: Intel Corporation Xeon 5600 Series QPI Physical 0 (rev 02)
fe:02.2 Host bridge: Intel Corporation Xeon 5600 Series Mirror Port Link 0 (rev 02)
fe:02.3 Host bridge: Intel Corporation Xeon 5600 Series Mirror Port Link 1 (rev 02)
fe:02.4 Host bridge: Intel Corporation Xeon 5600 Series QPI Link 1 (rev 02)
fe:02.5 Host bridge: Intel Corporation Xeon 5600 Series QPI Physical 1 (rev 02)
fe:03.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Registers (rev 02)
fe:03.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Target Address Decoder (rev 02)
fe:03.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller RAS Registers (rev 02)
fe:03.4 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Test Registers (rev 02)
fe:04.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Control (rev 02)
fe:04.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Address (rev 02)
fe:04.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Rank (rev 02)
fe:04.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Thermal Control (rev 02)
fe:05.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Control (rev 02)
fe:05.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Address (rev 02)
fe:05.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Rank (rev 02)
fe:05.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Thermal Control (rev 02)
fe:06.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Control (rev 02)
fe:06.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Address (rev 02)
fe:06.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Rank (rev 02)
fe:06.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Thermal Control (rev 02)
ff:00.0 Host bridge: Intel Corporation Xeon 5600 Series QuickPath Architecture Generic Non-core Registers (rev 02)
ff:00.1 Host bridge: Intel Corporation Xeon 5600 Series QuickPath Architecture System Address Decoder (rev 02)
ff:02.0 Host bridge: Intel Corporation Xeon 5600 Series QPI Link 0 (rev 02)
ff:02.1 Host bridge: Intel Corporation Xeon 5600 Series QPI Physical 0 (rev 02)
ff:02.2 Host bridge: Intel Corporation Xeon 5600 Series Mirror Port Link 0 (rev 02)
ff:02.3 Host bridge: Intel Corporation Xeon 5600 Series Mirror Port Link 1 (rev 02)
ff:02.4 Host bridge: Intel Corporation Xeon 5600 Series QPI Link 1 (rev 02)
ff:02.5 Host bridge: Intel Corporation Xeon 5600 Series QPI Physical 1 (rev 02)
ff:03.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Registers (rev 02)
ff:03.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Target Address Decoder (rev 02)
ff:03.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller RAS Registers (rev 02)
ff:03.4 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Test Registers (rev 02)
ff:04.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Control (rev 02)
ff:04.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Address (rev 02)
ff:04.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Rank (rev 02)
ff:04.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 0 Thermal Control (rev 02)
ff:05.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Control (rev 02)
ff:05.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Address (rev 02)
ff:05.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Rank (rev 02)
ff:05.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 1 Thermal Control (rev 02)
ff:06.0 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Control (rev 02)
ff:06.1 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Address (rev 02)
ff:06.2 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Rank (rev 02)
ff:06.3 Host bridge: Intel Corporation Xeon 5600 Series Integrated Memory Controller Channel 2 Thermal Control (rev 02)

@Ricks-Lab
Copy link
Owner

@JStateson Thanks for providing the details! This will help in the development in support of HW that I don't have. I see you have some experience with BOINC, so I have a question. It would be very useful if I can associate a BOINC device number with a Linux card number, but I have not been able to find details on how BOINC assigns the number. So far from observations, it appears to be the reverse of Linux card numbers, but my few systems is not enough to know this for sure. Let me know if you have insight into this else, maybe I can produce output that can be validated on your systems.

@JStateson
Copy link
Author

JStateson commented Jan 18, 2020

I have spent some time looking at this myself and have not figured it out.

The problem I was trying to solve was which board was causing a computation failure when there are multiple identical GPUs. The solution I had been using was to manually stop the GPU fan from moving and see which board showed a temperature increase. This is obviously less than ideal but it does work both in window an linux using boinctasks capability of displaying temperatures. I would like to do this programatically.

What I have learned:
Boinc does not ask the device manager (windows) or the kernel (linux) to enumerate the video boards.
Instead, Boinc runs a gpu detect app that uses cuda (nvidia only) and opencl (nvidia, ati , intel) to interrigate the boards and write out what is found. The file is named coproc_info.xml and it is read back in by boinc to "see what is there". Boinc sorts what is reported in descending order based on FLOPS such that d0 is the best board and d1, d2, etc, are in decreasing flops. This method is generally correct and more often than not, d0 is the best board.

Nvidia's cuda reports busid of 1,,6 for 6 boards but
their opencl package shows a "opencl_driver_index" 0..5

ATI shows opencl_driver_index also starting at 0

The module that reads in the info and does the sorting do not have an entry in the C++ structure such as "bus id" or "driver index" nor even the name of the board such as gtx1660ti etc. All that is lost once the gpu_detect returns.

Compounding the problem is the numbering of the board by the nvidia driver. nvidia-smi list boards using 0..5 (for 6 boards)
Note that gpu#1 is gtx1660 below

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 440.44       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 00000000:01:00.0 Off |                  N/A |
| 99%   42C    P2    97W / 151W |   1483MiB /  8116MiB |     87%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 166...  Off  | 00000000:02:00.0 Off |                  N/A |
|100%   58C    P2    96W / 120W |   1332MiB /  5944MiB |     91%      Default |
+-------------------------------+----------------------+----------------------+

but coproc_info shows bus id of "2"

<coproc_cuda>
   <count>1</count>
   <name>GeForce GTX 1660 Ti</name>
 ...
<pci_info>
   <bus_id>2</bus_id>
   <device_id>0</device_id>
   <domain_id>0</domain_id>
</pci_info>

but also device id of "0"

   <nvidia_opencl>
      <name>GeForce GTX 1660 Ti</name>
      <vendor>NVIDIA Corporation</vendor>

      <device_num>0</device_num>
      <peak_flops>5529600000000.000000</peak_flops>
      <opencl_available_ram>4164943872.000000</opencl_available_ram>
      <opencl_device_index>0</opencl_device_index>
      <warn_bad_cuda>0</warn_bad_cuda>
   </nvidia_opencl>

The value of 0 for opencl ID does not correspond to the nvidia-smi table but
the bus id of 2 does seem to match that table. Unfortunately, this fails when
multiplexing a slot. If a 4-in-1 riser is used in a slot, the number are no
longer 1..6 instead there is a jump to (for example) 12 and a renumbering of
the boards that are "after" the slot the 4-in-1 riser was in.

The net effect of all this is the GPUs: d0, d1, d2, etc associated with work units cannot
easily be matched back to the board or slot the board is in.

Some ideas I was looking at:

  1. Command the client to upload the coproc_info.xml file for analysis
  2. Mod the client to store bus id, driver index and board name in a structure for access
  3. send the bus id of a failing board to the manager so it can remove that board from the
    pool of available boards. ie: when the driver signals "Unable to determine the device
    handle for GPU 0000:01:00.0: GPU is lost. Reboot the system to recover this GPU*" the value
    of 0000:01:00.0 can be translated somehow to "d3" and "d3" will no longer be assigned tasks.
    I brought this up as a talking point here
    https://forum.efmer.com/index.php?topic=1394.msg8047#msg8047
    and down at the bottom of here
    Computing prefs 2.0 BOINC/boinc#2993

EDIT-url to efmer was corrected

@JStateson
Copy link
Author

There does not appear to be any unique identifying serial number on any NVidia board. I recall reading somewhere that the manufactures decided long ago not to put a serial number in as that might be used to prevent software from working on a replacement board unless a "fee" was paid to the sw developer. I had the idea of re-flashing the bios and increment the "date" so as to be able to identify which board had a problem. Another thought was to run a small "performance" test under direction of the gpu_detect module and have a program on the Linux or windows remote system determine which one of the 6, 8 or 19 GPUs was the one running the test.

@Ricks-Lab
Copy link
Owner

For now, I am rewriting the part of the code the generates the list of GPUs. Originally, I used lshw but later I added the capability to estimate energy used, so I had to take parts from my amdgpu-utils and use lspci and driver files, but this was just added on to what was already there. I plan to make the new GPU list the core of how benchMT uses GPU compute resources. This should make further improvements much easier.

To make the association in the past, I have used the benchMT command line option to run on a specific device and use amdgpu-utils to see which card number has loading and build a devmap which I stored in the benchCFG file. I hope a more generic implementation would allow it to work with other than AMD GPUs. It will take some time for the rewrite...

@Ricks-Lab
Copy link
Owner

@JStateson
Since I only have AMD GPUs, can you help to collect output of lspci for nvidia and intel GPUs?
lspci -k -s 43:00.0
where 43:00.0 is replaced by the pcie id for your cards?

@KeithMyers
Copy link

There does not appear to be any unique identifying serial number on any NVidia board. I recall reading somewhere that the manufactures decided long ago not to put a serial number in as that might be used to prevent software from working on a replacement board unless a "fee" was paid to the sw developer. I had the idea of re-flashing the bios and increment the "date" so as to be able to identify which board had a problem. Another thought was to run a small "performance" test under direction of the gpu_detect module and have a program on the Linux or windows remote system determine which one of the 6, 8 or 19 GPUs was the one running the test.

As far as I know . . . . EVERY Nvidia card gets an ID. EVGA has serial number stickers on the back of every card for example. Also, every gpu in the system gets an unique GPU UUID that is a 32 bit hexadecimal number.

@Ricks-Lab
Copy link
Owner

There does not appear to be any unique identifying serial number on any NVidia board. I recall reading somewhere that the manufactures decided long ago not to put a serial number in as that might be used to prevent software from working on a replacement board unless a "fee" was paid to the sw developer. I had the idea of re-flashing the bios and increment the "date" so as to be able to identify which board had a problem. Another thought was to run a small "performance" test under direction of the gpu_detect module and have a program on the Linux or windows remote system determine which one of the 6, 8 or 19 GPUs was the one running the test.

As far as I know . . . . EVERY Nvidia card gets an ID. EVGA has serial number stickers on the back of every card for example. Also, every gpu in the system gets an unique GPU UUID that is a 32 bit hexadecimal number.

I know for AMD there is a unique_id device file that returns a hex number, but I don't think it is useful in mapping between boinc device number and linux card number.

@Ricks-Lab
Copy link
Owner

@JStateson @KeithMyers
I have just posted a test version of benchMT on master. It is not completely functional and will just display a list of GPUs with some device details. Can you run and post results here?

Also, I am trying to figure out the hwmon file that can be used to read current power. Maybe the name of a file in the card's hwmon directory will be obvious. If so, please cat and send me details.

Thanks!

@JStateson
Copy link
Author

JStateson commented Jan 19, 2020

@JStateson
Since I only have AMD GPUs, can you help to collect output of lspci for nvidia and intel GPUs?
lspci -k -s 43:00.0
where 43:00.0 is replaced by the pcie id for your cards?

jstateson@jysdualxeon:~$ lspci -k | grep VGA
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev ef)
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev ef)
04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev ef)
08:01.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)



jstateson@tb85-nvidia:~$ lspci -k | grep VGA
01:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation Device 2182 (rev a1)
06:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] (rev a1)

jjstateson@tb85-nvidia:~$ lspci -k | grep 3D
03:00.0 3D controller: NVIDIA Corporation GP102 [P102-100] (rev a1)
04:00.0 3D controller: NVIDIA Corporation GP102 [P102-100] (rev a1)
05:00.0 3D controller: NVIDIA Corporation GP102 [P102-100] (rev a1)



jstateson@h110btc:~$ lspci -k | grep VGA
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
01:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
02:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)
03:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)
05:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
0a:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)
0b:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)
0e:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)

jstateson@h110btc:~$ lspci -k | grep 3D
04:00.0 3D controller: NVIDIA Corporation GP106 [P106-100] (rev a1)
08:00.0 3D controller: NVIDIA Corporation GP106 [P106-090] (rev a1)

willl add the -s later for you

@JStateson
Copy link
Author

JStateson commented Jan 19, 2020

jstateson@h110btc:~$ lspci -k -s 00:02.0
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
        Subsystem: ASRock Incorporation HD Graphics 530
        Kernel driver in use: i915
        Kernel modules: i915
jstateson@h110btc:~$ lspci -k -s 01:00.0
01:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 6GB] (rev a1)
        Subsystem: eVga.com. Corp. GP106 [GeForce GTX 1060 6GB]
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
jstateson@h110btc:~$ lspci -k -s 02:00.0
02:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)
        Subsystem: Device 196e:11da
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
jstateson@h110btc:~$ lspci -k -s 03:00.0
03:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)
        Subsystem: eVga.com. Corp. GP106 [GeForce GTX 1060 3GB]
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
jstateson@h110btc:~$ lspci -k -s 05:00.0
05:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. GP104 [GeForce GTX 1070]
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
jstateson@h110btc:~$ lspci -k -s 0a:00.0
0a:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)
        Subsystem: eVga.com. Corp. GP106 [GeForce GTX 1060 3GB]
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
jstateson@h110btc:~$ lspci -k -s 0b:00.0
0b:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)
        Subsystem: Device 196e:11da
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
jstateson@h110btc:~$ lspci -k -s 0e:00.0
0e:00.0 VGA compatible controller: NVIDIA Corporation GP106 [GeForce GTX 1060 3GB] (rev a1)
        Subsystem: eVga.com. Corp. GP106 [GeForce GTX 1060 3GB]
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
jstateson@h110btc:~$ lspci -k -s 04:00.0
04:00.0 3D controller: NVIDIA Corporation GP106 [P106-100] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. GP106 [P106-100]
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
jstateson@h110btc:~$ lspci -k -s 08:00.0
08:00.0 3D controller: NVIDIA Corporation GP106 [P106-090] (rev a1)
        Subsystem: ZOTAC International (MCO) Ltd. GP106 [P106-090]
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

@JStateson
Copy link
Author

jstateson@jysdualxeon:~$ lspci -k -s 01:00.0
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev ef)
        Subsystem: Gigabyte Technology Co., Ltd Radeon RX 570 Gaming 4G
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu
jstateson@jysdualxeon:~$ lspci -k -s 03:00.0
03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev ef)
        Subsystem: Gigabyte Technology Co., Ltd Radeon RX 570 Gaming 4G
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu
jstateson@jysdualxeon:~$ lspci -k -s 04:00.0
04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] (rev ef)
        Subsystem: Gigabyte Technology Co., Ltd Radeon RX 570 Gaming 4G
        Kernel driver in use: amdgpu
        Kernel modules: amdgpu
jstateson@jysdualxeon:~$ lspci -k -s 08:01.0
08:01.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)
        Subsystem: Super Micro Computer Inc MGA G200eW WPCM450
        Kernel driver in use: mgag200
        Kernel modules: mgag200
jstateson@jysdualxeon:~$

@Ricks-Lab
Copy link
Owner

Is the last one on the list a server vga card with no compute capabilities?
08:01.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)

@Ricks-Lab
Copy link
Owner

It looks like my assumptions are correct. It would be great if you could run the latest test version on benchMT on master. It will exit after displaying GPU information.

@JStateson
Copy link
Author

yea that was builtin no comute unlike the intel 530

@JStateson
Copy link
Author

will do later, thanks!

@JStateson
Copy link
Author

JStateson commented Jan 19, 2020

@JStateson @KeithMyers
I have just posted a test version of benchMT on master. It is not completely functional and will just display a list of GPUs with some device details. Can you run and post results here?
Also, I am trying to figure out the hwmon file that can be used to read current power. Maybe the name of a file in the card's hwmon directory will be obvious. If so, please cat and send me details.
Thanks!

https://stateson.net/images/h110btc_benchMT.txt

https://stateson.net/images/tb85_benchMT.txt

https://stateson.net/images/dualxeon_benchMT.txt

The only obvious hits were in the system with the AMD cards

root@jysdualxeon:/# find . -name "hwmon"
./sys/kernel/debug/tracing/events/hwmon
./sys/class/hwmon
./sys/devices/platform/coretemp.1/hwmon
./sys/devices/platform/coretemp.0/hwmon
./sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/hwmon
./sys/devices/pci0000:00/0000:00:14.3/hwmon
./sys/devices/pci0000:00/0000:00:07.0/0000:03:00.0/hwmon
./sys/devices/pci0000:00/0000:00:09.0/0000:04:00.0/hwmon

took a while but I navigate to the first AMD card and got a directory listing

jstateson@jysdualxeon:/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/hwmon/hwmon3$ ls -l
total 0
lrwxrwxrwx 1 root root    0 Jan 19 12:21 device -> ../../../0000:01:00.0
-rw-r--r-- 1 root root 4096 Jan 19 13:08 fan1_enable
-r--r--r-- 1 root root 4096 Jan 19 12:21 fan1_input
-r--r--r-- 1 root root 4096 Jan 19 12:21 fan1_max
-r--r--r-- 1 root root 4096 Jan 19 12:21 fan1_min
-rw-r--r-- 1 root root 4096 Jan 19 13:08 fan1_target
-r--r--r-- 1 root root 4096 Jan 19 13:08 freq1_input
-r--r--r-- 1 root root 4096 Jan 19 13:08 freq1_label
-r--r--r-- 1 root root 4096 Jan 19 13:08 freq2_input
-r--r--r-- 1 root root 4096 Jan 19 13:08 freq2_label
-r--r--r-- 1 root root 4096 Jan 19 12:21 in0_input
-r--r--r-- 1 root root 4096 Jan 19 12:21 in0_label
-r--r--r-- 1 root root 4096 Jan 19 12:21 name
drwxr-xr-x 2 root root    0 Jan 19 12:53 power
-r--r--r-- 1 root root 4096 Jan 19 12:21 power1_average
-rw-r--r-- 1 root root 4096 Jan 19 12:21 power1_cap
-r--r--r-- 1 root root 4096 Jan 19 13:08 power1_cap_max
-r--r--r-- 1 root root 4096 Jan 19 13:08 power1_cap_min
-rw-r--r-- 1 root root 4096 Jan 19 13:08 pwm1
-rw-r--r-- 1 root root 4096 Jan 19 13:08 pwm1_enable
-r--r--r-- 1 root root 4096 Jan 19 13:08 pwm1_max
-r--r--r-- 1 root root 4096 Jan 19 13:08 pwm1_min
lrwxrwxrwx 1 root root    0 Jan 19 12:21 subsystem -> ../../../../../../class/hwmon
-r--r--r-- 1 root root 4096 Jan 19 12:21 temp1_crit
-r--r--r-- 1 root root 4096 Jan 19 12:21 temp1_crit_hyst
-r--r--r-- 1 root root 4096 Jan 19 12:21 temp1_input
-r--r--r-- 1 root root 4096 Jan 19 12:21 temp1_label
-rw-r--r-- 1 root root 4096 Jan 19 12:20 uevent
jstateson@jysdualxeon:/sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/hwmon/hwmon3$ cat name
amdgpu

did not find anything similar for the NVidia systems

root@h110btc:/# find . -name "hwmon"
./usr/src/linux-headers-5.0.0-36/drivers/hwmon
./usr/src/linux-headers-5.3.0-26-generic/include/config/hwmon
./usr/src/linux-headers-5.0.0-36-generic/include/config/hwmon
./usr/src/linux-headers-5.0.0-37/drivers/hwmon
./usr/src/linux-headers-5.0.0-37-generic/include/config/hwmon
./usr/src/linux-headers-5.3.0-26/drivers/hwmon
find: ‘./proc/1795/task/1795/net’: Invalid argument
find: ‘./proc/1795/net’: Invalid argument
./lib/modules/5.3.0-26-generic/kernel/drivers/hwmon
./lib/modules/5.0.0-37-generic/kernel/drivers/hwmon
find: ‘./run/user/1000/gvfs’: Permission denied
./sys/kernel/debug/tracing/events/hwmon
./sys/class/hwmon
./sys/devices/platform/coretemp.0/hwmon

Poking around on both NVidia systems shows only core (cpu) info at any hwmon folder. Maybe something needs to be installed? Unlike the dualxeon, both of the NVidia systems are missing intel cpu frequency settings

jstateson@tb85-nvidia:~/temp$ sudo ./chg_intel_freq.sh
have to enter a frequency. Available frequencies are:
cat: /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies: No such file or directory

that file is missing

The xeon shows the following
jstateson@jysdualxeon:~/bt_bin$ sudo ./chg_freq.sh
have to enter a frequency. Available frequencies are:
3068000 3067000 2933000 2800000 2667000 2533000 2400000 2267000 2133000 2000000 1867000 1733000 1600000

so I suspect something was not installed on the NVidia systems but I dont know what it was. I needed to be able to step the frequency down on the xeon as even with watercooling during the summer it overheated.

@KeithMyers
Copy link

KeithMyers commented Jan 19, 2020

Don't really see anything useful on my host with Nvidia cards.

Set specified gpu_devices: [0, 1, 2]
GPU_ITEM: uuid: a3096ba3d91646389037d9478b791f70
pcie_id: 08:00.0
model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
vendor: NVIDIA
driver: nvidiafb, nouveau, nvidia_drm, nvidia
card number: 0
BOINC Device number: -1
card path: /sys/class/drm/card0/device
hwmon path: None
Compute compatible: True
Energy compatible: False
GPU_ITEM: uuid: eea5ffb9069e48859f4a81ba5ed9f302
pcie_id: 0a:00.0
model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
vendor: NVIDIA
driver: nvidiafb, nouveau, nvidia_drm, nvidia
card number: 1
BOINC Device number: -1
card path: /sys/class/drm/card1/device
hwmon path: None
Compute compatible: True
Energy compatible: False
GPU_ITEM: uuid: 6e3be3488c1b44ae95c49e7ccb8e629e
pcie_id: 0b:00.0
model: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
vendor: NVIDIA
driver: nvidiafb, nouveau, nvidia_drm, nvidia
card number: 2
BOINC Device number: -1
card path: /sys/class/drm/card2/device
hwmon path: None
Compute compatible: True
Energy compatible: False

@JStateson
Copy link
Author

Keith: how did you get that info?

@Ricks-Lab
Copy link
Owner

@JStateson @KeithMyers Thanks for posting all of the details. This makes things much clearer for me. Seems like nvidia implementation is so much different. I wonder if this is why Torvalds complains about them! But at least I now have an easy way to find if a GPU can support Energy measurements. Perhaps there is another way, like nvidia-smi. Do you know if there is a command line argument to give power for a given pcie_id or card number?

I have implemented a --lsgpu option for the benchMT currently on master. This will just display the GPU details and exit. It requires that clinfo is installed to get full details. It would be interesting to see your results posted. Do either of you know if openCL exists in parallel with CUDA or are they installed separately?

I think the next step is to see if there a predictable association between boinc device number and Linux card number. I have manually mapped them by running benchMT with a specified device and monitor the cards with another app. If you have some time to do this, please let me know your results.

@JStateson
Copy link
Author

JStateson commented Jan 20, 2020

from clinfo
https://stateson.net/images/tb85_clinfo.txt

jstateson@tb85-nvidia:~/Projects/benchMT$ ./benchMT --lsgpu
{}
GPU_ITEM: uuid: 1b1fc994708748ba88a84d220199064e
      pcie_id: 01:00.0
      model: NVIDIA Corporation GP104 [GeForce GTX 1070] (rev a1)
      vendor: NVIDIA
      driver: nvidiafb, nouveau, nvidia_drm, nvidia
      openCL Device: None
      openCL Version: None
      card number: 0
      BOINC Device number: -1
      card path: /sys/class/drm/card0/device
      hwmon path: None
      Compute compatible: True
      Energy compatible: False
GPU_ITEM: uuid: 86a5915084774475b0073277c21d4e71
      pcie_id: 02:00.0
      model: NVIDIA Corporation Device 2182 (rev a1)
      vendor: NVIDIA
      driver: nvidiafb, nouveau, nvidia_drm, nvidia
      openCL Device: None
      openCL Version: None
      card number: 1
      BOINC Device number: -1
      card path: /sys/class/drm/card1/device
      hwmon path: None
      Compute compatible: True
      Energy compatible: False
GPU_ITEM: uuid: 7cf82b55d9194805ba0b469cf7dbe7de
      pcie_id: 03:00.0
      model: NVIDIA Corporation GP102 [P102-100] (rev a1)
      vendor: NVIDIA
      driver: nvidiafb, nouveau, nvidia_drm, nvidia
      openCL Device: None
      openCL Version: None
      card number: 2
      BOINC Device number: -1
      card path: /sys/class/drm/card2/device
      hwmon path: None
      Compute compatible: True
      Energy compatible: False
GPU_ITEM: uuid: 8ed4e584a7af4d3fba60024e64c23307
      pcie_id: 04:00.0
      model: NVIDIA Corporation GP102 [P102-100] (rev a1)
      vendor: NVIDIA
      driver: nvidiafb, nouveau, nvidia_drm, nvidia
      openCL Device: None
      openCL Version: None
      card number: 3
      BOINC Device number: -1
      card path: /sys/class/drm/card3/device
      hwmon path: None
      Compute compatible: True
      Energy compatible: False
GPU_ITEM: uuid: fbc482e8faa543f18bf9b491887f329a
      pcie_id: 05:00.0
      model: NVIDIA Corporation GP102 [P102-100] (rev a1)
      vendor: NVIDIA
      driver: nvidiafb, nouveau, nvidia_drm, nvidia
      openCL Device: None
      openCL Version: None
      card number: 4
      BOINC Device number: -1
      card path: /sys/class/drm/card4/device
      hwmon path: None
      Compute compatible: True
      Energy compatible: False
GPU_ITEM: uuid: 9f392b9710b34a11bd45e8654ad3dc4a
      pcie_id: 06:00.0
      model: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] (rev a1)
      vendor: NVIDIA
      driver: nvidiafb, nouveau, nvidia_drm, nvidia
      openCL Device: None
      openCL Version: None
      card number: 5
      BOINC Device number: -1
      card path: /sys/class/drm/card5/device
      hwmon path: None
      Compute compatible: True
      Energy compatible: False

@KeithMyers
Copy link

You can use this nvidia-smi command for a polling power usage on a card.
nvidia-smi stats -i <device#> -d pwrDraw
And this is the output for my device 0.
0, pwrDraw , 1579539474276611, 175
The large number is memory usage and the 175 is the wattage.

Or for a snaphshot of a single card.
nvidia-smi -i 0 --query-gpu=power.draw --format=csv
power.draw [W]
207.88 W

If you want all the cards at once:
nvidia-smi --query-gpu=power.draw --format=csv
power.draw [W]
205.83 W
210.04 W
57.35 W

Both CUDA and OpenCL API are included in the standard Nvidia drivers. Sometimes the OpenCL API is dropped from packages but can always be installed separately if needed.
sudo apt-get install ocl-icd-libopencl1

@KeithMyers
Copy link

Keith: how did you get that info?

That was what comes up when I ran the test benchMT in the Terminal.

@Ricks-Lab
Copy link
Owner

@JStateson The empty brace at the top of your last posted output indicates that openCL is not installed. This means checking for openCL to judge compute capability is not going to work. Probably need to find another way to detect cuda capability.

@KeithMyers @JStateson I have attempted an implementation of energy metrics for Nvidia. Can you run 'benchMT --lsgpu' and post your output here? I need to learn to interpret the output first. I have also included a power read of a bad card number meant to be an error so I know how to manage it.

Thanks!

@KeithMyers
Copy link

Something isn't correct in the code.

 ./benchMT --lsgpu
{}
nsmi_items: [['42.92', '']]
Traceback (most recent call last):
  File "./benchMT", line 2516, in <module>
    main()
  File "./benchMT", line 2166, in main
    gpu_list.set_gpu_list()
  File "./benchMT", line 360, in set_gpu_list
    return self.set_lspci_gpu_list()
  File "./benchMT", line 587, in set_lspci_gpu_list
    mb_const.cmd_nvidia_smi, '9'), shell=True).decode().split('\n')
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 438, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '/usr/bin/nvidia-smi -i 9 --query-gpu=power.draw --format=csv,noheader,nounits' returned non-zero exit status 6.

@JStateson
Copy link
Author

JStateson commented Jan 23, 2020

Been busy working on my boinctasks temperature projects. I wanted to show wattage used as an option as a first step into trying to identify which card on the motherboard corresponds to a problem showing up on the boinctasks display. Anyway, got the following informatoin that might be helpful:

  1. output from nvidia-smi: gpus number 1..9 no sorting with bus id
  2. output from clinfo: looks like the sorting is done in clinfo as the output matches what is shown when boinc boots up. I had thought that boinc did the sorting but it seems clinfo does
    error = just looked at another system and clinfo did not match the boinc order
  3. output from boinc

What I hope to accomplish: Be able to look at a display of a problem work unit as shown by boinctasks for any remote system running Linux, see the temps and wattage, be able to identify the card and, somehow, if the cards have identical names find some identifying serial number or code programmatically so as to know which board has the problem. I have my own version of boinc "MSboinc" that I can build for window or Linux easily and will be adding this tools to it and, eventually, to my boinctasks history reader. I want to eventually replace performance reports that have "d3" with something meaningful like "gtx-1660Ti"

`jstateson@h110btc:~/Projects/BoincTasks$ nvidia-smi -L

GPU 0: GeForce GTX 1060 6GB (UUID: GPU-a2089043-23bd-3481-efb2-f3cbbce5906a)
GPU 1: GeForce GTX 1060 3GB (UUID: GPU-4b22e301-6b76-c4ad-f962-b40d3060dd20)
GPU 2: GeForce GTX 1060 3GB (UUID: GPU-c8eb4c00-ec6c-01de-d198-262fe9b93cb7)
GPU 3: P106-100 (UUID: GPU-df40a4fd-908f-f7cf-13aa-2654367aef88)
GPU 4: GeForce GTX 1070 (UUID: GPU-f3d9b16e-7878-e14b-3e43-71a37769e93a)
GPU 5: P106-090 (UUID: GPU-69fe3af3-2dfb-9cb1-3fc6-5650f3953f16)
GPU 6: GeForce GTX 1060 3GB (UUID: GPU-6c5723e9-6e00-dbd5-c890-ce278a20661e)
GPU 7: GeForce GTX 1060 3GB (UUID: GPU-a6c40c5f-d334-766a-6fa2-acfb8e572e88)
GPU 8: GeForce GTX 1060 3GB (UUID: GPU-d1137596-9cfa-a466-dcbb-0583a095bcdf)

jstateson@h110btc:~/Projects$ clinfo | grep "Device Topology"
Device Topology (NV) PCI-E, 05:00.0
Device Topology (NV) PCI-E, 01:00.0
Device Topology (NV) PCI-E, 04:00.0
Device Topology (NV) PCI-E, 02:00.0
Device Topology (NV) PCI-E, 03:00.0
Device Topology (NV) PCI-E, 0a:00.0
Device Topology (NV) PCI-E, 0b:00.0
Device Topology (NV) PCI-E, 0e:00.0
Device Topology (NV) PCI-E, 08:00.0

jstateson@h110btc:~/Projects$ clinfo | grep "Device Name"
Device Name GeForce GTX 1070
Device Name GeForce GTX 1060 6GB
Device Name P106-100
Device Name GeForce GTX 1060 3GB
Device Name GeForce GTX 1060 3GB
Device Name GeForce GTX 1060 3GB
Device Name GeForce GTX 1060 3GB
Device Name GeForce GTX 1060 3GB
Device Name P106-090
Device Name Intel(R) Gen9 HD Graphics NEO

jstateson@h110btc:~/Projects/BoincTasks/SystemdService$ nvidia-smi -q -d PIDS | grep "GPU 0"
GPU 00000000:01:00.0
GPU 00000000:02:00.0
GPU 00000000:03:00.0
GPU 00000000:04:00.0
GPU 00000000:05:00.0
GPU 00000000:08:00.0
GPU 00000000:0A:00.0
GPU 00000000:0B:00.0
GPU 00000000:0E:00.0

============below from boinc event messages h110btc
5 CUDA: NVIDIA GPU 0: GeForce GTX 1070 (driver version 440.48, CUDA version 10.2, compute capability 6.1, 4096MB, 3972MB available, 6561 GFLOPS peak)
6 CUDA: NVIDIA GPU 1: GeForce GTX 1060 6GB (driver version 440.48, CUDA version 10.2, compute capability 6.1, 4096MB, 3974MB available, 4698 GFLOPS peak)
7 CUDA: NVIDIA GPU 2: P106-100 (driver version 440.48, CUDA version 10.2, compute capability 6.1, 4096MB, 3974MB available, 4374 GFLOPS peak)
8 CUDA: NVIDIA GPU 3: GeForce GTX 1060 3GB (driver version 440.48, CUDA version 10.2, compute capability 6.1, 3019MB, 2943MB available, 3936 GFLOPS peak)
9 CUDA: NVIDIA GPU 4: GeForce GTX 1060 3GB (driver version 440.48, CUDA version 10.2, compute capability 6.1, 3019MB, 2943MB available, 3936 GFLOPS peak)
10 CUDA: NVIDIA GPU 5: GeForce GTX 1060 3GB (driver version 440.48, CUDA version 10.2, compute capability 6.1, 3019MB, 2943MB available, 3936 GFLOPS peak)
11 CUDA: NVIDIA GPU 6: GeForce GTX 1060 3GB (driver version 440.48, CUDA version 10.2, compute capability 6.1, 3019MB, 2943MB available, 3936 GFLOPS peak)
12 CUDA: NVIDIA GPU 7: GeForce GTX 1060 3GB (driver version 440.48, CUDA version 10.2, compute capability 6.1, 3019MB, 2943MB available, 3936 GFLOPS peak)
13 CUDA: NVIDIA GPU 8: P106-090 (driver version 440.48, CUDA version 10.2, compute capability 6.1, 3022MB, 2965MB available, 1960 GFLOPS peak)
14 OpenCL: NVIDIA GPU 0: GeForce GTX 1070 (driver version 440.48.02, device version OpenCL 1.2 CUDA, 8120MB, 3972MB available, 6561 GFLOPS peak)
15 OpenCL: NVIDIA GPU 1: GeForce GTX 1060 6GB (driver version 440.48.02, device version OpenCL 1.2 CUDA, 6078MB, 3974MB available, 4698 GFLOPS peak)
16 OpenCL: NVIDIA GPU 2: P106-100 (driver version 440.48.02, device version OpenCL 1.2 CUDA, 6081MB, 3974MB available, 4374 GFLOPS peak)
17 OpenCL: NVIDIA GPU 3: GeForce GTX 1060 3GB (driver version 440.48.02, device version OpenCL 1.2 CUDA, 3019MB, 2943MB available, 3936 GFLOPS peak)
18 OpenCL: NVIDIA GPU 4: GeForce GTX 1060 3GB (driver version 440.48.02, device version OpenCL 1.2 CUDA, 3019MB, 2943MB available, 3936 GFLOPS peak)
19 OpenCL: NVIDIA GPU 5: GeForce GTX 1060 3GB (driver version 440.48.02, device version OpenCL 1.2 CUDA, 3019MB, 2943MB available, 3936 GFLOPS peak)
20 OpenCL: NVIDIA GPU 6: GeForce GTX 1060 3GB (driver version 440.48.02, device version OpenCL 1.2 CUDA, 3019MB, 2943MB available, 3936 GFLOPS peak)
21 OpenCL: NVIDIA GPU 7: GeForce GTX 1060 3GB (driver version 440.48.02, device version OpenCL 1.2 CUDA, 3019MB, 2943MB available, 3936 GFLOPS peak)
22 OpenCL: NVIDIA GPU 8: P106-090 (driver version 440.48.02, device version OpenCL 1.2 CUDA, 3022MB, 2965MB available, 1960 GFLOPS peak)
23 OpenCL: Intel GPU 0: Intel(R) Gen9 HD Graphics NEO (driver version 19.45.14764, device version OpenCL 2.1 NEO, 25449MB, 25449MB available, 221 GFLOPS peak)
`

from copro_cinfo and note that it matches clinfo

jstateson@h110btc:/var/lib/boinc$ grep -i "bus" coproc_info.xml
   <bus_id>5</bus_id>
   <bus_id>1</bus_id>
   <bus_id>4</bus_id>
   <bus_id>2</bus_id>
   <bus_id>3</bus_id>
   <bus_id>10</bus_id>
   <bus_id>11</bus_id>
   <bus_id>14</bus_id>
   <bus_id>8</bus_id>\

from clinfo on a tb85 system

jstateson@tb85-nvidia:~$ clinfo | grep "Device Name"
  Device Name                                     Intel(R) Xeon(R) CPU E3-1230 v    3 @ 3.30GHz
  Device Name                                     GeForce GTX 1070
  Device Name                                     GeForce GTX 1660 Ti
  Device Name                                     P102-100
  Device Name                                     P102-100
  Device Name                                     P102-100
  Device Name                                     GeForce GTX 1070 Ti

boinc show a different ordering, the 1660 then the Ti, then the P102 then that 1070

@Ricks-Lab
Copy link
Owner

from clinfo
https://stateson.net/images/tb85_clinfo.txt

@JStateson I noticed that this file has information on nvidia compute platform, but my script is not picking it up. Can you post the output of clinfo --raw?

@Ricks-Lab
Copy link
Owner

@JStateson @KeithMyers
I just made a change to benchMT to hopefully pickup Nvidia GPUs based on what I think clinfo --raw would look like. Can you check with benchMT --lsgpu

@KeithMyers
Copy link

keith@Serenity:~/Downloads/benchMT-master$ ./benchMT --lsgpu
benchMT workdir Path [ /home/keith/Downloads/benchMT-master/workdir/ ] does not exist, making...
TestData Path [ /home/keith/Downloads/benchMT-master/testData/ ] does not exist, making...
GPU_ITEM: uuid: 627ddcf8f20443faa5d215bc8473e5cc
    pcie_id: 08:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: None
    openCL Version: None
    card number: 0
    BOINC Device number: None
    card path: /sys/class/drm/card0/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: 8343a2ba785a4dfebaaf628530f73435
    pcie_id: 0a:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: None
    openCL Version: None
    card number: 1
    BOINC Device number: None
    card path: /sys/class/drm/card1/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: d50e742a546c427793cdf3f774c3675c
    pcie_id: 0b:00.0
    model: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: None
    openCL Version: None
    card number: 2
    BOINC Device number: None
    card path: /sys/class/drm/card2/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
keith@Serenity:~/Downloads/benchMT-master$ 

@Ricks-Lab
Copy link
Owner

@KeithMyers Was this with the current benchMT on master? I made a change a few hours ago.

Also, can you post the output of clinfo --raw?

@KeithMyers
Copy link

Yes, this was with the new master updated ten minutes ago.

keith@Serenity:~$ clinfo --raw
#PLATFORMS                                        1
  CL_PLATFORM_NAME                                NVIDIA CUDA
  CL_PLATFORM_VENDOR                              NVIDIA Corporation
  CL_PLATFORM_VERSION                             OpenCL 1.2 CUDA 10.2.115
  CL_PLATFORM_PROFILE                             FULL_PROFILE
  CL_PLATFORM_EXTENSIONS                          cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics
  CL_PLATFORM_ICD_SUFFIX_KHR                      NV

[NV/*]   CL_PLATFORM_NAME                                NVIDIA CUDA
[NV/*] #DEVICES                                          3
[NV/0]   CL_DEVICE_NAME                                  GeForce RTX 2080
[NV/0]   CL_DEVICE_VENDOR                                NVIDIA Corporation
[NV/0]   CL_DEVICE_VENDOR_ID                             0x10de
[NV/0]   CL_DEVICE_VERSION                               OpenCL 1.2 CUDA
[NV/0]   CL_DRIVER_VERSION                               440.48.02
[NV/0]   CL_DEVICE_OPENCL_C_VERSION                      OpenCL C 1.2 
[NV/0]   CL_DEVICE_TYPE                                  CL_DEVICE_TYPE_GPU
[NV/0]   CL_DEVICE_PCI_BUS_ID_NV                         8
[NV/0]   CL_DEVICE_PCI_SLOT_ID_NV                        0
[NV/0]   CL_DEVICE_PROFILE                               FULL_PROFILE
[NV/0]   CL_DEVICE_AVAILABLE                             CL_TRUE
[NV/0]   CL_DEVICE_COMPILER_AVAILABLE                    CL_TRUE
[NV/0]   CL_DEVICE_LINKER_AVAILABLE                      CL_TRUE
[NV/0]   CL_DEVICE_MAX_COMPUTE_UNITS                     46
[NV/0]   CL_DEVICE_MAX_CLOCK_FREQUENCY                   1800
[NV/0]   CL_DEVICE_COMPUTE_CAPABILITY_MAJOR_NV           7
[NV/0]   CL_DEVICE_COMPUTE_CAPABILITY_MINOR_NV           5
[NV/0]   CL_DEVICE_PARTITION_MAX_SUB_DEVICES             1
[NV/0]   CL_DEVICE_PARTITION_PROPERTIES                  CL_NONE
[NV/0]   CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS              3
[NV/0]   CL_DEVICE_MAX_WORK_ITEM_SIZES                   1024 1024 64
[NV/0]   CL_DEVICE_MAX_WORK_GROUP_SIZE                   1024
[NV/0]   CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE    32
[NV/0]   CL_DEVICE_WARP_SIZE_NV                          32
[NV/0]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR           1
[NV/0]   CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR              1
[NV/0]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT          1
[NV/0]   CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT             1
[NV/0]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT            1
[NV/0]   CL_DEVICE_NATIVE_VECTOR_WIDTH_INT               1
[NV/0]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG           1
[NV/0]   CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG              1
[NV/0]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_HALF           0
[NV/0]   CL_DEVICE_NATIVE_VECTOR_WIDTH_HALF              0
[NV/0]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT          1
[NV/0]   CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT             1
[NV/0]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE         1
[NV/0]   CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE            1
[NV/0]   CL_DEVICE_SINGLE_FP_CONFIG                      CL_FP_DENORM | CL_FP_INF_NAN | CL_FP_ROUND_TO_NEAREST | CL_FP_ROUND_TO_ZERO | CL_FP_ROUND_TO_INF | CL_FP_FMA | CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRT
[NV/0]   CL_DEVICE_DOUBLE_FP_CONFIG                      CL_FP_DENORM | CL_FP_INF_NAN | CL_FP_ROUND_TO_NEAREST | CL_FP_ROUND_TO_ZERO | CL_FP_ROUND_TO_INF | CL_FP_FMA
[NV/0]   CL_DEVICE_ADDRESS_BITS                          64
[NV/0]   CL_DEVICE_ENDIAN_LITTLE                         CL_TRUE
[NV/0]   CL_DEVICE_GLOBAL_MEM_SIZE                       8370061312
[NV/0]   CL_DEVICE_ERROR_CORRECTION_SUPPORT              CL_FALSE
[NV/0]   CL_DEVICE_MAX_MEM_ALLOC_SIZE                    2092515328
[NV/0]   CL_DEVICE_HOST_UNIFIED_MEMORY                   CL_FALSE
[NV/0]   CL_DEVICE_INTEGRATED_MEMORY_NV                  CL_FALSE
[NV/0]   CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE              128
[NV/0]   CL_DEVICE_MEM_BASE_ADDR_ALIGN                   4096
[NV/0]   CL_DEVICE_GLOBAL_MEM_CACHE_TYPE                 CL_READ_WRITE_CACHE
[NV/0]   CL_DEVICE_GLOBAL_MEM_CACHE_SIZE                 1507328
[NV/0]   CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE             128
[NV/0]   CL_DEVICE_IMAGE_SUPPORT                         CL_TRUE
[NV/0]   CL_DEVICE_MAX_SAMPLERS                          32
[NV/0]   CL_DEVICE_IMAGE_MAX_BUFFER_SIZE                 268435456
[NV/0]   CL_DEVICE_IMAGE_MAX_ARRAY_SIZE                  2048
[NV/0]   CL_DEVICE_IMAGE2D_MAX_HEIGHT                    32768
[NV/0]   CL_DEVICE_IMAGE2D_MAX_WIDTH                     32768
[NV/0]   CL_DEVICE_IMAGE3D_MAX_HEIGHT                    16384
[NV/0]   CL_DEVICE_IMAGE3D_MAX_WIDTH                     16384
[NV/0]   CL_DEVICE_IMAGE3D_MAX_DEPTH                     16384
[NV/0]   CL_DEVICE_MAX_READ_IMAGE_ARGS                   256
[NV/0]   CL_DEVICE_MAX_WRITE_IMAGE_ARGS                  32
[NV/0]   CL_DEVICE_LOCAL_MEM_TYPE                        CL_LOCAL
[NV/0]   CL_DEVICE_LOCAL_MEM_SIZE                        49152
[NV/0]   CL_DEVICE_REGISTERS_PER_BLOCK_NV                65536
[NV/0]   CL_DEVICE_MAX_CONSTANT_ARGS                     9
[NV/0]   CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE              65536
[NV/0]   CL_DEVICE_MAX_PARAMETER_SIZE                    4352
[NV/0]   CL_DEVICE_QUEUE_PROPERTIES                      CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_PROFILING_ENABLE
[NV/0]   CL_DEVICE_PREFERRED_INTEROP_USER_SYNC           CL_FALSE
[NV/0]   CL_DEVICE_PROFILING_TIMER_RESOLUTION            1000
[NV/0]   CL_DEVICE_EXECUTION_CAPABILITIES                CL_EXEC_KERNEL
[NV/0]   CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV                CL_TRUE
[NV/0]   CL_DEVICE_GPU_OVERLAP_NV                        CL_TRUE
[NV/0]   CL_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT_NV       3
[NV/0]   CL_DEVICE_PRINTF_BUFFER_SIZE                    1048576
[NV/0]   CL_DEVICE_BUILT_IN_KERNELS                      
[NV/0]   CL_DEVICE_EXTENSIONS                            cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics

[NV/1]   CL_DEVICE_NAME                                  GeForce RTX 2080
[NV/1]   CL_DEVICE_VENDOR                                NVIDIA Corporation
[NV/1]   CL_DEVICE_VENDOR_ID                             0x10de
[NV/1]   CL_DEVICE_VERSION                               OpenCL 1.2 CUDA
[NV/1]   CL_DRIVER_VERSION                               440.48.02
[NV/1]   CL_DEVICE_OPENCL_C_VERSION                      OpenCL C 1.2 
[NV/1]   CL_DEVICE_TYPE                                  CL_DEVICE_TYPE_GPU
[NV/1]   CL_DEVICE_PCI_BUS_ID_NV                         10
[NV/1]   CL_DEVICE_PCI_SLOT_ID_NV                        0
[NV/1]   CL_DEVICE_PROFILE                               FULL_PROFILE
[NV/1]   CL_DEVICE_AVAILABLE                             CL_TRUE
[NV/1]   CL_DEVICE_COMPILER_AVAILABLE                    CL_TRUE
[NV/1]   CL_DEVICE_LINKER_AVAILABLE                      CL_TRUE
[NV/1]   CL_DEVICE_MAX_COMPUTE_UNITS                     46
[NV/1]   CL_DEVICE_MAX_CLOCK_FREQUENCY                   1800
[NV/1]   CL_DEVICE_COMPUTE_CAPABILITY_MAJOR_NV           7
[NV/1]   CL_DEVICE_COMPUTE_CAPABILITY_MINOR_NV           5
[NV/1]   CL_DEVICE_PARTITION_MAX_SUB_DEVICES             1
[NV/1]   CL_DEVICE_PARTITION_PROPERTIES                  CL_NONE
[NV/1]   CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS              3
[NV/1]   CL_DEVICE_MAX_WORK_ITEM_SIZES                   1024 1024 64
[NV/1]   CL_DEVICE_MAX_WORK_GROUP_SIZE                   1024
[NV/1]   CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE    32
[NV/1]   CL_DEVICE_WARP_SIZE_NV                          32
[NV/1]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR           1
[NV/1]   CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR              1
[NV/1]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT          1
[NV/1]   CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT             1
[NV/1]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT            1
[NV/1]   CL_DEVICE_NATIVE_VECTOR_WIDTH_INT               1
[NV/1]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG           1
[NV/1]   CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG              1
[NV/1]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_HALF           0
[NV/1]   CL_DEVICE_NATIVE_VECTOR_WIDTH_HALF              0
[NV/1]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT          1
[NV/1]   CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT             1
[NV/1]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE         1
[NV/1]   CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE            1
[NV/1]   CL_DEVICE_SINGLE_FP_CONFIG                      CL_FP_DENORM | CL_FP_INF_NAN | CL_FP_ROUND_TO_NEAREST | CL_FP_ROUND_TO_ZERO | CL_FP_ROUND_TO_INF | CL_FP_FMA | CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRT
[NV/1]   CL_DEVICE_DOUBLE_FP_CONFIG                      CL_FP_DENORM | CL_FP_INF_NAN | CL_FP_ROUND_TO_NEAREST | CL_FP_ROUND_TO_ZERO | CL_FP_ROUND_TO_INF | CL_FP_FMA
[NV/1]   CL_DEVICE_ADDRESS_BITS                          64
[NV/1]   CL_DEVICE_ENDIAN_LITTLE                         CL_TRUE
[NV/1]   CL_DEVICE_GLOBAL_MEM_SIZE                       8366784512
[NV/1]   CL_DEVICE_ERROR_CORRECTION_SUPPORT              CL_FALSE
[NV/1]   CL_DEVICE_MAX_MEM_ALLOC_SIZE                    2091696128
[NV/1]   CL_DEVICE_HOST_UNIFIED_MEMORY                   CL_FALSE
[NV/1]   CL_DEVICE_INTEGRATED_MEMORY_NV                  CL_FALSE
[NV/1]   CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE              128
[NV/1]   CL_DEVICE_MEM_BASE_ADDR_ALIGN                   4096
[NV/1]   CL_DEVICE_GLOBAL_MEM_CACHE_TYPE                 CL_READ_WRITE_CACHE
[NV/1]   CL_DEVICE_GLOBAL_MEM_CACHE_SIZE                 1507328
[NV/1]   CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE             128
[NV/1]   CL_DEVICE_IMAGE_SUPPORT                         CL_TRUE
[NV/1]   CL_DEVICE_MAX_SAMPLERS                          32
[NV/1]   CL_DEVICE_IMAGE_MAX_BUFFER_SIZE                 268435456
[NV/1]   CL_DEVICE_IMAGE_MAX_ARRAY_SIZE                  2048
[NV/1]   CL_DEVICE_IMAGE2D_MAX_HEIGHT                    32768
[NV/1]   CL_DEVICE_IMAGE2D_MAX_WIDTH                     32768
[NV/1]   CL_DEVICE_IMAGE3D_MAX_HEIGHT                    16384
[NV/1]   CL_DEVICE_IMAGE3D_MAX_WIDTH                     16384
[NV/1]   CL_DEVICE_IMAGE3D_MAX_DEPTH                     16384
[NV/1]   CL_DEVICE_MAX_READ_IMAGE_ARGS                   256
[NV/1]   CL_DEVICE_MAX_WRITE_IMAGE_ARGS                  32
[NV/1]   CL_DEVICE_LOCAL_MEM_TYPE                        CL_LOCAL
[NV/1]   CL_DEVICE_LOCAL_MEM_SIZE                        49152
[NV/1]   CL_DEVICE_REGISTERS_PER_BLOCK_NV                65536
[NV/1]   CL_DEVICE_MAX_CONSTANT_ARGS                     9
[NV/1]   CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE              65536
[NV/1]   CL_DEVICE_MAX_PARAMETER_SIZE                    4352
[NV/1]   CL_DEVICE_QUEUE_PROPERTIES                      CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_PROFILING_ENABLE
[NV/1]   CL_DEVICE_PREFERRED_INTEROP_USER_SYNC           CL_FALSE
[NV/1]   CL_DEVICE_PROFILING_TIMER_RESOLUTION            1000
[NV/1]   CL_DEVICE_EXECUTION_CAPABILITIES                CL_EXEC_KERNEL
[NV/1]   CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV                CL_TRUE
[NV/1]   CL_DEVICE_GPU_OVERLAP_NV                        CL_TRUE
[NV/1]   CL_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT_NV       3
[NV/1]   CL_DEVICE_PRINTF_BUFFER_SIZE                    1048576
[NV/1]   CL_DEVICE_BUILT_IN_KERNELS                      
[NV/1]   CL_DEVICE_EXTENSIONS                            cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics

[NV/2]   CL_DEVICE_NAME                                  GeForce GTX 1080
[NV/2]   CL_DEVICE_VENDOR                                NVIDIA Corporation
[NV/2]   CL_DEVICE_VENDOR_ID                             0x10de
[NV/2]   CL_DEVICE_VERSION                               OpenCL 1.2 CUDA
[NV/2]   CL_DRIVER_VERSION                               440.48.02
[NV/2]   CL_DEVICE_OPENCL_C_VERSION                      OpenCL C 1.2 
[NV/2]   CL_DEVICE_TYPE                                  CL_DEVICE_TYPE_GPU
[NV/2]   CL_DEVICE_PCI_BUS_ID_NV                         11
[NV/2]   CL_DEVICE_PCI_SLOT_ID_NV                        0
[NV/2]   CL_DEVICE_PROFILE                               FULL_PROFILE
[NV/2]   CL_DEVICE_AVAILABLE                             CL_TRUE
[NV/2]   CL_DEVICE_COMPILER_AVAILABLE                    CL_TRUE
[NV/2]   CL_DEVICE_LINKER_AVAILABLE                      CL_TRUE
[NV/2]   CL_DEVICE_MAX_COMPUTE_UNITS                     20
[NV/2]   CL_DEVICE_MAX_CLOCK_FREQUENCY                   1860
[NV/2]   CL_DEVICE_COMPUTE_CAPABILITY_MAJOR_NV           6
[NV/2]   CL_DEVICE_COMPUTE_CAPABILITY_MINOR_NV           1
[NV/2]   CL_DEVICE_PARTITION_MAX_SUB_DEVICES             1
[NV/2]   CL_DEVICE_PARTITION_PROPERTIES                  CL_NONE
[NV/2]   CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS              3
[NV/2]   CL_DEVICE_MAX_WORK_ITEM_SIZES                   1024 1024 64
[NV/2]   CL_DEVICE_MAX_WORK_GROUP_SIZE                   1024
[NV/2]   CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE    32
[NV/2]   CL_DEVICE_WARP_SIZE_NV                          32
[NV/2]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR           1
[NV/2]   CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR              1
[NV/2]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT          1
[NV/2]   CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT             1
[NV/2]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT            1
[NV/2]   CL_DEVICE_NATIVE_VECTOR_WIDTH_INT               1
[NV/2]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG           1
[NV/2]   CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG              1
[NV/2]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_HALF           0
[NV/2]   CL_DEVICE_NATIVE_VECTOR_WIDTH_HALF              0
[NV/2]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT          1
[NV/2]   CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT             1
[NV/2]   CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE         1
[NV/2]   CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE            1
[NV/2]   CL_DEVICE_SINGLE_FP_CONFIG                      CL_FP_DENORM | CL_FP_INF_NAN | CL_FP_ROUND_TO_NEAREST | CL_FP_ROUND_TO_ZERO | CL_FP_ROUND_TO_INF | CL_FP_FMA | CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRT
[NV/2]   CL_DEVICE_DOUBLE_FP_CONFIG                      CL_FP_DENORM | CL_FP_INF_NAN | CL_FP_ROUND_TO_NEAREST | CL_FP_ROUND_TO_ZERO | CL_FP_ROUND_TO_INF | CL_FP_FMA
[NV/2]   CL_DEVICE_ADDRESS_BITS                          64
[NV/2]   CL_DEVICE_ENDIAN_LITTLE                         CL_TRUE
[NV/2]   CL_DEVICE_GLOBAL_MEM_SIZE                       8513978368
[NV/2]   CL_DEVICE_ERROR_CORRECTION_SUPPORT              CL_FALSE
[NV/2]   CL_DEVICE_MAX_MEM_ALLOC_SIZE                    2128494592
[NV/2]   CL_DEVICE_HOST_UNIFIED_MEMORY                   CL_FALSE
[NV/2]   CL_DEVICE_INTEGRATED_MEMORY_NV                  CL_FALSE
[NV/2]   CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE              128
[NV/2]   CL_DEVICE_MEM_BASE_ADDR_ALIGN                   4096
[NV/2]   CL_DEVICE_GLOBAL_MEM_CACHE_TYPE                 CL_READ_WRITE_CACHE
[NV/2]   CL_DEVICE_GLOBAL_MEM_CACHE_SIZE                 983040
[NV/2]   CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE             128
[NV/2]   CL_DEVICE_IMAGE_SUPPORT                         CL_TRUE
[NV/2]   CL_DEVICE_MAX_SAMPLERS                          32
[NV/2]   CL_DEVICE_IMAGE_MAX_BUFFER_SIZE                 268435456
[NV/2]   CL_DEVICE_IMAGE_MAX_ARRAY_SIZE                  2048
[NV/2]   CL_DEVICE_IMAGE2D_MAX_HEIGHT                    32768
[NV/2]   CL_DEVICE_IMAGE2D_MAX_WIDTH                     16384
[NV/2]   CL_DEVICE_IMAGE3D_MAX_HEIGHT                    16384
[NV/2]   CL_DEVICE_IMAGE3D_MAX_WIDTH                     16384
[NV/2]   CL_DEVICE_IMAGE3D_MAX_DEPTH                     16384
[NV/2]   CL_DEVICE_MAX_READ_IMAGE_ARGS                   256
[NV/2]   CL_DEVICE_MAX_WRITE_IMAGE_ARGS                  16
[NV/2]   CL_DEVICE_LOCAL_MEM_TYPE                        CL_LOCAL
[NV/2]   CL_DEVICE_LOCAL_MEM_SIZE                        49152
[NV/2]   CL_DEVICE_REGISTERS_PER_BLOCK_NV                65536
[NV/2]   CL_DEVICE_MAX_CONSTANT_ARGS                     9
[NV/2]   CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE              65536
[NV/2]   CL_DEVICE_MAX_PARAMETER_SIZE                    4352
[NV/2]   CL_DEVICE_QUEUE_PROPERTIES                      CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_PROFILING_ENABLE
[NV/2]   CL_DEVICE_PREFERRED_INTEROP_USER_SYNC           CL_FALSE
[NV/2]   CL_DEVICE_PROFILING_TIMER_RESOLUTION            1000
[NV/2]   CL_DEVICE_EXECUTION_CAPABILITIES                CL_EXEC_KERNEL
[NV/2]   CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV                CL_TRUE
[NV/2]   CL_DEVICE_GPU_OVERLAP_NV                        CL_TRUE
[NV/2]   CL_DEVICE_ATTRIBUTE_ASYNC_ENGINE_COUNT_NV       2
[NV/2]   CL_DEVICE_PRINTF_BUFFER_SIZE                    1048576
[NV/2]   CL_DEVICE_BUILT_IN_KERNELS                      
[NV/2]   CL_DEVICE_EXTENSIONS                            cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics

[OCLICD/*]   CL_ICDL_NAME                                    OpenCL ICD Loader
[OCLICD/*]   CL_ICDL_VENDOR                                  OCL Icd free software
[OCLICD/*]   CL_ICDL_VERSION                                 2.2.11
[OCLICD/*]   CL_ICDL_OCL_VERSION                             OpenCL 2.1
keith@Serenity:~$ 

@Ricks-Lab
Copy link
Owner

@KeithMyers Thanks! Exactly what I needed. Looks like pcie ID is stored differently between NV and AMD. Should be an easy fix. I will let you know when I push an update with the changes.

@KeithMyers
Copy link

I would expect that to be the case since two different API's are being used. Boinc depends on the vendor API to pull the identifying information and capabilities of the detected cards. So AMD's API uses a different format compared to Nvidia's API to store the PCIe ID.

@KeithMyers
Copy link

KeithMyers commented Jan 24, 2020

// Detection of AMD/ATI GPUs
//
// Docs:
// http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_CAL_Programming_Guide_v2.0%5B1%5D.pdf
// ?? why don't they have HTML docs??

// NvAPI now provides an API for getting #cores :-)
// But not FLOPs per clock cycle :-(
// Anyway, don't use this for now because server code estimates FLOPS
// based on compute capability, so we may as well do the same
// See http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/
//

// OpenCL interfaces are documented here:
// http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/ and
// http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/

@Ricks-Lab
Copy link
Owner

Ricks-Lab commented Jan 24, 2020

@KeithMyers I just posted a fix, but not able to test. Can you give it a try?

@JStateson In this latest release, I added openCL Device Index, which may help in solving mapping problem. I noticed that coproc_info.xml has both device number and opencl_device_index. This looks promising.

@Ricks-Lab
Copy link
Owner

I just pushed another update. clinfo for NV reports back Bus ID and Slot ID, but the format of pcie ID is Bus:Device:Function, so I assumed Slot maps to Device. Only way to know for sure is to see if it matches to pcie ID found by lspci.

Seems from @JStateson post above, clinfo without --raw is giving Device Topology. It would be useful to see the output of --raw and default for nvidia.

@KeithMyers
Copy link

Doesn't seem to be picking up the OpenCL device number.

keith@Serenity:~/Downloads/benchMT-master$ ./benchMT --lsgpu
benchMT workdir Path [ /home/keith/Downloads/benchMT-master/workdir/ ] does not exist, making...
TestData Path [ /home/keith/Downloads/benchMT-master/testData/ ] does not exist, making...
GPU_ITEM: uuid: 67ad16af443d45ee9c9b63240f90137e
    pcie_id: 08:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: None
    openCL Version: None
    openCL Index: None
    card number: 0
    BOINC Device number: None
    card path: /sys/class/drm/card0/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: 6174520b81954271bbfd17e7958237e2
    pcie_id: 0a:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: None
    openCL Version: None
    openCL Index: None
    card number: 1
    BOINC Device number: None
    card path: /sys/class/drm/card1/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: 966bcc7e65ce44fdbcf2c0b1868e5cd1
    pcie_id: 0b:00.0
    model: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: None
    openCL Version: None
    openCL Index: None
    card number: 2
    BOINC Device number: None
    card path: /sys/class/drm/card2/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
keith@Serenity:~/Downloads/benchMT-master$

@KeithMyers
Copy link

KeithMyers commented Jan 24, 2020

./benchMT --devmap 0:0,1:1,2:2
devmap:  {0: 0, 1: 1, 2: 2}
Set specified gpu_devices:  [0, 1, 2]
BOINC Home Path [ /home/boinc/BOINC/ ] does not exist
Please set the correct BOINC Home Path with the --boinc_home command line option
boinccmd [ /home/boinc/BOINC/boinccmd ] does not exist
Error in BOINC environment.  Exiting...
./benchMT ./benchMT --boinc_home /home/keith/Desktop/BOINC/ --devmap 0:0,1:1,2:2
usage: benchMT [-h] [--about] [-y] [--cfg_file CFG_FILE] [--run_name RUN_NAME]
               [--boinc_home BOINC_HOME] [--noBS] [--display_compact]
               [--display_slots] [--num_repetitions NUM_REPETITIONS]
               [--max_threads MAX_THREADS] [--max_gpus MAX_GPUS]
               [--gpu_devices GPU_DEVICES] [--devmap DEVMAP] [--energy]
               [--astropulse] [--std_signals] [--no_ref] [--force_ref]
               [--purge_kernels] [--lsgpu] [--admin_mkdirs] [-d]
benchMT: error: unrecognized arguments: ./benchMT

So why is setting my boinc_home an invalid argument now?

@Ricks-Lab
Copy link
Owner

It’s complaining about the second ./benchMT.
Can you run with lsgpu and debug options?

@KeithMyers
Copy link

Ah, silly me. Didn't see the extra one.

./benchMT --boinc_home /home/keith/Desktop/BOINC/ --devmap 0:0,1:1,2:2 --debug --lsgpu
Using python 3.6.9
mb_const.boinc_home: [/home/boinc/BOINC/]
mb_const.cpu_app_subdir: [APPS_CPU/]
mb_const.gpu_app_subdir: [APPS_GPU/]
mb_const.ref_app_subdir: [APPS_REF/]
mb_const.ref_results_subdir: [REF_RESULTS/]
mb_const.wu_subdir: [WU_test/]
mb_const.std_signal_subdir: [WU_std_signal/]
mb_const.testdata_subdir: [testData/]
mb_const.workdir_subdir: [workdir/]
mb_const.slots_subdir: [Slots/]
mb_const.command_line_filename: [BenchCFG]
mb_const.boinccmd: [boinccmd]
mb_const.template_file: [init_data.xml.template]
mb_const.wu_cmp: [rescmpv5_l]
mb_const.suspend_args: [['boinccmd --set_gpu_mode never 172800', 'boinccmd --set_run_mode never 172800']]
mb_const.resume_args: [['boinccmd --set_gpu_mode never 1', 'boinccmd --set_run_mode never 1']]
mb_const.activeWU: [work_unit.sah]
mb_const.activeAPWU: [in.dat]
mb_const.DEBUG: [True]
mb_const.noBS: [False]
mb_const.env: [<__main__.BENCH_ENV object at 0x7f4ed229b588>]
mb_const.card_root: [/sys/class/drm/]
mb_const.hwmon_sub: [hwmon/hwmon]
mb_const.cmd_lspci: [/usr/bin/lspci]
mb_const.cmd_lshw: [/usr/bin/lshw]
mb_const.cmd_lscpu: [/usr/bin/lscpu]
mb_const.cmd_clinfo: [/usr/bin/clinfo]
mb_const.cmd_time: [/usr/bin/time]
mb_const.cmd_lsb_release: [/usr/bin/lsb_release]
mb_const.cmd_nvidia_smi: [/usr/bin/nvidia-smi]

   Initial app list
┌────┬────┬───┬────────────────────────────────────────────────────────────┬────────┬────────┬───────────┬────────┐
│Job#│Slot│xPU│app_name                                                    │  start │ finish │tot_time   │ state  │
│    │    │   │app_args                                                    │wu_name                               │
├────┼────┼───┼────────────────────────────────────────────────────────────┼────────┬────────┬───────────┬────────┤
│0   │ NA │GPU│MBv8_8.22r3584_sse2_clAMD_HD5_x86_64-pc-linux-gnu           │  NA    │  NA    │  NA       │PENDING │
│    │    │   │-v 1 -instances_per_device 1 -sbs 2048 -period_iterations_nu│not assigned                          │
└────┴────┴───┴────────────────────────────────────────────────────────────┴──────────────────────────────────────┘
CFG_mode: yes None
CFG_mode: run_name None
CFG_mode: boinc_home None
CFG_mode: noBS None
CFG_mode: display_compact None
CFG_mode: display_slots None
CFG_mode: num_repetitions None
CFG_mode: max_threads None
CFG_mode: max_gpus None
CFG_mode: gpu_devices None
CFG_mode: devmap None
CFG_mode: std_signals None
CFG_mode: no_ref None
CFG_mode: force_ref None
CFG_mode: energy None
CFG_mode: astropulse None
ocl_device_name [GeForce RTX 2080]
ocl_device_version [OpenCL 1.2 CUDA]
ocl_pcie_id []
cl_index: ['GeForce RTX 2080', 'OpenCL 1.2 CUDA', '0']
ocl_device_name [GeForce RTX 2080]
ocl_device_version [OpenCL 1.2 CUDA]
ocl_pcie_id []
cl_index: ['GeForce RTX 2080', 'OpenCL 1.2 CUDA', '1']
ocl_device_name [GeForce GTX 1080]
ocl_device_version [OpenCL 1.2 CUDA]
ocl_pcie_id []
cl_index: ['GeForce GTX 1080', 'OpenCL 1.2 CUDA', '2']
{'': ['GeForce GTX 1080', 'OpenCL 1.2 CUDA', '2']}
Found 3 GPUs
GPU:  08:00.0
['08:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)', '\tSubsystem: eVga.com. Corp. Device 2184', '\tKernel driver in use: nvidia', '\tKernel modules: nvidiafb, nouveau, nvidia_drm, nvidia', '']
hw_file_search:  []
Power reading for 08:00.0: 93.96 W
GPU:  0a:00.0
['0a:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)', '\tSubsystem: eVga.com. Corp. Device 2184', '\tKernel driver in use: nvidia', '\tKernel modules: nvidiafb, nouveau, nvidia_drm, nvidia', '']
hw_file_search:  []
Power reading for 0a:00.0: 166.11 W
GPU:  0b:00.0
['0b:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)', '\tSubsystem: eVga.com. Corp. GP104 [GeForce GTX 1080]', '\tKernel driver in use: nvidia', '\tKernel modules: nvidiafb, nouveau, nvidia_drm, nvidia', '']
hw_file_search:  []
Power reading for 0b:00.0: 103.40 W
GPU_ITEM: uuid: 965baa7754644ac79453a94ba22d5b57
    pcie_id: 08:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: None
    openCL Version: None
    openCL Index: None
    card number: 0
    BOINC Device number: None
    card path: /sys/class/drm/card0/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: 728b2213f4b94b3293da860cd7b60774
    pcie_id: 0a:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: None
    openCL Version: None
    openCL Index: None
    card number: 1
    BOINC Device number: None
    card path: /sys/class/drm/card1/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: 70f2a1a30d124f2fba9ec5413f5dbbe0
    pcie_id: 0b:00.0
    model: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: None
    openCL Version: None
    openCL Index: None
    card number: 2
    BOINC Device number: None
    card path: /sys/class/drm/card2/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: 965baa7754644ac79453a94ba22d5b57
    pcie_id: 08:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: None
    openCL Version: None
    openCL Index: None
    card number: 0
    BOINC Device number: None
    card path: /sys/class/drm/card0/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: 728b2213f4b94b3293da860cd7b60774
    pcie_id: 0a:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: None
    openCL Version: None
    openCL Index: None
    card number: 1
    BOINC Device number: None
    card path: /sys/class/drm/card1/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: 70f2a1a30d124f2fba9ec5413f5dbbe0
    pcie_id: 0b:00.0
    model: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: None
    openCL Version: None
    openCL Index: None
    card number: 2
    BOINC Device number: None
    card path: /sys/class/drm/card2/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
devmap:  {0: 0, 1: 1, 2: 2}
GPU_ITEM: uuid: 965baa7754644ac79453a94ba22d5b57
    pcie_id: 08:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: None
    openCL Version: None
    openCL Index: None
    card number: 0
    BOINC Device number: 0
    card path: /sys/class/drm/card0/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: 728b2213f4b94b3293da860cd7b60774
    pcie_id: 0a:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: None
    openCL Version: None
    openCL Index: None
    card number: 1
    BOINC Device number: 1
    card path: /sys/class/drm/card1/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: 70f2a1a30d124f2fba9ec5413f5dbbe0
    pcie_id: 0b:00.0
    model: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: None
    openCL Version: None
    openCL Index: None
    card number: 2
    BOINC Device number: 2
    card path: /sys/class/drm/card2/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True

@KeithMyers
Copy link

See power readings on the cards now. Currently running a mix of Einstein and Milkyway tasks since Seti is still fubared.

@Ricks-Lab
Copy link
Owner

Not sure why it’s not setting the clinfo PCIe id, but thought of a new way to test it myself. I will put your clinfo output into a file and cat it instead executing clinfo. I will try this in the morning.

@Ricks-Lab
Copy link
Owner

@KeithMyers I just found and fixed what I think the problem is. Can you try the latest on master with lsgpu and dubug options again?

@KeithMyers
Copy link

KeithMyers commented Jan 24, 2020

keith@Serenity:~/Downloads/benchMT-master$ ./benchMT --boinc_home /home/keith/Desktop/BOINC/ --devmap 0:0,1:1,2:2 --debug --lsgpu
Using python 3.6.9
mb_const.boinc_home: [/home/boinc/BOINC/]
mb_const.cpu_app_subdir: [APPS_CPU/]
mb_const.gpu_app_subdir: [APPS_GPU/]
mb_const.ref_app_subdir: [APPS_REF/]
mb_const.ref_results_subdir: [REF_RESULTS/]
mb_const.wu_subdir: [WU_test/]
mb_const.std_signal_subdir: [WU_std_signal/]
mb_const.testdata_subdir: [testData/]
mb_const.workdir_subdir: [workdir/]
mb_const.slots_subdir: [Slots/]
mb_const.command_line_filename: [BenchCFG]
mb_const.boinccmd: [boinccmd]
mb_const.template_file: [init_data.xml.template]
mb_const.wu_cmp: [rescmpv5_l]
mb_const.suspend_args: [['boinccmd --set_gpu_mode never 172800', 'boinccmd --set_run_mode never 172800']]
mb_const.resume_args: [['boinccmd --set_gpu_mode never 1', 'boinccmd --set_run_mode never 1']]
mb_const.activeWU: [work_unit.sah]
mb_const.activeAPWU: [in.dat]
mb_const.DEBUG: [True]
mb_const.noBS: [False]
mb_const.env: [<__main__.BENCH_ENV object at 0x7f4f3286c550>]
mb_const.card_root: [/sys/class/drm/]
mb_const.hwmon_sub: [hwmon/hwmon]
mb_const.cmd_lspci: [/usr/bin/lspci]
mb_const.cmd_lshw: [/usr/bin/lshw]
mb_const.cmd_lscpu: [/usr/bin/lscpu]
mb_const.cmd_clinfo: [/usr/bin/clinfo]
mb_const.cmd_time: [/usr/bin/time]
mb_const.cmd_lsb_release: [/usr/bin/lsb_release]
mb_const.cmd_nvidia_smi: [/usr/bin/nvidia-smi]
benchMT workdir Path [ /home/keith/Downloads/benchMT-master/workdir/ ] does not exist, making...
TestData Path [ /home/keith/Downloads/benchMT-master/testData/ ] does not exist, making...

   Initial app list
┌────┬────┬───┬────────────────────────────────────────────────────────────┬────────┬────────┬───────────┬────────┐
│Job#│Slot│xPU│app_name                                                    │  start │ finish │tot_time   │ state  │
│    │    │   │app_args                                                    │wu_name                               │
├────┼────┼───┼────────────────────────────────────────────────────────────┼────────┬────────┬───────────┬────────┤
│0   │ NA │GPU│MBv8_8.22r3584_sse2_clAMD_HD5_x86_64-pc-linux-gnu           │  NA    │  NA    │  NA       │PENDING │
│    │    │   │-v 1 -instances_per_device 1 -sbs 2048 -period_iterations_nu│not assigned                          │
└────┴────┴───┴────────────────────────────────────────────────────────────┴──────────────────────────────────────┘
CFG_mode: yes None
CFG_mode: run_name None
CFG_mode: boinc_home None
CFG_mode: noBS None
CFG_mode: display_compact None
CFG_mode: display_slots None
CFG_mode: num_repetitions None
CFG_mode: max_threads None
CFG_mode: max_gpus None
CFG_mode: gpu_devices None
CFG_mode: devmap None
CFG_mode: std_signals None
CFG_mode: no_ref None
CFG_mode: force_ref None
CFG_mode: energy None
CFG_mode: astropulse None
ocl_device_name [GeForce RTX 2080]
ocl_device_version [OpenCL 1.2 CUDA]
ocl_pcie_id []
ocl_pcie_id [08:00.0]
cl_index: ['GeForce RTX 2080', 'OpenCL 1.2 CUDA', '0']
ocl_device_name [GeForce RTX 2080]
ocl_device_version [OpenCL 1.2 CUDA]
ocl_pcie_id []
ocl_pcie_id [0a:00.0]
cl_index: ['GeForce RTX 2080', 'OpenCL 1.2 CUDA', '1']
ocl_device_name [GeForce GTX 1080]
ocl_device_version [OpenCL 1.2 CUDA]
ocl_pcie_id []
ocl_pcie_id [0b:00.0]
cl_index: ['GeForce GTX 1080', 'OpenCL 1.2 CUDA', '2']
{'08:00.0': ['GeForce RTX 2080', 'OpenCL 1.2 CUDA', '0'], '0a:00.0': ['GeForce RTX 2080', 'OpenCL 1.2 CUDA', '1'], '0b:00.0': ['GeForce GTX 1080', 'OpenCL 1.2 CUDA', '2']}
Found 3 GPUs
GPU:  08:00.0
['08:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)', '\tSubsystem: eVga.com. Corp. Device 2184', '\tKernel driver in use: nvidia', '\tKernel modules: nvidiafb, nouveau, nvidia_drm, nvidia', '']
hw_file_search:  []
Power reading for 08:00.0: 100.63 W
GPU:  0a:00.0
['0a:00.0 VGA compatible controller: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)', '\tSubsystem: eVga.com. Corp. Device 2184', '\tKernel driver in use: nvidia', '\tKernel modules: nvidiafb, nouveau, nvidia_drm, nvidia', '']
hw_file_search:  []
Power reading for 0a:00.0: 107.63 W
GPU:  0b:00.0
['0b:00.0 VGA compatible controller: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)', '\tSubsystem: eVga.com. Corp. GP104 [GeForce GTX 1080]', '\tKernel driver in use: nvidia', '\tKernel modules: nvidiafb, nouveau, nvidia_drm, nvidia', '']
hw_file_search:  []
Power reading for 0b:00.0: 140.23 W
GPU_ITEM: uuid: 8ee7c608aabc41239c1094bd6569e3ec
    pcie_id: 08:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: GeForce RTX 2080
    openCL Version: OpenCL 1.2 CUDA
    openCL Index: 0
    card number: 0
    BOINC Device number: None
    card path: /sys/class/drm/card0/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: bf24108f54a946f080bb4fda43ddd5c6
    pcie_id: 0a:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: GeForce RTX 2080
    openCL Version: OpenCL 1.2 CUDA
    openCL Index: 1
    card number: 1
    BOINC Device number: None
    card path: /sys/class/drm/card1/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: cb6508fee15e4bcab3c6596008c884e9
    pcie_id: 0b:00.0
    model: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: GeForce GTX 1080
    openCL Version: OpenCL 1.2 CUDA
    openCL Index: 2
    card number: 2
    BOINC Device number: None
    card path: /sys/class/drm/card2/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: 8ee7c608aabc41239c1094bd6569e3ec
    pcie_id: 08:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: GeForce RTX 2080
    openCL Version: OpenCL 1.2 CUDA
    openCL Index: 0
    card number: 0
    BOINC Device number: None
    card path: /sys/class/drm/card0/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: bf24108f54a946f080bb4fda43ddd5c6
    pcie_id: 0a:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: GeForce RTX 2080
    openCL Version: OpenCL 1.2 CUDA
    openCL Index: 1
    card number: 1
    BOINC Device number: None
    card path: /sys/class/drm/card1/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: cb6508fee15e4bcab3c6596008c884e9
    pcie_id: 0b:00.0
    model: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: GeForce GTX 1080
    openCL Version: OpenCL 1.2 CUDA
    openCL Index: 2
    card number: 2
    BOINC Device number: None
    card path: /sys/class/drm/card2/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
devmap:  {0: 0, 1: 1, 2: 2}
GPU_ITEM: uuid: 8ee7c608aabc41239c1094bd6569e3ec
    pcie_id: 08:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: GeForce RTX 2080
    openCL Version: OpenCL 1.2 CUDA
    openCL Index: 0
    card number: 0
    BOINC Device number: 0
    card path: /sys/class/drm/card0/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: bf24108f54a946f080bb4fda43ddd5c6
    pcie_id: 0a:00.0
    model: NVIDIA Corporation TU104 [GeForce RTX 2080 Rev. A] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: GeForce RTX 2080
    openCL Version: OpenCL 1.2 CUDA
    openCL Index: 1
    card number: 1
    BOINC Device number: 1
    card path: /sys/class/drm/card1/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
GPU_ITEM: uuid: cb6508fee15e4bcab3c6596008c884e9
    pcie_id: 0b:00.0
    model: NVIDIA Corporation GP104 [GeForce GTX 1080] (rev a1)
    vendor: NVIDIA
    driver: nvidiafb, nouveau, nvidia_drm, nvidia
    openCL Device: GeForce GTX 1080
    openCL Version: OpenCL 1.2 CUDA
    openCL Index: 2
    card number: 2
    BOINC Device number: 2
    card path: /sys/class/drm/card2/device
    hwmon path: None
    Compute compatible: True
    Energy compatible: True
keith@Serenity:~/Downloads/benchMT-master$

@JStateson
Copy link
Author

JStateson commented Jan 25, 2020

Sorry, just got back on this thread. Been putting out a fire: Unaccountably I cannot run clinfo too many times on a linux system having AMD RX-570 with Einstein-at-home crunching.
https://askubuntu.com/questions/1205335/where-to-report-bugs-involving-opencl-in-ubuntu

I am working on getting my MSboinc program to write out the correlation with the bus id's. On booting, it is supposed to read in the following (created by a python script) and replace the "-1" with the ID's it is using

<devmap>
<Num_GPUs>9</Num_GPUs>
<1>0 -1 01:00.0 NV GTX-1060-6GB</1>
<2>1 -1 02:00.0 NV GTX-1060-3GB</2>
<3>2 -1 03:00.0 NV GTX-1060-3GB</3>
<4>3 -1 04:00.0 NV P106-100</4>
<5>4 -1 05:00.0 NV GTX-1070</5>
<6>5 -1 08:00.0 NV P106-090</6>
<7>6 -1 0A:00.0 NV GTX-1060-3GB</7>
<8>7 -1 0B:00.0 NV GTX-1060-3GB</8>
<9>8 -1 0E:00.0 NV GTX-1060-3GB</9>
</devmap>

Is there anything I can do here on this thead?

@Ricks-Lab
Copy link
Owner

@JStateson I definitely ran with your original issue report and did a whole lot more! Probably resulted in more email than is desirable... If you could just test the latest on master on your nvidia system with ./benchMT --lsgpu --debug and post the results here. Just want to make sure it works fine. Then I will open a new help wanted issue for Beta testing and close this one. Thanks!

@JStateson
Copy link
Author

ran ok, results here
http://stateson.net/images/h110btc_benchMT_2.txt
Does not look as good as Keiths. Not sure how to format it. From a PC I used ftp to retrieve it from linux and then from the PC, I used FileZilla to put it my "junk" site.

@Ricks-Lab
Copy link
Owner

ran ok, results here
http://stateson.net/images/h110btc_benchMT_2.txt
Does not look as good as Keiths. Not sure how to format it. From a PC I used ftp to retrieve it from linux and then from the PC, I used FileZilla to put it my "junk" site.

Thanks! I was still able to see what I needed to in the attached file. Looks like the compute and energy capability determination is working correctly for Nvidia.

I noticed that clinfo indicates that the Intel 530 has compute capability. Does BOINC also see it as compute capable? Seems like pcie_id is stored different for Intel. Can you provide the clinfo --raw output for this card?

Let me know if you are ok with me adding you to the benchMT credits for lspci/clinfo debug.

@KeithMyers
Copy link

Yes, BOINC detects the Intel iGPUs as capable of computing. There is a specific module in /client for detecting Intel gpus.
extern vector<COPROC_INTEL> intel_gpus;
extern vector<OPENCL_DEVICE_PROP> intel_gpu_opencls;

@JStateson
Copy link
Author

JStateson commented Jan 25, 2020

Got python script "MakeTable.py" working for nvidia. The column starting with 1, 3, 4, 2 etc is the boinc id. I correlated this from coproc_info.xml. The first column 0,1,2, represents the sensor id of lm-sensors or the id in nvidia-smi. This still does not get the exact card when the name are identical but I am working on that.

<devmap>
<Num_GPUs>9</Num_GPUs>
<1>0 1 01:00.0 NV GTX-1060-6GB</1>
<2>1 3 02:00.0 NV GTX-1060-3GB</2>
<3>2 4 03:00.0 NV GTX-1060-3GB</3>
<4>3 2 04:00.0 NV P106-100</4>
<5>4 0 05:00.0 NV GTX-1070</5>
<6>5 8 08:00.0 NV P106-090</6>
<7>6 5 0A:00.0 NV GTX-1060-3GB</7>
<8>7 6 0B:00.0 NV GTX-1060-3GB</8>
<9>8 7 0E:00.0 NV GTX-1060-3GB</9>
</devmap>

Need to add the AMD ones but my AMD system is offline. The above "xml" file is to be read into my MSboinc program and then written to the event log so that it shows up on BoincTasks for every remote system.
Note that the "0" goes to the gtx 1070 and "8" goes to the slowest board, the P106-90
Update GitHub
https://github.com/JStateson/BoincTasks/tree/master/SystemdService

@Ricks-Lab
Copy link
Owner

Yes, BOINC detects the Intel iGPUs as capable of computing. There is a specific module in /client for detecting Intel gpus.
extern vector<COPROC_INTEL> intel_gpus;
extern vector<OPENCL_DEVICE_PROP> intel_gpu_opencls;

Seems like the pcie_id is represented differently than AMD or NV. Do you have an Intel card that you get clinfo --raw output for?

@JStateson
Copy link
Author

JStateson commented Jan 25, 2020

sorry, hit wrong button
the intel raw data is here
https://stateson.net/images/intel_raw.txt
seems I cannot delete the closing. Access should have been denied ACTUALLY I AM THE AUTHOR OF THE ISSUE!! This thread is getting too long, maybe time to start another?

@Ricks-Lab
Copy link
Owner

sorry, hit wrong button
the intel raw data is here
https://stateson.net/images/intel_raw.txt
seems I cannot delete the closing. Access should have been denied ACTUALLY I AM THE AUTHOR OF THE ISSUE!! This thread is getting too long, maybe time to start another?

I agree, it is getting confusing to follow the multiple lines of thoughts. I will raise issues for the separate items and leave this one closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants