-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Something changed in lspci and the grep is failing is my guess #5
Comments
@JStateson Sorry about the late response to this. I think the notification came during travels so I missed it. I think this problem is associated with AMD only energy measurements. I will need some time to dig into and implement a fix. |
When I was in undergrad (century ago?) It was fun to see who could write the shortest program to translate Morse code. I thought my 3 line program was good, but the instructor showed us his 1 line program in APL that did the trick. Unlike your grep above, his APL was understandable to me. |
The grep statement does 3 things: It looks for all GPU's and then selects only AMD GPU's from those results and then gets the PCIe ID from the final results. Maybe it would be better to do this in 3 steps. Originally, I only intended to run this when the --energy option is used which is only applicable for AMD at this time, but then pulled it early in the flow to determine devmap which maps boinc device numbers to linux card numbers. I will work on this with the plan of eventually including energy measurements for NVidia. Can you help to provide the output of the grep for NVidia GPUs? The code on master has already been modified to work correctly, but still want to make the longer term improvements. |
following did nothing
this was what it had to work with on H110BTC with 18.04 and 9 NV GPUs and 1 Intel. Note that two of the nvidia are designated as 3d and not VGA.
on an 18.04 with three AMD boards lscpi showed the following
|
@JStateson Thanks for providing the details! This will help in the development in support of HW that I don't have. I see you have some experience with BOINC, so I have a question. It would be very useful if I can associate a BOINC device number with a Linux card number, but I have not been able to find details on how BOINC assigns the number. So far from observations, it appears to be the reverse of Linux card numbers, but my few systems is not enough to know this for sure. Let me know if you have insight into this else, maybe I can produce output that can be validated on your systems. |
I have spent some time looking at this myself and have not figured it out. The problem I was trying to solve was which board was causing a computation failure when there are multiple identical GPUs. The solution I had been using was to manually stop the GPU fan from moving and see which board showed a temperature increase. This is obviously less than ideal but it does work both in window an linux using boinctasks capability of displaying temperatures. I would like to do this programatically. What I have learned: Nvidia's cuda reports busid of 1,,6 for 6 boards but ATI shows opencl_driver_index also starting at 0 The module that reads in the info and does the sorting do not have an entry in the C++ structure such as "bus id" or "driver index" nor even the name of the board such as gtx1660ti etc. All that is lost once the gpu_detect returns. Compounding the problem is the numbering of the board by the nvidia driver. nvidia-smi list boards using 0..5 (for 6 boards)
but coproc_info shows bus id of "2"
but also device id of "0"
The value of 0 for opencl ID does not correspond to the nvidia-smi table but The net effect of all this is the GPUs: d0, d1, d2, etc associated with work units cannot Some ideas I was looking at:
EDIT-url to efmer was corrected |
There does not appear to be any unique identifying serial number on any NVidia board. I recall reading somewhere that the manufactures decided long ago not to put a serial number in as that might be used to prevent software from working on a replacement board unless a "fee" was paid to the sw developer. I had the idea of re-flashing the bios and increment the "date" so as to be able to identify which board had a problem. Another thought was to run a small "performance" test under direction of the gpu_detect module and have a program on the Linux or windows remote system determine which one of the 6, 8 or 19 GPUs was the one running the test. |
For now, I am rewriting the part of the code the generates the list of GPUs. Originally, I used lshw but later I added the capability to estimate energy used, so I had to take parts from my amdgpu-utils and use lspci and driver files, but this was just added on to what was already there. I plan to make the new GPU list the core of how benchMT uses GPU compute resources. This should make further improvements much easier. To make the association in the past, I have used the benchMT command line option to run on a specific device and use amdgpu-utils to see which card number has loading and build a devmap which I stored in the benchCFG file. I hope a more generic implementation would allow it to work with other than AMD GPUs. It will take some time for the rewrite... |
@JStateson |
As far as I know . . . . EVERY Nvidia card gets an ID. EVGA has serial number stickers on the back of every card for example. Also, every gpu in the system gets an unique GPU UUID that is a 32 bit hexadecimal number. |
I know for AMD there is a unique_id device file that returns a hex number, but I don't think it is useful in mapping between boinc device number and linux card number. |
@JStateson @KeithMyers Also, I am trying to figure out the hwmon file that can be used to read current power. Maybe the name of a file in the card's hwmon directory will be obvious. If so, please cat and send me details. Thanks! |
willl add the -s later for you |
|
|
Is the last one on the list a server vga card with no compute capabilities? |
It looks like my assumptions are correct. It would be great if you could run the latest test version on benchMT on master. It will exit after displaying GPU information. |
yea that was builtin no comute unlike the intel 530 |
will do later, thanks! |
https://stateson.net/images/h110btc_benchMT.txt https://stateson.net/images/tb85_benchMT.txt https://stateson.net/images/dualxeon_benchMT.txt The only obvious hits were in the system with the AMD cards
took a while but I navigate to the first AMD card and got a directory listing
did not find anything similar for the NVidia systems
Poking around on both NVidia systems shows only core (cpu) info at any hwmon folder. Maybe something needs to be installed? Unlike the dualxeon, both of the NVidia systems are missing intel cpu frequency settings
that file is missing
so I suspect something was not installed on the NVidia systems but I dont know what it was. I needed to be able to step the frequency down on the xeon as even with watercooling during the summer it overheated. |
Don't really see anything useful on my host with Nvidia cards. Set specified gpu_devices: [0, 1, 2] |
Keith: how did you get that info? |
@JStateson @KeithMyers Thanks for posting all of the details. This makes things much clearer for me. Seems like nvidia implementation is so much different. I wonder if this is why Torvalds complains about them! But at least I now have an easy way to find if a GPU can support Energy measurements. Perhaps there is another way, like nvidia-smi. Do you know if there is a command line argument to give power for a given pcie_id or card number? I have implemented a --lsgpu option for the benchMT currently on master. This will just display the GPU details and exit. It requires that clinfo is installed to get full details. It would be interesting to see your results posted. Do either of you know if openCL exists in parallel with CUDA or are they installed separately? I think the next step is to see if there a predictable association between boinc device number and Linux card number. I have manually mapped them by running benchMT with a specified device and monitor the cards with another app. If you have some time to do this, please let me know your results. |
from clinfo
|
You can use this nvidia-smi command for a polling power usage on a card. Or for a snaphshot of a single card. If you want all the cards at once: Both CUDA and OpenCL API are included in the standard Nvidia drivers. Sometimes the OpenCL API is dropped from packages but can always be installed separately if needed. |
That was what comes up when I ran the test benchMT in the Terminal. |
@JStateson The empty brace at the top of your last posted output indicates that openCL is not installed. This means checking for openCL to judge compute capability is not going to work. Probably need to find another way to detect cuda capability. @KeithMyers @JStateson I have attempted an implementation of energy metrics for Nvidia. Can you run 'benchMT --lsgpu' and post your output here? I need to learn to interpret the output first. I have also included a power read of a bad card number meant to be an error so I know how to manage it. Thanks! |
Something isn't correct in the code.
|
Been busy working on my boinctasks temperature projects. I wanted to show wattage used as an option as a first step into trying to identify which card on the motherboard corresponds to a problem showing up on the boinctasks display. Anyway, got the following informatoin that might be helpful:
What I hope to accomplish: Be able to look at a display of a problem work unit as shown by boinctasks for any remote system running Linux, see the temps and wattage, be able to identify the card and, somehow, if the cards have identical names find some identifying serial number or code programmatically so as to know which board has the problem. I have my own version of boinc "MSboinc" that I can build for window or Linux easily and will be adding this tools to it and, eventually, to my boinctasks history reader. I want to eventually replace performance reports that have "d3" with something meaningful like "gtx-1660Ti" `jstateson@h110btc:~/Projects/BoincTasks$ nvidia-smi -L GPU 0: GeForce GTX 1060 6GB (UUID: GPU-a2089043-23bd-3481-efb2-f3cbbce5906a) jstateson@h110btc:~/Projects$ clinfo | grep "Device Topology" jstateson@h110btc:~/Projects$ clinfo | grep "Device Name" jstateson@h110btc:~/Projects/BoincTasks/SystemdService$ nvidia-smi -q -d PIDS | grep "GPU 0" ============below from boinc event messages h110btc from copro_cinfo and note that it matches clinfo
from clinfo on a tb85 system
|
@JStateson I noticed that this file has information on nvidia compute platform, but my script is not picking it up. Can you post the output of |
@JStateson @KeithMyers |
|
@KeithMyers Was this with the current benchMT on master? I made a change a few hours ago. Also, can you post the output of clinfo --raw? |
Yes, this was with the new master updated ten minutes ago.
|
@KeithMyers Thanks! Exactly what I needed. Looks like pcie ID is stored differently between NV and AMD. Should be an easy fix. I will let you know when I push an update with the changes. |
I would expect that to be the case since two different API's are being used. Boinc depends on the vendor API to pull the identifying information and capabilities of the detected cards. So AMD's API uses a different format compared to Nvidia's API to store the PCIe ID. |
// Detection of AMD/ATI GPUs // NvAPI now provides an API for getting #cores :-) // OpenCL interfaces are documented here: |
@KeithMyers I just posted a fix, but not able to test. Can you give it a try? @JStateson In this latest release, I added openCL Device Index, which may help in solving mapping problem. I noticed that |
I just pushed another update. clinfo for NV reports back Bus ID and Slot ID, but the format of pcie ID is Bus:Device:Function, so I assumed Slot maps to Device. Only way to know for sure is to see if it matches to pcie ID found by lspci. Seems from @JStateson post above, clinfo without --raw is giving Device Topology. It would be useful to see the output of --raw and default for nvidia. |
Doesn't seem to be picking up the OpenCL device number.
|
So why is setting my boinc_home an invalid argument now? |
It’s complaining about the second ./benchMT. |
Ah, silly me. Didn't see the extra one.
|
See power readings on the cards now. Currently running a mix of Einstein and Milkyway tasks since Seti is still fubared. |
Not sure why it’s not setting the clinfo PCIe id, but thought of a new way to test it myself. I will put your clinfo output into a file and cat it instead executing clinfo. I will try this in the morning. |
@KeithMyers I just found and fixed what I think the problem is. Can you try the latest on master with lsgpu and dubug options again? |
|
Sorry, just got back on this thread. Been putting out a fire: Unaccountably I cannot run clinfo too many times on a linux system having AMD RX-570 with Einstein-at-home crunching. I am working on getting my MSboinc program to write out the correlation with the bus id's. On booting, it is supposed to read in the following (created by a python script) and replace the "-1" with the ID's it is using
Is there anything I can do here on this thead? |
@JStateson I definitely ran with your original issue report and did a whole lot more! Probably resulted in more email than is desirable... If you could just test the latest on master on your nvidia system with |
ran ok, results here |
Thanks! I was still able to see what I needed to in the attached file. Looks like the compute and energy capability determination is working correctly for Nvidia. I noticed that clinfo indicates that the Intel 530 has compute capability. Does BOINC also see it as compute capable? Seems like pcie_id is stored different for Intel. Can you provide the Let me know if you are ok with me adding you to the benchMT credits for lspci/clinfo debug. |
Yes, BOINC detects the Intel iGPUs as capable of computing. There is a specific module in /client for detecting Intel gpus. |
Got python script "MakeTable.py" working for nvidia. The column starting with 1, 3, 4, 2 etc is the boinc id. I correlated this from coproc_info.xml. The first column 0,1,2, represents the sensor id of lm-sensors or the id in nvidia-smi. This still does not get the exact card when the name are identical but I am working on that.
Need to add the AMD ones but my AMD system is offline. The above "xml" file is to be read into my MSboinc program and then written to the event log so that it shows up on BoincTasks for every remote system. |
Seems like the pcie_id is represented differently than AMD or NV. Do you have an Intel card that you get |
sorry, hit wrong button |
I agree, it is getting confusing to follow the multiple lines of thoughts. I will raise issues for the separate items and leave this one closed. |
Ubuntu 18.04
lscpi version unknown (no -version argument)
from
to
removing the ATI and AMD fixed the problem of the first grep feeding "null" into the second
The text was updated successfully, but these errors were encountered: