-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash in nvgpu module with 5.15 kernel 5.15.136-l4t-r36.3 #1613
Comments
Hi @kraj |
QT6 and all layers at |
Thanks. I will try to reproduce the bug and will get back to you ASAP |
btw. I am not using wayland or X11, its using eglfs to launch the browser. |
The problem looks similar to problems in the past - the power management-related services have to start in a specific order, and before anything else touches the GPU, or you get these kinds of tracebacks. @kraj Are you using sysvinit or systemd as your init manager? |
I don't have a working r36 system with graphics quite yet, but for unrelated reasons, so I can't say whether I would run into this issue or not. I can say that I didn't run into this onto a system with qt and the eglfs gbm backend on r35. Which qt eglfs backend is being used? For the gbm, since this looks like an allocation problem, you could explore using Edit: looks like Matt's message crossed with my own, his answer sounds more promising, but I'll leave mine here for posterity |
I am using systemd and I have looked at another issue where you have fixed some sequencing of services, those changes are in master already. However, I do see this
and
not sure how |
for using
|
No, this is a mesa gbm backend for use with drm/kms |
using
The nvgpu crashed reported above still remain. |
One more thought: Weston normally pulls in the |
then I added
sadly, it gets over the above problem. but fails with
|
Mesa should not be falling back on any DRI driver to initialize drm -- the drm implementation is provided by nvidia's libdrm, and mesa only gets used for buffer allocation in gbm. Your last message looks suspicious though, can you verify that the nvidia-drm kernel driver is loaded with the option |
I built
|
That recipe is supposed to load nvidia-drm with the option |
right, I have rebooted.
|
Do you have a |
yep
|
Set up a file /usr/share/tegra.conf or something like that with these contents:
Then export these variables before launching your qt application:
|
its already doing this in .service file
and /etc/default/eglfs.json is
|
I guess it's also worth asking if you're still seeing the dma allocation failures from the gpu in the kernel logs, because if you are, maybe this is all just a tangent. |
yes I am seeing those messages consistently, as mentioned.
|
snippet of journal where this is seen whenever the yoe-kiosk-browser service or nvpmodel.service is restarted/started
|
For lack of anything better to try, it looks like the devkit sets up one of the ina3221's like this: /hardware/nvidia/t23x/nv-public/nv-platform/tegra234-p3701-0000.dtsi
And that could explain the missing label nodes. You could perhaps try removing this i2c device before loading the nvpower.sh? Or patch the devicetree, this is built by the recipe |
I tracked down an orin devkit and installed It looks like the warning from nvpower.sh is "normal" Next I installed kmscube and nvidia-drm-loadconf, and rebooted. On the next boot, the nvidia-drm module was not loaded automatically, this is a problem. After
But still, after that, running kmscube works. I do not see any of the other aforementioned kernel traces from the gpu. I noticed that we were looking at the nvpower.service earlier, but you have a failure loading nvpmodel.service. Maybe we should proceed by looking into this.
|
Regarding |
@kekiefer I tried to disable the i2c device in DT, it does not help, the second part is about failing
Not sure why yoe-kiosk-browser is failing too and following messages in journal are appearing which might be of interest
|
Yes, this is part of the default 30W power model, unless changed with |
|
|
That was my initial thought, but actually Are you installing all kernel modules on the target? From both the out-of-tree collection and the kernel? |
Seems to be ok.
installed mods
|
The problems loading the power management bits early on still seem to be likely at the root of this, causing later issues dealing with the gpu. I've got the
For what it's worth, these are the git hashes I used to build the tegrademo demo-image-egl, where it is working on an orin devkit for me:
|
yoe-kiosk-browser ( which is based on qtwebengine ) gets a SIGSEGV and I could fathom the backtrace now.
|
I really think you need to solve the prior problems setting up power management for the gpu in the kernel before diving into the details of the graphics stack. |
One note though - without nvidia_modeset and nvidia_drm, you won't be able to load a graphics device with gbm. |
yeah, I was putting it here for reference, to see if the path for a "eglfs" based image was still ok or is it using wrong libraries etc. |
can you share your kernel |
Regarding nvidia_drm (from the oot modules recipe), it looks like you have it installed, but it wasn't loaded in your printout of lsmod, despite installing |
hmm
and on reboot I do see these modules
The SEGV seen before remains as it is. comparing
|
here is my dmesg logs It seems that nvgpu messages are keys as they happen with |
Here are a journal and dmesg from a run where I interactively log in and run kmscube. The kernel logs look substantially the same on quick review, up until the errors, so maybe there are some clues in the journal? Does it make a difference if you delay starting yoe-kiosk-browser until much later? |
From the dmesg log, it looks like you have a 64GiB module (P3701-0005) in there, rather than the 32GiB one (P3701-0000). You'll need to use |
ha! that could be root of all. I must say the machine names are a bit confusing and I got tripped. If there is some way to name them so they are more revealing would be good. Are the SKU numbers in some form readable from machine via some NVRAM read etc ? |
Thanks a lot @madisongh this really helped and nailed the problem. Second minor issue was that I have to use |
The full part number is stored in an EEPROM on the module. The
Yep, that's a problem, and it's worse now than with earlier L4T versions due to device trees being different between SKUs in the same family. It's less of a problem for NVIDIA, since everything's pre-built, and their flashing scripts read the module info before constructing the rootfs, so they can get away with using the same config name for all of the variants. That's harder for us, since we have to know some of these differences at build time. Still, I think there's something we could do to at least catch these mismatches earlier in the process. |
Describe the bug
nvpmodel.service fails to start and any other services needing OpenGL/EGL also do not start
To Reproduce
Build QTWebengine for MACHINE=jetson-agx-orin-devkit
Additional context
crash report as seen on console.
The text was updated successfully, but these errors were encountered: