Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

system crash when inferencing model with enable GpuAcc on Mali T860 #116

Closed
noinhome opened this issue Jan 3, 2019 · 11 comments
Closed
Labels
Bug Something isn't working

Comments

@noinhome
Copy link

noinhome commented Jan 3, 2019

Hi All,

I use a development board with a ARM Mali-T860MP4 GPU, running with mobilenet v2(tensorflow model).
CpuAcc and CpuRef goes very good, but when I use GpuAcc, system frozen after running about 10sec ~ 5min(randomly).

I can found some error message in kernel message like
mali ff9a0000.gpu: GPU Fault 0x00000088 (UNKNOWN) at 0x00000000a5d7a240
I known this could be a OpenCL driver issue, does there any suggestion to find out problem is ?

Greatly appreciated.

@MatthewARM
Copy link
Collaborator

Hi @noinhome, sorry to hear you are having trouble. I'll look in to this and get back to you as soon as possible.

@MatthewARM
Copy link
Collaborator

Hi @noinhome can you give more information about what board you are using? And do you know what version of the Mali OpenCL driver you have?

@noinhome
Copy link
Author

Hi @MatthewARM,
we are using Firefly RK3399 development board (link).
About the Mali OpenCL driver, I had tried two version, one is libmali-midgard-4th-r13p0.so (provide by Firefly), and the other one is libmali-midgard-t86x-r14p0.so (provide by rockchip, link). Both those two are 64bit version and have the same system frozen issue.
Please let me know if there’s anything I can do to help.

@MatthewARM
Copy link
Collaborator

Hi @noinhome thanks for the information. We're trying to set up a Firefly RK3399 with those driver versions and we'll let you know how we get on.

@Surmeh Surmeh added the Bug Something isn't working label Jan 16, 2019
@oleksiig
Copy link

Hi All,
I have the same problem with Mali-T860MP4 GPU inside of RK3099 SOC, but just during boot of Android. Sometimes graphics system hangs with GPU Fault and sometimes continues boot after following logs:

[ 55.198345] mali ff9a0000.gpu: GPU Fault 0x00000088 (UNKNOWN) at 0x00000000c9555080
[ 55.223759] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[ 55.229340] mali ff9a0000.gpu: Flush for GPU page table update did not complete. Issueing GPU soft-reset to recover
[ 55.241171] mali ff9a0000.gpu: Preparing to soft-reset GPU: Waiting (upto 3000 ms) for all jobs to complete soft-stop
[ 55.258033] mali ff9a0000.gpu: Resetting GPU (allowing up to 500 ms)
[ 55.265827] mali ff9a0000.gpu: Register state:
[ 55.270822] mali ff9a0000.gpu: GPU_IRQ_RAWSTAT=0x00000200 GPU_STATUS=0x00000003
[ 55.278922] mali ff9a0000.gpu: AS_ACTIVE bit stuck
[ 55.279331] mali ff9a0000.gpu: JOB_IRQ_RAWSTAT=0x00000000 JOB_IRQ_JS_STATE=0x00000000
[ 55.284426] mali ff9a0000.gpu: Flush for GPU page table update did not complete. Issueing GPU soft-reset to recover
[ 55.304929] mali ff9a0000.gpu: JS0_STATUS=0x00000000 JS0_HEAD_LO=0x00000000
[ 55.313411] mali ff9a0000.gpu: JS1_STATUS=0x00000000 JS1_HEAD_LO=0xfe9ebb00
[ 55.322057] mali ff9a0000.gpu: JS2_STATUS=0x00000000 JS2_HEAD_LO=0x00000000
[ 55.330532] mali ff9a0000.gpu: MMU_IRQ_RAWSTAT=0x00000000 GPU_FAULTSTATUS=0x00ff0388
[ 55.339369] mali ff9a0000.gpu: GPU_IRQ_MASK=0x00000000 JOB_IRQ_MASK=0x00000000 MMU_IRQ_MASK=0x00000000
[ 55.350701] mali ff9a0000.gpu: PWR_OVERRIDE0=0x00000000 PWR_OVERRIDE1=0x00000000
[ 55.367351] mali ff9a0000.gpu: SHADER_CONFIG=0x00010000 L2_MMU_CONFIG=0x00000000
[ 55.384878] mali ff9a0000.gpu: TILER_CONFIG=0x00000001 JM_CONFIG=0x00000038

Sometimes I see following and no more picture on display:

mali ff9a0000.gpu: GPU Fault 0x00000088 (UNKNOWN) at 0x00000000d77eb180
mali ff9a0000.gpu: ctx 745_2: Atom 131 still waiting for fence [ffffffc0c3a8b980] after 3000ms
mali ff9a0000.gpu: Guilty fence [ffffffc0c3a8b980] 259#9: signaled

GPU driver taken from Firefly RK3399 Kernel (r18p0-01rel0)

@MatthewARM
Copy link
Collaborator

Thanks @oleksiig that suggests this might not be an ArmNN problem then, but generally a driver stability issue.

We're still working on getting an equivalent setup here to confirm.

@TelmoARM
Copy link
Contributor

TelmoARM commented Feb 4, 2019

Hi @noinhome ,

Apologies for the delayed reply.

Using the following set-up:
Firefly RK3399
Firefly Linux 16.04 image
Mobilenet_V2_1.0_224.pb
ArmNN: https://review.mlplatform.org/#/c/ml/armnn/+/619/

We have been able to reproduce the GPU fault but not consistently, it occurs at times but this does not seem to be caused by ArmNN.

We have been unable to reproduce the freeze/hang. I'm still looking into it and will report back if I get it.

Telmo

@noinhome
Copy link
Author

noinhome commented Feb 6, 2019

Hi @TelmoARM
Thanks for your clarification on this freeze issue. It is seem this issue much more of a openCL driver issue not a ArmNN bug.
I will try if I can report this GPU issue to Rockchip. If there have any progress, I will report back too.

@noinhome
Copy link
Author

noinhome commented Mar 6, 2019

Hi ARMNN Team,

I got reply from Rockchip, and they gave me a big (about 134 files...) kernel patch file for the GPU frozen issue. They said this frozen issue maybe caused by some known mali-deivce driver bugs on their own kernel.
I think this issue can be classified into vendor kernel issue not ARMNN bug and close.

@MatthewARM
Copy link
Collaborator

Thanks @noinhome good luck.

@dorforer
Copy link

Hi, can I get the solution Rockchip gave you? I'm having the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants