Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU speed-up on Raspberry Pi 5 #226

Open
flatsiedatsie opened this issue Jan 24, 2024 · 15 comments
Open

GPU speed-up on Raspberry Pi 5 #226

flatsiedatsie opened this issue Jan 24, 2024 · 15 comments

Comments

@flatsiedatsie
Copy link

I'm experimenting with Llamafile on a Raspberry Pi 5 with 8Gb of ram, in order to integrate it with existing privacy-protecting smart home voice control. This is working great so far as long as I very small models are used.

I was wondering: would it be possible to speed up inference on the Rasbperry Pi 5 by using the GPU?

Through this Stack Overflow post I've found some frameworks that already do this, such as:

The Raspberry Pi 5's VideoCore GPU has vulkan drivers:
https://www.phoronix.com/news/Mesa-RPi-5-VideoCore-7.1.x

Curious to your thoughts.

Related:
#40

@jart jart changed the title Feature request: GPU speed-up on Raspberry Pi 5? GPU speed-up on Raspberry Pi 5 Jan 24, 2024
@jart
Copy link
Collaborator

jart commented Jan 24, 2024

Ubuntu doesn't even support the Vulkan Mesa driver you linked yet, so I doubt Tencent and beatmup are using GPU on RPI5. Vulkan Mesa is for graphics processing. You can't use it with OpenCL to multiply matrices. Even if we rewrote GGML in a shader language, libraries like OpenGL, GLFW, GLEW, etc. all depend on X Windows and can't run headlessly for general computation tasks like linear algebra. Broadcom claims their GPU is capable of general purpose computation:

Although they are physically located within, and closely coupled to the 3D system, the QPUs are also capable
of providing a general-purpose computation resource for non-3D software, such as video codecs and ISP tasks.
https://docs.broadcom.com/doc/12358545

The community project that lets Linux users write programs for Broadcom's GPU was abandoned three years ago and no longer builds. https://github.com/wimrijnders/V3DLib If you can show me how multiply a matrix on this GPU without depending on frameworks, then I'll reopen this issue and strongly consider supporting it.

@jart jart closed this as completed Jan 24, 2024
@flatsiedatsie
Copy link
Author

Thanks for the enlightening explanation. That is both good and bad news. Great that you're also enthousiastic about Raspberry Pi optimization.. but sad to hear (and read) that there is so little support for VideoCore hardware.

@jart
Copy link
Collaborator

jart commented Jan 30, 2024

Looks like someone actually did rewrite GGML in a shader language. Yesterday ggerganov/llama.cpp#2059 just got merged in llama.cpp which adds Vulkan support and a whole bunch of shaders. This gives me new hope that Raspberry Pi 5 GPU support will be possible. Unfortunately it doesn't appear possible today. If I build llama.cpp at head with make LLAMA_VULKAN=1 and run TinyLlama Q4_0 then I get this:

jart@pi5:~/llama.cpp$ ./main -e -m ~/TinyLlama-1.1B-Chat-v1.0.Q4_0.gguf -p '# Famous Speech\nFour score and' -n 50
Log start
main: build = 2008 (ceebbb5b)
main: built with cc (Ubuntu 13.2.0-4ubuntu3) 13.2.0 for aarch64-linux-gnu
main: seed  = 1706575520
TU: error: ../src/freedreno/vulkan/tu_knl.cc:251: device /dev/dri/renderD128 (v3d) is not compatible with turnip (VK_ERROR_INCOMPATIBLE_DRIVER)
ggml_vulkan: Using V3D 7.1.7 | fp16: 0 | warp size: 16

I'm going to leave this open until we can circle back in possibly several months to a year, until the distro driver situation improves, or someone else leaves a comment here helping us figure out how to do this. In the mean time, please do try this yourself. It's possible I broke my Ubuntu install by using a PPA earlier.

@jart jart reopened this Jan 30, 2024
@jart jart added the vulkan label Jan 30, 2024
@flatsiedatsie
Copy link
Author

Awesome! It seems someone else in that thread also ran into an issue.

I'll attempt building Llamafile from source on the Pi 5 and let you know how it goes.

@flatsiedatsie
Copy link
Author

flatsiedatsie commented Jan 30, 2024

It compiles and runs.

# Famous Speech\nFour score and seven years ago our etc

This is on a Pi 5 8Gb with the latest Raspberry Pi Lite OS, fully updated/upgraded, and mesa vulkan drivers installed.

sudo apt-get update -y && sudo apt-get upgrade -y
sudo apt-get install libvulkan1 mesa-vulkan-drivers
git clone https://github.com/Mozilla-Ocho/llamafile.git
cd llamafile
make LLAMA_VULKAN=1
./o/llama.cpp/main/main -m YOUR_MODEL_PATH_HERE.gguf -p '# Famous Speech\nFour score and' -n 50

Whether it's actually GPU enhanced though.. I noticed this in the output:

llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU

The full log is below:

./o/llama.cpp/main/main -m /home/pi/.webthings/data/voco/llm/assistant/phi-2.Q4_K_S.gguf -p '# Famous Speech\nFour score and' -n 50
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
Log start
main: llamafile version 0.6.2
main: seed  = 1706622373
llama_model_loader: loaded meta data with 20 key-value pairs and 325 tensors from /home/pi/.webthings/data/voco/llm/assistant/phi-2.Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi2
llama_model_loader: - kv   1:                               general.name str              = Phi2
llama_model_loader: - kv   2:                        phi2.context_length u32              = 2048
llama_model_loader: - kv   3:                      phi2.embedding_length u32              = 2560
llama_model_loader: - kv   4:                   phi2.feed_forward_length u32              = 10240
llama_model_loader: - kv   5:                           phi2.block_count u32              = 32
llama_model_loader: - kv   6:                  phi2.attention.head_count u32              = 32
llama_model_loader: - kv   7:               phi2.attention.head_count_kv u32              = 32
llama_model_loader: - kv   8:          phi2.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv   9:                  phi2.rope.dimension_count u32              = 32
llama_model_loader: - kv  10:                          general.file_type u32              = 14
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,51200]   = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,51200]   = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,50000]   = ["Ġ t", "Ġ a", "h e", "i n", "r e",...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 50256
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 50256
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 50256
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  195 tensors
llama_model_loader: - type q4_K:  125 tensors
llama_model_loader: - type q5_K:    4 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: mismatch in special tokens definition ( 910/51200 vs 944/51200 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = phi2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 51200
llm_load_print_meta: n_merges         = 50000
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2560
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 32
llm_load_print_meta: n_embd_head_k    = 80
llm_load_print_meta: n_embd_head_v    = 80
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2560
llm_load_print_meta: n_embd_v_gqa     = 2560
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 10240
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 2.78 B
llm_load_print_meta: model size       = 1.50 GiB (4.64 BPW) 
llm_load_print_meta: general.name     = Phi2
llm_load_print_meta: BOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: EOS token        = 50256 '<|endoftext|>'
llm_load_print_meta: UNK token        = 50256 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1539.00 MiB
...........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   160.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     6.01 MiB
llama_new_context_with_model:        CPU compute buffer size =   115.50 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 4 / 4 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 512, n_batch = 512, n_predict = 50, n_keep = 0


# Famous Speech\nFour score and seven years ago our

llama_print_timings:        load time =   15421.26 ms
llama_print_timings:      sample time =       3.15 ms /     4 runs   (    0.79 ms per token,  1269.84 tokens per second)
llama_print_timings: prompt eval time =   18739.26 ms /     8 tokens ( 2342.41 ms per token,     0.43 tokens per second)
llama_print_timings:        eval time =   43841.51 ms /     3 runs   (14613.84 ms per token,     0.07 tokens per second)
llama_print_timings:       total time =   73117.41 ms /    11 tokens

@flatsiedatsie
Copy link
Author

flatsiedatsie commented Jan 30, 2024

Seems they are speedily fixing bugs in llama.cpp

issue: interactive mode is broken on Vulkan
ggerganov/llama.cpp#5217

Pull request
ggerganov/llama.cpp#5223

@Mar2ck
Copy link

Mar2ck commented Feb 2, 2024

Whether it's actually GPU enhanced though.. I noticed this in the output:

llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU

You can offload layers to the GPU with the -ngl argument which should give a much bigger speed improvement. Try -ngl 33 and if it crashes due to lack of GPU memory then just keep reducing the number until it works.

@flatsiedatsie
Copy link
Author

flatsiedatsie commented Feb 2, 2024

Thanks @Mar2ck !

It worked fine first try with -ngl 33

The speed difference doesn't seem noticable. Oddly, the base version itself seems to run much faster already today, compared to last time I tried. Back then it generated one word per second. Not sure why it's different now.

##BEFORE

./o/llama.cpp/main/main -m /home/pi/.webthings/data/voco/llm/assistant/phi-2.Q4_K_S.gguf -p '# Famous Speech\nFour score and' -n 50

llamafile_without

llama_print_timings:        load time =     431.10 ms
llama_print_timings:      sample time =      22.42 ms /    50 runs   (    0.45 ms per token,  2229.85 tokens per second)
llama_print_timings: prompt eval time =     886.69 ms /     8 tokens (  110.84 ms per token,     9.02 tokens per second)
llama_print_timings:        eval time =    8704.99 ms /    49 runs   (  177.65 ms per token,     5.63 tokens per second)
llama_print_timings:       total time =    9637.94 ms /    57 tokens

##AFTER

./o/llama.cpp/main/main -m /home/pi/.webthings/data/voco/llm/assistant/phi-2.Q4_K_S.gguf -p '# Famous Speech\nFour score and' -n 50 -ngl 33

llamafile_speedup

llama_print_timings:        load time =     433.53 ms
llama_print_timings:      sample time =      23.30 ms /    50 runs   (    0.47 ms per token,  2145.46 tokens per second)
llama_print_timings: prompt eval time =     896.34 ms /     8 tokens (  112.04 ms per token,     8.93 tokens per second)
llama_print_timings:        eval time =    8670.61 ms /    49 runs   (  176.95 ms per token,     5.65 tokens per second)
llama_print_timings:       total time =    9615.66 ms /    57 tokens

// I added the logs. Technically speaking the GPU version is actually a little slower.. strange.

//
I tried again, with full system reboots in between.

Non-GPU version:

# Famous Speech\nFour score and seven years ago our fathers brought forth on this continent, a new nation...')
    output = speech.replace('Nation', 'Nation-State')
    print(output)

    Output:
    'Four score and seven years ago

llama_print_timings:        load time =   19469.40 ms
llama_print_timings:      sample time =      22.69 ms /    50 runs   (    0.45 ms per token,  2203.61 tokens per second)
llama_print_timings: prompt eval time =     855.38 ms /     8 tokens (  106.92 ms per token,     9.35 tokens per second)
llama_print_timings:        eval time =    8079.14 ms /    49 runs   (  164.88 ms per token,     6.07 tokens per second)
llama_print_timings:       total time =    8980.39 ms /    57 tokens

GPU version:

# Famous Speech\nFour score and seven years ago our fathers brought forth on this continent, a new nation...',
            'The United States of America is the world\'s oldest surviving federation.\n...'],
        ['I have a dream that my four little children

llama_print_timings:        load time =   25512.36 ms
llama_print_timings:      sample time =      23.41 ms /    50 runs   (    0.47 ms per token,  2135.57 tokens per second)
llama_print_timings: prompt eval time =     876.47 ms /     8 tokens (  109.56 ms per token,     9.13 tokens per second)
llama_print_timings:        eval time =    8216.39 ms /    49 runs   (  167.68 ms per token,     5.96 tokens per second)
llama_print_timings:       total time =    9142.51 ms /    57 tokens

Funny how both runs decided the prompt was programming related..

@flatsiedatsie
Copy link
Author

Wait a tick:

warning: --n-gpu-layers 33 was passed but no GPUs were found; falling back to CPU inference

@chuangtc
Copy link

chuangtc commented Apr 10, 2024

https://www.phoronix.com/news/Raspberry-Pi-OS-Default-V3DV
With the default installation of Vulkan driver in this OS, will it help you move forward?
This is what I tested out local LLMs on Raspberry Pi 5. It's around 1 token/sec, very slow.
https://aidatatools.com/2024/01/ollama-benchmark-on-raspberry-pi-5-ram-8gb/

@flatsiedatsie
Copy link
Author

@chuangtc That's great news, thanks for sharing.

Which model are you running though?

I got a lot more tokens than that running small models (tinyllama-1.1b-1t-openorca.Q4_K_M.gguf) on the CPU. On that topic, I look forward to seeing what the new mathematical functions created by @jart will do to improve running on the Pi further, as those are said to speed up context ingestion.

@chuangtc
Copy link

Here is what I am asking help on reddit.
https://www.reddit.com/r/raspberry_pi/comments/1c24vga/how_to_make_llamafile_get_accelerated_during/
I noticed that there could be a bug in vulkaninfo --summary

jason@raspberrypi5:~ $ vulkaninfo --summary
WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Failed to CreateInstance in ICD 0.  Skipping ICD.
==========
VULKANINFO
==========

Vulkan Instance Version: 1.3.239

@chuangtc
Copy link

Instance Extensions: count = 22
-------------------------------
VK_EXT_acquire_drm_display             : extension revision 1
VK_EXT_acquire_xlib_display            : extension revision 1
VK_EXT_debug_report                    : extension revision 10
VK_EXT_debug_utils                     : extension revision 2
VK_EXT_direct_mode_display             : extension revision 1
VK_EXT_display_surface_counter         : extension revision 1
VK_EXT_surface_maintenance1            : extension revision 1
VK_EXT_swapchain_colorspace            : extension revision 4
VK_KHR_device_group_creation           : extension revision 1
VK_KHR_display                         : extension revision 23
VK_KHR_external_fence_capabilities     : extension revision 1
VK_KHR_external_memory_capabilities    : extension revision 1
VK_KHR_external_semaphore_capabilities : extension revision 1
VK_KHR_get_display_properties2         : extension revision 1
VK_KHR_get_physical_device_properties2 : extension revision 2
VK_KHR_get_surface_capabilities2       : extension revision 1
VK_KHR_portability_enumeration         : extension revision 1
VK_KHR_surface                         : extension revision 25
VK_KHR_surface_protected_capabilities  : extension revision 1
VK_KHR_wayland_surface                 : extension revision 6
VK_KHR_xcb_surface                     : extension revision 6
VK_KHR_xlib_surface                    : extension revision 6

Instance Layers: count = 2
--------------------------
VK_LAYER_MESA_device_select Linux device selection layer 1.3.211  version 1
VK_LAYER_MESA_overlay       Mesa Overlay layer           1.3.211  version 1

Devices:
========
GPU0:
	apiVersion         = 1.2.255
	driverVersion      = 23.2.1
	vendorID           = 0x14e4
	deviceID           = 0x55701c33
	deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
	deviceName         = V3D 7.1.7
	driverID           = DRIVER_ID_MESA_V3DV
	driverName         = V3DV Mesa
	driverInfo         = Mesa 23.2.1-1~bpo12+rpt3
	conformanceVersion = 1.3.6.1
	deviceUUID         = 5fd8106e-741a-cafa-e080-fdb16cf11a80
	driverUUID         = 1698c6ef-161f-3213-5159-557202953ee9
GPU1:
	apiVersion         = 1.3.255
	driverVersion      = 0.0.1
	vendorID           = 0x10005
	deviceID           = 0x0000
	deviceType         = PHYSICAL_DEVICE_TYPE_CPU
	deviceName         = llvmpipe (LLVM 15.0.6, 128 bits)
	driverID           = DRIVER_ID_MESA_LLVMPIPE
	driverName         = llvmpipe
	driverInfo         = Mesa 23.2.1-1~bpo12+rpt3 (LLVM 15.0.6)
	conformanceVersion = 1.3.1.1
	deviceUUID         = 6d657361-3233-2e32-2e31-2d317e627000
	driverUUID         = 6c6c766d-7069-7065-5555-494400000000

@martincerven
Copy link

Raspberry Pi 5 doesn't have all the Vulkan 1.3 capabilities: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10896

@nekoteai
Copy link

Raspberry Pi 5 doesn't have all the Vulkan 1.3 capabilities: https://gitlab.freedesktop.org/mesa/mesa/-/issues/10896

update:the missing exts now marked as supported

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants