Repeated token generation for gpu-based gemma models on device #5414

adityak6798 · 2024-05-16T22:05:21Z

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

No

OS Platform and Distribution

Windows 11, Android 14

MediaPipe Tasks SDK version

No response

Task name (e.g. Image classification, Gesture recognition etc.)

LLMInference

Programming Language and version (e.g. C++, Python, Java)

Kotlin/Java

Describe the actual behavior

GPU Model generates garbage output (same token repeated indefinitely)

Describe the expected behaviour

A meaningful response from the model

Standalone code/steps you may have used to try to get what you need

I used the LLMInferenceGuide to get the code working for LLMInference for Gemma 1.1 2B (gemma-1.1-2b-it-gpu-int4.bin).
On using the CPU variant of the model, and prompting it to output only one integer value, I got what I expected (eg. 3)
When I used the GPU variant, I get garbage responses (nothing else changed in the codebase, just the model).

Other info / Complete Logs

Bad output examples:
1. <bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos><bos> (repeated until max tokens)
(Expected Output: "3", achieved when using cpu variant of same model)
2. KíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKíchKích disagre disagre disagre disagre disagre disagre disagre disagre disagre disagre disagre disagre (repeated until max tokens)
(Expected Output: "2", achieved when using cpu variant of same model)

adityak6798 · 2024-05-17T21:15:19Z

I tried downloading the same gemma model (non-quantized) from huggingface using the given scripts in "Model Conversion Colab", (model: Gemma2B for GPU). I face the same kind of error: garbage output.
Output sample: "VentajasVentajasVentajasVentajasVentajasVentajasVentajasVentajasVentajasVentajasVentajasVentajasVentajasVentajasVentajasVentajasVentajasVentajasVentajas"

kuaashish · 2024-05-21T08:40:06Z

Hi @adityak6798,

Could you please provide the complete steps you are following from our documentation? Additionally, please specify whether you are using an emulator or a real device, along with the device's complete configuration and MediaPipe version is in use. This will help us understand the issue better and reproduce it accurately.

Thank you!!

adityak6798 · 2024-05-28T21:02:41Z

Using a real device: Samsung Galaxy S23 Ultra

Steps followed for Kaggle model: (from here)

Added dependency (com.google.mediapipe:tasks-genai:0.10.14)
Downloaded Gemma model from Kaggle - gemma-2b-it-gpu-int4
Created the task (LLMInference object) with the right arguments
Ran inference as specified.

Steps followed for non-Kaggle model:

Added dependency as earlier
Downloaded Gemma-2b using the model conversion colab (GPU Variant)
Created the task as earlier (model path updated as needed)
Ran inference as earlier

Device config: Android 14
Mediapipe version: com.google.mediapipe:tasks-genai:0.10.14

kuaashish · 2024-05-30T06:18:37Z

Hi @adityak6798,

Could you please try the example available in this repository and let us know if you continue to experience the same behavior.

Thank you!!

kuaashish self-assigned this May 20, 2024

kuaashish added os:windows MediaPipe issues on Windows task:LLM inference Issues related to MediaPipe LLM Inference Gen AI setup platform:android Issues with Android as Platform type:support General questions labels May 20, 2024

kuaashish added the stat:awaiting response Waiting for user response label May 21, 2024

google-ml-butler bot removed the stat:awaiting response Waiting for user response label May 28, 2024

kuaashish added the stat:awaiting response Waiting for user response label May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repeated token generation for gpu-based gemma models on device #5414

Repeated token generation for gpu-based gemma models on device #5414

adityak6798 commented May 16, 2024

adityak6798 commented May 17, 2024

kuaashish commented May 21, 2024 •

edited

adityak6798 commented May 28, 2024

kuaashish commented May 30, 2024

Repeated token generation for gpu-based gemma models on device #5414

Repeated token generation for gpu-based gemma models on device #5414

Comments

adityak6798 commented May 16, 2024

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

OS Platform and Distribution

MediaPipe Tasks SDK version

Task name (e.g. Image classification, Gesture recognition etc.)

Programming Language and version (e.g. C++, Python, Java)

Describe the actual behavior

Describe the expected behaviour

Standalone code/steps you may have used to try to get what you need

Other info / Complete Logs

adityak6798 commented May 17, 2024

kuaashish commented May 21, 2024 • edited

adityak6798 commented May 28, 2024

kuaashish commented May 30, 2024

kuaashish commented May 21, 2024 •

edited