Bug: RuntimeError: "topk_cpu" not implemented for 'Half' #42

lorr1 · 2022-09-25T06:51:05Z

When using a bloom model with generate, I get

 RuntimeError: "topk_cpu" not implemented for 'Half'

when do_sample=True.

I.e.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MAX_NEW_TOKENS = 128
model_name = 'bigscience/bloom-560m'

text = """
Q: On average Joe throws 25 punches per minute. A fight lasts 5 rounds of 3 minutes. 
How many punches did he throw?\n
A: Let’s think step by step.\n"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_ids = tokenizer(text, return_tensors="pt").input_ids

free_in_GB = int(torch.cuda.mem_get_info()[0]/1024**3)
max_memory = f'{free_in_GB-6}GB'

n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}

model = AutoModelForCausalLM.from_pretrained(
  model_name,
  cache_dir="/home/code/fm_in_context_eval/transformers_cache",
  device_map='auto', 
  load_in_8bit=True, 
  max_memory=max_memory
)
generated_ids = model.generate(input_ids, max_length=len(input_ids[0])+1, do_sample=True)

The text was updated successfully, but these errors were encountered:

TimDettmers · 2022-10-10T01:28:20Z

This error happens when part of the model is to large to be fully held by the GPU. So it is offloaded to the CPU and the CPU does not support the topk operation in fp16.

The solution should be to increase the max_memory variable so the entire model fits onto the GPU. If your GPU is too small, this error is unfortunately unavoidable. Since your example is with a relatively small model, I imagine that you have an 8 GB GPU. How many GB do you have?

lorr1 · 2022-10-10T07:16:46Z

That's what I was thinking too, but I was on a A100 on GCP (so 40GB of memory). And usually, when it's a memory issue, I get the half kernel errors during a normal forward call. As far as I could tell, normal forward and generate had everything return on the GPU. It was just with sampling.

younesbelkada · 2022-10-10T17:17:40Z

Hey @lorr1 !
Thanks for your message, it appears that this issue can be fixed with a small workaroud (also it seems to be unrelated to 8bit models)
Could you please try to insert input_ids = input_ids.to(0) before calling generate? I tried it locally and I believe should do the trick here ;) I suggest to stick with this workaround for now, until a proper fix will be merged in #19468 !

TimDettmers added the documentation Improvements or additions to documentation label Oct 10, 2022

younesbelkada mentioned this issue Oct 10, 2022

Anything but plain "greedy" search "not implemented for 'Half'" huggingface/transformers#19445

Closed

4 tasks

younesbelkada self-assigned this Oct 10, 2022

younesbelkada mentioned this issue Oct 10, 2022

Add warning in generate & device_map=auto & half precision models huggingface/transformers#19468

Merged

younesbelkada linked a pull request Oct 10, 2022 that will close this issue

Add warning in generate & device_map=auto & half precision models huggingface/transformers#19468

Merged

younesbelkada closed this as completed in huggingface/transformers#19468 Oct 11, 2022

0cc4m mentioned this issue Feb 25, 2023

NaN error when using a GPU with no support for igemmlt #165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: RuntimeError: "topk_cpu" not implemented for 'Half' #42

Bug: RuntimeError: "topk_cpu" not implemented for 'Half' #42

lorr1 commented Sep 25, 2022

TimDettmers commented Oct 10, 2022

lorr1 commented Oct 10, 2022

younesbelkada commented Oct 10, 2022

Bug: RuntimeError: "topk_cpu" not implemented for 'Half' #42

Bug: RuntimeError: "topk_cpu" not implemented for 'Half' #42

Comments

lorr1 commented Sep 25, 2022

TimDettmers commented Oct 10, 2022

lorr1 commented Oct 10, 2022

younesbelkada commented Oct 10, 2022