Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: RuntimeError: "topk_cpu" not implemented for 'Half' #42

Closed
lorr1 opened this issue Sep 25, 2022 · 3 comments · Fixed by huggingface/transformers#19468
Closed

Bug: RuntimeError: "topk_cpu" not implemented for 'Half' #42

lorr1 opened this issue Sep 25, 2022 · 3 comments · Fixed by huggingface/transformers#19468
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@lorr1
Copy link

lorr1 commented Sep 25, 2022

When using a bloom model with generate, I get

 RuntimeError: "topk_cpu" not implemented for 'Half'

when do_sample=True.

I.e.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MAX_NEW_TOKENS = 128
model_name = 'bigscience/bloom-560m'

text = """
Q: On average Joe throws 25 punches per minute. A fight lasts 5 rounds of 3 minutes. 
How many punches did he throw?\n
A: Let’s think step by step.\n"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_ids = tokenizer(text, return_tensors="pt").input_ids

free_in_GB = int(torch.cuda.mem_get_info()[0]/1024**3)
max_memory = f'{free_in_GB-6}GB'

n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}

model = AutoModelForCausalLM.from_pretrained(
  model_name,
  cache_dir="/home/code/fm_in_context_eval/transformers_cache",
  device_map='auto', 
  load_in_8bit=True, 
  max_memory=max_memory
)
generated_ids = model.generate(input_ids, max_length=len(input_ids[0])+1, do_sample=True)
@TimDettmers
Copy link
Owner

This error happens when part of the model is to large to be fully held by the GPU. So it is offloaded to the CPU and the CPU does not support the topk operation in fp16.

The solution should be to increase the max_memory variable so the entire model fits onto the GPU. If your GPU is too small, this error is unfortunately unavoidable. Since your example is with a relatively small model, I imagine that you have an 8 GB GPU. How many GB do you have?

@TimDettmers TimDettmers added the documentation Improvements or additions to documentation label Oct 10, 2022
@lorr1
Copy link
Author

lorr1 commented Oct 10, 2022

That's what I was thinking too, but I was on a A100 on GCP (so 40GB of memory). And usually, when it's a memory issue, I get the half kernel errors during a normal forward call. As far as I could tell, normal forward and generate had everything return on the GPU. It was just with sampling.

@younesbelkada
Copy link
Collaborator

Hey @lorr1 !
Thanks for your message, it appears that this issue can be fixed with a small workaroud (also it seems to be unrelated to 8bit models)
Could you please try to insert input_ids = input_ids.to(0) before calling generate? I tried it locally and I believe should do the trick here ;) I suggest to stick with this workaround for now, until a proper fix will be merged in #19468 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants