-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support user customized device_map
#47
Comments
currently |
OK. It's just that initially I found this: So, I assumed that if |
I've inspected the code and found out that, in theory, custom device maps should work, because when So I decided to just test the from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import torch
model_path = "/opt/models/vicuna-13B-1.1-GPTQ-4bit-128g"
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
)
model = AutoGPTQForCausalLM.from_quantized(
model_path,
device="cpu",
use_safetensors=True,
use_triton=False,
quantize_config=quantize_config,
model_basename="vicuna-13B-1.1-GPTQ-4bit-128g.latest",
max_memory={0: "2GIB", "cpu": "30GIB"}
)
mem_gb = round(torch.cuda.memory_allocated(0) / 1000 / 1000 / 1000)
print(f"USED VRAM: {mem_gb}GB") This internally constructs the following device map (via {'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 'cpu', 'model.layers.9': 'cpu', 'model.layers.10': 'cpu', 'model.layers.11': 'cpu', 'model.layers.12': 'cpu', 'model.layers.13': 'cpu', 'model.layers.14': 'cpu', 'model.layers.15': 'cpu', 'model.layers.16': 'cpu', 'model.layers.17': 'cpu', 'model.layers.18': 'cpu', 'model.layers.19': 'cpu', 'model.layers.20': 'cpu', 'model.layers.21': 'cpu', 'model.layers.22': 'cpu', 'model.layers.23': 'cpu', 'model.layers.24': 'cpu', 'model.layers.25': 'cpu', 'model.layers.26': 'cpu', 'model.layers.27': 'cpu', 'model.layers.28': 'cpu', 'model.layers.29': 'cpu', 'model.layers.30': 'cpu', 'model.layers.31': 'cpu', 'model.layers.32': 'cpu', 'model.layers.33': 'cpu', 'model.layers.34': 'cpu', 'model.layers.35': 'cpu', 'model.layers.36': 'cpu', 'model.layers.37': 'cpu', 'model.layers.38': 'cpu', 'model.layers.39': 'cpu', 'model.norm': 'cpu', 'lm_head': 'cpu'} The limit for the GPU was set to 2GiB, but in reality it still uses the same amount of VRAM:
I can make the following conclusions:
Seems like the problem is somewhere deeper. |
device_map
does not offload anything to CPUdevice_map
device_map
device_map
It seems like
device_map
does not offload anything to CPU if it's constructed manually.Here's an example:
The model is https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g. The
device_map
is constructed so that only the first model layer is on GPU, and the rest is supposed to be on CPU. The last two lines measure the used VRAM.When
full_gpu = True
, everything is on GPU, and I get this:which is expected.
But now, I set
full_gpu = False
and the device map is used. However, at the end I get the same result:I double-checked that the
device_map
is actually used, but it seems like it doesn't offload anything. Am I missing something?The text was updated successfully, but these errors were encountered: