Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama3-70B-Instruct-AWQ causing CUDA error: an illegal memory access was encountered #1871

Open
4 tasks
anindya-saha opened this issue May 8, 2024 · 0 comments

Comments

@anindya-saha
Copy link

anindya-saha commented May 8, 2024

System Info

Hello Team,

I am using the following to load the AWQ quantized version on the LLama 3 model on a 4 x A100 GCP m/c. I cannot increase the --max-batch-prefill-tokens since I get the CUDA error: an illegal memory access was encountered. I also observe through nvidia-smi that it does not consume the entire GPU memory but still cause the illegal memory access error.

# Load LLama 3 casperhansen/llama-3-70b-instruct-awq

DOCKER_IMAGE=ghcr.io/huggingface/text-generation-inference:2.0.2
CONTAINER_NAME=eval_llama_3
HF_TOKEN=<my-token>
CUDA_VISIBLE_DEVICES=0,1,2,3
MODEL_ID="casperhansen/llama-3-70b-instruct-awq"
QUANTIZE=awq
VOLUME=~/.cache/huggingface/hub


docker run --rm \
    --name ${CONTAINER_NAME} \
    --shm-size 4g \
    --env HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} \
    --env CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
    -p 8080:80 \
    -v ${VOLUME}:/data \
    --gpus all \
    $DOCKER_IMAGE \
    --model-id ${MODEL_ID} \
    --num-shard 4 \
    --sharded true \
    --max-concurrent-requests 3 \
    --max-batch-prefill-tokens 24000 \
    --max-stop-sequences 20 \
    --trust-remote-code \
    --quantize ${QUANTIZE}

The GPUs are not even utilized half way though

Wed May  8 09:32:39 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0              58W / 400W |  13321MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          Off | 00000000:00:05.0 Off |                    0 |
| N/A   36C    P0              75W / 400W |  13465MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          Off | 00000000:00:06.0 Off |                    0 |
| N/A   34C    P0              71W / 400W |  13465MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          Off | 00000000:00:07.0 Off |                    0 |
| N/A   35C    P0              70W / 400W |  13321MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     26648      C   /opt/conda/bin/python3.10                 13312MiB |
|    1   N/A  N/A     26649      C   /opt/conda/bin/python3.10                 13456MiB |
|    2   N/A  N/A     26650      C   /opt/conda/bin/python3.10                 13456MiB |
|    3   N/A  N/A     26652      C   /opt/conda/bin/python3.10                 13312MiB |
+---------------------------------------------------------------------------------------+

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Steps are provided in the problem desription.

Expected behavior

The Model should load without exceptions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant