llama3-70B-Instruct-AWQ causing CUDA error: an illegal memory access was encountered #1871

anindya-saha · 2024-05-08T09:34:31Z

System Info

Hello Team,

I am using the following to load the AWQ quantized version on the LLama 3 model on a 4 x A100 GCP m/c. I cannot increase the --max-batch-prefill-tokens since I get the CUDA error: an illegal memory access was encountered. I also observe through nvidia-smi that it does not consume the entire GPU memory but still cause the illegal memory access error.

# Load LLama 3 casperhansen/llama-3-70b-instruct-awq

DOCKER_IMAGE=ghcr.io/huggingface/text-generation-inference:2.0.2
CONTAINER_NAME=eval_llama_3
HF_TOKEN=<my-token>
CUDA_VISIBLE_DEVICES=0,1,2,3
MODEL_ID="casperhansen/llama-3-70b-instruct-awq"
QUANTIZE=awq
VOLUME=~/.cache/huggingface/hub


docker run --rm \
    --name ${CONTAINER_NAME} \
    --shm-size 4g \
    --env HUGGING_FACE_HUB_TOKEN=${HF_TOKEN} \
    --env CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES \
    -p 8080:80 \
    -v ${VOLUME}:/data \
    --gpus all \
    $DOCKER_IMAGE \
    --model-id ${MODEL_ID} \
    --num-shard 4 \
    --sharded true \
    --max-concurrent-requests 3 \
    --max-batch-prefill-tokens 24000 \
    --max-stop-sequences 20 \
    --trust-remote-code \
    --quantize ${QUANTIZE}

The GPUs are not even utilized half way though

Wed May  8 09:32:39 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0              58W / 400W |  13321MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-40GB          Off | 00000000:00:05.0 Off |                    0 |
| N/A   36C    P0              75W / 400W |  13465MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-40GB          Off | 00000000:00:06.0 Off |                    0 |
| N/A   34C    P0              71W / 400W |  13465MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-40GB          Off | 00000000:00:07.0 Off |                    0 |
| N/A   35C    P0              70W / 400W |  13321MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     26648      C   /opt/conda/bin/python3.10                 13312MiB |
|    1   N/A  N/A     26649      C   /opt/conda/bin/python3.10                 13456MiB |
|    2   N/A  N/A     26650      C   /opt/conda/bin/python3.10                 13456MiB |
|    3   N/A  N/A     26652      C   /opt/conda/bin/python3.10                 13312MiB |
+---------------------------------------------------------------------------------------+

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Steps are provided in the problem desription.

Expected behavior

The Model should load without exceptions.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama3-70B-Instruct-AWQ causing CUDA error: an illegal memory access was encountered #1871

llama3-70B-Instruct-AWQ causing CUDA error: an illegal memory access was encountered #1871

anindya-saha commented May 8, 2024 •

edited

llama3-70B-Instruct-AWQ causing CUDA error: an illegal memory access was encountered #1871

llama3-70B-Instruct-AWQ causing CUDA error: an illegal memory access was encountered #1871

Comments

anindya-saha commented May 8, 2024 • edited

System Info

Information

Tasks

Reproduction

Expected behavior

anindya-saha commented May 8, 2024 •

edited