Skip to content

Triton Error [CUDA]: invalid argument #80

@jemmyshin

Description

@jemmyshin

Issue description:

Got CUDA Error when sending request to server.

Steps to reproduce:

python -m lightllm.server.api_server --model_dir ~/.cache/huggingface/hub/models--decapoda-research--llama-7b-hf/snapshots/5f98eefcc80e437ef68d457ad7bf167c2c6a1348 --host 0.0.0.0 --port 8080 --tp 1 --max_total_token_num 120

And sending request using:

url = 'http://localhost:8080/generate'
headers = {'Content-Type': 'application/json'}
data = {
        'inputs': PROMPT,
        "parameters": {
        'do_sample': args.do_sample,
        'ignore_eos': False,
        'max_new_tokens': max_new_tokens,
        }
}
generated_text = requests.post(url, headers=headers, data=json.dumps(data)).json()

Error logging:

Task exception was never retrieved
future: <Task finished name='Task-5' coro=<RouterManager.loop_for_fwd() done, defined at /home/jina/jemfu/lightllm/lightllm/server/router/manager.py:88> exception=RuntimeError('Triton Error [CUDA]: invalid argument')>
Traceback (most recent call last):
  File "<string>", line 21, in _fwd_kernel
KeyError: ('2-.-0-.-0-7d1eb0d2fed8ff2032dccb99c2cc311a-2b0c5161c53c71b37ae20a9996ee4bb8-c1f92808b4e4644c1732e8338187ac87-d962222789c30252d492a16cca3bf467-12f7ac1ca211e037f62a7c0c323d9990-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float16, torch.float16, torch.float16, 'fp32', torch.int32, torch.int32, torch.float32, torch.float16, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (128, 128, 128), (True, True, True, (False,), True, True, True, True, (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (False, False), (False, True)))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/lightllm/lightllm/server/router/manager.py", line 91, in loop_for_fwd
    await self._step()
  File "/home/lightllm/lightllm/server/router/manager.py", line 112, in _step
    await self._prefill_batch(self.running_batch)
  File "/home/lightllm/lightllm/server/router/manager.py", line 149, in _prefill_batch
    ans = await asyncio.gather(*rets)
  File "/home/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 201, in prefill_batch
    ans = self._prefill_batch(batch_id)
  File "/home/lightllm/lightllm/utils/infer_utils.py", line 49, in inner_func
    result = func(*args, **kwargs)
  File "/home/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 77, in exposed_prefill_batch
    return self.forward(batch_id, is_prefill=True)
  File "/home/lightllm/lightllm/server/router/model_infer/model_rpc.py", line 128, in forward
    logits = self.model.forward(**kwargs)
  File "/home/miniconda3/envs/jemfu/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/lightllm/lightllm/models/llama/layer_infer/model.py", line 116, in forward
    predict_logics = self._context_forward(input_ids, infer_state)
  File "/home/lightllm/lightllm/models/llama/layer_infer/model.py", line 154, in _context_forward
    input_embs = self.layers_infer[i].context_forward(input_embs, infer_state, self.trans_layers_weight[i])
  File "/home/lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 117, in context_forward
    self._context_flash_attention(input_embdings,
  File "/home/lightllm/lightllm/utils/infer_utils.py", line 21, in time_func
    ans = func(*args, **kwargs)
  File "/home//lightllm/lightllm/models/llama/layer_infer/transformer_layer_inference.py", line 76, in _context_flash_attention
    context_attention_fwd(q.view(calcu_shape1),
  File "/home/miniconda3/envs/jemfu/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/jemfu/lightllm/lightllm/models/llama/triton_kernel/context_flashattention_nopad.py", line 224, in context_attention_fwd
    _fwd_kernel[grid](
  File "/home/miniconda3/envs/jemfu/lib/python3.10/site-packages/triton/runtime/jit.py", line 106, in launcher
    return self.run(*args, grid=grid, **kwargs)
  File "<string>", line 43, in _fwd_kernel
RuntimeError: Triton Error [CUDA]: invalid argument

Environment:

Please provide information about your environment, such as:

  • Using container

  • OS: Ubuntu

  • GPU info:

    • NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6
    • RTX TITAN
  • Python: 3.10

  • LightLLm: I used git clone and pip install -e .

  • openai-triton: 2.0.0.dev20221202

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions