LLaMA support #506

michaelroyzen · 2023-03-16T20:59:20Z

Given existing support for GPT-J and its rotary embeddings, is LLaMA supported as well? Huggingface just shipped their implementation: huggingface/transformers@464d420

@byshiue

The text was updated successfully, but these errors were encountered:

byshiue · 2023-03-17T03:38:03Z

Can you introduce what's difference between GPT-J and LLaMA?

teknium1 · 2023-03-17T09:11:25Z

+1 for this

michaelroyzen · 2023-03-21T19:09:59Z

They look very similar. In HuggingFace's doc page, they say that the implementation is based on the GPT-NeoX codebase, which seems to be supported by FasterTransformer: https://huggingface.co/docs/transformers/main/model_doc/llama.

Do you think it'll work?

yuikns · 2023-03-24T09:01:17Z

+1

@byshiue According to our investigation, it is not difficult to portal this model to Megatron as well. But I am not sure will one convert script works.

byshiue · 2023-03-24T09:30:32Z

Thank you for the suggestion and discussion. We may not have time to work on that issue right now. If you are interesting, you can try to support it.
It is welcome to ask question if you encounter any question, and merge back into our repo if you can support it.

Hap-Zhang · 2023-03-28T09:39:08Z

+1 for this

michaelroyzen · 2023-04-07T21:51:50Z

It seems to be quite a simple implementation @byshiue. All that needs to be done is implement RMS layer norm in GPT-NeoX, as well as support the SILU activation. It seem that both of these features are already implemented elsewhere in FasterTransformer.

I'd be happy to take the lead if you can help me with the general steps.

ZZR0 · 2023-04-11T09:04:55Z

+1 for this

troycheng · 2023-04-11T09:08:08Z

+1 for this

moonscar · 2023-04-11T10:17:25Z

I compared the GPT-j and llama models in huggingface, they have the same attention layer. There are some differences in FFN, llama uses 3 weights, and the forward function is as follows

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

I checked the relevant code of the ffn layer in the source code, and it seems that there is no similar structure. Or such a layer already exists in the current code and I have not found it, I hope to get some tips.
@byshiue

byshiue · 2023-04-12T03:19:53Z

I compared the GPT-j and llama models in huggingface, they have the same attention layer. There are some differences in FFN, llama uses 3 weights, and the forward function is as follows
    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))
I checked the relevant code of the ffn layer in the source code, and it seems that there is no similar structure. Or such a layer already exists in the current code and I have not found it, I hope to get some tips. @byshiue

It looks like a standard gated silu. Can you explain what difference do you think?

moonscar · 2023-04-12T05:46:21Z

Thanks for the reminder, I missed this part.
I will try to make this work

michaelroyzen · 2023-04-13T18:59:33Z

Wow, thank you @moonscar. Want any help? What's the status of your PR?

AnShengqiang · 2023-04-14T06:47:45Z

need this too

Anychnn · 2023-04-17T09:28:23Z

moonscar

have you started this work? or i can help with it.

michaelroyzen · 2023-04-18T03:30:49Z

Don't think it's been started yet @Anychnn

michaelroyzen · 2023-04-20T06:24:52Z

Given the interest and activity here, I'd like to offer a bounty of $2,500 USD to whoever can get Llama implemented in FT. Please email me at michael@phind.com if you're interested. @moonscar @AnShengqiang @Anychnn @byshiue

It seems that all that needs to be done is copy over T5's RMS layer norm (already implemented in FT) and UL2's gated-silu (also already implemented elsewhere in FT) into GPT-NeoX. As per the Huggingface's implementation of Llama, it is otherwise completely identical to GPT-NeoX (which is already implemented in FT).

michaelroyzen · 2023-04-20T18:44:45Z

The bounty will be $3,000 if a correct and working PR is opened by the end of Friday, April 21st (Pacific Time).

jinluyang · 2023-04-21T12:29:46Z

would be glad to help do a part of the work, for example converting the weights to FT

cameronfr · 2023-04-21T19:01:43Z

Made alot of progress on this, but my current FT model is outputting seemingly random tokens, so there's something wrong with my weight conversion or maybe even the exact layer implementation. If someone wants to pick up the torch (I am done for now 😞) the next step would prob be to compare layer-by-layer the output of the Huggingface model vs. this FT model:

Weights conversion: https://github.com/cameronfr/FasterTransformer/blob/main/examples/cpp/llama/huggingface_llama_convert.py
FT Model:
https://github.com/cameronfr/FasterTransformer/tree/main/src/fastertransformer/models/llama
Testing:
https://github.com/cameronfr/FasterTransformer/tree/main/examples/cpp/llama

Everything is modified from the respective GPTNeoX versions. LlamaContextDecoder and LlamaDecoder essentially just have the changes of Gelu -> Gated Silu and LayerNorm -> LayerNormT5. LlamaDecoderLayerWeight and LlamaWeight set the parameters of these layers.

Anychnn · 2023-04-22T17:09:23Z

@cameronfr The default layernorm_eps_ of llama.h is set to be 1e-5, but llama-7b-torch set it default to 1e-6. And the attention module output is also incorrect, I am fixing this.

jinluyang · 2023-04-23T01:46:43Z

@cameronfr I think the reshape of qkv here might not be correct https://github.com/cameronfr/FasterTransformer/blob/45d48f9d06713cd006f7d95d4b2f99a4bd3abb11/examples/cpp/llama/huggingface_llama_convert.py#L97
Since the huggingface format qkv proj is prepared for rotary embedding https://github.com/huggingface/transformers/blob/d04ec99bec8a0b432fc03ed60cea9a1a20ebaf3c/src/transformers/models/llama/convert_llama_weights_to_hf.py#L101
So I tried something like :
qkvArr[:, 0, :, :] = qArr.reshape(n_heads,2, head_size//2, hidden_size).transpose((3,0,2,1)).reshape(hidden_size,n_heads,head_size)
and fixed the layernorm_eps, but the output tokens are still seemingly incorrect, not a sentence.
Also I changed the start_ids.csv not to use the one in gptneox, since they may not share the same token ids.

michaelroyzen · 2023-04-24T03:56:59Z

Great progress @cameronfr @Anychnn @jinluyang. I'm doubling the bounty to $6k to whoever can get this working and merged in.

void-main · 2023-04-24T07:40:09Z

Hey @michaelroyzen @cameronfr @Anychnn @jinluyang , I got a self-tested working version and opened a pull request with it. Could you guys please take a look? Any chances we could get it merged?

michaelroyzen · 2023-04-24T23:00:12Z

Nice! Works well so far in limited tests and is consistent with the Huggingface output using beam_size 1. One comment is that it should support max_position_embeddings (max_pos_seq_len in FT), but this is likely a simple change. Will continue testing and post the updates here.

ZhuYuJin · 2023-04-27T03:22:43Z

@michaelroyzen Does FT support the fine-tuned LLaMA with Lora? Training code is as follows: https://github.com/tloen/alpaca-lora/blob/main/finetune.py

ZhuYuJin · 2023-04-27T10:49:04Z

@michaelroyzen Does FT support the fine-tuned LLaMA with Lora? Training code is as follows: https://github.com/tloen/alpaca-lora/blob/main/finetune.py

Use the merge_adapter interface can merge lora weights into original linear weights. https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora.py#L279

void-main · 2023-04-30T00:32:23Z

Hey community, here are some updates:

supported bf16
supported triton decouple mode
verified that Llama 65B is working

Anychnn · 2023-07-15T16:25:55Z

你好， ?xml:namespace>

realgump · 2023-07-17T12:09:13Z

llama支持dynamic batching吗？
我在config文件里面打开了dynamic batching，但是server端仍然是串行推理的。仿照https://github.com/triton-inference-server/fastertransformer_backend/blob/6df8877bee99d0c6eefc2e9127edd5ee71b1ad06/all_models/gpt/fastertransformer/config.pbtxt 里面开ragged input，可以成功组batch，但输出会出现很多乱码。

pai4451 · 2023-07-19T01:21:19Z

Llama 2 released: https://ai.meta.com/resources/models-and-libraries/llama/

Is it possible to serve it with Triton?

SamuraiBUPT · 2023-07-19T01:48:24Z

Llama 2 released: https://ai.meta.com/resources/models-and-libraries/llama/

Is it possible to serve it with Triton?

worth trying, if there is no structural change, llama-2 may be supported.

from MetaAI blogs and their paper, they seemed to train new parameters on different methods.

SamuraiBUPT · 2023-07-19T02:00:41Z

The primary architectural differences from Llama 1 include increased context length and grouped-query attention (GQA).

structural changes may exist.

CN-COTER · 2023-07-19T03:19:14Z

The primary architectural differences from Llama 1 include increased context length and grouped-query attention (GQA).

structural changes may exist.

If we use LLAMA2-7B or LLAMA2-13B that without GQA, maybe we could apply current llama-ft inference architecture.

valtab · 2023-07-20T04:45:54Z

llama支持dynamic batching吗？我在config文件里面打开了dynamic batching，但是server端仍然是串行推理的。仿照https://github.com/triton-inference-server/fastertransformer_backend/blob/6df8877bee99d0c6eefc2e9127edd5ee71b1ad06/all_models/gpt/fastertransformer/config.pbtxt 里面开ragged input，可以成功组batch，但输出会出现很多乱码。

Yes, dynamic batching works well with latest commits. 33B, both decoupled true and false.

chuanzhao0626 · 2023-07-21T02:11:53Z

llama-2-70B: The model configuration has one more parameter.
'num_key_value_heads': 8

Converting model weights using huggingface_llama_convert.py indicates an error:
ValueError: all input arrays must have the same shape

Comparing the two versions of llama model, the weight dimensions of k, q and v are different.
llama-2-70B:
q_proj.weight:(8192, 8192)
k_proj.weight:(1024, 8192)
v_proj.weight:(1024, 8192)
Due to dimensional differences, use the np.vstack() method for parameter concatenation?

llama-65B:
k_proj.weight:(8192, 8192)
q_proj.weight:(8192, 8192)
v_proj.weight:(8192, 8192)

Can model transformation give some suggestions for changes? This dimension change is not the inference implementation code in FT also needs to be modified synchronously.

Dimensionzw · 2023-07-21T02:37:05Z

llama-2-70B：模型配置多了一个参数。 'num_key_value_heads'：8

使用huggingface_llama_convert.py转换模型权重指示错误： ValueError：所有输入数组必须具有相同的形状

对比两个版本的llama模型，k、q、v的权重维度不同。 llama-2-70B: q_proj.weight:(8192, 8192) k_proj.weight:(1024, 8192) v_proj.weight:(1024, 8192) 由于维度差异，使用 np.vstack() 方法进行参数串联？

llama-65B: k_proj.weight:(8192, 8192) q_proj.weight:(8192, 8192) v_proj.weight:(8192, 8192)

模型改造能否给出一些改变的建议？这个维度变化并不是FT中的推理实现代码也需要同步修改。

Hello, I would like to know if there are any structural changes between llama2 7B and 13B, and can they be directly converted and deployed using the FT framework?

CN-COTER · 2023-07-21T06:07:41Z

llama-2-70B：模型配置多了一个参数。 'num_key_value_heads'：8
使用huggingface_llama_convert.py转换模型权重指示错误： ValueError：所有输入数组必须具有相同的形状
对比两个版本的llama模型，k、q、v的权重维度不同。 llama-2-70B: q_proj.weight:(8192, 8192) k_proj.weight:(1024, 8192) v_proj.weight:(1024, 8192) 由于维度差异，使用 np.vstack() 方法进行参数串联？
llama-65B: k_proj.weight:(8192, 8192) q_proj.weight:(8192, 8192) v_proj.weight:(8192, 8192)
模型改造能否给出一些改变的建议？这个维度变化并不是FT中的推理实现代码也需要同步修改。

Hello, I would like to know if there are any structural changes between llama2 7B and 13B, and can they be directly converted and deployed using the FT framework?

I have tested llama2 13B with FT framework + int8 and I did not encounter any error.

void-main · 2023-07-21T10:04:51Z

Hey guys, don't try to use Llama2 with current Llama implementation.

Current implementation doesn't implement MultiQueryAttention (the num_key_value_heads field), and it is expected to not work.

If you guys are in a hurry to use Llama2, I highly recommend you turn to vllm which now supports Llama2.

fmac2000 · 2023-07-22T03:35:33Z

@void-main will there be work done to implement MQA?

void-main · 2023-07-24T01:40:29Z

Hey @fmac2000 , I'd like to try implement MQA based on FlashAttention2, but I can't promise when this feature would be ready.

Dimensionzw · 2023-07-24T01:51:41Z

嘿@fmac2000，我想尝试基于FlashAttention2实现MQA，但我不能保证这个功能什么时候准备好。

@void-main Maybe can refer to the implementation of this submission, this project is also implemented using the FT framework, recently supported the GQA function of llama2-70B, but for the llama2 models of 7B and 13B, the existing implementation should be directly usable
https://github.com/InternLM/lmdeploy

AnyangAngus · 2023-07-26T09:02:53Z

llama-2-70B：模型配置多了一个参数。 'num_key_value_heads'：8
使用huggingface_llama_convert.py转换模型权重指示错误： ValueError：所有输入数组必须具有相同的形状
对比两个版本的llama模型，k、q、v的权重维度不同。 llama-2-70B: q_proj.weight:(8192, 8192) k_proj.weight:(1024, 8192) v_proj.weight:(1024, 8192) 由于维度差异，使用 np.vstack() 方法进行参数串联？
llama-65B: k_proj.weight:(8192, 8192) q_proj.weight:(8192, 8192) v_proj.weight:(8192, 8192)
模型改造能否给出一些改变的建议？这个维度变化并不是FT中的推理实现代码也需要同步修改。

Hello, I would like to know if there are any structural changes between llama2 7B and 13B, and can they be directly converted and deployed using the FT framework?

I have tested llama2 13B with FT framework + int8 and I did not encounter any error.

@CN-COTER
Hi:
Dose the llama-2 output tokens from FT is consistant with HF transformer ?
Thank U

fmac2000 · 2023-07-30T18:47:10Z

Hey @fmac2000 , I'd like to try implement MQA based on FlashAttention2, but I can't promise when this feature would be ready.

@void-main - that’s great news, thank you for all the work you’ve put in so far - it’s extremely appreciated. Let us know 👍

realgump · 2023-07-31T02:43:27Z

Is there any bugs in batching inference? The response of the model always appear garbled characters when request batches of input, like:
01.01.0395153939222e0 for 3010 for a neutral.tt.tt.222201401.01.01.5p.1.91.91.a01.20 with the first pitch-1.10 with01.1.10 with1.10 with01.1.20 with1.1.0 with1.20 with1.20 with1.22222222222201.1.1.1.1.133300 with1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1d1.1.1d1.1.1d1.1.1d1.
��

double-vin · 2023-08-02T11:09:59Z

@cameronfr I think the reshape of qkv here might not be correct https://github.com/cameronfr/FasterTransformer/blob/45d48f9d06713cd006f7d95d4b2f99a4bd3abb11/examples/cpp/llama/huggingface_llama_convert.py#L97 Since the huggingface format qkv proj is prepared for rotary embedding https://github.com/huggingface/transformers/blob/d04ec99bec8a0b432fc03ed60cea9a1a20ebaf3c/src/transformers/models/llama/convert_llama_weights_to_hf.py#L101 So I tried something like : qkvArr[:, 0, :, :] = qArr.reshape(n_heads,2, head_size//2, hidden_size).transpose((3,0,2,1)).reshape(hidden_size,n_heads,head_size) and fixed the layernorm_eps, but the output tokens are still seemingly incorrect, not a sentence. Also I changed the start_ids.csv not to use the one in gptneox, since they may not share the same token ids.

I have also encountered this issue. There is a problem with the output token id. Have you resolved it?

double-vin · 2023-08-02T12:01:43Z

update the inference speed:

38ms per token on A6000, 13B llama model with FP16 precision.
18ms per token on A800, 13B llama model with FP16 precision.

[1685187041.895424] [ee6d00936280:22964:f]        vfs_fuse.c:424  UCX  WARN  failed to connect to vfs socket '': Invalid argument
Total ranks: 1.
Device NVIDIA RTX A6000
P0 is running with GPU #0.
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
after allocation    : free: 22.86 GB, total: 47.54 GB, used: 24.67 GB
d_sequence_lengths 91 elements
Writing 1036 elements
    1 12968 29901 29896 29974 29896 29922 29973    13  7900
zeroCount = 946
[INFO] request_batch_size 1 beam_width 1 head_num 40 size_per_head 128 total_output_len 1036 decoder_layers 40 vocab_size 32000 FT-CPP-decoding-beamsearch-time 3052.38 ms
[INFO] batch 0: input_token_len 12, gen_token_len 79, total_token_len 91, ave 38.64 ms/token

Total ranks: 1.
Device NVIDIA A800-SXM4-80GB
P0 is running with GPU #0.
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
after allocation    : free: 54.51 GB, total: 79.32 GB, used: 24.82 GB
d_sequence_lengths 91 elements
Writing 1036 elements
    1 12968 29901 29896 29974 29896 29922 29973    13  7900
zeroCount = 946
[INFO] request_batch_size 1 beam_width 1 head_num 40 size_per_head 128 total_output_len 1036 decoder_layers 40 vocab_size 32000 FT-CPP-decoding-beamsearch-time 1471.19 ms
[INFO] batch 0: input_token_len 12, gen_token_len 79, total_token_len 91, ave 18.62 ms/token

Can you share your code? My output token id is incorrect, I would like to compare it.Thank you!

CN-COTER · 2023-08-03T03:56:02Z

Sorry for late Reply.

I have test Llama2-13b-chat on hf-transfomer and FT.

The input_id is

    '1, 12968, 29901, 29871, 30406, 4691, 31479, 30287, 30502, 232, 194, 174, 31859, 233, 145, 149, 31463, 29871, 13, 13, 7900, 22137, 29901, 29871'

FT
I use llama_example.cc and save input_id to start_ids.csv, then get the following output in out file

1 12968 29901 29871 30406 4691 31479 30287 30502 232 194 174 31859 233 145 149 31463 29871 13 13 7900 22137 29901 29871 18585 29991 2266 338 385 1342 310 263 4996 6605 5687 297 5132 29901 13 28956 13 1753 4996 6605 29898 2749 1125 13 1678 565 7431 29898 2749 29897 5277 29871 29896 29901 13 4706 736 3948 13 1678 24438 353 3948 29961 29900 29962 13 1678 3109 353 518 29916 363 921 297 3948 29961 29896 17531 565 921 5277 24438 29962 13 1678 7621 353 518 29916 363 921 297 3948 29961 29896 17531 565 921 1405 24438 29962 13 1678 736 4996 6605 29898 2222 29897 718 518 29886 11002 29962 718 4996 6605 29898 7979

Hf-Transformer

input_id = '1, 12968, 29901, 29871, 30406, 4691, 31479, 30287, 30502, 232, 194, 174, 31859, 233, 145, 149, 31463, 29871, 13, 13, 7900, 22137, 29901, 29871'
input_id = [[int(i) for i in input_id.split(', ')]]
input_id = torch.tensor(input_id)
generate_ids = model.generate(input_id, max_new_tokens=100, do_sample = True, top_k =1, top_p=0.95, temperature = 1, repetition_penalty=1.0, eos_token_id=2, bos_token_id=1, pad_token_id=0)
generate_ids is:
tensor([[    1, 12968, 29901, 29871, 30406,  4691, 31479, 30287, 30502,   232,
           194,   174, 31859,   233,   145,   149, 31463, 29871,    13,    13,
          7900, 22137, 29901, 29871, 18585, 29991,  2266,   338,   385,  1342,
           310,   263,  4996,  6605,  5687,   297,  5132, 29901,    13, 28956,
            13,  1753,  4996,  6605, 29898,  2749,  1125,    13,  1678,   565,
          7431, 29898,  2749, 29897,  5277, 29871, 29896, 29901,    13,  4706,
           736,  3948,    13,  1678, 24438,   353,  3948, 29961, 29900, 29962,
            13,  1678,  3109,   353,   518, 29916,   363,   921,   297,  3948,
         29961, 29896, 17531,   565,   921,  5277, 24438, 29962,    13,  1678,
          7621,   353,   518, 29916,   363,   921,   297,  3948, 29961, 29896,
         17531,   565,   921,  1405, 24438, 29962,    13,  1678,   736,  4996,
          6605, 29898,  2222, 29897,   718,   518, 29886, 11002, 29962,   718,
          4996,  6605, 29898,  7979]])

So, according to this example output from FT is consistant with HF transformer.

double-vin · 2023-08-08T06:35:58Z

Thank you very much for your reply. When I used the commit on July 2nd, I received the correct results, but there was a problem with using the commit on April 23rd. I will use a new version to solve this problem.

realgump · 2023-08-08T06:42:58Z

llama支持dynamic batching吗？我在config文件里面打开了dynamic batching，但是server端仍然是串行推理的。仿照https://github.com/triton-inference-server/fastertransformer_backend/blob/6df8877bee99d0c6eefc2e9127edd5ee71b1ad06/all_models/gpt/fastertransformer/config.pbtxt 里面开ragged input，可以成功组batch，但输出会出现很多乱码。

Yes, dynamic batching works well with latest commits. 33B, both decoupled true and false.

Thank you for your reply. However, even after updating to the latest commits, my 13B model still produces garbled output when multiple requests are made concurrently.

shixianc · 2023-08-15T20:46:37Z

Do we have a working implementation for Llama1 using FlashAttention?

I tried to set $FMHA_ENABLE=ON but did not observe any difference in the output or the performance. I'm wondering if anyone has tested this feature and would like to share some more details?

efwfe · 2023-08-23T11:25:20Z

llama-2-70B: The model configuration has one more parameter. 'num_key_value_heads': 8

Converting model weights using huggingface_llama_convert.py indicates an error: ValueError: all input arrays must have the same shape

Comparing the two versions of llama model, the weight dimensions of k, q and v are different. llama-2-70B: q_proj.weight:(8192, 8192) k_proj.weight:(1024, 8192) v_proj.weight:(1024, 8192) Due to dimensional differences, use the np.vstack() method for parameter concatenation?

llama-65B: k_proj.weight:(8192, 8192) q_proj.weight:(8192, 8192) v_proj.weight:(8192, 8192)

Can model transformation give some suggestions for changes? This dimension change is not the inference implementation code in FT also needs to be modified synchronously.

same error .

RobotGF · 2023-09-08T09:54:04Z

llama支持dynamic batching吗？我在config文件里面打开了dynamic batching，但是server端仍然是串行推理的。仿照https://github.com/triton-inference-server/fastertransformer_backend/blob/6df8877bee99d0c6eefc2e9127edd5ee71b1ad06/all_models/gpt/fastertransformer/config.pbtxt 里面开ragged input，可以成功组batch，但输出会出现很多乱码。

Yes, dynamic batching works well with latest commits. 33B, both decoupled true and false.

Thank you for your reply. However, even after updating to the latest commits, my 13B model still produces garbled output when multiple requests are made concurrently.

#716
#742
this two pull request may help. both work well. select one as you need

double-vin · 2023-09-18T12:03:01Z

llama支持dynamic batching吗？我在config文件里面打开了dynamic batching，但是server端仍然是串行推理的。仿照https://github.com/triton-inference-server/fastertransformer_backend/blob/6df8877bee99d0c6eefc2e9127edd5ee71b1ad06/all_models/gpt/fastertransformer/config.pbtxt 里面开ragged input，可以成功组batch，但输出会出现很多乱码。

Yes, dynamic batching works well with latest commits. 33B, both decoupled true and false.

Thank you for your reply. However, even after updating to the latest commits, my 13B model still produces garbled output when multiple requests are made concurrently.

#716 #742 this two pull request may help. both work well. select one as you need

The same problem, but these two requests did not solve the garbled code problem.

HuaYZhao · 2023-10-11T14:17:35Z

我根据llama_guide编译FasterTransformer，cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release ..编译成功后运行build目录下的./bin/llama_example成功运行，但当我以debug模式进行编译cmake -DSM=80 -DCMAKE_BUILD_TYPE=Debug ..，并以相同的方式去运行llama_example可执行文件，得到错误如下：

terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: too many resources requested for launch /ft_workspace/FasterTransformer/src/fastertransformer/layers/FfnLayer.cc:311

使用的模型是llama-7b-hf，设备是A100-80G
请帮我解惑一下，谢谢！

Anychnn · 2023-10-11T14:17:58Z

您的邮件已收到!谢谢

CN-COTER · 2023-10-20T06:20:13Z

Hi, FYI
TensorRT-LLM is publicly available https://github.com/NVIDIA/TensorRT-LLM/tree/main. As know from doc, it intergrate fastertransformer and support many latest feature.Meanwhille, tensorRT-llm-backend is now avaliable in triton-inference-server.(https://github.com/triton-inference-server/tensorrtllm_backend/blob/e514b4af5ec87477b095d3ba6fe63cc7b797055f/README.md#L31).

byshiue added the enhancement New feature or request label Mar 24, 2023

void-main mentioned this issue Apr 24, 2023

[enhancement] support llama #575

Open

lvhan028 mentioned this issue Jul 25, 2023

[Doc] Add projects section in README which is developed based on FasterTransformer #731

Open

LLaMA support #506

LLaMA support #506

Comments

michaelroyzen commented Mar 16, 2023 • edited Loading

byshiue commented Mar 17, 2023

teknium1 commented Mar 17, 2023

michaelroyzen commented Mar 21, 2023

yuikns commented Mar 24, 2023

byshiue commented Mar 24, 2023

Hap-Zhang commented Mar 28, 2023

michaelroyzen commented Apr 7, 2023 • edited Loading

ZZR0 commented Apr 11, 2023

troycheng commented Apr 11, 2023

moonscar commented Apr 11, 2023

byshiue commented Apr 12, 2023

moonscar commented Apr 12, 2023

michaelroyzen commented Apr 13, 2023

AnShengqiang commented Apr 14, 2023

Anychnn commented Apr 17, 2023 • edited Loading

michaelroyzen commented Apr 18, 2023

michaelroyzen commented Apr 20, 2023

michaelroyzen commented Apr 20, 2023

jinluyang commented Apr 21, 2023

cameronfr commented Apr 21, 2023

Anychnn commented Apr 22, 2023 • edited Loading

jinluyang commented Apr 23, 2023

michaelroyzen commented Apr 24, 2023

void-main commented Apr 24, 2023

michaelroyzen commented Apr 24, 2023

ZhuYuJin commented Apr 27, 2023

ZhuYuJin commented Apr 27, 2023

void-main commented Apr 30, 2023

Anychnn commented Jul 15, 2023 via email

realgump commented Jul 17, 2023

pai4451 commented Jul 19, 2023 • edited Loading

SamuraiBUPT commented Jul 19, 2023 • edited Loading

SamuraiBUPT commented Jul 19, 2023

CN-COTER commented Jul 19, 2023

valtab commented Jul 20, 2023

chuanzhao0626 commented Jul 21, 2023 • edited Loading

Dimensionzw commented Jul 21, 2023

CN-COTER commented Jul 21, 2023 • edited Loading

void-main commented Jul 21, 2023

fmac2000 commented Jul 22, 2023

void-main commented Jul 24, 2023

Dimensionzw commented Jul 24, 2023

AnyangAngus commented Jul 26, 2023

fmac2000 commented Jul 30, 2023

realgump commented Jul 31, 2023 • edited Loading

double-vin commented Aug 2, 2023

double-vin commented Aug 2, 2023

CN-COTER commented Aug 3, 2023 • edited Loading

double-vin commented Aug 8, 2023

realgump commented Aug 8, 2023

shixianc commented Aug 15, 2023 • edited Loading

efwfe commented Aug 23, 2023

RobotGF commented Sep 8, 2023 • edited Loading

double-vin commented Sep 18, 2023

HuaYZhao commented Oct 11, 2023 • edited Loading

Anychnn commented Oct 11, 2023 via email

CN-COTER commented Oct 20, 2023 • edited Loading

michaelroyzen commented Mar 16, 2023 •

edited

Loading

michaelroyzen commented Apr 7, 2023 •

edited

Loading

Anychnn commented Apr 17, 2023 •

edited

Loading

Anychnn commented Apr 22, 2023 •

edited

Loading

pai4451 commented Jul 19, 2023 •

edited

Loading

SamuraiBUPT commented Jul 19, 2023 •

edited

Loading

chuanzhao0626 commented Jul 21, 2023 •

edited

Loading

CN-COTER commented Jul 21, 2023 •

edited

Loading

realgump commented Jul 31, 2023 •

edited

Loading

CN-COTER commented Aug 3, 2023 •

edited

Loading

shixianc commented Aug 15, 2023 •

edited

Loading

RobotGF commented Sep 8, 2023 •

edited

Loading

HuaYZhao commented Oct 11, 2023 •

edited

Loading

CN-COTER commented Oct 20, 2023 •

edited

Loading