CUDA runtime error: CUBLAS_STATUS_EXECUTION_FAILED for fp16

## Description
When trying to benchmark a gpt3 6 billion parameters model for sampling. I get the following error when using fp16.
`python ./pytorch/gpt_sample.py --output_len=100 --time --max_batch_size=1 --max_seq_len=2048 --layer_num=28 --head_num=32 --size_per_head=128 --fp16`


>Traceback (most recent call last):
  File "./pytorch/gpt_sample.py", line 167, in <module>
    main()
  File "./pytorch/gpt_sample.py", line 138, in main
    tokens_batch = gpt(start_ids, start_lengths, attn_mask)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/workspace/FasterTransformer/build/pytorch/utils/gpt.py", line 210, in forward
    output_ids, = self.model.forward(start_ids, start_lengths, attn_mask, self.output_len)
RuntimeError: [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_EXECUTION_FAILED /workspace/FasterTransformer/fastertransformer/utils/functions.h:674

The error does not occur when I run without fp16 flag
`python ./pytorch/gpt_sample.py --output_len=1900 --time --max_batch_size=1 --sample_input_file=" " --max_seq_len=2048 --layer_num=28 --head_num=32 --size_per_head=128 --fp16`
>=============== Arguments ===============                                                                                                                                                                                                                     
layer_num: 28                                                                                                                                                                                                                                                 
output_len: 128                                                                                                                                                                                                                                               
head_num: 32                                                                                                                                                                                                                                                  
size_per_head: 128                                                                                                                                                                                                                                            
vocab_size: 50304                                                                                                                                                                                                                                             
top_k: 1                                                                                                                                                                                                                                                      
top_p: 0.0                                                                                                                                                                                                                                                    
temperature: 1.0                                                                                                                                                                                                                                              
tensor_para_size: 1                                                                                                                                                                                                                                           
layer_para_size: 1                                                                                                                                                                                                                                            
layer_para_batch_size: 1                                                                                                                                                                                                                                      
ckpt_path: ../models/megatron-models/c-model/345m/1-gpu                                                                                                                                                                                                       
lib_path: ./lib/libpyt_fastertransformer.so                                                                                                                                                                                                                   
vocab_file: ../models/gpt2-vocab.json                                                                                                                                                                                                                         
merges_file: ../models/gpt2-merges.txt                                                                                                                                                                                                                        
start_id: 50256                                                                                                                                                                                                                                               
end_id: 50256                                                                                                                                                                                                                                                 
max_batch_size: 1                                                                                                                                                                                                                                             
max_seq_len: 1024                                                                                                                                                                                                                                             
fp16: False                                                                                                                                                                                                                                                   
time: True                                                                                                                                                                                                                                                    
sample_input_file:                                                                                                                                                                                                                                            
sample_output_file: None                                                                                                                                                                                                                                      
=========================================                                                                                                                                                                                                                     
[INFO] batch size: 1                                                                                                                                                                                                                                          
[WARNING] decoding_gemm_config.in is not found                                                                                                                                                                                                                
[WARNING] decoding_gemm_config.in is not found                                                                                                                                                                                                                
[WARNING] decoding_gemm_config.in is not found                                                                                                                                                                                                                
[WARNING] decoding_gemm_config.in is not found                                                                                                                                                                                                                
[WARNING] decoding_gemm_config.in is not found                                                                                                                                                                                                                
[WARNING] decoding_gemm_config.in is not found                                                                                                                                                                                                                
[WARNING] decoding_gemm_config.in is not found                                                                                                                                                                                                                
[WARNING] decoding_gemm_config.in is not found                                                                                                                                                                                                                
[WARNING] decoding_gemm_config.in is not found                                                                                                                                                                                                                
[WARNING] decoding_gemm_config.in is not found                                                                                                                                                                                                                                                                                                                                                                                                                             
[INFO] GPT time costs: 2432.03 ms

## System Configuration
I am using the pytorch NGC container -> nvcr.io/nvidia/pytorch:20.12-py3




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA runtime error: CUBLAS_STATUS_EXECUTION_FAILED for fp16 #181

Description

System Configuration

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CUDA runtime error: CUBLAS_STATUS_EXECUTION_FAILED for fp16 #181

Description

Description

System Configuration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions