-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to batch evaluate in inference? #40
Comments
I did something similar, but instead of concating, i used a for loop. Got the exact same error. Were you able to resolve? |
I don't know the detail about this problem, however, I found that the input_id tensor should not exceed batch size 1. E.g. [1, 50] can work but [4, 50] can't. Hope that it would be helpful. |
I'm not sure how to solve this problem, but I implemented batch eval using this solution. I hope this can help you. @torch.no_grad()
def generate(
self,
inputs: Optional[torch.Tensor] = None,
images: Optional[torch.Tensor] = None,
**kwargs,
) -> Union[GenerateOutput, torch.LongTensor]:
position_ids = kwargs.pop("position_ids", None)
attention_mask = kwargs.pop("attention_mask", None)
if "inputs_embeds" in kwargs:
raise NotImplementedError("`inputs_embeds` is not supported")
if images is not None:
(
inputs,
position_ids,
attention_mask,
_,
inputs_embeds,
_
) = self.prepare_inputs_labels_for_multimodal(
inputs,
position_ids,
attention_mask,
None,
None,
images,
)
else:
inputs_embeds = self.get_model().embed_tokens(inputs)
return super().generate(
position_ids=position_ids,
attention_mask=attention_mask,
inputs_embeds=inputs_embeds,
**kwargs
) Then at the infer position, I save the text part import torch
from torch.nn.utils.rnn import pad_sequence
from videollava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from videollava.conversation import conv_templates, SeparatorStyle
from videollava.mm_utils import tokenizer_image_token, KeywordsStoppingCriteria
def roll_padding_to_front(padded_input_ids, padding_value=0):
padding_lengths = (padded_input_ids == padding_value).long().sum(dim=1)
rolled_input_ids = torch.stack([torch.roll(input_id, shifts=padding_length.item()) for input_id, padding_length in zip(padded_input_ids, padding_lengths)])
return rolled_input_ids
def the_generate(
tokenizer , model , model_name , video_processor ,
questions , video_path , conv_mode, device, temperature, max_new_tokens
):
video_outputs = video_processor(video_path, return_tensors='pt')
video_tensor = video_outputs['pixel_values']
video_prompts = video_outputs['prompts']
# 'video_prompt' containing video frames and time points, same as videollama
if type(video_tensor) is list:
videos_tensor = [video.to(device, dtype=torch.float16) for video in video_tensor]
else:
videos_tensor = video_tensor.to(device, dtype=torch.float16)
inputs_ids = []
stopping_criterias = []
for question , video_prompt in zip(questions , video_prompts):
conv_m = "llava_v1"
if conv_mode is not None and conv_m != conv_mode:
print(
"[WARNING] the auto inferred conversation mode is {}, while `--conv-mode` is {}, using {}".format(
conv_m, conv_mode, conv_mode
)
)
else:
conv_mode = conv_m
conv = conv_templates[conv_mode].copy()
roles = conv.roles
question = ' '.join([DEFAULT_IMAGE_TOKEN] * model.get_video_tower().config.num_frames) + '\n' \
+ video_prompt + question
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = (
tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
.unsqueeze(0)
)
inputs_ids.append(input_ids.squeeze(0))
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)
stopping_criterias.append(stopping_criteria)
padded_input_ids = pad_sequence(
inputs_ids,
batch_first=True,
padding_value=tokenizer.pad_token_id
).to(device)
# move padding tokens ahead
rolled_input_ids = roll_padding_to_front(padded_input_ids)
with torch.inference_mode():
output_ids = model.generate(
rolled_input_ids,
images=videos_tensor,
do_sample=True if temperature > 0 else False,
temperature=temperature,
max_new_tokens=max_new_tokens,
use_cache=True,
stopping_criteria=stopping_criterias
)
outputs = tokenizer.batch_decode(
output_ids,
skip_special_tokens=True
)
outputs = [x.strip() for x in outputs]
return outputs |
I'm having same problem :-( Any advice? Not sure why @xiningin code also return me error :-( |
If you are running inference in a loop, make sure that you reset the conversation template |
@RaulKite Maybe we don't use the same dependency of some package. Here is my main version of packages, you can try to install the same version as mine, or you can show the error you encountered at that time.
And I use for batch in tqdm(dataloader):
......
outputs = the_generate(........)
....... |
Hi, how can I make the inference code to evaluate videos in batch?
I naively concatenated the tensor in dimension 0 and get this error.
Can you help me the figure out it? Thanks.
The text was updated successfully, but these errors were encountered: