We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/vl/engine.py#L103-L110
目前的ImageEncoder耗时统计方式:
def forward(self, inputs: List[Image]): """Model forward.""" time_start = time.perf_counter() outputs = self.model.forward(inputs) time_end = time.perf_counter() logger.info(f'ImageEncoder forward {len(inputs)} images, ' f'cost {time_end - time_start:.3f}s') return outputs
由于pytorch执行中存在异步cuda stream,这样直接统计和在forward前后添加torch.cuda.synchronize()后得到的耗时差异巨大。
def forward(self, inputs: List[Image]): """Model forward.""" torch.cuda.synchronize() # 流同步 time_start = time.perf_counter() outputs = self.model.forward(inputs) torch.cuda.synchronize() # 流同步 time_end = time.perf_counter() logger.info(f'ImageEncoder forward {len(inputs)} images, ' f'cost {time_end - time_start:.3f}s') return outputs
比如对于InternViT-6B, L20x2:
None
The text was updated successfully, but these errors were encountered:
补充一下torch.cuda.synchronize()默认只同步0卡的流,如果是vision model分布在多卡。需要逐个卡号同步,否则会导致统计的耗时是真实耗时的1/N(N为卡数)
Sorry, something went wrong.
这里直接获取 cpu tensor 应该足够了吧。
有道理,这样更简单
irexyc
No branches or pull requests
Checklist
Describe the bug
https://github.com/InternLM/lmdeploy/blob/main/lmdeploy/vl/engine.py#L103-L110
目前的ImageEncoder耗时统计方式:
由于pytorch执行中存在异步cuda stream,这样直接统计和在forward前后添加torch.cuda.synchronize()后得到的耗时差异巨大。
比如对于InternViT-6B, L20x2:
Reproduction
None
Environment
Error traceback
The text was updated successfully, but these errors were encountered: