[OpenCL][Kernel] Add OpenCL image kernel: split #4645
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
【问题】
split
作为一个常用 op,之前没有 OpenCL 实现,这样会引起不必要的io_copy
和layout_cast
执行,进而增加运行耗时。【解决方法】本 PR 新增
split
的 OpenCL image 实现。【效果】测试 shufflenetv2,已验证精度与ARM 无 diff。该模型中含有 16 个
concat -> shuffle_channel -> split
这种 block 结构。其中shuffle_channel
的 GPU 实现 PR 尚未合入。因此:如果仅有split
的 GPU 实现,并不会减少io_copy
和layout_cast
的执行次数,且split
操作是纯数据拷贝无计算量,因此理论上该 op 使用 GPU 并无优势(下表第三行数据也验证了这一点)。但是,当split
和shuffle_channel
均有 GPU 实现时,一个 block 就可以降低 3 次io_copy
和 3 次layout_cast
的执行耗时,因此最终在 shufflenetv2 上有速度提升。备注:仅使用 GPU 版
shuffle_channel
时,在 835 上运行 shufflenetv2 由 43ms 降为 41ms。因此必须shuffle_channel
和split
均有 GPU 实现时才有意义(1+1>2哈哈)。【TODO】
split
单测。由于 OpenCL 的单测会迁移至 tests 文件夹下,对应修改较多。因此split
的单测会在新的一个 PR 中添加。shufflenetv2
OpenCL 比 ARM 慢,需要详细耗时分析。