Skip to content

Add support for bucketize #18040

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

kavin-sai-krishna
Copy link
Contributor

This PR adds support for bucketize op which is used in many vision models like Phi4, SmolVLM etc.,

@tqchen
Copy link
Member

tqchen commented Jun 5, 2025

for operators like this one, we also need legalization rule to know how to lower them. We don;t want to be end up in a situation where we have the ops but canot lower/compile them. cc @tlopex

@kavin-sai-krishna
Copy link
Contributor Author

for operators like this one, we also need legalization rule to know how to lower them

@tqchen I came across the PyTorch implementation of this operation and noticed that they used searchsorted. Following that approach, I’ve used topi.searchsorted to lower the operation. I also tested the implementation numerically with boundaries = [0, 2, 4, 6, 8, 10] and input = [-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], and the results appear to be correct.

@tqchen
Copy link
Member

tqchen commented Jun 5, 2025

Thanks, would be good to go through the checklist below. Some checklist for adding a new op.

  • C0: Can the operator be decomposed into smaller ops (if so, then it may not be necessary to introduce the high-level op). It is helpful to reuse existing ops when possible so we don't have to introduce C1/C2, otherwise we need C1/C2
  • C1: Introduce legalization rules so the op at least compiles on CPU
  • C2: Make sure the op compiles on CUDA

@tqchen
Copy link
Member

tqchen commented Jun 5, 2025

I think the main question is how to make sure it runs on cuda

@kavin-sai-krishna
Copy link
Contributor Author

@tqchen Thank you. I understood the high-level idea you suggested, but I have a few specific questions regarding the design choices:

  • Q1: What’s the difference between decomposing an op using Relax ops vs. TOPI ops vs. TIR?
    How does the abstraction level impact performance or correctness?

  • Q2: If the op compiles on CUDA, is numerical verification still required?

  • Q3: Are there nightly tests that check numerical correctness?
    I ask because I found a case (fmod) where the op ran but didn't match PyTorch output.

@tqchen
Copy link
Member

tqchen commented Jun 5, 2025

Q1: What’s the difference between decomposing an op using Relax ops vs. TOPI ops vs. TIR?

Given this is relax importer, we can chose either options as long as the correctness match. When possible, if we can decompose via relax then legalize, it gives most opportunities for possible choice of lowering path. We should aim to reduce total number of core relax ops

Q2/ Q3

yes ideally we should have a nightly test validating the correctness

We can add such tests to

https://github.com/apache/tvm/tree/main/tests/python/nightly

nightly/relax/test_relax_op_numeric.py

@kavin-sai-krishna
Copy link
Contributor Author

@tqchen Thanks for your response. I'll make sure the checklists are satisfied. But I'm not sure what i should do if C2 is not met.

@kavin-sai-krishna
Copy link
Contributor Author

@tqchen I've updated the op to compile and run on CUDA as you requested. Can you please review it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants