Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add Min-P sampling layer #1154

Open
aikitoria opened this issue Feb 25, 2024 · 11 comments
Open

Feature Request: Add Min-P sampling layer #1154

aikitoria opened this issue Feb 25, 2024 · 11 comments
Assignees
Labels
feature request New feature or request

Comments

@aikitoria
Copy link

It would be very nice if the library supported using Min-P sampling as an alternative to Top-P/Top-K. This became popular for local LLMs in the past few months because it provides significantly more useful results, or at least feels like it does. More info here: https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/your_settings_are_probably_hurting_your_model_why/

image

Most other libraries already support it, examples:
turboderp/exllamav2@0d436d7
ggerganov/llama.cpp#3841

This only requires a single parameter - consider all tokens whose probability is greater than than the probability of the first one scaled down by some number.

@byshiue byshiue added the feature request New feature or request label Feb 27, 2024
@aikitoria
Copy link
Author

Forgot to mention: this sampling method should be applied before temperature

@0xymoro
Copy link
Contributor

0xymoro commented Mar 11, 2024

@ncomly-nvidia seconded on adding min p - makes a noticeable impact on production, doesn't seem too bad to implement compared to some others.

@aikitoria
Copy link
Author

@byshiue any chance of this being added soon? 👀

Most other engines have it now, it's in vLLM and HF transformers is also adding it.

@nivekgnaij
Copy link

^^ has a huge impact on production

@pathorn
Copy link

pathorn commented May 2, 2024

@aikitoria I found that it was far easier and performant to implement in decodingCommon.cu since the same math used for logprobs can be used for calculating the relative threshold used in min-p sampling
See my PR in #1536

I'll need to review that things are done in the correct order: I'm still grappling with the codebase but I assumed it should be doing min-p before sampling

@byshiue
Copy link
Collaborator

byshiue commented May 29, 2024

Also mentioned in this issue #1683.

@cyr0930
Copy link

cyr0930 commented May 29, 2024

I hope this feature being added any time soon!

@DreamGenX
Copy link

You can implement this as a logit processor as far as I can tell:

def _get_min_p_fn(min_p: float):
    def _fn(
        logits: torch.Tensor,
    ) -> torch.Tensor:
        probs = torch.softmax(logits)
        top_prob = probs.max()
        scaled_min_p = min_p * top_prob
        tokens_to_remove = probs < scaled_min_p
        logits = logits.masked_fill(tokens_to_remove, -float("inf"))

        return logits

    return _fn

@aikitoria
Copy link
Author

aikitoria commented May 29, 2024

Is that not much slower than it would be if properly implemented it in the CUDA sampling layers?

Just saw the PR above also. That's an interesting way. Highly doubt nvidia would accept it given it seems more like a hack, but it gives us something to experiment with...

@DreamGenX
Copy link

@aikitoria Yeah, since it's computed per request and not in a batch (see also #1681). But if you are already using other logit processors, it might not have a big of an effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants