Sampling optimizations #152

lucasavila00 · 2024-04-16T04:02:22Z

Makes sampling fully async, removing synchronizations.

Generation goes from 54t/s to 59t/s using ./target/profiling/mistralrs-server --prompt "Tell me 3 jokes." mistral-gguf

Same generation speed as Llama.cpp 😄

github-actions · 2024-04-16T04:02:38Z

Code Metrics Report

  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        60     20036     1439       820    17777       1130
───────────────────────────────────────────────────────────────────────────────
Total                       60     20036     1439       820    17777       1130
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 54,557
Estimated Schedule Effort 10.992939 months
Estimated People Required 4.481762
───────────────────────────────────────────────────────────────────────────────
Processed 677983 bytes, 0.678 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

lucasavila00 · 2024-04-16T06:24:01Z

mistralrs-core/src/sampler.rs

        // Sort by descending probability.
-        argsort_indices.sort_by(|&i, &j| probs[j].partial_cmp(&probs[i]).unwrap());
+        argsort_indices.sort_unstable_by(|&i, &j| probs[j].partial_cmp(&probs[i]).unwrap());


This sort is unstable (i.e., may reorder equal elements), in-place (i.e., does not allocate), and O(n * log(n)) worst-case.

The benchmarks here show the difference between stable and unstable. (I don't think adding the lib is required)

https://github.com/orlp/glidesort

This saves 500ms per token

lucasavila00 · 2024-04-16T06:24:58Z

mistralrs-core/src/sampler.rs

        for (token_id, logit) in logits.iter_mut().enumerate() {
            let count = context.iter().filter(|x| **x as usize == token_id).count();
            *logit = *logit
                - count as f32 * frequency_penalty
                - if count > 0 { 1. } else { 0. } * presence_penalty;
        }
-        let logits_len = logits.len();
-        Tensor::from_vec(logits, logits_len, device)
+        Ok(logits)


Since this reads the context we can't pre-allocate the tensor.

Allocating the tensor here was triggering a synchronization that would cost 500ms

EricLBuehler · 2024-04-16T08:43:20Z

Thank you!

Apply logit bias in GPU

e3206a6

remove syncrhonizations

8f03794

lucasavila00 closed this Apr 16, 2024

lucasavila00 reopened this Apr 16, 2024

lucasavila00 added 5 commits April 16, 2024 03:11

Remove synchronization

e811870

use unstable sort

f658ed9

no profile in pr

9f5b9b4

fix ci

ba54000

rm comments

1975c70

lucasavila00 commented Apr 16, 2024

View reviewed changes

lucasavila00 marked this pull request as ready for review April 16, 2024 06:25

lucasavila00 mentioned this pull request Apr 16, 2024

Quantized Mistral: Prompt processing slower than llama.cpp #153

Closed

EricLBuehler approved these changes Apr 16, 2024

View reviewed changes

EricLBuehler merged commit 62d560f into EricLBuehler:master Apr 16, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampling optimizations #152

Sampling optimizations #152

lucasavila00 commented Apr 16, 2024 •

edited

github-actions bot commented Apr 16, 2024 •

edited

lucasavila00 Apr 16, 2024

lucasavila00 Apr 16, 2024

EricLBuehler commented Apr 16, 2024

Sampling optimizations #152

Sampling optimizations #152

Conversation

lucasavila00 commented Apr 16, 2024 • edited

github-actions bot commented Apr 16, 2024 • edited

lucasavila00 Apr 16, 2024

Choose a reason for hiding this comment

lucasavila00 Apr 16, 2024

Choose a reason for hiding this comment

EricLBuehler commented Apr 16, 2024

lucasavila00 commented Apr 16, 2024 •

edited

github-actions bot commented Apr 16, 2024 •

edited