Async sampling #198

lucasavila00 · 2024-04-23T03:57:40Z

./target/profiling/mistralrs-bench -p 0 -g 64 -r 1 -c 8  gguf -t mistralai/Mistral-7B-Instruct-v0.1 -m TheBloke/Mistral-7B-Instruct-v0.1-GGUF -f mistral-7b-instruct-v0.1.Q4_K_M.gguf

Master

This PR

github-actions · 2024-04-23T03:57:58Z

Code Metrics Report

  ───────────────────────────────────────────────────────────────────────────────
Language                 Files     Lines   Blanks  Comments     Code Complexity
───────────────────────────────────────────────────────────────────────────────
Rust                        70     23237     1543       508    21186       1278
───────────────────────────────────────────────────────────────────────────────
Total                       70     23237     1543       508    21186       1278
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop 66,725
Estimated Schedule Effort 11.790000 months
Estimated People Required 5.023991
───────────────────────────────────────────────────────────────────────────────
Processed 764768 bytes, 0.765 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────

mistralrs-core/src/pipeline/mod.rs

EricLBuehler · 2024-04-26T00:43:52Z

@lucasavila00, thanks for your work here. I think we definitely need to pursue the throughput decrease as the batch increase. I like the idea of async being incorporated, did this not work out?

lucasavila00 · 2024-04-26T01:01:28Z

@EricLBuehler It had mixed results.

On bs=8 it improved from 5ms to 2ms.

But on bs=1 it got worse, from 0.8ms to 1.2ms.

Clippy raised an issue of a non-async mutex being held across an await point.

The best fix for it would be to use an async aware mutex? Or drop the lock a lot of times and re-lock when needed?

I'm still learning async rust and don't feel confident enough to work on this MR yet, as it requires defining the overall async structure for the engine.

Also, profiling CPU code became harder (eg: samply profiler). Only the nvidia profiler showed interpretable results.

EricLBuehler · 2024-04-26T01:04:46Z

On bs=8 it improved from 5ms to 2ms.

Great! Perhaps we could profile it and see when the performance gains go away, and use this then.

Clippy raised an issue of a non-async mutex being held across an await point.

We could use this type.

I think this sort of structure is very interesting, I'll take a look in the next few days. Thanks for working on it.

lucasavila00 · 2024-04-26T03:42:38Z

I'll leave it open, so it picks up my commits.

I just fixed the clippy issue, by not holding the lock across awaits.

lucasavila00 · 2024-04-26T04:27:48Z

@EricLBuehler cargo test fails because it can't download the mistral tokenizer from HF.

Any chance the CI account does not have access to the model, which has been recently locked (locked like llama where one needs to request access)

lucasavila00 · 2024-04-26T04:32:20Z

Great! Perhaps we could profile it and see when the performance gains go away, and use this then.

I implemented it in the last commit. It only uses the async pool if there's more than one batch. Then, the performance loss is gone.

It means that the async code is free if not used.

lucasavila00 · 2024-04-26T06:12:12Z

I added 2 nvidia profiler screenshots to the main PR body, comparing master to this PR, showing the improvements.

EricLBuehler · 2024-04-26T08:56:28Z

@lucasavila00

cargo test fails because it can't download the mistral tokenizer from HF.

I set the HF_TOKEN secret during CI:

mistral.rs/.github/workflows/ci.yml

Line 49 in 9ec90a4

TESTS_HF_TOKEN: ${{ secrets.HF_TOKEN }}

EricLBuehler · 2024-04-26T19:22:49Z

@lucasavila00, it looks like there are some merge conflicts.

lucasavila00 · 2024-04-26T19:32:06Z

@lucasavila00, it looks like there are some merge conflicts.

I'm fixing it

lucasavila00 · 2024-04-26T19:47:39Z

mistralrs-core/src/engine/mod.rs

    ) -> Result<()> {
        let seqs_len = seqs.len();
-        let logits_seq = logits.chunk(seqs_len, 0).unwrap();
+        let logits_seq = logits.to_device(&Device::Cpu)?.chunk(seqs_len, 0)?;


Now that we do a synchronization before we start sampling, we can get statistics about sampling speed.

Basically reverting #151

lucasavila00 · 2024-04-26T20:14:32Z

Master

+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 58.712±0.644 | 17.034±0.190 |           1 |    58.712006 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 45.956±0.805 | 21.766±0.380 |           2 |      91.9128 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 29.011±0.321 | 34.473±0.388 |           4 |     116.0458 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 15.469±0.388 | 64.685±1.639 |           8 |    123.75505 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+

This

+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 58.752±0.501 | 17.022±0.147 |           1 |    58.752266 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 48.121±0.124 | 20.781±0.053 |           2 |     96.24125 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 30.030±0.020 | 33.300±0.022 |           4 |    120.12018 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| model                              | backend | test   | t/s          | ms/t         | concurrency | throughput/s |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+
| mistralai/Mistral-7B-Instruct-v0.1 | CUDA    | tg 128 | 16.839±0.008 | 59.384±0.027 |           8 |    134.71562 |
+------------------------------------+---------+--------+--------------+--------------+-------------+--------------+

EricLBuehler · 2024-04-27T09:50:15Z

@lucasavila00, this looks good. However, I think there is one more merge conflict.

EricLBuehler

This looks good. I think I made some mistakes when doing the conflict resolution, so I've marked them.

mistralrs-core/src/pipeline/mod.rs

EricLBuehler

Looks good, thank you for adding this!

EricLBuehler · 2024-04-28T01:06:59Z

@lucasavila00, I think there are unfortunately still some merge conflicts.

lucasavila00 · 2024-04-28T01:30:06Z

@EricLBuehler please consider squashing this MR

Or please let me know if I should squash it

EricLBuehler · 2024-04-28T01:37:04Z

@lucasavila00 thank you for adding this!

Add async sampling POC

5378219

lucasavila00 commented Apr 23, 2024

View reviewed changes

mistralrs-core/src/pipeline/mod.rs Outdated Show resolved Hide resolved

lucasavila00 added 3 commits April 25, 2024 20:44

Merge branch 'master' into async_sampling

8650996

fix conflicts

5ca2339

copy to CPU asap

849c448

lucasavila00 closed this Apr 26, 2024

lucasavila00 reopened this Apr 26, 2024

lucasavila00 closed this Apr 26, 2024

fix clippy

1ed9fa3

lucasavila00 reopened this Apr 26, 2024

lucasavila00 added 7 commits April 26, 2024 00:54

refactor

c2ebfe9

refactor

6597b7a

refactor

80b36ca

refactor

5426079

refactor tokio dependency installation

0c46f57

don't use worker thread for 1 seq

ba2ff87

clippy

e2992d1

sample async for grammar correction

05cd7dc

lucasavila00 marked this pull request as ready for review April 26, 2024 06:12

lucasavila00 changed the title ~~Add async sampling POC~~ Async sampling Apr 26, 2024

lucasavila00 mentioned this pull request Apr 26, 2024

Async engine #210

Closed

Merge branch 'master' into async_sampling

86981dd

lucasavila00 added 2 commits April 26, 2024 16:34

fix conflicts issues

17d08e1

Merge branch 'master' into async_sampling

3330302

lucasavila00 commented Apr 26, 2024

View reviewed changes

EricLBuehler added this to the Version 0.1.0 milestone Apr 26, 2024

Merge branch 'master' into async_sampling

53b8c0b

EricLBuehler requested changes Apr 27, 2024

View reviewed changes

mistralrs-core/src/pipeline/mod.rs Outdated Show resolved Hide resolved

mistralrs-core/src/pipeline/mod.rs Outdated Show resolved Hide resolved

EricLBuehler modified the milestones: 0.1.0, 0.2.0 Apr 27, 2024

lucasavila00 added 3 commits April 27, 2024 12:20

fix conflict issues

8044bfd

Merge branch 'master' into async_sampling

5b2f3bd

fix conflict issues

4b6d942

lucasavila00 requested a review from EricLBuehler April 27, 2024 15:29

EricLBuehler approved these changes Apr 28, 2024

View reviewed changes

Merge branch 'master' into async_sampling

f911fdf

EricLBuehler mentioned this pull request Apr 28, 2024

Async channels #228

Closed

EricLBuehler approved these changes Apr 28, 2024

View reviewed changes

EricLBuehler merged commit 73e4acf into EricLBuehler:master Apr 28, 2024
8 of 11 checks passed

lucasavila00 mentioned this pull request Apr 28, 2024

Quantized Mistral: Batching is slower than non batches #139

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async sampling #198

Async sampling #198

lucasavila00 commented Apr 23, 2024 •

edited

github-actions bot commented Apr 23, 2024 •

edited

EricLBuehler commented Apr 26, 2024

lucasavila00 commented Apr 26, 2024

EricLBuehler commented Apr 26, 2024 •

edited

lucasavila00 commented Apr 26, 2024

lucasavila00 commented Apr 26, 2024

lucasavila00 commented Apr 26, 2024

lucasavila00 commented Apr 26, 2024 •

edited

EricLBuehler commented Apr 26, 2024

EricLBuehler commented Apr 26, 2024

lucasavila00 commented Apr 26, 2024

lucasavila00 Apr 26, 2024

lucasavila00 commented Apr 26, 2024

EricLBuehler commented Apr 27, 2024

EricLBuehler left a comment

EricLBuehler left a comment

EricLBuehler commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024

EricLBuehler commented Apr 28, 2024

Async sampling #198

Async sampling #198

Conversation

lucasavila00 commented Apr 23, 2024 • edited

github-actions bot commented Apr 23, 2024 • edited

EricLBuehler commented Apr 26, 2024

lucasavila00 commented Apr 26, 2024

EricLBuehler commented Apr 26, 2024 • edited

lucasavila00 commented Apr 26, 2024

lucasavila00 commented Apr 26, 2024

lucasavila00 commented Apr 26, 2024

lucasavila00 commented Apr 26, 2024 • edited

EricLBuehler commented Apr 26, 2024

EricLBuehler commented Apr 26, 2024

lucasavila00 commented Apr 26, 2024

lucasavila00 Apr 26, 2024

Choose a reason for hiding this comment

lucasavila00 commented Apr 26, 2024

EricLBuehler commented Apr 27, 2024

EricLBuehler left a comment

Choose a reason for hiding this comment

EricLBuehler left a comment

Choose a reason for hiding this comment

EricLBuehler commented Apr 28, 2024

lucasavila00 commented Apr 28, 2024

EricLBuehler commented Apr 28, 2024

lucasavila00 commented Apr 23, 2024 •

edited

github-actions bot commented Apr 23, 2024 •

edited

EricLBuehler commented Apr 26, 2024 •

edited

lucasavila00 commented Apr 26, 2024 •

edited