Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Speculative Decoding #242

Merged
merged 79 commits into from
May 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
e9610e4
Temp
EricLBuehler Apr 28, 2024
097b8e0
Merge branch 'master' into speculative
EricLBuehler Apr 28, 2024
63c4c86
Temp
EricLBuehler Apr 28, 2024
3871180
Merge branch 'master' into speculative
EricLBuehler Apr 30, 2024
fa01657
Merge branch 'master' into speculative
EricLBuehler Apr 30, 2024
945e985
Use arc for chat template
EricLBuehler Apr 30, 2024
462dea3
Update reset_non_granular_state
EricLBuehler Apr 30, 2024
aeb2460
Merge branch 'master' into speculative
EricLBuehler Apr 30, 2024
ec24cf2
Begin abstraction of sampling
EricLBuehler Apr 30, 2024
e7f36ca
Begin abstraction of sampling
EricLBuehler Apr 30, 2024
72b5755
Abstract sampling process
EricLBuehler May 1, 2024
4dc5db1
Abstract sampling process
EricLBuehler May 1, 2024
cc79226
Almost there!
EricLBuehler May 1, 2024
ae4877a
It compiles
EricLBuehler May 1, 2024
07caf41
Implement the rest of the todos
EricLBuehler May 1, 2024
5fb731a
Implement the rest of the todos
EricLBuehler May 1, 2024
5adef60
Use cache instructions
EricLBuehler May 1, 2024
d228757
Clippy
EricLBuehler May 1, 2024
254e405
Add to server api
EricLBuehler May 2, 2024
6bb0971
Update cache manager
EricLBuehler May 2, 2024
1bf81b2
Remove manual rt
EricLBuehler May 2, 2024
1a468bf
Partially working
EricLBuehler May 2, 2024
0da46fa
Rewrite it
EricLBuehler May 2, 2024
303eb5d
Set cache to none
EricLBuehler May 2, 2024
bf548a8
Disable prefix cache here
EricLBuehler May 2, 2024
880b0f1
Format
EricLBuehler May 2, 2024
73b0cd8
Clippy
EricLBuehler May 2, 2024
9a8d22e
Merge branch 'master' into speculative
EricLBuehler May 2, 2024
dd4fa6a
Narrow target model kv cache
EricLBuehler May 2, 2024
1e625cc
Only narrow if non prompt
EricLBuehler May 2, 2024
f2af3ef
Small fixes
EricLBuehler May 2, 2024
690cec1
Set tmp tok
EricLBuehler May 2, 2024
0d78dfb
Clamp
EricLBuehler May 2, 2024
311f48c
Merge
EricLBuehler May 2, 2024
621cf5d
Fix adding and removing the last token
EricLBuehler May 4, 2024
72e0cb5
Update
EricLBuehler May 4, 2024
50dc127
So close
EricLBuehler May 4, 2024
86e7188
Update with no kv cache for target
EricLBuehler May 4, 2024
4d40453
Got it to work!
EricLBuehler May 4, 2024
7da675b
Clippy and add kv cache
EricLBuehler May 5, 2024
a2c8684
Rewind kv cache of draft model
EricLBuehler May 5, 2024
4a0feca
Clone in and out draft cache
EricLBuehler May 5, 2024
bb41f77
Proper source and dst for cache ops
EricLBuehler May 5, 2024
6f37f47
Fix
EricLBuehler May 5, 2024
fb2b6fd
Fix
EricLBuehler May 5, 2024
94a2fb0
Fix it
EricLBuehler May 5, 2024
6033230
Fix n not accepted
EricLBuehler May 5, 2024
085fe95
Slight refactor and add new entrypt
EricLBuehler May 5, 2024
a5b6180
Refactor causal mask generation
EricLBuehler May 5, 2024
72c43a9
Fix dtypes
EricLBuehler May 5, 2024
d75722c
Fix deadlock
EricLBuehler May 5, 2024
6dd27ec
Sample with gumbel for speculative step
EricLBuehler May 6, 2024
5cbbc53
Use argmax
EricLBuehler May 6, 2024
7777648
Fix len and broadcast div
EricLBuehler May 6, 2024
8f311f2
Fix seqlen when using tmp toks
EricLBuehler May 6, 2024
f67b9e9
Narrow caches correctly
EricLBuehler May 6, 2024
7875fd8
Add toml selecter interface
EricLBuehler May 6, 2024
fdaa712
Add initial config file
EricLBuehler May 6, 2024
3bc6405
Fix deser
EricLBuehler May 6, 2024
d350121
Fix filename
EricLBuehler May 6, 2024
21e6d3e
Fixes
EricLBuehler May 6, 2024
cc2f60a
Add same gguf toml
EricLBuehler May 6, 2024
f5c9970
Merge
EricLBuehler May 8, 2024
9fe7591
Merge branch 'master' into speculative
EricLBuehler May 8, 2024
b382602
Cache last (not working yet
EricLBuehler May 9, 2024
19afd58
Merge branch 'master' into speculative
EricLBuehler May 9, 2024
482859f
Use causal masker
EricLBuehler May 9, 2024
bf8878a
Merge branch 'master' into speculative
EricLBuehler May 9, 2024
4adbd6d
Merge
EricLBuehler May 9, 2024
281040a
Merge branch 'master' into speculative
EricLBuehler May 9, 2024
2ab9d3b
Merge branch 'master' into speculative
EricLBuehler May 10, 2024
830478e
It works
EricLBuehler May 10, 2024
9d78b4e
It works
EricLBuehler May 10, 2024
8fa467b
Add speculative api to runner
EricLBuehler May 10, 2024
7b8ac2a
Docs
EricLBuehler May 11, 2024
cfc9a0c
Fix
EricLBuehler May 11, 2024
74a9d0d
Fix
EricLBuehler May 11, 2024
0b3ba2c
Fix deadlock
EricLBuehler May 11, 2024
d630c4a
More masking fixes
EricLBuehler May 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ Mistral.rs is a fast LLM inference platform supporting inference on a variety of
**Powerful**:
- Fast LoRA support with weight merging.
- First X-LoRA inference platform with first class support.
- Speculative Decoding: Mix supported models as the draft model or the target model


This is a demo of interactive mode with streaming running Mistral GGUF:
Expand Down Expand Up @@ -121,9 +122,7 @@ OpenAI API compatible API server

**Llama Index integration**

- [Source](integrations/llama_index_integration.py).
- [Example](examples/llama_index/xlora_gguf.py)
- [Cookbook](examples/llama_index/cookbook.ipynb)
- Docs: https://docs.llamaindex.ai/en/stable/examples/llm/mistral_rs/

---

Expand Down