# Speculative Decoding
While small models are very fast and allow for very high tokens/second even on commercial hardware, their performance is not always high especially for specific use cases. On the contrary, larger models can offer greater performance guarantees but not good latency and acceptable response times. A technique to increase speed and reduce latency of larger models is Speculative Decoding.

The underlying idea of this technique is that many words are very easy to generate. Imagine, for example, all the articles or interconnection words present between facts. By being able to "speculate" and therefore guess the next N words, we can potentially reduce inference times.

This happens because usually we are forced to perform an entire forward pass on the entire input sequence to generate the next token, which is then inserted into the input sequence. The process is repeated N times to generate the next N tokens. In this scenario, every time we need to predict a token, we have to propagate the input sequence through all the layers of the transformers. It is not necessary to recalculate everything completely thanks to some optimization techniques like the KV cache, which allows avoiding the recalculation of attention for all tokens up to the last one inserted. But regardless of this aspect, it is important to understand that it takes almost the same time to generate a single token or to propagate an entire new sequence of tokens to "verify" it (this is not exactly true for the discussion I mentioned earlier about the KV cache, but let's assume for simplicity that it is).

![spec_1.png](images/spec_1.png)

The algorithm is as follows:
- Generate the next K tokens with the Draft Model (the fast model).
- Run the target model in parallel on the input sequence + the new generated tokens (speculative tokens). All probability distributions of the new tokens are calculated in parallel. To be convinced of this, just review how a forward pass works on a transformer architecture.
- The next step is Rejection Sampling, in which we decide whether to accept or reject each individual token produced by the Draft Model. Each token is checked sequentially and added to the queue of accepted tokens. As soon as a token to be rejected is encountered, all subsequent tokens are discarded and the loop exits. Since I have already calculated all probability distributions up to that point, I can generate and sample the correct token and continue with the process.

The token is accepted if the target model is as confident as or more confident than the Draft Model in sampling that token. If the sampling probability is lower, that token is accepted with a certain probability that may depend on the type of Speculative Sampling algorithm we are using.

![spec_2.png](images/spec_2.png)

I won't go into too much detail about the algorithm, but for a clear understanding of how it works, I suggest watching the following YouTube video by Efficient NLP (https://www.youtube.com/watch?v=S-8yr_RibJ4)



The best case scenario occurs when all tokens are accepted; we will have generated N tokens at the speed of the draft model plus a single forward pass of the target model.

The worst case scenario occurs when all tokens are rejected. In that case, the time it takes for the target model to generate a single token will be worsened by the time it takes for the draft model to generate N tokens (plus the verification time given by the algorithm). This helps us understand how important it is to have a good hit rate (given by the percentage of accepted tokens and not rejected) and also tells us that it's crucial to have a good Draft Model.

## Let's start

Run this if you haven't already done it
```
!git clone https://github.com/ggerganov/llama.cpp.git
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt
```

Idea: Can we use a smaller (and faster) quantized model as speculative model for the bigger one?
Let's see

### Download the bigger model here
Let's test using the newest LLama3-70B-instruct

In [29]:
!wget https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/resolve/main/Meta-Llama-3-70B-Instruct-Q3_K_S.gguf


--2024-04-19 17:17:30--  https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/resolve/main/Meta-Llama-3-70B-Instruct-Q3_K_S.gguf
Resolving huggingface.co (huggingface.co)... 108.138.189.74, 108.138.189.57, 108.138.189.70, ...
Connecting to huggingface.co (huggingface.co)|108.138.189.74|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/2a/b0/2ab04cb3294326d82544e8b8ccdd51bdcf0b3e243e3f715a528f2fbaae0d8f47/8e6224569b0c43c15b0f75d4e03bbce38e856de623758c332d8972e9bbf9163b?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27Meta-Llama-3-70B-Instruct-Q3_K_S.gguf%3B+filename%3D%22Meta-Llama-3-70B-Instruct-Q3_K_S.gguf%22%3B&Expires=1713799050&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMzc5OTA1MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzJhL2IwLzJhYjA0Y2IzMjk0MzI2ZDgyNTQ0ZThiOGNjZGQ1MWJkY2YwYjNlMjQzZTNmNzE1YTUyOGYyZmJ

### Download the smaller one here

Our draft model will be a Q3_K_L version of Llama-3-8B-Instruct. For even better result I suggest to pick something even faster. 
The hit rate will be crucial for the gain in performance. The greater the speed difference between the two models, the greater the potential gain. But the draft model must also be good enough to have a good hit rate.

In [2]:
!wget https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q3_K_L.gguf

--2024-04-19 16:43:46--  https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q3_K_L.gguf
Resolving huggingface.co (huggingface.co)... 108.138.189.57, 108.138.189.70, 108.138.189.96, ...
Connecting to huggingface.co (huggingface.co)|108.138.189.57|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/79/f2/79f21025e377180e4ec0e3968bca4612bb9c99fa84e70cb7815186c42a858124/1411591a3b405ef45313e92560e7a28920114a2a11a6e7ad79a36d9b58cc0084?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27Meta-Llama-3-8B-Instruct.Q3_K_L.gguf%3B+filename%3D%22Meta-Llama-3-8B-Instruct.Q3_K_L.gguf%22%3B&Expires=1713797026&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMzc5NzAyNn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzc5L2YyLzc5ZjIxMDI1ZTM3NzE4MGU0ZWMwZTM5NjhiY2E0NjEyYmI5Yzk5ZmE4NGU3MGNiNzgxNTE4NmM0MmE

## Let's test how fast is the larger model alone

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import os

model_id = "meta-llama/Meta-Llama-3-70B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

  from .autonotebook import tqdm as notebook_tqdm
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [2]:
messages = [
    {"role": "system", "content": "You are a polite chatbot who always responds the user requests"},
    {"role": "user", "content": "Make me a summary of the Napoleon life"},
]

prompt = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)
prompt

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a polite chatbot who always responds the user requests<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nMake me a summary of the Napoleon life<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

offload to GPU 45 level. 
It is important to remember that Llama-3-70B has about 80 levels. Loading all 80 levels is not possible on a single 4090 despite the Q3_K_L quantized version. We will also need some space to load the draft model into the GPU as well.

In [19]:
%%time
#model_path = os.getcwd() + "/Meta-Llama-3-8B-Instruct.Q8_0.gguf"
model_path = os.getcwd() + "/Meta-Llama-3-70B-Instruct-Q3_K_S.gguf"
!./llama.cpp/main -m {model_path} -n 250 --color --temp 0.0 -ngl 45 --top-k 1 -e --temp -1 -p "{prompt}"

Log start
main: build = 2697 (9958c81b)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1713783078
llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from /home/nlp/grimaldian/llms-lab/pratical-llms/Meta-Llama-3-70B-Instruct-Q3_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = ..
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                          llama.block_count u32     


### 2.01 token/second

## Let's try the smaller model only first

In [20]:
model_path = os.getcwd() + "/Meta-Llama-3-8B-Instruct.Q3_K_L.gguf"
!./llama.cpp/main -m {model_path} -n 250 --color --temp 0.0 -ngl 99 --top-k 1 -p "{prompt}"

Log start
main: build = 2697 (9958c81b)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1713783238
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/nlp/grimaldian/llms-lab/pratical-llms/Meta-Llama-3-8B-Instruct.Q3_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32       

The smallest version is pretty fast. 
### 87.02 token/second

## Now let's try Speculative decoding

In [17]:
larger_model_path = os.getcwd() + "/Meta-Llama-3-70B-Instruct-Q3_K_S.gguf"
smaller_model_path = os.getcwd() + "/Meta-Llama-3-8B-Instruct.Q3_K_L.gguf"

In [21]:
%%time
!./llama.cpp/speculative -ngl 45 -ngld 99 --color -m  {larger_model_path} --model-draft {smaller_model_path} -e --temp -1 -p "{prompt}" --temp 0.0 -n 250 -s 1 --top-k 1 --draft 3

Log start
llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from /home/nlp/grimaldian/llms-lab/pratical-llms/Meta-Llama-3-70B-Instruct-Q3_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = ..
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                          llama.block_count u32              = 80
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - 

```
encoded   37 tokens in    4.655 seconds, speed:    7.949 t/s
decoded  254 tokens in   98.596 seconds, speed:    2.576 t/s

n_draft   = 3
n_predict = 254
n_drafted = 243
n_accept  = 172
accept    = 70.782%
```

### Speculative decoding version runs 2.576 tokens/seconds with a 70.782% of accepted

## Wrap up the results

The speculative sampling 

| Model                              | GPU Layers offload N | Total Tokens/seconds                                                                                                                                                                       | Gain %  |
|--------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|-----|
| Meta-Llama-3-70B-Instruct-Q3_K_S.gguf | 45/80 | 2.01 | - |
| Meta-Llama-3-8B-Instruct.Q3_K_L.gguf  | 32/32  | 87.02 | - |
| Speculative Sampling   | 45/80 (Model) and 32/32 (Draft Model) | 2.58 | 28% |


Speculative sampling has allowed us to sacrifice some VRAM (to load and run the draft model) to achieve a 28% increase in the number of tokens/second in output. 

Was it worth it? It depends on you and your scenario.

Usually if you have more VRAM then what you need to load a Target Model, you should use this technique to sacrifice some VRAM to speed up you inference. 