# Speculative Decoding
[Explain speculative decoding here]

Run this if you haven't already done it
```
!git clone https://github.com/ggerganov/llama.cpp.git
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt
```

Idea: Can we use a smaller (and faster) quantized model as speculative model for the bigger one?
Let's see

### Download the bigger model here
Let's test using the newest LLama3-70B-instruct

In [29]:
!wget https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/resolve/main/Meta-Llama-3-70B-Instruct-Q3_K_S.gguf


--2024-04-19 17:17:30--  https://huggingface.co/NousResearch/Meta-Llama-3-70B-Instruct-GGUF/resolve/main/Meta-Llama-3-70B-Instruct-Q3_K_S.gguf
Resolving huggingface.co (huggingface.co)... 108.138.189.74, 108.138.189.57, 108.138.189.70, ...
Connecting to huggingface.co (huggingface.co)|108.138.189.74|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/2a/b0/2ab04cb3294326d82544e8b8ccdd51bdcf0b3e243e3f715a528f2fbaae0d8f47/8e6224569b0c43c15b0f75d4e03bbce38e856de623758c332d8972e9bbf9163b?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27Meta-Llama-3-70B-Instruct-Q3_K_S.gguf%3B+filename%3D%22Meta-Llama-3-70B-Instruct-Q3_K_S.gguf%22%3B&Expires=1713799050&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMzc5OTA1MH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzJhL2IwLzJhYjA0Y2IzMjk0MzI2ZDgyNTQ0ZThiOGNjZGQ1MWJkY2YwYjNlMjQzZTNmNzE1YTUyOGYyZmJ

### Download the smaller one here

Our draft model will be a Q3_K_L version of Llama-3-8B-Instruct. For even better result I suggest to pick something even faster. 
The hit rate will be crucial for the gain in performance. The greater the speed difference between the two models, the greater the potential gain. But the draft model must also be good enough to have a good hit rate.

In [2]:
!wget https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q3_K_L.gguf

--2024-04-19 16:43:46--  https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q3_K_L.gguf
Resolving huggingface.co (huggingface.co)... 108.138.189.57, 108.138.189.70, 108.138.189.96, ...
Connecting to huggingface.co (huggingface.co)|108.138.189.57|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/79/f2/79f21025e377180e4ec0e3968bca4612bb9c99fa84e70cb7815186c42a858124/1411591a3b405ef45313e92560e7a28920114a2a11a6e7ad79a36d9b58cc0084?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27Meta-Llama-3-8B-Instruct.Q3_K_L.gguf%3B+filename%3D%22Meta-Llama-3-8B-Instruct.Q3_K_L.gguf%22%3B&Expires=1713797026&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMzc5NzAyNn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzc5L2YyLzc5ZjIxMDI1ZTM3NzE4MGU0ZWMwZTM5NjhiY2E0NjEyYmI5Yzk5ZmE4NGU3MGNiNzgxNTE4NmM0MmE

## Let's test how fast is the larger model alone

In [30]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import os

model_id = "meta-llama/Meta-Llama-3-70B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [31]:
messages = [
    {"role": "system", "content": "You are a polite chatbot who always responds in Italian!"},
    {"role": "user", "content": "Make me a summary of the Napoleon life"},
]

prompt = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)
prompt

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a polite chatbot who always responds in Italian!<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nMake me a summary of the Napoleon life<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

offload to GPU 45 level. 
It is important to remember that Llama-3-70B has about 80 levels. Loading all 80 levels is not possible on a single 4090 despite the Q3_K_L quantized version. We will also need some space to load the draft model into the GPU as well.

In [46]:
%%time
#model_path = os.getcwd() + "/Meta-Llama-3-8B-Instruct.Q8_0.gguf"
model_path = os.getcwd() + "/Meta-Llama-3-70B-Instruct-Q3_K_S.gguf"
!./llama.cpp/main -m {model_path} -n 100 --color --temp 0.0 -ngl 50 -p "{prompt}"

Log start
main: build = 2697 (9958c81b)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1713772474
llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from /home/nlp/grimaldian/llms-lab/pratical-llms/Meta-Llama-3-70B-Instruct-Q3_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = ..
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                          llama.block_count u32     

```
llama_print_timings:        load time =   22809.92 ms
llama_print_timings:      sample time =      11.13 ms /   100 runs   (    0.11 ms per token,  8986.34 tokens per second)
llama_print_timings: prompt eval time =    1921.27 ms /    37 tokens (   51.93 ms per token,    19.26 tokens per second)
llama_print_timings:        eval time =   47780.49 ms /    99 runs   (  482.63 ms per token,     2.07 tokens per second)
llama_print_timings:       total time =   49936.35 ms /   136 tokens
```

Total time of 136 token in 49,9 seconds. 
### 2.72 token/second

## Let's try the smaller model only first

In [47]:
model_path = os.getcwd() + "/Meta-Llama-3-8B-Instruct.Q3_K_L.gguf"
!./llama.cpp/main -m {model_path} -n 100 --color --temp 0.0 -ngl 99 -ngld 100 -p "{prompt}"

Log start
main: build = 2697 (9958c81b)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1713772666
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/nlp/grimaldian/llms-lab/pratical-llms/Meta-Llama-3-8B-Instruct.Q3_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32       

```
llama_print_timings:        load time =    3522.16 ms
llama_print_timings:      sample time =      10.60 ms /   100 runs   (    0.11 ms per token,  9435.74 tokens per second)
llama_print_timings: prompt eval time =      45.31 ms /    37 tokens (    1.22 ms per token,   816.60 tokens per second)
llama_print_timings:        eval time =    1162.92 ms /    99 runs   (   11.75 ms per token,    85.13 tokens per second)
llama_print_timings:       total time =    1285.67 ms /   136 tokens
```

The smallest version is pretty fast. Total time of 136 tokes in 1,28 seconds. 
### 105,83 token/second

## Now let's try Speculative decoding

In [49]:
larger_model_path = os.getcwd() + "/Meta-Llama-3-70B-Instruct-Q3_K_S.gguf"
smaller_model_path = os.getcwd() + "/Meta-Llama-3-8B-Instruct.Q3_K_L.gguf"

In [50]:
%%time
!./llama.cpp/speculative -ngl 45 -ngld 100 --color -m  {larger_model_path} --model-draft {smaller_model_path}  -p "{prompt}" --temp 0.0 -n 100 -s 1 --top-k 0 --top-p 1 --repeat-last-n 0 --repeat-penalty 1.0 --draft 3

Log start
llama_model_loader: loaded meta data with 22 key-value pairs and 723 tensors from /home/nlp/grimaldian/llms-lab/pratical-llms/Meta-Llama-3-70B-Instruct-Q3_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = ..
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   5:                          llama.block_count u32              = 80
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - 

```
draft:

llama_print_timings:        load time =    3226.00 ms
llama_print_timings:      sample time =     577.99 ms /     1 runs   (  577.99 ms per token,     1.73 tokens per second)
llama_print_timings: prompt eval time =   37783.48 ms /   104 tokens (  363.30 ms per token,     2.75 tokens per second)
llama_print_timings:        eval time =    1043.41 ms /    68 runs   (   15.34 ms per token,    65.17 tokens per second)
llama_print_timings:       total time =   43304.43 ms /   172 tokens

target:

llama_print_timings:        load time =   23264.80 ms
llama_print_timings:      sample time =      10.31 ms /   101 runs   (    0.10 ms per token,  9793.46 tokens per second)
llama_print_timings: prompt eval time =   40794.61 ms /   173 tokens (  235.81 ms per token,     4.24 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   46833.24 ms /   174 tokens
```

174 tokens on 46,8 second total. 
### Total time of 3.71 tokens/seconds

## Wrap up the results

The speculative sampling 

| Model                              | GPU Layers offload N | Total Tokens/seconds                                                                                                                                                                       | Gain %  |
|--------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|-----|
| Meta-Llama-3-70B-Instruct-Q3_K_S.gguf | 45/80 | 2.72 | - |
| Meta-Llama-3-8B-Instruct.Q3_K_L.gguf  | 32/32  | 105.83 | - |
| Speculative Sampling   | 45/80 and 32/32 | 3.71 | 36% |