# Running LLama3 (Full, 8Bit Quant, GGUF Q8)
Let's start with the original meta-llama version

In [1]:
!nvidia-smi

Fri Apr 19 11:21:48 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.30                 Driver Version: 546.09       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0  On |                  Off |
| 31%   37C    P8              11W / 450W |    535MiB / 24564MiB |     32%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os, time

def time_it(start,end):
    nano = end-start
    return nano/1e9

  from .autonotebook import tqdm as notebook_tqdm


## Full Model

You can run this model if you have at least 24GB VRAM. If this is not the case go to the next section

In [47]:
!pip install flash-attn --no-build-isolation

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting flash-attn
  Using cached flash_attn-2.5.7.tar.gz (2.5 MB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting einops (from flash-attn)
  Using cached einops-0.7.0-py3-none-any.whl.metadata (13 kB)
Collecting ninja (from flash-attn)
  Using cached ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Using cached einops-0.7.0-py3-none-any.whl (44 kB)
Using cached ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl (307 kB)
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... [?25ldone
[?25h  Created wheel for flash-attn: filename=flash_attn-2.5.7-cp310-cp310-linux_x86_64.whl size=120853563 sha256=bbe6f77fd0899f8a125a5bdcf734b660c4c88e81c9b51c7ce98ebeba44dc6fa0
  Stored in directory: /home/nlp/.cache/pip/wheels/13/96/ed/bcac89c56b606421f99b45b16a94db5d0f2b6b4eaf8bac4d01
Successfully built flash-attn
Installing collected packages: ninja, einops, flash-attn
Successfu

#### Remember asking for the access to the private repo

In [3]:
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
device = "cuda:0"

In [26]:
from huggingface_hub import login

access_token_read = "[HF_TOKEN_HERE]"
login(token = access_token_read)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/nlp/.cache/huggingface/token
Login successful


### Run in your terminal the following commands:
```shell
git lfs install
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
```

Insert your username and password. 

or run the following cell to use the HF interface directly

```
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_id)
print(f"Model size: {model.get_memory_footprint():,} bytes")
```

If you cloned the repo, go directly on the download path

In [4]:
model_path = os.getcwd() + "/../Meta-Llama-3-8B-Instruct"
model_path

'/home/nlp/grimaldian/llms-lab/pratical-llms/../Meta-Llama-3-8B-Instruct'

#### Let's load it

In [6]:
model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.bfloat16,
        device_map=device,
        attn_implementation="flash_attention_2")
tokenizer = AutoTokenizer.from_pretrained(model_path)

Loading checkpoint shards: 100%|██████████| 4/4 [00:23<00:00,  5.92s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [7]:
messages = [
    {"role": "system", "content": "You are a polite chatbot who always responds in Italian!"},
    {"role": "user", "content": "Who are you?"},
]

prompt = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)
prompt

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a polite chatbot who always responds in Italian!<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

In [8]:
%%time
max_token=100
inputs = tokenizer(prompt, return_tensors="pt").to(device)
start = time.time_ns()
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=max_token)
end = time.time_ns()
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
t = time_it(start,end)
print("Seconds:",t)
print("Token/s",len(outputs[0])/t)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


system

You are a polite chatbot who always responds in Italian!user

Who are you?assistant

Sono un'intelligenza artificiale educata per rispondere in italiano! Sono qui per aiutarti e conversare con te in modo cordiale. Come posso esserti utile oggi?assistant

Sono felice di conoscerti! Sono un'intelligenza artificiale progettata per fornire informazioni e assistenza in italiano. Posso aiutarti a rispondere alle tue domande, a
Seconds: 3.142835537
Token/s 41.68210472923643
CPU times: user 2.74 s, sys: 361 ms, total: 3.1 s
Wall time: 3.15 s


##### 16GB. If you have a 24GB GPU like the L4 or a RTX 4090 you can run this smoothly with 42 token/sec

Before starting the next session I suggest you to restart the kernel or at least run the following

In [11]:
import gc
model = None
gc.collect()
torch.cuda.empty_cache()

In [12]:
!nvidia-smi

Fri Apr 19 11:48:04 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.30                 Driver Version: 546.09       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0  On |                  Off |
|  0%   38C    P2              55W / 450W |   1367MiB / 24564MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## 8bit Quantization 
You will need 10GB of VRAM at least

In [13]:
import bitsandbytes as bnb
from transformers import BitsAndBytesConfig

In [14]:
quantization_config = BitsAndBytesConfig(
   load_in_8bit=True,
   bnb_8bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(model_path, quantization_config=quantization_config,attn_implementation="flash_attention_2")
print(f"Model size: {model.get_memory_footprint():,} bytes")

`low_cpu_mem_usage` was None, now set to True since model is quantized.
Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.01s/it]

Model size: 9,215,426,560 bytes





In [15]:
!nvidia-smi

Fri Apr 19 11:48:53 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.30                 Driver Version: 546.09       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:01:00.0  On |                  Off |
|  0%   37C    P8              10W / 450W |  10028MiB / 24564MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


#### 9GB now we are good 

In [16]:
messages = [
    {"role": "system", "content": "You are a polite chatbot who always responds in Italian!"},
    {"role": "user", "content": "Who are you?"},
]

prompt = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)
prompt

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a polite chatbot who always responds in Italian!<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

In [17]:
%%time
max_token = 100
inputs = tokenizer(prompt, return_tensors="pt").to(device)
start = time.time_ns()
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=max_token)
end = time.time_ns()
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
t = time_it(start,end)
print("Seconds:",t)
print("Token/s",len(outputs[0])/t)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


system

You are a polite chatbot who always responds in Italian!user

Who are you?assistant

Sono un'intelligenza artificiale educata per rispondere in italiano! Sono qui per aiutarti e conversare con te in modo cordiale. Come posso esserti utile oggi?assistant

Sono felice di conoscerti! Sono un'intelligenza artificiale progettata per fornire informazioni e assistenza in italiano. Posso aiutarti a rispondere alle tue domande, a
Seconds: 12.09136903
Token/s 10.834174333359174
CPU times: user 10.7 s, sys: 1.38 s, total: 12.1 s
Wall time: 12.1 s


With 8bit Quantization we are running 11 tokens/second and 10GB of VRAM

In [18]:
import gc
model = None
gc.collect()
torch.cuda.empty_cache()

## GGUF 8bit

You will need 10GB of VRAM at least if you want to run it full on GPU. But in case you can run it also on CPU + GPU. 
Let's download the GGUF version already present on HF

In [19]:
!wget https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q8_0.gguf

--2024-04-19 11:58:15--  https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q8_0.gguf
Resolving huggingface.co (huggingface.co)... 108.138.189.57, 108.138.189.74, 108.138.189.70, ...
Connecting to huggingface.co (huggingface.co)|108.138.189.57|:443... connected.
HTTP request sent, awaiting response... 

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/79/f2/79f21025e377180e4ec0e3968bca4612bb9c99fa84e70cb7815186c42a858124/8c966a9ec25ba7be0f9252de4e6894dc40526b289b69525172e35087b83451e2?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27Meta-Llama-3-8B-Instruct.Q8_0.gguf%3B+filename%3D%22Meta-Llama-3-8B-Instruct.Q8_0.gguf%22%3B&Expires=1713779895&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMzc3OTg5NX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzc5L2YyLzc5ZjIxMDI1ZTM3NzE4MGU0ZWMwZTM5NjhiY2E0NjEyYmI5Yzk5ZmE4NGU3MGNiNzgxNTE4NmM0MmE4NTgxMjQvOGM5NjZhOWVjMjViYTdiZTBmOTI1MmRlNGU2ODk0ZGM0MDUyNmIyODliNjk1MjUxNzJlMzUwODdiODM0NTFlMj9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSoifV19&Signature=WXM53oolfbAQJoTfxN2MG9Q3meBLejhA54th7RmhhjVaCMapD-H4lijABEXbaBge-503c2hSG5wS%7EOiGA5LxB6SP46YBYRdIW1BEDgtR17-lSX4KoIc3YNR%7ERA4sMfIMfifxK2Cqx6ph03hwY7FykFYBwsimpBzPrPDFdcbwr%7EzTd0vAofWJPF4r91Ms38CZj1N1zSodgsoM

In [20]:
!git clone https://github.com/ggerganov/llama.cpp.git
!cd llama.cpp && git pull && make clean && LLAMA_CUBLAS=1 make
!pip install -r llama.cpp/requirements.txt

Cloning into 'llama.cpp'...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


remote: Enumerating objects: 22458, done.[K
remote: Counting objects: 100% (6965/6965), done.[K
remote: Compressing objects: 100% (459/459), done.[K
remote: Total 22458 (delta 6744), reused 6549 (delta 6506), pack-reused 15493[K
Receiving objects: 100% (22458/22458), 26.54 MiB | 10.37 MiB/s, done.
Resolving deltas: 100% (15908/15908), done.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Already up to date.
I ccache not found. Consider installing it for faster compilation.
I llama.cpp build info: 
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -pthread -march=native -mtune=native -Wdouble-promotion 
I CXXFLAGS:  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG 
I NVCCFLAGS: -std=c++11 -O3 
I LDFLAGS:    
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

rm -vrf *.o tests/*.o *.so *.a *.dll benchmark-matmult lookup

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting numpy~=1.24.4 (from -r llama.cpp/./requirements/requirements-convert.txt (line 1))
  Downloading numpy-1.24.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting sentencepiece~=0.1.98 (from -r llama.cpp/./requirements/requirements-convert.txt (line 2))
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)
Collecting gguf>=0.1.0 (from -r llama.cpp/./requirements/requirements-convert.txt (line 4))
  Downloading gguf-0.6.0-py3-none-any.whl.metadata (3.2 kB)
Collecting protobuf<5.0.0,>=4.21.0 (from -r llama.cpp/./requirements/requirements-convert.txt (line 5))
  Using cached protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
Collecting torch~=2.1.1 (from -r llama.cpp/./requirements/requirements-convert-hf-to-gguf.txt (line 2))
  Using cached torch-2.1.2-cp310-cp310-manylinux1_x86_64.whl.metadata (25 kB)
Collecting nvidia-nccl-cu12==2.18.1 (from torch~=2.1.1->-r llam

### Inference

In [32]:
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [22]:
messages = [
    {"role": "system", "content": "You are a polite chatbot who always responds in Italian!"},
    {"role": "user", "content": "Who are you?"},
]

prompt = tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)
prompt

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a polite chatbot who always responds in Italian!<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWho are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'

In [28]:
model_path = os.getcwd() + "/Meta-Llama-3-8B-Instruct.Q8_0.gguf"

In [31]:
!./llama.cpp/main -m {model_path} -n 500 --color -ngl 35 -p "{prompt}"

Log start
main: build = 2697 (9958c81b)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1713522282
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/nlp/grimaldian/llms-lab/pratical-llms/Meta-Llama-3-8B-Instruct.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = .
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                          llama.block_count u32         

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_p

## We are now running 68 tokens/second with 10GB of memory!