# Running a very large LLM on a consumer Hardware GPU

In [4]:
!free && sync
!egrep "MemTotal|MemFree|Cached" /proc/meminfo

              total        used        free      shared  buff/cache   available
Mem:       32387632     2673296     2707528        1236    27006808    29235828
Swap:             0           0           0
MemTotal:       32387632 kB
MemFree:         2706968 kB
Cached:         26426084 kB
SwapCached:            0 kB


In [9]:
import psutil
import time

def memory():
    memory_usage = psutil.virtual_memory()
    print(f"Total Memory:{psutil.virtual_memory().total/(1024 * 1024)} MB")
    print(f"Available Memory:{psutil.virtual_memory().available/(1024 * 1024)} MB")
    print(f"Memory Usage: {memory_usage.percent}%")
memory()

Total Memory:31628.546875 MB
Available Memory:27564.16015625 MB
Memory Usage: 12.9%


In [10]:
!nvidia-smi

Tue Apr 16 13:33:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   22C    P8               9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

On this machine I have a T4 with 15GB of VRAM and 32 GB of RAM (25 seems to be usable) for a total of 47GB of available memory.
### Let's choose something from the Chatbot Arena that fit in this memory and squeeze it 

I like Qwen/Qwen1.5-32B-Chat that seems to be pretty good for his size. Someone gently quantize it to GGUF. 

In [11]:
model_id = "Qwen/Qwen1.5-32B-Chat-GGUF"
base_model = "Qwen/Qwen1.5-32B-Chat" ## The model from which this one was quantized

How can I know how much memory this model requires in its quantized versions?

In [12]:
from typing import Any

def calc_model_size(parameters: int, quant: float) -> int:
    return parameters * quant // 8

def calc_compute_buffer_size(model_config, context: int) -> float:
    return (
        (context / 1024 * 2 + 0.75) * model_config["num_attention_heads"] * 1024 * 1024
    )

def calc_input_buffer_size(model_config, context: int) -> float:
    return 4096 + 2048 * model_config["hidden_size"] + context * 4 + context * 2048

def calc_context_size(model_config, context: int) -> float:
    n_gqa = model_config["num_attention_heads"] / model_config["num_key_value_heads"]
    n_embd_gqa = model_config["hidden_size"] / n_gqa
    n_elements = n_embd_gqa * (model_config["num_hidden_layers"] * context)
    return 2 * n_elements * 2

def get_model_config(model_config,model_size):
    config = model_config
    config["parameters"] = model_size / 2

def calc(model_config, model_size,context, quant_size):
    quant_bpw = 0
    model_config['parameters'] = model_size / 2
    try:
        quant_bpw = float(quant_size)
    except:
        quant_bpw = quants[quant_size]

    model_size = round(
        calc_model_size(model_config["parameters"], quant_bpw) / 1000 / 1000 / 1000, 2
    )
    context_size = round(
        (
            calc_input_buffer_size(model_config, context)
            + calc_context_size(model_config, context)
            + calc_compute_buffer_size(model_config, context)
        )
        / 1000
        / 1000
        / 1000,
        2,
    )

    return model_size, context_size, round(model_size + context_size, 2)

The following is the config.json taken from the official HF repo of Qwen/Qwen1.5-32B-Chat and model.safetensors.index.json or pytorch_model.bin.index.json.
Basically we need the "total_size"

In [13]:
import json
config = json.loads("""{
  "architectures": [
    "Qwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 27392,
  "max_position_embeddings": 32768,
  "max_window_layers": 35,
  "model_type": "qwen2",
  "num_attention_heads": 40,
  "num_hidden_layers": 64,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.37.2",
  "use_cache": true,
  "use_sliding_window": false,
  "vocab_size": 152064
}""")

metadata = json.loads("""{
  "metadata": {
    "total_size": 65024436224
  }}""")



For example let's assume we want a Q5_K_M version. The map below tells us how many bits are used per parameter in that type of quantization.

In [14]:
quants = {
    "Q2_K": 3.35,
    "Q3_K_S": 3.5,
    "Q3_K_M": 3.91,
    "Q3_K_L": 4.27,
    "Q4_0": 4.55,
    "Q4_K_S": 4.58,
    "Q4_K_M": 4.85,
    "Q5_0": 5.54,
    "Q5_K_S": 5.54,
    "Q5_K_M": 5.69,
    "Q6_K": 6.59,
    "Q8_0": 8.5,
}


In [15]:
context_size = 16000

model_size_gb, context_size_gb, total_size_gb = calc(model_config=config,model_size=metadata['metadata']['total_size'],context=context_size,quant_size=quants['Q5_K_M'])
print(f"Model Size GB:{model_size_gb}")
print(f"Context Size GB:{context_size_gb}")
print(f"Total Size GB:{total_size_gb}")

Model Size GB:23.12
Context Size GB:5.58
Total Size GB:28.7


Seems that we need 29GB of RAM to run Q5_K_M version with 16k context lenght. Let's verify it

In [6]:
import os
os.chdir(os.getcwd()+"/pratical-llms")

In [2]:
!wget https://huggingface.co/Qwen/Qwen1.5-32B-Chat-GGUF/resolve/main/qwen1_5-32b-chat-q5_k_m.gguf

--2024-04-16 13:05:08--  https://huggingface.co/Qwen/Qwen1.5-32B-Chat-GGUF/resolve/main/qwen1_5-32b-chat-q5_k_m.gguf?download=true
Resolving huggingface.co (huggingface.co)... 18.154.227.67, 18.154.227.87, 18.154.227.7, ...
Connecting to huggingface.co (huggingface.co)|18.154.227.67|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.huggingface.co/repos/96/88/96886dc2b365648803a6c4c31290ebe172329dddf1699fca4b5fd6f0a9868aca/9f7f066e1ef9453f0d38f08af0795af6022caf81bb1abad38893d79b3c373b4c?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27qwen1_5-32b-chat-q5_k_m.gguf%3B+filename%3D%22qwen1_5-32b-chat-q5_k_m.gguf%22%3B&Expires=1713531908&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcxMzUzMTkwOH19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzk2Lzg4Lzk2ODg2ZGMyYjM2NTY0ODgwM2E2YzRjMzEyOTBlYmUxNzIzMjlkZGRmMTY5OWZjYTRiNWZkNmYwYTk4NjhhY2EvOWY3ZjA2NmUxZWY5NDUzZj

In [8]:
model_path = os.getcwd()+"/qwen1_5-32b-chat-q5_k_m.gguf"
model_path

'/teamspace/studios/this_studio/pratical-llms/qwen1_5-32b-chat-q5_k_m.gguf'

In [26]:
from transformers import AutoTokenizer

  from .autonotebook import tqdm as notebook_tqdm


In [28]:
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen1.5-32B-Chat")

prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
prompt

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


'<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nGive me a short introduction to large language model.<|im_end|>\n<|im_start|>assistant\n'

In [29]:
!./llama.cpp/main -m {model_path} -n 500 --color -ngl 35 -p "{prompt}"

Log start
main: build = 2675 (17e98d4c)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed  = 1713275233
llama_model_loader: loaded meta data with 20 key-value pairs and 771 tensors from /teamspace/studios/this_studio/pratical-llms/qwen1_5-32b-chat-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = Qwen2-beta-32B-Chat-AWQ-fp16
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 64
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv   5:                  qwen2.feed_forward_

llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q5_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
llm_load_vocab: special tokens definition check successful ( 421/152064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type   

## Great it works!
Here the output 

assistant

Large language models are artificial intelligence systems designed to process and generate human-like language at an unprecedented scale. These models are built using deep learning techniques, specifically deep neural networks, and are trained on massive amounts of text data, often spanning billions or trillions of words. By learning patterns and relationships within this data, they acquire the ability to understand and generate text across a wide range of topics and styles.

The key characteristic of large language models is their size, which refers to the number of parameters they have learned during training. This vast parameter count enables them to capture intricate language structures and nuances, allowing them to perform a variety of natural language processing tasks, such as language translation, question-answering, text summarization, sentiment analysis, and even creative writing.

Prominent examples of large language models include GPT (Generative Pre-trained Transformer), created by OpenAI, and BERT (Bidirectional Encoder Representations from Transformers), developed by Google. These models have pushed the boundaries of what is possible with AI-generated language, achieving remarkable results in terms of fluency, coherence, and context awareness. However, they also present challenges, such as the potential for bias in their training data and the ethical considerations around their use. [end of text]

llama_print_timings:        load time =   59691.71 ms
llama_print_timings:      sample time =      34.31 ms /   247 runs   (    0.14 ms per token,  7198.23 tokens per second)
llama_print_timings: prompt eval time =   12905.83 ms /    29 tokens (  445.03 ms per token,     2.25 tokens per second)
llama_print_timings:        eval time =  157831.47 ms /   246 runs   (  641.59 ms per token,     1.56 tokens per second)
llama_print_timings:       total time =  171284.84 ms /   275 tokens

#### 1.56 tokens/s is not that fast, but it works.
Let's analyze the RAM and the VRAM better. The model have 65 layers. Let's offload until it runs

In [9]:
!./llama.cpp/server --threads 1 --threads-batch 1 --threads-http 1 -m {model_path} -n 16000 --batch-size 1 -ngl 45 --main-gpu 0 --port 8080

{"tid":"139882769641472","timestamp":1713276081,"level":"INFO","function":"main","line":2921,"msg":"build info","build":2675,"commit":"17e98d4c"}
{"tid":"139882769641472","timestamp":1713276081,"level":"INFO","function":"main","line":2926,"msg":"system info","n_threads":1,"n_threads_batch":1,"total_threads":8,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | "}
llama_model_loader: loaded meta data with 20 key-value pairs and 771 tensors from /teamspace/studios/this_studio/pratical-llms/qwen1_5-32b-chat-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.n

llama_model_loader: - kv  14:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  15:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  16:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  18:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  19:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q5_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
llm_load_vocab: special tokens definition check successful ( 421/152064 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type   

As you can see from the following images, we are now using almost 15GB of GPU VRAM and 13GB of Memory for a total of 28GB that seems coherent with what expected

![vram_gpu.png](images/vram_gpu.png)
![mem_before.png](images/before_mem.png)
![vram_gpu.png](images/mem_after.png)
