# TODO: Make headers/sections

In this blog post I will be writing about some things I learned recently in the [LLM conference](https://maven.com/parlance-labs/fine-tuning?utm_campaign=2848bd&utm_medium=partner&utm_source=instructor0) I took part in. 
Some of the details come from [Jonathan Whitaker's](https://x.com/johnowhitaker) [talk](https://x.com/HamelHusain/status/1798353336145674483).
 


```
ssh -i ~/.ssh/jarivs_labs -o StrictHostKeyChecking=no  -p 11014 root@sshe.jarvislabs.ai
```


You have probably heard the saying, "In the computer it's all 0's and 1's". Well, those 0's and 1's are called **bits**. A **bit** is the smallest
unit of storage and simply stores a 0 or a 1. A **byte** is a group of 8 **bits** together.

- 1 **byte** **=** 8 **bytes**.

All storage is measured in **bytes**. You're probably very familiar with these units:

| Unit      | Abbreviation | Approximate Size        |
|-----------|--------------|-------------------------|
| Kilobyte  | KB           | about 1 thousand bytes  |
| Megabyte  | MB           | about 1 million bytes   |
| Gigabyte  | GB           | about 1 billion bytes   |
| Terabyte  | TB           | about 1 trillion bytes  |


This blog post on [Hugging Face](https://huggingface.co/blog/hf-bitsandbytes-integration) explains the common data types used in machine learning.
The most common data types are (float16, float32, bfloat16, int8). Then this [article](https://huggingface.co/blog/4bit-transformers-bitsandbytes) explains more on `bitsandbytes`, quantization, and `QLORA`. `bitsandbytes` is a Python library which is used heavily in quantization of LLMs.




In [1]:
# | warning: false
import torch
import gc
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

device = "cuda"


def cleanup():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats(device)


def print_memory_stats():
    """Print two different measures of GPU memory usage"""
    print(f"Max memory allocated: {torch.cuda.max_memory_allocated(device)/1e9:.2f}GB")
    # reserved (aka 'max_memory_cached') is ~the allocated memory plus pre-cached memory
    print(f"Max memory reserved: {torch.cuda.max_memory_reserved(device)/1e9:.2f}GB")


print_memory_stats()

cleanup()

Max memory allocated: 0.00GB
Max memory reserved: 0.00GB


Let's load the model `TinyLlama/TinyLlama-1.1B-Chat-v1.0`.
By default, it will load in `float32`. 

In [2]:
# | output: false
model_ckpt = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
model = AutoModelForCausalLM.from_pretrained(model_ckpt, device_map=device)
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

We can check the `dtype` of the parameters and indeed see that they are stored in `float32`.

In [3]:
# | warning: false
set([(x.dtype) for x in model.parameters()])

{torch.float32}

Since the parameters are in `float32`, we can estimate that each parameter will takes up 32/8=4 bytes of memory.
For a 1.1B parameter model that is 4.4gb. Let's see if our rough back of the napkin calculation is correct.

In [4]:
# | warning: false
!nvidia-smi

Fri Jun 21 11:16:19 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA RTX A6000               Off | 00000000:1A:00.0 Off |                  Off |
| 30%   33C    P2              66W / 300W |   4475MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                         

In [5]:
# | warning: false
print_memory_stats()

Max memory allocated: 4.40GB
Max memory reserved: 4.40GB


Yes, that's what we thought.

Let's run some inference with the model.

In [6]:
# | warning: false
def inference(messages):
    tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(input_ids=tokenized_chat.to("cuda"), max_new_tokens=128, do_sample=False)
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0])
inference([{"role": "user", "content": "How many bytes are in one gigabyte?"}])

<|user|>
How many bytes are in one gigabyte? 
<|assistant|>
Yes, there are 1,000,000,000 bytes in a gigabyte (GB).


Now let's load the model in a lower precision.
The model config points to what precision to use.

In [7]:
# | warning: false
model.config.torch_dtype

torch.bfloat16

In [8]:
# | warning: false
del model
cleanup()
print_memory_stats()
!nvidia-smi

Max memory allocated: 0.01GB
Max memory reserved: 0.02GB


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Fri Jun 21 11:16:21 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA RTX A6000               Off | 00000000:1A:00.0 Off |                  Off |
| 30%   34C    P2              89W / 300W |    351MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                         

Now let's load the model in `bfloat16`. Estimating that each parameter will use 16/8=2 bytes of memory.
For the same model, it should use roughly half the memory as before, 2.2GB.

In [9]:
# | output: false
model = AutoModelForCausalLM.from_pretrained(model_ckpt, torch_dtype=torch.bfloat16, device_map=device)

In [10]:
# | warning: false
print_memory_stats()

Max memory allocated: 2.21GB
Max memory reserved: 2.32GB


In [11]:
# | warning: false
set([(x.dtype) for x in model.parameters()])

{torch.bfloat16}

In [12]:
# | warning: false
inference([{"role": "user", "content": "How many bytes are in one gigabyte?"}])

<|user|>
How many bytes are in one gigabyte? 
<|assistant|>
Yes, there are 1,000,000,000 bytes in a gigabyte (GB).


This is the exact same output we got before. Since most models are currently trained using `bfloat16`, there's no need to use full `float32` precision. In this example if we use float32, it won't improve inference results compared to `bfloat16`.

In [13]:
# | warning: false
del model
cleanup()
print_memory_stats()

Max memory allocated: 0.01GB
Max memory reserved: 0.02GB


Now let's try loading a quantized model using  `BitsAndBytesConfig`. 
Note that the model weights are stored in 4bit precision but the computations are done in a higher precision.
Here we are specifying that the dtype for computations is bf16. During inference, as well as training, 
the weights of the model are constantly being dequantized (from 4bit to bf16). This can be done for specific layers at a time
during forward and backward passes, to keep memory requirements low. *The computation happens in a higher precision*.

Here we specify that the model should be loaded in 4bit and the computations done in bf16.
Since we are using 4bit we expect each parameter to use 4/8=0.5 bytes so we should be using less than a GB of memory.

In [14]:
# | warning: false
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    quantization_config=bnb_config,
    device_map='auto'
)

This can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_cuda_dir/lib64



In [15]:
print_memory_stats()

Max memory allocated: 0.84GB
Max memory reserved: 0.89GB


Now we expect the model inference results to be different:

In [16]:
inference([{"role": "user", "content": "How many bytes are in one gigabyte?"}])

<|user|>
How many bytes are in one gigabyte? 
<|assistant|>
Yes, one gigabyte (GB) is equal to 1,073,741,824 bytes. A byte is a unit of information storage in the binary system, which is the basis for digital computing. In binary, each byte has a value of 10, with each bit representing a single binary digit. So, one gigabyte is equivalent to 1,073,741,824 bytes, which is approximately 1,000,000,000 bytes.


You can experiment with different types of quantization.  
In this next example we load the model using: 

- NF4 quantization.
- `bnb_4bit_use_double_quant` which uses a second quantization after the first one.
- bfloat16 for computation

In [17]:
del model
cleanup()
print_memory_stats()

Max memory allocated: 0.01GB
Max memory reserved: 0.02GB


In [18]:
# | warning: false
nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    quantization_config=nf4_config,
    device_map='auto'
)

In [19]:
print_memory_stats()

Max memory allocated: 0.80GB
Max memory reserved: 0.84GB


In [20]:
inference([{"role": "user", "content": "How many bytes are in one gigabyte?"}])

<|user|>
How many bytes are in one gigabyte? 
<|assistant|>
Yes, I can provide you with the answer to your question. A gigabyte (GB) is a unit of measurement for data storage. It is equal to 1,000 bytes. So, 1 GB is equal to 1,000,000,000 bytes.


I really like the high level explanation of quantization from this [post](https://huggingface.co/blog/optimize-llm) by Patrick von Platen.
In general when running inference with quantized models the steps are:

- Quantize all the weights of the model and load it (for example 4bit).
- Pass through the input sequence in bf16.
- Dynamically dequantize the weights to bf16 layer by layer during the forward pass
- Quantize the weights back to 4bit after the computation

So if we want to do $Y = X W$ where $W$ and $X$ are the weights and input sequence respectively, then for each matrix multiplication we do:

$Y = X \cdot \text{dequantize}(W)$ ; $\text{quantize}(W);$



For this reason, inference is usually not faster when using quantized models. It's slower. It is good to remember that quantization is a tradeoff between memory usage and output quality, as well as possibly inference time.


# TODO:

Talk about (not my words): For shorter text inputs (less than 1024 tokens), the memory requirement for inference is very much dominated by the memory requirement to load the weights. Therefore, for now, let's assume that the memory requirement for inference is equal to the memory requirement to load the model into the GPU VRAM.
# Resources/Links

[basics of bits and bytes](https://web.stanford.edu/class/cs101/bits-bytes.html)

https://huggingface.co/blog/4bit-transformers-bitsandbytes

https://huggingface.co/blog/hf-bitsandbytes-

https://huggingface.co/blog/optimize-llm

https://sebastianraschka.com/blog/2023/llm-mixed-precision-copy.html

https://huggingface.co/docs/accelerate/en/usage_guides/model_size_estimator