<a href="https://colab.research.google.com/github/Taaniya/LLM-compression-optimization/blob/main/Loading_LLM_T5_with_int8_quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# HuggingFace meets `bitsandbytes` for lighter models on GPU for inference

## Running T5-11b on Google Colab

 <center>
 <img src="https://s3.amazonaws.com/moonup/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png">
 </center>


You can run your own 8-bit model on any HuggingFace 🤗 model with just few lines of code. This notebook shows how to do it with a `T5` model that would usually require 12GB of GPU RAM.
Install the dependencies below first!


In [None]:
!pip install --quiet bitsandbytes
!pip install --quiet --upgrade transformers # Install latest version of transformers
!pip install --quiet --upgrade accelerate
!pip install --quiet sentencepiece

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m14.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m104.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m61.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install transformers bitsandbytes accelerate

## Choose your model

Rerun this cell if you want to change the model!

In [None]:
model_name = "t5-3b-sharded" #@param ["t5-11b-sharded", "t5-3b-sharded"]

This model has sharded checkpoints, i.e., it involves a single folder with several files , where each file is a checkpoint containing the partial state dict i.e., the model weights.

We can access the model's its sharded checkpoint on its model card for it [here](https://huggingface.co/ybelkada/t5-3b-sharded/tree/main)

Accelerate is used to load large sharded models when the entire model cannot fit into RAM. More details of how accelerate enables this can be referred [here](https://huggingface.co/docs/accelerate/usage_guides/big_modeling) and [here](https://huggingface.co/docs/transformers/main/en/main_classes/model#large-model-loading).

## Use 8bit models with `t5-3b-sharded` 🤗

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# T5-3b and T5-11B are supported!
# We need sharded weights otherwise we get CPU OOM errors
model_id=f"ybelkada/{model_name}"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model_8bit = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)

By passing device_map="auto", we tell Accelerate to determine automatically where to put each layer of the model depending on the available resources - GPU RAM, CPU RAM or even disk.

We can inspect how the model was split across devices by looking at its hf_device_map attribute. In this case, I've loaded this model on a single GPU on colab.

In [None]:
model_8bit.hf_device_map

{'': 0}

Let's check the memory footprint of this model! 🪶

In [None]:
model_8bit.get_memory_footprint()

5300543488

We use quantization while loading the model for inference to reduce the memory footprint by passing argument `load_in_8bit=True` to load the model weights with lower precision - 8 bit integers.

Quantization is kind of a lossy compression of representing weights with data types at lower precision.

This is explained in detail in [this Huggingface blogpost](https://huggingface.co/blog/hf-bitsandbytes-integration) and how its enabled with help of bitsandbytes.

For `t5-3b` the int8 model is about ~2.9GB! whereas the original model has 11GB. For `t5-11b` the int8 model is about ~11GB vs 42GB for the original model.
Now let's generate and see the qualitative results of the 8bit model!

In [None]:
max_new_tokens = 50

input_ids = tokenizer(
    "translate English to German: Hello my name is Younes and I am a Machine Learning Engineer at Hugging Face", return_tensors="pt"
).input_ids

outputs = model_8bit.generate(input_ids, max_new_tokens=max_new_tokens)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))



Hallo mein Name ist Younes und ich bin ein Machine Learning Ingenieur bei Hugging Face


Let's see the effect of Quantization on memory footprint with T5-base model

In [None]:
model_name = "t5-base"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [None]:
model.get_memory_footprint() / (1024 * 1024)    # 850 MB

850.3095703125

In [None]:
# checking model parameter's precision. This is full precision - 32 bit

for param in model.parameters():
  print(param.dtype)

torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.

In [None]:
# Loading with 8-bit quantization
# Note that the weights that will be dispatched on CPU will not be converted in 8-bit,
# thus kept in float32
# https://huggingface.co/docs/transformers/main_classes/quantization#offload-between-cpu-and-gpu

model_8bit = AutoModelForSeq2SeqLM.from_pretrained(model_name, device_map="auto",
                                                   load_in_8bit=True)

Some weights of T5ForConditionalGeneration were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model_8bit.hf_device_map

{'': 0}

In [None]:
model_8bit.get_memory_footprint() / (1024 * 1024)    # 398 MB

398.15478515625

In [None]:
# checking model parameter's precision. This one is quantized to 8-bit precision
# for most of the params, some to half precision (16-bit) and rest to 32 bit
# full precision

for param in model_8bit.parameters():
  print(param.dtype)

torch.float16
torch.int8
torch.int8
torch.int8
torch.int8
torch.float16
torch.float16
torch.int8
torch.float32
torch.float16
torch.int8
torch.int8
torch.int8
torch.int8
torch.float16
torch.int8
torch.float32
torch.float16
torch.int8
torch.int8
torch.int8
torch.int8
torch.float16
torch.int8
torch.float32
torch.float16
torch.int8
torch.int8
torch.int8
torch.int8
torch.float16
torch.int8
torch.float32
torch.float16
torch.int8
torch.int8
torch.int8
torch.int8
torch.float16
torch.int8
torch.float32
torch.float16
torch.int8
torch.int8
torch.int8
torch.int8
torch.float16
torch.int8
torch.float32
torch.float16
torch.int8
torch.int8
torch.int8
torch.int8
torch.float16
torch.int8
torch.float32
torch.float16
torch.int8
torch.int8
torch.int8
torch.int8
torch.float16
torch.int8
torch.float32
torch.float16
torch.int8
torch.int8
torch.int8
torch.int8
torch.float16
torch.int8
torch.float32
torch.float16
torch.int8
torch.int8
torch.int8
torch.int8
torch.float16
torch.int8
torch.float32
torch.float16
to

#### References

* What is sharding - https://docs.graphcore.ai/projects/tf-model-parallelism/en/latest/sharding.html
* Large model loading - https://huggingface.co/docs/transformers/main/en/main_classes/model#large-model-loading
* Handling big models for inference - https://huggingface.co/docs/accelerate/usage_guides/big_modeling
* https://huggingface.co/blog/hf-bitsandbytes-integration
* https://huggingface.co/blog/4bit-transformers-bitsandbytes
* Quantization - https://huggingface.co/docs/accelerate/usage_guides/quantization
* [Quantize Transformers models - bitsandbytes integration](https://huggingface.co/docs/transformers/main_classes/quantization)
* [Data parallelism in Amazon Sagemaker for faster training](https://aws.amazon.com/blogs/machine-learning/enable-faster-training-with-amazon-sagemaker-data-parallel-library/)