# HuggingFace meets `bitsandbytes` for lighter models on GPU for inference

## Running T5-11b on Google Colab 

 <center>
 <img src="https://s3.amazonaws.com/moonup/production/uploads/1659861207959-62441d1d9fdefb55a0b7d12c.png">
 </center>


You can run your own 8-bit model on any HuggingFace 🤗 model with just few lines of code. This notebook shows how to do it with a `T5` model that would usually require 12GB of GPU RAM.
Install the dependencies below first!


In [1]:
!pip install --quiet bitsandbytes
!pip install --quiet --upgrade transformers # Install latest version of transformers
!pip install --quiet --upgrade accelerate
!pip install --quiet sentencepiece

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
trlx 0.5.0 requires tritonclient, which is not installed.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
trlx 0.5.0 requires tritonclient, which is not installed.[0m[31m
[0m

In [None]:
!pip install transformers bitsandbytes accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Choose your model

Rerun this cell if you want to change the model!

model_name = "t5-3b-sharded" #@param ["t5-11b-sharded", "t5-3b-sharded"]

## Use 8bit models with `t5-3b-sharded` 🤗

In [2]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

# T5-3b and T5-11B are supported!
# We need sharded weights otherwise we get CPU OOM errors
model_id=f"google/flan-t5-xxl"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model_8bit = AutoModelForSeq2SeqLM.from_pretrained(
    model_id, 
    device_map="auto", 
    load_in_8bit=True,
)




Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
CUDA SETUP: CUDA runtime path found: /home/mila/g/gagnonju/.anaconda3/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/mila/g/gagnonju/.anaconda3/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Let's check the memory footprint of this model! 🪶

In [4]:
model_8bit.get_memory_footprint() // 1000 ** 3

17

For `t5-3b` the int8 model is about ~2.9GB! whereas the original model has 11GB. For `t5-11b` the int8 model is about ~11GB vs 42GB for the original model.
Now let's generate and see the qualitative results of the 8bit model!

In [7]:
max_new_tokens = 50

input_ids = tokenizer(
    "translate English to French: Hello my name is Younes "
    "and I am a Machine Learning Engineer at Hugging Face", 
    return_tensors="pt",
).input_ids  

outputs = model_8bit.generate(
    input_ids.to(model_8bit.device), max_new_tokens=max_new_tokens
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Hello, mon nom est Younes et je suis un ingénieur de apprentissage d'applications à Hugging Face.
