<a href="https://www.kaggle.com/code/aisuko/lighter-models-on-gpu-for-inference?scriptVersionId=163025567" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Overview

We discussed [Introduction to 8bit matrix multiplication](https://www.kaggle.com/code/aisuko/introduction-to-8-bit-matrix-multiplication) and [Zero degradation matrix multiplication](https://www.kaggle.com/code/aisuko/zero-degradation-matrix-multiplication). So, we can use 8-bit tensor to fit lower memory GPU. However, 8-bit tensor cores are not supported on the GPU. Here we are going to load model in 8-bit.

In [1]:
%%capture
!pip install transformers==4.37.2
!pip install bitsandbytes==0.42.0
!pip install accelerate==0.27.2

# Use 8-bit model with PyTorch

In [2]:
model_name="ft-t5-small-with-opusbook"

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model_id=f"aisuko/{model_name}"

tokenizer=AutoTokenizer.from_pretrained(model_id)
model_8bit=AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto",load_in_8bit=True)

tokenizer_config.json:   0%|          | 0.00/20.7k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.50k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

## Memory footprint comparison

> Translation converts a sequence of text from one language to another. It is one of several tasks we can formulate as a sequence-to-sequence problem. More detail check [Translation(NLP)](https://www.kaggle.com/code/aisuko/translation-nlp)



In [3]:
model=AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto",torch_dtype="auto")

In [4]:
model_fp16=model.get_memory_footprint() #the native model is fp16, see https://www.kaggle.com/code/aisuko/translation-nlp?scriptVersionId=154119620&cellId=20
model_fp16

242026496

In [5]:
model_int8=model_8bit.get_memory_footprint()
model_int8

114721792

In [6]:
print("Memory relative difference:{}".format(model_fp16/model_int8))

Memory relative difference:2.109681968705649


In [7]:
max_new_tokens=50
prompt="translate English to German: Hello my name is Kaggle"

input_ids=tokenizer(
    prompt, return_tensors="pt"
).input_ids

outputs=model_8bit.generate(input_ids, max_new_tokens=max_new_tokens)



In [8]:
tokenizer.decode(outputs[0], skip_special_tokens=True)

'Hallo mein Name ist Kaggle'

# Use 8-bit model with pipeline

In [9]:
from transformers import pipeline

pipe=pipeline(model=model_id, model_kwargs={"device_map":"auto", "load_in_8bit":True}, max_new_tokens=20)

In [10]:
pipe(prompt)

[{'generated_text': 'Hallo mein Name ist Kaggle'}]

# References List

* https://www.kaggle.com/code/aisuko/translation-nlp
* https://huggingface.co/aisuko/ft-t5-small-with-opusbook