<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/text_generation_pipeline_conversational.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation example with conversational model and model quantization

This is a brief example of how to run text generation with a conversational (chat, instruct) causal language model, `pipeline`, and model quantization.

[Model quantization](https://huggingface.co/docs/accelerate/usage_guides/quantization) is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

In [1]:
!pip install --quiet transformers accelerate bitsandbytes>0.37.0

Import the `AutoTokenizer`, `AutoModelForCausalLM`, and `pipeline` classes. The first two support loading tokenizers and generative models from the [Hugging Face repository](https://huggingface.co/models), and the last wraps a tokenizer and a model for convenience. `BitsAndBytesConfig` is used for model quantization.

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig

Load a generative model and its tokenizer. You can substitute any other generative model name here, but note that Colab may have issues running larger models.

`google/gemma-3-1b-it` model needs authentication, `userdata.get('my_access_token')` reads my secret authentication token I created for my hf account.

In [3]:
from google.colab import userdata
import torch

MODEL_NAME = 'google/gemma-3-1b-it'

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side='left', token=userdata.get('my_access_token'))
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, token=userdata.get('my_access_token'), device_map="auto", quantization_config=quantization_config)

Instantiate a text generation pipeline using the tokenizer and model.

In [4]:

pipe = pipeline("text-generation",
    model=model,
    tokenizer=tokenizer
)

Device set to use cuda:0


We can now call the pipeline with a text prompt; it will take care of tokenizing, encoding, generation, and decoding:

In [5]:
conversation = [{"role": "user", "content": "What is the capital of Finland?"}] # we have only one message here, but we could also have a conversation with multiple user/assistant turns as a starting point

output = pipe(conversation, max_new_tokens=50)

for c in output[0]["generated_text"]:
  print(f"{c['role']}: {c['content']}")


user: What is the capital of Finland?
assistant: The capital of Finland is **Helsinki**.



We can also call the pipeline with any arguments that the model `generate` function supports. For details on text generation using `transformers`, see e.g. [this tutorial](https://huggingface.co/blog/how-to-generate).

Example with sampling and a high `temperature` parameter to generate more chaotic output:

In [6]:
output = pipe(conversation,
              do_sample=True,
              temperature=2.0,
              max_new_tokens=50)

for c in output[0]["generated_text"]:
  print(f"{c['role']}: {c['content']}")


user: What is the capital of Finland?
assistant: The capital of Finland is **Stockholm**.

Interestingly, this is a long and somewhat convoluted story! Finland was formerly part of Sweden, and as such, the capital of Sweden was Stockholm. In the 16th century, the Finnish Tsar


In [7]:
# print memory usage

import sys

def report_memory_usage(message, out=sys.stderr):
    print(f'max memory allocation {message}:', file=out)
    total = 0
    for i in range(torch.cuda.device_count()):
        mem = torch.cuda.max_memory_allocated(i)
        print(f'  cuda:{i}: {mem/2**30:.1f}G', file=out)
        total += mem
    print(f'  TOTAL: {total/2**30:.1f}G', file=out)

report_memory_usage("4bit")

#max memory allocation (original):
#  cuda:0: 5.8G
#  TOTAL: 5.8G

# max memory allocation (4bit):
#  cuda:0: 1.2G
#  TOTAL: 1.2G

max memory allocation 4bit:
  cuda:0: 1.2G
  TOTAL: 1.2G
