<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/text_generation_pipeline_conversational.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text generation example with conversational model and model quantization

This is a brief example of how to run text generation with a conversational (chat, instruct) causal language model, `pipeline`, and model quantization.

[Model quantization](https://huggingface.co/docs/accelerate/usage_guides/quantization) is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8) instead of the usual 32-bit floating point (float32).

In [1]:
!pip install --quiet transformers accelerate bitsandbytes>0.37.0

Import the `AutoTokenizer`, `AutoModelForCausalLM`, and `pipeline` classes. The first two support loading tokenizers and generative models from the [Hugging Face repository](https://huggingface.co/models), and the last wraps a tokenizer and a model for convenience. `BitsAndBytesConfig` is used for model quantization.

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig

Load a generative model and its tokenizer. You can substitute any other generative model name here, but note that Colab may have issues running larger models.

`Mistral-7B-Instruct-v0.2` model needs authentication, `userdata.get('my_access_token')` reads my secret authentication token I created for my hf account.

In [4]:
from google.colab import userdata
import torch

MODEL_NAME = 'mistralai/Mistral-7B-Instruct-v0.2'

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side='left', token=userdata.get('my_access_token'))
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, token=userdata.get('my_access_token'), device_map="auto", quantization_config=quantization_config)

tokenizer_config.json:   0%|          | 0.00/2.10k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

Instantiate a text generation pipeline using the tokenizer and model.

In [6]:

pipe = pipeline("text-generation",
    model=model,
    tokenizer=tokenizer
)

We can now call the pipeline with a text prompt; it will take care of tokenizing, encoding, generation, and decoding:

In [15]:
conversation = [{"role": "user", "content": "What is the capital of Finland?"}] # we have only one message here, but we could also have a conversation with multiple user/assistant turns as a starting point

output = pipe(conversation, max_new_tokens=50)

for c in output[0]["generated_text"]:
  print(f"{c['role']}: {c['content']}")


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


user: What is the capital of Finland?
assistant:  The capital city of Finland is Helsinki. It is the largest city in Finland and is located in the southern part of the country, on the shores of the Gulf of Finland. Helsinki is known for its beautiful architecture,


We can also call the pipeline with any arguments that the model `generate` function supports. For details on text generation using `transformers`, see e.g. [this tutorial](https://huggingface.co/blog/how-to-generate).

Example with sampling and a high `temperature` parameter to generate more chaotic output:

In [31]:
output = pipe(conversation,
              do_sample=True,
              temperature=3.0,
              max_new_tokens=50)

for c in output[0]["generated_text"]:
  print(f"{c['role']}: {c['content']}")


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


user: What is the capital of Finland?
assistant:  The capital city of Finland is called Helsinki. It's where the government administration center – named as both Parliament of Finland and office of the presidente de la repubica (PresidENCY of FInLand) in Finland English
