<a href="https://colab.research.google.com/github/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/generation_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generation basics

This notebook illustrates some parts that go into text generation using the `pipeline` class from the `transformers` library.

First install the [transformers](https://huggingface.co/docs/transformers/index) package.

In [1]:
!pip install --quiet transformers

Next, load a `pipeline` for text generation with a small model.

In [2]:
from transformers import pipeline

MODEL_NAME = 'HuggingFaceTB/SmolLM-135M'

pipe = pipeline('text-generation', model=MODEL_NAME)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/538M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

Device set to use cpu


We can conveniently generate text using the high-level abstraction that `pipeline` provides:

In [4]:
prompt = 'The capital of Finland is'

print(pipe(prompt)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


The capital of Finland is the capital of Finland, Helsinki. The country is known for the beautiful Finnish countryside, natural beauty, and the unique way of life. Finland has a rich cultural heritage and is home to a vibrant and diverse population.
The country is a great place to explore. There are several attractions in Finland that are popular with tourists. The country has a long history and is one of the oldest countries in Europe. The country is also famous for its natural beauty, which is reflected in its stunning landscapes.
The country is a great place to experience the local culture. The Finnish people have a rich and unique culture that is reflected in their traditions, music, and art. The country is also known for its food and cuisine. The country is a great place to try new foods and enjoy traditional Finnish dishes.
The country is also a great place to learn about the history of Finland. The country was established in 1510, and it has a long history that dates back to the

For simplicity, let's look at generating one word using greedy decoding, i.e. simply selecting the word that's most likely according to the model.

In [5]:
params = {
    'do_sample': False,
    'max_new_tokens': 1,
}

print(pipe(prompt, **params)[0]['generated_text'])

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


The capital of Finland is Helsinki


Now, let's look at what's going on behind the `pipeline` abstraction. First, here's the model

In [6]:
model = pipe.model
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 576)
    (layers): ModuleList(
      (0-29): 30 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=576, out_features=576, bias=False)
          (k_proj): Linear(in_features=576, out_features=192, bias=False)
          (v_proj): Linear(in_features=576, out_features=192, bias=False)
          (o_proj): Linear(in_features=576, out_features=576, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
          (up_proj): Linear(in_features=576, out_features=1536, bias=False)
          (down_proj): Linear(in_features=1536, out_features=576, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((576,), eps=1e-05)
    (rotary_emb): LlamaRotaryEm

The model doesn't actually deal with text directly, but rather with token indices. The mapping between running text and token indices is implemented by a tokenizer.

In [7]:
tokenizer = pipe.tokenizer

Let's have a look at the mapping between our prompt and the token indices.

In [8]:
input_ = tokenizer(prompt)

print(input_)

{'input_ids': [504, 3575, 282, 17446, 314], 'attention_mask': [1, 1, 1, 1, 1]}


That's the actual input to the model, and `input_ids` are the token indices. (We can ignore the `attention_mask` here.)

The tokenizer can map these back to text

In [9]:
print(tokenizer.convert_ids_to_tokens(input_.input_ids))

['The', 'Ġcapital', 'Ġof', 'ĠFinland', 'Ġis']


(The `Ġ` there encodes space; this representation is a minor quirk of the GPT tokenizer.)

Because the token ids represent both visible characters and space, the full input string can be reconstructed accurately:

In [10]:
print(tokenizer.decode(input_.input_ids))

The capital of Finland is


We can invoke the model directly with the encoded input. Here we need to ask the tokenizer to generate pytorch tensors due to some implementation details, but the information content is the same.

In [11]:
input_ = tokenizer(prompt, return_tensors='pt')

print(input_)

{'input_ids': tensor([[  504,  3575,   282, 17446,   314]]), 'attention_mask': tensor([[1, 1, 1, 1, 1]])}


In [12]:
output = model(**input_)
print(output)

CausalLMOutputWithPast(loss=None, logits=tensor([[[ -3.5625, -11.6875, -11.6875,  ...,  -7.3750,  -7.0312,  -9.8750],
         [  2.0938,  -8.3125,  -8.4375,  ...,  -7.7812,  -5.3125,  -9.5000],
         [-10.4375, -20.6250, -20.7500,  ..., -16.1250, -14.0625, -19.2500],
         [ 13.8750,  -2.8750,  -3.1250,  ...,   3.8125,   3.5781,  -2.6719],
         [ -0.5078, -13.4375, -13.6250,  ...,  -6.4375,  -3.9062, -14.1875]]],
       dtype=torch.bfloat16, grad_fn=<UnsafeViewBackward0>), past_key_values=DynamicCache(layers=[<transformers.cache_utils.DynamicLayer object at 0x7d13b45f5c70>, <transformers.cache_utils.DynamicLayer object at 0x7d13b45fce60>, <transformers.cache_utils.DynamicLayer object at 0x7d13b45fd280>, <transformers.cache_utils.DynamicLayer object at 0x7d13b45fcc50>, <transformers.cache_utils.DynamicLayer object at 0x7d13b45ffc80>, <transformers.cache_utils.DynamicLayer object at 0x7d13b45ffd40>, <transformers.cache_utils.DynamicLayer object at 0x7d13b45fe7b0>, <transformer

The primary output of the model are the logits, which correspond to unnormalized scores for each token. We're interested in the scores for the last token, which are used to predict the next one. (The first dimension here is for the batch, and we have a batch of one.)

In [15]:
logits = output.logits[0][-1]
print(logits.shape)
print(logits)

torch.Size([49152])
tensor([ -0.5078, -13.4375, -13.6250,  ...,  -6.4375,  -3.9062, -14.1875],
       dtype=torch.bfloat16, grad_fn=<SelectBackward0>)


For greedy decoding, we can just take the argmax, which gives us the index of the most likely next word.

In [16]:
logits.argmax()

tensor(42398)

In [17]:
tokenizer.convert_ids_to_tokens([logits.argmax()])

['ĠHelsinki']

If we wanted to continue generating more than one word, we would simply append this index to `input_ids` and invoke the model again.

---

### Embeddings

The model operates with continuous representations rather than discrete word identifiers. To map the `input_ids` to a continuous representation, the first step in the model application is to look up a learned (context-independent) embedding. These are similar to the embeddings generated by methods such as `word2vec`.

(We already saw an instance of mapping from the continuous representations to discrete IDs on the output through `argmax`)

Let's have a quick look at the embeddings for this model.

In [18]:
embedding = model.get_input_embeddings()
print(embedding)

Embedding(49152, 576)


Here the first value is the number of tokens in the model vocabulary and the second the hidden dimension of the model.

Let's write a few functions for getting IDs and embeddings for individual words and look at the IDs and embeddings for the words "dog", "cat", and "hat".

In [19]:
def get_id(word):
  ids = tokenizer(' ' + word).input_ids    # add initial space
  assert len(ids) == 1, f'multiple tokens for {word}'
  return ids[0]

dog_id = get_id('dog')
cat_id = get_id('cat')
hat_id = get_id('hat')

The IDs themselves are arbitrary and meaningless outside of their reference to the embedding matrix:

In [20]:
print(dog_id, cat_id, hat_id)

2767 2644 9968


Let's grab the corresponding embeddings

In [21]:
import torch

def get_embedding(id_):
  return embedding(torch.tensor(id_))

dog_emb = get_embedding(dog_id)
cat_emb = get_embedding(cat_id)
hat_emb = get_embedding(hat_id)

These are vectors of the hidden dimensionality of the model

In [22]:
print(dog_emb.shape)

torch.Size([576])


The embeddings cannot be interpreted in isolation, but are "understood" by the model.

In [23]:
print(dog_emb[:100])

tensor([-0.0126,  0.0339,  0.1357, -0.0303,  0.0640,  0.1357,  0.0044,  0.1660,
         0.1196, -0.0820, -0.1094, -0.0918, -0.1719,  0.2021, -0.0713, -0.0491,
         0.1533,  0.0172,  0.1177, -0.1426, -0.0664, -0.0469,  0.0298,  0.1660,
         0.1348,  0.1289, -0.1240, -0.1221, -0.1099, -0.0742,  0.1660, -0.0938,
        -0.0608,  0.1162,  0.0154,  0.1611, -0.0398,  0.1357, -0.2617,  0.1758,
         0.0874, -0.0184,  0.0603,  0.0850,  0.0588,  0.1514,  0.0903,  0.0605,
         0.1279,  0.1641,  0.0153, -0.3066,  0.0222,  0.2021, -0.0488, -0.1211,
        -0.0459,  0.0334,  0.0146,  0.1230,  0.0115,  0.1211,  0.0396,  0.0884,
         0.0267, -0.1680, -0.0400,  0.0087, -0.2715, -0.1445,  0.0471,  0.0447,
         0.0618,  0.0437, -0.0138,  0.0211,  0.0723,  0.0952, -0.1328, -0.1953,
        -0.0854,  0.1543, -0.0014, -0.1748,  0.0698, -0.0952, -0.0011,  0.0081,
        -0.1245,  0.1533, -0.1514, -0.1846, -0.1787, -0.1963, -0.0219,  0.2754,
         0.0227, -0.1445,  0.0167, -0.13

The embeddings can also be used e.g. to compare the similarity of (context-free) word representations:

In [24]:
from torch.nn.functional import cosine_similarity

def compare(emb1, emb2):
  return cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()

print(compare(dog_emb, cat_emb))
print(compare(cat_emb, hat_emb))

0.5703125
0.1640625
