<a href="https://colab.research.google.com/github/TurkuNLP/Deep_Learning_in_LangTech_course/blob/master/generation_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generation basics

This notebook illustrates some parts that go into text generation using the `pipeline` class from the `transformers` library.

First install the [transformers](https://huggingface.co/docs/transformers/index) package.

In [None]:
!pip install --quiet transformers

Next, load a `pipeline` for text generation with a small model.

In [None]:
from transformers import pipeline

MODEL_NAME = 'HuggingFaceTB/SmolLM-135M'

pipe = pipeline('text-generation', model=MODEL_NAME)

We can conveniently generate text using the high-level abstraction that `pipeline` provides:

In [None]:
prompt = 'The capital of Finland is'

print(pipe(prompt)[0]['generated_text'])

For simplicity, let's look at generating one word using greedy decoding, i.e. simply selecting the word that's most likely according to the model.

In [None]:
params = {
    'do_sample': False,
    'max_new_tokens': 1,
}

print(pipe(prompt, **params)[0]['generated_text'])

Now, let's look at what's going on behind the `pipeline` abstraction. First, here's the model

In [None]:
model = pipe.model
print(model)

The model doesn't actually deal with text directly, but rather with token indices. The mapping between running text and token indices is implemented by a tokenizer.

In [None]:
tokenizer = pipe.tokenizer

Let's have a look at the mapping between our prompt and the token indices.

In [None]:
input_ = tokenizer(prompt)

print(input_)

That's the actual input to the model, and `input_ids` are the token indices. (We can ignore the `attention_mask` here.)

The tokenizer can map these back to text

In [None]:
print(tokenizer.convert_ids_to_tokens(input_.input_ids))

(The `Ġ` there encodes space; this representation is a minor quirk of the GPT tokenizer.)

Because the token ids represent both visible characters and space, the full input string can be reconstructed accurately:

In [None]:
print(tokenizer.decode(input_.input_ids))

We can invoke the model directly with the encoded input. Here we need to ask the tokenizer to generate pytorch tensors due to some implementation details, but the information content is the same.

In [None]:
input_ = tokenizer(prompt, return_tensors='pt')

print(input_)

In [None]:
output = model(**input_)
print(output)

The primary output of the model are the logits, which correspond to unnormalized scores for each token. We're interested in the scores for the last token, which are used to predict the next one. (The first dimension here is for the batch, and we have a batch of one.)

In [None]:
logits = output.logits[0][-1]
print(logits.shape)
print(logits)

For greedy decoding, we can just take the argmax, which gives us the index of the most likely next word.

In [None]:
logits.argmax()

In [None]:
tokenizer.convert_ids_to_tokens([logits.argmax()])

If we wanted to continue generating more than one word, we would simply append this index to `input_ids` and invoke the model again.

---

### Embeddings

The model operates with continuous representations rather than discrete word identifiers. To map the `input_ids` to a continuous representation, the first step in the model application is to look up a learned (context-independent) embedding. These are similar to the embeddings generated by methods such as `word2vec`.

(We already saw an instance of mapping from the continuous representations to discrete IDs on the output through `argmax`)

Let's have a quick look at the embeddings for this model.

In [None]:
embedding = model.get_input_embeddings()
print(embedding)

Here the first value is the number of tokens in the model vocabulary and the second the hidden dimension of the model.

Let's write a few functions for getting IDs and embeddings for individual words and look at the IDs and embeddings for the words "dog", "cat", and "hat".

In [None]:
def get_id(word):
  ids = tokenizer(' ' + word).input_ids    # add initial space
  assert len(ids) == 1, f'multiple tokens for {word}'
  return ids[0]

dog_id = get_id('dog')
cat_id = get_id('cat')
hat_id = get_id('hat')

The IDs themselves are arbitrary and meaningless outside of their reference to the embedding matrix:

In [None]:
print(dog_id, cat_id, hat_id)

Let's grab the corresponding embeddings

In [None]:
import torch

def get_embedding(id_):
  return embedding(torch.tensor(id_))

dog_emb = get_embedding(dog_id)
cat_emb = get_embedding(cat_id)
hat_emb = get_embedding(hat_id)

These are vectors of the hidden dimensionality of the model

In [None]:
print(dog_emb.shape)

The embeddings cannot be interpreted in isolation, but are "understood" by the model.

In [None]:
print(dog_emb[:100])

The embeddings can also be used e.g. to compare the similarity of (context-free) word representations:

In [None]:
from torch.nn.functional import cosine_similarity

def compare(emb1, emb2):
  return cosine_similarity(emb1.unsqueeze(0), emb2.unsqueeze(0)).item()

print(compare(dog_emb, cat_emb))
print(compare(cat_emb, hat_emb))