# Text-Generation

Transformer language models has an uncanny feature of generating text that is almost indistinguishable from human text. This text generation happens without any explicit supervised leraning, just by predicting the next word based on context in a millions of web pages. With just pretraining LLM's learn a special set of skills and pattern recognition abilites that can be activated with different kind of prompts.

![pretraining-sequence-of-tasks](https://github.com/JpChii/nlp-with-hugging-face/blob/main/notes/images/5-text-generation/pretraining-model-sequence-of-tasks.png?raw=1)

The image shows addition, unscramling, translation are some of the sequence tasks that an LLM is exposed during training. This knowledge is transferred during fine-tuning(for larger models during inference-time). These tasks are not chosen specifically ahead of time and occur naturally with huge corpora.

With the advent of GPT-4 and now an open sourced LLAMA2, has given rise to lot's of applications with LLM's at its core with text generation capacity.

In [5-text-generation.ipynb](../notebooks/5-text-generation.ipynb) notebook we'
ll cover how text generation works with LLM's and how different decoding stratergies impact text generation.

In [3]:
!pip install accelerate==0.21.0

Collecting accelerate==0.21.0
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
  Attempting uninstall: accelerate
    Found existing installation: accelerate 0.5.1
    Uninstalling accelerate-0.5.1:
      Successfully uninstalled accelerate-0.5.1
Successfully installed accelerate-0.21.0


## The Challenge with Generating coherente Text

Until now in the series of notebook, we used a body and a fine-tuned head to get logits. Then we use argmax on logits to get a predicted class or softmax to get prediction probabalites for each token. By contrast, converting the model's probablistic output to text requries a *decoding method*, which introduces a few challenges unique to text generation:

* The decoding is done *iteratively* and requires more compure, not like passing the inputs through forward pass just once.
* The *quality* and *diversity* of text generated depends on the decoding method and associated hyperparameters.

To understand how this decoding process works, let's start by examining how GPT-2 is pretrained and subsequently applied to genreate text.

Like other *autoregressive* or *casual language models* GPT-2 is pretrained to estimate the probabality p(X|Y) of a sequence of tokens **y** = y1, y2,...yt, given some initial context **x** = x1, x2,...xt. Since it's impossible to acquire enough training data, the chain rule of proabality is used to factorize it as a product of *conditional probabalities*.

*Predicting token c given a and b are before it is the conditional probablity intutition*.

![alt contitional-proabablity](https://github.com/JpChii/nlp-with-hugging-face/blob/main/notes/images/5-text-generation/llm-product-of-conditional-probabalities.png?raw=1)

The note above describe exactly the probablity calculation on right side. This pretraining objective is quite different from BERT's, which utilizer both past and furture contexts to predict a masked token.

We can generate a text by predicting next token, adding it to the sequence and use this as new sequenct to predict next token and continue this iterative process until a special end of sequence token.

Example of this process below,
![text-generation](https://github.com/JpChii/nlp-with-hugging-face/blob/main/notes/images/5-text-generation/text-generation.png?raw=1)

> **Note:** Since the output sequence is *conditioned* on the choice of input prompt, this type of text genreation is often called as *conditional text generation*.

At the heart of this process lies the decoding method that determines which token is selected at each time step.

A language model produces a logit for each word in  the vocabulary at each time step, we can get the probabality distribution for each token using softmax.

![next-token-softmax](https://github.com/JpChii/nlp-with-hugging-face/blob/main/notes/images/5-text-generation/next-token-softmax.png?raw=1)

The goal of most decoding methods is to search for the most likelt overall sequence by picking a y_hat such that:

![next-token-softmax](https://github.com/JpChii/nlp-with-hugging-face/blob/main/notes/images/5-text-generation/next-token-argmax.png?raw=1)


Finding y_hat directly involve evaluating every possible sequence with the language model. Since there does not exist an algorithm to do this within an reasonable amount of time we use approximation instead. In this note, we'll explore few of these approximation methods and gradullay build up toward smarter and more complex algorithms that can gernerate high quality texts.

## Greedy Search Decoding

The simplest decoding method to get discrete tokens from a model's continuous output is to greedily select the token with the highest probabality at each timestep.

*Greedy search decoding argmax*
![alt](https://github.com/JpChii/nlp-with-hugging-face/blob/main/notes/images/5-text-generation/greedy-search-decoding.png?raw=1)

To see how greedy search works, let's load a 1.5 billion-parameter version of GPT-2 with a language modelling head.

In [10]:
from accelerate import init_empty_weights
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Now let's generate some text! Although Transformers provides a generate() function for autoregressive models like GPT-2, we'll implement this decoding method to understand what's going on under the hood.

In [11]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [12]:
import torch
import pandas as pd

input_txt = "Transformers are the"
input_ids = tokenizer(input_txt, return_tensors="pt")["input_ids"].to(device)
# List to store dicts of input context and next top 5 probabale tokens
iterations = []
# Number of steps to generate tokens
n_steps = 8
# Number of choices
choices_per_step = 5

with torch.no_grad():
  # Loop to generate tokens for n_steps
  for _ in range(n_steps):
    iteration = dict()
    iteration["Input"] = tokenizer.decode(input_ids[0])

    # Get model outputs
    outputs = model(input_ids=input_ids)

    # Logits --> probs -->
    next_token_logits = outputs.logits[0, -1, :] # Get logits for last token(-1) in the first batch(0)
    next_token_probs = torch.softmax(next_token_logits, dim=-1)
    sorted_ids = torch.argsort(next_token_probs, dim=-1, descending=True)

    for choice_idx in range(choices_per_step):
      token_id = sorted_ids[choice_idx]
      token_prob = next_token_probs[token_id].cpu().numpy()
      token_choice = {
          f"{tokenizer.decode(token_id)} ({100 * token_prob:.2f}%)"
      }
      iteration[f"Choice {choice_idx}"] = token_choice

    input_ids = torch.cat([input_ids, sorted_ids[None, 0, None]], dim=-1)
    iterations.append(iteration)

pd.DataFrame(iterations)

Unnamed: 0,Input,Choice 0,Choice 1,Choice 2,Choice 3,Choice 4
0,Transformers are the,{ most (8.37%)},{ only (3.35%)},{ best (2.75%)},{ first (2.54%)},{ ultimate (2.20%)}
1,Transformers are the most,{ powerful (20.77%)},{ common (7.09%)},{ popular (5.09%)},{ important (3.29%)},{ advanced (2.72%)}
2,Transformers are the most powerful,{ beings (9.43%)},{ and (8.35%)},{ of (4.61%)},{ Transformers (4.34%)},"{, (3.83%)}"
3,Transformers are the most powerful beings,{ in (56.16%)},{ on (18.99%)},{ known (3.12%)},{ of (3.09%)},{ to (2.18%)}
4,Transformers are the most powerful beings in,{ the (72.89%)},{ existence (11.20%)},{ all (3.40%)},{ creation (1.81%)},{ Transformers (1.18%)}
5,Transformers are the most powerful beings in the,{ universe (67.94%)},{ Universe (5.41%)},{ Marvel (4.40%)},{ Transformers (3.49%)},{ mult (3.47%)}
6,Transformers are the most powerful beings in t...,{. (35.28%)},"{, (34.16%)}",{ and (12.94%)},{; (1.55%)},{! (1.34%)}
7,Transformers are the most powerful beings in t...,{ They (32.09%)},{\n (4.97%)},{ Their (4.93%)},{ The (3.85%)},{ But (2.88%)}


With this simple method we were able to generate the sentence "Transformers are the most powerful beings in the universe". Interestingly this indicates that GPT-2 has internalized some knowledge about the media franchise, which was created by two companies(Hasbro and Takara Tony).

We can also see other possible continuations at each step, highlighting the iterative nature of text generation. Unlike in sequence classification tasks where a single forward pass suffices to generate the predictrions, with text generation we need to decode the output tokens one at a time.

Let's try out the transformers `generate()` function to explore more sophisticated decoding startegies.

In [14]:
input_tokens = tokenizer(
    input_txt,
    return_tensors="pt"
)["input_ids"].to(device)
output = model.generate(input_ids, max_new_tokens=n_steps, do_sample=False)
print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Transformers are the most powerful beings in the universe. They are the creators of the universe, and
