### Notebook will share how different text generation functions work

- How to generate Text

- Avoid common pitfall

- Getting most out of the llm

=> setting up the token selection step and the stopping condition is essential to make your model behave as you’d expect on your task



In [9]:
from transformers import AutoModelForCausalLM

# model_path = "metallama/Llama-2-7b"
# model_path = "google/flan-t5-base"  # facing error with AutoModelForCausalLM
model_path = "facebook/opt-350m"
# local_model = "/home/kamal/.cache/huggingface/hub/models--google--flan-t5-base/"
local_model = "/home/kamal/.cache/huggingface/hub/models--facebook--opt-350m/refs/main"

model = AutoModelForCausalLM.from_pretrained(model_path,
                                             device_map='auto',
                                             load_in_4bit=True,)

In [11]:
model.device

device(type='cuda', index=0)

In [12]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left") 
# reason for left is explained below
model_ins = tokenizer(["A list of colored donuts: red, pink"], 
                      return_tensors='pt').to('cuda')

In [13]:
model_ins

{'input_ids': tensor([[    2,   250,   889,     9, 20585,   218,  7046,    35,  1275,     6,
          6907]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

In [14]:
generated_outs = model.generate(**model_ins)
generated_outs



tensor([[    2,   250,   889,     9, 20585,   218,  7046,    35,  1275,     6,
          6907,     6,  2272,     6,  2440,     6,     8,  5718,     4, 50118]],
       device='cuda:0')

In [15]:
tokenizer.batch_decode(generated_outs, skip_special_tokens=True)

['A list of colored donuts: red, pink, green, blue, and yellow.\n']

In [16]:
tokenizer.pad_token = tokenizer.eos_token
model_ins = tokenizer(
    ["there are lots of fighter", "Challenger Space"], return_tensors='pt', padding=True
).to('cuda')
gen_outs = model.generate(**model_ins)
tokenizer.batch_decode(gen_outs, skip_special_tokens=True)



['there are lots of fighter jets in the air, but they are not the only ones. ',
 'Challenger Space Program\n\nThe Challenger Space Program was a space program that was launched']

In [17]:
# We highly recommend manually setting max_new_tokens in your generate call to 
# control the maximum number of new tokens it can return.

tokenizer.pad_token = tokenizer.eos_token

model_ins = tokenizer(
    ["there are lots of fighter", "Challenger Space"], return_tensors='pt', padding=True
).to('cuda')

gen_outs = model.generate(**model_ins, max_new_tokens=25)  # max_new_tokens is set

tokenizer.batch_decode(gen_outs, skip_special_tokens=True)

['there are lots of fighter jets in the air, but they are not the only ones.            ',
 'Challenger Space Program\n\nThe Challenger Space Program was a space program that was launched by the Challenger Space Shuttle in 1986. The program was']

In [19]:
# Input-grounded tasks like audio transcription or translation benefit from greedy decoding
# Creative tasks / writing essays will suffer from greedy decoding

from transformers import set_seed
set_seed(42)

model_inputs = tokenizer(["I am a bot"],
                         return_tensors='pt').to('cuda')

generated_outs = model.generate(**model_inputs,
                                max_new_tokens=30)

print(tokenizer.batch_decode(generated_outs, skip_special_tokens=True)[0])

generated_outs = model.generate(**model_inputs,
                                do_sample=True,
                                max_new_tokens=30)
print('creative out')
print(tokenizer.batch_decode(generated_outs, skip_special_tokens=True)[0])


I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/oldpeoplefacebook
creative out
I am a bot, why do you have a thread in r/NBA?
The bot that gets the top posts out to the front page doesn't have this thread


In [21]:
# decoder only archs continue to iterate over your input prompts. 
# reason for left padding is to llm are not trained to continue from pad tokens

mod_ins = tokenizer(
    ['1, 2, 3', 'a, b, k, d, e'], padding=True, return_tensors="pt"
)
gen_outs = model.generate(**mod_ins, max_new_tokens=36)

print(tokenizer.batch_decode(gen_outs, skip_special_tokens=True)[0])



1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21


In [22]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a thug",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]

In [24]:
model_chats = tokenizer.apply_chat_template(messages,
                                            # add_generation_prompt=True,
                                            return_tensors='pt',).to('cuda')
in_length = model_chats.shape[1]
in_length

Using sep_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using mask_token, but it is not set yet.


28

In [31]:
gen_ids = model.generate(model_chats, do_sample=True, max_new_tokens=50)
print(tokenizer.batch_decode(gen_ids[:, in_length:], skip_special_tokens=True)[0])

Djokovic, Novak to defend Paris opener
Serbian Open tennis star Novak Djokovic made a big statement before his Paris Open debut, defeating France's Laurent Grosjean 6-3 4-6 6-4,


In [32]:
# continue with the text generation strategies
# https://huggingface.co/docs/transformers/generation_strategies

This guide describe


default generation conf
    
    - The default generation configuration limits the size of the output combined with the input prompt to a maximum of 20 tokens to avoid running into resource limitations. 
    
    - The default decoding strategy is greedy search, which is the simplest decoding strategy that picks a token with the highest probability as the next token. 
    
    - For many tasks and small output sizes this works well

n
common decoding strategies and their main paramet

rs
saving and sharing custom generation configurations with your fine-tuned model on 🤗 Hub

In [34]:
model.generation_config

GenerationConfig {
  "bos_token_id": 2,
  "eos_token_id": 2,
  "pad_token_id": 1
}

## Four params of generate method

**max_new_tokens:** the maximum number of tokens to generate. Does not including the tokens in the prompt. Alternative to stopping criteria
**
num_beams**: by specifying a number of beams higher than 1, you are effectively switching from greedy search to beam search. This has the** advantage of identifying high-probability sequence**s that start with a lower probability initial tokens and would’ve been ignored by the greedy searc

.
do_sample: if set to True, this parameter enables decoding strategies such 

    - s multinomial sampling
    
    - , beam-search multinomial sampling
    
    - , Top-K sampling a
    
    - d Top-p samplig

s.
num_return_sequences: the number of sequence candidates to return for each input. This option is only available for the decoding strategies that support multiple sequence candidates, e.g. variations of beam search and samplin
g. Decoding strategies like greedy search and contrastive search return a single output sequence

In [35]:
from transformers import GenerationConfig

generation_config = GenerationConfig(
    max_new_tokens=50,
    do_sample=True,
    top_k=50,
    eos_token_id=model.config.eos_token_id
)

generation_config.save_pretrained("/home/kamal/training_files/generation_config")

In [36]:
translation_generation_config = GenerationConfig(
    num_beams=4,
    early_stopping=True,
    decoder_start_token_id=0,
    eos_token_id=model.config.eos_token_id,
    pad_token=model.config.pad_token_id,
)

translation_generation_config.save_pretrained("/home/kamal/training_files/translation_config")

In [38]:
inputs = tokenizer("translate English to French: Configuration files are easy to use!", 
                   return_tensors="pt")

outputs = model.generate(**inputs,
                         generation_config=translation_generation_config)
# need to use a different model, nt opt-350m
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


RuntimeError: "log_softmax_lastdim_kernel_impl" not implemented for 'Half'

In [42]:
from transformers import TextStreamer

model = AutoModelForCausalLM.from_pretrained("gpt2")

tok = AutoTokenizer.from_pretrained("gpt2")

inputs = tok(["An incrementing seq: one"], return_tensors='pt')

streamer = TextStreamer(tok)

_ = model.generate(**inputs, streamer=streamer, max_new_tokens=100)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


An incrementing seq: one of the following:

1. The first element of the seq is the first element of the seq.

2. The second element of the seq is the second element of the seq.

3. The third element of the seq is the third element of the seq.

4. The fourth element of the seq is the fourth element of the seq.

5. The fifth element of the seq is the fifth element of the seq.

6. The sixth


In [45]:
# generate uses greedy search decoding by default so you don’t have to pass any parameters to enable 

# two main parameters that enable and control the behavior of contrastive search 
# are penalty_alpha and top_k:

inputs = tok(["Apple Iphone is"], return_tensors='pt')

outputs = model.generate(**inputs,
                         penalty_alpha=0.6,
                         top_k=4,
                         max_new_tokens=100)

tokenizer.batch_decode(outputs, skip_special_tokens=True)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[' illicit leftpr Ifu main west in m newsr Sou main west for nationalr Sou main west for nationalr Sou main west for nationalr Sou main west for nationalr Sou main west for nationalr Sou main west for nationalr Sou main west for nationalr Sou main west for nationalr Sou main west for nationalr Sou main west for nationalr Sou main west for nationalr Sou main west for nationalr Sou main west for nationalr Sou']

As opposed to greedy search that always chooses a token with the highest probability as the next token, multinomial sampling (also called ancestral sampling) randomly selects the next token based on the probability distribution over the entire vocabulary given by the model.


In [18]:
outputs = model.generate(**inputs,
                         do_sample=True,
                         num_beams=1,
                         max_new_tokens=100,
                        temperature=0.3)

tokenizer.batch_decode(outputs, skip_special_tokens=True)

['. These design principles allow us to design the most efficient and sustainable human habitation and food production systems...... are. that can be applied to any location, climate and culture.. and and, combines and reflects the best of these disciplines. and.... Each design principle itself is a complete conceptual framework based on sound scientific principles...']

In [19]:
# In Top-K sampling, the K most likely next words are filtered and the probability mass is 
# redistributed among only those K next words. 
outputs = model.generate(**inputs,
                         do_sample=True,
                         num_beams=1,
                         max_new_tokens=100,
                        temperature=0.3,
                        top_k=5)

tokenizer.batch_decode(outputs, skip_special_tokens=True)

['. Each design principle embodies a set of universal design principles that can be applied to any location, climate and culture..... are a set of universal design principles... that can be applied to any.. and, and and. The design principles are a set of universal design principles that.. design principles that are. into one.']

In [22]:
#  Top-p sampling chooses from the smallest possible set of 
# words whose cumulative probability exceeds the probability p.
# deactivate top_k sampling and sample only from 92% most likely words
sample_output = model.generate(
    **inputs, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

Output:
----------------------------------------------------------------------------------------------------
. The Permaculture Design Principles are a set of. We call these a “Plant System” because they are all parts of a plant, and every element is a “plant”..


In [47]:
# Unlike greedy search, beam-search decoding keeps several hypotheses at each time step and eventually 
# chooses the hypothesis that has the overall highest probability for the entire sequence.

outputs = model.generate(**inputs,
                         num_beams=5,
                         max_new_tokens=50)

print(tokenizer.batch_decode(outputs, skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[' illicit leftpr If claimedz morning- Jordanum for around around learn There laterF industrial never stronger in04 detailsó family supported water for around around learn There laterF industrial never stronger in04 detailsó family supported water for target There laterF industrial never stronger in04']


In [14]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

checkpoint = "t5-large"
prompt = "It is astonishing how one can"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint).to('cuda')

outputs = model.generate(**inputs,
                         num_beams=3,
                         do_sample=True,
                         max_new_tokens=100)

In [None]:
beam_output = model.generate(
    **inputs, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
     

In [None]:
# set return_num_sequences > 1 and num_return_sequence <= num_beams
beam_outputs = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, 
    num_return_sequences=5, 
    early_stopping=True
)

for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

In [15]:
outputs

tensor([[    0, 32099,   103,    24,     5,     1]], device='cuda:0')

In [17]:
prompt = (
    "The Permaculture Design Principles are a set of universal design principles "
    "that can be applied to any location, climate and culture, and they allow us to design "
    "the most efficient and sustainable human habitation and food production systems. "
    "Permaculture is a design system that encompasses a wide variety of disciplines, such "
    "as ecology, landscape design, environmental science and energy conservation, and the "
    "Permaculture design principles are drawn from these various disciplines. Each individual "
    "design principle itself embodies a complete conceptual framework based on sound "
    "scientific principles. When we bring all these separate  principles together, we can "
    "create a design system that both looks at whole systems, the parts that these systems "
    "consist of, and how those parts interact with each other to create a complex, dynamic, "
    "living system. Each design principle serves as a tool that allows us to integrate all "
    "the separate parts of a design, referred to as elements, into a functional, synergistic, "
    "whole system, where the elements harmoniously interact and work together in the most "
    "efficient way possible."
)

inputs = tokenizer(prompt, return_tensors="pt").to('cuda')

outputs = model.generate(**inputs,
                         num_beams=5,
                         num_beam_groups=5,
                         max_new_tokens=30,
                         diversity_penalty=1.0)

tokenizer.decode(outputs[0], skip_special_tokens=True)

'. The Permaculture Design Principles are a set of universal design principles that can be applied to any location, climate and culture.'

In [27]:
import torch
# set seed to reproduce results. Feel free to change the seed though to get different results

# set top_k = 50 and set top_p = 0.95 and num_return_sequences = 3
sample_outputs = model.generate(
    **inputs,
    do_sample=True, 
    max_length=50, 
    top_k=50, 
    top_p=0.95, 
    num_return_sequences=3
)

print("Output:\n" + 100 * '-')
for i, sample_output in enumerate(sample_outputs):
  print("{}: {}".format(i, tokenizer.decode(sample_output, skip_special_tokens=True)))

Output:
----------------------------------------------------------------------------------------------------
0: , or to. The Permaculture Design Principles are a set of universal design principles.. The Permaculture Design Principles are a set of universal design principles that can be applied to any location
1: . Each design principle is used to help us create systems that are effective and sustainable. The Permaculture Design Principles are a set of universal design principles that can be applied to any location, climate and culture..
2: . Each design principle serves as a tool that allows us to integrate all the separate parts of a design, referred to as elements, into a functional, dynamic, whole system.....


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
set_seed(42)  # For reproducibility

prompt = "Alice and Bob"
checkpoint = "EleutherAI/pythia-1.4b-deduped"
assistant_checkpoint = "EleutherAI/pythia-160m-deduped"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt")

model = AutoModelForCausalLM.from_pretrained(checkpoint)
assistant_model = AutoModelForCausalLM.from_pretrained(assistant_checkpoint)
outputs = model.generate(**inputs, assistant_model=assistant_model, do_sample=True, temperature=0.5)
tokenizer.batch_decode(outputs, skip_special_tokens=True)