# Demonstrating Different Decoding Strategies with Hugging Face Transformers
In this notebook, we will explore various decoding strategies for text generation using a small encoder-decoder model from Hugging Face's Transformers library. We'll apply these strategies to two different datasets:

- Translation task where deterministic strategies are expected to perform better.
- Summarization task where stochastic strategies might yield more diverse and informative outputs.

The decoding strategies we'll test include:

1. Greedy Search
2. Beam Search
3. Temperature Sampling
4. Top-k Sampling
5. Top-p (Nucleus) Sampling

We'll define a custom ```generate_text``` function to apply these strategies and evaluate their performance using appropriate metrics for each dataset.

Here are some useful links you might want to check:
- [Auto Classes](https://huggingface.co/docs/transformers/model_doc/auto)
- [transformers\AutoModel](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModel)
- [transformers\AutoTokenizer](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer)
- [Google's T5](https://arxiv.org/pdf/1910.10683)
- [T5-small](https://huggingface.co/google-t5/t5-small)
- [Flan-T5-small](https://huggingface.co/google/flan-t5-small)

In [2]:
# !pip install transformers datasets sacrebleu rouge_score evaluate --quiet

In [3]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset
import random
import evaluate

  from .autonotebook import tqdm as notebook_tqdm


### Load the Pre-trained Model and Tokenizer

We'll use the ```flan-T5-small``` model, which is a small encoder-decoder model suitable for both story_gen and summarization tasks. This model is based on Google's ```t5-small``` model, but fine-tuned on more than 1000 additional tasks covering also more languages.

In [4]:
model_name = 'google/flan-t5-small'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


In [5]:
def generate_text(input_text, strategy='greedy', max_length=50, **kwargs):
    """Generates text based on the specified decoding strategy.

    Args:
        input_text (str): The input text to be processed by the model.
        strategy (str, optional): The decoding strategy to use. Defaults to 'greedy'.
            Options include:
            - 'greedy': Greedy search decoding.
            - 'beam': Beam search decoding.
            - 'temperature': Temperature sampling.
            - 'top-k': Top-k sampling.
            - 'top-p': Top-p (nucleus) sampling.
            - 'contrastive': Contrastive search decoding.
        max_length (int, optional): The maximum length of the generated text. Defaults to 50.
        **kwargs: Additional keyword arguments specific to the decoding strategy.

    Keyword Args:
        num_beams (int, optional): Number of beams for beam search. Defaults to 5.
            Used when `strategy='beam'`.
        temperature (float, optional): Sampling temperature. Defaults to 1.0.
            Used when `strategy='temperature'`.
        top_k (int, optional): The number of highest probability vocabulary tokens to keep for top-k filtering. Defaults to 50.
            Used when `strategy='top-k'` or when `strategy='contrastive'`.
        top_p (float, optional): Cumulative probability for nucleus sampling. Defaults to 0.95.
            Used when `strategy='top-p'`.
        penalty_alpha (float, optional): Contrastive search penalty factor. Defaults to 0.6.
            Used when `strategy='contrastive'`.

    Returns:
        str: The generated text based on the decoding strategy.

    Raises:
        ValueError: If an unknown decoding strategy is specified.
    """
    input_ids = tokenizer.encode(input_text, return_tensors='pt')
    model.eval()
    with torch.no_grad():
        if strategy == 'greedy':
            output_ids = model.generate(input_ids, max_length=max_length)
        elif strategy == 'beam':
            num_beams = kwargs.get('num_beams', 5)
            output_ids = model.generate(
                input_ids, max_length=max_length, num_beams=num_beams, early_stopping=True
            )
        elif strategy == 'temperature':
            temperature = kwargs.get('temperature', 1.0)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, temperature=temperature
            )
        elif strategy == 'top-k':
            top_k = kwargs.get('top_k', 50)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, top_k=top_k
            )
        elif strategy == 'top-p':
            top_p = kwargs.get('top_p', 0.95)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, top_p=top_p
            )
        elif strategy == 'contrastive':
            penalty_alpha = kwargs.get('penalty_alpha', 0.6)
            top_k = kwargs.get('top_k', 4)
            output_ids = model.generate(
                input_ids, max_length=max_length, penalty_alpha=penalty_alpha, top_k=top_k
            )
        else:
            raise ValueError("Unknown strategy: {}".format(strategy))

    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return output_text


### Use case 1: Neural Machine Translation
The WMT16 English-German dataset is a collection of parallel sentences in English and German used for machine translation tasks. It is part of the Conference on Machine story_gen (WMT) shared tasks, which are benchmarks for evaluating machine story_gen systems. The dataset contains professionally translated sentences and covers a variety of topics, making it ideal for training and evaluating story_gen models.

We'll use a subset (1% of the test split) of the dataset for evaluation purposes.

In [6]:
dataset_nmt = load_dataset('wmt16', 'de-en', split='test[:1%]')

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not in

In [7]:
# Randomly select a sample from the dataset
sample = random.choice(dataset_nmt['translation'])
input_text = sample['de']
target_text = sample['en']

print("Input Text:")
print(input_text)
print("\nTarget Text:")
print(target_text)


Input Text:
Die von den Ermittlern an beiden Enden des Staates veröffentlichten Details, wie auch das, was Studenten und Mitarbeiter, die ihn kannten, aussagten, half dabei, ein Bild von einem talentierten, aber möglicherweise schwierigen Lehrer zu zeichnen.

Target Text:
The details released by investigators at both ends of the state as well as students and staff who knew him helped paint a picture of a talented but possibly troubled teacher.


We need to preprocess the dataset to prepare it for input into the T5 model. The T5 model expects input in a specific format, including a task prefix. This is because the T5 model is a multi-purpose model, being able to perform several task, so we need to tell it what it must do.

In [8]:
def preprocess_translation(examples: list[dict]) -> dict:
    """Preprocesses a single example for the translation task.

    Args:
        examples list[dict]: A list of dictionaries containing 'de' and 'en' keys with English and German sentences.

    Returns:
        dict[list]: A dictionary of lists with added 'src_texts' and 'tgt_texts' keys for model input and target.

    The function:
    - Adds the task prefix 'translate German to English: ' to the German sentence.
    - Stores the result in 'src_texts'.
    - Copies the English sentence to 'tgt_texts'.
    """
    # Empty dictionary
    texts = {}
    
    # T5 expects a "translate German to English: " prefix
    texts['src_texts'] = ['translate German to English: ' + ex['de'] for ex in examples]
    texts['tgt_texts'] = [ex['en'] for ex in examples]
    return texts

dataset_nmt_preproc = preprocess_translation(dataset_nmt["translation"])


Let's see how we can generate a translation using the t5 model. From a random dataset, we are going to create translations using the different implemented strategies:

In [21]:
# Obtain source and target texts
random_index = random.randint(0, len(dataset_nmt_preproc['src_texts']))
random_src_sentence = dataset_nmt_preproc['src_texts'][random_index]
random_tgt_sentence = dataset_nmt_preproc['tgt_texts'][random_index]

# Obtain the translated sentence
translation_greedy= generate_text(random_src_sentence, strategy='greedy')
translation_beam_search = generate_text(random_src_sentence, strategy='beam', num_beams=5)
translation_temperature = generate_text(random_src_sentence, strategy='temperature', temperature=0.7)
translation_top_k = generate_text(random_src_sentence, strategy='top-k', top_k=50)
translation_top_p = generate_text(random_src_sentence, strategy='top-p', top_p=0.95)
translation_contrastive = generate_text(random_src_sentence, strategy='contrastive', top_k=4, penalty_alpha=0.6)

print(random_src_sentence)
print(f"Original translation: \n{random_tgt_sentence}")
print(f"Greedy Search: \n{translation_greedy}")
print(f"Beam Search: \n{translation_beam_search}")
print(f"Temperature Sampling: \n{translation_temperature}")
print(f"Top-k Sampling: \n{translation_top_k}")
print(f"Top-p Sampling: \n{translation_top_p}")
print(f"Contrastive Search: \n{translation_contrastive}")

translate German to English: Im März hatte Netanyahu auf Einladung der Republikaner vor dem US-Kongress eine umstrittene Rede gehalten, die teils als Affront gegen Obama gewertet wurde.
Original translation: 
In March, at the invitation of the Republicans, Netanyahu made a controversial speech to the US Congress, which was partly seen as an affront to Obama.
Greedy Search: 
In March, Netanyahu was invited to the US Congress to participate in a re-election campaign, which was part of the Obama campaign.
Beam Search: 
In March, Netanyahu was invited to the US Congress to represent the US Congress, which was part of the Obama campaign.
Temperature Sampling: 
In March, Netanyahu held the president's invitation to the US Congress, which included part of the White House voting against Obama.
Top-k Sampling: 
In March, Netanyahu had been presented with an invitation of the former congress members to the State Board of Representatives by the US Congress. In March, the re-elections were held am

- Do you see something different between the deterministic and the stochastic strategies? Try different random sentences.
>>> Yes, deterministic methods tend to produce more coherent translations, but sometimes might get stuck repeating the same words, while the stochastic ones, result in more diverse outputs, causing sometimes to hallucinate. The contrastive search appears to be a balance between both.

### Evaluation Metric: BLEU Score
#### What is BLEU Score?
The BLEU (Bilingual Evaluation Understudy) score is a metric for evaluating the quality of text that has been machine-translated from one language to another. It compares a candidate translation to one or more reference translations and calculates a score based on the overlap of n-grams (contiguous sequences of words).

BLEU-4: Considers up to 4-gram matches between the candidate and reference translations. It provides a balance between precision (matching words) and fluency (maintaining the structure of the language). Using BLEU-4 allows us to capture not just individual word matches (unigrams) but also phrases of up to four words. This makes the evaluation more sensitive to the quality of the translation in terms of both accuracy and fluency.

We'll use the ```sacrebleu``` implementation for a standardized BLEU score calculation (which is BLEU-4). You can check the details [here](https://aclanthology.org/W14-3346.pdf).

In [10]:
bleu_metric = evaluate.load('sacrebleu')

Downloading builder script: 8.15kB [00:00, 2.63MB/s]


In [38]:
strategies = ['greedy', 'beam', 'temperature', 'top-k', 'top-p', 'contrastive']
results = {}

hyperparameters = {
    'num_beams': 5,
    'temperature': 1.0,
    'top_k': 50,
    'top_p': 0.95,
    'penalty_alpha': 0.6
}

new_hyperparameters = {
    'num_beams': 3,
    'temperature': 0.6,
    'top_k': 10,
    'top_p': 0.8,
    'penalty_alpha': 0.4
}

for strategy in strategies:
    print(f"Evaluating strategy: {strategy}")
    predictions = []
    references = []
    for src_text, tgt_text in zip(dataset_nmt_preproc["src_texts"], dataset_nmt_preproc["tgt_texts"]): 
        pred = generate_text(src_text, strategy=strategy, **hyperparameters)
        predictions.append(pred)
        references.append([tgt_text])  # SacreBLEU expects a list of references

    # Compute BLEU-4 score
    bleu = bleu_metric.compute(predictions=predictions, references=references, smooth_method='exp')
    results[strategy] = bleu['score']
    print(f"BLEU-4 score for {strategy}: {bleu['score']:.2f}")


Evaluating strategy: greedy
BLEU-4 score for greedy: 20.99
Evaluating strategy: beam
BLEU-4 score for beam: 21.14
Evaluating strategy: temperature
BLEU-4 score for temperature: 11.37
Evaluating strategy: top-k
BLEU-4 score for top-k: 12.30
Evaluating strategy: top-p
BLEU-4 score for top-p: 14.38
Evaluating strategy: contrastive
BLEU-4 score for contrastive: 19.34


In [None]:
# Hyperparameter tuning code
best_results = {}

def get_hyperparameter_grid(strategy):
    if strategy == 'greedy':
        return [{}]  # No tiene hiperparámetros
    elif strategy == 'beam':
        return [{'num_beams': n, 'length_penalty': lp} 
                for n in [3,5,7] for lp in [0.8,1.0,1.2]]
    elif strategy == 'temperature':
        return [{'temperature': t} for t in [0.6,0.7,0.8,1.0,1.2]]
    elif strategy == 'top-k':
        return [{'top_k': k} for k in [10,20,50,100]]
    elif strategy == 'top-p':
        return [{'top_p': p} for p in [0.8,0.9,0.95]]
    elif strategy == 'contrastive':
        return [{'top_k': k, 'penalty_alpha': a} 
                for k in [3,5,7] for a in [0.4,0.6,0.7]]

for strategy in strategies:
    best_bleu = 0
    best_params = {}
    param_grid = get_hyperparameter_grid(strategy)  

    for params in param_grid:
        predictions, references = [], []
        for src, tgt in zip(dataset_nmt_preproc['src_texts'], dataset_nmt_preproc['tgt_texts']):
            pred = generate_text(src, strategy=strategy, **params)
            predictions.append(pred)
            references.append([tgt])
        bleu = bleu_metric.compute(predictions=predictions, references=references, smooth_method='exp')['score']

        if bleu > best_bleu:
            best_bleu = bleu
            best_params = params

    best_results[strategy] = {'BLEU': best_bleu, 'params': best_params}

In [36]:
for key in best_results.keys():
    print(best_results[key])

{'BLEU': 20.98821693903096, 'params': {}}
{'BLEU': 21.940264780375937, 'params': {'num_beams': 3, 'length_penalty': 0.8}}
{'BLEU': 17.905244283153074, 'params': {'temperature': 0.6}}
{'BLEU': 13.402206825495583, 'params': {'top_k': 10}}
{'BLEU': 17.41548491202558, 'params': {'top_p': 0.8}}
{'BLEU': 20.289416727605808, 'params': {'top_k': 7, 'penalty_alpha': 0.4}}


- Seeing the above translations and the BLEU score of the different strategies, which strategy would you choose for this use case?
>>> I would definetely choose the deterministic strategies, they show better BLEU Score while producing coherent outputs.

### Use case 2: Story generation

The ```WritingPrompts``` dataset is a collection of imaginative prompts and corresponding stories from the Reddit community. It contains over 300,000 stories written in response to various prompts, making it suitable for training and evaluating models on creative text generation tasks.

We'll use a subset of the dataset for evaluation purposes.

In [13]:
dataset_st_gen = load_dataset('llm-aes/writing-prompts', split='train[:1%]')

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Generating train split: 100%|██████████| 232360/232360 [00:00<00:00, 2817128.66 examples/s]
Generating sample_length_10_to_292 split: 100%|██████████| 1000/1000 [00:00<00:00, 657930.04 examples/s]


In [14]:
# Randomly select a sample from the dataset
sample = random.choice(dataset_st_gen['prompt'])

print("Sample Text:")
print(sample)

Sample Text:
 Even with all the stars on the sky , the night will always remain dark .



Now, we have to tell the t5 model to generate text after the input sentence.

In [16]:
def preprocess_story_generation(examples: list[dict]) -> dict:
    """Preprocesses a single example for the story generation task.

    Args:
        examples (list[dict]): A list of dictionaries containing 'prompt' key.

    Returns:
        dict[list]: A dictionary of list with added 'src_texts' for model input.

    The function:
    - Adds the task prefix 'Write a story based on: ' to the prompt.
    """
    # Empty dictionary
    texts = {}
    
    # T5 expects a task prefix
    texts['src_texts'] = ['Write a story based on: ' + ex['prompt'] for ex in examples]
    return texts

dataset_st_gen_preproc = preprocess_story_generation(dataset_st_gen)


As we have no original story to compare how good the model generates a story, you should compare the different decoding strategies by looking at some random stories: 

In [41]:
# Obtain source and target texts
random_index = random.randint(0, len(dataset_st_gen_preproc['src_texts']))
random_src_sentence = dataset_st_gen_preproc['src_texts'][random_index]

# Obtain the translated sentence
story_gen_greedy= generate_text(random_src_sentence, max_length=300, strategy='greedy')
story_gen_beam_search = generate_text(random_src_sentence, max_length=300, strategy='beam', num_beams=7)
story_gen_temperature = generate_text(random_src_sentence, max_length=300, strategy='temperature', temperature=0.85)
story_gen_top_k = generate_text(random_src_sentence, max_length=300, strategy='top-k', top_k=75)
story_gen_top_p = generate_text(random_src_sentence, max_length=300, strategy='top-p', top_p=0.98)
story_gen_contrastive = generate_text(random_src_sentence, max_length=300, strategy='contrastive', top_k=6, penalty_alpha=0.6)

print(random_src_sentence)
print(f"Greedy Search: \n{story_gen_greedy}")
print(f"Beam Search: \n{story_gen_beam_search}")
print(f"Temperature Sampling: \n{story_gen_temperature}")
print(f"Top-k Sampling: \n{story_gen_top_k}")
print(f"Top-p Sampling: \n{story_gen_top_p}")
print(f"Contrastive Search: \n{story_gen_contrastive}")

Write a story based on:  `` I do n't care about the million dollars you stole , just tell me where the hell my pet snail is ! ''

Greedy Search: 
i'm a huge pet snail , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and i'm a huge slob , and 
Beam Search: 
 I do n't care about the million dollars you stole , just tell me where the hell my pet snail is !  I do n't care about the million dollars you stole , just tell me where the hell my pet snail is !  I do n't care about the million dollars you stole , just tell me where the hell my pet snail is ! 
Temperature Sampling: 
julia '

- Seeing the above generated stories of the different strategies, which strategy would you choose for this use case?
>>> Once more, deterministic strategies produce more rigid text, but repeating itself, and the contrastive search now appears to get stuck also repeating itself. In this case the best option seems to be the stochastic strategies, as the randomness they produce result in more creative and better answers than the other strategies.

### Sensitivity Analysis

Try different hyperparameter values for the decoding strategies, try to optimize the BLEU score for the Neural Machine Translation case and generate better stories in the second use case.

- Which optimal configuration have you found for use case 1? Which are your conclusions based on your analysis?
>>> I have used a loop with differente hyperparameters to obtain some optimal ones, which are: num_beams = 3, temperature = 0.6, top_k = 10, top_p = 0.8, and penalty_alpha = 0.4. The BLUE scores show that beam search achieved the highest score (21.94), better greedy decoding (20.99) and contrastive decoding (20.47), while stochastic strategies performed less good. My conclusion is that beam search, is more effective for this first use case, as it balances exploration and exploitation.

- Which optimal configuration have you found for use case 2? Which are your conclusions bsaed on your analysis?
>>> I tried several hyperparameters and found that Beam Search with num_beams=7 produced repetitive outputs, while stochastic samplings with its respective hyperparameter: temperature (0.85), top-k (75), and top-p (0.98), generated more diverse and creative stories. Contrastive Search with top_k=6 and penalty_alpha=0.6 still had loops. In conclusion, stochastic strategies in this case offer the best balance between creativity and coherence, making them the best ones for generating a story.