# Demonstrating Different Decoding Strategies with Hugging Face Transformers
In this notebook, we will explore various decoding strategies for text generation using a small encoder-decoder model from Hugging Face's Transformers library. We'll apply these strategies to two different datasets:

- Translation task where deterministic strategies are expected to perform better.
- Summarization task where stochastic strategies might yield more diverse and informative outputs.

The decoding strategies we'll test include:

1. Greedy Search
2. Beam Search
3. Temperature Sampling
4. Top-k Sampling
5. Top-p (Nucleus) Sampling

We'll define a custom ```generate_text``` function to apply these strategies and evaluate their performance using appropriate metrics for each dataset.

Here are some useful links you might want to check:
- [Auto Classes](https://huggingface.co/docs/transformers/model_doc/auto)
- [transformers\AutoModel](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModel)
- [transformers\AutoTokenizer](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer)
- [Google's T5](https://arxiv.org/pdf/1910.10683)
- [T5-small](https://huggingface.co/google-t5/t5-small)
- [Flan-T5-small](https://huggingface.co/google/flan-t5-small)

In [2]:
#!pip install transformers datasets sacrebleu rouge_score evaluate --quiet

[33m  DEPRECATION: Building 'rouge_score' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'rouge_score'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m

In [3]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset
import random
import evaluate

### Load the Pre-trained Model and Tokenizer

We'll use the ```flan-T5-small``` model, which is a small encoder-decoder model suitable for both story_gen and summarization tasks. This model is based on Google's ```t5-small``` model, but fine-tuned on more than 1000 additional tasks covering also more languages.

In [4]:
model_name = 'google/flan-t5-small'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)




In [5]:
def generate_text(input_text, strategy='greedy', max_length=50, **kwargs):
    """Generates text based on the specified decoding strategy.

    Args:
        input_text (str): The input text to be processed by the model.
        strategy (str, optional): The decoding strategy to use. Defaults to 'greedy'.
            Options include:
            - 'greedy': Greedy search decoding.
            - 'beam': Beam search decoding.
            - 'temperature': Temperature sampling.
            - 'top-k': Top-k sampling.
            - 'top-p': Top-p (nucleus) sampling.
            - 'contrastive': Contrastive search decoding.
        max_length (int, optional): The maximum length of the generated text. Defaults to 50.
        **kwargs: Additional keyword arguments specific to the decoding strategy.

    Keyword Args:
        num_beams (int, optional): Number of beams for beam search. Defaults to 5.
            Used when `strategy='beam'`.
        temperature (float, optional): Sampling temperature. Defaults to 1.0.
            Used when `strategy='temperature'`.
        top_k (int, optional): The number of highest probability vocabulary tokens to keep for top-k filtering. Defaults to 50.
            Used when `strategy='top-k'` or when `strategy='contrastive'`.
        top_p (float, optional): Cumulative probability for nucleus sampling. Defaults to 0.95.
            Used when `strategy='top-p'`.
        penalty_alpha (float, optional): Contrastive search penalty factor. Defaults to 0.6.
            Used when `strategy='contrastive'`.

    Returns:
        str: The generated text based on the decoding strategy.

    Raises:
        ValueError: If an unknown decoding strategy is specified.
    """
    input_ids = tokenizer.encode(input_text, return_tensors='pt')
    model.eval()
    with torch.no_grad():
        if strategy == 'greedy':
            output_ids = model.generate(input_ids, max_length=max_length)
        elif strategy == 'beam':
            num_beams = kwargs.get('num_beams', 5)
            output_ids = model.generate(
                input_ids, max_length=max_length, num_beams=num_beams, early_stopping=True
            )
        elif strategy == 'temperature':
            temperature = kwargs.get('temperature', 1.0)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, temperature=temperature
            )
        elif strategy == 'top-k':
            top_k = kwargs.get('top_k', 50)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, top_k=top_k
            )
        elif strategy == 'top-p':
            top_p = kwargs.get('top_p', 0.95)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, top_p=top_p
            )
        elif strategy == 'contrastive':
            penalty_alpha = kwargs.get('penalty_alpha', 0.6)
            top_k = kwargs.get('top_k', 4)
            output_ids = model.generate(
                input_ids, max_length=max_length, penalty_alpha=penalty_alpha, top_k=top_k
            )
        else:
            raise ValueError("Unknown strategy: {}".format(strategy))

    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return output_text


### Use case 1: Neural Machine Translation
The WMT16 English-German dataset is a collection of parallel sentences in English and German used for machine translation tasks. It is part of the Conference on Machine story_gen (WMT) shared tasks, which are benchmarks for evaluating machine story_gen systems. The dataset contains professionally translated sentences and covers a variety of topics, making it ideal for training and evaluating story_gen models.

We'll use a subset (1% of the test split) of the dataset for evaluation purposes.

In [6]:
dataset_nmt = load_dataset('wmt16', 'de-en', split='test[:1%]')

Generating train split: 100%|██████████| 4548885/4548885 [00:02<00:00, 1650860.99 examples/s]
Generating validation split: 100%|██████████| 2169/2169 [00:00<00:00, 746732.77 examples/s]
Generating test split: 100%|██████████| 2999/2999 [00:00<00:00, 929553.48 examples/s]


In [7]:
# Randomly select a sample from the dataset
sample = random.choice(dataset_nmt['translation'])
input_text = sample['de']
target_text = sample['en']

print("Input Text:")
print(input_text)
print("\nTarget Text:")
print(target_text)


Input Text:
Lamb war es wichtig zu betonen, dass sein "süßer Hund" aber noch lebe und wahrscheinlich aufgeregt sei, und er sagte, die Familienkontakte der toten Frau könnten auf dem Handy gefunden werden.

Target Text:
Lamb made a point to say his "sweet dog" was there alive and probably upset, and said the dead woman's family contacts could be found on her phone.


We need to preprocess the dataset to prepare it for input into the T5 model. The T5 model expects input in a specific format, including a task prefix. This is because the T5 model is a multi-purpose model, being able to perform several task, so we need to tell it what it must do.

In [8]:
def preprocess_translation(examples: list[dict]) -> dict:
    """Preprocesses a single example for the translation task.

    Args:
        examples list[dict]: A list of dictionaries containing 'de' and 'en' keys with English and German sentences.

    Returns:
        dict[list]: A dictionary of lists with added 'src_texts' and 'tgt_texts' keys for model input and target.

    The function:
    - Adds the task prefix 'translate German to English: ' to the German sentence.
    - Stores the result in 'src_texts'.
    - Copies the English sentence to 'tgt_texts'.
    """
    # Empty dictionary
    texts = {}
    
    # T5 expects a "translate German to English: " prefix
    texts['src_texts'] = ['translate German to English: ' + ex['de'] for ex in examples]
    texts['tgt_texts'] = [ex['en'] for ex in examples]
    return texts

dataset_nmt_preproc = preprocess_translation(dataset_nmt["translation"])


Let's see how we can generate a translation using the t5 model. From a random dataset, we are going to create translations using the different implemented strategies:

In [9]:
# Obtain source and target texts
random_index = random.randint(0, len(dataset_nmt_preproc['src_texts']))
random_src_sentence = dataset_nmt_preproc['src_texts'][random_index]
random_tgt_sentence = dataset_nmt_preproc['tgt_texts'][random_index]

# Obtain the translated sentence
translation_greedy= generate_text(random_src_sentence, strategy='greedy')
translation_beam_search = generate_text(random_src_sentence, strategy='beam', num_beams=5)
translation_temperature = generate_text(random_src_sentence, strategy='temperature', temperature=0.7)
translation_top_k = generate_text(random_src_sentence, strategy='top-k', top_k=50)
translation_top_p = generate_text(random_src_sentence, strategy='top-p', top_p=0.95)
translation_contrastive = generate_text(random_src_sentence, strategy='contrastive', top_k=4, penalty_alpha=0.6)

print(random_src_sentence)
print(f"Original translation: \n{random_tgt_sentence}")
print(f"Greedy Search: \n{translation_greedy}")
print(f"Beam Search: \n{translation_beam_search}")
print(f"Temperature Sampling: \n{translation_temperature}")
print(f"Top-k Sampling: \n{translation_top_k}")
print(f"Top-p Sampling: \n{translation_top_p}")
print(f"Contrastive Search: \n{translation_contrastive}")

translate German to English: Bei der Begegnung soll es aber auch um den Konflikt mit den Palästinensern und die diskutierte Zwei-Staaten-Lösung gehen.
Original translation: 
The meeting was also planned to cover the conflict with the Palestinians and the disputed two state solution.
Greedy Search: 
The onset of the conflict should also be a part of the conflict with the Palestinians and the debated two-seat solution.
Beam Search: 
However, it will also be a conflict with the Palästinians and the debated two-staaten solution.
Temperature Sampling: 
Although the initiative is an important step but also a step in the fight against a Palestinian and the debated two-staaten solution.
Top-k Sampling: 
The conclusion should therefore make it possible for the conflict to proceed with the Palestinians and to appoint the two-step solution.
Top-p Sampling: 
When it comes to the introduction, this may be a combination of mediation, social networking and debate.
Contrastive Search: 
The decision sh

- Do you see something different between the deterministic and the stochastic strategies? Try different random sentences.
Las estrategias deterministas intentan traducciones mas literales, que las hacen a veces menos naturales o con errores de traduccion directos. Por otro lado las estocásticas generan traducciones mas variadas aportando alternativas distintas y a veces menos precisas

### Evaluation Metric: BLEU Score
#### What is BLEU Score?
The BLEU (Bilingual Evaluation Understudy) score is a metric for evaluating the quality of text that has been machine-translated from one language to another. It compares a candidate translation to one or more reference translations and calculates a score based on the overlap of n-grams (contiguous sequences of words).

BLEU-4: Considers up to 4-gram matches between the candidate and reference translations. It provides a balance between precision (matching words) and fluency (maintaining the structure of the language). Using BLEU-4 allows us to capture not just individual word matches (unigrams) but also phrases of up to four words. This makes the evaluation more sensitive to the quality of the translation in terms of both accuracy and fluency.

We'll use the ```sacrebleu``` implementation for a standardized BLEU score calculation (which is BLEU-4). You can check the details [here](https://aclanthology.org/W14-3346.pdf).

In [10]:
bleu_metric = evaluate.load('sacrebleu')

Downloading builder script: 8.15kB [00:00, 13.0MB/s]


In [None]:
strategies = ['greedy', 'beam', 'temperature', 'top-k', 'top-p', 'contrastive']
results = {}

hyperparameters = {
    'num_beams': 8,
    'temperature': 1.0,
    'top_k': 50,
    'top_p': 0.95,
    'penalty_alpha': 0.6
}

for strategy in strategies:
    print(f"Evaluating strategy: {strategy}")
    predictions = []
    references = []
    for src_text, tgt_text in zip(dataset_nmt_preproc["src_texts"], dataset_nmt_preproc["tgt_texts"]): 
        pred = generate_text(src_text, strategy=strategy, **hyperparameters)
        predictions.append(pred)
        references.append([tgt_text])  # SacreBLEU expects a list of references

    # Compute BLEU-4 score
    bleu = bleu_metric.compute(predictions=predictions, references=references, smooth_method='exp')
    results[strategy] = bleu['score']
    print(f"BLEU-4 score for {strategy}: {bleu['score']:.2f}")
    


Evaluating strategy: greedy
BLEU-4 score for greedy: 20.99
Evaluating strategy: beam
BLEU-4 score for greedy: 20.99
Evaluating strategy: beam
BLEU-4 score for beam: 21.86
Evaluating strategy: temperature
BLEU-4 score for beam: 21.86
Evaluating strategy: temperature
BLEU-4 score for temperature: 9.55
Evaluating strategy: top-k
BLEU-4 score for temperature: 9.55
Evaluating strategy: top-k
BLEU-4 score for top-k: 9.46
Evaluating strategy: top-p
BLEU-4 score for top-k: 9.46
Evaluating strategy: top-p
BLEU-4 score for top-p: 12.80
Evaluating strategy: contrastive
BLEU-4 score for top-p: 12.80
Evaluating strategy: contrastive
BLEU-4 score for contrastive: 19.34
BLEU-4 score for contrastive: 19.34


In [17]:
# Búsqueda del mejor num_beams para Beam Search
best_bleu = 0
best_num_beams = 1
bleu_scores = []

for num_beams in range(1, 11):
    print(f"Evaluando num_beams={num_beams}")
    predictions = []
    references = []
    for src_text, tgt_text in zip(dataset_nmt_preproc["src_texts"], dataset_nmt_preproc["tgt_texts"]):
        pred = generate_text(src_text, strategy='beam', num_beams=num_beams)
        predictions.append(pred)
        references.append([tgt_text])
    bleu = bleu_metric.compute(predictions=predictions, references=references, smooth_method='exp')
    bleu_score = bleu['score']
    bleu_scores.append((num_beams, bleu_score))
    print(f"num_beams={num_beams}, BLEU-4={bleu_score:.2f}")
    if bleu_score > best_bleu:
        best_bleu = bleu_score
        best_num_beams = num_beams

print(f"\nEl mejor num_beams es {best_num_beams} con BLEU-4={best_bleu:.2f}")
# Si quieres ver todos los resultados:
print("\nResultados de BLEU-4 para cada num_beams:")
for nb, score in bleu_scores:
    print(f"num_beams={nb}: BLEU-4={score:.2f}")

Evaluando num_beams=1
num_beams=1, BLEU-4=20.99
Evaluando num_beams=2
num_beams=1, BLEU-4=20.99
Evaluando num_beams=2
num_beams=2, BLEU-4=21.56
Evaluando num_beams=3
num_beams=2, BLEU-4=21.56
Evaluando num_beams=3
num_beams=3, BLEU-4=21.94
Evaluando num_beams=4
num_beams=3, BLEU-4=21.94
Evaluando num_beams=4
num_beams=4, BLEU-4=21.04
Evaluando num_beams=5
num_beams=4, BLEU-4=21.04
Evaluando num_beams=5
num_beams=5, BLEU-4=21.14
Evaluando num_beams=6
num_beams=5, BLEU-4=21.14
Evaluando num_beams=6
num_beams=6, BLEU-4=20.80
Evaluando num_beams=7
num_beams=6, BLEU-4=20.80
Evaluando num_beams=7
num_beams=7, BLEU-4=21.17
Evaluando num_beams=8
num_beams=7, BLEU-4=21.17
Evaluando num_beams=8
num_beams=8, BLEU-4=21.86
Evaluando num_beams=9
num_beams=8, BLEU-4=21.86
Evaluando num_beams=9
num_beams=9, BLEU-4=22.53
Evaluando num_beams=10
num_beams=9, BLEU-4=22.53
Evaluando num_beams=10
num_beams=10, BLEU-4=21.65

El mejor num_beams es 9 con BLEU-4=22.53

Resultados de BLEU-4 para cada num_beams:


- Seeing the above translations and the BLEU score of the different strategies, which strategy would you choose for this use case?

Para el caso de traducción automática, elegiría la estrategia Beam Search. Es la que obtiene el mayor BLEU-4 (21.14) y produce traducciones más coherentes y precisas que las estrategias estocásticas ya que las estrategias estocásticas generan más variedad, pero sacrifican precisión y calidad en este contexto.

### Use case 2: Story generation

The ```WritingPrompts``` dataset is a collection of imaginative prompts and corresponding stories from the Reddit community. It contains over 300,000 stories written in response to various prompts, making it suitable for training and evaluating models on creative text generation tasks.

We'll use a subset of the dataset for evaluation purposes.

In [12]:
dataset_st_gen = load_dataset('llm-aes/writing-prompts', split='train[:1%]')

Generating train split: 100%|██████████| 232360/232360 [00:00<00:00, 2287863.05 examples/s]
Generating sample_length_10_to_292 split: 100%|██████████| 1000/1000 [00:00<00:00, 1072712.02 examples/s]


In [13]:
# Randomly select a sample from the dataset
sample = random.choice(dataset_st_gen['prompt'])

print("Sample Text:")
print(sample)

Sample Text:
 Science has found the key to immortality , but there 's a catch : it can only be administered at birth . You are a member of the last mortal generation .



Now, we have to tell the t5 model to generate text after the input sentence.

In [14]:
def preprocess_story_generation(examples: list[dict]) -> dict:
    """Preprocesses a single example for the story generation task.

    Args:
        examples (list[dict]): A list of dictionaries containing 'prompt' key.

    Returns:
        dict[list]: A dictionary of list with added 'src_texts' for model input.

    The function:
    - Adds the task prefix 'Write a story based on: ' to the prompt.
    """
    # Empty dictionary
    texts = {}
    
    # T5 expects a task prefix
    texts['src_texts'] = ['Write a story based on: ' + ex['prompt'] for ex in examples]
    return texts

dataset_st_gen_preproc = preprocess_story_generation(dataset_st_gen)


As we have no original story to compare how good the model generates a story, you should compare the different decoding strategies by looking at some random stories: 

In [18]:
# Obtain source and target texts
random_index = random.randint(0, len(dataset_st_gen_preproc['src_texts']))
random_src_sentence = dataset_st_gen_preproc['src_texts'][random_index]

# Obtain the translated sentence
story_gen_greedy= generate_text(random_src_sentence, max_length=300, strategy='greedy')
story_gen_beam_search = generate_text(random_src_sentence, max_length=300, strategy='beam', num_beams=5)
story_gen_temperature = generate_text(random_src_sentence, max_length=300, strategy='temperature', temperature=0.7)
story_gen_top_k = generate_text(random_src_sentence, max_length=300, strategy='top-k', top_k=50)
story_gen_top_p = generate_text(random_src_sentence, max_length=300, strategy='top-p', top_p=0.92)
story_gen_contrastive = generate_text(random_src_sentence, max_length=300, strategy='contrastive', top_k=4, penalty_alpha=0.6)

print(random_src_sentence)
print(f"Greedy Search: \n{story_gen_greedy}")
print(f"Beam Search: \n{story_gen_beam_search}")
print(f"Temperature Sampling: \n{story_gen_temperature}")
print(f"Top-k Sampling: \n{story_gen_top_k}")
print(f"Top-p Sampling: \n{story_gen_top_p}")
print(f"Contrastive Search: \n{story_gen_contrastive}")

Write a story based on:  The floor is lava .

Greedy Search: 
The floor is lava. It is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like substance that is a lava-like s

- Seeing the above generated stories of the different strategies, which strategy would you choose for this use case?

Para la generación de historias, las estrategias estocásticas como Top-p, Top-k o Temperature Sampling parecen ser mejores. Aunque pueden producir resultados menos coherentes, nos permiten generar textos más creativos y variados. En este caso, las estrategias deterministas tienden a repetir frases o quedarse atascadas, mientras que Top-p y Top-k ofrecen historias más originales y menos repetitivas. Por tanto, elegiría Top-p o Top-k para este caso.

### Sensitivity Analysis

Try different hyperparameter values for the decoding strategies, try to optimize the BLEU score for the Neural Machine Translation case and generate better stories in the second use case.

- Which optimal configuration have you found for use case 1? Which are your conclusions based on your analysis?

Para el caso 1 , la mejor configuración fue num_beams = 9 con BLEU-4=22.53. Aunque se podria usar num_beans = 4 ya que solo disminuye el BLEU por 0.5 y disminuiria el coste computacional. Las estrategias estocásticas y el ajuste de temperature, top-k o top-p no superaron a Beam Search en BLEU. Por tanto Beam Search es la mejor opción para traducción precisa y estable.

- Which optimal configuration have you found for use case 2? Which are your conclusions bsaed on your analysis?

Para el caso 2 (generación de historias), los mejores resultados en creatividad y coherencia se obtuvieron con Top-p Sampling usando top_p=0.92 y max_length=300. Aunque un top_p entre 0.9 y 0.95 da historias variadas pero comprensibles. Top-k también funciona bien con valores entre 40 y 60. Para tareas creativas, Top-p o Top-k con valores intermedios generan textos más interesantes y menos repetitivos. 