# Demonstrating Different Decoding Strategies with Hugging Face Transformers
In this notebook, we will explore various decoding strategies for text generation using a small encoder-decoder model from Hugging Face's Transformers library. We'll apply these strategies to two different datasets:

- Translation task where deterministic strategies are expected to perform better.
- Summarization task where stochastic strategies might yield more diverse and informative outputs.

The decoding strategies we'll test include:

1. Greedy Search
2. Beam Search
3. Temperature Sampling
4. Top-k Sampling
5. Top-p (Nucleus) Sampling

We'll define a custom ```generate_text``` function to apply these strategies and evaluate their performance using appropriate metrics for each dataset.

Here are some useful links you might want to check:
- [Auto Classes](https://huggingface.co/docs/transformers/model_doc/auto)
- [transformers\AutoModel](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModel)
- [transformers\AutoTokenizer](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer)
- [Google's T5](https://arxiv.org/pdf/1910.10683)
- [T5-small](https://huggingface.co/google-t5/t5-small)
- [Flan-T5-small](https://huggingface.co/google/flan-t5-small)

In [27]:
!pip install transformers datasets sacrebleu rouge_score evaluate --quiet

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [28]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset
import random
import evaluate

### Load the Pre-trained Model and Tokenizer

We'll use the ```flan-T5-small``` model, which is a small encoder-decoder model suitable for both story_gen and summarization tasks. This model is based on Google's ```t5-small``` model, but fine-tuned on more than 1000 additional tasks covering also more languages.

In [29]:
model_name = 'google/flan-t5-small'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)


In [37]:
def generate_text(input_text, strategy='greedy', max_length=50, **kwargs):
    """Generates text based on the specified decoding strategy.

    Args:
        input_text (str): The input text to be processed by the model.
        strategy (str, optional): The decoding strategy to use. Defaults to 'greedy'.
            Options include:
            - 'greedy': Greedy search decoding.
            - 'beam': Beam search decoding.
            - 'temperature': Temperature sampling.
            - 'top-k': Top-k sampling.
            - 'top-p': Top-p (nucleus) sampling.
            - 'contrastive': Contrastive search decoding.
        max_length (int, optional): The maximum length of the generated text. Defaults to 50.
        **kwargs: Additional keyword arguments specific to the decoding strategy.

    Keyword Args:
        num_beams (int, optional): Number of beams for beam search. Defaults to 5.
            Used when `strategy='beam'`.
        temperature (float, optional): Sampling temperature. Defaults to 1.0.
            Used when `strategy='temperature'`.
        top_k (int, optional): The number of highest probability vocabulary tokens to keep for top-k filtering. Defaults to 50.
            Used when `strategy='top-k'` or when `strategy='contrastive'`.
        top_p (float, optional): Cumulative probability for nucleus sampling. Defaults to 0.95.
            Used when `strategy='top-p'`.
        penalty_alpha (float, optional): Contrastive search penalty factor. Defaults to 0.6.
            Used when `strategy='contrastive'`.

    Returns:
        str: The generated text based on the decoding strategy.

    Raises:
        ValueError: If an unknown decoding strategy is specified.
    """
    input_ids = tokenizer.encode(input_text, return_tensors='pt')
    model.eval()
    with torch.no_grad():
        if strategy == 'greedy':
            output_ids = model.generate(input_ids, max_length=max_length)
        elif strategy == 'beam':
            num_beams = kwargs.get('num_beams', 5)
            output_ids = model.generate(
                input_ids, max_length=max_length, num_beams=num_beams, early_stopping=True
            )
        elif strategy == 'temperature':
            temperature = kwargs.get('temperature', 1.0)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, temperature=temperature
            )
        elif strategy == 'top-k':
            top_k = kwargs.get('top_k', 50)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, top_k=top_k
            )
        elif strategy == 'top-p':
            top_p = kwargs.get('top_p', 0.95)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, top_p=top_p
            )
        elif strategy == 'contrastive':
            top_k = kwargs.get('top_k', 4)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, top_k=top_k
            )
        else:
            raise ValueError("Unknown strategy: {}".format(strategy))

    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return output_text


### Use case 1: Neural Machine Translation
The WMT16 English-German dataset is a collection of parallel sentences in English and German used for machine translation tasks. It is part of the Conference on Machine story_gen (WMT) shared tasks, which are benchmarks for evaluating machine story_gen systems. The dataset contains professionally translated sentences and covers a variety of topics, making it ideal for training and evaluating story_gen models.

We'll use a subset (1% of the test split) of the dataset for evaluation purposes.

In [38]:
dataset_nmt = load_dataset('wmt16', 'de-en', split='test[:1%]')

In [39]:
# Randomly select a sample from the dataset
sample = random.choice(dataset_nmt['translation'])
input_text = sample['de']
target_text = sample['en']

print("Input Text:")
print(input_text)
print("\nTarget Text:")
print(target_text)


Input Text:
Lamb war es wichtig zu betonen, dass sein "süßer Hund" aber noch lebe und wahrscheinlich aufgeregt sei, und er sagte, die Familienkontakte der toten Frau könnten auf dem Handy gefunden werden.

Target Text:
Lamb made a point to say his "sweet dog" was there alive and probably upset, and said the dead woman's family contacts could be found on her phone.


We need to preprocess the dataset to prepare it for input into the T5 model. The T5 model expects input in a specific format, including a task prefix. This is because the T5 model is a multi-purpose model, being able to perform several task, so we need to tell it what it must do.

In [40]:
def preprocess_translation(examples: list[dict]) -> dict:
    """Preprocesses a single example for the translation task.

    Args:
        examples list[dict]: A list of dictionaries containing 'de' and 'en' keys with English and German sentences.

    Returns:
        dict[list]: A dictionary of lists with added 'src_texts' and 'tgt_texts' keys for model input and target.

    The function:
    - Adds the task prefix 'translate German to English: ' to the German sentence.
    - Stores the result in 'src_texts'.
    - Copies the English sentence to 'tgt_texts'.
    """
    # Empty dictionary
    texts = {}
    
    # T5 expects a "translate German to English: " prefix
    texts['src_texts'] = ['translate German to English: ' + ex['de'] for ex in examples]
    texts['tgt_texts'] = [ex['en'] for ex in examples]
    return texts

dataset_nmt_preproc = preprocess_translation(dataset_nmt["translation"])


Let's see how we can generate a translation using the t5 model. From a random dataset, we are going to create translations using the different implemented strategies:

In [41]:
# Obtain source and target texts
random_index = random.randrange(len(dataset_nmt_preproc['src_texts']))
random_src_sentence = dataset_nmt_preproc['src_texts'][random_index]
random_tgt_sentence = dataset_nmt_preproc['tgt_texts'][random_index]

# Obtain the translated sentence
translation_greedy= generate_text(random_src_sentence, strategy='greedy')
translation_beam_search = generate_text(random_src_sentence, strategy='beam', num_beams=5)
translation_temperature = generate_text(random_src_sentence, strategy='temperature', temperature=0.7)
translation_top_k = generate_text(random_src_sentence, strategy='top-k', top_k=50)
translation_top_p = generate_text(random_src_sentence, strategy='top-p', top_p=0.95)
translation_contrastive = generate_text(random_src_sentence, strategy='contrastive', top_k=4, penalty_alpha=0.6)

print(random_src_sentence)
print(f"Original translation: \n{random_tgt_sentence}")
print(f"Greedy Search: \n{translation_greedy}")
print(f"Beam Search: \n{translation_beam_search}")
print(f"Temperature Sampling: \n{translation_temperature}")
print(f"Top-k Sampling: \n{translation_top_k}")
print(f"Top-p Sampling: \n{translation_top_p}")
print(f"Contrastive Search: \n{translation_contrastive}")

translate German to English: Innerhalb des Hauses fanden die Beamten die Leiche von Amy Prentiss und eine handgeschriebene Notiz, die auf einen weißen Block gekritzelt war: "Mir tut es so leid, ich wollte, ich könnte es rückgängig machen, ich liebte Amy und sie ist die einzige Frau, die mich jemals liebte". Dies stand nach Angaben der Behörden in dem Brief, und er war von Lamb unterzeichnet.
Original translation: 
Inside the home, officers found Amy Prentiss' body and a hand-written note scribbled on a white legal pad: "I am so very sorry I wish I could take it back I loved Amy and she is the only woman who ever loved me," read the letter authorities say was signed by Lamb.
Greedy Search: 
In the House, the officers of Amy Prentiss and a handwritten note, which was on a white block, said, "I would like to make it a bit more difficult, I would like to make it more
Beam Search: 
Within the house, the officers of Amy Prentiss and a handwritten note, which was written on a white block, sai

- Do you see something different between the deterministic and the stochastic strategies? Try different random sentences.
>>> Las estrategias deterministas como Greedy Search y Beam Search generan siempre la misma salida para una misma entrada y tienden a producir traducciones coherentes pero a veces truncadas o repetitivas, como se observa en tu ejemplo con frases incompletas o palabras extrañas. En cambio, las estrategias estocásticas como Temperature, Top-k, Top-p y Contrastive Search introducen aleatoriedad, produciendo resultados diferentes en cada ejecución; esto permite generar traducciones más diversas y creativas, aunque a veces menos precisas o con construcciones inusuales. En general, las estrategias deterministas priorizan coherencia y seguridad, mientras que las estocásticas favorecen diversidad y exploración.

### Evaluation Metric: BLEU Score
#### What is BLEU Score?
The BLEU (Bilingual Evaluation Understudy) score is a metric for evaluating the quality of text that has been machine-translated from one language to another. It compares a candidate translation to one or more reference translations and calculates a score based on the overlap of n-grams (contiguous sequences of words).

BLEU-4: Considers up to 4-gram matches between the candidate and reference translations. It provides a balance between precision (matching words) and fluency (maintaining the structure of the language). Using BLEU-4 allows us to capture not just individual word matches (unigrams) but also phrases of up to four words. This makes the evaluation more sensitive to the quality of the translation in terms of both accuracy and fluency.

We'll use the ```sacrebleu``` implementation for a standardized BLEU score calculation (which is BLEU-4). You can check the details [here](https://aclanthology.org/W14-3346.pdf).

In [42]:
bleu_metric = evaluate.load('sacrebleu')

Downloading builder script: 8.15kB [00:00, 1.39MB/s]



In [43]:
strategies = ['greedy', 'beam', 'temperature', 'top-k', 'top-p', 'contrastive']
results = {}

hyperparameters = {
    'num_beams': 5,
    'temperature': 1.0,
    'top_k': 50,
    'top_p': 0.95,
    'penalty_alpha': 0.6
}

for strategy in strategies:
    print(f"Evaluating strategy: {strategy}")
    predictions = []
    references = []
    for src_text, tgt_text in zip(dataset_nmt_preproc["src_texts"], dataset_nmt_preproc["tgt_texts"]): 
        pred = generate_text(src_text, strategy=strategy, **hyperparameters)
        predictions.append(pred)
        references.append([tgt_text])  # SacreBLEU expects a list of references

    # Compute BLEU-4 score
    bleu = bleu_metric.compute(predictions=predictions, references=references, smooth_method='exp')
    results[strategy] = bleu['score']
    print(f"BLEU-4 score for {strategy}: {bleu['score']:.2f}")


Evaluating strategy: greedy
BLEU-4 score for greedy: 20.99
Evaluating strategy: beam
BLEU-4 score for greedy: 20.99
Evaluating strategy: beam
BLEU-4 score for beam: 21.14
Evaluating strategy: temperature
BLEU-4 score for beam: 21.14
Evaluating strategy: temperature
BLEU-4 score for temperature: 9.55
Evaluating strategy: top-k
BLEU-4 score for temperature: 9.55
Evaluating strategy: top-k
BLEU-4 score for top-k: 10.96
Evaluating strategy: top-p
BLEU-4 score for top-k: 10.96
Evaluating strategy: top-p
BLEU-4 score for top-p: 13.91
Evaluating strategy: contrastive
BLEU-4 score for top-p: 13.91
Evaluating strategy: contrastive
BLEU-4 score for contrastive: 9.13
BLEU-4 score for contrastive: 9.13


- Seeing the above translations and the BLEU score of the different strategies, which strategy would you choose for this use case?
>>> Según los resultados de BLEU-4, las estrategias deterministas como Beam Search y Greedy Search obtienen puntuaciones significativamente más altas (Beam: 21.14, Greedy: 20.99) en comparación con las estrategias estocásticas (Temperature: 9.55, Top-k: 10.96, Top-p: 13.91, Contrastive: 9.13). Esto indica que, para este caso de uso de traducción, donde la precisión y fidelidad al texto original son prioritarias, lo más recomendable sería usar Beam Search, ya que logra el BLEU más alto y produce traducciones más coherentes y precisas que las estrategias basadas en muestreo aleatorio.

### Use case 2: Story generation

The ```WritingPrompts``` dataset is a collection of imaginative prompts and corresponding stories from the Reddit community. It contains over 300,000 stories written in response to various prompts, making it suitable for training and evaluating models on creative text generation tasks.

We'll use a subset of the dataset for evaluation purposes.

In [44]:
dataset_st_gen = load_dataset('llm-aes/writing-prompts', split='train[:1%]')

Generating train split: 100%|██████████| 232360/232360 [00:03<00:00, 66780.41 examples/s] 
Generating sample_length_10_to_292 split: 100%|██████████| 1000/1000 [00:00<00:00, 93198.47 examples/s]
Generating train split: 100%|██████████| 232360/232360 [00:03<00:00, 66780.41 examples/s] 
Generating sample_length_10_to_292 split: 100%|██████████| 1000/1000 [00:00<00:00, 93198.47 examples/s]


In [45]:
# Randomly select a sample from the dataset
sample = random.choice(dataset_st_gen['prompt'])

print("Sample Text:")
print(sample)

Sample Text:
 Your child is the ( next ) Messiah . They have come to you for guidance before setting off . It 's your last chance to speak to them as a parent .



Now, we have to tell the t5 model to generate text after the input sentence.

In [46]:
def preprocess_story_generation(examples: list[dict]) -> dict:
    """Preprocesses a single example for the story generation task.

    Args:
        examples (list[dict]): A list of dictionaries containing 'prompt' key.

    Returns:
        dict[list]: A dictionary of list with added 'src_texts' for model input.

    The function:
    - Adds the task prefix 'Write a story based on: ' to the prompt.
    """
    # Empty dictionary
    texts = {}
    
    # T5 expects a task prefix
    texts['src_texts'] = ['Write a story based on: ' + ex['prompt'] for ex in examples]
    return texts

dataset_st_gen_preproc = preprocess_story_generation(dataset_st_gen)


As we have no original story to compare how good the model generates a story, you should compare the different decoding strategies by looking at some random stories: 

In [47]:
# Obtain source and target texts
random_index = random.randrange(len(dataset_st_gen_preproc['src_texts']))
random_src_sentence = dataset_st_gen_preproc['src_texts'][random_index]

# Obtain the translated sentence
story_gen_greedy= generate_text(random_src_sentence, max_length=300, strategy='greedy')
story_gen_beam_search = generate_text(random_src_sentence, max_length=300, strategy='beam', num_beams=5)
story_gen_temperature = generate_text(random_src_sentence, max_length=300, strategy='temperature', temperature=0.7)
story_gen_top_k = generate_text(random_src_sentence, max_length=300, strategy='top-k', top_k=50)
story_gen_top_p = generate_text(random_src_sentence, max_length=300, strategy='top-p', top_p=0.95)
story_gen_contrastive = generate_text(random_src_sentence, max_length=300, strategy='contrastive', top_k=4, penalty_alpha=0.6)

print(random_src_sentence)
print(f"Greedy Search: \n{story_gen_greedy}")
print(f"Beam Search: \n{story_gen_beam_search}")
print(f"Temperature Sampling: \n{story_gen_temperature}")
print(f"Top-k Sampling: \n{story_gen_top_k}")
print(f"Top-p Sampling: \n{story_gen_top_p}")
print(f"Contrastive Search: \n{story_gen_contrastive}")

Write a story based on:  You live in a world where humans have nine lives . Each of these nine lives represents a certain aspect of a person ’ s life . When a person is killed , only the aspect of them that was active during their death dies . You have one life left .

Greedy Search: 
You live in a world where humans have nine lives . Each of these nine lives represents a certain aspect of a person ’s life . When a person is killed , only the aspect of them that was active during their death dies . You have one life left .
Beam Search: 
You live in a world where humans have nine lives . Each of these nine lives represents a certain aspect of a person ’ s life . When a person is killed , only the aspect of them that was active during their death dies . You have one life left .
Temperature Sampling: 
You live in a world where humans have nine lives . Each of these nine lives represent a certain aspect of a person ’ s life . When a person is killed , only the aspect of them that was activ

- Seeing the above generated stories of the different strategies, which strategy would you choose for this use case?
>>> Mirando los resultados, se nota que las estrategias deterministas como Greedy y Beam Search mantienen la historia clara y coherente, siguiendo exactamente la premisa que se les dio, sin inventar cosas raras. Las estrategias estocásticas como Temperature, Top-k o Top-p son más creativas y añaden detalles o giros, pero a veces se vuelven confusas o repiten ideas.

Para este caso, donde queremos que la historia tenga sentido y siga bien la idea inicial, lo mejor sería usar Beam Search (o Greedy), porque genera un texto coherente y fiel a lo que pedimos.

### Sensitivity Analysis

Try different hyperparameter values for the decoding strategies, try to optimize the BLEU score for the Neural Machine Translation case and generate better stories in the second use case.

- Which optimal configuration have you found for use case 1? Which are your conclusions based on your analysis?
>>> La configuración óptima encontrada fue Beam Search, ya que obtuvo la puntuación BLEU más alta (21.14) y produjo traducciones coherentes y precisas, sin las incoherencias o errores que aparecían en las estrategias estocásticas como Top-k, Top-p o Temperature. La conclusión es que, para tareas de traducción donde la fidelidad al texto original es importante, usar estrategias deterministas garantiza resultados más fiables.

- Which optimal configuration have you found for use case 2? Which are your conclusions bsaed on your analysis?
>>> La configuración óptima fue también Beam Search (o Greedy), porque generó historias claras y coherentes que respetan la premisa inicial. Las estrategias estocásticas podían introducir más creatividad, pero a costa de generar incoherencias o repeticiones. La conclusión es que, cuando el objetivo es mantener coherencia narrativa y seguir fielmente la idea inicial, las estrategias deterministas son las más recomendables.