# Demonstrating Different Decoding Strategies with Hugging Face Transformers
In this notebook, we will explore various decoding strategies for text generation using a small encoder-decoder model from Hugging Face's Transformers library. We'll apply these strategies to two different datasets:

- Translation task where deterministic strategies are expected to perform better.
- Summarization task where stochastic strategies might yield more diverse and informative outputs.

The decoding strategies we'll test include:

1. Greedy Search
2. Beam Search
3. Temperature Sampling
4. Top-k Sampling
5. Top-p (Nucleus) Sampling

We'll define a custom ```generate_text``` function to apply these strategies and evaluate their performance using appropriate metrics for each dataset.

Here are some useful links you might want to check:
- [Auto Classes](https://huggingface.co/docs/transformers/model_doc/auto)
- [transformers\AutoModel](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModel)
- [transformers\AutoTokenizer](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer)
- [Google's T5](https://arxiv.org/pdf/1910.10683)
- [T5-small](https://huggingface.co/google-t5/t5-small)
- [Flan-T5-small](https://huggingface.co/google/flan-t5-small)

In [10]:
# !pip install transformers datasets sacrebleu rouge_score evaluate --quiet

In [2]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset
import random
import evaluate

### Load the Pre-trained Model and Tokenizer

We'll use the ```flan-T5-small``` model, which is a small encoder-decoder model suitable for both story_gen and summarization tasks. This model is based on Google's ```t5-small``` model, but fine-tuned on more than 1000 additional tasks covering also more languages.

In [3]:
model_name = 'google/flan-t5-small'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


In [4]:
def generate_text(input_text, strategy='greedy', max_length=50, **kwargs):
    """Generates text based on the specified decoding strategy.

    Args:
        input_text (str): The input text to be processed by the model.
        strategy (str, optional): The decoding strategy to use. Defaults to 'greedy'.
            Options include:
            - 'greedy': Greedy search decoding.
            - 'beam': Beam search decoding.
            - 'temperature': Temperature sampling.
            - 'top-k': Top-k sampling.
            - 'top-p': Top-p (nucleus) sampling.
            - 'contrastive': Contrastive search decoding.
        max_length (int, optional): The maximum length of the generated text. Defaults to 50.
        **kwargs: Additional keyword arguments specific to the decoding strategy.

    Keyword Args:
        num_beams (int, optional): Number of beams for beam search. Defaults to 5.
            Used when `strategy='beam'`.
        temperature (float, optional): Sampling temperature. Defaults to 1.0.
            Used when `strategy='temperature'`.
        top_k (int, optional): The number of highest probability vocabulary tokens to keep for top-k filtering. Defaults to 50.
            Used when `strategy='top-k'` or when `strategy='contrastive'`.
        top_p (float, optional): Cumulative probability for nucleus sampling. Defaults to 0.95.
            Used when `strategy='top-p'`.
        penalty_alpha (float, optional): Contrastive search penalty factor. Defaults to 0.6.
            Used when `strategy='contrastive'`.

    Returns:
        str: The generated text based on the decoding strategy.

    Raises:
        ValueError: If an unknown decoding strategy is specified.
    """
    input_ids = tokenizer.encode(input_text, return_tensors='pt')
    model.eval()
    with torch.no_grad():
        if strategy == 'greedy':
            output_ids = model.generate(input_ids, max_length=max_length)
        elif strategy == 'beam':
            num_beams = kwargs.get('num_beams', 5)
            output_ids = model.generate(
                input_ids, max_length=max_length, num_beams=num_beams, early_stopping=True
            )
        elif strategy == 'temperature':
            temperature = kwargs.get('temperature', 1.0)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, temperature=temperature
            )
        elif strategy == 'top-k':
            top_k = kwargs.get('top_k', 50)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, top_k=top_k
            )
        elif strategy == 'top-p':
            top_p = kwargs.get('top_p', 0.95)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, top_p=top_p
            )
        elif strategy == 'contrastive':
            penalty_alpha = kwargs.get('penalty_alpha', 0.6)
            top_k = kwargs.get('top_k', 4)
            output_ids = model.generate(
                input_ids, max_length=max_length, penalty_alpha=penalty_alpha, top_k=top_k
            )
        else:
            raise ValueError("Unknown strategy: {}".format(strategy))

    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return output_text


### Use case 1: Neural Machine Translation
The WMT16 English-German dataset is a collection of parallel sentences in English and German used for machine translation tasks. It is part of the Conference on Machine story_gen (WMT) shared tasks, which are benchmarks for evaluating machine story_gen systems. The dataset contains professionally translated sentences and covers a variety of topics, making it ideal for training and evaluating story_gen models.

We'll use a subset (1% of the test split) of the dataset for evaluation purposes.

In [4]:
dataset_nmt = load_dataset('wmt16', 'de-en', split='test[:1%]')

In [15]:
# Randomly select a sample from the dataset
sample = random.choice(dataset_nmt['translation'])
input_text = sample['de']
target_text = sample['en']

print("Input Text:")
print(input_text)
print("\nTarget Text:")
print(target_text)


Input Text:
Die von den Ermittlern an beiden Enden des Staates veröffentlichten Details, wie auch das, was Studenten und Mitarbeiter, die ihn kannten, aussagten, half dabei, ein Bild von einem talentierten, aber möglicherweise schwierigen Lehrer zu zeichnen.

Target Text:
The details released by investigators at both ends of the state as well as students and staff who knew him helped paint a picture of a talented but possibly troubled teacher.


We need to preprocess the dataset to prepare it for input into the T5 model. The T5 model expects input in a specific format, including a task prefix. This is because the T5 model is a multi-purpose model, being able to perform several task, so we need to tell it what it must do.

In [5]:
def preprocess_translation(examples: list[dict]) -> dict:
    """Preprocesses a single example for the translation task.

    Args:
        examples list[dict]: A list of dictionaries containing 'de' and 'en' keys with English and German sentences.

    Returns:
        dict[list]: A dictionary of lists with added 'src_texts' and 'tgt_texts' keys for model input and target.

    The function:
    - Adds the task prefix 'translate German to English: ' to the German sentence.
    - Stores the result in 'src_texts'.
    - Copies the English sentence to 'tgt_texts'.
    """
    # Empty dictionary
    texts = {}
    
    # T5 expects a "translate German to English: " prefix
    texts['src_texts'] = ['translate German to English: ' + ex['de'] for ex in examples]
    texts['tgt_texts'] = [ex['en'] for ex in examples]
    return texts

dataset_nmt_preproc = preprocess_translation(dataset_nmt["translation"])


Let's see how we can generate a translation using the t5 model. From a random dataset, we are going to create translations using the different implemented strategies:

In [17]:
# Obtain source and target texts
random_index = random.randint(0, len(dataset_nmt_preproc['src_texts']))
random_src_sentence = dataset_nmt_preproc['src_texts'][random_index]
random_tgt_sentence = dataset_nmt_preproc['tgt_texts'][random_index]

# Obtain the translated sentence
translation_greedy= generate_text(random_src_sentence, strategy='greedy')
translation_beam_search = generate_text(random_src_sentence, strategy='beam', num_beams=5)
translation_temperature = generate_text(random_src_sentence, strategy='temperature', temperature=0.7)
translation_top_k = generate_text(random_src_sentence, strategy='top-k', top_k=50)
translation_top_p = generate_text(random_src_sentence, strategy='top-p', top_p=0.95)
translation_contrastive = generate_text(random_src_sentence, strategy='contrastive', top_k=4, penalty_alpha=0.6)

print(random_src_sentence)
print(f"Original translation: \n{random_tgt_sentence}\n")
print(f"Greedy Search: \n{translation_greedy}\n")
print(f"Beam Search: \n{translation_beam_search}\n")
print(f"Temperature Sampling: \n{translation_temperature}\n")
print(f"Top-k Sampling: \n{translation_top_k}\n")
print(f"Top-p Sampling: \n{translation_top_p}\n")
print(f"Contrastive Search: \n{translation_contrastive}\n")

translate German to English: Lamb hatte zuvor die Delta State University um eine Beurlaubung aus gesundheitlichen Gründen gebeten und dabei gesagt, dass er irgendein gesundheitliches Problem habe.
Original translation: 
Lamb had earlier asked Delta State University for a medical leave of absence, saying he had a health issue of some sort.

Greedy Search: 
Lamb has been a member of Delta State University for a year and said that he has any health problem.

Beam Search: 
Lamb has been at Delta State University for several years, and he said that he has any health problem.

Temperature Sampling: 
Lamb sat there for a time to remember a health issue and said that he had any health issue.

Top-k Sampling: 
Lamb had previously seen the Delta State University canceling a trip to South Africa and said that he any concern he raised.

Top-p Sampling: 
Lamb has always been a student at Delta State University, and he said that his health has been a mystery.

Contrastive Search: 
Lamb has been a me

- Do you see something different between the deterministic and the stochastic strategies? Try different random sentences.
>>> Write your answer here

Yes, there is a clear difference. I have sampled different sequences and I can summarize the followinf statements:

- The deterministic strategies (Greedy Search, Beam Search, Contrastive Search) tend to produce translations that are more consistent and closer to the meaning of the original sentence, even if not exact. For example, in the example above, we can see that three sequences whose meaning is closer to the original are these three sequences, althoung their meanings are not exactly to the original. They usually stay on track and do not invent completely unrelated content.

- On the other hand, the stochastic (sampling-based) strategies (Temperature, Top-k, Top-p) generate much more variation. Sometimes they produce sentences that diverge significantly from the original meaning, adding or changing details that were not in the source text. As we can see in the example above, the meaning of the sentences produced by these methods are quite/higher different than the original one. Top-k sampling has changed the meaning completly

We can conclude that for this type of tasks, deterministic methods are quite better than sampling methods, because the first ones try to produce translations closer to the original, while the second ones try to generate sequences more diverse


### Evaluation Metric: BLEU Score
#### What is BLEU Score?
The BLEU (Bilingual Evaluation Understudy) score is a metric for evaluating the quality of text that has been machine-translated from one language to another. It compares a candidate translation to one or more reference translations and calculates a score based on the overlap of n-grams (contiguous sequences of words).

BLEU-4: Considers up to 4-gram matches between the candidate and reference translations. It provides a balance between precision (matching words) and fluency (maintaining the structure of the language). Using BLEU-4 allows us to capture not just individual word matches (unigrams) but also phrases of up to four words. This makes the evaluation more sensitive to the quality of the translation in terms of both accuracy and fluency.

We'll use the ```sacrebleu``` implementation for a standardized BLEU score calculation (which is BLEU-4). You can check the details [here](https://aclanthology.org/W14-3346.pdf).

In [6]:
bleu_metric = evaluate.load('sacrebleu')

In [19]:
strategies = ['greedy', 'beam', 'temperature', 'top-k', 'top-p', 'contrastive']
results = {}

hyperparameters = {
    'num_beams': 5,
    'temperature': 1.0,
    'top_k': 50,
    'top_p': 0.95,
    'penalty_alpha': 0.6
}

for strategy in strategies:
    print(f"Evaluating strategy: {strategy}")
    predictions = []
    references = []
    for src_text, tgt_text in zip(dataset_nmt_preproc["src_texts"], dataset_nmt_preproc["tgt_texts"]): 
        pred = generate_text(src_text, strategy=strategy, **hyperparameters)
        predictions.append(pred)
        references.append([tgt_text])  # SacreBLEU expects a list of references

    # Compute BLEU-4 score
    bleu = bleu_metric.compute(predictions=predictions, references=references, smooth_method='exp')
    results[strategy] = bleu['score']
    print(f"BLEU-4 score for {strategy}: {bleu['score']:.2f}")


Evaluating strategy: greedy
BLEU-4 score for greedy: 20.99
Evaluating strategy: beam
BLEU-4 score for beam: 21.14
Evaluating strategy: temperature
BLEU-4 score for temperature: 12.83
Evaluating strategy: top-k
BLEU-4 score for top-k: 13.10
Evaluating strategy: top-p
BLEU-4 score for top-p: 11.70
Evaluating strategy: contrastive
BLEU-4 score for contrastive: 19.34


- Seeing the above translations and the BLEU score of the different strategies, which strategy would you choose for this use case?
>>> Write your answer here

As we can see, this clearly reflects what was mentioned before: the deterministic methods (greedy, beam seacr and constractive) have higher scores than the sampling ones, as expected, since the sampling strategies completely changed the meaning of the sentence in the example above.
 
The deterministic ones have similar scores. I would discard greedy because it is too rigid in its choices, and since beam search gives me a better chance to find the global optimum, I would prefer it over greedy (also beam search has a higher score).

Then, between beam search and contrastive, I would be in doubt. Although in this case beam search has a higher score, I think contrastive is also a very good method because it combines the probability of the model with a degeneration penalty, which can be very useful in case the sentence grows too much.

What I would do is test both methods and, after several experiments, choose the one that gives the best results. If I have to choose one right now, it would be beam search, because its score is very high and it is also less computationally expensive than contrastive.

### Use case 2: Story generation

The ```WritingPrompts``` dataset is a collection of imaginative prompts and corresponding stories from the Reddit community. It contains over 300,000 stories written in response to various prompts, making it suitable for training and evaluating models on creative text generation tasks.

We'll use a subset of the dataset for evaluation purposes.

In [5]:
dataset_st_gen = load_dataset('llm-aes/writing-prompts', split='train[:1%]')

In [22]:
# Randomly select a sample from the dataset
sample = random.choice(dataset_st_gen['prompt'])

print("Sample Text:")
print(sample)

Sample Text:
 You 've been sentenced to die by being locked in a cage in the middle of a hot empty desert .



Now, we have to tell the t5 model to generate text after the input sentence.

In [8]:
def preprocess_story_generation(examples: list[dict]) -> dict:
    """Preprocesses a single example for the story generation task.

    Args:
        examples (list[dict]): A list of dictionaries containing 'prompt' key.

    Returns:
        dict[list]: A dictionary of list with added 'src_texts' for model input.

    The function:
    - Adds the task prefix 'Write a story based on: ' to the prompt.
    """
    # Empty dictionary
    texts = {}
    
    # T5 expects a task prefix
    texts['src_texts'] = ['Write a story based on: ' + ex['prompt'] for ex in examples]
    return texts

dataset_st_gen_preproc = preprocess_story_generation(dataset_st_gen)


As we have no original story to compare how good the model generates a story, you should compare the different decoding strategies by looking at some random stories: 

In [24]:
# Obtain source and target texts
random_index = random.randint(0, len(dataset_st_gen_preproc['src_texts']))
random_src_sentence = dataset_st_gen_preproc['src_texts'][random_index]

# Obtain the translated sentence
story_gen_greedy= generate_text(random_src_sentence, max_length=300, strategy='greedy')
story_gen_beam_search = generate_text(random_src_sentence, max_length=300, strategy='beam', num_beams=5)
story_gen_temperature = generate_text(random_src_sentence, max_length=300, strategy='temperature', temperature=0.7)
story_gen_top_k = generate_text(random_src_sentence, max_length=300, strategy='top-k', top_k=50)
story_gen_top_p = generate_text(random_src_sentence, max_length=300, strategy='top-p', top_p=0.95)
story_gen_contrastive = generate_text(random_src_sentence, max_length=300, strategy='contrastive', top_k=4, penalty_alpha=0.6)

print(random_src_sentence)
print(f"Greedy Search: \n{story_gen_greedy}\n")
print(f"Beam Search: \n{story_gen_beam_search}\n")
print(f"Temperature Sampling: \n{story_gen_temperature}\n")
print(f"Top-k Sampling: \n{story_gen_top_k}\n")
print(f"Top-p Sampling: \n{story_gen_top_p}\n")
print(f"Contrastive Search: \n{story_gen_contrastive}\n")

Write a story based on:  The narrator is fluently telling the story when he suddenly realizes the rest of the script is gone .

Greedy Search: 
The narrator is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a sailor . He is a

Beam Search: 
The narrator is fluently telling the story when he suddenly realizes the rest of the script is gone .

Temperature Sampling: 
The narrator is a bit confused . . . the script is gone . The narrator is fluently telling the story when he suddenly realizes the rest of the script is gone .

Top-k Sampling: 
The main character is ench

- Seeing the above generated stories of the different strategies, which strategy would you choose for this use case?
>>> Write your answer here

En este tipo de tareas, podemos observar como las estrategias deterministas actúan peor que las estocásticas.
Fijándonos en greedy, vemos como queda atrapado en su mayor problema, secuencias repetitivas, donde en bucle se queda dieciendo "él es un marinero" infinitas veces, texto que no tiene sentido.

Por otro lado, el beam search se ha limitado a copiar el argumento de la historia y no a generar una nueva. Esto se debe a que las estrategias deterministas tienden a buscar la opción más probable en cada paso, lo que reduce la diversidad del texto generado.

Para terminar con los deterministas, vemos como con la busqueda constractiva pasa lo mismo, se que pillado en una palabra y ya no sale de ahí.

Para este tipo de tareas se ve claramente como las estocásticas muestran resulltados mucho más naturales y variados, como sería el caso de Top-p, top-k y temeperature sampling. Aunque pueden introducir cierta aleatoriedad, logran generar textos más creativos, coherentes y con continuidad narrativa.

Por ejemplo, Top-p Sampling mantiene una buena fluidez en la historia, con un balance entre coherencia y originalidad, evitando tanto la repetición como la incoherencia total.

Por todo ello, aunque me quedaría con cualquiera de las estocástica, en este caso elegiría el de Top-p, por su naturaleza y por cómo está definido, que mejora las otras dos estrategias estocásticas y en este caso el resultado generado por ella es muy bueno.

### Sensitivity Analysis

Try different hyperparameter values for the decoding strategies, try to optimize the BLEU score for the Neural Machine Translation case and generate better stories in the second use case.

- Which optimal configuration have you found for use case 1? Which are your conclusions based on your analysis?
>>> Write your answer here

- Which optimal configuration have you found for use case 2? Which are your conclusions bsaed on your analysis?
>>> Write your answer here

Tras probar con diferentes configuraciones, obtenido estas conclusiones finales: 

TOP CONFIGURACIONES:

{'strategy': 'beam', 'params': {'num_beams': 10}, 'BLEU': 0.4881172979721856}

{'strategy': 'beam', 'params': {'num_beams': 8}, 'BLEU': 0.4566655607105938}

{'strategy': 'beam', 'params': {'num_beams': 5}, 'BLEU': 0.4376870123133287}

{'strategy': 'beam', 'params': {'num_beams': 3}, 'BLEU': 0.36616517117616154}

{'strategy': 'beam', 'params': {'num_beams': 4}, 'BLEU': 0.326340088513953}

Aunque haya salido que la estartegia mejor para este caso sea beam seacrh, voy a detallar todos los resultados obtenidos con los diferentes valores de los hiperparámetros para cada estartegia (se puede ver debajo del todo la ejecución):

- Beam search: Vemos como comparando el BLEU value con todas las demás opciones, los mejores valores se obtienen con la estratgia de Beam Search, alcanzando un BLEU máximo de 0.4881 con 10 beams. Esto se debe a que beam search explora múltiples trayectorias posibles y selecciona la más probable globalmente, lo que favorece a la coherencia y gramática. Sin embargo, como se observa en los resultados, al aumentar demasaidao el número de beams, tiene a producir textos más deterministas y menos creativos. Al disminuir el número de bemas, se va acercando cada vez más a greedy que puede dar textos incoherentes.Es una estrategia buena cuando queremos conseguir consistencia, claridad y fluidez narrativa. 

- Temperature Sampling: Probando para diferentes valores, he obtenido el mejor valor de BLEU para T=0.7, con valor de BLEU de 0.3235, obteniendo valores demasiados malos de BLUE para los otros valores con 0.0142 para T01 por ejemplo, lo que nos lleva a concluir que BLEU desciende drásticamente con temperaturas altas (<1.0). Es una estrategia que genera textos más creativos, pero menos consiestentes. Una conclusión que obtengo con lo svalores de BLEU obtenidos es que esta estrategia, para temperaturas bajas (0.7) mantiene la salida con una estrcutura y creatividad controlada, pero a medida que vamos aumentando la temperatura, el modelo elije más tokens aleatorios, perdiendo lógica en el texto generado. Es una estrategia buena para generar historias diversas y originales pero mtiene mala calidad con la medida de BLUE.

- Top-k Sampling: Probando con diferentes valores, el mejor hiperparámetro ha sido k=30 con un valor de BLEU de 0.0909. Éste es un modelo con alta aleatoridad, donde pierde coherencia con el prompt. Vemos que la medida de BLEU para todas las configraciones es bastante mala, siendo la máxima de 0.0909; a medida que k va creciendo, el modelo tiene más opciones de palabras en cada ejecución, por lo que puede llevar a un aumento de entropía y de incoherencia. Si buscamos diversidad y creatividad literaria es un modelo bueno, pero no es óptima para precisión semántica. Produce historias más randoms pero que se ajustan menos al prompt original.

- Top-p: éste es un método un poco más controlado que el anterior, pero menos preciso que el de beam search tal y como se puede ver en los valores de BLEU obtenidos, donde el mejor valor de hiperparámetro ha sido p=0.6 con un BLEU value de 0.2344. Analiando los resultados, se ve que al aumentar p (0.9 o 0.95en mi ejemplo), los valores empeoran mucho ya que el modelo está permitineod elegir tokens de baja probabilidad a ser muestreados. Un p más pequeño, como mi óptimo de 0.6, fuerza al modelo a conentrarse en opciones más probables. Por lo que puedo concluir, que los resulatdos de esta estrategia está entre los de beam search y temperature sampling, sin llegar a ser lo precisa que es beam search

- Constractive Search: Este es un método que combina la parte determinista y aleatoria, bsuscando un equilibrio entre diversidad y coherencia. El mejor valor ha sido com k=4 y alpha=0.8, obteniendo un BLEU value de 0.2815. Éste metodo penaliza la repetición y favorece frases más informativas y diversas. Su rendimiento ha sudo mejor que top-p y top-k pero peor que beam secarch en cuenato a vlores de BLEU obtenidos. 

Conclusión final Caso 2: Tras probar distintas estrategias de decodificación en el modelo Flan-T5-small para generación de historias, se observa que Beam Search con 10 beams ofrece la mejor calidad global, alcanzando un BLEU ≈ 0.49. Este método genera textos más coherentes, gramaticalmente correctos y fieles al prompt original. Por otro lado, los métodos de muestreo introducen mayor diversidad, pero sacrifican precisión semántica y estructura narrativa, reflejado en BLEU bajos.

Por tanto, la configuración óptima para este caso es beam seacrh con un numero de beams de 10, que maximiza la coherencia y la calidad medida por BLEU, confirmando que las estrategias deterministas son las más adecuadas en contextos de generación controlada.

In [6]:
# Codigo anterior para que no de error
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Cargar modelo y tokenizer
model_name = 'google/flan-t5-small'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

def generate_text(input_text, strategy='greedy', max_length=150, **kwargs):
    input_ids = tokenizer.encode(input_text, return_tensors='pt')
    model.eval()
    with torch.no_grad():
        if strategy == 'greedy':
            output_ids = model.generate(input_ids, max_length=max_length)
        elif strategy == 'beam':
            num_beams = kwargs.get('num_beams', 5)
            output_ids = model.generate(input_ids, max_length=max_length, num_beams=num_beams, early_stopping=True)
        elif strategy == 'temperature':
            temperature = kwargs.get('temperature', 1.0)
            output_ids = model.generate(input_ids, max_length=max_length, do_sample=True, temperature=temperature)
        elif strategy == 'top-k':
            top_k = kwargs.get('top_k', 50)
            output_ids = model.generate(input_ids, max_length=max_length, do_sample=True, top_k=top_k)
        elif strategy == 'top-p':
            top_p = kwargs.get('top_p', 0.95)
            output_ids = model.generate(input_ids, max_length=max_length, do_sample=True, top_p=top_p)
        elif strategy == 'contrastive':
            penalty_alpha = kwargs.get('penalty_alpha', 0.6)
            top_k = kwargs.get('top_k', 4)
            output_ids = model.generate(
                input_ids, max_length=max_length, penalty_alpha=penalty_alpha, top_k=top_k
            )
        else:
            raise ValueError(f"Unknown strategy: {strategy}")
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)


In [7]:
# Caso 2

def preprocess_story_generation(examples: list[dict]) -> dict:
    """Preprocesses a single example for the story generation task.

    Args:
        examples (list[dict]): A list of dictionaries containing 'prompt' key.

    Returns:
        dict[list]: A dictionary of list with added 'src_texts' for model input.

    The function:
    - Adds the task prefix 'Write a story based on: ' to the prompt.
    """
    # Empty dictionary
    texts = {}
    
    # T5 expects a task prefix
    texts['src_texts'] = ['Write a story based on: ' + ex['prompt'] for ex in examples]
    return texts

dataset_st_gen_preproc = preprocess_story_generation(dataset_st_gen)

# Seleccionar aleatoriamente un subconjunto pequeño para evaluación
sample_prompts = random.sample(dataset_st_gen_preproc['src_texts'], 5)

# Hiperparámetros
param_grid = {
    'beam': [{'num_beams': n} for n in [3, 4, 5, 7, 8, 10]],
    'temperature': [{'temperature': t} for t in [0.7, 1.0, 1.3, 1.5]],
    'top-k': [{'top_k': k} for k in [15, 30, 50, 80, 100]],
    'top-p': [{'top_p': p} for p in [0.6, 0.7, 0.8, 0.9, 0.95]],
    'contrastive': [{'top_k': k, 'penalty_alpha': pa} for k in [4, 8] for pa in [0.4, 0.6, 0.8]]
}

# Bleu Evaluation
results = []
smooth = SmoothingFunction().method1

for strategy, configs in param_grid.items():
    for config in configs:
        print(f"\n - Strategy: {strategy} | Params: {config}")
        bleu_scores = []
        for prompt in sample_prompts:
            generated = generate_text(prompt, strategy=strategy, max_length=200, **config)
            reference = [prompt.lower().split()]
            candidate = generated.lower().split()
            bleu = sentence_bleu(reference, candidate, smoothing_function=smooth)
            bleu_scores.append(bleu)
        avg_bleu = sum(bleu_scores) / len(bleu_scores)
        results.append({'strategy': strategy, 'params': config, 'BLEU': avg_bleu})
        print(f"-> Average BLEU: {avg_bleu:.4f}")

results_sorted = sorted(results, key=lambda x: x['BLEU'], reverse=True)
print("\n TOP CONFIGURACIONES:")
for r in results_sorted[:5]:
    print(r)



 - Strategy: beam | Params: {'num_beams': 3}
-> Average BLEU: 0.3662

 - Strategy: beam | Params: {'num_beams': 4}
-> Average BLEU: 0.3263

 - Strategy: beam | Params: {'num_beams': 5}
-> Average BLEU: 0.4377

 - Strategy: beam | Params: {'num_beams': 7}
-> Average BLEU: 0.2953

 - Strategy: beam | Params: {'num_beams': 8}
-> Average BLEU: 0.4567

 - Strategy: beam | Params: {'num_beams': 10}
-> Average BLEU: 0.4881

 - Strategy: temperature | Params: {'temperature': 0.7}
-> Average BLEU: 0.3235

 - Strategy: temperature | Params: {'temperature': 1.0}
-> Average BLEU: 0.0142

 - Strategy: temperature | Params: {'temperature': 1.3}
-> Average BLEU: 0.0075

 - Strategy: temperature | Params: {'temperature': 1.5}
-> Average BLEU: 0.0049

 - Strategy: top-k | Params: {'top_k': 15}
-> Average BLEU: 0.0314

 - Strategy: top-k | Params: {'top_k': 30}
-> Average BLEU: 0.0909

 - Strategy: top-k | Params: {'top_k': 50}
-> Average BLEU: 0.0084

 - Strategy: top-k | Params: {'top_k': 80}
-> Aver