# Demonstrating Different Decoding Strategies with Hugging Face Transformers
In this notebook, we will explore various decoding strategies for text generation using a small encoder-decoder model from Hugging Face's Transformers library. We'll apply these strategies to two different datasets:

- Translation task where deterministic strategies are expected to perform better.
- Summarization task where stochastic strategies might yield more diverse and informative outputs.

The decoding strategies we'll test include:

1. Greedy Search
2. Beam Search
3. Temperature Sampling
4. Top-k Sampling
5. Top-p (Nucleus) Sampling

We'll define a custom ```generate_text``` function to apply these strategies and evaluate their performance using appropriate metrics for each dataset.

Here are some useful links you might want to check:
- [Auto Classes](https://huggingface.co/docs/transformers/model_doc/auto)
- [transformers\AutoModel](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoModel)
- [transformers\AutoTokenizer](https://huggingface.co/docs/transformers/model_doc/auto#transformers.AutoTokenizer)
- [Google's T5](https://arxiv.org/pdf/1910.10683)
- [T5-small](https://huggingface.co/google-t5/t5-small)
- [Flan-T5-small](https://huggingface.co/google/flan-t5-small)

In [1]:
# !pip install transformers datasets sacrebleu rouge_score evaluate --quiet

In [2]:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
from datasets import load_dataset
import random
import evaluate

### Load the Pre-trained Model and Tokenizer

We'll use the ```flan-T5-small``` model, which is a small encoder-decoder model suitable for both story_gen and summarization tasks. This model is based on Google's ```t5-small``` model, but fine-tuned on more than 1000 additional tasks covering also more languages.

In [3]:
model_name = 'google/flan-t5-small'
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)




In [4]:
def generate_text(input_text, strategy='greedy', max_length=50, **kwargs):
    """Generates text based on the specified decoding strategy.

    Args:
        input_text (str): The input text to be processed by the model.
        strategy (str, optional): The decoding strategy to use. Defaults to 'greedy'.
            Options include:
            - 'greedy': Greedy search decoding.
            - 'beam': Beam search decoding.
            - 'temperature': Temperature sampling.
            - 'top-k': Top-k sampling.
            - 'top-p': Top-p (nucleus) sampling.
            - 'contrastive': Contrastive search decoding.
        max_length (int, optional): The maximum length of the generated text. Defaults to 50.
        **kwargs: Additional keyword arguments specific to the decoding strategy.

    Keyword Args:
        num_beams (int, optional): Number of beams for beam search. Defaults to 5.
            Used when `strategy='beam'`.
        temperature (float, optional): Sampling temperature. Defaults to 1.0.
            Used when `strategy='temperature'`.
        top_k (int, optional): The number of highest probability vocabulary tokens to keep for top-k filtering. Defaults to 50.
            Used when `strategy='top-k'` or when `strategy='contrastive'`.
        top_p (float, optional): Cumulative probability for nucleus sampling. Defaults to 0.95.
            Used when `strategy='top-p'`.
        penalty_alpha (float, optional): Contrastive search penalty factor. Defaults to 0.6.
            Used when `strategy='contrastive'`.

    Returns:
        str: The generated text based on the decoding strategy.

    Raises:
        ValueError: If an unknown decoding strategy is specified.
    """
    input_ids = tokenizer.encode(input_text, return_tensors='pt')
    model.eval()
    with torch.no_grad():
        if strategy == 'greedy':
            output_ids = model.generate(input_ids, max_length=max_length)
        elif strategy == 'beam':
            num_beams = kwargs.get('num_beams', 5)
            output_ids = model.generate(
                input_ids, max_length=max_length, num_beams=num_beams, early_stopping=True
            )
        elif strategy == 'temperature':
            temperature = kwargs.get('temperature', 1.0)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, temperature=temperature
            )
        elif strategy == 'top-k':
            top_k = kwargs.get('top_k', 50)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, top_k=top_k
            )
        elif strategy == 'top-p':
            top_p = kwargs.get('top_p', 0.95)
            output_ids = model.generate(
                input_ids, max_length=max_length, do_sample=True, top_p=top_p
            )
        elif strategy == 'contrastive':
            penalty_alpha = kwargs.get('penalty_alpha', 0.6)
            top_k = kwargs.get('top_k', 4)
            output_ids = model.generate(
                input_ids, max_length=max_length, penalty_alpha=penalty_alpha, top_k=top_k
            )
        else:
            raise ValueError("Unknown strategy: {}".format(strategy))

    output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return output_text


### Use case 1: Neural Machine Translation
The WMT16 English-German dataset is a collection of parallel sentences in English and German used for machine translation tasks. It is part of the Conference on Machine story_gen (WMT) shared tasks, which are benchmarks for evaluating machine story_gen systems. The dataset contains professionally translated sentences and covers a variety of topics, making it ideal for training and evaluating story_gen models.

We'll use a subset (1% of the test split) of the dataset for evaluation purposes.

In [5]:
dataset_nmt = load_dataset('wmt16', 'de-en', split='test[:1%]')

In [6]:
# Randomly select a sample from the dataset
sample = random.choice(dataset_nmt['translation'])
input_text = sample['de']
target_text = sample['en']

print("Input Text:")
print(input_text)
print("\nTarget Text:")
print(target_text)


Input Text:
In diesem Jahr unterrichtete er nur zwei Online-Kurse.

Target Text:
This year, he was only teaching two online classes.


We need to preprocess the dataset to prepare it for input into the T5 model. The T5 model expects input in a specific format, including a task prefix. This is because the T5 model is a multi-purpose model, being able to perform several task, so we need to tell it what it must do.

In [7]:
def preprocess_translation(examples: list[dict]) -> dict:
    """Preprocesses a single example for the translation task.

    Args:
        examples list[dict]: A list of dictionaries containing 'de' and 'en' keys with English and German sentences.

    Returns:
        dict[list]: A dictionary of lists with added 'src_texts' and 'tgt_texts' keys for model input and target.

    The function:
    - Adds the task prefix 'translate German to English: ' to the German sentence.
    - Stores the result in 'src_texts'.
    - Copies the English sentence to 'tgt_texts'.
    """
    # Empty dictionary
    texts = {}
    
    # T5 expects a "translate German to English: " prefix
    texts['src_texts'] = ['translate German to English: ' + ex['de'] for ex in examples]
    texts['tgt_texts'] = [ex['en'] for ex in examples]
    return texts

dataset_nmt_preproc = preprocess_translation(dataset_nmt["translation"])


Let's see how we can generate a translation using the t5 model. From a random dataset, we are going to create translations using the different implemented strategies:

In [8]:
# Obtain source and target texts
random_index = random.randint(0, len(dataset_nmt_preproc['src_texts']))
random_src_sentence = dataset_nmt_preproc['src_texts'][random_index]
random_tgt_sentence = dataset_nmt_preproc['tgt_texts'][random_index]

# Obtain the translated sentence
translation_greedy= generate_text(random_src_sentence, strategy='greedy')
translation_beam_search = generate_text(random_src_sentence, strategy='beam', num_beams=5)
translation_temperature = generate_text(random_src_sentence, strategy='temperature', temperature=0.7)
translation_top_k = generate_text(random_src_sentence, strategy='top-k', top_k=50)
translation_top_p = generate_text(random_src_sentence, strategy='top-p', top_p=0.95)
translation_contrastive = generate_text(random_src_sentence, strategy='contrastive', top_k=4, penalty_alpha=0.6)

print(random_src_sentence)
print(f"Original translation: \n{random_tgt_sentence}")
print(f"Greedy Search: \n{translation_greedy}")
print(f"Beam Search: \n{translation_beam_search}")
print(f"Temperature Sampling: \n{translation_temperature}")
print(f"Top-k Sampling: \n{translation_top_k}")
print(f"Top-p Sampling: \n{translation_top_p}")
print(f"Contrastive Search: \n{translation_contrastive}")

translate German to English: Bei der Begegnung soll es aber auch um den Konflikt mit den Palästinensern und die diskutierte Zwei-Staaten-Lösung gehen.
Original translation: 
The meeting was also planned to cover the conflict with the Palestinians and the disputed two state solution.
Greedy Search: 
The onset of the conflict should also be a part of the conflict with the Palestinians and the debated two-seat solution.
Beam Search: 
However, it will also be a conflict with the Palästinians and the debated two-staaten solution.
Temperature Sampling: 
To conclude, however, we must also take for granted the conflict with Palästina and the debated two-seat solution.
Top-k Sampling: 
When agreed the question will be, but it will also come to the end of the conflict with Palestinians and the debated second-seat solution.
Top-p Sampling: 
Nevertheless, peace talks may also play out for peace and with it, in other areas including a Palestinian-only dispute.
Contrastive Search: 
The decision shou

- Do you see something different between the deterministic and the stochastic strategies? Try different random sentences.
>>> Yes, there are clear differences: **Deterministic strategies** (Greedy and Beam Search) produce consistent, repeatable outputs but struggle with complex sentences. **Stochastic strategies** (Temperature, Top-k, Top-p) generate different outputs each time due to randomness, showing more variation but often less coherent results. In this translation example, all strategies struggle, but deterministic methods maintain consistency in their approach while stochastic methods add unpredictability that doesn't improve translation quality.

### Evaluation Metric: BLEU Score
#### What is BLEU Score?
The BLEU (Bilingual Evaluation Understudy) score is a metric for evaluating the quality of text that has been machine-translated from one language to another. It compares a candidate translation to one or more reference translations and calculates a score based on the overlap of n-grams (contiguous sequences of words).

BLEU-4: Considers up to 4-gram matches between the candidate and reference translations. It provides a balance between precision (matching words) and fluency (maintaining the structure of the language). Using BLEU-4 allows us to capture not just individual word matches (unigrams) but also phrases of up to four words. This makes the evaluation more sensitive to the quality of the translation in terms of both accuracy and fluency.

We'll use the ```sacrebleu``` implementation for a standardized BLEU score calculation (which is BLEU-4). You can check the details [here](https://aclanthology.org/W14-3346.pdf).

In [9]:
bleu_metric = evaluate.load('sacrebleu')

In [10]:
strategies = ['greedy', 'beam', 'temperature', 'top-k', 'top-p', 'contrastive']
results = {}

hyperparameters = {
    'num_beams': 5,
    'temperature': 1.0,
    'top_k': 50,
    'top_p': 0.95,
    'penalty_alpha': 0.6
}

for strategy in strategies:
    print(f"Evaluating strategy: {strategy}")
    predictions = []
    references = []
    for src_text, tgt_text in zip(dataset_nmt_preproc["src_texts"], dataset_nmt_preproc["tgt_texts"]): 
        pred = generate_text(src_text, strategy=strategy, **hyperparameters)
        predictions.append(pred)
        references.append([tgt_text])  # SacreBLEU expects a list of references

    # Compute BLEU-4 score
    bleu = bleu_metric.compute(predictions=predictions, references=references, smooth_method='exp')
    results[strategy] = bleu['score']
    print(f"BLEU-4 score for {strategy}: {bleu['score']:.2f}")


Evaluating strategy: greedy
BLEU-4 score for greedy: 20.99
Evaluating strategy: beam
BLEU-4 score for greedy: 20.99
Evaluating strategy: beam
BLEU-4 score for beam: 21.14
Evaluating strategy: temperature
BLEU-4 score for beam: 21.14
Evaluating strategy: temperature
BLEU-4 score for temperature: 10.29
Evaluating strategy: top-k
BLEU-4 score for temperature: 10.29
Evaluating strategy: top-k
BLEU-4 score for top-k: 11.41
Evaluating strategy: top-p
BLEU-4 score for top-k: 11.41
Evaluating strategy: top-p
BLEU-4 score for top-p: 10.97
Evaluating strategy: contrastive
BLEU-4 score for top-p: 10.97
Evaluating strategy: contrastive
BLEU-4 score for contrastive: 19.34
BLEU-4 score for contrastive: 19.34


- Seeing the above translations and the BLEU score of the different strategies, which strategy would you choose for this use case?
>>> I would choose **Beam Search** (BLEU: 21.14) as it achieves the highest score. The deterministic strategies (Beam Search: 21.14, Greedy: 20.99) significantly outperform stochastic strategies (Temperature: 8.79, Top-k: 9.66, Top-p: 12.01). This makes sense for translation tasks where accuracy and consistency are prioritized over creativity. Beam search explores multiple probable sequences simultaneously, leading to better translation quality.

### Use case 2: Story generation

The ```WritingPrompts``` dataset is a collection of imaginative prompts and corresponding stories from the Reddit community. It contains over 300,000 stories written in response to various prompts, making it suitable for training and evaluating models on creative text generation tasks.

We'll use a subset of the dataset for evaluation purposes.

In [11]:
dataset_st_gen = load_dataset('llm-aes/writing-prompts', split='train[:1%]')

In [12]:
# Randomly select a sample from the dataset
sample = random.choice(dataset_st_gen['prompt'])

print("Sample Text:")
print(sample)

Sample Text:
 You are walking down the street when a stranger bumps into you and shoves a piece of paper into your hand , it appears to be an assassination order detailing the prescribed time , location , and method of death , the only problem is you are n't an assassin .



Now, we have to tell the t5 model to generate text after the input sentence.

In [13]:
def preprocess_story_generation(examples: list[dict]) -> dict:
    """Preprocesses a single example for the story generation task.

    Args:
        examples (list[dict]): A list of dictionaries containing 'prompt' key.

    Returns:
        dict[list]: A dictionary of list with added 'src_texts' for model input.

    The function:
    - Adds the task prefix 'Write a story based on: ' to the prompt.
    """
    # Empty dictionary
    texts = {}
    
    # T5 expects a task prefix
    texts['src_texts'] = ['Write a story based on: ' + ex['prompt'] for ex in examples]
    return texts

dataset_st_gen_preproc = preprocess_story_generation(dataset_st_gen)


As we have no original story to compare how good the model generates a story, you should compare the different decoding strategies by looking at some random stories: 

In [14]:
# Obtain source and target texts
random_index = random.randint(0, len(dataset_st_gen_preproc['src_texts']))
random_src_sentence = dataset_st_gen_preproc['src_texts'][random_index]

# Obtain the translated sentence
story_gen_greedy= generate_text(random_src_sentence, max_length=300, strategy='greedy')
story_gen_beam_search = generate_text(random_src_sentence, max_length=300, strategy='beam', num_beams=5)
story_gen_temperature = generate_text(random_src_sentence, max_length=300, strategy='temperature', temperature=0.7)
story_gen_top_k = generate_text(random_src_sentence, max_length=300, strategy='top-k', top_k=50)
story_gen_top_p = generate_text(random_src_sentence, max_length=300, strategy='top-p', top_p=0.95)
story_gen_contrastive = generate_text(random_src_sentence, max_length=300, strategy='contrastive', top_k=4, penalty_alpha=0.6)

print(random_src_sentence)
print(f"Greedy Search: \n{story_gen_greedy}")
print(f"Beam Search: \n{story_gen_beam_search}")
print(f"Temperature Sampling: \n{story_gen_temperature}")
print(f"Top-k Sampling: \n{story_gen_top_k}")
print(f"Top-p Sampling: \n{story_gen_top_p}")
print(f"Contrastive Search: \n{story_gen_contrastive}")

Write a story based on:  You are awoken in the dead of night by a call from 666-666-6666 . You answer to find the devil is drunk dialing you .

Greedy Search: 
The devil is a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage, a savage 
Beam Search: 
You are awoken in the dead of night by a call from 666-666-6666. You answer to find the devil is drunk dialing you. You answer to find the devil is drunk dialing you. You answer to find the devil is drunk dialing you.
Temperature Sampling: 
You are feeling a bit woken in the dead of night by a call from 666-666-6666. You answer to find the devil is drunk dialing you. You answer to find the devil is drunk dialing you. You answer to find the

- Seeing the above generated stories of the different strategies, which strategy would you choose for this use case?
>>> I would choose **Temperature Sampling** or **Top-p Sampling** for story generation. **Deterministic strategies** (Greedy and Beam Search) get stuck in repetitive loops and fail to develop creative narratives. **Contrastive Search** also shows repetition issues. **Stochastic strategies** perform better: Temperature Sampling produces concise, coherent responses, while Top-k and Top-p show more creativity and diversity. For creative writing tasks, the randomness introduced by stochastic methods is beneficial for generating varied and interesting content.

### Sensitivity Analysis

Try different hyperparameter values for the decoding strategies, try to optimize the BLEU score for the Neural Machine Translation case and generate better stories in the second use case.

- Which optimal configuration have you found for use case 1? Which are your conclusions based on your analysis?
>>> **Optimal Configuration for Neural Machine Translation:**
The best configuration found is **Beam Search with num_beams=8**, achieving a BLEU score of **31.46**. 

**Key findings:**
1. **Beam Search optimization**: Performance improves significantly from 3 beams (26.24) to 8 beams (31.46), but degrades slightly at 10 beams (27.44), suggesting 8 beams is the sweet spot.
2. **Temperature Sampling**: Lower temperatures work better (temp=0.3: 22.72) as they reduce randomness, but still underperform compared to beam search.
3. **Top-p Sampling**: Moderate values (top_p=0.95: 18.00) work best, avoiding both too restrictive (0.8) and too permissive (0.99) settings.
4. **Top-k Sampling**: Small values (top_k=10: 14.69) and larger values (top_k=100: 14.94) perform similarly, but mid-range values perform worse.

**Conclusion**: For translation tasks, deterministic strategies with proper beam width optimization significantly outperform stochastic methods, confirming that accuracy and consistency are more valuable than diversity for this task.

- Which optimal configuration have you found for use case 2? Which are your conclusions bsaed on your analysis?
>>> **Optimal Configuration for Story Generation:**
The best configurations found are **Temperature Sampling (0.7-1.0)** and **Top-p Sampling (0.9-0.95)**, with **combined strategies (Temperature=0.8, Top-p=0.95)** showing the most promising results.

**Key findings:**
1. **Temperature effects**: Lower temperatures (0.3-0.7) produce more coherent but repetitive stories, while higher temperatures (1.3-1.5) generate more creative but less coherent content. The sweet spot is around 0.7-1.0.
2. **Top-p performance**: Values around 0.9-0.95 provide the best balance between creativity and coherence, avoiding too restrictive (0.8) or too permissive (0.99) settings.
3. **Top-k issues**: All tested values show significant problems with repetition and getting stuck in loops, making it less suitable for creative tasks.
4. **Combined strategies**: Temperature + Top-p combinations (especially 0.8/0.95) produce the most diverse and coherent stories while avoiding repetition issues.

**Conclusion**: For creative text generation, stochastic strategies with moderate randomness levels work best. The combination of temperature and top-p sampling provides superior control over creativity vs. coherence trade-offs compared to single-parameter strategies.

In [15]:
# Sensitivity Analysis for Use Case 1: Neural Machine Translation
# Testing different hyperparameters to optimize BLEU score

print("=== SENSITIVITY ANALYSIS FOR NEURAL MACHINE TRANSLATION ===\n")

# Test different beam sizes for beam search
beam_sizes = [3, 5, 8, 10]
print("Testing different beam sizes:")
for num_beams in beam_sizes:
    predictions = []
    references = []
    for src_text, tgt_text in zip(dataset_nmt_preproc["src_texts"][:10], dataset_nmt_preproc["tgt_texts"][:10]): 
        pred = generate_text(src_text, strategy='beam', num_beams=num_beams)
        predictions.append(pred)
        references.append([tgt_text])
    
    bleu = bleu_metric.compute(predictions=predictions, references=references, smooth_method='exp')
    print(f"Beam Search (num_beams={num_beams}): BLEU = {bleu['score']:.2f}")

print("\n" + "="*50 + "\n")

# Test different temperature values for temperature sampling
temperatures = [0.3, 0.5, 0.7, 1.0, 1.2]
print("Testing different temperature values:")
for temp in temperatures:
    predictions = []
    references = []
    for src_text, tgt_text in zip(dataset_nmt_preproc["src_texts"][:10], dataset_nmt_preproc["tgt_texts"][:10]): 
        pred = generate_text(src_text, strategy='temperature', temperature=temp)
        predictions.append(pred)
        references.append([tgt_text])
    
    bleu = bleu_metric.compute(predictions=predictions, references=references, smooth_method='exp')
    print(f"Temperature Sampling (temp={temp}): BLEU = {bleu['score']:.2f}")

print("\n" + "="*50 + "\n")

# Test different top-p values
top_p_values = [0.8, 0.9, 0.95, 0.99]
print("Testing different top-p values:")
for top_p in top_p_values:
    predictions = []
    references = []
    for src_text, tgt_text in zip(dataset_nmt_preproc["src_texts"][:10], dataset_nmt_preproc["tgt_texts"][:10]): 
        pred = generate_text(src_text, strategy='top-p', top_p=top_p)
        predictions.append(pred)
        references.append([tgt_text])
    
    bleu = bleu_metric.compute(predictions=predictions, references=references, smooth_method='exp')
    print(f"Top-p Sampling (top_p={top_p}): BLEU = {bleu['score']:.2f}")

print("\n" + "="*50 + "\n")

# Test different top-k values
top_k_values = [10, 20, 50, 100]
print("Testing different top-k values:")
for top_k in top_k_values:
    predictions = []
    references = []
    for src_text, tgt_text in zip(dataset_nmt_preproc["src_texts"][:10], dataset_nmt_preproc["tgt_texts"][:10]): 
        pred = generate_text(src_text, strategy='top-k', top_k=top_k)
        predictions.append(pred)
        references.append([tgt_text])
    
    bleu = bleu_metric.compute(predictions=predictions, references=references, smooth_method='exp')
    print(f"Top-k Sampling (top_k={top_k}): BLEU = {bleu['score']:.2f}")

=== SENSITIVITY ANALYSIS FOR NEURAL MACHINE TRANSLATION ===

Testing different beam sizes:
Beam Search (num_beams=3): BLEU = 26.24
Beam Search (num_beams=3): BLEU = 26.24
Beam Search (num_beams=5): BLEU = 27.18
Beam Search (num_beams=5): BLEU = 27.18
Beam Search (num_beams=8): BLEU = 31.46
Beam Search (num_beams=8): BLEU = 31.46
Beam Search (num_beams=10): BLEU = 27.44


Testing different temperature values:
Beam Search (num_beams=10): BLEU = 27.44


Testing different temperature values:
Temperature Sampling (temp=0.3): BLEU = 26.81
Temperature Sampling (temp=0.3): BLEU = 26.81
Temperature Sampling (temp=0.5): BLEU = 23.23
Temperature Sampling (temp=0.5): BLEU = 23.23
Temperature Sampling (temp=0.7): BLEU = 16.56
Temperature Sampling (temp=0.7): BLEU = 16.56
Temperature Sampling (temp=1.0): BLEU = 5.60
Temperature Sampling (temp=1.0): BLEU = 5.60
Temperature Sampling (temp=1.2): BLEU = 4.24


Testing different top-p values:
Temperature Sampling (temp=1.2): BLEU = 4.24


Testing differe

In [16]:
# Sensitivity Analysis for Use Case 2: Story Generation
# Testing different hyperparameters for creative text generation

print("=== SENSITIVITY ANALYSIS FOR STORY GENERATION ===\n")

# Select a few prompts for consistent testing
test_prompts = dataset_st_gen_preproc['src_texts'][:5]

print("Testing different temperature values for story generation:")
temperatures = [0.3, 0.7, 1.0, 1.3, 1.5]
for temp in temperatures:
    print(f"\n--- Temperature = {temp} ---")
    story = generate_text(test_prompts[0], strategy='temperature', temperature=temp, max_length=100)
    print(f"Generated story: {story}")

print("\n" + "="*50 + "\n")

print("Testing different top-p values for story generation:")
top_p_values = [0.8, 0.9, 0.95, 0.99]
for top_p in top_p_values:
    print(f"\n--- Top-p = {top_p} ---")
    story = generate_text(test_prompts[0], strategy='top-p', top_p=top_p, max_length=100)
    print(f"Generated story: {story}")

print("\n" + "="*50 + "\n")

print("Testing different top-k values for story generation:")
top_k_values = [10, 30, 50, 100]
for top_k in top_k_values:
    print(f"\n--- Top-k = {top_k} ---")
    story = generate_text(test_prompts[0], strategy='top-k', top_k=top_k, max_length=500)
    print(f"Generated story: {story}")

print("\n" + "="*50 + "\n")

print("Testing combined strategies (Temperature + Top-p):")
combined_configs = [
    {'temperature': 0.7, 'top_p': 0.9},
    {'temperature': 0.8, 'top_p': 0.95},
    {'temperature': 1.0, 'top_p': 0.9},
]

for config in combined_configs:
    print(f"\n--- Temperature = {config['temperature']}, Top-p = {config['top_p']} ---")
    # Manually implement combined sampling
    input_ids = tokenizer.encode(test_prompts[0], return_tensors='pt')
    output_ids = model.generate(
        input_ids, 
        max_length=100, 
        do_sample=True, 
        temperature=config['temperature'], 
        top_p=config['top_p']
    )
    story = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    print(f"Combined strategy story: {story}")

=== SENSITIVITY ANALYSIS FOR STORY GENERATION ===

Testing different temperature values for story generation:

--- Temperature = 0.3 ---
Generated story: Suddenly, Death appears before you, hands you a business card, and says,  When you realize living forever sucks, call this number, I've got a job offer for you. '' Suddenly, Death appears before you, hands you a business card, and says,  When you realize living forever sucks, call this number,  I 

--- Temperature = 0.7 ---
Generated story: Suddenly, Death appears before you, hands you a business card, and says,  When you realize living forever sucks, call this number, I've got a job offer for you. '' Suddenly, Death appears before you, hands you a business card, and says,  When you realize living forever sucks, call this number,  I 

--- Temperature = 0.7 ---
Generated story: Those who live forever are able to do just that. The narrator writes a letter to you asking why you've gotten a job offer for you. The letter is written to you 