# üéõÔ∏è Key Parameters in Text Generation

This notebook explores the **key parameters** that control how Large Language Models (LLMs) generate text. Understanding these parameters is essential for getting the desired output from generative AI models.

## What You'll Learn

- **Temperature**: Controls randomness and creativity
- **Top-p (Nucleus Sampling)**: Probability-based token selection
- **Top-k**: Limits token choices to top k candidates
- **Output Length**: Controls how much text is generated

---

## üì¶ Setup

First, let's import the necessary libraries. We'll use the `transformers` library from Hugging Face, which provides easy access to pre-trained language models.

In [None]:
from transformers import pipeline
import warnings
warnings.filterwarnings('ignore')

## ü§ñ Loading the Model

We'll use **GPT-2**, a popular open-source language model from OpenAI. While smaller than modern models like GPT-4, it's perfect for demonstrating these concepts.

The `pipeline` function creates an easy-to-use interface for text generation.

In [None]:
generator = pipeline('text-generation', model='gpt2')

---

## üå°Ô∏è Temperature

**Temperature** is one of the most important parameters in text generation. It controls the **randomness** of the model's output.

### How Temperature Works

When a model generates text, it predicts probabilities for each possible next token (word/subword). Temperature modifies these probabilities:

| Temperature | Effect | Use Case |
|-------------|--------|----------|
| **Low (0.1-0.3)** | More deterministic, focused, repetitive | Factual content, code, precise answers |
| **Medium (0.5-0.7)** | Balanced creativity and coherence | General writing, conversations |
| **High (0.8-1.5)** | More random, creative, diverse | Creative writing, brainstorming |

### Mathematical Intuition

Temperature divides the logits (raw model outputs) before applying softmax:
- **Low temperature** ‚Üí Sharpens the distribution (high-probability tokens become even more likely)
- **High temperature** ‚Üí Flattens the distribution (all tokens become more equally likely)

---

### Example: Low Temperature (0.1)

With a very low temperature, the model becomes **highly deterministic**. It will almost always choose the most probable next token, leading to predictable (and sometimes repetitive) output.

In [None]:
# Low temperature = More focused, deterministic output
result = generator(
    "Here is a suggestion on my new coffee shop name. The name should be:",
    temperature=0.1,
    max_new_tokens=50,
    do_sample=True
)
print("Temperature 0.1 (Low - Deterministic):")
print(result[0]['generated_text'])

### Example: High Temperature (1.2)

With a higher temperature, the model becomes **more creative and diverse**. It's more willing to choose less probable tokens, leading to more varied (but potentially less coherent) output.

In [None]:
# High temperature = More creative, diverse output
result = generator(
    "Here is a suggestion on my new coffee shop name. The name should be:",
    temperature=1.2,
    max_new_tokens=50,
    do_sample=True
)
print("Temperature 1.2 (High - Creative):")
print(result[0]['generated_text'])

### üí° Key Takeaway

- **Low temperature** ‚Üí Safe, predictable, may repeat
- **High temperature** ‚Üí Creative, diverse, may be incoherent
- **Sweet spot** ‚Üí Usually between 0.7-0.9 for most tasks

---

## üéØ Top-p (Nucleus Sampling)

**Top-p** (also called **nucleus sampling**) is an alternative to temperature for controlling randomness. Instead of modifying probabilities, it **limits which tokens can be selected**.

### How Top-p Works

1. Sort all possible next tokens by probability (highest first)
2. Add tokens to the "nucleus" until their cumulative probability reaches `p`
3. Sample only from tokens in the nucleus

| Top-p Value | Effect | Tokens Considered |
|-------------|--------|-------------------|
| **0.1** | Very restrictive | Only top ~10% probability mass |
| **0.5** | Moderate | Top ~50% probability mass |
| **0.9** | Permissive | Top ~90% probability mass |
| **1.0** | No filtering | All tokens considered |

### Why Use Top-p?

Top-p is **adaptive** ‚Äî it automatically adjusts how many tokens to consider based on the probability distribution. When the model is confident, fewer tokens are considered. When uncertain, more tokens are included.

---

### Example: Low Top-p (0.2)

With `top_p=0.2`, only tokens that make up the top 20% of probability mass are considered. This leads to **more predictable** output.

In [None]:
# Low top-p = Restrictive, predictable output
result = generator(
    "The cat sat on",
    max_new_tokens=10,
    top_p=0.2,
    do_sample=True
)
print("Top-p 0.2 (Restrictive):")
print(result[0]['generated_text'])

### Example: High Top-p (0.9)

With `top_p=0.9`, tokens making up 90% of the probability mass are considered. This allows for **more diverse** completions.

In [None]:
# High top-p = More diverse output
result = generator(
    "The cat sat on",
    max_new_tokens=10,
    top_p=0.9,
    do_sample=True
)
print("Top-p 0.9 (Diverse):")
print(result[0]['generated_text'])

### üí° Key Takeaway

- **Low top-p (0.1-0.3)** ‚Üí Conservative, sticks to high-probability tokens
- **High top-p (0.8-0.95)** ‚Üí More variety, includes less likely tokens
- **Common default** ‚Üí 0.9 or 0.95

---

## üî¢ Top-k

**Top-k** is a simpler alternative to top-p. Instead of using probability thresholds, it simply **limits selection to the k most likely tokens**.

### How Top-k Works

1. Sort all possible next tokens by probability
2. Keep only the top `k` tokens
3. Sample from these k tokens

| Top-k Value | Effect |
|-------------|--------|
| **1** | Greedy decoding (always pick the most likely) |
| **5-10** | Very focused, limited variety |
| **20-50** | Moderate variety |
| **100+** | High variety |

### Top-k vs Top-p

| Aspect | Top-k | Top-p |
|--------|-------|-------|
| **Selection** | Fixed number of tokens | Dynamic based on probability |
| **Adaptivity** | Not adaptive | Adapts to confidence |
| **Simplicity** | Simpler to understand | More nuanced |

---

### Example: Top-k = 1 (Greedy Decoding)

With `top_k=1`, the model **always picks the single most likely token**. This is called greedy decoding and produces completely deterministic output.

In [None]:
# Top-k = 1 (Greedy - always pick most likely)
result = generator(
    "The cat sat on",
    max_new_tokens=10,
    top_k=1,
    do_sample=True
)
print("Top-k 1 (Greedy - Most Deterministic):")
print(result[0]['generated_text'])

### Example: Top-k = 50 (More Variety)

With `top_k=50`, the model can choose from the **50 most likely tokens**, allowing for more diverse output.

In [None]:
# Top-k = 50 (More variety)
result = generator(
    "The cat sat on",
    max_new_tokens=10,
    top_k=50,
    do_sample=True
)
print("Top-k 50 (More Variety):")
print(result[0]['generated_text'])

### üí° Key Takeaway

- **Top-k = 1** ‚Üí Greedy, completely deterministic
- **Top-k = 10-50** ‚Üí Good balance for most tasks
- **Top-k is simpler** but less adaptive than top-p

---

## üìè Output Length

Controlling the **length of generated text** is crucial for practical applications. There are two main parameters:

| Parameter | Description |
|-----------|-------------|
| **max_new_tokens** | Maximum number of NEW tokens to generate |
| **max_length** | Maximum TOTAL length (input + output) |

### Important Notes

- `max_new_tokens` is generally preferred as it's more intuitive
- If both are set, `max_new_tokens` takes precedence
- Generation may stop early if the model produces an end-of-sequence token

---

### Example: Short Output

In [None]:
# Generate only 10 new tokens
result = generator(
    "The cat sat on",
    max_new_tokens=10
)
print("Short output (10 tokens):")
print(result[0]['generated_text'])

### Example: Longer Output

In [None]:
# Generate up to 100 new tokens
result = generator(
    "The cat sat on",
    max_new_tokens=100
)
print("Longer output (100 tokens):")
print(result[0]['generated_text'])

---

## üîÑ Combining Parameters

In practice, you'll often **combine multiple parameters** to achieve the desired output. Here are some common combinations:

### Factual/Precise Output
```python
temperature=0.3, top_p=0.9, top_k=40
```

### Creative Writing
```python
temperature=0.9, top_p=0.95, top_k=100
```

### Code Generation
```python
temperature=0.2, top_p=0.9, top_k=20
```

In [None]:
# Example: Balanced creative output
result = generator(
    "Once upon a time in a magical forest,",
    max_new_tokens=50,
    temperature=0.8,
    top_p=0.92,
    top_k=50,
    do_sample=True
)
print("Balanced Creative Output:")
print(result[0]['generated_text'])

---

## üìä Summary

| Parameter | What It Controls | Low Value | High Value |
|-----------|------------------|-----------|------------|
| **Temperature** | Randomness | Deterministic, focused | Creative, diverse |
| **Top-p** | Probability threshold | Conservative | Permissive |
| **Top-k** | Number of candidates | Very focused | More variety |
| **max_new_tokens** | Output length | Short responses | Long responses |

### Best Practices

1. **Start with defaults** and adjust based on results
2. **Use temperature OR top-p**, not both at extreme values
3. **Match parameters to your use case** (factual vs creative)
4. **Experiment!** Different tasks benefit from different settings

---

## üìö Further Reading

- [Hugging Face Text Generation Documentation](https://huggingface.co/docs/transformers/main_classes/text_generation)
- [The Curious Case of Neural Text Degeneration](https://arxiv.org/abs/1904.09751) (Top-p paper)
- [How to Generate Text with Transformers](https://huggingface.co/blog/how-to-generate)