<center><a href="https://www.nvidia.cn/training/"><img src="https://dli-lms.s3.amazonaws.com/assets/general/DLI_Header_White.png" width="400" height="186" /></a></center>

# 附录：探索超参数

对于感兴趣的学员，这个附录 notebook 探索了超参数 `temperature`、`top_p` 和 `top_k` 对 LLM token 采样的影响。

---

## 导入

In [None]:
import os
import random
import math

---

## Top K

参数 `top_k` 将模型的选择限制为 `k` 个最可能的下一个 token。当 top_k 设置为 1 时，模型选择最可能的那个 token，这种情况下，给定完全相同的提示词，其输出将始终相同。我们称之为**贪婪解码**（greedy decoding）。

当 `top_k` 设置为大于 1 时，模型可以考虑多个可能的下一个 token，而不仅仅是最可能的那个 token。

到目前为止，我们一直将 `top_k` 默认设为 1。

这里有一些示例代码，帮助您理解 `top_k`。想象一下这些是 LLM 生成的下一个候选单词。

In [None]:
# List of words with their likelihood of being the next word, sorted by likelihood.
words_and_likelihoods_of_being_next_sorted = [
    ('apple', 0.4),
    ('dragonfruit', 0.2),
    ('marita', 0.1)
]

以下函数将获取生成下一个单词时考虑的 `top_k`。

In [None]:
def get_next_word(words_sorted_by_likelihood_of_being_next, top_k):
    # Limit the choices to the top_k words.
    words_available_to_be_next = words_sorted_by_likelihood_of_being_next[:top_k]
    # Separate the words and their probabilities.
    words, probabilities = zip(*words_available_to_be_next)
    # Choose one word based on the probabilities as weights.
    return random.choices(words, weights=probabilities, k=1)[0]

通过迭代多个 `top_k` 值，我们可以看到它如何影响了生成的下一个单词。

In [None]:
top_ks = [1, 2, 3]

for top_k in top_ks:
    print(f'Setting top_k to {top_k}.')
    for _ in range(10):
        next_word = get_next_word(words_and_likelihoods_of_being_next_sorted, top_k)
        print(next_word)
    print()

---

## Temperature

当 `top_k` 设置为 1 时，`temperature` 参数没有影响，但当 `top_k` 大于 1 时，我们还可以为模型的 `temperature` 传入一个介于 `0.0` 和 `1.0` 之间的值。

**Temperature** 影响单词选择的随机性：较高的 **temperature** 增加选择不太可能单词的概率，为文本增加了多样性。较低的 **temperature** 则使模型的选择更可预测。

例如，当 `top_k` 设置为 2 时，模型从两个最可能的下一个 token 中进行选择。随着温度的升高，概率分布变得更加均匀，使得第二个最可能的 token 被选择的机会更大，而较低的 temperature 则使模型更倾向于选择两个中的最可能的 token。

下面这个代码示例，可以帮助您理解这个概念。我们再次提供一组单词及其在文本生成中作为下一个单词的概率。

In [None]:
# List of words with their likelihood of being the next word, sorted by likelihood.
words_and_likelihoods_of_being_next_sorted = [
    ('apple', 0.4),
    ('dragonfruit', 0.2),
    ('marita', 0.1)
]

函数 `apply_temperature` 根据 temperature 更新下一个单词的概率。

In [None]:
def apply_temperature(probabilities, temperature):
    # Ensure temperature is within the valid range for your model
    if temperature <= 0 or temperature > 1:
        raise ValueError("Temperature must be greater than 0 and less than or equal to 1")
    # Apply temperature to probabilities
    adjusted_probabilities = [pow(p, 1 / temperature) for p in probabilities]
    return adjusted_probabilities

函数 `get_next_word_temperature` 在选择下一个单词时考虑 temperature 值。

In [None]:
def get_next_word_temperature(words_sorted_by_likelihood_of_being_next, temperature):
    # Separate the words and their original probabilities.
    words, probabilities = zip(*words_sorted_by_likelihood_of_being_next)
    # Adjust the probabilities by applying temperature.
    adjusted_probabilities = apply_temperature(probabilities, temperature)
    # Choose one word based on the adjusted probabilities as weights.
    return random.choices(words, weights=adjusted_probabilities, k=1)[0]

通过迭代多个 `temperature` 值，我们可以看到它如何影响生成中可能出现的单词。

In [None]:
temperatures = [0.01, 0.5, 1.0]

for temperature in temperatures:
    print(f'Setting temperature to {temperature}.')
    for _ in range(10):
        next_word = get_next_word_temperature(words_and_likelihoods_of_being_next_sorted, temperature)
        print(next_word)
    print()

---

## Top P

在使用语言模型进行文本生成时，`top_p`，也称为“核采样”（nucleus sampling），涉及选择一组可能的下一个 token，这组 token 的累积概率刚好超过 `top_p` 指定的阈值，`top_p` 是一个介于 0.0 和 1.0 之间的浮点值。它的工作原理如下：
- 模型计算每个可能的下一个 token 的概率，并按降序排序。
- 从最可能的 token 开始，逐步将 token 添加到子集中，直到它们的概率总和超过 `top_p` 阈值。
- 然后，模型仅从这个子集中随机选择下一个 token。

下面的代码示例能帮您理解这个概念。我们再提供一组单词及其在文本生成中作为下一个单词的概率。

In [None]:
# List of words with their likelihood of being the next word, sorted by likelihood.
words_and_likelihoods_of_being_next_sorted = [
    ('apple', 0.4),
    ('dragonfruit', 0.2),
    ('marita', 0.1)
]

函数 `get_next_word_top_p` 在选择下一个单词时考虑了 `p` 的值。

In [None]:
def get_next_word_top_p(words_sorted_by_likelihood_of_being_next, p):
    # Initialize the cumulative probability.
    cumulative = 0
    # List to hold words and probabilities up to the cumulative probability p.
    words_available_to_be_next = []
    # Add words and their probabilities to the list until the cumulative probability reaches p.
    for word, likelihood in words_sorted_by_likelihood_of_being_next:
        cumulative += likelihood
        words_available_to_be_next.append((word, likelihood))
        if cumulative >= p:
            break
    # Separate the words and their probabilities.
    words, probabilities = zip(*words_available_to_be_next)
    # Choose one word based on the probabilities as weights.
    return random.choices(words, weights=probabilities, k=1)[0]

通过迭代多个 `top_p` 值，我们可以看到它如何影响生成中可能出现的单词。

In [None]:
top_ps = [0.1, 0.6, 1.0]

for top_p in top_ps:
    print(f'Setting top_p to {top_p}.')
    for _ in range(10):
        next_word = get_next_word_top_p(words_and_likelihoods_of_being_next_sorted, top_p)
        print(next_word)
    print()

---

## 组合使用

下面我们给出一个示例，包含更长的可能单词列表，帮助我们进行一个关于参数组合如何影响下一个单词选择的实验。

In [None]:
# List of words with their likelihood of being the next word, sorted by likelihood.
words_and_likelihoods_of_being_next_sorted = [
    ('apple', 0.4),
    ('dragonfruit', 0.2),
    ('marita', 0.1)
]

函数 `get_next_word_combined` 考虑了 `top_k`、`p` 和 `temperature`。

In [None]:
def get_next_word_combined(words_sorted_by_likelihood_of_being_next, top_k, p, temperature):
    # Apply top_k to limit the choices.
    words_available_to_be_next = words_sorted_by_likelihood_of_being_next[:top_k]
    
    # Initialize the cumulative probability.
    cumulative = 0
    # List to hold words and probabilities after applying top_p.
    top_p_words = []
    # Add words to the list based on top_p criteria.
    for word, likelihood in words_available_to_be_next:
        cumulative += likelihood
        top_p_words.append((word, likelihood))
        if cumulative >= p:
            break
            
    # Separate the words and their probabilities after applying top_p.
    words, probabilities = zip(*top_p_words)
    # Adjust the probabilities by applying temperature.
    adjusted_probabilities = apply_temperature(probabilities, temperature)
    
    # Choose one word based on the adjusted probabilities as weights.
    return random.choices(words, weights=adjusted_probabilities, k=1)[0]

通过迭代多个 `top_k` 值，我们可以看到它如何限制生成中可能出现的单词。

In [None]:
top_ks = [2, 3]
top_ps = [0.6, 1]
temperatures = [0.1, 1]

for top_k in top_ks:
    for top_p in top_ps:
        for temperature in temperatures:
            print(f'Setting top_k to {top_k}, top_p to {top_p}, temperature to {temperature}.')
            for _ in range(10):
                next_word = get_next_word_combined(
                    words_and_likelihoods_of_being_next_sorted, 
                    top_k, 
                    top_p, 
                    temperature
                )
                print(next_word)
            print()