<br>
<a href="https://www.nvidia.cn/training/">
    <div style="width: 55%; background-color: white; margin-top: 50px;">
    <img src="https://dli-lms.s3.amazonaws.com/assets/general/nvidia-logo.png"
         width="400"
         height="186"
         style="margin: 0px -25px -5px; width: 300px"/>
</a>
<h1 style="line-height: 1.4;"><font color="#76b900"><b>使用大语言模型（LLM）构建 AI 智能体</h1>
<h2><b>练习 2：</b> 生成元数据</h2>
<br>

**欢迎来到第二个练习！**

这个练习很简洁，旨在加强结构化输出的概念，以便与课程材料甚至长格式 markdown 进行互动。具体来说，我们将考虑如何生成逼真的元数据，接着用上节中的工具创作一个真正的 Notebook。

### **学习目标：**
**在这个 Notebook 中，我们将：**

- 考虑一个可以直接应用于合成内容的更复杂的结构化输出示例（*如果合理使用*）。
- 超越您 LLM 系统的生成先验，以迭代的方式改善一份长文档。

### **准备工作**

在此之前，先加载上一个 Notebook 的设置：

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_nvidia import ChatNVIDIA
from langchain_openai import ChatOpenAI
from functools import partial

from course_utils import chat_with_chain

llm = ChatNVIDIA(model="meta/llama-3.1-8b-instruct", base_url="http://nim-llm:8000/v1")

## Minimum Viable Invocation
# print(llm.invoke("How is it going? 1 sentence response.").content)

## Back-and-forth loop
prompt = ChatPromptTemplate.from_messages([
    ("system",
         "You are a helpful instructor assistant for NVIDIA Deep Learning Institute (DLI). "
         " Please help to answer user questions about the course. The first message is your context."
         " Restart from there, and strongly rely on it as your knowledge base. Do not refer to your 'context' as 'context'."
    ),
    ("user", "<context>\n{context}</context>"),
    ("ai", "Thank you. I will not restart the conversation and will abide by the context."),
    ("placeholder", "{messages}")
])

## Am LCEL chain to pass into chat_with_generator
chat_chain = prompt | llm | StrOutputParser()

with open("simple_long_context.txt", "r") as f:
    full_context = f.read()

long_context_state = {
    "messages": [],
    "context": full_context,
}

# chat = partial(chat_with_chain, chain=chat_chain)
# chat(long_context_state)

<hr><br>

### **第一部分：** 生成简单元数据

在之前的 Notebook 中，我们学到了一些通过礼貌提问和强制风格来生成数据的技巧。这依赖于模型的先验，并指出每个模型在这方面都有一些限制。为了简化操作，让我们从一个实际可生产的用例开始，甚至连 8B 模型都能亮相的场景：**短格式数据提取**。

研讨会数据集有很多自然语言描述，而我们有一个需要一些格式的网站前端，如果能用 LLM 来初始化这些值就好了！

当然，可以定义一个架构来帮我们生成这些值：

In [None]:
from pydantic import BaseModel, Field
from typing import List

class MetaCreator(BaseModel):
    short_abstract: str = Field(description=(
        "A concise, SEO-optimized summary (1-2 sentences) of the course for students."
        " Ensure accuracy and relevance without overstating the workshop's impact."
    ))
    topics_covered: List[str] = Field(description=(
        "A natural-language list of key topics, techniques, and technologies covered."
        " Should start with 'This workshop' and follow a structured listing format that lists at least 4 points."
    ))
    abstract_body: str = Field(description=(
        "A detailed expansion of the short abstract, providing more context and information."
    ))
    long_abstract: str = Field(description=(
        "An extended version of the short abstract, followed by the objectives."
        " The first paragraph should introduce the topic with a strong hook and highlight its relevance."
    ))
    objectives: List[str] = Field(description=(
        "Key learning outcomes that students will achieve, emphasizing big-picture goals rather than specific notebook content."
    ))
    outline: List[str] = Field(description=(
        "A structured sequence of key topics aligned with major course sections, providing a clear learning path."
    ))
    on_completion: str = Field(description=(
        "A brief summary of what students will be able to accomplish upon completing the workshop."
    ))
    prerequisites: List[str] = Field(description=(
        "Essential prior knowledge and skills expected from students before taking the course."
    ))

def get_schema_hint(schema):
    schema = getattr(schema, "model_json_schema", lambda: None)() or schema
    return ( # PydanticOutputParser(pydantic_object=Obj.model_schema_json()).get_format_instructions()
        'The output should be formatted as a JSON instance that conforms to the JSON schema below.\n\nAs an example, for the schema'
        ' {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}},'
        ' "required": ["foo"]}\nthe object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema.'
        ' The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.\n\nHere is the output schema:\n```\n' + str(schema) + '\n```'
    )

schema_hint = get_schema_hint(MetaCreator)
# schema_hint

<br>

然后，如果我们把 LLM 客户端绑定到这个架构上，就应该能生成它。 
下面的代码不仅实现了这一点，还演示了如何流式处理数据甚至过滤数据。

In [None]:
structured_llm = llm.with_structured_output(schema=MetaCreator.model_json_schema(), strict=True)

meta_chain = prompt | structured_llm
meta_gen_directive = (
    # f"Can you generate a course entry on the Earth-2 course? {schema_hint}"
    # f"Can you combine the topics of the Earth-2 course and the NeRF/3DGS courses and generate a compelling course entry? {schema_hint}"
    f"Can you combine the topics of the Earth-2 course and the NeRF/3DGS courses and generate a compelling course entry? Make sure to explain how they combine. {schema_hint}"
) 
meta_gen_state = {
    "messages": [("user", meta_gen_directive)],
    "context": full_context,
}

# answer = meta_chain.invoke(meta_gen_state)
# print(answer)

from IPython.display import clear_output

answer = {}
for chunk in meta_chain.stream(meta_gen_state):
    clear_output(wait=True)
    for key, value in chunk.items():
        print(f"{key}: {value}", end="\n\n")
        answer[key] = value

# llm._client.last_response.json()

<br>

好的！这还不错！它反映了我们在课程中讨论的那些限制，但似乎确实在好好利用上下文（而不是陷入无意义的内容）。也许可以请它进一步改进一下？

In [None]:
## TODO: See if you can't prompt-engineer this solution to lead to an improved autoregression.
meta_gen_state = {
    "messages": [
        ("user", meta_gen_directive),
        ("ai", str(answer)),
        ("user", "Great! Can you please correct any mistakes and flesh out some vagueness?")
    ],
    # "context": full_context,  ## Maybe we don't need the full context
    "context": "",
}

answer2 = {}
for chunk in meta_chain.stream(meta_gen_state):
    clear_output(wait=True)
    for key, value in chunk.items():
        print(f"{key}: {value}", end="\n\n")
        answer2[key] = value

<br>

**嗯… 它能在一定程度上变得更好。** 
- 如果我们把聊天记录纳入考虑，您会很快遇到问题，因为模型开始达到其上下文限制。
- 如果不这样做，我们仍然可以从 LLM 中挤出一些定制，一定程度上合理地生成更好或更长的提纲。

对于这个用例来说，这个模型实际上还不错，但对于稍微长一点的内容，限制就明显开始显露出来了...

In [None]:
## Pick your preferred option
final_answer = answer2

<hr><br>

### **第二部分：** 生成一个 Notebook

在尝试生成元数据时，我们看到了一些模糊的限制，接下来看看当我们更有野心时，会不会开始出更明显的问题。下面展示了使用 GPT-4o 模型生成 Notebook 的一个尝试：

In [None]:
from IPython.display import display, Markdown, Latex
with open("chats/make_me_a_notebook/input.txt", "r") as f:
    notebook_input_full = f.read()
    notebook_input_prompt = notebook_input_full.split("\n\n")[-1]
# print(notebook_input_full)
print(notebook_input_prompt)
# !cat chats/make_me_a_notebook/output.txt
# display(Markdown("chats/make_me_a_notebook/output.txt"))
# display(Markdown("<hr><br><br>"))

<br>

在 [`chats/make_me_a_Notebook/output.txt`](./chats/make_me_a_Notebook/output.txt) 的 Notebook 输出是我请求 GPT-4o 生成一个 Notebook 时的第一次尝试输出，依据 [`chats/make_me_a_Notebook/input.txt`](./chats/make_me_a_Notebook/input.txt) 的内容生成。这样的模糊输入得到的输出还算可以服务，并且仅通过要求更好的输出、批评它和提供足够的信息，就可以在**一定程度上**加以改进。

这时可能会想到“无用输入，无用输出（garbage in, garbage out）”这种说法，因为 LLM 只是镜像您特定输入所给出合理输出的风格。但由于训练的对话性质——通过聊天提示的引导，您的输出通常会不太够用，且在许多高级用例中显得不够准确。

看看能不能通过给我们的 LLM 一点风格参考，并要求它稍微改写 Notebook 来改进这个输出：

In [None]:
import json

def notebook_to_markdown(path: str) -> str:
    """Load a Jupyter notebook from a given path and convert it to Markdown format."""
    with open(path, 'r', encoding='utf-8') as file:
        notebook = json.load(file)
    markdown_content = []
    for cell in notebook['cells']:
        if cell['cell_type'] == 'code':          # Combine code into one block
            markdown_content += [f'```python\n{"".join(cell["source"])}\n```']
        elif cell['cell_type'] == 'markdown':    # Directly append markdown source
            markdown_content += ["".join(cell["source"])]
        # for output in cell.get('outputs', []):   # Optionally, you can include cell outputs
        #     if output['output_type'] == 'stream':
        #         markdown_content.append(f'```\n{"".join(output["text"])}\n```')
    return '\n\n'.join(markdown_content)

notebook_example = notebook_to_markdown("extra_utils/general_representative_notebook.ipynb")

context = str(final_answer)
# context = (
#     f"THE FOLLOWING IS AN EXAMPLE NOTEBOOK FOR STYLE ONLY: \n\n{notebook_example}"
#     "\n\n=========\n\n"
#     f"THE FOLLOWING IS THE TOPIC COURSE THAT WE ARE DISCUSSING:\n\n{final_answer}\n\n"
# )

long_context_state = {
    "messages": [],
    "context": context,
}

chat = partial(chat_with_chain, chain=chat_chain)
chat(long_context_state)

## Option: Can you please construct a good notebook in markdown format?
## Option: That's great, but there is no code. Can you please flesh out each section within an end-to-end narrative?

<br>

这个例子中的模型相当小，同时我们也把入口限制在较短的输入和输出上（为了它自身的利益），因此它能生成的内容真的非常有限。不过，这个限制在所有现实场景下都显而易见，不管模型质量怎样。对于任何现代的 LLM：
- 虽然直接解码解决方案在某些情况下能奏效，但它们无法扩展到任意大的输入或输出。
- 当序列变长时，输出的质量长度通常会比输入的质量长度短。这在训练中是强制执行的，并且有助于生成的成本和上下文累积的减少。 

换句话说，**可以给 LLM 提供或期望的内容空间 $>>$ LLM 实际能理解好的内容空间 $>>$ LLM 实际能输出好的内容空间。** *（$>>$ = "远大于"）*

基于这个洞见，我们可以理解，强迫 LLM 一次性生成一个 Notebook 可能会导致整体上的不连贯。不过，它似乎至少是从某种程度上来看还不错，所以这个方法或许还有点价值。

<hr><br>

### **第三部分：** 使用智能体 Canvas

当我们发现无法直接输出我们想要的内容时，接下来的问题就是“我们能否接收我们想要的内容？”
- 当只给 LLM 提供前提作为输入时，它似乎能大致跟上思路，但当我们给它一个代表性的例子时，它就开始详细展开。
- 而且，它可能通过对话实际上能够改善您的 Notebook，所以或许可以从这里开始。

**Canvas 方法：** 与其让模型预测整个文档，不如让它把文档视为一个环境，并向 LLM 提出以下其中一个建议：
> - ***"Please propose a modification that will improve the state of the document. Here are your options. Pick one/several and they will be done."***
> - ***"Here is the whole state, and you are tasked with improving JUST THIS SUBSECTION OF IT. Please output your update to that section. No other sections will be modified."***
> - ***"This is the whole document. This section is bad because of one or more of the following: {criticisms}. Replace it with an improved version."***

如果模型能够理解整个环境和指令，那么它就能够直接自回归生成一个小部分或甚至输出的一个战略性修改。将这种方法与结构化输出或思维链结合使用，您很可能会得到一个尽管不完美，但能将潜在输出长度接近模型潜在输入长度的结果。

In [None]:
## TODO: Insert a notebook of choice
STARTING_NOTEBOOK = """

""".strip()

In [None]:
prompt = ChatPromptTemplate.from_messages([
    ("system",
         "You are a helpful instructor assistant for NVIDIA Deep Learning Institute (DLI). "
         " Please help to answer user questions about the course. The first message is your context."
         " Restart from there, and strongly rely on it as your knowledge base. Do not refer to your 'context' as 'context'."
    ),
    ("user", "<context>\n{context}</context>"),
    ("ai", "Thank you. I will not restart the conversation and will abide by the context."),
    ("user", (
        "The following section needs some improvements:\n\n<<<SECTION>>>\n\n{section}\n\n<<</SECTION>>>\n\n"
        "Please propose an upgrade that would improve the overall notebook quality."
        " Later sections will follow and will be adapted by other efforts."
        " You may only output modifications to the section provided here, no later or earlier sections."
        " Follow best style practices, and assume the sections before this one are more enforcing that the latter ones."
        " Make sure to number your section, continuing from the previous ones."
    )),
])

## An LCEL chain to pass into chat_with_generator
sub_chain = prompt | llm | StrOutputParser()

delimiter = "###"  ## TODO: Pick a delimiter that works for your notebook
WORKING_NOTEBOOK = STARTING_NOTEBOOK.split(delimiter)
output = ""
for i in range(len(WORKING_NOTEBOOK)):
    chunk = WORKING_NOTEBOOK[i]
    ## TODO: Knowing that the state needs "context" and "section" values,
    ## can you construct your input state?
    chunk_refinement_state = {
        "context": None,
        "section": None,
    }
    for token in sub_chain.stream(chunk_refinement_state):
        print(token, end="", flush=True)
        output += token
    WORKING_NOTEBOOK[i] = output
    print("\n\n" + "#" * 64 + "\n\n")

<br>

<details><summary><b>参考答案</b></summary>

```python
chunk_refinement_state = {
    "context": "####".join(WORKING_NOTEBOOK),
    "section": chunk,
}
```
    
</details>

<hr><br>

### **第四部分：** 回顾这个练习

如我们所见，这种方法非常有前景，因为它能通过局部修改扩展模型的输出到一个更大的上下文。这个方法很快就让 8B 模型超出了它的训练分布，而且由于输入模糊，它很可能也开始快速进入幻觉模式，但一个更大的模型能在这个过程中迭代更长时间，甚至可以加入一些纠错或随机化的努力来稳定这个过程。

这种技术在实际中也被用于实现代码库修改和协作文档编辑（例如 OpenAI Canvas）。此外，即使对这种方法做出微小的修改，也能帮助您实施一些出人意料且高效的解决方案：
- **查找-替换 Canvas：** 与其自回归处理文档的各个部分，不如生成查找-替换对。对这些块执行这个过程，最终会得到更安全的公式以及更易于跟踪的足迹。这种系统可以用于实现 AI 驱动的拼写检查器和其他策略性错误修正。
- **文档翻译：** 更普遍地说，这种方法也可以用于将文档从一种格式逐节翻译成另一种格式。可以使用类似上述方法，将文档从一种语言翻译成另一种语言，同时注入一些上下文来帮助翻译模型的工作流有更好的风格。

注意，虽然我们称这个过程为 *“Canvas 化”，* 但您可能也会遇到同样或相似的想法，称作*“迭代优化。”* 它们几乎是一样的，除了后者更为通用，技术上可以应用于任何 LLM 驱动的循环，将输入通过多次迭代转变为输出。Canvas 化更强调地表明您在利用当前环境作为游乐场，并可以进行战略修改以改善状态。

----

无论如何，我们现在已经测试了小模型如何实实在在地帮我们完成一些颇有趣的事，同时也反思了它的明显局限。这标志着我们这个课程的“简单工作流”练习的结束。下一部分，我们将利用至今学到的基本元素，开始使用一个真正的智能体框架，同时继续使用非常有限但出人意料灵活的 Llama-8B 模型。