# 🚀Llama3-8b-Instruct React Agent

实现过程类似[mistral-ReAct-Agent-with-function-tool-call](./mistral-ReAct-Agent-with-function-tool-call)案例，主要区别是：
0. 模型从Mixtral-8x22b-Instruct切换到Llama3-8b-Instruct
1. 尝试用中文定义Prompt观察效果
2. 采用vLLM而不是llamacpp作为推理后端

In [5]:
from vllm import LLM, SamplingParams

llm = LLM(
    model="/data/hf/Meta-Llama-3-8B-Instruct",
    trust_remote_code=True,
    tensor_parallel_size=2,
)
tokenizer = llm.get_tokenizer()

  from .autonotebook import tqdm as notebook_tqdm
2024-04-19 23:01:06,294	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 04-19 23:01:06 config.py:407] Custom all-reduce kernels are temporarily disabled due to stability issues. We will re-enable them once the issues are resolved.


2024-04-19 23:01:08,243	INFO worker.py:1724 -- Started a local Ray instance.


INFO 04-19 23:01:09 llm_engine.py:79] Initializing an LLM engine with config: model='/data/hf/Meta-Llama-3-8B-Instruct', tokenizer='/data/hf/Meta-Llama-3-8B-Instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


INFO 04-19 23:01:20 llm_engine.py:337] # GPU blocks: 12465, # CPU blocks: 4096
INFO 04-19 23:01:21 model_runner.py:666] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 04-19 23:01:21 model_runner.py:670] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
[36m(RayWorkerVllm pid=273778)[0m INFO 04-19 23:01:21 model_runner.py:666] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
[36m(RayWorkerVllm pid=273778)[0m INFO 04-19 23:01:21 model_runner.py:670] CUDA graphs can take additional 1~3 GiB memory per

[36m(RayWorkerVllm pid=273778)[0m INFO 04-19 23:01:27 model_runner.py:738] Graph capturing finished in 7 secs.


In [3]:
def search_weather(city: str):
    """一个输入城市名称查询天气的函数
    
    Args:
        city: 你想查询的城市名称
    """
    if city == "北京":
        return "北京气温36度"
    elif city == "巴黎":
        return "巴黎气温24度"
    else:
        return f"{city}气温20度"

func_desc = """function: search_weather
  description: 一个输入城市名称查询天气的函数
  args:
    city (str): 你想查询的城市名称
"""

In [38]:
REACT_PROMPT = '''<|begin_of_text|><|start_header_id|>system<|end_header_id|>

你是一个AI智能，能够使用各种函数来回答问题。

### 可以访问的函数

{{func_desc}}

### 回复格式
使用如下格式进行回复：

思考： 一步步思考解决问题，将你思考的过程放在这里
行动：
```json
{
    "function": $FUNCTION_NAME,
    "args": $FUNCTION_ARGS
}
```
观察: 得到函数返回结果
...(这里思考/行动/观察可以重复n次) 
思考：现在我知道最终答案了
答案：写出最终答案

$FUNCTION_NAME 是函数名. $FUNCTION_ARGS 是复合函数要求的字典输入。<|eot_id|><|start_header_id|>user<|end_header_id|>

问题：{{question}} <|eot_id|><|start_header_id|>assistant<|end_header_id|>

''' # 这里最后一定要添加两个\n\n

In [36]:
# from pydantic import Dict, List
import re
import json

def react_agent(model, question, function, function_desc, max_rounds=3):
    print(f"Question: {question}")
    prompt = REACT_PROMPT.replace("{{question}}", question).replace("{{func_desc}}", function_desc)
    
    output = ""
    try:
        for i in range(max_rounds):
            # Thought step
            response = llm.generate([prompt], sampling_params=SamplingParams(temperature=0., max_tokens=1024, 
                                                                             stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],  # KEYPOINT HERE
                                                                             stop=["观察："]))[0].outputs[0].text
            output += response
            prompt += response
            print(response, end="")
            # If "FinalAnswer" in reponse, end react process
            if "答案：" in response:
                output += response
                break

            elif "行动：" in response:
                args = json.loads(re.findall("```json([\s\S]*?)```", response)[0].replace("\n", "").replace(" ", ""))
                obs = function(**args["args"])
                obs = f"观察：{obs}" + "\n思考："
                output += obs
                prompt += obs
                print(obs, end="")
                continue
            
    except Exception as e:
        print(f"ERROR: {e}")
    return output

    

In [37]:
question = "北京和巴黎现在哪个地方更热？"
output = react_agent(llm, question, search_weather, func_desc)

Question: 北京和巴黎现在哪个地方更热？


Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.27it/s]


思考：首先，我需要知道北京和巴黎现在的天气情况。

行动：
```json
{
    "function": "search_weather",
    "args": {
        "city": "北京"
    }
}
```
观察：北京气温36度
思考：

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.37it/s]


然后，我需要知道巴黎现在的天气情况。

行动：
```json
{
    "function": "search_weather",
    "args": {
        "city": "巴黎"
    }
}
```
观察：巴黎气温24度
思考：

Processed prompts: 100%|██████████| 1/1 [00:00<00:00,  1.99it/s]

现在我知道了北京和巴黎的天气情况，可以比较气温来确定哪个地方更热。

答案：北京更热。


