Model Instruction

🎉 News

[TODO]: Update data and code.
[03.2024] xLAM model is released! Try it together with AgentLite benchmark or other benchmarks, which is comparable to GPT-4!
[02.2024] Initial Release of AgentOhana and xLAM paper!

This repo is for research purposes only.

Autonomous agents powered by large language models (LLMs) have garnered significant research attention. However, fully harnessing the potential of LLMs for agent-based tasks presents inherent challenges due to the heterogeneous nature of diverse data sources featuring multi-turn trajectories.

This repo introduces xLAM that aggregates agent trajectories from distinct environments, spanning a wide array of scenarios. It standardizes and unifies these trajectories into a consistent format, streamlining the creation of a generic data loader optimized for agent training. Leveraging the data unification, our training pipeline maintains equilibrium across different data sources and preserves independent randomness across devices during dataset partitioning and model training.

Model Instruction

If you already know Mixtral, xLAM-v0.1 is a significant upgrade and better at many things. For the same number of parameters, the model have been fine-tuned across a wide range of agent tasks and scenarios, all while preserving the capabilities of the original model.

xLAM-v0.1-r represents the version 0.1 of the Large Action Model series, with the "-r" indicating it's tagged for research. This model is compatible with VLLM and FastChat platforms.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Salesforce/xLAM-v0.1-r")
model = AutoModelForCausalLM.from_pretrained("Salesforce/xLAM-v0.1-r", device_map="auto")

messages = [
    {"role": "user", "content": "What is your favourite condiment?"},
    {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
    {"role": "user", "content": "Do you have mayonnaise recipes?"}
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note: You may need to tune the Temperature setting for different applications. Typically, a lower Temperature is helpful for tasks that require deterministic outcomes. Additionally, for tasks demanding adherence to specific formats or function calls, explicitly including formatting instructions is advisable and important.

Framework

A unified data formatting and streaming loader.

from fm_datasets import webshop_multi_turn_v2
from fm_utils.seed_random import init_device_seed
from fm_utils.interleave_datasets import interleave_data


sft_webshop_multi_turn = webshop_multi_turn_v2.SFTWebShopMultiTurnV2(tokenizer, script_args)

seed = init_device_seed(seed=42)

train_dataset, eval_dataset = \
    interleave_data(
        data_objects=[sft_webshop_multi_turn],
        sample_probs=[1.0],
        return_type="prompt_answer",
        seq_length=4096,
        seed=seed)

Supervised fine tuning and DPO fine tuning.

from fm_utils.derived_data_collator import DataCollatorForPromptAnswer
from fm_trainers.sft_foundation_trainer import SFTFoundationTrainer


collator = DataCollatorForPromptAnswer(
    instruction_template=instruction_template_ids,
    response_template=response_template_ids,
    tokenizer=tokenizer,
    mlm=False)

trainer = SFTFoundationTrainer(
    model=base_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    peft_config=peft_config,
    packing=False,
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_args,
    data_collator=collator,
)

trainer.train()

Installation

You can use our configured docker environment gcr.io/salesforce-research-internal/xlam-2024-02-14, and one example yaml file is shown at envs_config. Then, you can pip install -e . --no-dependencies

Or, you can directly pip install -e .. There is a chance that your configured environment might have some error.

Train

You can refer to the complete example scripts to learn more details

Or you can simply run this bash script to have a quick start for our example

nohup accelerate launch --config_file xLAM/train/scripts/multi_gpu.yaml xLAM/train/scripts/sft_mixtral8X7B_accelerator.py --model_name mistralai/Mixtral-8x7B-Instruct-v0.1 --seq_length 4096 --run_name sft_mixtral8X7B_v2_02072024 --output_dir {path} > sft_mixtral8X7B_v2_02072024.nohup 2>&1 &

Benchmarks

BOLAA

Webshop

LLM Name	ZS	ZST	ReaAct	PlanAct	PlanReAct	BOLAA
Llama-2-70B-chat	0.0089	0.0102	0.4273	0.2809	0.3966	0.4986
Vicuna-33B	0.1527	0.2122	0.1971	0.3766	0.4032	0.5618
Mixtral-8x7B-Instruct-v0.1	0.4634	0.4592	0.5638	0.4738	0.3339	0.5342
GPT-3.5-Turbo	0.4851	0.5058	0.5047	0.4930	0.5436	0.6354
GPT-3.5-Turbo-Instruct	0.3785	0.4195	0.4377	0.3604	0.4851	0.5811
GPT-4-0613	0.5002	0.4783	0.4616	0.7950	0.4635	0.6129
xLAM-v0.1-r	0.5201	0.5268	0.6486	0.6573	0.6611	0.6556

HotpotQA

LLM Name	ZS	ZST	ReaAct	PlanAct	PlanReAct
Mixtral-8x7B-Instruct-v0.1	0.3912	0.3971	0.3714	0.3195	0.3039
GPT-3.5-Turbo	0.4196	0.3937	0.3868	0.4182	0.3960
GPT-4-0613	0.5801	0.5709	0.6129	0.5778	0.5716
xLAM-v0.1-r	0.5492	0.4776	0.5020	0.5583	0.5030

AgentLite

Please note: All prompts provided by AgentLite are considered "unseen prompts" for xLAM-v0.1-r, meaning the model has not been trained with data related to these prompts.

Webshop

LLM Name	Act	ReAct	BOLAA
GPT-3.5-Turbo-16k	0.6158	0.6005	0.6652
GPT-4-0613	0.6989	0.6732	0.7154
xLAM-v0.1-r	0.6563	0.6640	0.6854

HotpotQA

	Easy		Medium		Hard
LLM Name	F1 Score	Accuracy	F1 Score	Accuracy	F1 Score	Accuracy
GPT-3.5-Turbo-16k-0613	0.410	0.350	0.330	0.25	0.283	0.20
GPT-4-0613	0.611	0.47	0.610	0.480	0.527	0.38
xLAM-v0.1-r	0.532	0.45	0.547	0.46	0.455	0.36

ToolBench

LLM Name	Unseen Insts & Same Set	Unseen Tools & Seen Cat	Unseen Tools & Unseen Cat
TooLlama V2	0.4385	0.4300	0.4350
GPT-3.5-Turbo-0125	0.5000	0.5150	0.4900
GPT-4-0125-preview	0.5462	0.5450	0.5050
xLAM-v0.1-r	0.5077	0.5650	0.5200

MINT-BENCH

LLM Name	1-step	2-step	3-step	4-step	5-step
GPT-4-0613	-	-	-	-	69.45
Claude-Instant-1	12.12	32.25	39.25	44.37	45.90
xLAM-v0.1-r	4.10	28.50	36.01	42.66	43.96
Claude-2	26.45	35.49	36.01	39.76	39.93
Lemur-70b-Chat-v1	3.75	26.96	35.67	37.54	37.03
GPT-3.5-Turbo-0613	2.73	16.89	24.06	31.74	36.18
AgentLM-70b	6.48	17.75	24.91	28.16	28.67
CodeLlama-34b	0.17	16.21	23.04	25.94	28.16
Llama-2-70b-chat	4.27	14.33	15.70	16.55	17.92

Tool-Query

LLM Name	Success Rate	Progress Rate
xLAM-v0.1-r	0.533	0.766
DeepSeek-67B	0.400	0.714
GPT-3.5-Turbo-0613	0.367	0.627
GPT-3.5-Turbo-16k	0.317	0.591
Lemur-70B	0.283	0.720
CodeLlama-13B	0.250	0.525
CodeLlama-34B	0.133	0.600
Mistral-7B	0.033	0.510
Vicuna-13B-16K	0.033	0.343
Llama-2-70B	0.000	0.483

Acknowledgement

We want to acknowledge the work which have made contributions to our paper and the agent research community! If you find our work useful, please consider to cite

@article{zhang2024agentohana,
  title={AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning},
  author={Zhang, Jianguo and Lan, Tian and Murthy, Rithesh and Liu, Zhiwei and Yao, Weiran and Tan, Juntao and Hoang, Thai and Yang, Liangwei and Feng, Yihao and Liu, Zuxin and others},
  journal={arXiv preprint arXiv:2402.15506},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
envs_config		envs_config
images		images
xLAM		xLAM
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
license_info.md		license_info.md
requirements.txt		requirements.txt
setup.py		setup.py

License

SalesforceAIResearch/xLAM

Folders and files

Latest commit

History

Repository files navigation

🎉 News

Model Instruction

Framework

A unified data formatting and streaming loader.

Supervised fine tuning and DPO fine tuning.

Installation

Train

Benchmarks

Webshop

HotpotQA

Webshop

HotpotQA

Acknowledgement

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages