ToolSword

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

Data for paper ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

Junjie Ye

jjye23@m.fudan.edu.cn

Feb. 16, 2024

Introduction

Tool learning is widely acknowledged as a foundational approach or deploying large language models (LLMs) in real-world scenarios. While current research primarily emphasizes leveraging tools to augment LLMs, it frequently neglects emerging safety considerations tied to their application. To fill this gap, we present ToolSword, a comprehensive framework dedicated to meticulously investigating safety issues linked to LLMs in tool learning. Specifically, ToolSword delineates six safety scenarios for LLMs in tool learning, encompassing malicious queries and jailbreak attacks in the input stage, noisy misdirection and risky cues in the execution stage, and harmful feedback and error conflicts in the output stage. Experiments conducted on 11 open-source and closed-source LLMs reveal enduring safety challenges in tool learning, such as handling harmful queries, employing risky tools, and delivering detrimental feedback, which even GPT-4 is susceptible to. Moreover, we conduct further studies with the aim of fostering research on tool learning safety.

What's New

[2024.02.19] Release the data for ToolSword.
[2024.02.19] Paper available on Arxiv.

Results in the Input Stage

We manually evaluate the performance of various LLMs in four safety scenarios during the input stage by tallying their attack success rate (ASR), which represents the percentage of non-secure queries that are inaccurately recognized and not rejected.

Results in the Execution Stage

In the execution stage, we manually assess the performance of various LLMs in two safety scenarios. This assessment entails monitoring the tool selection error rate, which signifies the percentage of incorrectly chosen tools.

Results in the Output Stage

In the output stage, we manually evaluate various LLMs in two safety scenarios. We gauge LLMs performance by calculating the ratio of unsafe output.

Citation

If you find this project useful in your research, please cite:

@misc{ye2024toolsword,
      title={ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages}, 
      author={Junjie Ye and Sixian Li and Guanyu Li and Caishuang Huang and Songyang Gao and Yilong Wu and Qi Zhang and Tao Gui and Xuanjing Huang},
      year={2024},
      eprint={2402.10753},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Figures		Figures
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figures

Figures

data

data

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

ToolSword

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

Introduction

What's New

Results in the Input Stage

Results in the Execution Stage

Results in the Output Stage

Citation

About

Releases

Packages

License

Junjie-Ye/ToolSword

Folders and files

Latest commit

History

Repository files navigation

ToolSword

ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages

Introduction

What's New

Results in the Input Stage

Results in the Execution Stage

Results in the Output Stage

Citation

About

Resources

License

Stars

Watchers

Forks