GitHub - DaoD/INTERS: This is the repository for our paper "INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning"

INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning

Authors: Yutao Zhu, Peitian Zhang, Chenghao Zhang, Yifei Chen, Binyu Xie, Zhicheng Dou, Zheng Liu, and Ji-Rong Wen

🤗 HuggingFace Model List

Model	Backbone Model
INTERS-LLaMA-7b-Chat	LLaMA-2-7b-chat
INTERS-LLaMA-7b-Base	LLaMA-2-7b
INTERS-Mistral-7b	Mistral-7b
INTERS-Minima-3b	Minima-2-3b
INTERS-Falcon-1b	Falcon-rw-1b

News

Feb, 2024: We have released the dataset, instruction templates, fine-tuned models, and evaluation scripts.

Introduction

Large language models (LLMs) have demonstrated impressive capabilities in various natural language processing tasks. Despite this, their application to information retrieval (IR) tasks is still challenging due to the infrequent occurrence of many IR-specific concepts in natural language. While prompt-based methods can provide task descriptions to LLMs, they often fall short in facilitating a comprehensive understanding and execution of IR tasks, thereby limiting LLMs' applicability. To address this gap, in this work, we explore the potential of instruction tuning to enhance LLMs' proficiency in IR tasks. We introduce a novel instruction tuning dataset, INTERS, encompassing 20 tasks across three fundamental IR categories: query understanding, document understanding, and query-document relationship understanding. The data are derived from 43 distinct datasets with manually written templates. Our empirical results reveal that INTERS significantly boosts the performance of various publicly available LLMs, such as LLaMA, Mistral, and Phi, in IR tasks. Furthermore, we conduct extensive experiments to analyze the effects of instruction design, template diversity, few-shot demonstrations, and the volume of instructions on performance.

Tasks & Datasets

We consider tasks under the categories of query understanding, document understanding, and query-document understanding. Our dataset consists of 20 tasks derived from 43 datasets. All tasks and datasets we used are shown in the figure below.

Dataset Construction

General Performance

Zero-shot Evaluation

The evaluation script is under the evaluation directory.

Required packages

torch               2.0.0
transformers        4.36.2
numpy               1.26.3
tqdm                4.66.1
scikit-learn        1.4.0
rouge_score         0.1.2
nltk                3.8.1
accelerate          0.26.1

For query understanding tasks and document understanding tasks (qu-du-tasks)

This evaluation script use pytorch DDP for text generation.

Download test data and save it to data/in-domain/zero_shot/. The directory structure is like below:

qu-du-tasks
├── eval_sampling.py
├── inference_dataset.py
├── inference_qu_du.py
├── inference_tasks
│   ├── conversational_qa.py
│   ├── fact_verification.py
│   └── ...
└── data
    └── in-domain
        └── zero-shot
            ├── conversational_qa_coqa.zero_shot.test.jsonl
            ├── conversational_qa_quac.zero_shot.test.jsonl
            ├── fact_verification_climate_fever.zero_shot.test.jsonl
            ├── fact_verification_fever.zero_shot.test.jsonl
            ├── fact_verification_scifact.zero_shot.test.jsonl
            └── ...

If you choose to place the test files in other directories, you can modify the path in each task file under inference_tasks directory (in get_path() function).
Run evaluation as

TOKENIZERS_PARALLELISM=True python3 inference_qu_du.py \
    --model_name_or_path your/model/path \
    --tokenizer_name your/tokenizer/path \
    --setting in-domain \
    --n_shots zero_shot

For query-document relationship understanding tasks (qdu-tasks)

Download test data and save it to data/. The directory structure is like below:

qdu-tasks
├── cqa.sh
├── eval_rank.py
├── postprocess_cqa.py
├── run_eval.sh
└── data
    ├── cqadupstack
    │   ├── android
    │   │   └── test.pt.key.do-not-overwrite.json
    │   ├── english
    │   │   └── test.pt.key.do-not-overwrite.json
    │   └── ...
    ├── arguana.bm25.100.jsonl
    ├── climate_fever.bm25.100.jsonl
    └── ...

For datasets other than cqadupstack, modify the paths in run_eval.sh, then run the script

MODEL_PATH="your/model/path"
TOKENIZER_PATH="your/tokenizer/path"
RESULT_PATH="your/result/path"
EVAL_DATA_PATH="data"

-----------------------
bash run_eval.sh

For cqadupstack dataset, modify the paths in cqa.sh, then run the script

MODEL_PATH="your/model/path"
TOKENIZER_PATH="your/tokenizer/path"
RESULT_PATH="your/result/path"

-----------------------
bash cqa.sh

This script supports testing pointwise/pairwise/listwise methods for reranking. Modify the parameter of eval_rerank.py in run_eval.sh or cqa.sh

# pointwise:  (default)
--rerank_method pointwise

# pairwise:
--rerank_method pairwise

# listwise:
--rerank_method listwise \
--listwise_window 5 \
--listwise_stride 5

Citation

Please kindly cite our paper if it helps your research:

@article{INTERS,
  author       = {Yutao Zhu and
                  Peitian Zhang and
                  Chenghao Zhang and
                  Yifei Chen and
                  Binyu Xie and
                  Zhicheng Dou and
                  Zheng Liu and
                  Ji{-}Rong Wen},
  title        = {{INTERS:} Unlocking the Power of Large Language Models in Search with
                  Instruction Tuning},
  journal      = {CoRR},
  volume       = {abs/2401.06532},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2401.06532},
  doi          = {10.48550/ARXIV.2401.06532},
  eprinttype    = {arXiv},
  eprint       = {2401.06532}
}

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
evaluation		evaluation
img		img
LICENSE		LICENSE
README.md		README.md
instruct_templates.py		instruct_templates.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation

evaluation

img

img

LICENSE

LICENSE

README.md

README.md

instruct_templates.py

instruct_templates.py

Repository files navigation

INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning

News

Introduction

Tasks & Datasets

Dataset Construction

General Performance

Zero-shot Evaluation

Required packages

For query understanding tasks and document understanding tasks (qu-du-tasks)

For query-document relationship understanding tasks (qdu-tasks)

Citation

About

Releases

Packages

Languages

License

DaoD/INTERS

Folders and files

Latest commit

History

Repository files navigation

INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning

News

Introduction

Tasks & Datasets

Dataset Construction

General Performance

Zero-shot Evaluation

Required packages

For query understanding tasks and document understanding tasks (qu-du-tasks)

For query-document relationship understanding tasks (qdu-tasks)

Citation

About

Resources

License

Stars

Watchers

Forks

Languages