Skip to content

IBM/API-BLEND

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

API-BLEND

A Comprehensive Corpora for Training and Benchmarking API LLMs.

Paper Link: https://arxiv.org/abs/2402.15491

API-BLEND is a collection of 10 datasets for training and systematic testing of tool-augmented LLMs. The datasets mimic real-world scenarios involving API-tasks such as API / tool detection, slot filling, and sequencing of the detected APIs. Out of 10 datasets we have curated 6 datasets from the existing datasets, and the other 4, we have used them off-the-shelf (for OOD tests).

Note: Currently, we are in the process of obtaining license clearance to release the curated datasets directly. So, for the time, we have outlined the steps involved in curating them from the raw datasets.

Install Dependencies

conda create --name api-blend python=3.9
conda activate api-blend
pip install -r requirements.txt
python -m spacy download en_core_web_sm

For LLM-based data-generation we have used IBM Generative AI Python SDK. Please follow the instructions to generate unique API key to access IBM Generative AI.

pip install --upgrade ibm-generative-ai==2.2.0

Datasets Curation

1. LLM-based Data Generation

  • Raw Data:

    We have curated SeqSGD (from SGD) and SeqMultiWOZ (from MultiWOZ) using LLM. Please download the raw data from the following links.

  • Generate Data:

    Here is an example to generate SeqMultiWOZ, where data/raw/MultiWOZ_2.2 is the raw data dir. Same codebase works for SeqSGD.

export GENAI_API=<your API url>
export GENAI_KEY=<your API key>

python llm-based-generation/llm-data-gen.py \
	--data_dir data/raw/MultiWOZ_2.2 \
	--save_dir data/processed/SeqMultiWOZ \
	--dataset_name multiwoz \
	--model google/flan-t5-xxl

2. Grammar-based Data Generation

  • Raw Data:

    Using grammar based generation, we have generated 4 datasets. Please download the raw datasets from the following links

  • Generate Data:
    • SeqATIS and SeqSNIPS: Please download the SNIPS and ATIS datasets from the above link. Run the below script to generate SeqSNIPS and SeqATIS datasets, where data/raw/ contains the raw SNIPS and ATIS datasets.
    python grammar-based-generation/SeqSNIPS_SeqATIS-data-gen.py \
        --data_dir data/raw/ \
        --save_dir data/processed/
    
    • SeqTopV2: Please download the TopV2 following the above link. Run the below script to generate SeqTopV2 dataset, where data/raw/TOPv2_Dataset is the raw data.
    python grammar-based-generation/SeqTopV2-data-gen.py \
        --data_dir data/raw/TOPv2_Dataset \
        --save_dir data/processed/SeqTopV2
    
    • SeqToolQA: The original dataset of ToolQA does not contain the APIs, it only comes with a question and a final answer. So, we have used their data-generation code for each template to generate the intermediate APIs, which are of utmost importance to us. Here is an example, where the apis key is generated by us following their codebase.
{
	"qid": "easy-flight-0034",
	"input": "How long was the different between the CRS recorded departure time and actual departure time of the AA529 flight from LAX to MIA on 2022-05-03?",
	"apis": [
	    "LoadDB[flights]",
	    "FilterDB[Origin=LAX, Dest=MIA, Flight_Number_Marketing_Airline=529, IATA_Code_Marketing_Airline=AA]",
	    "GetValue[DepTime]",
	    "GetValue[CRSDepTime]",
	    "PythonInterpreter(abs(DepTime - CRSDepTime))"
	],
	"output": "45"
}