API-BLEND

A Comprehensive Corpora for Training and Benchmarking API LLMs.

Paper Link: https://arxiv.org/abs/2402.15491

API-BLEND is a collection of 10 datasets for training and systematic testing of tool-augmented LLMs. The datasets mimic real-world scenarios involving API-tasks such as API / tool detection, slot filling, and sequencing of the detected APIs. Out of 10 datasets we have curated 6 datasets from the existing datasets, and the other 4, we have used them off-the-shelf (for OOD tests).

Note: Currently, we are in the process of obtaining license clearance to release the curated datasets directly. So, for the time, we have outlined the steps involved in curating them from the raw datasets.

Install Dependencies

conda create --name api-blend python=3.9
conda activate api-blend
pip install -r requirements.txt
python -m spacy download en_core_web_sm

For LLM-based data-generation we have used IBM Generative AI Python SDK. Please follow the instructions to generate unique API key to access IBM Generative AI.

pip install --upgrade ibm-generative-ai==2.2.0

Datasets Curation

1. LLM-based Data Generation

Raw Data:

We have curated SeqSGD (from SGD) and SeqMultiWOZ (from MultiWOZ) using LLM. Please download the raw data from the following links.
- SGD: Link
- MultiWOZ: Link
Generate Data:

Here is an example to generate SeqMultiWOZ, where data/raw/MultiWOZ_2.2 is the raw data dir. Same codebase works for SeqSGD.

export GENAI_API=<your API url>
export GENAI_KEY=<your API key>

python llm-based-generation/llm-data-gen.py \
	--data_dir data/raw/MultiWOZ_2.2 \
	--save_dir data/processed/SeqMultiWOZ \
	--dataset_name multiwoz \
	--model google/flan-t5-xxl

2. Grammar-based Data Generation

Raw Data:

Using grammar based generation, we have generated 4 datasets. Please download the raw datasets from the following links
- ATIS: Link
- SNIPS: Link
- TopV2: Link
- ToolQA: Link
Generate Data:
- SeqATIS and SeqSNIPS: Please download the SNIPS and ATIS datasets from the above link. Run the below script to generate SeqSNIPS and SeqATIS datasets, where data/raw/ contains the raw SNIPS and ATIS datasets.
```
python grammar-based-generation/SeqSNIPS_SeqATIS-data-gen.py \
    --data_dir data/raw/ \
    --save_dir data/processed/
```
- SeqTopV2: Please download the TopV2 following the above link. Run the below script to generate SeqTopV2 dataset, where data/raw/TOPv2_Dataset is the raw data.
```
python grammar-based-generation/SeqTopV2-data-gen.py \
    --data_dir data/raw/TOPv2_Dataset \
    --save_dir data/processed/SeqTopV2
```
- SeqToolQA: The original dataset of ToolQA does not contain the APIs, it only comes with a question and a final answer. So, we have used their data-generation code for each template to generate the intermediate APIs, which are of utmost importance to us. Here is an example, where the apis key is generated by us following their codebase.

{
	"qid": "easy-flight-0034",
	"input": "How long was the different between the CRS recorded departure time and actual departure time of the AA529 flight from LAX to MIA on 2022-05-03?",
	"apis": [
	    "LoadDB[flights]",
	    "FilterDB[Origin=LAX, Dest=MIA, Flight_Number_Marketing_Airline=529, IATA_Code_Marketing_Airline=AA]",
	    "GetValue[DepTime]",
	    "GetValue[CRSDepTime]",
	    "PythonInterpreter(abs(DepTime - CRSDepTime))"
	],
	"output": "45"
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
grammar-based-generation		grammar-based-generation
llm-based-generation		llm-based-generation
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

API-BLEND

Install Dependencies

Datasets Curation

1. LLM-based Data Generation

Raw Data:

Generate Data:

2. Grammar-based Data Generation

Raw Data:

Generate Data:

About

Releases

Packages

Contributors 2

Languages

License

IBM/API-BLEND

Folders and files

Latest commit

History

Repository files navigation

API-BLEND

Install Dependencies

Datasets Curation

1. LLM-based Data Generation

Raw Data:

Generate Data:

2. Grammar-based Data Generation

Raw Data:

Generate Data:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages