A Comprehensive Corpora for Training and Benchmarking API LLMs.
Paper Link: https://arxiv.org/abs/2402.15491
API-BLEND is a collection of 10 datasets for training and systematic testing of tool-augmented LLMs. The datasets mimic real-world scenarios involving API-tasks such as API / tool detection, slot filling, and sequencing of the detected APIs. Out of 10 datasets we have curated 6 datasets from the existing datasets, and the other 4, we have used them off-the-shelf (for OOD tests).
Note: Currently, we are in the process of obtaining license clearance to release the curated datasets directly. So, for the time, we have outlined the steps involved in curating them from the raw datasets.
conda create --name api-blend python=3.9
conda activate api-blend
pip install -r requirements.txt
python -m spacy download en_core_web_sm
For LLM-based data-generation we have used IBM Generative AI Python SDK. Please follow the instructions to generate unique API key to access IBM Generative AI.
pip install --upgrade ibm-generative-ai==2.2.0
-
We have curated SeqSGD (from SGD) and SeqMultiWOZ (from MultiWOZ) using LLM. Please download the raw data from the following links.
-
Here is an example to generate SeqMultiWOZ, where
data/raw/MultiWOZ_2.2
is the raw data dir. Same codebase works for SeqSGD.
export GENAI_API=<your API url>
export GENAI_KEY=<your API key>
python llm-based-generation/llm-data-gen.py \
--data_dir data/raw/MultiWOZ_2.2 \
--save_dir data/processed/SeqMultiWOZ \
--dataset_name multiwoz \
--model google/flan-t5-xxl
-
Using grammar based generation, we have generated 4 datasets. Please download the raw datasets from the following links
-
- SeqATIS and SeqSNIPS:
Please download the SNIPS and ATIS datasets from the above link. Run the below script to generate SeqSNIPS and SeqATIS datasets, where
data/raw/
contains the rawSNIPS
andATIS
datasets.
python grammar-based-generation/SeqSNIPS_SeqATIS-data-gen.py \ --data_dir data/raw/ \ --save_dir data/processed/
- SeqTopV2:
Please download the TopV2 following the above link. Run the below script to generate SeqTopV2 dataset, where
data/raw/TOPv2_Dataset
is the raw data.
python grammar-based-generation/SeqTopV2-data-gen.py \ --data_dir data/raw/TOPv2_Dataset \ --save_dir data/processed/SeqTopV2
- SeqToolQA: The original dataset of ToolQA does not contain the APIs, it only comes with a question and a final answer. So, we have used their data-generation code for each template to generate the intermediate APIs, which are of utmost importance to us. Here is an example, where the
apis
key is generated by us following their codebase.
- SeqATIS and SeqSNIPS:
Please download the SNIPS and ATIS datasets from the above link. Run the below script to generate SeqSNIPS and SeqATIS datasets, where
{
"qid": "easy-flight-0034",
"input": "How long was the different between the CRS recorded departure time and actual departure time of the AA529 flight from LAX to MIA on 2022-05-03?",
"apis": [
"LoadDB[flights]",
"FilterDB[Origin=LAX, Dest=MIA, Flight_Number_Marketing_Airline=529, IATA_Code_Marketing_Airline=AA]",
"GetValue[DepTime]",
"GetValue[CRSDepTime]",
"PythonInterpreter(abs(DepTime - CRSDepTime))"
],
"output": "45"
}