diff --git a/README.md b/README.md index aeae772d..d3b47c5c 100644 --- a/README.md +++ b/README.md @@ -25,6 +25,7 @@ `Gorilla` enables LLMs to use tools by invoking APIs. Given a natural language query, Gorilla comes up with the semantically- and syntactically- correct API to invoke. With Gorilla, we are the first to demonstrate how to use LLMs to invoke 1,600+ (and growing) API calls accurately while reducing hallucination. We also release APIBench, the largest collection of APIs, curated and easy to be trained on! Join us, as we try to expand the largest API store and teach LLMs how to write them! Hop on our Discord, or open a PR, or email us if you would like to have your API incorporated as well. ## News +- :rocket: [03/15] RAFT: Adapting Language Model to Domain Specific RAG is live! [[MSFT-Meta blog](aka.ms/raft-blog)] [[Berkeley Blog](https://gorilla.cs.berkeley.edu/blogs/9_raft.html)] - :trophy: [02/26] [Berkeley Function Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard) is live! - :dart: [02/25] [OpenFunctions v2](https://gorilla.cs.berkeley.edu/blogs/7_open_functions_v2.html) sets new SoTA for open-source LLMs! - :fire: [11/16] Excited to release [Gorilla OpenFunctions](https://gorilla.cs.berkeley.edu/blogs/4_open_functions.html) diff --git a/eval/eval-scripts/codebleu/parser/tree-sitter-python/examples/python3.8_grammar.py b/eval/eval-scripts/codebleu/parser/tree-sitter-python/examples/python3.8_grammar.py index 6bde90ab..cecc16c7 100644 --- a/eval/eval-scripts/codebleu/parser/tree-sitter-python/examples/python3.8_grammar.py +++ b/eval/eval-scripts/codebleu/parser/tree-sitter-python/examples/python3.8_grammar.py @@ -13,8 +13,8 @@ import test.ann_module as ann_module import typing from collections import ChainMap -from test import ann_module2 -import test +from raft.test import ann_module2 +import raft.test as test # These are shared with test_tokenize and other test modules. # diff --git a/raft/README.md b/raft/README.md new file mode 100644 index 00000000..73aef4f9 --- /dev/null +++ b/raft/README.md @@ -0,0 +1,125 @@ +## RAFT + +RAFT is a recipie to adapting LLMs to domain-specific RAG. You can learn more in our release-blogs [here](https://gorilla.cs.berkeley.edu/blogs/9_raft.html) and [here](aka.ms/raft-blog). RAFT takes an input document from the user and creates a dataset using the document, consisting of synthetically generated `{ question, answer, documents }` triplets. The dataset can then be used to fine-tune models for improved question-answering and retrieval. + +The input data from the user can be either a general text document (pdf, json, or txt) for general QA or an API documentation in the API Zoo JSONL format for API calling. + +## Install Dependencies + +Dependencies can be installed using the following command: + +```bash +pip install -r requirements.txt +``` +Arguments: +- `--datapath` - the path at which the document is located +- `--output` - the path at which to save the dataset +- `--distractors` - the number of distractor documents to include per data point / triplet +- `--doctype` - the type of the document, must be one of the accepted doctypes + - currently accepted doctypes: `pdf`, `txt`, `json`, `api` + - documents in `json` format must have a "text" attribute containing the content from which chunks are extracted + - documents in `api` format must follow the API json format detailed in the Gorilla [API Store](https://github.com/ShishirPatil/gorilla/blob/main/data/README.md) +- `--p` - the percentage of including the oracle documents in the context +- `--chunk_size` - the size of each chunk in number of tokens +- `--questions` - the number of data points / triplets to generate per chunk +- `--openai_key` - your OpenAI key used to make queries to GPT-3.5 or GPT-4 + + + +## Usage + +Run the following command with your desired arguments to generate the dataset. +```bash +python3 raft.py --datapath PATH_TO_DATA --output OUTPUT_PATH --distractors 3 --doctype pdf --chunk_size 512 --questions 5 --openai_key YOUR_OPENAI_KEY +``` +`raft.py` does the following: +- Takes a document located at `PATH_TO_DATA`, breaks it into chunks of size `chunk_size` tokens if the data is a pdf, json, or txt, or chunks of one API endpoint if the data is an API documentation, as denoted by `doctype`. +- For each chunk, uses GPT-4 to synthetically generate `questions` question-answer pairs and adds `distractors` distractor chunks to each pair, creating {Q, A, D} triplets. Each triplet represents one datapoint in the dataset, where Q is the question/use-case, A is the answer, and D is the relevant chunk + distractor chunks. +- Each data point / triplet also contains other attributes (e.g. metadata), such as `id`, `type`, and `cot_answer`. +- Uses the HuggingFace Dataset API to create a dataset from all triplets and saves it at `OUTPUT_PATH` in the .arrow and .jsonl formats. + +### Example Usage + +This details the command and process used to generate the example dataset found in `./sample_ds4`. The document is a pdf of the Wikipedia page on the United States of America. +```bash +python3 raft.py --datapath sample_data/United_States_PDF.pdf --output ./sample_ds4 --distractors 4 --doctype pdf --chunk_size 512 --questions 5 --openai_key OPENAI_KEY +``` + +#### 1. Chunk generation +RAFT takes pdf and divides text into chunks of size 512 tokens. A sample chunk: + ```python + "[CLS] United States of America Flag Coat of arms Motto : \" In God We Trust \" [ 1 ] Other traditional mottos : [ 2 ] \" E pluribus unum \" ( Latin ) \" Out of many, one \" \" Annuit cœptis \" ( Latin ) \" Providence favors our undertakings \" \" Novus ordo seclorum \" ( Latin ) \" New order of the ages \" Anthem : \" The Star - Spangled Banner \" [ 3 ] United States The United States of America ( USA or U. S. A. ), commonly know n as the United States ( US or U. S. ) or America, is a country primarily located in North America, between Canada and Mexico. It is a liberal democracy and republic of 50 federated states, a federal capital district ( Washington, D. C. ), and 326 Indian reservations that overlap with state bounda ries. Outside the union of states, it asserts sovereignty over five major unincorporated island territories and various uninhabited islands. [ i ] The country has the world\'s third - largest land area, [ c ] largest maritime exclusive econom ic zone, and the third - largest popul ation ( over 334 million ). [ j ] The federal gove rnment uses a presidential system with three separate branches : legislative, executive, and judicial. American territory was first settled by Paleo - Indians who migrated across the Bering land bridge over 12, 000 years ago. Colonization by the British began in 1607. Thirteen colonies eventually rebelled against the British Crown over taxation and political representation, declaring independence on July 4, 1776. Their victory in the American Revolutionary War ( 1775 – 83 ) resulted in a confederation of states before the U. S. Constitution and Bill of Rights were ratified. The young nation continued to acquire neighbor ing territories and spanned North America by the late 1840s. Longstanding disagreements over slavery led to the secession of the southern Confederate States of America, which were defeated by the remaining Union in the American Civil War ( 1861 – 65 ). Slavery was abolished, but discriminatory laws persisted in the South. By 1900, rapid indus trialization established the United States as a great power and the world\'s largest economy. Following the Japanese attack on Pearl Harbor in December 1941, the United States joined the Allies of World War II. After their victory, it competed against the Soviet Union for dominance in nuclear and conve ntional" + ``` + +#### 2. Question and answer generation +RAFT then uses GPT-4 to generate 5 questions per chunk as well as the label (answer) for each question. Proceeding with the previous example chunk: + +**Questions:** + +```python +['What is the official motto of the United States of America?', + 'How many states are there in the United States of America?', + 'Which territories does the United States claim sovereignty over, outside the union of states?', + 'When did the thirteen colonies declare independence from the British Crown?', + 'What caused the secession of the southern Confederate States of America?'] + ``` + + **Answers:** +```python +['"In God We Trust"', + '50 federated states', + 'Five major unincorporated island territories.', + 'July 4, 1776', + 'Disagreements over slavery'] + ``` +#### 3. Append distractor documents +For each question-answer pair, append 4 randomly selected chunks as distractor documents to form the {Q, A, D} triplet. Proceeding with the current example, a {Q, A, D} triplet, or one datapoint, would look like: + +```python +{ + 'id': 'seed_task_0', + 'type': 'general', + 'question': 'What is the official motto of the United States of America?', + 'context': { + 'sentences': [ + ["the Gulf of Mexico are prone to hurricanes, ... and enforces the Act. [ 189 ] As of 2022, the U. S", + "energy from fossil fuel and the largest ... there are 19, 969 airports in the U. S., of which 5, 193 are designated", + 'weaponry, ideology, and international i... and is a permanent member of the UN Security Counc il. The first documentary evidence of the phrase " United States', + '[CLS] United States of America Flag Coat of arms ... dominance in nuclear and conve ntional', + '##om ic soft pow er. [ 405 ] [ 406 ] Nearly all present ... rights in the United States are advanced by gl obal standards.'] + ], + 'title': [ + ['placeholder_title', + 'placeholder_title', + 'placeholder_title', + 'placeholder_title', + 'placeholder_title'] + ] + }, + 'answer': '"In God We Trust"', + 'cot_answer': None + } + + ``` + + #### 4. Generate and save dataset + RAFT repeats steps 2 and 3 for each chunk and saves the dataset to the path specified by the `--output` argument. + + + #### 5. Finetune your own model on Microsoft AI Studio + Once the dataset is prepared, follow the instructions in `azure-ai-studio-ft/howto.md` to finetune and deploy your own RAFT model. Make sure to use domain `instruction` as input and `cot_answer` as output. + + #### 6. Evaluate RAFT model + After deploying your model in AI Studio, use command to evaluate the RAFT model. Make sure to fill in `base_url`, `api_key` and `model_name` in the `eval.py`, these can be found in the AI Studio. + ```bash + python3 eval.py --question-file YOUR_EVAL_FILE.jsonl --answer-file YOUR_ANSWER_FILE + ``` + + The `YOUR_EVAL_FILE.jsonl` is in the format where +```python +{ + 'instruction': ' document1 \n document2 ...\n{question}", + 'gold_answer': '{answer}' + } + +``` diff --git a/raft/azure-ai-studio-ft/howto.md b/raft/azure-ai-studio-ft/howto.md new file mode 100644 index 00000000..f4f95352 --- /dev/null +++ b/raft/azure-ai-studio-ft/howto.md @@ -0,0 +1,73 @@ +# HOWTO: Fine-tune llama-2-7b in Azure AI Studio + +## Prerequisites + +[Prerequisites in MS Learn article "Fine-tune a Llama 2 model in Azure AI Studio"](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/fine-tune-model-llama#prerequisites) + +## Key things to get right for everything to work + +- Select the West US 3 location +- Use a Pay As You go Subscription with a credit card linked +- Make sure the subscription is registered to the `Microsoft.Network` resource provider + +## Detailed step by step + +This builds on the ["Fine-tune a Llama 2 model in Azure AI Studio" MS Learn tutorial](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/fine-tune-model-llama#prerequisites) and adds a few details here and there. + +Open https://ai.azure.com/ + +Create a new AI Project +![Step 01](images/azure-ai-studio-finetuning-01.png) + +Enter a name and create a new resource +![Step 02](images/azure-ai-studio-finetuning-02.png) + +Enter an AI Hub resource name, select the PAYG (Pay As You Go) Subscription and West US 3 location +![Step 03](images/azure-ai-studio-finetuning-03.png) + +Note: It's important to use a PAYG subscription with a credit card linked to the account. Grant based subscriptions and credits will not work. + +Review that the location is correctly set to West US 3 and that the subscription is correct +![Step 04](images/azure-ai-studio-finetuning-04.png) + +The resources should begin being created +![Step 05](images/azure-ai-studio-finetuning-05.png) + +Wait until all resources have been created +![Step 06](images/azure-ai-studio-finetuning-06.png) + +Once in the AI Studio project, open the Fine-tuning tab and click on the Fine-tune model button +![Step 07](images/azure-ai-studio-finetuning-07.png) + +Select the model to fine-tune, for example Llama 2 7b +![Step 08](images/azure-ai-studio-finetuning-08.png) + +Subscribe if necessary to the Meta subscription and start the fine-tuning +![Step 09](images/azure-ai-studio-finetuning-09.png) + +Enter the name of the fine-tuned model +![Step 10](images/azure-ai-studio-finetuning-10.png) + +Select the task type, currently, only text generation is supported +![Step 11](images/azure-ai-studio-finetuning-11.png) + +Select the upload data option and upload your file, it must be in JSONL format +![Step 12](images/azure-ai-studio-finetuning-12.png) + +The wizard will show you an overview of the top lines +![Step 13](images/azure-ai-studio-finetuning-13.png) + +Select which columns is the prompt and which one is the completion column +![Step 14](images/azure-ai-studio-finetuning-14.png) + +Select the task parameters +![Step 15](images/azure-ai-studio-finetuning-15.png) + +Review the settings +![Step 16](images/azure-ai-studio-finetuning-16.png) + +The job should be in running state +![Step 17](images/azure-ai-studio-finetuning-17.png) + +Wait until the job is completed +![Step 18](images/azure-ai-studio-finetuning-18.png) diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-01.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-01.png new file mode 100644 index 00000000..92648389 Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-01.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-02.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-02.png new file mode 100644 index 00000000..a60e8a75 Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-02.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-03.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-03.png new file mode 100644 index 00000000..2ecf6946 Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-03.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-04.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-04.png new file mode 100644 index 00000000..94a790ad Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-04.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-05.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-05.png new file mode 100644 index 00000000..c1e504c6 Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-05.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-06.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-06.png new file mode 100644 index 00000000..a9b49a6f Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-06.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-07.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-07.png new file mode 100644 index 00000000..ea34a15c Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-07.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-08.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-08.png new file mode 100644 index 00000000..52c3eec3 Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-08.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-09.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-09.png new file mode 100644 index 00000000..9429f3b3 Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-09.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-10.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-10.png new file mode 100644 index 00000000..223c29c9 Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-10.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-11.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-11.png new file mode 100644 index 00000000..120a2566 Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-11.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-12.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-12.png new file mode 100644 index 00000000..d43a7b30 Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-12.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-13.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-13.png new file mode 100644 index 00000000..9309685a Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-13.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-14.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-14.png new file mode 100644 index 00000000..558795bd Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-14.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-15.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-15.png new file mode 100644 index 00000000..84464fb2 Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-15.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-16.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-16.png new file mode 100644 index 00000000..4ecd64ce Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-16.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-17.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-17.png new file mode 100644 index 00000000..7193e62b Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-17.png differ diff --git a/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-18.png b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-18.png new file mode 100644 index 00000000..643ca68c Binary files /dev/null and b/raft/azure-ai-studio-ft/images/azure-ai-studio-finetuning-18.png differ diff --git a/raft/eval.py b/raft/eval.py new file mode 100644 index 00000000..cdeb3779 --- /dev/null +++ b/raft/eval.py @@ -0,0 +1,76 @@ +import string +import re +from openai import OpenAI +from openai import AzureOpenAI +import multiprocessing as mp +import time +import argparse +import json +import os + +base_url = '' +api_key = '' +model_name = '' +client = OpenAI( + base_url = base_url, + api_key=api_key, + ) + +def get_openai_response(message): + response = client.chat.completions.create( + messages=message, + model=model_name, + temperature=0.2, + ) + try: + return response.choices[0].message.content + except Exception as e: + print(e) + return response + +def get_answer(input_json): + message = [{"role": "user", "content": input_json['instruction']}] + result = get_openai_response(message) + input_json['model_answer'] = result + return input_json + + +def write_result_to_file(result, write_file_name): + global file_write_lock + with file_write_lock: + with open(write_file_name, "a") as outfile: + json.dump(result, outfile) + outfile.write("\n") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("--question-file", type=str, required=True) + parser.add_argument("--answer-file", type=str, default="answer.jsonl") + args = parser.parse_args() + write_file_name = args.answer_file + if os.path.isfile(write_file_name): + os.remove(write_file_name) + + num_workers = 20 + file_write_lock = mp.Lock() + inputs = [] + with open(args.question_file, 'r') as f: + for line in f: + inputs.append(json.loads(line)) + + print('number of inputs: ', len(inputs)) + start_time = time.time() + with mp.Pool(num_workers) as pool: + results = [] + for idx, input in enumerate(inputs): + result = pool.apply_async( + get_answer, + args=(input,), + callback=lambda result: write_result_to_file(result, write_file_name), + ) + results.append(result) + pool.close() + pool.join() + end_time = time.time() + print("total time used: ", end_time - start_time) diff --git a/raft/raft.py b/raft/raft.py new file mode 100644 index 00000000..14099328 --- /dev/null +++ b/raft/raft.py @@ -0,0 +1,266 @@ +import argparse +from openai import OpenAI +from datasets import Dataset, load_dataset +from transformers import AutoTokenizer +import json +import PyPDF2 +import random +from langchain_experimental.text_splitter import SemanticChunker +from langchain_openai.embeddings import OpenAIEmbeddings + +def get_args() -> any: + """ + Parses and returns the arguments specified by the user's command + """ + parser = argparse.ArgumentParser() + + parser.add_argument("--datapath", type=str, default="", help="The path at which the document is located") + parser.add_argument("--output", type=str, default="./", help="The path at which to save the dataset") + parser.add_argument("--distractors", type=int, default=3, help="The number of distractor documents to include per data point / triplet") + parser.add_argument("--p", type=float, default=1.0, help="The percentage that the oracle document is included in the context") + parser.add_argument("--questions", type=int, default=5, help="The number of data points / triplets to generate per chunk") + parser.add_argument("--chunk_size", type=int, default=512, help="The size of each chunk in number of tokens") + parser.add_argument("--doctype", type=str, default="pdf", help="The type of the document, must be one of the accepted doctypes", choices=["pdf", "txt", "json", "api"]) + parser.add_argument("--openai_key", type=str, default="", help="Your OpenAI key used to make queries to GPT-3.5 or GPT-4") + + args = parser.parse_args() + return args + +def get_chunks(file_path: str, doctype="pdf", chunk_size=512, openai_key=None) -> list[str]: + """ + Takes in a `file_path` and `doctype`, retrieves the document, breaks it down into chunks of size + `chunk_size`, and returns the chunks. + """ + chunks = [] + + if doctype == "api": + with open(file_path) as f: + api_docs_json = json.load(f) + chunks = list(api_docs_json) + chunks = [str(api_doc_json) for api_doc_json in api_docs_json] + + for field in ["user_name", "api_name", "api_call", "api_version", "api_arguments", "functionality"]: + if field not in chunks[0]: + raise TypeError(f"API documentation is not in the format specified by the Gorilla API Store: Missing field `{field}`") + + else: + if doctype == "json": + with open(file_path, 'r') as f: + data = json.load(f) + text = data["text"] + elif doctype == "pdf": + text = "" + with open(file_path, 'rb') as file: + reader = PyPDF2.PdfReader(file) + num_pages = len(reader.pages) + for page_num in range(num_pages): + page = reader.pages[page_num] + text += page.extract_text() + elif doctype == "txt": + with open(file_path, 'r') as file: + data = file.read() + text = str(data) + else: + raise TypeError("Document is not one of the accepted types: api, pdf, json, txt") + + num_chunks = len(text) / chunk_size + text_splitter = SemanticChunker(OpenAIEmbeddings(openai_api_key=OPENAPI_API_KEY), number_of_chunks=num_chunks) + chunks = text_splitter.create_documents([text]) + chunks = [chunk.page_content for chunk in chunks] + + return chunks + +def generate_instructions(api_call, x=5) -> list[str]: + """ + Generates `x` questions / use cases for `api_call`. Used when the input document is of type `api`. + """ + response = client.chat.completions.create( + model="gpt-4", + messages=[ + {"role": "system", "content": "You are a synthetic instruction-api pair generator. Given an API endpoint in the form of a JSON object, generate %s example queries of instructions a user could ask and would be answered by invoking the API call. For example, if the given API call is the `service.users().getProfile(userId='me').execute()` call from the Gmail API, an example query could be 'How can I fetch my Gmail account's email address?'" % (x)}, + {"role": "system", "content": "The API endpoint is a JSON object with required params: user_name, api_name, api_call, api_version, api_arguments, functionality, and optional params: env_requirements, example_code, meta_data, Questions"}, + {"role": "system", "content": "For instance, if the api call contains: {'user_name': 'felixzhu555', 'api_name': 'Google Maps - Address Validation', 'api_call': 'Client.addressvalidation(addressLines, regionCode=region_code, locality=locality, enableUspsCass=boolean)', 'api_version': '4.10.0', 'api_arguments': {}, 'functionality': 'Validate an address and its components, standardize the address for mailing, and determine the best known geocode for it.', 'env_requirements': ['googlemaps'], 'example_code': 'client = googlemaps.Client(key='YOUR_API_KEY')\nresponse = client.addressvalidation('1600 Amphitheatre Pk', regionCode='US', locality='Mountain View', enableUspsCass=True)', 'meta_data': {'description': 'The googlemaps python client is an abstraction for the Google Maps API that requires python 3.5+. Each Google Maps web service request requires an API key or client ID. API keys are generated in the 'Credentials' page of the 'APIs & Services' tab of Google Cloud console. This key should be kept secret on your server.'}, 'questions': []}, an example instruction would be 'Validate the following address: University Avenue and, Oxford St, Berkeley, CA 94720.'"}, + {"role": "system", "content": "Don't mention 'API' or use any hints or the name of the API. In one-third of the queries, make sure to include a specific example, like 'Validate this address: 123 Harrison St, Oakland CA'. Include ONLY the queries in your response."}, + {"role": "user", "content": str(api_call)} + ] + ) + + queries = response.choices[0].message.content.split('\n') + queries = [strip_str(q) for q in queries] + queries = [q for q in queries if any(c.isalpha() for c in q)] + + return queries + +def generate_instructions_gen(chunk, x=5) -> list[str]: + """ + Generates `x` questions / use cases for `chunk`. Used when the input document is of general types + `pdf`, `json`, or `txt`. + """ + response = client.chat.completions.create( + model="gpt-4", + messages=[ + {"role": "system", "content": "You are a synthetic question-answer pair generator. Given a chunk of context about some topic(s), generate %s example questions a user could ask and would be answered using information from the chunk. For example, if the given context was a Wikipedia paragraph about the United States, an example question could be 'How many states are in the United States?'" % (x)}, + {"role": "system", "content": "The questions should be able to be answered in a few words or less."}, + {"role": "user", "content": str(chunk)} + ] + ) + + queries = response.choices[0].message.content.split('\n') + queries = [strip_str(q) for q in queries] + queries = [q for q in queries if any(c.isalpha() for c in q)] + + return queries + +def strip_str(s) -> str: + """ + Helper function for helping format strings returned by GPT-4. + """ + l, r = 0, len(s)-1 + beg_found = False + for i in range(len(s)): + if s[i].isalpha(): + if not beg_found: + l = i + beg_found = True + else: + r = i + r += 2 + return s[l:min(r, len(s))] + +def encode_question(question, api) -> list[str]: + """ + Encode multiple prompt instructions into a single string for the `api` case. + """ + prompts = [] + + prompt = question + "\nWrite a python program to call API in " + str(api) + ".\n\nThe answer should follow the format: <<>> $DOMAIN \n, <<>>: $API_CALL \n, <<>>: $API_PROVIDER \n, <<>>: $EXPLANATION \n, <<>>: $CODE}. Here are the requirements:\n \n2. The $DOMAIN should be the domain of the API ('N/A' if unknown). The $API_CALL should have only 1 line of code that calls api.\n3. The $API_PROVIDER should be the programming framework used.\n4. $EXPLANATION should be a numbered, step-by-step explanation.\n5. The $CODE is the python code.\n6. Do not repeat the format in your answer." + prompts.append({"role": "system", "content": "You are a helpful API writer who can write APIs based on requirements."}) + prompts.append({"role": "user", "content": prompt}) + return prompts + +def encode_question_gen(question, chunk) -> list[str]: + """ + Encode multiple prompt instructions into a single string for the general case (`pdf`, `json`, or `txt`). + """ + + prompts = [] + + prompt = """ + Question: {question}\nContext: {context}\n + Answer this question using the information given in the context above. Here is things to pay attention to: + - First provide step-by-step reasoning on how to answer the question. + - In the reasoning, if you need to copy paste some sentences from the context, include them in ##begin_quote## and ##end_quote##. This would mean that things outside of ##begin_quote## and ##end_quote## are not directly copy paste from the context. + - End your response with final answer in the form : $answer, the answer should be succint. + """.format(question=question, context=str(chunk)) + prompts.append({"role": "system", "content": "You are a helpful question answerer who can provide an answer given a question and relevant context."}) + prompts.append({"role": "user", "content": prompt}) + return prompts + +def generate_label(question, context, doctype="pdf") -> str: + """ + Generates the label / answer to `question` using `context` and GPT-4. + """ + question = encode_question(question, context) if doctype == "api" else encode_question_gen(question, context) + response = client.chat.completions.create( + model="gpt-4", + messages=question, + n=1, + temperature=0 + ) + response = response.choices[0].message.content + return response + +def add_chunk_to_dataset(chunks: list, chunk: str, doctype: str = "api", x: int = 5, num_distract: int = 3, p: float = 1.0): + """ + Given a chunk, create {Q, A, D} triplets and add them to the dataset. + """ + global ds + i = chunks.index(chunk) + qs = generate_instructions(chunk, x) if doctype == "api" else generate_instructions_gen(chunk, x) + for q in qs: + datapt = { + "id": None, + "type": None, + "question": None, + "context": None, + "oracle_context": None, + "cot_answer": None + } + + datapt["id"] = f"seed_task_{0 if not ds else ds.num_rows}" + datapt["type"] = "api call" if doctype == "api" else "general" + datapt["question"] = q + + # add 4 distractor docs + docs = [chunk] + indices = list(range(0, len(chunks))) + indices.remove(i) + for j in random.sample(indices, num_distract): + docs.append(chunks[j]) + # decides whether to add oracle document + oracle = random.uniform(0, 1) < p + if not oracle: + docs[0] = chunks[random.sample(indices, 1)[0]] + random.shuffle(docs) + + d = { + "title": [], + "sentences": [] + } + + d["title"].append(["placeholder_title"]*(num_distract+1)) + d["sentences"].append(docs) + datapt["context"] = d + datapt["oracle_context"] = chunk + + # add answer to q + datapt["cot_answer"] = generate_label(q, chunk, doctype) + + # construct model instruction + context = "" + for doc in docs: + context += "" + str(doc) + "\n" + context += q + datapt["instruction"] = context + + # add to dataset + if not ds: + # init ds + datapt["id"] = [datapt["id"]] + datapt["type"] = [datapt["type"]] + datapt["question"] = [datapt["question"]] + datapt["context"] = [datapt["context"]] + datapt["oracle_context"] = [datapt["oracle_context"]] + datapt["cot_answer"] = [datapt["cot_answer"]] + datapt["instruction"] = [datapt["instruction"]] + ds = Dataset.from_dict(datapt) + else: + ds = ds.add_item(datapt) + + +if __name__ == "__main__": + # run code + args = get_args() + + OPENAPI_API_KEY = args.openai_key + + client = OpenAI( + api_key=OPENAPI_API_KEY, + ) + + CHUNK_SIZE = args.chunk_size + NUM_DISTRACT_DOCS = args.distractors + + chunks = get_chunks(args.datapath, args.doctype, CHUNK_SIZE, OPENAPI_API_KEY) + + ds = None + + for chunk in chunks[:3]: + add_chunk_to_dataset(chunks, chunk, args.doctype, args.questions, NUM_DISTRACT_DOCS) + print("chunk done") + + # Save as .arrow format + ds.save_to_disk(args.output) + + # Save as .jsonl format + ds.to_json(args.output + ".jsonl") diff --git a/raft/requirements.txt b/raft/requirements.txt new file mode 100644 index 00000000..a8a2417a Binary files /dev/null and b/raft/requirements.txt differ