Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAFT + readme + small sample dataset #218

Merged
merged 25 commits into from
Mar 15, 2024
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
`Gorilla` enables LLMs to use tools by invoking APIs. Given a natural language query, Gorilla comes up with the semantically- and syntactically- correct API to invoke. With Gorilla, we are the first to demonstrate how to use LLMs to invoke 1,600+ (and growing) API calls accurately while reducing hallucination. We also release APIBench, the largest collection of APIs, curated and easy to be trained on! Join us, as we try to expand the largest API store and teach LLMs how to write them! Hop on our Discord, or open a PR, or email us if you would like to have your API incorporated as well.

## News
- :rocket: [03/15] RAFT: Adapting Language Model to Domain Specific RAG is live! [[MSFT-Meta blog](aka.ms/raft-blog)] [[Berkeley Blog](https://gorilla.cs.berkeley.edu/blogs/9_raft.html)]
- :trophy: [02/26] [Berkeley Function Calling Leaderboard](https://gorilla.cs.berkeley.edu/leaderboard) is live!
- :dart: [02/25] [OpenFunctions v2](https://gorilla.cs.berkeley.edu/blogs/7_open_functions_v2.html) sets new SoTA for open-source LLMs!
- :fire: [11/16] Excited to release [Gorilla OpenFunctions](https://gorilla.cs.berkeley.edu/blogs/4_open_functions.html)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@
import test.ann_module as ann_module
import typing
from collections import ChainMap
from test import ann_module2
import test
from raft.test import ann_module2
import raft.test as test

# These are shared with test_tokenize and other test modules.
#
Expand Down
125 changes: 125 additions & 0 deletions raft/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
## RAFT

RAFT is a recipie to adapting LLMs to domain-specific RAG. You can learn more in our release-blogs [here](https://gorilla.cs.berkeley.edu/blogs/9_raft.html) and [here](aka.ms/raft-blog). RAFT takes an input document from the user and creates a dataset using the document, consisting of synthetically generated `{ question, answer, documents }` triplets. The dataset can then be used to fine-tune models for improved question-answering and retrieval.

The input data from the user can be either a general text document (pdf, json, or txt) for general QA or an API documentation in the API Zoo JSONL format for API calling.

## Install Dependencies

Dependencies can be installed using the following command:

```bash
pip install -r requirements.txt
```
Arguments:
- `--datapath` - the path at which the document is located
- `--output` - the path at which to save the dataset
- `--distractors` - the number of distractor documents to include per data point / triplet
- `--doctype` - the type of the document, must be one of the accepted doctypes
- currently accepted doctypes: `pdf`, `txt`, `json`, `api`
- documents in `json` format must have a "text" attribute containing the content from which chunks are extracted
- documents in `api` format must follow the API json format detailed in the Gorilla [API Store](https://github.com/ShishirPatil/gorilla/blob/main/data/README.md)
- `--p` - the percentage of including the oracle documents in the context
- `--chunk_size` - the size of each chunk in number of tokens
- `--questions` - the number of data points / triplets to generate per chunk
- `--openai_key` - your OpenAI key used to make queries to GPT-3.5 or GPT-4



## Usage

Run the following command with your desired arguments to generate the dataset.
```bash
python3 raft.py --datapath PATH_TO_DATA --output OUTPUT_PATH --distractors 3 --doctype pdf --chunk_size 512 --questions 5 --openai_key YOUR_OPENAI_KEY
```
`raft.py` does the following:
- Takes a document located at `PATH_TO_DATA`, breaks it into chunks of size `chunk_size` tokens if the data is a pdf, json, or txt, or chunks of one API endpoint if the data is an API documentation, as denoted by `doctype`.
- For each chunk, uses GPT-4 to synthetically generate `questions` question-answer pairs and adds `distractors` distractor chunks to each pair, creating {Q, A, D} triplets. Each triplet represents one datapoint in the dataset, where Q is the question/use-case, A is the answer, and D is the relevant chunk + distractor chunks.
- Each data point / triplet also contains other attributes (e.g. metadata), such as `id`, `type`, and `cot_answer`.
- Uses the HuggingFace Dataset API to create a dataset from all triplets and saves it at `OUTPUT_PATH` in the .arrow and .jsonl formats.

### Example Usage

This details the command and process used to generate the example dataset found in `./sample_ds4`. The document is a pdf of the Wikipedia page on the United States of America.
```bash
python3 raft.py --datapath sample_data/United_States_PDF.pdf --output ./sample_ds4 --distractors 4 --doctype pdf --chunk_size 512 --questions 5 --openai_key OPENAI_KEY
```

#### 1. Chunk generation
RAFT takes pdf and divides text into chunks of size 512 tokens. A sample chunk:
```python
"[CLS] United States of America Flag Coat of arms Motto : \" In God We Trust \" [ 1 ] Other traditional mottos : [ 2 ] \" E pluribus unum \" ( Latin ) \" Out of many, one \" \" Annuit cœptis \" ( Latin ) \" Providence favors our undertakings \" \" Novus ordo seclorum \" ( Latin ) \" New order of the ages \" Anthem : \" The Star - Spangled Banner \" [ 3 ] United States The United States of America ( USA or U. S. A. ), commonly know n as the United States ( US or U. S. ) or America, is a country primarily located in North America, between Canada and Mexico. It is a liberal democracy and republic of 50 federated states, a federal capital district ( Washington, D. C. ), and 326 Indian reservations that overlap with state bounda ries. Outside the union of states, it asserts sovereignty over five major unincorporated island territories and various uninhabited islands. [ i ] The country has the world\'s third - largest land area, [ c ] largest maritime exclusive econom ic zone, and the third - largest popul ation ( over 334 million ). [ j ] The federal gove rnment uses a presidential system with three separate branches : legislative, executive, and judicial. American territory was first settled by Paleo - Indians who migrated across the Bering land bridge over 12, 000 years ago. Colonization by the British began in 1607. Thirteen colonies eventually rebelled against the British Crown over taxation and political representation, declaring independence on July 4, 1776. Their victory in the American Revolutionary War ( 1775 – 83 ) resulted in a confederation of states before the U. S. Constitution and Bill of Rights were ratified. The young nation continued to acquire neighbor ing territories and spanned North America by the late 1840s. Longstanding disagreements over slavery led to the secession of the southern Confederate States of America, which were defeated by the remaining Union in the American Civil War ( 1861 – 65 ). Slavery was abolished, but discriminatory laws persisted in the South. By 1900, rapid indus trialization established the United States as a great power and the world\'s largest economy. Following the Japanese attack on Pearl Harbor in December 1941, the United States joined the Allies of World War II. After their victory, it competed against the Soviet Union for dominance in nuclear and conve ntional"
```

#### 2. Question and answer generation
RAFT then uses GPT-4 to generate 5 questions per chunk as well as the label (answer) for each question. Proceeding with the previous example chunk:

**Questions:**

```python
['What is the official motto of the United States of America?',
'How many states are there in the United States of America?',
'Which territories does the United States claim sovereignty over, outside the union of states?',
'When did the thirteen colonies declare independence from the British Crown?',
'What caused the secession of the southern Confederate States of America?']
```

**Answers:**
```python
['"In God We Trust"',
'50 federated states',
'Five major unincorporated island territories.',
'July 4, 1776',
'Disagreements over slavery']
```
#### 3. Append distractor documents
For each question-answer pair, append 4 randomly selected chunks as distractor documents to form the {Q, A, D} triplet. Proceeding with the current example, a {Q, A, D} triplet, or one datapoint, would look like:

```python
{
'id': 'seed_task_0',
'type': 'general',
'question': 'What is the official motto of the United States of America?',
'context': {
'sentences': [
["the Gulf of Mexico are prone to hurricanes, ... and enforces the Act. [ 189 ] As of 2022, the U. S",
"energy from fossil fuel and the largest ... there are 19, 969 airports in the U. S., of which 5, 193 are designated",
'weaponry, ideology, and international i... and is a permanent member of the UN Security Counc il. The first documentary evidence of the phrase " United States',
'[CLS] United States of America Flag Coat of arms ... dominance in nuclear and conve ntional',
'##om ic soft pow er. [ 405 ] [ 406 ] Nearly all present ... rights in the United States are advanced by gl obal standards.']
],
'title': [
['placeholder_title',
'placeholder_title',
'placeholder_title',
'placeholder_title',
'placeholder_title']
]
},
'answer': '"In God We Trust"',
'cot_answer': None
}

```

#### 4. Generate and save dataset
RAFT repeats steps 2 and 3 for each chunk and saves the dataset to the path specified by the `--output` argument.


#### 5. Finetune your own model on Microsoft AI Studio
Once the dataset is prepared, follow the instructions in `azure-ai-studio-ft/howto.md` to finetune and deploy your own RAFT model. Make sure to use domain `instruction` as input and `cot_answer` as output.

#### 6. Evaluate RAFT model
After deploying your model in AI Studio, use command to evaluate the RAFT model. Make sure to fill in `base_url`, `api_key` and `model_name` in the `eval.py`, these can be found in the AI Studio.
```bash
python3 eval.py --question-file YOUR_EVAL_FILE.jsonl --answer-file YOUR_ANSWER_FILE
```

The `YOUR_EVAL_FILE.jsonl` is in the format where
```python
{
'instruction': '<DOCUMENT> document1 </DOCUMENT>\n<DOCUMENT> document2 </DOCUMENT> ...\n{question}",
'gold_answer': '{answer}'
}

```
73 changes: 73 additions & 0 deletions raft/azure-ai-studio-ft/howto.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# HOWTO: Fine-tune llama-2-7b in Azure AI Studio

## Prerequisites

[Prerequisites in MS Learn article "Fine-tune a Llama 2 model in Azure AI Studio"](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/fine-tune-model-llama#prerequisites)

## Key things to get right for everything to work

- Select the West US 3 location
- Use a Pay As You go Subscription with a credit card linked
- Make sure the subscription is registered to the `Microsoft.Network` resource provider

## Detailed step by step

This builds on the ["Fine-tune a Llama 2 model in Azure AI Studio" MS Learn tutorial](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/fine-tune-model-llama#prerequisites) and adds a few details here and there.

Open https://ai.azure.com/

Create a new AI Project
![Step 01](images/azure-ai-studio-finetuning-01.png)

Enter a name and create a new resource
![Step 02](images/azure-ai-studio-finetuning-02.png)

Enter an AI Hub resource name, select the PAYG (Pay As You Go) Subscription and West US 3 location
![Step 03](images/azure-ai-studio-finetuning-03.png)

Note: It's important to use a PAYG subscription with a credit card linked to the account. Grant based subscriptions and credits will not work.

Review that the location is correctly set to West US 3 and that the subscription is correct
![Step 04](images/azure-ai-studio-finetuning-04.png)

The resources should begin being created
![Step 05](images/azure-ai-studio-finetuning-05.png)

Wait until all resources have been created
![Step 06](images/azure-ai-studio-finetuning-06.png)

Once in the AI Studio project, open the Fine-tuning tab and click on the Fine-tune model button
![Step 07](images/azure-ai-studio-finetuning-07.png)

Select the model to fine-tune, for example Llama 2 7b
![Step 08](images/azure-ai-studio-finetuning-08.png)

Subscribe if necessary to the Meta subscription and start the fine-tuning
![Step 09](images/azure-ai-studio-finetuning-09.png)

Enter the name of the fine-tuned model
![Step 10](images/azure-ai-studio-finetuning-10.png)

Select the task type, currently, only text generation is supported
![Step 11](images/azure-ai-studio-finetuning-11.png)

Select the upload data option and upload your file, it must be in JSONL format
![Step 12](images/azure-ai-studio-finetuning-12.png)

The wizard will show you an overview of the top lines
![Step 13](images/azure-ai-studio-finetuning-13.png)

Select which columns is the prompt and which one is the completion column
![Step 14](images/azure-ai-studio-finetuning-14.png)

Select the task parameters
![Step 15](images/azure-ai-studio-finetuning-15.png)

Review the settings
![Step 16](images/azure-ai-studio-finetuning-16.png)

The job should be in running state
![Step 17](images/azure-ai-studio-finetuning-17.png)

Wait until the job is completed
![Step 18](images/azure-ai-studio-finetuning-18.png)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
76 changes: 76 additions & 0 deletions raft/eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import string
import re
from openai import OpenAI
from openai import AzureOpenAI
import multiprocessing as mp
import time
import argparse
import json
import os

base_url = ''
api_key = ''
model_name = ''
client = OpenAI(
base_url = base_url,
api_key=api_key,
)

def get_openai_response(message):
response = client.chat.completions.create(
messages=message,
model=model_name,
temperature=0.2,
)
try:
return response.choices[0].message.content
except Exception as e:
print(e)
return response

def get_answer(input_json):
message = [{"role": "user", "content": input_json['instruction']}]
result = get_openai_response(message)
input_json['model_answer'] = result
return input_json


def write_result_to_file(result, write_file_name):
global file_write_lock
with file_write_lock:
with open(write_file_name, "a") as outfile:
json.dump(result, outfile)
outfile.write("\n")


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--question-file", type=str, required=True)
parser.add_argument("--answer-file", type=str, default="answer.jsonl")
args = parser.parse_args()
write_file_name = args.answer_file
if os.path.isfile(write_file_name):
os.remove(write_file_name)

num_workers = 20
file_write_lock = mp.Lock()
inputs = []
with open(args.question_file, 'r') as f:
for line in f:
inputs.append(json.loads(line))

print('number of inputs: ', len(inputs))
start_time = time.time()
with mp.Pool(num_workers) as pool:
results = []
for idx, input in enumerate(inputs):
result = pool.apply_async(
get_answer,
args=(input,),
callback=lambda result: write_result_to_file(result, write_file_name),
)
results.append(result)
pool.close()
pool.join()
end_time = time.time()
print("total time used: ", end_time - start_time)
Loading