# Synthetic dataset generation using Llama 3.1 405B and RAFT

This recipe will walk you through using [Meta Llama 3.1 405B](https://aka.ms/c/model/Meta-Llama-3.1-405B-Instruct) deployed on Azure AI to generate a synthetic dataset using UC Berkeley's Gorilla project RAFT method (see [blog post](https://aka.ms/raft-blog)).

## What is RAFT?

RAFT stands for Retrieval Augmented Fine Tuning. The general principle is to use a big LLM such as Llama 3.1 405B to analyse a set of documents and generate a dataset of questions and answers that users might want to ask about those documents. We can then use that QA dataset to fine tune a smaller model such as Llama 3.1 8B. The fine tune model will therefore be better at answering questions about those documents.

### Analogy: How to prepare a LLM for an Exam? 📝

*Note: This description was copied from the [UC Berkeley RAFT blog post](https://aka.ms/raft-blog-ucb).*

RAFT is a general recipe to finetune a pretrained LLM to your domain-specific RAG settings. This is a common scenario where you want your LLM to answer questions grounded on a set of documents, for e.g., private files in an enterprise. Such a setting is different from the general RAG where the LLM does not know which domain (of documents) it will be tested on. To better illustrate this setting, let's draw an analogy between deploying and using an LLM with the real-world setting of prepararing for an exam.

![RAFT Open book principle](./doc/raft_openbook.png "RAFT Open book principle")

#### Closed-Book Exam

A closed book exam often refers to the scenario where the LLMs do not have access to any additional documents or references to answer the questions during the exam. For LLMs, this is equivalent to the scenario, for example, in which the LLM is used as a chatbot. In this scenario the LLM draws from the knowledge baked in during pre-training and supervised-finetuning to respond to the users' prompt.

#### Open-Book Exam

In contrast, we liken the open-book exam setting to the scenario in which the LLM can refer to external sources of information (e.g., a website or a book chapter). In such scenarios, typically, the LLM is paired with retriever which retrieves k documents (or specific segments of the document) which are appended to the users' prompt. It is only through these documents retrieved that the LLM gains access to new knowledge. As a result, we argue that the LLM's performance in these settings, where it is trained as a general-purpose LLM is largely dependent on the quality of the retriever and how accurately the retriever can identify the most relevant piece of information.

#### RAFT

RAFT focuses on a narrower but increasingly popular domain than the general open book exam, called the domain-specific open-book exam. In domain-specific open book exam, we know a priori the domain in which the LLM will be tested --- used for inference. The LLM can respond to the users' prompt using use any and all information from this specific domain, which it has been fine-tuned on. Examples of domain specific examples include enterprise documents, latest news, code repositories belonging to an organization, etc. In all these scenarios, the LLM will be used to respond to the questions, whose answers can be found within a collection of documents (a small practical domain). The retrieval technique itself has little to no-impact on the mechanism (though it may impact the accuracy). This paper mainly studies this, domain-specific open-book setting and how to adapt a pretrained LLM to this specific domain, including how to make it more robust to a varying number of retrieved documents and distractors.

### RAFT Process: from domain documents to Q/A/CoT dataset splits

The process is the following. RAFT takes as input a set of documents, split them into chunks, and for each chunk generates a list of questions, Chain Of Thought answers with a selection of relevant and irrelevant context chunks.

![RAFT](./doc/raft.png "RAFT")

## Running time and cost

The RAFT script usually takes a few minutes on the default sample document but can take days on bigger domains depending on the number and size of documents and the number of questions being generated for each chunk.

The cost of running this RAFT script on the sample document should be a few dollars. But beware, running it on bigger domains can cost hundreds of dollars if not more. It is safe to run this notebook multiple times though as the costly part, running the `raft.py` script, will only be executed if the dataset doesn't exist yet.

## Pre-requisites

Before running this notebook, let's make sure your environment is ready

### 1. Deploy Meta Llama 3.1 405B Instruct as a serverless endpoint.

This model will be used to generate the synthetic dataset.

You can either use [Azure ML Studio](https://aka.ms/raft-llama-31-learn-deploy-405b) or [Azure AI Studio](https://aka.ms/raft-llama-31-learn-deploy-405b-ai-studio).

**Note**: an Azure ML Workspace is the same as a Azure AI Hub, you will be able to go back and forth between the two transparently.

### 2. Deploy OpenAI's `text-embedding-ada-002` as a serverless endpoint.

This model will be used to create the chunk embeddings.

You can follow the same procedure as for the Meta Llama model deployment

### 3. Setup your environment variables

Copy the `.env.sample` file to `.env` and update according to your Azure AI project configuration and deployed endpoints

## Setup the RAFT repository
 
This script will checkout a shallow and narrow clone of the UC Berkeley Gorilla RAFT repository locally so that this notebook can invoke the RAFT script and util functions. It can safely be run multiple times.

In [None]:
! ./setup_raft.sh

## Install requirements

In [None]:
! pip install -r requirements.txt

## Select the documents

#### Notebook parameters

*Note: Parameters are typed as indicated for Papermill introspection*

In [None]:
ds_name: str = "vampire-DEMO"
doc_path: str = "sample_data/vampires/Vampire - Wikipedia.pdf"
format: str = "completion"

In [None]:
import pandas as pd
from utils import update_state

ds_path = f"dataset/{ds_name}"
ds_output_file = f"{ds_path}.jsonl"
update_state("DATASET_NAME", ds_name)
print("Creating dataset: " + ds_name)

### Overview of PDF

In [None]:
from utils import get_pdf_image
from pathlib import Path

pdf_image = None
if Path(doc_path).exists() and Path(doc_path).is_file() and Path(doc_path).suffix == ".pdf":
    pdf_image = get_pdf_image(doc_path)
pdf_image

### Generate Q/A/CoT fine-tuning dataset using RAFT from the domain specific documents

The `--completion_model` and `--embedding_model` parameters refer to the names of the deployments of the models in Azure.

In [None]:
! [ ! -f $ds_output_file ] && python3 .gorilla/raft/raft.py \
    --datapath "$doc_path" \
    --output $ds_path \
    --distractors 3 \
    --doctype pdf \
    --chunk_size 512 \
    --questions 1 \
    --workers 2 \
    --system-prompt-key llama \
    --completion_model Meta-Llama-3-70B-Instruct \
    --embedding_model text-embedding-ada-002 \
    || echo "Dataset already generated, skipping generation."

*Note*: The bit of shell logic wrapping the python script call allows to skip the generation if the dataset has already been generated so it is safe to run this notebook multiple times.

## Prepare training, validation and evaluation splits

Let's define variables for the different files we will need throughout this notebook

In [None]:
raft_arrow_file = f"{ds_path}/data-00000-of-00001.arrow"
dataset_path = f"{ds_path}-files/{ds_name}-full.jsonl"
dataset_path_hf = f"{ds_path}-files/{ds_name}-hf.full.jsonl"

dataset_path_hf_train = f"{ds_path}-files/{ds_name}-hf.train.jsonl"
dataset_path_hf_valid = f"{ds_path}-files/{ds_name}-hf.valid.jsonl"
dataset_path_hf_eval = f"{ds_path}-files/{ds_name}-hf.eval.jsonl"

dataset_path_ft_train = f"{ds_path}-files/{ds_name}-ft.train.jsonl"
dataset_path_ft_valid = f"{ds_path}-files/{ds_name}-ft.valid.jsonl"

print(f"Reading arrow file {raft_arrow_file}")

### Export dataset to JSONL

Let's export the Apache Arrow format file to JSONL, easier to manipulate

In [None]:
! python .gorilla/raft/format.py \
    --input $raft_arrow_file \
    --output $dataset_path_hf \
    --output-format hf

In [None]:
hf_full_df = pd.read_json(dataset_path_hf, lines=True)
hf_full_df.head(5)

## Let's look at a sample

In [None]:
from IPython.display import display, Markdown
from random import randint

sample_idx = randint(0, len(hf_full_df) - 1)
sample = hf_full_df.iloc[sample_idx]
instruction_md = sample.instruction.replace("<DOCUMENT>", "`<DOCUMENT>`").replace("</DOCUMENT>", "`</DOCUMENT>`")
oracle_context_md = sample.oracle_context.replace("<DOCUMENT>", "`<DOCUMENT>`").replace("</DOCUMENT>", "`</DOCUMENT>`")
sample_answer_md = sample.cot_answer.replace("<ANSWER>", "`<ANSWER>`").replace("##begin_quote##", "`##begin_quote##`").replace("##end_quote##", "`##end_quote##`")
display(Markdown(f"## Oracle Context\n{oracle_context_md}\n\n## Question\n{sample.question}\n\n## CoT Answer\n{sample_answer_md}\n\n## Instruction\n{instruction_md}"))

### Split the dataset into train / validation / evaluation

In [None]:
# split dataset into 80%/10%/10%
import numpy as np

samples_count = len(hf_full_df)
hf_train_df, hf_valid_df, hf_eval_df = np.split(hf_full_df, [int(0.8 * samples_count), int(0.9 * samples_count)])
hf_train_df.to_json(dataset_path_hf_train, orient="records", lines=True)
hf_valid_df.to_json(dataset_path_hf_valid, orient="records", lines=True)
hf_eval_df.to_json(dataset_path_hf_eval, orient="records", lines=True)

### Export training and validation splits into JSONL format

In [None]:
! python .gorilla/raft/format.py \
    --input $dataset_path_hf_train \
    --input-type jsonl \
    --output $dataset_path_ft_train \
    --output-format $format \
    --output-completion-prompt-column text\
    --output-completion-completion-column ground_truth

In [None]:
! python .gorilla/raft/format.py \
    --input $dataset_path_hf_valid \
    --input-type jsonl \
    --output $dataset_path_ft_valid \
    --output-format $format \
    --output-completion-prompt-column text\
    --output-completion-completion-column ground_truth

In [None]:
dataset_path_ft_valid_df = pd.read_json(dataset_path_ft_valid, lines=True)
dataset_path_ft_valid_df.head(2)

### Keep the evaluation split aside

We don't need to format the evaluation dataset for now

In [None]:
pd.read_json(dataset_path_hf_eval, lines=True).head(2)