# Synthetic dataset generation using Llama 3.1 405B and RAFT

This recipe will walk you through using Meta Llama 3.1 405B deployed on Azure AI to generate a synthetic dataset using UC Berkeley's Gorilla project RAFT method (see [blog post](https://aka.ms/raft-blog)).

Before running this notebook, make sure your environment is ready:
- Deploy Meta Llama 3.1 405B Instruct as a serverless endpoint. See [learn](https://aka.ms/raft-llama-31-learn-deploy-405b) article.
- Deploy OpenAI's `text-embedding-ada-002` as a serverless endpoint.
- Copy the `.env.sample` file to `.env` and update according to your Azure AI project configuration and deployed endpoints

## Setup the RAFT repository
 
This script will checkout a shallow and narrow clone of the UC Berkeley Gorilla RAFT repository locally so that this notebook can invoke the RAFT script and util functions. It can safely be run multiple times.

In [None]:
! ./setup_raft.sh

## Install requirements

In [None]:
! pip install -r requirements.txt

## Synthetic data generation phase

RAFT stands for Retrieval Augmented Fine Tuning. The general principle is to use a big LLM such as Llama 3.1 405B to analyse a set of documents and generate a dataset of questions and answers that users might want to ask about those documents. We can then use that QA dataset to fine tune a smaller model such as Llama 3.1 8B. The fine tune model will therefore be better at answering questions about those documents.


The process is the following. RAFT takes as input a set of documents, split them into chunks, and for each chunk generates a list of questions, Chain Of Thought answers with a selection of relevant and irrelevant context chunks.

<div>
<img src="./doc/raft.png" width="75%"/>
</div>

### Select the documents

In [None]:
import pandas as pd
from utils import update_state
ds_name = "vampire-DEMO"
doc_path = "sample_data/vampires/Vampire - Wikipedia.pdf"
ds_path = f"dataset/{ds_name}"
update_state("DATASET_NAME", ds_name)
print("Creating dataset: " + ds_name)

### Overview of PDF

In [None]:
from utils import get_pdf_image
from pathlib import Path
pdf_image = None
if Path(doc_path).exists() and Path(doc_path).is_file() and Path(doc_path).suffix == ".pdf":
    pdf_image = get_pdf_image(doc_path)
pdf_image

### Clean up the DEMO folder

In [None]:
# Clean up demo folder only if it's a DEMO dataset
if ds_path.endswith("DEMO"):
    import shutil
    print(f"Cleaning demo folder {ds_path}")
    shutil.rmtree(ds_path, ignore_errors=True)
    print(f"Cleaning demo checkpoints folder {ds_path}")
    shutil.rmtree(ds_path + "-checkpoints", ignore_errors=True)
    print(f"Cleaning demo files folder {ds_path}")
    shutil.rmtree(ds_path + "-files", ignore_errors=True)

### Generate Q/A/CoT fine-tuning dataset using RAFT from the domain specific documents

In [None]:
import os
os.environ["HF_DATASETS_CACHE"] = ".cache/huggingface/datasets"

In [None]:
! python3 .gorilla/raft/raft.py \
    --datapath "$doc_path" \
    --output $ds_path \
    --distractors 3 \
    --doctype pdf \
    --chunk_size 512 \
    --questions 1 \
    --workers 2 \
    --system-prompt-key llama \
    --completion_model Meta-Llama-3-70B-Instruct \
    --embedding_model text-embedding-ada-002

## Prepare training, validation and evaluation splits

In [None]:
raft_arrow_file = f"{ds_path}/data-00000-of-00001.arrow"
dataset_path = f"{ds_path}-files/{ds_name}-full.jsonl"
dataset_path_hf = f"{ds_path}-files/{ds_name}-hf.full.jsonl"

dataset_path_hf_train = f"{ds_path}-files/{ds_name}-hf.train.jsonl"
dataset_path_hf_valid = f"{ds_path}-files/{ds_name}-hf.valid.jsonl"
dataset_path_hf_eval = f"{ds_path}-files/{ds_name}-hf.eval.jsonl"

dataset_path_ft_train = f"{ds_path}-files/{ds_name}-ft.train.jsonl"
dataset_path_ft_valid = f"{ds_path}-files/{ds_name}-ft.valid.jsonl"

print(f"Reading arrow file {raft_arrow_file}")

### Export dataset to JSONL

In [None]:
! python .gorilla/raft/format.py \
    --input $raft_arrow_file \
    --output $dataset_path_hf \
    --output-format hf

In [None]:
hf_full_df = pd.read_json(dataset_path_hf, lines=True)
hf_full_df.head(5)

## Let's look at a sample

In [None]:
from IPython.display import display, Markdown
from random import randint
sample_idx = randint(0, len(hf_full_df) - 1)
sample = hf_full_df.iloc[sample_idx]
instruction_md = sample.instruction.replace("<DOCUMENT>", "`<DOCUMENT>`").replace("</DOCUMENT>", "`</DOCUMENT>`")
oracle_context_md = sample.oracle_context.replace("<DOCUMENT>", "`<DOCUMENT>`").replace("</DOCUMENT>", "`</DOCUMENT>`")
sample_answer_md = sample.cot_answer.replace("<ANSWER>", "`<ANSWER>`").replace("##begin_quote##", "`##begin_quote##`").replace("##end_quote##", "`##end_quote##`")
display(Markdown(f"## Oracle Context\n{oracle_context_md}\n\n## Question\n{sample.question}\n\n## CoT Answer\n{sample_answer_md}\n\n## Instruction\n{instruction_md}"))

### Split the dataset into train / validation / evaluation

In [None]:
# split dataset into 80%/10%/10%
import numpy as np
samples_count = len(hf_full_df)
hf_train_df, hf_valid_df, hf_eval_df = np.split(hf_full_df, [int(.8*samples_count), int(.9*samples_count)])
hf_train_df.to_json(dataset_path_hf_train, orient="records", lines=True)
hf_valid_df.to_json(dataset_path_hf_valid, orient="records", lines=True)
hf_eval_df.to_json(dataset_path_hf_eval, orient="records", lines=True)

### Export training and validation splits into JSONL format

In [None]:
! python .gorilla/raft/format.py \
    --input $dataset_path_hf_train \
    --input-type jsonl \
    --output $dataset_path_ft_train \
    --output-format completion \
    --output-completion-prompt-column text\
    --output-completion-completion-column ground_truth

In [None]:
! python .gorilla/raft/format.py \
    --input $dataset_path_hf_valid \
    --input-type jsonl \
    --output $dataset_path_ft_valid \
    --output-format completion \
    --output-completion-prompt-column text\
    --output-completion-completion-column ground_truth

In [None]:
dataset_path_ft_valid_df = pd.read_json(dataset_path_ft_valid, lines=True)
dataset_path_ft_valid_df.head(2)

### Keep the evaluation split aside

We don't need to format the evaluation dataset for now

In [None]:
pd.read_json(dataset_path_hf_eval, lines=True).head(2)