# Generating a Finetuning Dataset

We will first create a high-quality, instruction-based dataset for finetuning a smaller language model. 

**Process:**
1.  **Load a source document:** We'll use the sample PDF provided in `data/source_documents/`.
2.  **Chunk the document:** Break the document into smaller, manageable text chunks.
3.  **Use a powerful generator model:** For each chunk, we will prompt a powerful LLM to generate relevant question-and-answer pairs.
4.  **Format and Save:** The generated pairs will be structured and saved into a `.jsonl` file, which is a common format for finetuning.

In [35]:
import os
import yaml
import json
from tqdm import tqdm

from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser

import sys
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), "..")))

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader

In [36]:
# Load configuration
with open("../config/config.yaml", "r") as f:
    config = yaml.safe_load(f) 

# Load documents
directory_path = "../"+config['data']['source_documents_path']
print(f"Loading documents from: {directory_path}")
loader = DirectoryLoader(
    directory_path,
    glob="**/*.pdf",
    loader_cls=PyPDFLoader,
    show_progress=True,
    use_multithreading=True
)
source_docs = loader.load()
if not source_docs:
    print(f"Warning: No documents loaded from {directory_path}. Make sure there are PDF files in the directory.")
    source_docs = []
else:   print(f"Successfully loaded {sum(f.endswith('.pdf') for f in os.listdir(directory_path))} PDFs => {len(source_docs)} documents (pages).")

# Split documents into chunks
chunk_size = config['data']['chunk_size']
chunk_overlap = config['data']['chunk_overlap']
print(f"Splitting {len(source_docs)} documents into chunks of size {chunk_size} with overlap {chunk_overlap}...")
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    length_function=len,
    add_start_index=True,
)
doc_chunks = text_splitter.split_documents(source_docs)
print(f"Successfully split documents into {len(doc_chunks)} chunks.")

Loading documents from: ../data/source_documents


  0%|          | 0/2 [00:00<?, ?it/s]

100%|██████████| 2/2 [00:00<00:00,  7.12it/s]

Successfully loaded 2 PDFs => 3 documents (pages).
Splitting 3 documents into chunks of size 1000 with overlap 100...
Successfully split documents into 13 chunks.





In [37]:
# For dataset generation, we need a powerful and creative model. We explicitly set the provider to gemini for this task. Ensure API Key is set correctly.
try:
    print("Loading Gemini model...")
    api_key = os.getenv("GEMINI_API_KEY")
    if not api_key:
        api_key = config['llm']['gemini']['gemini_api_key']
        if api_key == "YOUR_GEMINI_API_KEY_HERE":
            raise ValueError("Gemini API key not found. Please set it in config/config.yaml or a .env file.")
    model_name = config['llm']['gemini']['model_name']
    generator_llm = ChatGoogleGenerativeAI(model=model_name, google_api_key=api_key)
    print("Successfully initialised Gemini for dataset generation.")
except Exception as e:
    print(f"Error: {e}")
    print("Please ensure config.yaml has 'provider: gemini' and a valid API key.")

Loading Gemini model...
Successfully initialised Gemini for dataset generation.


In [38]:
# Create a generation prompt that instructs the LLM to act as a data creator. It needs to generate a question and a corresponding detailed answer based only on the provided text chunk.
generation_prompt_template = """
You are an expert data scientist creating a dataset for instruction finetuning.
Your task is to generate a single, high-quality question-and-answer pair based ONLY on the following text chunk.
The question should be something a user would realistically ask about the document.
The answer must be detailed, comprehensive, and derived exclusively from the provided text.

TEXT CHUNK:
----------
{context}
----------

Generate the Q&A pair in the following JSON format:
{{
    "question": "<Your generated question here>",
    "answer": "<Your generated answer here>"
}}
"""

generation_prompt = PromptTemplate(
    template=generation_prompt_template, 
    input_variables=["context"]
)
generation_chain = generation_prompt | generator_llm | StrOutputParser()

# Iterate through each document chunk, invoke the generation chain, parse the JSON output, and collect the results
generated_data = []
print(f"Starting dataset generation from {len(doc_chunks)} chunks...")
for chunk in tqdm(doc_chunks):
    try:
        response_json_str = generation_chain.invoke({"context": chunk.page_content})
        clean_json_str = response_json_str.strip().replace('```json', '').replace('```', '')
        qa_pair = json.loads(clean_json_str)
        if 'question' in qa_pair and 'answer' in qa_pair:
            formatted_entry = {
                "text": f"<s>[INST] {qa_pair['question']} [/INST] {qa_pair['answer']} </s>"
            }
            generated_data.append(formatted_entry)
    except json.JSONDecodeError:
        print(f"\nWarning: Could not decode JSON from LLM response: {clean_json_str}")
    except Exception as e:
        print(f"\nAn unexpected error occurred: {e}")
print(f"\nSuccessfully generated {len(generated_data)} Q&A pairs.")

# Save the generated dataset to the path specified in the config
dataset_path = "../"+config['finetuning']['dataset_path']
output_dir = os.path.dirname(dataset_path)
os.makedirs(output_dir, exist_ok=True)
with open(dataset_path, 'w') as f:
    for entry in generated_data:
        f.write(json.dumps(entry) + '\n')
print(f"Dataset successfully saved to: {dataset_path}")

Starting dataset generation from 13 chunks...


100%|██████████| 13/13 [00:32<00:00,  2.47s/it]


Successfully generated 13 Q&A pairs.
Dataset successfully saved to: ../data/generated_dataset/finetuning_data.jsonl





### Next Steps

With our dataset created, we are now ready to proceed to the next notebook, `2_Finetuning_with_LoRA.ipynb`, where we will use this data to train our own specialised model.