# Goal

We will test whether fine-tuning our reranker improves the [recall metrics from week 1](https://github.com/567-labs/systematically-improving-rag/tree/main/week1_bootstrap_evals)

We will fine-tune the reranker on ~1000 `(question, review)` pairs. The question will be the request/input to the model and the review is the response/output. Our goal is for the better reranker to improve our recall metrics. 

To avoid leakage, we must use separate data for fine-tuning vs evaluating retrieval quality. This notebook generates the fine-tuning data. In practice, you will use your real data rather than generating synthetic data for fine-tuning.

To add some small differences between our fine-tuning data and our recall eval data (increasing realism), we use Sonnet 3.5 to generate this data (whereas we used gpt-4o for the recall eval data)

## Load Products and Create Some Reviews

We will load the products database created in week 1. If you haven't run the week 1 code, do that so we can load the products database.

This cell will then create reviews of these products.

In [1]:
import json
import lancedb
import pandas as pd
import instructor
from anthropic import AsyncAnthropic
from pydantic import BaseModel
import asyncio
from typing import List

pd.set_option("display.max_colwidth", 160)


class Product(BaseModel):
    title: str
    description: str


class Review(BaseModel):
    review: str


class AllObjectInfo(BaseModel):
    title: str
    description: str
    review: str


async_client = instructor.from_anthropic(AsyncAnthropic())


db = lancedb.connect("../week1_bootstrap_evals/lancedb")
products = db.open_table("products").to_pandas()


async def make_reviews(
    product: Product, n: int, semaphore: asyncio.Semaphore = asyncio.Semaphore(1)
) -> List[AllObjectInfo]:
    async with semaphore:
        prompt = f"""
        Write {n} realistic but detailed/specific product reviews that might show up on a hardware store's website.

        The reviews should be about the following product:
        Product Title: {product.title}
        Product Description: {product.description}
        
        Add many relevant and concrete facts about the products (this is for synthetic data generation, make up facts about each product as necessary).

        To see the format of a possible review, here is a review for a saw:
        ```
        I've enjoyed using this saw. It is lightweight and the battery lasts longer than other brands.
        I've been using it for 3 years now and it has been very durable. It was twice as expensive as the PX-500. But
        it is comfortable to hold because of the light weight.
        ```

        Respond only with the reviews, and nothing else.
        """

        try:
            result = await async_client.messages.create(
                model="claude-3-5-sonnet-20240620",
                response_model=List[Review],
                messages=[{"role": "user", "content": prompt}],
                temperature=0.0,
                max_tokens=1000,
            )
            return [
                AllObjectInfo(
                    title=product.title,
                    description=product.description,
                    review=r.review,
                )
                for r in result
            ]

        except Exception as e:
            print(f"Error: {str(e)}")
            return []


async def make_reviews_batch(
    max_concurrency: int = 10, reviews_per_product: int = 4
) -> List[AllObjectInfo]:
    out = []
    semaphore = asyncio.Semaphore(max_concurrency)
    tasks = [
        make_reviews(Product(**o), reviews_per_product, semaphore)
        for _, o in products.iterrows()
    ]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    for r in results:
        if isinstance(r, Exception):
            print(f"Error encountered: {r}")  # Print out any exceptions
        else:
            out.extend(r)
    return out


reviews = await make_reviews_batch(reviews_per_product=4)

## Questions

Make question <-> review pairs where we know what reviews are associated with each question. We do this by giving the review to the LLM and asking for a question that is answered by that review.

In [2]:
class Question(BaseModel):
    question: str


class TrainingPair(BaseModel):
    question: str
    review: str


async def generate_ft_pairs(
    obj_info: AllObjectInfo, n_questions: int, semaphore: asyncio.Semaphore
) -> List[TrainingPair]:

    prompt = f"""
        Generate `{n_questions}` question-answer pairs about a {obj_info.title}. The answers should primarily be derived from information in this product review:

        <content>
        {obj_info.review}
        </content>

        While they should contain information from the product review, you may also find it helpful context to see a product description:
        <content>
        {obj_info.description}
        </content>

        Example questions to consider when forming your question:
        - "What are the products strengths?",
        - "What are the products weaknesses?",
        - "What features or quirks stood out?",

        Provide a concise and specific answer for each question.
        Do not use the exact example questions. Use them only as inspiration for the types of more specific questions to generate.
        Do not include answers that are not in the content.
        Questions should ask about product characteristics (e.g. durability) and answers should refer to product characteristics without referring to the reviewer specifically.
        Stylistically, the questions should resemble what people would ask a RAG-based answer bot on a retailer's website. So they can be a little informal, messy or scattered.
        """

    async with semaphore:
        try:
            questions = await async_client.messages.create(
                model="claude-3-5-sonnet-20240620",
                response_model=List[Question],
                messages=[{"role": "user", "content": prompt}],
                temperature=0.0,
                max_tokens=600,
            )
            return [
                TrainingPair(question=q.question, review=obj_info.review)
                for q in questions
            ]
        except Exception as e:
            print(f"Error generating evals: {str(e)}")
            return []

In [3]:
async def make_all_ft_data(
    reviews: List[AllObjectInfo],
    n_questions: int,
    max_concurrency: int = 10,
) -> List[TrainingPair]:
    semaphore = asyncio.Semaphore(max_concurrency)
    tasks = [generate_ft_pairs(review, n_questions, semaphore) for review in reviews]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    dataset = [item for r in results if isinstance(r, list) for item in r]
    return dataset


ft_dataset = await make_all_ft_data(reviews, n_questions=3)



Save the data.

In [4]:
def save_dataset(dataset: List[TrainingPair], filename: str):
    with open(filename, "w") as f:
        for item in dataset:
            to_write = {
                "query": item.question,
                "relevant_passages": [item.review],
            }
            f.write(json.dumps(to_write) + "\n")


save_dataset(ft_dataset, "ft_dataset.jsonl")

print(f"Generated {len(ft_dataset)} Training Pairs.")
print("Dataset saved as 'ft_dataset.jsonl'")

Generated 1029 Training Pairs.
Dataset saved as 'ft_dataset.jsonl'
