# RAGatouille + Instructor: Finetuning ColBERT(v2) with no annotated data

In the [previous example](https://github.com/bclavie/RAGatouille/blob/main/examples/02-basic_training.ipynb), we experimented with training our own ColBERT model. The process was straightforward, but our resulting model wasn't very useful: we didn't have any of the annotated data needed, so we created meaningless `[query, relevant_passage]` pairs that the model couldn't learn anything from!

Getting annotated data is expensive! Thankfully, the literature in retrieval has recently shown that synthetic data can yield similar, if not better, performance when fine-tuning retrieval models. This means we can fine-tune to our target domain without **needing pricey and time-consuming annotations**.

In this tutorial, we'll show how easily we can leverage Jason Wei's [instructor library](https://github.com/jxnl/instructor) for structured extraction with OpenAI's functional calling API to generate meaningful query-passage pairs.

First, we'll need to install the required dependencies:

In [1]:
!pip install openai instructor

[0m

To continue with this tutorial, you'll need to set up your OpenAI api key as an environment variable. If you're not planning on sharing this notebook with anyone, you can do so by uncommenting the line below and filling it with your own API key before running the next cell!

In [2]:
import os
# os.environ['OPENAI_API_KEY'] = "YOUR_KEY_HERE"

Now that our API key's set-up, we will load up our OpenAI client and set-up `instructor`:

In [3]:
import instructor
# If you're using llamaindex 0.10 or above, these need to be imported from llama_index.core instead!
from llama_index import Document
from llama_index.text_splitter import SentenceSplitter
from openai import OpenAI
from typing import List
from pydantic import BaseModel, Field

client = instructor.patch(OpenAI(api_key=os.environ["OPENAI_API_KEY"]))

Perfect, now all we need is some data! Just like in the last tutorial, we'll be getting a few wikipedia pages. First, let's define the helper function again:

In [4]:
import requests

def get_wikipedia_page(title: str):
    """
    Retrieve the full text content of a Wikipedia page.
    
    :param title: str - Title of the Wikipedia page.
    :return: str - Full text content of the page as raw string.
    """
    # Wikipedia API endpoint
    URL = "https://en.wikipedia.org/w/api.php"

    # Parameters for the API request
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "extracts",
        "explaintext": True,
    }

    # Custom User-Agent header to comply with Wikipedia's best practices
    headers = {
        "User-Agent": "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)"
    }

    response = requests.get(URL, params=params, headers=headers)
    data = response.json()

    # Extracting page content
    page = next(iter(data['query']['pages'].values()))
    return page['extract'] if 'extract' in page else None

Now, let's load up a few pages and pre-process them into smaller chunks using `CorpusProcessor`:

In [5]:
my_full_corpus = [get_wikipedia_page("Hayao_Miyazaki")]
my_full_corpus += [get_wikipedia_page("Studio_Ghibli")]
my_full_corpus += [get_wikipedia_page("Toei_Animation")]

from ragatouille.data import CorpusProcessor, llama_index_sentence_splitter

corpus_processor = CorpusProcessor(document_splitter_fn=llama_index_sentence_splitter)
documents = corpus_processor.process_corpus(my_full_corpus, chunk_size=180)

How we proceed next would depend on your primary use-case. In a lot of situations, you may have a lot of documents, but think only some of them would be useful for user queries.

We're going to make the same assumption here, so we'll randomly choose a few chunks as "relevant passages", for which we'll generate queries, and the others will serve as our full corpus to generate negatives from:

In [6]:
import random

random.seed(42)
relevant_documents = random.sample(documents, 32)
irrelevant_documents = list(set(documents) - set(relevant_documents))

### Creating our synthetic query set

Now that we have relevant our list of documents, we need to generate queries for them to be able to train our model! Unlike in the last tutorial, we want actual, useful queries that will produce a model we can use.

To do so, we'll define a `pydantic` model for `instructor` to use when calling OpenAI:

In [7]:
class QueryForPassage(BaseModel):
    hypothetical_questions: List[str] = Field(
        default_factory=list,
        description="A wide variety of hypothetical questions that this document could answer.",
    )
    hypothetical_queries: List[str] = Field(
        default_factory=list,
        description="A wide variety of hypothetical queries that this document would be relevant to, in the context of a search engine or a retrieval pipeline.",
    )

We're defining two styles of queries here: the variety might be helpful in preparing the model for different keywords the users may use in their queries! Feel free to experiment with the descriptions, add your own, or even request that the model also assigns a few keywords to the document.

For this example, we'll only run it on the one passage so it doesn't get too wordy:

In [8]:
candidate_queries = []
for doc in relevant_documents:
    candidate = client.chat.completions.create(
        model="gpt-4-1106-preview",
        response_model=QueryForPassage,
        messages=[
            {
                "role": "system",
                "content": """You are an expert AI assisting us in creating a high quality, diverse synthetic dataset to train Information Retrieval models. 
Your role is to analyse the document chunk given to you and provide us with high quality potential queries.""",
            },
            {"role": "user", "content": doc},
        ],
    )
    candidate_queries.append(candidate)
    break

print("Document: ")
print(relevant_documents[0])
print("Generated queries: ")
candidate_queries[0].model_dump()

Document: 
In April 1984, Miyazaki opened his own office in Suginami Ward, naming it Nibariki.


=== Studio Ghibli ===


==== Early films (1985–1996) ====
In June 1985, Miyazaki, Takahata, Tokuma and Suzuki founded the animation production company Studio Ghibli, with funding from Tokuma Shoten. Studio Ghibli's first film, Laputa: Castle in the Sky (1986), employed the same production crew of Nausicaä. Miyazaki's designs for the film's setting were inspired by Greek architecture and "European urbanistic templates".
Generated queries: 


{'hypothetical_questions': ["What was Studio Ghibli's first film?",
  "Where was Miyazaki's office located when he opened it?",
  'Who were the founders of Studio Ghibli?',
  'What year was Studio Ghibli founded?',
  "What inspired Miyazaki's designs for the setting of Laputa: Castle in the Sky?",
  'Which production crew did Laputa: Castle in the Sky employ?'],
 'hypothetical_queries': ['First film produced by Studio Ghibli',
  "Location of Miyazaki's office in 1984",
  'Founders of Studio Ghibli',
  'Year Studio Ghibli was established',
  "Inspirations for Miyazaki's film Laputa",
  'Production crew for Laputa: Castle in the Sky']}

Obviously, in a real setting, you'd generate queries for many, many more of your documents to populate your training data! Feel free to experiment with the description you use to the model, and even sway it towards certain topics if you think those are particualrly important to your users.

Now that we've generated the queries, we need to create training pairs. In the example below, we arbitrarily decide that we want to keep two potential query for each relevant document: one from `hypothetical_queries` and one from `hypothetical_questions`:

In [9]:
pairs = []
num_questions = 1
num_queries = 1
random.seed(42)

for candidates, doc in zip(candidate_queries, relevant_documents):
    candidates = candidates.model_dump()
    queries = random.sample(candidates['hypothetical_questions'], num_questions)
    queries += random.sample(candidates['hypothetical_queries'], num_queries)
    for q in queries:
        pairs.append([q, doc])

And just like that, we've generated annotated training examples in just a few lines of code, thanks to the magic of `pydantic` harnessed by `instructor` and OpenAI's function calling API!

Now, we just need to get our trainer:

In [10]:
from ragatouille import RAGTrainer

trainer = RAGTrainer(model_name="GhibliColBERTv2.0", pretrained_model_name="colbert-ir/colbertv2.0")



Now, we'll get our trainer to prepare our training data. We'll pass it both the pairs we've created, as well as the full document corpus, so it can generate hard negatives via `SimpleMiner`:


In [11]:
trainer.prepare_training_data(
        raw_data = pairs,
        all_documents = documents,
        num_new_negatives = 10,
        mine_hard_negatives= True,
        )

Loading Hard Negative SimpleMiner dense embedding model BAAI/bge-small-en-v1.5...
Building hard negative index for 156 documents...
All documents embedded, now adding to index...
save_index set to False, skipping saving hard negative index
Hard negative index generated
mining
mining


'./data/'

Our trainer is now ready to `train()`, just like in the [previous tutorial](https://github.com/bclavie/RAGatouille/blob/main/examples/02-basic_training.ipynb). We won't actually train a model here, since we created just one training pair, but you've now got all you need to fine-tune your own ColBERT on your data, without any annotations!


**You're now done with the RAGatouille 0.0.1 examples!**


In a future tutorial, we'll demonstrate how to leverage [DSPy](https://github.com/stanfordnlp/dspy) to perform this kind of data-generation using open-source language models, in case you want to avoid relying on API calls!

In future RAGatouille version, we'll further improve on this approach by supporting [UDAPDR](https://arxiv.org/abs/2303.00807), the state-of-the-art method to adapt retrievers to any domain, using LLMs.