## Search: Retrieval & Ranking

Based on [Q&A using Embeddings](https://cookbook.openai.com/examples/question_answering_using_embeddings) from OpenAI Cookbook.

### Why do we even search? 

For every time, we need the model to use some information which it has not already seen during training. This is common when there is something new in the world, or when the model is being used in a new context e.g. your company's internal data.

> GPT can learn knowledge in two ways:
>
> - Via model weights (i.e., fine-tune the model on a training set)
> - Via model inputs (i.e., insert the knowledge into an input message)
> 
> Although fine-tuning can feel like the more natural option—training on data is how GPT learned all of its other knowledge, after all—we generally do not recommend it as a way to teach the model knowledge. Fine-tuning is better suited to teaching specialized tasks or styles, and is less reliable for factual recall.
> 
> As an analogy, model weights are like long-term memory. When you fine-tune a model, it's like studying for an exam a week away. When the exam arrives, the model may forget details, or misremember facts it never read.
> 
> In contrast, message inputs are like short-term memory. When you insert knowledge into a message, it's like taking an exam with open notes. With notes in hand, the model is more likely to arrive at correct answers.

### Retrieval Augmented Generation

While we will cover this in more detail in the later chapters, it's worth mentioning here that the system can be combined with retrieved information from a database and then generate a response based on that information. This is called retrieval augmented generation.

![](../assets/Retrieval%20Augemented%20Generation.gif)

Following are the steps to perform retrieval augmented generation:

## Retrieval
1. Prepare search data: Prepare a dataset of documents that you want to search through.
2. Create embeddings: Create embeddings for each document in the dataset.
3. Prepare index: Create an index of the embeddings, this will allow you to search through the documents quickly.
4. Search: Search through the documents using the embeddings.

## Generation
5. Generate: Use the retrieved documents to generate a response.

Here, we'll quickly introduce a simplified view of the retrieval using the OpenAI API next:

1. Prepare search data: You need to prepare a dataset of documents that you want to search through. This could be a list of documents, a list of paragraphs, or a list of sentences.

In [1]:
import json
from pathlib import Path
from typing import List

import tiktoken
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()  # Load environment variables from .env file

client = OpenAI()

MODEL = "gpt-3.5-turbo-0125"

In [2]:
text = Path("../data/paul_graham/paul_graham_essay.txt").read_text()

In [3]:
def ask(query: str, model: str = MODEL) -> str:
    """Return the response to a query."""
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": query},
        ],
        temperature=0,
    )
    return response.choices[0].message.content


ask("What did Paul Graham do in Summer of 2016?")

'In the summer of 2016, Paul Graham, the co-founder of Y Combinator, likely continued his work with the startup accelerator program and may have also been involved in various speaking engagements, writing articles, and advising startups. Unfortunately, I do not have specific details about his activities during that time period.'

## Input Processing: Chunking

In [4]:
def num_tokens(text: str, model: str = MODEL) -> int:
    """Return the number of tokens in a string."""
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))


num_tokens(text)

16534

Our model can take a maximum of 16,385 which is less than the number of tokens in the document. We need to chunk the document into smaller pieces.

Here, we'll simply split the document into approximate chunks of 1024 tokens each. This heuristic is based on empirical experiments [here](https://blog.llamaindex.ai/evaluating-the-ideal-chunk-size-for-a-rag-system-using-llamaindex-6207e5d3fec5) by the good folks at LlamaIndex.

We'd also recommend using [ChunkWiz](https://chunkviz.up.railway.app/) to build your intuition about the chunking process.

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1024,
    chunk_overlap=96,
    length_function=len,
    is_separator_regex=False,
)
texts = text_splitter.create_documents([text])
context_text = [t.page_content for t in texts]
len(context_text), context_text[0]

(101,
 'What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.')

In [6]:
EMBEDDING_MODEL = "text-embedding-3-small"

def create_embeddings(texts: List[str], model: str = EMBEDDING_MODEL) -> List[str]:
    """Return the embeddings for a list of texts."""
    return client.embeddings.create(
        input=texts,
        model=model,
    )

response = create_embeddings(context_text)
print(f"Number of documents: {len(response.data)}")

Number of documents: 101


## Searching with Qdrant

In [7]:
# Prepare a list of embeddigns from the response object

vectors = [d.embedding for d in response.data]
len(vectors), len(vectors[0])

(101, 1536)

In [22]:
from qdrant_client import QdrantClient
from qdrant_client import models
qdrant_client = QdrantClient(":memory:")
qdrant_client.recreate_collection('demo',vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE))
# Prepare the points with payload for upserting into Qdrant
points = [
    models.PointStruct(
        id=i,
        vector=vectors[i],
        payload={"content":texts[i].page_content}# Assuming response.data[i] is a serializable object
    ) for i in range(len(vectors))
]
qdrant_client.upload_points(collection_name='demo', points=points)

In [23]:
query_vector = create_embeddings(["What did Paul Graham do in Summer of 2016?"]).data[0].embedding

In [26]:
# Find the two closest elements:
search_results = qdrant_client.search(
    collection_name='demo',
    query_vector=query_vector,
    limit=2
)
neighbors = [hit.id for hit in search_results]
distances = [hit.score for hit in search_results]
print(neighbors)  # Example output: [0, 1]
print(distances)  # Example output: [0.0, 0.8]

[72, 71]
[0.45065986325349416, 0.44844746084954556]


In [27]:
def ask_with_context(query: str, context: List[str], model: str = MODEL) -> str:
    introduction = """Use the below writing from Paul Graham to answer the subsequent question. If the answer cannot be found in the articles, write "I could not find an answer."""
    question = f"\n\nQuestion: {query}\n\n"
    context = "\n\n".join(context)
    return ask(introduction + context + question, model)

selected_context = [context_text[neighbors[0]]]
ask_with_context("What did Paul Graham do in Summer of 2016?", selected_context)

'I could not find an answer.'

In this example, the model response was more helpful than the search results. But in many cases, the search results will be more helpful than the model response. We encourage you to "improve retrieval" and try different search strategies to see how the model responds.