If you try experimenting with asking one of the chat bots we have created previously about its vector store you may have found it hard to get answers about just something from that vector store, as such we need a way of narrowing the chat bots focus on a specific sets of information. To do this we can use what is known as an RAG (retrieval augmented generation) pipeline. What this essentially does is it splits the source of information into several parts known as "chunks" these chunks are then "embedded" which means they are turned into large vectors and stored (I briefly touched on this in the file supported jupyter notebook). We then embed the users prompt and compare each chunk's vector embed against the users prompt vector embed and see how similar they are. We then feed the most similar chunks back to the chat bot and ask it to answer the prompt in light of this information. We start as we usually do and start off by defining our input query:

In [None]:
#pip install openai

In [None]:
query = "how do i find the minimum chi squared value?"

In [None]:
from openai import OpenAI
import openai 
import numpy as np

openai_api_key = ""

client = OpenAI(api_key=openai_api_key)

In [None]:
assistant = client.beta.assistants.create(
  name="Discovery Skills",
  description="You are a factual education AI Assistant dedicated to providing accurate, useful information. Your primary task is to assist me by providing me reliable and clear responses to my questions, only ever use information from file search as your source, this knowledge base is ______.  You are reluctant of making any claims unless they are stated or supported by the knowledge base.",
  instructions = "if you make any code you should always run it",
  model="gpt-4o",
  tools=[{"type": "file_search"}, {"type": "code_interpreter"}],
  top_p=0.1,
)

For this to work we'll need to locate the folder the reference document is inside of, assuming that your reference document is inside the same folder as this jupyter notebook you can run the below box and replace the file_path string with the output of the box and then re run the box. If your reference document is in another location simply take the below and enter it in a jupyter notebook in the same folder as the document and then come back to this one and swap the file_path out for the output you recieved. 

In [None]:
import os
current_directory = os.getcwd()
print(current_directory)

In [None]:
file_path = "the above file path"

This is our chunking method, we first define our chunks list and then define a load_large_document function. This function opens our reference document as a file, it then reads the file up to a number of characters equal to our chunk_size and then yields that as a string. We use yield instead of return as it allows for the use of a for loop to be much easier in the next line of code. This for loop uses the load_large_document function to break the document into 2000 character long chunks which are then added into our chunks list. We can then print the chunks list to see the entire document. 

Using a lower chunk size increases the amount of time taken for this process to complete however it increases the final specificity of the answer, you do not want to go too low however as this would lead to there not being enough information at the end to provide the user with a useful answer. A bigger chunk size allows for this process to be quicker but decreases specificity in the final answer.

In [None]:
chunks = []

def load_large_document(file_path, chunk_size=2000):
    with open(file_path, 'r') as file:
        while True:
            chunk = file.read(chunk_size)
            if not chunk:
                break
            yield chunk

for chunk in load_large_document('file_name.txt'):
    chunks.append(chunk)

print(chunks)


We then call .embeddings.create at our client to embed the chunks we have just created, we specify a model to be used (this is the most general use case one but there are others), we specify our input to be embedded which in this case is our list of chunks, and our encouding format deicdes how the vectors appear, the alternate encoding formate is base64.

In [None]:
db = client.embeddings.create(
  model="text-embedding-ada-002",
  input=chunks,
  encoding_format="float"
)

We'll do the same thing to embed our query except this time we'll also flatten the embedding such that our query is a singular vector.

In [None]:
query_response = client.embeddings.create(
    model="text-embedding-ada-002",
    input=[query],
    encoding_format="float"
)

query_embedding = query_response.data[0].embedding
query_embedding = np.array(query_embedding).flatten()

print(query_embedding)

Now we'll compare the similarity of our query's embeddding with every single embedding of our chunks. We'll do this using defintion of the angle between two vectors, the closer the output of our cosine is to 1 the closer in relavence a given chunk and our query is, we refer to this number as a similarity score and we'll create a list of them:

In [None]:
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

similarity_scores = []
for embedding_data in db.data:
    chunk_embedding = embedding_data.embedding

    chunk_embedding = np.array(chunk_embedding).flatten()

    score = cosine_similarity(query_embedding, chunk_embedding)
    similarity_scores.append(score)

Next we'll combine our chunk list and similarity score list and sort the list in order of highest similarity to lowest, we'll then take the top 5 chunks in similarity to our query and combine them into a single string. By printing it we can see what chunks have been chosen

In [None]:
chunk_scores = list(zip(chunks, similarity_scores))
sorted_chunks = sorted(chunk_scores, key=lambda x: x[1], reverse=True)

top_n = 5
combined_string = "All documents sourced from: file_name.txt"
for i in range(top_n):
    combined_string += sorted_chunks[i][0]
    
print(combined_string)

Now that we have both a refined source material and our query we can combine these into a single prompt and give it to the chat bot. We will use the same event handler as we did in the code_interpreter jupyter notebook.

In [None]:
prompt = f"Answer the question based only on this context: {combined_string}. Answer the question based on the above context: {query}, you should explain as if i do not have access to this context, always quote _____ as your source. run any code you create, quote explicitely any equations"

In [None]:
thread = client.beta.threads.create(
  messages=[
    {
      "role": "user",
      "content": prompt,        
    }
  ]
)

In [None]:
from typing_extensions import override
from openai import AssistantEventHandler
from PIL import Image
import io
import requests

class EventHandler(AssistantEventHandler):    
    @override
    def on_text_created(self, text) -> None:
        print(f"\nassistant > ", end="", flush=True)
      
    @override
    def on_text_delta(self, delta, snapshot):
        print(delta.value, end="", flush=True)
      
    def on_tool_call_created(self, tool_call):
        print(f"\nassistant > {tool_call.type}\n", flush=True)
  
    def on_tool_call_delta(self, delta, snapshot):
        if delta.type == 'code_interpreter':
            if delta.code_interpreter.input:
                print(delta.code_interpreter.input, end="", flush=True)
            if delta.code_interpreter.outputs:
                print(f"\n\noutput >", flush=True)
                for output in delta.code_interpreter.outputs:
                    if output.type == "logs":
                        print(f"\n{output.logs}", flush=True)
                    elif output.type == "image":
                        # Fetch the image data using the file_id
                        file_id = output.image.file_id
                        image_data = self.download_image(file_id)
                        if image_data:
                            image = Image.open(io.BytesIO(image_data))
                            image.show()
  
    def download_image(self, file_id):
        url = f"https://api.openai.com/v1/files/{file_id}/content"
        headers = {
            "Authorization": f"Bearer {openai_api_key}",
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.content
        else:
            print(f"Failed to download image: {response.status_code} {response.text}")
            return None

In [None]:
with client.beta.threads.runs.stream(
  thread_id=thread.id,
  assistant_id=assistant.id,
  instructions="",
  event_handler=EventHandler(),
) as stream:
  stream.until_done()

Because the chat bot has also has access to the entire textbook it can still answer questions about where the content comes from:

In [None]:
thread_message = client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="what chapter could i find more in about this stuff?"
)

with client.beta.threads.runs.stream(
  thread_id=thread.id,
  assistant_id=assistant.id,
  instructions="",
  event_handler=EventHandler(),
) as stream:
  stream.until_done()