This jupyter notebook demonstrates the use of a fine tuned model designed to help students towards answers by asking questions rather than just giving answers.

In [None]:
#pip install openai
#!pip install PyPDF2

In [None]:
query = "how would i make a curve fitting program?"

In [None]:
from openai import OpenAI
import numpy as np
import os
from PyPDF2 import PdfReader

openai_api_key = ""

client = OpenAI(api_key=openai_api_key)

Here we've created an assistant as ususal however in this we have specified a custom fine tuned model. To create a fine tuned model you can follow these steps: Go to the OpenAI dashboard, on the left hand menu go to fine-tuning, in the top right click the +create button to begin customising a fine-tuning job, select your base model (if its text based i'd recommend one of the gpt models), upload your training data (I will go over this more in a second), provide a suffix (this is essentially a name for the model), then I'd recommend leaving everything else as is originally (set to auto/random) but if you find the fine tuned model to be unresponsive to your training data you can up the learning multiplier to fit the model more to your data (increasing this too much can lead to overfitting however, for this example I used a value of 2). You can retrieve the name of the model from the same place as you created it under the "Output Model" Section. This should then be put in the model section when creating the assistant as done below.

Regarding training data, in the same folder as you find this file you should find a .jsonl file, you can open this with notepad (or a text editor of your choice) and inside it you will find a large list of messages formatted like: 

{"messages": [{"role": "system", "content": "You are a factual chatbot that helps students learn by asking them questions"}, {"role": "user", "content": "What is the area of a rectangle with a length of 5 and a width of 3?"}, {"role": "assistant", "content": "How do you calculate the area of a rectangle?"}, {"role": "user", "content": "You multiply the length by the width."}, {"role": "assistant", "content": "That's correct. So, what is 5 multiplied by 3?"},{"role": "user", "content": "5 multiplied by 3 is 15."}, {"role": "assistant", "content": "Great! So, what is the area of the rectangle?"},{"role": "user", "content": "The area is 15."}, {"role": "assistant", "content": "Exactly. Well done!"}]}, 

each of these lines that start with {"messages":.... and are example conversations between an assistant and a user. After the "messages": we specify a list of messages. The first message in the list is always a system message, we define roles by using "role": and then give the role of the entity giving the message as we have done previously when creating messages in threads. Similarly as we have done previusly after specifying the role we specify the content of the message given by that role, in this case the first system message is "You are a factual chatbot that helps students learn by asking them questions". The next messages all follow the same format but contain different content and use the user and assistant roles, this is the example conversation between the user and assistant. To make all of this more obvious to you I've reformatted the above to give the below:

{
"messages": 

[

{"role": "system", "content": "You are a factual chatbot that helps students learn by asking them questions"}, 

{"role": "user", "content": "What is the area of a rectangle with a length of 5 and a width of 3?"},

{"role": "assistant", "content": "How do you calculate the area of a rectangle?"}, 

{"role": "user", "content": "You multiply the length by the width."}, 

{"role": "assistant", "content": "That's correct. So, what is 5 multiplied by 3?"},

{"role": "user", "content": "5 multiplied by 3 is 15."}, 

{"role": "assistant", "content": "Great! So, what is the area of the rectangle?"},

{"role": "user", "content": "The area is 15."},

{"role": "assistant", "content": "Exactly. Well done!"}

]

}, 

Several of these can then be uploaded to OpenAI in a .jsonl file to serve as training data. OpenAI recommends 50-100 messages to begin getting behaviour that you want but more is always good. Remember we're not training the knowledge we want here, we're training the type of conversation we want. 

It is important in our case that all of the first system messages in each conversation is the same. When we create an assistant based on this model we need to give it instructions that are identical to the initial system message and we need to make sure these instructions are never overwritten. When we give the model the identical instructions we are essentially telling it that we want it to mimic the behaviour it has seen in the .json conversations. Therefore any instructions you want to give the model should be specified in the .json file initially. If you want to give more instructions on the fly, I'd recommend putting these in the prompt text or in a description however the latter is limited to 512 characters. 

In [None]:
assistant = client.beta.assistants.create(
  name="Coding Teacher",
  instructions="You are a factual chatbot that helps students learn by asking them questions",
  model="ft:gpt-3.5-turbo-0125:personal:socratic4:9eLlcKvZ",
  tools=[{"type": "file_search"}, {"type": "code_interpreter"}],
  tool_resources={"file_search": {"vector_store_ids": ["vs_jqaQHX3axAiwMZDTWATBHdQL"]}},
  top_p=0.1,
)

I'll also make note here that I have created a new vector store without code, instead as we did above for fine-tuning I created it in the OpenAI dashboard. To do this follow these steps: Go to the OpenAI dashboard, on the left hand menu go to storage, then switch to the Vector Stores tab, here you'll want to press the +create button in the top right (if there are any pre exisiting vector stores make sure you switch to the new one you just created before the next step as it does not auto select this new vector store for you), once you've done this you can press the +add files button to add files to the vector store. A list of the type of files that a vector store accepts can be found here: https://platform.openai.com/docs/assistants/tools/file-search/supported-files. Once we have created a vector store and uploaded our files we can retrieve the vector store id from the same window next to where it says "ID". Above I have manually put this in to our tool reasources, in this example I'm using a series of .py files from L1 that teach students how to code (due to privacy reasons this couldnt be shared on GitHub, please use a repository of knowledge that you find useful for your usecase). Below I've specified this as a string that I can use later.

In [None]:
source = "Year 1 Python Handbook"

Change file_path to be wherever the folder for where the knowledge base files are locally stored 

In [None]:
current_directory = os.getcwd()
print(current_directory)

In [None]:
file_path = ""

As well as creating the vector store with all the files, if we want to use the custom search that we created it needs an update to be able to handle more file types and more files. To handle several files we'll define the get_all_files_in_folder which retrieves the file paths of all the files in that folder and returns them as a list. We can print this to make sure it's getting all of them. Then in the load_large_document to save ourselves some headache in the future we can update it to be able to extract text from pdfs (in another later jupyter notebook we go over how to extract images). We keep the original load_large_document part intact. A similar approach can be done for different file types if they are not able to be extracted by our original load_large_document in the same way as we have done for pdfs. We then define a function called process_files which calls the load_large_document function for each of the files found by get_all_files_in_folder and splits the text from them into chunks. 

In [None]:
def get_all_files_in_folder(folder_path):
    try:
        # List all entries in the directory
        entries = os.listdir(folder_path)
        
        # Filter out the files and include the full path
        files = [os.path.join(folder_path, entry) for entry in entries if os.path.isfile(os.path.join(folder_path, entry))]
        
        return files
    except FileNotFoundError:
        return "The folder path does not exist."
    
files = get_all_files_in_folder(file_path)

print(files)

chunks = []

def load_large_document(file_path, chunk_size=2000):
    if file_path.endswith('.pdf'):
        try:
            reader = PdfReader(file_path)
            text = ''
            for page_num in range(len(reader.pages)):
                page = reader.pages[page_num]
                text += page.extract_text()
                
            for i in range(0, len(text), chunk_size):
                yield text[i:i + chunk_size]
        except FileNotFoundError:
            print(f"File not found: {file_path}")
        except Exception as e:
            print(f"An error occurred while reading the file {file_path}: {e}")

    else:
        try:
            with open(file_path, 'r') as file:
                while True:
                    chunk = file.read(chunk_size)
                    if not chunk:
                        break
                    yield chunk
        except FileNotFoundError:
            print(f"File not found: {file_path}")
        except Exception as e:
            print(f"An error occurred while reading the file {file_path}: {e}")

def process_files(file_paths, chunk_size=2000):
    for file_path in file_paths:
        for chunk in load_large_document(file_path, chunk_size):
            chunks.append(chunk)

process_files(files)

print(chunks[0])


Our embedding and similarity checks are the same as previous, just a note for future optimisations, although the query embedding needs to be created everytime there is a new query the embedding for the chunks does not, these can be stored and retrieved. To make the final product more efficient you can create the embedding on the initial upload of the knowledge base documents.

In [None]:
db = client.embeddings.create(
  model="text-embedding-ada-002",
  input=chunks,
  encoding_format="float"
)

In [None]:
query_response = client.embeddings.create(
    model="text-embedding-ada-002",
    input=[query],
    encoding_format="float"
)

query_embedding = query_response.data[0].embedding
query_embedding = np.array(query_embedding).flatten()

print(query_embedding)

In [None]:
def cosine_similarity(vec1, vec2):
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

similarity_scores = []
for embedding_data in db.data:
    chunk_embedding = embedding_data.embedding

    chunk_embedding = np.array(chunk_embedding).flatten()

    score = cosine_similarity(query_embedding, chunk_embedding)
    similarity_scores.append(score)

In [None]:
chunk_scores = list(zip(chunks, similarity_scores))
sorted_chunks = sorted(chunk_scores, key=lambda x: x[1], reverse=True)

top_n = 5
combined_string = f"All documents sourced from: {source}"
for i in range(top_n):
    combined_string += sorted_chunks[i][0]
    
print(combined_string)

We've updated the prompt a little to be more in line with the fine tune model we created specifying that it should use the "socratic method" which is a summed up version of saying dont give answers only ask questions.

In [None]:
prompt = f"Answer the question based only on this context: {combined_string}. Answer the question based on the above context: {query}, you should explain as if i do not have access to this context and any source documents, this is your source always quote it when you use it: {source}, run any code you create, quote explicitely any equations in latex format, always use $ when writing latex, employ the socratic method for teaching in your answer, speak as if you're talking with a student."

In [None]:
thread = client.beta.threads.create(
  messages=[
    {
      "role": "user",
      "content": prompt,
    },
    
  ]
)

In [None]:
from typing_extensions import override
from openai import AssistantEventHandler
from PIL import Image
from IPython.display import display, Markdown, Latex, HTML
import io
import requests

class EventHandler(AssistantEventHandler):
    
    def __init__(self):
        super().__init__()
        self.buffer = ""  # Buffer to collect text output
        
    @override
    def on_text_created(self, text) -> None:
        if not self.buffer.endswith(text.value):  # Prevent duplication
            self.buffer += text.value

    @override
    def on_text_delta(self, delta, snapshot):
        if not self.buffer.endswith(delta.value):  # Prevent duplication
            self.buffer += delta.value

    @override
    def on_tool_call_created(self, tool_call):
        print(f"\nassistant > {tool_call.type}\n", flush=True)
            
    def on_tool_call_delta(self, delta, snapshot):
        if delta.type == 'code_interpreter':
            if delta.code_interpreter.input:
                print(delta.code_interpreter.input, end="", flush=True)
            if delta.code_interpreter.outputs:
                print(f"\n\noutput >", flush=True)
                for output in delta.code_interpreter.outputs:
                    if output.type == "logs":
                        print(f"\n{output.logs}", flush=True)
                    elif output.type == "image":
                        # Fetch the image data using the file_id
                        file_id = output.image.file_id
                        image_data = self.download_image(file_id)
                        if image_data:
                            image = Image.open(io.BytesIO(image_data))
                            image.show()
  
    def download_image(self, file_id):
        url = f"https://api.openai.com/v1/files/{file_id}/content"
        headers = {
            "Authorization": f"Bearer {openai_api_key}",
        }
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.content
        else:
            print(f"Failed to download image: {response.status_code} {response.text}")
            return None
        
    def display_output(self):

        # Process the buffer to format LaTeX and code blocks
        processed_content = self.format_buffer(self.buffer)

        # Display as Markdown to correctly render LaTeX and plain text
        display(Markdown(processed_content))

    def format_buffer(self, buffer):
        # Format the buffer to handle LaTeX and code blocks appropriately

        # Split buffer into lines for processing
        lines = buffer.split("\n")
        formatted_lines = []

        in_code_block = False

        for line in lines:
            if line.strip().startswith("code_interpreter"):
                in_code_block = True
                formatted_lines.append(line)
            elif line.strip() == "```":
                in_code_block = False
                formatted_lines.append(line)
            elif in_code_block:
                formatted_lines.append(line)
            else:
                # Check for LaTeX patterns and wrap them in delimiters
                line = line.replace(r'\(', '$').replace(r'\)', '$')
                line = line.replace(r'\[', '$$').replace(r'\]', '$$')
                formatted_lines.append(line)
        
        return "\n".join(formatted_lines)
            
    

Here we can see that the chatbot should ask us questions to help us towards the answer rather than giving it straight away 

In [None]:
event_handler = EventHandler()

with client.beta.threads.runs.stream(
  thread_id=thread.id,
  assistant_id=assistant.id,
  event_handler=event_handler,
) as stream:
  stream.until_done()

event_handler.display_output()

As usual we can continue the conversation using the below,

In [None]:
thread_message = client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="im not sure"
)

event_handler2 = EventHandler()

with client.beta.threads.runs.stream(
  thread_id=thread.id,
  assistant_id=assistant.id,
  instructions="",
  event_handler=event_handler2,
) as stream:
  stream.until_done()

event_handler2.display_output()