# Build RAG baseline system with OpenAI


 <a target="_blank" href="https://colab.research.google.com/drive/1OudluwP8er680a7adzKbUKF15XrZuitS?usp=sharing">
        <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab" />
      </a>

1) Creat an [OpenAI account](https://chatgpt.com/)

2) Add [Credit balance](https://platform.openai.com/settings/organization/billing/overview) to you account  

3) Install and import libraries

In [None]:
!pip install openai
from openai import OpenAI

Collecting openai
  Downloading openai-1.37.0-py3-none-any.whl (337 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m337.0/337.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: h11, httpcore, httpx, openai
Successfully installed h11-0.14.0 httpcore-1.0.5 ht

3) Creat and set your openai api key

In [None]:
import os

# Set the OpenAI API key
os.environ['OPENAI_API_KEY'] = 'sk-...'
client = OpenAI()

3) Create [Vector store](https://platform.openai.com/docs/assistants/tools/file-search)

In [None]:
import os
import openai

# Create a vector store called "Financial Statements"
vector_store = client.beta.vector_stores.create(name="mouse_brain")

# Directory containing the files
directory_path = "dataset/"

# List all files in the directory
file_paths = [os.path.join(directory_path, file) for file in os.listdir(directory_path) if os.path.isfile(os.path.join(directory_path, file))]

# Ready the files for upload to OpenAI
file_streams = [open(path, "rb") for path in file_paths]

# Use the upload and poll SDK helper to upload the files, add them to the vector store,
# and poll the status of the file batch for completion.
file_batch = client.beta.vector_stores.file_batches.upload_and_poll(
  vector_store_id=vector_store.id, files=file_streams
)

# You can print the status and the file counts of the batch to see the result of this operation.
print(file_batch.status)
print(file_batch.file_counts)
print("Vector store created! id : ", vector_store.id)

completed
FileCounts(cancelled=0, completed=287, failed=0, in_progress=0, total=287)
Vector store created! id :  vs_D4IIKl065YOBLeZRirTPPHdg


4) Create Assistant

In [None]:
assistant = client.beta.assistants.create(
  name="Mouse brain assistant",
  instructions="You are an expert in medical imaging. Use you knowledge base to answer requests.",
  model="gpt-4o-mini",
  tools=[{"type": "file_search"}],
  tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}},
)

print("Assistant created! id: ", assistant.id)

Assistant created! id:  asst_N09oOPhNRijJloSAtwsR9MwH


5) Prompt the assistant

In [None]:
# Create a thread and attach the file to the message
thread = client.beta.threads.create(
  messages=[
    {
      "role": "user",
      "content": "Joël Lefebvre is the author of which scientific paper",
    }
  ]
)

# Use the create and poll SDK helper to create a run and poll the status of
# the run until it's in a terminal state.

run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id, assistant_id=assistant.id
)

messages = list(client.beta.threads.messages.list(thread_id=thread.id, run_id=run.id))

message_content = messages[0].content[0].text
annotations = message_content.annotations
citations = []
for index, annotation in enumerate(annotations):
    message_content.value = message_content.value.replace(annotation.text, f"[{index}]")
    if file_citation := getattr(annotation, "file_citation", None):
        cited_file = client.files.retrieve(file_citation.file_id)
        citations.append(f"[{index}] {cited_file.filename}")

print(message_content.value)
print("\n".join(citations))

Joël Lefebvre is the author of several scientific papers. Notable ones include:

1. **"Fully automated dual-resolution serial optical coherence tomography aimed at diffusion MRI validation in whole mouse brains"** (2018) - Co-authored with Patrick Delafontaine-Martel, Philippe Pouliot, Hélène Girouard, Maxime Descoteaux, and Frédéric Lesage[0].

2. **"Whole mouse brain imaging using optical coherence tomography: reconstruction, normalization, segmentation, and comparison with diffusion MRI"** (2017) - Co-authored with Alexandre Castonguay, Philippe Pouliot, Maxime Descoteaux, and Frédéric Lesage[1].

3. **"Whole brain vascular imaging in a mouse model of Alzheimer’s disease with two-photon microscopy"** (2018) - Co-authored with Patrick Delafontaine-Martel, Pier-Luc Tardif, Bernard I. Lévy, Philippe Pouliot, and Frédéric Lesage[2].

4. **"Comparing three-dimensional serial optical coherence tomography histology to MRI imaging in the entire mouse brain"** (2018) - Co-authored with Alexa

6) Create questions dataset

In [None]:
assistant = client.beta.assistants.create(
  name="Question Answer maker",
  instructions="You are an assistant specialized in Multimodal RAG tasks.",
  model="gpt-4o-mini",
  tools=[{"type": "file_search"}]
)
print("Assistant created! id: ", assistant.id)

Assistant created! id:  asst_DCgJVrZ12fDogGIZQtVHr0eV


Get the list of file ids.

In [None]:
import requests

# Define the API key and vector store ID
api_key = 'sk-...'
vector_store_id = "vs_..."

# Function to list all files in the vector store
def list_all_files(vector_store_id):
    all_files = []
    limit = 100  # The number of files to retrieve per request
    last_id = ""

    headers = {
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json',
        'OpenAI-Beta': 'assistants=v2'
    }

    params =  {
        'limit': limit,
    }

    while True:

        if last_id :
            params['after'] = last_id

        response = requests.get(
            f'https://api.openai.com/v1/vector_stores/{vector_store_id}/files',
            headers=headers,
            params=params
        )

        if response.status_code != 200:
            print(f"Error: {response.status_code}, {response.text}")
            break

        data = response.json()

        all_files.extend(data['data'])

        if data["has_more"]:
            last_id=data["last_id"]
        else:
            break

    return all_files

# Retrieve all files
files = list_all_files(vector_store_id)
print(len(files))
print(files[0]['id'])

287
file-Yo5CMB1viixnFQIzp3WK2qsa


In [None]:
files_data={}
# Retrieve and print file names
for file in files:
    file_metadata = client.files.retrieve(file_id=file['id'])
    files_data[file_metadata.id] = file_metadata.filename
    # print(f"File ID: {file_metadata.id}, File Name: {file_metadata.filename}")

In [None]:
first_key = next(iter(files_data))
first_value = files_data[first_key]
print(first_key)
print(first_value)

file-Yo5CMB1viixnFQIzp3WK2qsa
fnana-09-00047.pdf


In [None]:
vector_store_file = client.beta.vector_stores.create(
  name="vs-file-...",
  file_ids=[first_key]
)

In [None]:
thread = client.beta.threads.create(
  messages=[ { "role": "user", "content": 'Give me 10 relevant questions with tier answer about this scientific paper as if I am a student that need to prepare for an exam. Give me the response in json format following format and NOTHING ELSE: { "questions": [{"question": "XXXXXX","answer": "YYYYYY"}, ... ]} where "XXXXXX" is the question and "YYYYYY" is the corresponding answer that could be as long as needed. Focus on making relevant questions concerning the document.'}],
  tool_resources={
    "file_search": {
      "vector_store_ids": [vector_store_file.id]
    }
  }
)
print("Thread created! id: ", thread.id)

Thread created! id:  thread_mQp1EiJDmXFzd8yGyoRo1bqm


In [None]:
import csv
import os
import json

# Replace these placeholders with your actual thread and assistant IDs
thread_id = thread.id
assistant_id = assistant.id

# Create and poll the run (replace these placeholders with your actual calls)
run = client.beta.threads.runs.create_and_poll(
    thread_id=thread_id, assistant_id=assistant_id
)

messages = list(client.beta.threads.messages.list(thread_id=thread.id, run_id=run.id))

message_content = messages[0].content[0].text
annotations = message_content.annotations

# Remove triple backticks and json
cleaned_content = message_content.value.strip("```").strip()
cleaned_content = cleaned_content.replace("```", "").strip()
cleaned_content = cleaned_content.replace("json", "").strip()

# Read json
data = json.loads(cleaned_content)
print(data)
qas = data["questions"] # list

# Prepare the data for CSV
qa_data = []
for qa in qas:
  question = qa["question"]
  answer = qa["answer"]
  source_type = "pdf"
  source = first_value
  qa_data.append([source, source_type, question, answer])

# Save to qa.csv
file_path = 'data/qa.csv'

# Check if the file exists
file_exists = os.path.isfile(file_path)

# Open the file in append mode if it exists, write mode if it doesn't
with open(file_path, mode='a' if file_exists else 'w', newline='') as file:
    writer = csv.writer(file)
    if not file_exists:
        writer.writerow(["source", "source_type", "question", "answer"])
    writer.writerows(qa_data)

print("qa.csv file created/updated successfully.")

{'questions': [{'question': "What is the primary focus of the paper 'Role of developmental factors in hypothalamic function'?", 'answer': 'The primary focus of the paper is to summarize the roles of various developmental factors, such as transcription factors and neuropeptides, in the development of the hypothalamus and how these factors contribute to its function in adulthood, using zebrafish and mouse models to explore these regulatory mechanisms.'}, {'question': 'What are the implications of developmental abnormalities in the hypothalamus according to the authors?', 'answer': 'The authors suggest that developmental abnormalities in the hypothalamus can lead to serious health issues, including obesity, sleep disorders, anxiety, depression, and autism, highlighting the importance of proper neuronal development and connectivity for normal hypothalamic function.'}, {'question': 'Which transcription factors are discussed in the paper, and what roles do they play in hypothalamic developme

Do this for all the documents

In [None]:
import csv
import os
import json

for i,(file_id, file_name) in enumerate(files_data.items()):
  # Create a new store
  vector_store_file = client.beta.vector_stores.create(
    name="vs-...",
    file_ids=[file_id]
  )
  # Create a thread
  thread = client.beta.threads.create(
    messages=[ { "role": "user", "content": 'Give me 10 relevant questions with tier answer about this scientific paper as if I am a student that need to prepare for an exam. Give me the response in json format following format and NOTHING ELSE: { "questions": [{"question": "XXXXXX","answer": "YYYYYY"}, ... ]} where "XXXXXX" is the question and "YYYYYY" is the corresponding answer that could be as long as needed. Focus on making relevant questions concerning the document.'}],
    tool_resources={
      "file_search": {
        "vector_store_ids": [vector_store_file.id]
      }
    }
  )

  # Replace these placeholders with your actual thread and assistant IDs
  thread_id = thread.id
  assistant_id = assistant.id

  # Create and poll the run (replace these placeholders with your actual calls)
  run = client.beta.threads.runs.create_and_poll(
      thread_id=thread_id, assistant_id=assistant_id
  )

  messages = list(client.beta.threads.messages.list(thread_id=thread.id, run_id=run.id))

  message_content = messages[0].content[0].text
  annotations = message_content.annotations

  # Remove triple backticks and json
  cleaned_content = message_content.value.strip("```").strip()
  cleaned_content = cleaned_content.replace("```", "").strip()
  cleaned_content = cleaned_content.replace("json", "").strip()

  # Read json
  try :
    data = json.loads(cleaned_content)
  except Exception as e:
    print("ERROR while parsing: ", e, cleaned_content)
    continue

  if data:
    qas = data["questions"] # list

    # Prepare the data for CSV
    qa_data = []
    for qa in qas:
      question = qa["question"]
      answer = qa["answer"]
      source_type = "pdf"
      source = first_value
      qa_data.append([source, source_type, question, answer])

    # Save to qa.csv
    file_path = 'data/qa.csv'

    # Check if the file exists
    file_exists = os.path.isfile(file_path)

    # Open the file in append mode if it exists, write mode if it doesn't
    with open(file_path, mode='a' if file_exists else 'w', newline='') as file:
        writer = csv.writer(file)
        if not file_exists:
            writer.writerow(["source", "source_type", "question", "answer"])
        writer.writerows(qa_data)

    print(f"{i+1}/{len(files_data)}, {file_id}, {file_name}, Q&A added to the csv,", data)

print("qa.csv file created/updated successfully.")

1/287, file-Yo5CMB1viixnFQIzp3WK2qsa, fnana-09-00047.pdf, Q&A added to the csv, {'questions': [{'question': 'What is the primary focus of the research conducted by Biran et al. regarding hypothalamic development?', 'answer': 'The research by Biran et al. primarily focuses on the roles of various transcription factors, specifically Orthopedia (Otp) and Sim1, in the development and functioning of the hypothalamus. The study explores how these factors influence the differentiation and migration of neuropeptide-producing neurons, impacting physiological responses to environmental challenges and homeostasis.'}, {'question': 'How do Otp and Sim1 collaborate during hypothalamic development?', 'answer': 'Otp and Sim1 function together to regulate the expression of key neuropeptides within the neurosecretory preoptic area and paraventricular nucleus. Their interaction is crucial for the proper differentiation of various neuroendocrine cell types, as both transcription factors are essential for 

IndexError: list index out of range

In [None]:
import csv
import os
import json

def process_files(start_index=0):
    for i, (file_id, file_name) in enumerate(files_data.items()):
        if i < start_index:
            continue

        try:
            # Create a new store
            vector_store_file = client.beta.vector_stores.create(
                name="vs-file-...",
                file_ids=[file_id]
            )
            # Create a thread
            thread = client.beta.threads.create(
                messages=[{
                    "role": "user",
                    "content": 'Give me 10 relevant questions with their answers about this scientific paper as if I am a student that need to prepare for an exam. Give me the response in json format following format and NOTHING ELSE: { "questions": [{"question": "XXXXXX","answer": "YYYYYY"}, ... ]} where "XXXXXX" is the question and "YYYYYY" is the corresponding answer that could be as long as needed. Focus on making relevant questions concerning the document.'
                }],
                tool_resources={
                    "file_search": {
                        "vector_store_ids": [vector_store_file.id]
                    }
                }
            )

            # Replace these placeholders with your actual thread and assistant IDs
            thread_id = thread.id
            assistant_id = assistant.id

            # Create and poll the run (replace these placeholders with your actual calls)
            run = client.beta.threads.runs.create_and_poll(
                thread_id=thread_id, assistant_id=assistant_id
            )

            messages = list(client.beta.threads.messages.list(thread_id=thread.id, run_id=run.id))

            message_content = messages[0].content[0].text
            annotations = message_content.annotations

            # Remove triple backticks and json
            cleaned_content = message_content.value.strip("```").strip()
            cleaned_content = cleaned_content.replace("```", "").strip()
            cleaned_content = cleaned_content.replace("json", "").strip()

            # Read json
            try:
                data = json.loads(cleaned_content)
            except Exception as e:
                print("ERROR while parsing: ", e, cleaned_content)
                continue

            if data:
                qas = data["questions"]  # list

                # Prepare the data for CSV
                qa_data = []
                for qa in qas:
                    question = qa["question"]
                    answer = qa["answer"]
                    source_type = "pdf"
                    source = file_name  # Assuming 'first_value' should be 'file_name'
                    qa_data.append([source, source_type, question, answer])

                # Save to qa.csv
                file_path = 'data/qa.csv'

                # Check if the file exists
                file_exists = os.path.isfile(file_path)

                # Open the file in append mode if it exists, write mode if it doesn't
                with open(file_path, mode='a' if file_exists else 'w', newline='') as file:
                    writer = csv.writer(file)
                    if not file_exists:
                        writer.writerow(["source", "source_type", "question", "answer"])
                    writer.writerows(qa_data)

                print(f"{i+1}/{len(files_data)}, {file_id}, {file_name}, Q&A added to the csv,", data)

        except Exception as e:
            print(f"An error occurred with file {file_id}: {e}")
            continue

    print("qa.csv file created/updated successfully.")

# Call the function with the desired start index
process_files(start_index=72)

73/287, file-p0rlgb5lUwxhWJoajFi1JZvw, fnana-09-00080.pdf, Q&A added to the csv, {'questions': [{'question': 'What are the main types of viral vectors discussed in the paper?', 'answer': 'The paper discusses several types of viral vectors, including lentiviruses, adeno-associated viruses (AAV), adenoviruses, rabies viruses (RABV), and vesicular stomatitis viruses (VSV). Each vector has unique properties that make them suitable for different applications in neuroanatomy.'}, {'question': 'How do AAVs perform in terms of genome size and application?', 'answer': 'Adeno-associated viruses (AAV) can package a genome of approximately 4.8 Kb, which limits the size of the transgenes that can be expressed. However, AAVs are known for their ability to achieve long-term, stable gene expression in the central nervous system (CNS) and have low immunogenicity, making them ideal for many applications in neuroanatomy.'}, {'question': 'What limitations do rabies viruses pose when used for transsynaptic 

In [None]:
# convert csv to xlsx
import pandas as pd

# Define the file paths
csv_file_path = 'data/qa.csv'
xlsx_file_path = 'data/qa.xlsx'

# Read the CSV file
df = pd.read_csv(csv_file_path)

# Write to an XLSX file
df.to_excel(xlsx_file_path, index=False)

print("CSV file has been successfully converted to XLSX format.")

CSV file has been successfully converted to XLSX format.


curate questions

In [None]:
# add questions qui on pas rapport.

In [None]:
# add column : topic, difficulty, relevancy (bad or good question), ambiguous,
# -> test a prompt with gpt4 mini to return a json to evaluate the quelity of the question.
# clear question like :
# - What limitations did the authors acknowledge in their study?
# - How does this study contribute to the field of neuroscience?
# - What is the main objective of the study?
# - What future research directions does the paper suggest?
# - What limitations did the authors acknowledge in their study?

7) Evaluate the model

# other

In [None]:
# Create a new assistant without the context.
assistant_qa = client.beta.assistants.create(
  name="Question Answer maker",
  instructions="You are an assistant specialized in Multimodal RAG tasks.",
  model="gpt-4o-mini",
  tools=[{"type": "file_search"}],
)

print("Assistant created! id: ", assistant_qa.id)

Assistant created! id:  asst_s54yZlhphyxqMHnsKH2bOK3s


In [None]:
# Upload the user provided file to OpenAI
message_file = client.files.create(
  file=open("dataset/045004_1.pdf", "rb"), purpose="assistants"
)


In [None]:
# Create a thread and attach the file to the message
thread = client.beta.threads.create(
  messages=[
    {
      "role": "user",
      "content": prompt_to_create_questions,
      # Attach the new file to the message.
      "attachments": [
        { "file_id": message_file.id, "tools": [{"type": "file_search"}] }
      ],
    }
  ]
)

# The thread now has a vector store with that file in its tool resources.
print(thread.tool_resources.file_search)

ToolResourcesFileSearch(vector_store_ids=['vs_clrpgxJbbLXv9N9FuqV81TlC'])


In [None]:
run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id, assistant_id=assistant_qa.id
)

messages = list(client.beta.threads.messages.list(thread_id=thread.id, run_id=run.id))

message_content = messages[0].content[0].text
annotations = message_content.annotations
citations = []
for index, annotation in enumerate(annotations):
    message_content.value = message_content.value.replace(annotation.text, f"[{index}]")
    if file_citation := getattr(annotation, "file_citation", None):
        cited_file = client.files.retrieve(file_citation.file_id)
        citations.append(f"[{index}] {cited_file.filename}")

print(message_content.value)
# print("\n".join(citations))

It seems that there's an issue with retrieving the content of the document you uploaded. Please try uploading the document again, or provide its content directly so I can assist you in generating relevant questions and answers.


In [None]:
del message_file, thread, run, messages, message_content