
# OpenProBono RAG Evaluation

This notebook was created using https://huggingface.co/learn/cookbook/en/rag_evaluation as a guide.

### 0: Install and import dependencies.

In [3]:
!pip install -q torch transformers langchain sentence-transformers tqdm openpyxl openai pandas datasets

In [11]:
import json
from pathlib import Path
from typing import List, Optional, Tuple

import openai
import pandas as pd
from tqdm.auto import tqdm

import milvusdb

pd.set_option("display.max_colwidth", None)

### 1: Load our knowledge base

For this step, we have already loaded the data we wish to evaluate into a `Collection` in our Milvus vector database.

#### 1.1: Load sources
 
We have a list of sources we use to filter the documents so we can generate questions about one document at a time. The sources can be files or URLs. We load the list of sources for this example from a file named `urls`.

In [7]:
with Path("urls").open() as f:
    urls = [line.strip() for line in f.readlines()]

#### 1.2: Load documents

Now that we loaded our source URLs, we can write a function that will get the chunks associated the source from Milvus. We use a boolean expression filter to get the right chunks.

In [10]:
collection_name = milvusdb.COURTROOM5

def load_url_documents(url: str):
    expr = f"metadata['url']=='{url}'"
    hits = milvusdb.get_expr(collection_name, expr)["result"]
    for i in range(len(hits)):
        hits[i]["url"] = hits[i]["metadata"]["url"]
        del hits[i]["pk"]
        del hits[i]["metadata"]
    return hits

#### 1.3: Setup question generation LLM

We will use `gpt-3.5-turbo` for question generation.

In [13]:
model = "gpt-3.5-turbo"

def call_llm(client: openai.OpenAI, prompt: str):
    prompt_msg = {"role": "system", "content": prompt}
    response = client.chat.completions.create(
        model=model,
        messages=[prompt_msg],
        max_tokens=1000,
    )
    return response.choices[0].message.content

In [None]:
QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: {context}\n
Output:::"""