<a href="https://colab.research.google.com/github/MoritzLaurer/rag-demo/blob/master/rag_langchain_ai_law.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Evaluating a RAG pipeline with LangChain and Hugging Face Endpoints or OpenAI

This notebook provides a quick demo for creating and evaluating a Retrieval Augmented Generation (RAG) pipeline with LangChain and Hugging Face Endpoints or OpenAI.

The demo has the following main steps:
1. Create an example vector database: The demo downloads 440 position paper PDFs which stakeholders had submitted to the EU public consultation on the EU White Paper on AI in 2020. These PDFs are processed and ingested in a vector database.
2. We then automatically generate test questions about a sample of the texts with an LLM
3. Then we create a RAG pipeline and feed the generated test questions into the RAG pipeline as user queries
4. RAG evaluation:
  - Retriever quality: If we ask a generated question to the RAG pipeline, does the pipeline's retriever retrieve the same original text which was used to generated the question? This provides an indication of retriever (and reranker) quality. Note that this indicator is imperfect, as the retriever could also retrieve other texts that help the RAG pipeline generate good answers beyond only the original text used for generating the test question.
  - Answer quality: We also use an LLM to evaluate answer quality more broadly. This is particularly important for RAG systems, as RAG outputs are unstructured text and these are hard to evaluate with standard metrics like ROUGE, BERTScore etc. Standard metrics require a reference "gold" answer, which is expensive to create at scale.


## Install packages

In [1]:
%%bash
pip install --upgrade pip -q
pip install langchain~=0.0.352
pip install langchainhub~=0.1.14
pip install openai~=1.6.0
pip install tiktoken~=0.5.2
pip install transformers>=4.35.2
pip install huggingface_hub~=0.20.1
pip install sentence_transformers~=2.2.2
pip install qdrant-client~=1.7.0
pip install PyMuPDF~=1.23.7


Collecting urllib3<3,>=1.21.1 (from requests<3,>=2->langchainhub~=0.1.14)
  Downloading urllib3-2.1.0-py3-none-any.whl.metadata (6.4 kB)
Downloading urllib3-2.1.0-py3-none-any.whl (104 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 104.6/104.6 kB 2.2 MB/s eta 0:00:00
Installing collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.26.18
    Uninstalling urllib3-1.26.18:
      Successfully uninstalled urllib3-1.26.18
Successfully installed urllib3-2.1.0
Collecting urllib3<2.0.0,>=1.26.14 (from qdrant-client~=1.7.0)
  Using cached urllib3-1.26.18-py2.py3-none-any.whl.metadata (48 kB)
Using cached urllib3-1.26.18-py2.py3-none-any.whl (143 kB)
Installing collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 2.1.0
    Uninstalling urllib3-2.1.0:
      Successfully uninstalled urllib3-2.1.0
Successfully installed urllib3-1.26.18


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
qdrant-client 1.7.0 requires urllib3<2.0.0,>=1.26.14, but you have urllib3 2.1.0 which is incompatible.
tensorboard 2.15.1 requires protobuf<4.24,>=3.19.6, but you have protobuf 4.25.1 which is incompatible.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorboard 2.15.1 requires protobuf<4.24,>=3.19.6, but you have protobuf 4.25.1 which is incompatible.
types-requests 2.31.0.20240106 requires urllib3>=2, but you have urllib3 1.26.18 which is incompatible.


In [2]:
import os
from google.colab import userdata
from huggingface_hub import login

# for using hugging face models
login(token=userdata.get('HF_TOKEN'))

# for using OAI models
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_KEY')

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Prepare example data

#### Download PDF data

In [3]:
## download PDF data
import os
import zipfile
import requests
from io import BytesIO

# URL of the zip file in your GitHub repo (make sure it's the raw file URL)
zip_url = 'https://github.com/MoritzLaurer/rag-demo/blob/master/data/position-papers-pdfs.zip?raw=true'

# Download the zip file
print("Downloading zip file...")
response = requests.get(zip_url)
zip_content = BytesIO(response.content)

# Define the extraction path
extract_path = '/content/data'

# Create directory if it doesn't exist
if not os.path.exists(extract_path):
    os.makedirs(extract_path)

# Extract the zip file
print("Extracting zip file...")
with zipfile.ZipFile(zip_content, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

print("Extraction completed.")

file_paths = [f for f in os.listdir(extract_path) if os.path.isfile(os.path.join(extract_path, f))]
print(f"{len(file_paths)} PDF files downloaded.")


Downloading zip file...
Extracting zip file...
Extraction completed.
440 PDF files downloaded.


### Process data

In [4]:
# parse the raw PDFs into machine-readable docs
from langchain.document_loaders import PyMuPDFLoader
from tqdm.notebook import tqdm

directory = "./data"

docs = []
for pdf_path in tqdm(os.listdir(directory)):
  try:
    docs.append(PyMuPDFLoader(os.path.join(directory, pdf_path)).load())
  except Exception as e:
    print("Exception: ", e)


  0%|          | 0/440 [00:00<?, ?it/s]

Exception:  cannot open broken document
Exception:  cannot open broken document


In [5]:
# split the docs into shorter chunks that fit into LLM context windows
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

# text splitter based on the tokenizer of a model of your choosing
#to make texts fit exactly a transformer's context window size
text_splitter = SentenceTransformersTokenTextSplitter(
    chunk_overlap=48, tokens_per_chunk=256, model_name='sentence-transformers/all-mpnet-base-v2'
)
# alternative faster text splitter
#text_splitter = RecursiveCharacterTextSplitter(
#    chunk_size=1000, chunk_overlap=100, add_start_index=True, separators=["\n\n", "\n", ".", " ", ""]
#)

docs_processed = [text_splitter.split_documents(doc) for doc in docs]
docs_processed = [item for sublist in docs_processed for item in sublist]

print(len(docs_processed))

docs_processed[:1]

14726


[Document(page_content='francisco javier diez dept. artificial intelligence computer science school - uned juan del rosal, 16. 28040 madrid. spain office phone : + 34 - 913. 987. 161 mobile phone : + 34 - 646. 794. 342 email : fjdiez @ dia. uned. es www. ia. uned. es / ~ fjdiez madrid, june 8th 2020 additional comments about the white paper on artificial intelligence - a european approach ai should not be restricted to data - driven applications. human expertise has played a significant role in ai since its inception, and will continue do so in the future. there - fore, the definition at the top of page 2 should say : “ simply put, ai is a collection of technologies that combine knowledge, human expertise, data, algorithms and com - puting power. ” the document does not mention how ai will impact the doctor - patient relation, for better or worse. there is a risk of dehumanizing medicine, but also the potential of liberating health staff from tedious tasks and reducing human errors, an

#### Sample data to reduce embedding and generation costs

In [6]:
import random
random.seed(42)

# sample corpus for embedding
n_sample_texts = 100
index_random = random.sample(range(len(docs_processed)), 100)
docs_samp = [docs_processed[i] for i in index_random]

# sample a smaller set of texts to generate questions from
n_questions = 5
docs_for_q_generation = docs_samp[:n_questions]
docs_for_q_generation = [doc.page_content for doc in docs_for_q_generation]


## Automatic question generation for evaluation

This section generates questions which users could ask about a specific text in the database. This allows us to assess:  
- If we ask a generated question to the RAG pipeline, does the pipeline's retriever retrieve the same text which was used to generated the question? This provides an indication of retriever (and reranker) quality.
- Beyond the original text used for generating the question, the retriever might retrieve other texts that are also help the RAG pipeline generate good answers. We therefore also use an LLM to evaluate answer quality more broadly.

In [None]:
# create an huggingface inference endpoint to run any LLM
# intro: https://www.philschmid.de/inference-endpoints-iac
# docs: https://huggingface.co/docs/huggingface_hub/v0.20.1/en/package_reference/hf_api#huggingface_hub.HfApi.create_inference_endpoint
from huggingface_hub import create_inference_endpoint
from huggingface_hub import HfApi
api = HfApi()

create_new_endpoint = False
model_for_endpoint = "mistralai/Mixtral-8x7B-Instruct-v0.1"  #"mistralai/Mixtral-8x7B-Instruct-v0.1",  #"HuggingFaceH4/zephyr-7b-beta",
endpoint_name = "mixtral-8x7b-instruct-v0-1-test"

if create_new_endpoint:
  # define TGI as custom image
  custom_image = {
      "health_route": "/health",  # Health route for TGI
      "env": {
          "MAX_BATCH_PREFILL_TOKENS": "2048", # can be adjusted to your needs
          "MAX_INPUT_LENGTH": "1024", # can be adjusted to your needs
          "MAX_TOTAL_TOKENS": "1512", # can be adjusted to your needs
          "MODEL_ID": "/repository",  # IE will save the model in /repository
      },
      "url": "ghcr.io/huggingface/text-generation-inference:1.3.3",
  }

  # Create Inference Endpoint to run Zephyr 7B
  print("Creating Inference Endpoint")
  hf_endpoint = create_inference_endpoint(
      name=endpoint_name,
      repository=model_for_endpoint,
      framework="pytorch",
      task="text-generation",
      vendor="aws",
      region="us-east-1",
      type="protected",
      instance_size="2xlarge",  #"medium",
      instance_type="p4de",  #"g5.2xlarge",  # A10G GPU. Pricing: https://huggingface.co/pricing#endpoints
      accelerator="gpu",
      namespace="HF-test-lab",  # your user or organisation name on the HF hub
      custom_image=custom_image,
  )
  #curl https://api.endpoints.huggingface.cloud/v2/endpoint/MoritzLaurer \ -X POST \ -d '{"compute":{"accelerator":"gpu","instanceSize":"2xlarge","instanceType":"p4de","scaling":{"maxReplica":1,"minReplica":0}},"model":{"framework":"pytorch","image":{"custom":{"health_route":"/health","env":{"MAX_BATCH_PREFILL_TOKENS":"2048","MAX_INPUT_LENGTH":"1024","MAX_TOTAL_TOKENS":"1512","QUANTIZE":"bitsandbytes","MODEL_ID":"/repository"},"url":"ghcr.io/huggingface/text-generation-inference:1.3.4"}},"repository":"mistralai/Mixtral-8x7B-Instruct-v0.1","task":"text-generation"},"name":"aws-mixtral-8x7b-instruct-v0-1","provider":{"region":"us-east-1","vendor":"aws"},"type":"protected"}' \ -H "Content-Type: application/json" \ -H "Authorization: Bearer XXXXX"
  print("Waiting for endpoint to be deployed")
  hf_endpoint.wait()

  print("Endpoint ready")


else:
  print("Waiting for endpoint to be resumed")
  hf_endpoint = api.get_inference_endpoint(name=endpoint_name, namespace="HF-test-lab")
  hf_endpoint.resume()  # resume only works if endpoint was explicitly paused. If endpoint scaled to 0, need to send a request to wake it up
  hf_endpoint.wait()
  print("Endpoint ready")

  # to manage an existing endpoint, use:
  #hf_endpoint.resume()
  #hf_endpoint.pause()
  #hf_endpoint.delete()
  # Endpoints should automatically scale to 0 after 15 minutes to avoid unnecessary costs
  # But you can delete it manually just to be save

In [8]:
from langchain.chat_models import ChatOpenAI
from langchain.llms import HuggingFaceEndpoint
from langchain.prompts import ChatPromptTemplate, PromptTemplate

provider_for_question_generation = "HF"


if provider_for_question_generation == "HF":
  chat_model = HuggingFaceEndpoint(
    endpoint_url=hf_endpoint.url,  #"https://ytjpei7t003tedav.us-east-1.aws.endpoints.huggingface.cloud",
    task="text-generation",
    huggingfacehub_api_token=userdata.get('HF_TOKEN'),
    model_kwargs={}
  )

elif provider_for_question_generation == "OAI":
  # https://platform.openai.com/docs/api-reference/chat
  chat_model = ChatOpenAI(
      model="gpt-3.5-turbo-1106",  #"gpt-3.5-turbo-1106",  # "gpt-4-1106-preview"
      temperature=0.2, max_tokens=1024,
      n=1, top_p=0.95,
      frequency_penalty=0.0,  # Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
      presence_penalty=0.0,  # Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
      #response_format={ "type": "json_object" },
      seed=42,
  )



In [9]:
import ast
import numpy as np

# we generate both a question and answer
# having an answer which according to the LLM follows from the question makes it easier to judge the quality of the question
instruction_qa_gen = """\
Your task is to write a factoid question and and answer given a context.

Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine. \
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

After writing the factoid question, also write the corresponding answer that is clearly grounded in the context.

Always answer in this JSON response format: {{"question": "...", "answer": "..."}}

context: {context}\n
JSON response: """



prompt_question_gen = ChatPromptTemplate.from_template(instruction_qa_gen)

chain_question_gen = prompt_question_gen | chat_model

question_answer_dic_lst = []
for context in docs_for_q_generation:
  print("Context:\n", context)

  output_question_dic = chain_question_gen.invoke({"context": context})

  if provider_for_question_generation == "OAI":
    output_question_dic = output_question_dic.content

  try:
    output_question_judge_dic = ast.literal_eval(output_question_dic)
  except:
    output_question_judge_dic = {"question": np.nan, "answer": np.nan}

  question_answer_dic_lst.append(output_question_judge_dic)
  print("\nGenerated question with answer:\n", output_question_judge_dic, "\n")

question_lst = [dic["question"] for dic in question_answer_dic_lst]
answer_lst = [dic["answer"] for dic in question_answer_dic_lst]


Context:
 regulatory framework for artificial intelligence in the european union non - paper of the czech republic 26 november 2019 the czech republic recognizes the importance of technologies commonly known as artificial intelligence and their increasing impact on our everyday lives. the european commission, with the support of the european parliament, has expressed its intention to introduce a new ai regulatory framework. so far, the eu has identified several issues related to ai, namely questions of safety and liability, privacy protection, data, copyright ( iprs ), consumer protection as well as protection of fundamental human rights. it puts a strong emphasis on setting up an ethical background for the use of ai. this approach is referred to as human - centric ai and should be based on implementing european values into the research and development of ai systems from the very beginning to maintain a high level of protection of human rights and democracy. thirty years after the fall

In [10]:
# good alternativ critique prompts: https://github.com/A-Roucher/RAG_cookbook/blob/master/retrieval_augmented_generation.ipynb

instruction_question_judge = """\
Your task is to score the quality of a question that has been written based on a specific context.

Your scoring criteria for assessing the question are:
- ambiguity: Can the question be clearly, unambiguously answered with the given context?
- form and verbosity: Is the question formulated like a question that a user could ask to a search engine? The question should not be accompanied by an answer or other text that users would not ask in a search query

Your quality score should be in the range of 0 to 100.\
100 means a very good question, 0 means a very bad question, 50 means a mediocre question.

First briefly reason step-by-step to assess the extent to which the question fulfills these criteria. Your reasoning should be short.
Then return the quality score.

Always answer in this JSON evaluation format: {{"reason": "...", "score": "..."}}

context: "{context}"\n
question: "{question}"\n
JSON evaluation: """


prompt_question_judge = ChatPromptTemplate.from_template(instruction_question_judge)

chain = prompt_question_judge | chat_model

question_judgement_lst = []
for qa_dic, context in zip(question_answer_dic_lst, docs_for_q_generation):
  print("Question:", qa_dic["question"])
  print("Context:", context)

  output_question_judgement = chain.invoke({"question": qa_dic["question"].strip().replace("\n", " "), "context": context.strip()})

  if provider_for_question_generation == "OAI":
    output_question_judgement = output_question_judgement.content

  question_judgement_lst.append(output_question_judgement)
  print("\nJudgement:\n", output_question_judgement, "\n")




Question: What is the Czech Republic's view on the focus of AI deployment?
Context: regulatory framework for artificial intelligence in the european union non - paper of the czech republic 26 november 2019 the czech republic recognizes the importance of technologies commonly known as artificial intelligence and their increasing impact on our everyday lives. the european commission, with the support of the european parliament, has expressed its intention to introduce a new ai regulatory framework. so far, the eu has identified several issues related to ai, namely questions of safety and liability, privacy protection, data, copyright ( iprs ), consumer protection as well as protection of fundamental human rights. it puts a strong emphasis on setting up an ethical background for the use of ai. this approach is referred to as human - centric ai and should be based on implementing european values into the research and development of ai systems from the very beginning to maintain a high leve

In [11]:
# parsing the JSON output can lead to errors
# with open-source models, which don't enforce JSON as well as OAI
import ast
import numpy as np

#output_question_judge_dic = []
output_question_score = []
output_question_reason = []
for output in question_judgement_lst:
  try:
    output_question_judge_dic = ast.literal_eval(output)
    output_question_score.append(int(output_question_judge_dic["score"]))
    output_question_reason.append(output_question_judge_dic["reason"])

  except:
    print("This JSON output could not be parsed: ", output)
    #output_question_judge_dic.append(np.nan)
    output_question_score.append(np.nan)
    output_question_reason.append(np.nan)



In [12]:
import pandas as pd

df_questions = pd.DataFrame({
  "question": question_lst,
  "answer": answer_lst,
  "score_question": output_question_score,
  "score_reason": output_question_reason,
  "context": docs_for_q_generation,
})

df_questions

Unnamed: 0,question,answer,score_question,score_reason,context
0,What is the Czech Republic's view on the focus...,The Czech Republic believes that the important...,95,The question is unambiguous and clearly refers...,regulatory framework for artificial intelligen...
1,How is control over an AI system's decisions a...,Control is assessed by considering the level o...,95,The question is unambiguous and directly relat...,38 classifying an ai ’ s application context 3...
2,What approach should be followed in the applic...,The EU whitepaper suggests following an interd...,95,The question is unambiguous and clearly relate...,"( in terms of research, startups, infrastructu..."
3,When did the evolution of facial recognition d...,The evolution of facial recognition databases ...,100,The question is unambiguous and can be clearly...,other databases have been made public so that ...
4,What study shows that AI may not recognize peo...,"An empirical study, as mentioned in the contex...",95,The question is unambiguous and formulated lik...,is mainly confronted with people of white skin...


In [13]:
# run critique prompt and save question with context etc.
# to csv file that can be loaded downstream

## RAG pipeline

### Retrival

Optimization potential: different retrievers, different rerankers, multi-retrievers

In [14]:
# detailed RAG docs: https://python.langchain.com/docs/use_cases/question_answering/
# FAISS cookbook: https://python.langchain.com/docs/expression_language/cookbook/retrieval
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceEmbeddings, HuggingFaceInferenceAPIEmbeddings
from langchain.schema import StrOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma, Qdrant
from langchain_core.runnables import RunnablePassthrough
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams


In [18]:
# ! issue: langchain vector store wrappers don't seem to allow adjustment to dimensions, only accept OAI default 1.5k
# using qdrant directly instead of langchain wrapper

provider_retrieval_model = "HF"

client_path = f"./vectorstore"
collection_name = f"collection"

if provider_retrieval_model == "HF":
  qdrantClient = QdrantClient(path=client_path, prefer_grpc=True)

  embeddings = HuggingFaceInferenceAPIEmbeddings(
      api_key=userdata.get('HF_TOKEN'), model_name="sentence-transformers/all-MiniLM-l6-v2"
  )

  dim = 384

elif provider_retrieval_model == "OAI":

  qdrantClient = QdrantClient(path=client_path, prefer_grpc=True)

  embeddings = OpenAIEmbeddings(
          model="text-embedding-ada-002",
          openai_api_key=os.getenv("OPENAI_API_KEY"),
  )

  dim = 1536


qdrantClient.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=dim, distance=Distance.COSINE),
)

vectorstore = Qdrant(
    client=qdrantClient,
    collection_name=collection_name,
    embeddings=embeddings,
)

vectorstore.add_documents(docs_samp)

['657e22287c7043e8b5a72335ac98d8ec',
 '249b39850b6a48bc85bd2b0bf16c978f',
 '205f6124599b4b07871a7a8dad95af03',
 '9cf4e924a18f4c569643d973fca8b8f9',
 'a030a62e2fa7407da855f8f3f3eb56f2',
 '3376375246c243b9a5860d97f6aa1563',
 '45fb7ea8f2934f739bd2d9eb12d56c80',
 'ed2da93c017942f19ee5d909f6ddd215',
 'e91e3ad2c10b4a0689c8713f34bc84b4',
 '5c5a3c541e7945b98f2930428148601a',
 '6b7cd0c2fe0f485b8310cc5266e2283d',
 '399468f318254a6a98f7095fc2f99a08',
 '928971f4247b418e9bc7d247191dbb81',
 '6f396122933e4caa89234d94e908a07a',
 'daa022fc82dc4da68e5b73db871bf068',
 '0f4fdfa1a98b4ebdbb1ee939b226635c',
 '362980c3c58d4aa3bdbd6351a2eeb169',
 'cc7c1e9756294caab35dc40555fb7864',
 '466ef8c24a3049a8a684d63a57d4c134',
 '06615430f5ae46838776983a30b89c22',
 '93bfbd39abd04a3b9a3a2174bba076ea',
 '31a24aea41e24878a41d40faee7f3df5',
 'cc2b030a4a384e2b9de99632b66a8f8c',
 'bbde5fa814bc44f59b3bad1145ec843e',
 'c22519e6bf4046a99fa2a7be059e23a2',
 'b991a549f1a1427d95a478bc6212dece',
 '4783d4c99cdc45c7bb5df9f18df08902',
 

In [19]:

context_retrieved_lst = []
for question in question_lst:
  retriever = vectorstore.as_retriever(
      search_type="similarity",
      search_kwargs={"k": 1}
  )

  context_retrieved = retriever.get_relevant_documents(
      question
  )

  def format_docs(docs):
      return "\n\n".join(doc.page_content for doc in docs)

  context_retrieved = format_docs(context_retrieved)

  context_retrieved_lst.append(context_retrieved)
  #print(context_retrieved)


In [20]:
# check if retrieved context for question is same as context used for generating the question
# note that this is an imperfect measure, because the retriever might
# retrieve other texts that are equally relevant as the text used for generating the question
context_for_q_generation = [doc for doc in docs_for_q_generation]
correct_context_retrieved = [a == b for a, b in zip(context_for_q_generation, context_retrieved_lst)]

retrieval_accuracy = sum(correct_context_retrieved) / len(correct_context_retrieved)
print(retrieval_accuracy)


1.0


In [21]:
# add reranking step
# challenge: reranking with HF models not implemented in langchain
# only cohere reranker seems implemented: https://python.langchain.com/docs/integrations/retrievers/cohere-reranker

"""import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-base')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base')
model.eval()

context_question_pairs_lst = []
for question in question_lst:
  context_question_pairs_lst.append([[question, context] for context in context_retrieved_lst])

#pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]

for context_question_pair in context_question_pairs_lst:
  with torch.no_grad():
      inputs = tokenizer(context_question_pair, padding=True, truncation=True, return_tensors='pt', max_length=512)
      scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
      print(scores)
"""

"import torch\nfrom transformers import AutoModelForSequenceClassification, AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-base')\nmodel = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base')\nmodel.eval()\n\ncontext_question_pairs_lst = []\nfor question in question_lst:\n  context_question_pairs_lst.append([[question, context] for context in context_retrieved_lst])\n\n#pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]\n\nfor context_question_pair in context_question_pairs_lst:\n  with torch.no_grad():\n      inputs = tokenizer(context_question_pair, padding=True, truncation=True, return_tensors='pt', max_length=512)\n      scores = model(**inputs, return_dict=True).logits.view(-1, ).float()\n      print(scores)\n"

### Answer generation

Optimization potential: different LLMs, different prompt templates

In [22]:

prompt_qa_template = """\
Your task is to answer a question based on a context.
Your answer should be concise and you should only return your answer.

context: {context}
question: {question}
answer: """

prompt_qa_template = PromptTemplate.from_template(prompt_qa_template)


In [23]:
from langchain.llms import HuggingFaceEndpoint

provider_answer_model = "HF"

if provider_answer_model == "HF":
  llm_qa = HuggingFaceEndpoint(
    endpoint_url=hf_endpoint.url,  #"https://nqoa2is3qe7y82ww.us-east-1.aws.endpoints.huggingface.cloud",
    task="text-generation",
    huggingfacehub_api_token=userdata.get('HF_TOKEN'),
    model_kwargs={}
  )
elif provider_answer_model == "OAI":
  llm_qa = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

chain = prompt_qa_template | llm_qa | StrOutputParser()

answer_lst = []
for question, context in zip(question_lst , context_retrieved_lst):
  answer = chain.invoke({"context": context, "question": question})
  answer_lst.append(answer)


### Automatic LLM evaluation of generated answer

In [24]:
# this scoring prompt can be freely adapted to evaluation criteria
# of different use-cases

instruction_judge_answer = """\
Your task is to score the quality of an answer to a question in a given context.

Your scoring criteria for assessing the answer are:
- pertinence: Does the answer directly answer the question?
- context grounding: Is the answer clearly grounded in the context? To be well grounded, the answer does not need to explicitly reference the context.
- conciseness: Is the answer concise without unnecessary verbosity?

Your quality score should be in the range of 0 to 100.\
100 means a very good answer, 0 means a very bad answer, 50 means a mediocre answer.

First briefly reason step-by-step to assess the extent to which the answer fulfills these criteria. Your reasoning should be short.
Then return the quality score.

Always answer in this JSON evaluation format: {{"reason": "...", "score": "..."}}

context: {context}\n
question: "{question}"\n
answer: "{answer}"\n
JSON evaluation: """

instruction_judge_answer = ChatPromptTemplate.from_template(instruction_judge_answer)

# currently need to use OAI here, because it enforces JSON very well
llm_evaluation = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

chain = instruction_judge_answer | llm_evaluation


output_quality_lst = []
for answer, question, context_retrieved in zip(answer_lst, question_lst, context_retrieved_lst):

  output_quality = chain.invoke({
      "context": context_retrieved.strip(),
      "question": question.strip().replace("\n", " "),
      "answer": answer.strip().replace("\n", " ")
  })

  output_quality_lst.append(output_quality.content)



  warn_deprecated(


In [25]:
# parsing the JSON output can lead to errors
# with open-source models, which don't enforce JSON as well as OAI
import ast

output_quality_dic = [ast.literal_eval(output) for output in output_quality_lst]
output_quality_score = [int(dic["score"]) for dic in output_quality_dic]
output_quality_reason = [dic["reason"] for dic in output_quality_dic]


## Results

In [26]:
import pandas as pd

df_results = pd.DataFrame({
    "question": question_lst,
    "answer": answer_lst,
    "answer_quality_score": output_quality_score,
    "answer_quality_reason": output_quality_reason,
    "correct_context": [a == b for a, b in zip(context_for_q_generation, context_retrieved_lst)],
    "context_retrieved": context_retrieved_lst,
    "context_for_q_generation": context_for_q_generation
})

mean_answer_score = df_results["answer_quality_score"].mean()
retrieval_accuracy = sum(df_results["correct_context"]) / len(df_results["correct_context"])

print(f"Retrieval accuracy: {retrieval_accuracy}")
print(f"Mean answer socre: {mean_answer_score}")
print("\n")

df_results


Retrieval accuracy: 1.0
Mean answer socre: 89.0




Unnamed: 0,question,answer,answer_quality_score,answer_quality_reason,correct_context,context_retrieved,context_for_q_generation
0,What is the Czech Republic's view on the focus...,\nThe Czech Republic believes that the importa...,100,The answer directly answers the question by st...,True,regulatory framework for artificial intelligen...,regulatory framework for artificial intelligen...
1,How is control over an AI system's decisions a...,\nControl over an AI system's decisions and ac...,90,The answer directly answers the question by ex...,True,38 classifying an ai ’ s application context 3...,38 classifying an ai ’ s application context 3...
2,What approach should be followed in the applic...,\nThe EU whitepaper suggests following an inte...,100,The answer directly answers the question by su...,True,"( in terms of research, startups, infrastructu...","( in terms of research, startups, infrastructu..."
3,When did the evolution of facial recognition d...,1994,75,The answer directly answers the question and i...,True,other databases have been made public so that ...,other databases have been made public so that ...
4,What study shows that AI may not recognize peo...,\tAn empirical study shows that AI may not rec...,80,The answer directly answers the question and i...,True,is mainly confronted with people of white skin...,is mainly confronted with people of white skin...
