<a href="https://colab.research.google.com/github/MoritzLaurer/rag-demo/blob/master/rag_langchain_ai_law.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Evaluating a RAG pipeline with LangChain and Hugging Face Endpoints or OpenAI

This notebook provides a quick demo for creating and evaluating a Retrieval Augmented Generation (RAG) pipeline with LangChain and Hugging Face Endpoints or OpenAI.

The demo has the following main steps:
1. Create an example vector database: The demo downloads 440 position paper PDFs which stakeholders had submitted to the EU public consultation on the EU White Paper on AI in 2020. These PDFs are processed and ingested in a vector database.
2. We then automatically generate test questions about a sample of the texts with an LLM
3. Then we create a RAG pipeline and feed the generated test questions into the RAG pipeline as user queries
4. RAG evaluation:
  - Retriever quality: If we ask a generated question to the RAG pipeline, does the pipeline's retriever retrieve the same original text which was used to generated the question? This provides an indication of retriever (and reranker) quality. Note that this indicator is imperfect, as the retriever could also retrieve other texts that help the RAG pipeline generate good answers beyond only the original text used for generating the test question.
  - Answer quality: We also use an LLM to evaluate answer quality more broadly. This is particularly important for RAG systems, as RAG outputs are unstructured text and these are hard to evaluate with standard metrics like ROUGE, BERTScore etc. Standard metrics require a reference "gold" answer, which is expensive to create at scale.


## Install packages

In [None]:
%%bash
pip install --upgrade pip -q
pip install langchain~=0.1.0
pip install langchain_mistralai
pip install langchainhub~=0.1.14
pip install openai~=1.6.0
pip install tiktoken~=0.5.2
pip install transformers>=4.35.2
pip install huggingface_hub~=0.20.1
pip install sentence_transformers~=2.2.2
pip install qdrant-client~=1.7.0
pip install PyMuPDF~=1.23.7

pip install git+https://github.com/mistralai/client-python


Collecting mistralai<0.0.9,>=0.0.8 (from langchain_mistralai)
  Using cached mistralai-0.0.8-py3-none-any.whl.metadata (1.5 kB)
Using cached mistralai-0.0.8-py3-none-any.whl (14 kB)
Installing collected packages: mistralai
  Attempting uninstall: mistralai
    Found existing installation: mistralai 0.0.1
    Uninstalling mistralai-0.0.1:
      Successfully uninstalled mistralai-0.0.1
Successfully installed mistralai-0.0.8
Collecting urllib3<3,>=1.21.1 (from requests<3,>=2->langchainhub~=0.1.14)
  Using cached urllib3-2.1.0-py3-none-any.whl.metadata (6.4 kB)
Using cached urllib3-2.1.0-py3-none-any.whl (104 kB)
Installing collected packages: urllib3
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.26.18
    Uninstalling urllib3-1.26.18:
      Successfully uninstalled urllib3-1.26.18
Successfully installed urllib3-2.1.0
Collecting urllib3<2.0.0,>=1.26.14 (from qdrant-client~=1.7.0)
  Using cached urllib3-1.26.18-py2.py3-none-any.whl.metadata (48 kB)
Using cached 

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
qdrant-client 1.7.0 requires urllib3<2.0.0,>=1.26.14, but you have urllib3 2.1.0 which is incompatible.
tensorboard 2.15.1 requires protobuf<4.24,>=3.19.6, but you have protobuf 4.25.2 which is incompatible.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorboard 2.15.1 requires protobuf<4.24,>=3.19.6, but you have protobuf 4.25.2 which is incompatible.
types-requests 2.31.0.20240106 requires urllib3>=2, but you have urllib3 1.26.18 which is incompatible.
  Running command git clone --filter=blob:none --quiet https://github.com/mistralai/client-python /tmp/pip-req-build-q8g314su
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. T

In [None]:
import os
from google.colab import userdata
from huggingface_hub import login

# for using hugging face models
login(token=userdata.get('HF_TOKEN'))

# for using OAI models
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_KEY')

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Prepare example data

#### Download PDF data

In [None]:
## download PDF data
import os
import zipfile
import requests
from io import BytesIO

# URL of the zip file in your GitHub repo (make sure it's the raw file URL)
zip_url = 'https://github.com/MoritzLaurer/rag-demo/blob/master/data/position-papers-pdfs.zip?raw=true'

# Download the zip file
print("Downloading zip file...")
response = requests.get(zip_url)
zip_content = BytesIO(response.content)

# Define the extraction path
extract_path = '/content/data'

# Create directory if it doesn't exist
if not os.path.exists(extract_path):
    os.makedirs(extract_path)

# Extract the zip file
print("Extracting zip file...")
with zipfile.ZipFile(zip_content, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

print("Extraction completed.")

file_paths = [f for f in os.listdir(extract_path) if os.path.isfile(os.path.join(extract_path, f))]
print(f"{len(file_paths)} PDF files downloaded.")


Downloading zip file...
Extracting zip file...
Extraction completed.
440 PDF files downloaded.


### Process data

In [None]:
# parse the raw PDFs into machine-readable docs
from langchain.document_loaders import PyMuPDFLoader
from tqdm.notebook import tqdm

directory = "./data"

docs = []
for pdf_path in tqdm(os.listdir(directory)):
  try:
    docs.append(PyMuPDFLoader(os.path.join(directory, pdf_path)).load())
  except Exception as e:
    print("Exception: ", e)


  0%|          | 0/440 [00:00<?, ?it/s]

Exception:  cannot open broken document
Exception:  cannot open broken document


In [None]:
# split the docs into shorter chunks that fit into LLM context windows
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
from transformers import AutoTokenizer

# text splitter based on the tokenizer of a model of your choosing
# to make texts fit exactly a transformer's context window size
# langchain text splitters: https://python.langchain.com/docs/modules/data_connection/document_transformers/
chunk_size = 256
text_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
        AutoTokenizer.from_pretrained("BAAI/bge-small-en-v1.5"),
        chunk_size=chunk_size,
        chunk_overlap=int(chunk_size / 10),
        add_start_index=True,
        strip_whitespace=True,
        separators=["\n\n", "\n", ".", " ", ""],
)


docs_processed = [text_splitter.split_documents(doc) for doc in docs]
docs_processed = [item for sublist in docs_processed for item in sublist]

print(len(docs_processed))

docs_processed[:1]

14785


[Document(page_content='1 the role of artificial intelligence within in silico medicine vph institute – avicenna alliance white paper provisional executive summary june 12th 2020 contributions liesbet geris, phd – university of liege & ku leuven ; vph institute ; avicenna alliance cecile f. rousseau, phd - voisin consulting life sciences ; avicenna alliance marco viceconti, phd – alma mater studiorum - university of bologna ; vph institute ; avicenna alliance alfons g. hoekstra, phd – university of amsterdam ; vph institute ; avicenna alliance emmanuelle m. voisin, phd – voisin consulting life sciences ; avicenna alliance markus reiterer, phd – medtronic, plc ; avicenna alliance martha de cunha - burgman, msc – medtronic, plc ; avicenna alliance michael auffret, msc – voisin consulting life sciences ; avicenna alliance payman afshari, phd – johnson and johnson ; avicenna alliance wen - yang chu, msc – virtonomy. io ; avicenna alliance thierry marchal, mecheng, mba – ansys ; avicenna al

#### Sample data to reduce embedding and generation costs

In [None]:
import random
random.seed(42)

# sample corpus for embedding
n_sample_texts = 100
index_random = random.sample(range(len(docs_processed)), 100)
docs_samp = [docs_processed[i] for i in index_random]

# sample a smaller set of texts to generate questions from
n_questions = 5
docs_for_q_generation = docs_samp[:n_questions]
docs_for_q_generation = [doc.page_content for doc in docs_for_q_generation]


## Automatic question generation for evaluation

This section generates questions which users could ask about a specific text in the database. This allows us to assess:  
- If we ask a generated question to the RAG pipeline, does the pipeline's retriever retrieve the same text which was used to generated the question? This provides an indication of retriever (and reranker) quality.
- Beyond the original text used for generating the question, the retriever might retrieve other texts that are also help the RAG pipeline generate good answers. We therefore also use an LLM to evaluate answer quality more broadly.

In [None]:
# create an huggingface inference endpoint to run any LLM
# intro: https://www.philschmid.de/inference-endpoints-iac
# docs: https://huggingface.co/docs/huggingface_hub/v0.20.1/en/package_reference/hf_api#huggingface_hub.HfApi.create_inference_endpoint
from huggingface_hub import create_inference_endpoint
from huggingface_hub import HfApi
api = HfApi()

create_new_endpoint = False
model_for_endpoint = "mistralai/Mixtral-8x7B-Instruct-v0.1"  #"mistralai/Mixtral-8x7B-Instruct-v0.1",  #"HuggingFaceH4/zephyr-7b-beta",
endpoint_name = "mixtral-8x7b-instruct-v0-1-test"

if create_new_endpoint:
  # define TGI as custom image
  custom_image = {
      "health_route": "/health",  # Health route for TGI
      "env": {
          "MAX_BATCH_PREFILL_TOKENS": "2048", # can be adjusted to your needs
          "MAX_INPUT_LENGTH": "1024", # can be adjusted to your needs
          "MAX_TOTAL_TOKENS": "1512", # can be adjusted to your needs
          "MODEL_ID": "/repository",  # IE will save the model in /repository
      },
      "url": "ghcr.io/huggingface/text-generation-inference:1.3.3",
  }

  # Create Inference Endpoint to run Zephyr 7B
  print("Creating Inference Endpoint")
  hf_endpoint = create_inference_endpoint(
      name=endpoint_name,
      repository=model_for_endpoint,
      framework="pytorch",
      task="text-generation",
      vendor="aws",
      region="us-east-1",
      type="protected",
      instance_size="2xlarge",  #"medium",
      instance_type="p4de",  #"g5.2xlarge",  # A10G GPU. Pricing: https://huggingface.co/pricing#endpoints
      accelerator="gpu",
      namespace="HF-test-lab",  # your user or organisation name on the HF hub
      custom_image=custom_image,
  )
  #curl https://api.endpoints.huggingface.cloud/v2/endpoint/MoritzLaurer \ -X POST \ -d '{"compute":{"accelerator":"gpu","instanceSize":"2xlarge","instanceType":"p4de","scaling":{"maxReplica":1,"minReplica":0}},"model":{"framework":"pytorch","image":{"custom":{"health_route":"/health","env":{"MAX_BATCH_PREFILL_TOKENS":"2048","MAX_INPUT_LENGTH":"1024","MAX_TOTAL_TOKENS":"1512","QUANTIZE":"bitsandbytes","MODEL_ID":"/repository"},"url":"ghcr.io/huggingface/text-generation-inference:1.3.4"}},"repository":"mistralai/Mixtral-8x7B-Instruct-v0.1","task":"text-generation"},"name":"aws-mixtral-8x7b-instruct-v0-1","provider":{"region":"us-east-1","vendor":"aws"},"type":"protected"}' \ -H "Content-Type: application/json" \ -H "Authorization: Bearer XXXXX"
  print("Waiting for endpoint to be deployed")
  hf_endpoint.wait()

  print("Endpoint ready")


else:
  print("Waiting for endpoint to be resumed")
  hf_endpoint = api.get_inference_endpoint(name=endpoint_name, namespace="HF-test-lab")
  hf_endpoint.resume()  # resume only works if endpoint was explicitly paused. If endpoint scaled to 0, need to send a request to wake it up
  hf_endpoint.wait()
  print("Endpoint ready")

  # to manage an existing endpoint, use:
  #hf_endpoint.resume()
  #hf_endpoint.pause()
  #hf_endpoint.delete()
  # Endpoints should automatically scale to 0 after 15 minutes to avoid unnecessary costs
  # But you can delete it manually just to be save

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.llms import HuggingFaceEndpoint
from langchain_mistralai.chat_models import ChatMistralAI
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.messages import HumanMessage


provider_for_question_generation = "MISTRAL"


if provider_for_question_generation == "HF":
  chat_model = HuggingFaceEndpoint(
    endpoint_url=hf_endpoint.url,  #"https://ytjpei7t003tedav.us-east-1.aws.endpoints.huggingface.cloud",
    task="text-generation",
    huggingfacehub_api_token=userdata.get('HF_TOKEN'),
    model_kwargs={}
  )

elif provider_for_question_generation == "OAI":
  # https://platform.openai.com/docs/api-reference/chat
  chat_model = ChatOpenAI(
      model="gpt-3.5-turbo-1106",  #"gpt-3.5-turbo-1106",  # "gpt-4-1106-preview"
      temperature=0.2, max_tokens=1024,
      n=1, top_p=0.95,
      frequency_penalty=0.0,  # Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
      presence_penalty=0.0,  # Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
      #response_format={ "type": "json_object" },
      seed=42,
  )

elif provider_for_question_generation == "MISTRAL":
  # source: https://github.com/langchain-ai/langchain/blob/9b3962fc2521ec0d6ef2ea7c0a40b9c32977671a/libs/partners/mistralai/langchain_mistralai/chat_models.py#L156C6-L156C6
  # docs: https://docs.mistral.ai/platform/client/  or https://python.langchain.com/docs/integrations/chat/mistralai
  chat_model = ChatMistralAI(
      mistral_api_key=userdata.get('MISTRAL_KEY'),
      max_retries=5,
      timeout=60,
      max_concurrent_requests=2,
      model="mistral-small",
      temperature=0.2,
      max_tokens=1024,
      top_p=0.95,  #Decode using nucleus sampling: consider the smallest set of tokens whose probability sum is at least top_p. Must be in the closed interval [0.0, 1.0].
      random_seed=42,
      safe_mode=False,
  )



In [None]:
import ast
import numpy as np

# we generate both a question and answer
# having an answer which according to the LLM follows from the question makes it easier to judge the quality of the question
instruction_qa_gen = """\
Your task is to write a factoid question and an answer given a context.

Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine. \
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

After writing the factoid question, also write the corresponding answer that is clearly grounded in the context.

Always answer in this JSON response format: {{"question": "...", "answer": "..."}}

context: {context}\n
JSON response: """



prompt_question_gen = ChatPromptTemplate.from_template(instruction_qa_gen)

chain_question_gen = prompt_question_gen | chat_model

question_answer_dic_lst = []
for context in docs_for_q_generation:
  print("Context:\n", context)

  output_question_dic = chain_question_gen.invoke({"context": context})

  if provider_for_question_generation in ["OAI", "MISTRAL"]:
    output_question_dic = output_question_dic.content

  try:
    output_question_judge_dic = ast.literal_eval(output_question_dic)
  except:
    output_question_judge_dic = {"question": np.nan, "answer": np.nan}

  question_answer_dic_lst.append(output_question_judge_dic)
  print("\nGenerated question with answer:\n", output_question_judge_dic, "\n")

question_lst = [dic["question"] for dic in question_answer_dic_lst]
answer_lst = [dic["answer"] for dic in question_answer_dic_lst]


Context:
 framework to assess the potential benefits. such opportunities include but are not 14 limited to : improvements to fairness, health, privacy, equity or efficiency. 15 3. the framework could assess the risk of tasks instead of sectors. the framework proposes to assess risks based on the industry sector. we suggest that an alternative basis should be considered and we suggest “ tasks ” as such an alternative. the 16 motivation to assess the risks of tasks instead of sectors is that sectors differ greatly internally with respect to the risk that ai tools pose. for example, although the health care sector appears to exhibit greater risks than municipal garbage collection, this need not be the case. as municipal garbage collection transitions to autonomous vehicle technology, very mundane driving decisions, such as whether the vehicles should avoid left turns, can have a significant negative impact on population safety in the aggregate. 17 likewise, accounting may as a whole appea

In [None]:
# good alternativ critique prompts: https://github.com/A-Roucher/RAG_cookbook/blob/master/retrieval_augmented_generation.ipynb

instruction_question_judge = """\
Your task is to score the quality of a question that has been written based on a specific context.

Your scoring criteria for assessing the question are:
- ambiguity: Can the question be clearly, unambiguously answered with the given context?
- form and verbosity: Is the question formulated like a question that a user could ask to a search engine? The question should not be accompanied by an answer or other text that users would not ask in a search query

Your quality score should be in the range of 0 to 100.\
100 means a very good question, 0 means a very bad question, 50 means a mediocre question.

First briefly reason step-by-step to assess the extent to which the question fulfills these criteria. Your reasoning should be short.
Then return the quality score.

Always answer in this JSON evaluation format: {{"reason": "...", "score": "..."}}

context: "{context}"\n
question: "{question}"\n
JSON evaluation: """


prompt_question_judge = ChatPromptTemplate.from_template(instruction_question_judge)

chain = prompt_question_judge | chat_model

question_judgement_lst = []
for qa_dic, context in zip(question_answer_dic_lst, docs_for_q_generation):
  print("Question:", qa_dic["question"])
  print("Context:", context)

  output_question_judgement = chain.invoke({"question": qa_dic["question"].strip().replace("\n", " "), "context": context.strip()})

  if provider_for_question_generation == "OAI":
    output_question_judgement = output_question_judgement.content

  question_judgement_lst.append(output_question_judgement)
  print("\nJudgement:\n", output_question_judgement, "\n")




Question: How does the framework suggest assessing AI risks differently?
Context: framework to assess the potential benefits. such opportunities include but are not 14 limited to : improvements to fairness, health, privacy, equity or efficiency. 15 3. the framework could assess the risk of tasks instead of sectors. the framework proposes to assess risks based on the industry sector. we suggest that an alternative basis should be considered and we suggest “ tasks ” as such an alternative. the 16 motivation to assess the risks of tasks instead of sectors is that sectors differ greatly internally with respect to the risk that ai tools pose. for example, although the health care sector appears to exhibit greater risks than municipal garbage collection, this need not be the case. as municipal garbage collection transitions to autonomous vehicle technology, very mundane driving decisions, such as whether the vehicles should avoid left turns, can have a significant negative impact on populati

In [None]:
# parsing the JSON output can lead to errors
# with open-source models, which don't enforce JSON as well as OAI
import ast
import numpy as np

#output_question_judge_dic = []
output_question_score = []
output_question_reason = []
for output in question_judgement_lst:
  try:
    output_question_judge_dic = ast.literal_eval(output)
    output_question_score.append(int(output_question_judge_dic["score"]))
    output_question_reason.append(output_question_judge_dic["reason"])

  except:
    print("This JSON output could not be parsed: ", output)
    #output_question_judge_dic.append(np.nan)
    output_question_score.append(np.nan)
    output_question_reason.append(np.nan)



This JSON output could not be parsed:  content='{"reason": "The question is clear and unambiguous, and it directly relates to the context provided. However, the question could be improved by specifying which framework is being referred to, as there are two different frameworks mentioned in the context. Despite this, the question is still specific enough to be answered. The form of the question is also appropriate for a search engine query. The score is slightly reduced due to the ambiguity regarding which framework is being referred to.", "score": "85"}'
This JSON output could not be parsed:  content='{"reason": "The question is formulated clearly and unambiguously, using proper grammar and verb structure. It refers specifically to \'measure 3\' and asks about the \'approach to innovations centers\' which is mentioned in the context. The question does not contain any ambiguity and is formulated like a question that a user could ask to a search engine. Therefore, it fulfills the criteri

In [None]:
import pandas as pd

df_questions = pd.DataFrame({
  "question": question_lst,
  "answer": answer_lst,
  "score_question": output_question_score,
  "score_reason": output_question_reason,
  "context": docs_for_q_generation,
})

df_questions

Unnamed: 0,question,answer,score_question,score_reason,context
0,How does the framework suggest assessing AI ri...,The framework suggests assessing AI risks base...,,,framework to assess the potential benefits. su...
1,Which approach to innovations centers is consi...,Measure 3 emphasizes the crucial role of inclu...,,,zivilgesellschaft eingebunden werden sollen un...
2,What is EDRI's stance on the necessity and pro...,EDRI argues that using facial recognition syst...,,,"cnil, for permission to use a facial recogniti..."
3,How can artificial intelligence impact researc...,Artificial intelligence has the potential to s...,,,le domaine de l ’ ia pourrait etre a meme de s...
4,What article in GDPR outlines the data subject...,"Article 22(3) of GDPR, 2016/679, gives the dat...",,,"i. e., which data was influential ), and to fa..."


In [None]:
# run critique prompt and save question with context etc.
# to csv file that can be loaded downstream

## RAG pipeline

### Retrival

Optimization potential: different retrievers, different rerankers, multi-retrievers

In [None]:
# detailed RAG docs: https://python.langchain.com/docs/use_cases/question_answering/
# FAISS cookbook: https://python.langchain.com/docs/expression_language/cookbook/retrieval
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceEmbeddings, HuggingFaceInferenceAPIEmbeddings
from langchain.schema import StrOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma, Qdrant
from langchain_core.runnables import RunnablePassthrough
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams


In [None]:
# ! issue: langchain vector store wrappers don't seem to allow adjustment to dimensions, only accept OAI default 1.5k
# using qdrant directly instead of langchain wrapper

provider_retrieval_model = "HF"

client_path = f"./vectorstore"
collection_name = f"collection"

if provider_retrieval_model == "HF":
  qdrantClient = QdrantClient(path=client_path, prefer_grpc=True)

  embeddings = HuggingFaceInferenceAPIEmbeddings(
      api_key=userdata.get('HF_TOKEN'), model_name="sentence-transformers/all-MiniLM-l6-v2"
  )

  dim = 384

elif provider_retrieval_model == "OAI":

  qdrantClient = QdrantClient(path=client_path, prefer_grpc=True)

  embeddings = OpenAIEmbeddings(
          model="text-embedding-ada-002",
          openai_api_key=os.getenv("OPENAI_API_KEY"),
  )

  dim = 1536


qdrantClient.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=dim, distance=Distance.COSINE),
)

vectorstore = Qdrant(
    client=qdrantClient,
    collection_name=collection_name,
    embeddings=embeddings,
)

vectorstore.add_documents(docs_samp)

In [None]:

context_retrieved_lst = []
for question in question_lst:
  retriever = vectorstore.as_retriever(
      search_type="similarity",
      search_kwargs={"k": 1}
  )

  context_retrieved = retriever.get_relevant_documents(
      question
  )

  def format_docs(docs):
      return "\n\n".join(doc.page_content for doc in docs)

  context_retrieved = format_docs(context_retrieved)

  context_retrieved_lst.append(context_retrieved)
  #print(context_retrieved)


In [None]:
# check if retrieved context for question is same as context used for generating the question
# note that this is an imperfect measure, because the retriever might
# retrieve other texts that are equally relevant as the text used for generating the question
context_for_q_generation = [doc for doc in docs_for_q_generation]
correct_context_retrieved = [a == b for a, b in zip(context_for_q_generation, context_retrieved_lst)]

retrieval_accuracy = sum(correct_context_retrieved) / len(correct_context_retrieved)
print(retrieval_accuracy)


0.2


In [None]:
# add reranking step
# challenge: reranking with HF models not implemented in langchain
# only cohere reranker seems implemented: https://python.langchain.com/docs/integrations/retrievers/cohere-reranker

"""import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-base')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base')
model.eval()

context_question_pairs_lst = []
for question in question_lst:
  context_question_pairs_lst.append([[question, context] for context in context_retrieved_lst])

#pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]

for context_question_pair in context_question_pairs_lst:
  with torch.no_grad():
      inputs = tokenizer(context_question_pair, padding=True, truncation=True, return_tensors='pt', max_length=512)
      scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
      print(scores)
"""

"import torch\nfrom transformers import AutoModelForSequenceClassification, AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-base')\nmodel = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base')\nmodel.eval()\n\ncontext_question_pairs_lst = []\nfor question in question_lst:\n  context_question_pairs_lst.append([[question, context] for context in context_retrieved_lst])\n\n#pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]\n\nfor context_question_pair in context_question_pairs_lst:\n  with torch.no_grad():\n      inputs = tokenizer(context_question_pair, padding=True, truncation=True, return_tensors='pt', max_length=512)\n      scores = model(**inputs, return_dict=True).logits.view(-1, ).float()\n      print(scores)\n"

### Answer generation

Optimization potential: different LLMs, different prompt templates

In [None]:

prompt_qa_template = """\
Your task is to answer a question based on a context.
Your answer should be concise and you should only return your answer.

context: {context}
question: {question}
answer: """

prompt_qa_template = PromptTemplate.from_template(prompt_qa_template)


In [None]:
from langchain.llms import HuggingFaceEndpoint

provider_answer_model = "MISTRAL"


if provider_answer_model == "HF":
  llm_qa = HuggingFaceEndpoint(
    endpoint_url=hf_endpoint.url,  #"https://nqoa2is3qe7y82ww.us-east-1.aws.endpoints.huggingface.cloud",
    task="text-generation",
    huggingfacehub_api_token=userdata.get('HF_TOKEN'),
    model_kwargs={}
  )

elif provider_answer_model == "OAI":
  llm_qa = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

elif provider_for_question_generation == "MISTRAL":
  # source: https://github.com/langchain-ai/langchain/blob/9b3962fc2521ec0d6ef2ea7c0a40b9c32977671a/libs/partners/mistralai/langchain_mistralai/chat_models.py#L156C6-L156C6
  # docs: https://docs.mistral.ai/platform/client/  or https://python.langchain.com/docs/integrations/chat/mistralai
  llm_qa = ChatMistralAI(
      mistral_api_key=userdata.get('MISTRAL_KEY'),
      max_retries=5,
      timeout=60,
      max_concurrent_requests=2,
      model="mistral-small",
      temperature=0.2,
      max_tokens=1024,
      top_p=0.95,  #Decode using nucleus sampling: consider the smallest set of tokens whose probability sum is at least top_p. Must be in the closed interval [0.0, 1.0].
      random_seed=42,
      safe_mode=False,
  )


chain = prompt_qa_template | llm_qa | StrOutputParser()

answer_lst = []
for question, context in zip(question_lst , context_retrieved_lst):
  answer = chain.invoke({"context": context, "question": question})
  answer_lst.append(answer)


### Automatic LLM evaluation of generated answer

In [None]:
# this scoring prompt can be freely adapted to evaluation criteria
# of different use-cases

instruction_judge_answer = """\
Your task is to score the quality of an answer to a question in a given context.

Your scoring criteria for assessing the answer are:
- pertinence: Does the answer directly answer the question?
- context grounding: Is the answer clearly grounded in the context? To be well grounded, the answer does not need to explicitly reference the context.
- conciseness: Is the answer concise without unnecessary verbosity?

Your quality score should be in the range of 0 to 100.\
100 means a very good answer, 0 means a very bad answer, 50 means a mediocre answer.

First briefly reason step-by-step to assess the extent to which the answer fulfills these criteria. Your reasoning should be short.
Then return the quality score.

Always answer in this JSON evaluation format: {{"reason": "...", "score": "..."}}

context: {context}\n
question: "{question}"\n
answer: "{answer}"\n
JSON evaluation: """

instruction_judge_answer = ChatPromptTemplate.from_template(instruction_judge_answer)

# currently need to use OAI here, because it enforces JSON very well
llm_evaluation = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

chain = instruction_judge_answer | llm_evaluation


output_quality_lst = []
for answer, question, context_retrieved in zip(answer_lst, question_lst, context_retrieved_lst):

  output_quality = chain.invoke({
      "context": context_retrieved.strip(),
      "question": question.strip().replace("\n", " "),
      "answer": answer.strip().replace("\n", " ")
  })

  output_quality_lst.append(output_quality.content)



  warn_deprecated(


In [None]:
# parsing the JSON output can lead to errors
# with open-source models, which don't enforce JSON as well as OAI
import ast

output_quality_dic = [ast.literal_eval(output) for output in output_quality_lst]
output_quality_score = [int(dic["score"]) for dic in output_quality_dic]
output_quality_reason = [dic["reason"] for dic in output_quality_dic]


## Results

In [None]:
import pandas as pd

df_results = pd.DataFrame({
    "question": question_lst,
    "answer": answer_lst,
    "answer_quality_score": output_quality_score,
    "answer_quality_reason": output_quality_reason,
    "correct_context": [a == b for a, b in zip(context_for_q_generation, context_retrieved_lst)],
    "context_retrieved": context_retrieved_lst,
    "context_for_q_generation": context_for_q_generation
})

mean_answer_score = df_results["answer_quality_score"].mean()
retrieval_accuracy = sum(df_results["correct_context"]) / len(df_results["correct_context"])

print(f"Retrieval accuracy: {retrieval_accuracy}")
print(f"Mean answer socre: {mean_answer_score}")
print("\n")

df_results


Retrieval accuracy: 0.2
Mean answer socre: 96.0




Unnamed: 0,question,answer,answer_quality_score,answer_quality_reason,correct_context,context_retrieved,context_for_q_generation
0,How does the framework suggest assessing AI ri...,The framework suggests assessing AI risks by f...,100,The answer directly answers the question by ex...,False,the possible harm caused by the ai system is p...,framework to assess the potential benefits. su...
1,Which approach to innovations centers is consi...,The approach to innovation centered on industr...,100,The answer directly answers the question by st...,False,and be applicable without prejudice to cultura...,zivilgesellschaft eingebunden werden sollen un...
2,What is EDRI's stance on the necessity and pro...,EDRI believes that using facial recognition sy...,100,The answer directly addresses the question by ...,True,"cnil, for permission to use a facial recogniti...","cnil, for permission to use a facial recogniti..."
3,How can artificial intelligence impact researc...,Artificial intelligence can impact research in...,90,The answer directly answers the question by ex...,False,response to the public consultation on the eur...,le domaine de l ’ ia pourrait etre a meme de s...
4,What article in GDPR outlines the data subject...,Article 22 of GDPR outlines the data subject's...,90,The answer directly answers the question by st...,False,", to express his or her point of view and to c...","i. e., which data was influential ), and to fa..."
