<a href="https://colab.research.google.com/github/MoritzLaurer/rag-demo/blob/master/rag_langchain_ai_law.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Evaluating a RAG pipeline with LangChain and Hugging Face or OpenAI

This notebook provides a quick demo for creating and evaluating a Retrieval Augmented Generation (RAG) pipeline with LangChain and Hugging Face or OpenAI.

The demo has the following main steps:
1. Create an example vector database: The demo downloads 440 position paper PDFs which stakeholders had submitted to the EU public consultation on the EU White Paper on AI in 2020. These PDFs are processed and ingested in a vector database.
2. We then automatically generate questions about a sample of the texts with an LLM
3. Then we create a RAG pipeline and feed the generated questions into the RAG pipeline as user queries
4. RAG evaluation:
  - Retriever quality: If we ask a generated question to the RAG pipeline, does the pipeline's retriever retrieve the same original text which was used to generated the question? This provides an indication of retriever (and reranker) quality. This indicator is, however, imperfect, as the retriever could also retrieve other texts that help the RAG pipeline generate good answers beyond only the original text used for generating the question.
  - Answer quality: We also use an LLM to evaluate answer quality more broadly.


## Install packages

In [1]:
%%bash
pip install --upgrade pip -q
pip install langchain~=0.0.352
pip install langchainhub~=0.1.14
pip install openai~=1.6.0
pip install tiktoken~=0.5.2
#pip install chromadb~=0.4.21
#pip install faiss-cpu~=1.7.4
pip install qdrant-client~=1.7.0
pip install PyMuPDF~=1.23.7
pip install sentence_transformers~=2.2.2

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.1/2.1 MB 9.2 MB/s eta 0:00:00
Collecting langchain~=0.0.352
  Downloading langchain-0.0.354-py3-none-any.whl.metadata (13 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain~=0.0.352)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl.metadata (25 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain~=0.0.352)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-community<0.1,>=0.0.8 (from langchain~=0.0.352)
  Downloading langchain_community-0.0.8-py3-none-any.whl.metadata (7.3 kB)
Collecting langchain-core<0.2,>=0.1.5 (from langchain~=0.0.352)
  Downloading langchain_core-0.1.5-py3-none-any.whl.metadata (4.0 kB)
Collecting langsmith<0.1.0,>=0.0.77 (from langchain~=0.0.352)
  Downloading langsmith-0.0.77-py3-none-any.whl.metadata (10 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain~=0.0.352)
  Downloading marshmallow-3.20.1-py3-none-any.whl.metadata (7.8 

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.
tensorflow-probability 0.22.0 requires typing-extensions<4.6.0, but you have typing-extensions 4.9.0 which is incompatible.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorboard 2.15.1 requires protobuf<4.24,>=3.19.6, but you have protobuf 4.25.1 which is incompatible.
tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobu

In [2]:
from google.colab import userdata
import os
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_KEY')

## Prepare example data

#### Download PDF data

In [3]:
## download PDF data
import os
import zipfile
import requests
from io import BytesIO

# URL of the zip file in your GitHub repo (make sure it's the raw file URL)
zip_url = 'https://github.com/MoritzLaurer/rag-demo/blob/master/data/position-papers-pdfs.zip?raw=true'

# Download the zip file
print("Downloading zip file...")
response = requests.get(zip_url)
zip_content = BytesIO(response.content)

# Define the extraction path
extract_path = '/content/data'

# Create directory if it doesn't exist
if not os.path.exists(extract_path):
    os.makedirs(extract_path)

# Extract the zip file
print("Extracting zip file...")
with zipfile.ZipFile(zip_content, 'r') as zip_ref:
    zip_ref.extractall(extract_path)

print("Extraction completed.")

file_paths = [f for f in os.listdir(extract_path) if os.path.isfile(os.path.join(extract_path, f))]
print(f"{len(file_paths)} PDF files downloaded.")


Downloading zip file...
Extracting zip file...
Extraction completed.
440 PDF files downloaded.


### Process data

In [4]:
from langchain.document_loaders import PDFMinerLoader
from langchain.document_loaders import PyMuPDFLoader
from tqdm.notebook import tqdm

directory = "./data"

docs = []
for pdf_path in tqdm(os.listdir(directory)):
  try:
    docs.append(PyMuPDFLoader(os.path.join(directory, pdf_path)).load())
  except Exception as e:
    print("Exception: ", e)


  0%|          | 0/440 [00:00<?, ?it/s]

Exception:  cannot open broken document
Exception:  cannot open broken document


In [5]:
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter

"""text_splitter = CharacterTextSplitter(
    separator = " ",
    chunk_size = 1000,
    chunk_overlap  = 30,
    length_function = len,
    is_separator_regex = False,
)"""

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)

docs_processed = [text_splitter.split_documents(doc) for doc in docs]

docs_processed = [item for sublist in docs_processed for item in sublist]
print(len(docs_processed))

docs_processed[:1]

16932


[Document(page_content='EC White Paper: Consultation Response \nJune 2020 \n© 2020, Loughborough University, UKRI Project REF: ES/S010416/1  \n \n1 of 26 \nResponse \nto \nthe \nEuropean \nCommission’s \nConsultation on Artificial Intelligence: A European \napproach to excellence and trust \nThis document is a response to the European Commission’s Consultation on Artificial Intelligence, from \nLoughborough University systems engineering researchers Dr Melanie King and Paul Timms, written as part \nof the TECHNGI Academic Research Project (UKRI Project Ref: ES/S010416/1).  TECHNGI (Technology \nDriven Next Generation Insurance) is a cross-disciplinary research project investigating the opportunities and \nchallenges for the UK insurance industry arising from the application of new AI technologies, including machine \nlearning, distributed ledger, automated processing, and the explosion of available dataa. \nWe provide both general comments on the white paper [1], and address more speci

#### Sample data to reduce embedding and generation costs

In [52]:
import random
random.seed(42)

# sample corpus for embedding
index_random = random.sample(range(len(docs_processed)), 100)
docs_processed_samp = [docs_processed[i] for i in index_random]

# sample some contexts generate questions from
docs_processed_for_q_generation = docs_processed_samp[:5]


## Automatic question generation for evaluation

This section generates questions which users could ask about a specific text in the database. This allows us to assess:  
- If we ask a generated question to the RAG pipeline, does the pipeline's retriever retrieve the same text which was used to generated the question? This provides an indication of retriever (and reranker) quality.
- Beyond the original text used for generating the question, the retriever might retrieve other texts that are also help the RAG pipeline generate good answers. We therefore also use an LLM to evaluate answer quality more broadly.

In [57]:
from langchain.chat_models import ChatOpenAI
from langchain.llms import HuggingFaceEndpoint
from langchain.prompts import ChatPromptTemplate, PromptTemplate

provider_for_question_generation = "OAI"


if provider_for_question_generation == "HF":
  chat_model = HuggingFaceEndpoint(
    endpoint_url="https://ytjpei7t003tedav.us-east-1.aws.endpoints.huggingface.cloud",
    task="text-generation",
    huggingfacehub_api_token=userdata.get('hf_api_key'),
    model_kwargs={}
  )

elif provider_for_question_generation == "OAI":
  # https://platform.openai.com/docs/api-reference/chat
  chat_model = ChatOpenAI(
      model="gpt-3.5-turbo-1106",  #"gpt-3.5-turbo-1106",  # "gpt-4-1106-preview"
      temperature=0.2, max_tokens=1024,
      n=1, top_p=0.95,
      frequency_penalty=0.0,  # Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model's likelihood to repeat the same line verbatim.
      presence_penalty=0.0,  # Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.
      #response_format={ "type": "json_object" },
      seed=42,
  )



'HF'

In [56]:

instruction_question_gen = """\
Your task is to write a factoid question given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine. \
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

context: {context}\n
factoid question: """


prompt_question_gen = ChatPromptTemplate.from_template(instruction_question_gen)

chain = prompt_question_gen | chat_model

questions_lst = []
for context in docs_processed_for_q_generation:
  print("Context:\n", context.page_content)
  output_question = chain.invoke({"context": context.page_content})
  if provider_for_question_generation == "OAI":
    output_question = output_question.content
  print("\nGenerated question:\n", output_question, "\n")
  questions_lst.append(output_question)


Context:
 access to school-made teaching material covering the achieved or 
partially achieved curriculum. Therefore teachers need telecom 
support and home-working status as well as authorship 
recognition. 
"A School", as we put it, "is a seamless process for transferring 
knowledge and experimenting newly acquired knowledge." 
From the user point of view, this is a crucial issue. For Education 
Bodies, it is a critical mission. For multimedia technologists, it is a 
real challenge. 
As Professor Ruberti used to say: "Information Technology and 
Telecommunications are the only ways today to help teachers 
upgrading the Education System's performances." 
Are all the players in this area convinced that they have to 
proceed this way ? 
What do think of it our newest regulatory bodies? 
Are they ready to enforce the introduction of Isdn as a bottom 
line for the Universal Service definition ? 
And what about our governments ? 
Would it be possible to ask them yearly accounts of their

G

## RAG pipeline

### Retrival

Optimization potential: different retrievers, different rerankers, multi-retrievers

In [58]:
# detailed RAG docs: https://python.langchain.com/docs/use_cases/question_answering/
# FAISS cookbook: https://python.langchain.com/docs/expression_language/cookbook/retrieval

import bs4
from langchain import hub
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings import OpenAIEmbeddings, HuggingFaceEmbeddings, HuggingFaceInferenceAPIEmbeddings
from langchain.schema import StrOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma, Qdrant
from langchain_core.runnables import RunnablePassthrough

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

#vectorstore = FAISS.from_documents(docs_processed, OpenAIEmbeddings())


In [None]:
# ! issue: langchain vector store wrappers don't seem to allow adjustment to dimensions, only accept OAI default 1.5k

retriever_model = "OAI"

client_path = f"./vectorstore"
collection_name = f"collection"

if retriever_model == "HF":
  qdrantClient = QdrantClient(path=client_path, prefer_grpc=True)

  embeddings = HuggingFaceInferenceAPIEmbeddings(
      api_key=userdata.get('hf_api_key'), model_name="sentence-transformers/all-MiniLM-l6-v2"
  )

  dim = 384

elif retriever_model == "OAI":

  qdrantClient = QdrantClient(path=client_path, prefer_grpc=True)

  embeddings = OpenAIEmbeddings(
          model="text-embedding-ada-002",
          openai_api_key=os.getenv("OPENAI_API_KEY"),
  )

  dim = 1536


qdrantClient.create_collection(
    collection_name=collection_name,
    vectors_config=VectorParams(size=dim, distance=Distance.COSINE),
)

vectorstore = Qdrant(
    client=qdrantClient,
    collection_name=collection_name,
    embeddings=embeddings,
)

vectorstore.add_documents(docs_processed_samp)

In [59]:

context_retrieved_lst = []
for question in questions_lst:
  retriever = vectorstore.as_retriever(
      search_type="similarity",
      search_kwargs={"k": 1}
  )

  context_retrieved = retriever.get_relevant_documents(
      question
  )

  def format_docs(docs):
      return "\n\n".join(doc.page_content for doc in docs)

  context_retrieved = format_docs(context_retrieved)

  context_retrieved_lst.append(context_retrieved)
  #print(context_retrieved)


In [60]:
# check if retrieved context for question is same as context used for generating the question
# note that this is an imperfect measure, because the retriever might
# retrieve other texts that are equally relevant as the text used for generating the question
context_for_q_generation = [doc.page_content for doc in docs_processed_for_q_generation]
correct_context_retrieved = [a == b for a, b in zip(context_for_q_generation, context_retrieved_lst)]

retrieval_accuracy = sum(correct_context_retrieved) / len(correct_context_retrieved)
print(retrieval_accuracy)


1.0


In [None]:
# add reranking step
# challenge: reranking with HF models not implemented in langchain
# only cohere reranker seems implemented: https://python.langchain.com/docs/integrations/retrievers/cohere-reranker

"""import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-base')
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-base')
model.eval()

context_question_pairs_lst = []
for question in questions_lst:
  context_question_pairs_lst.append([[question, context] for context in context_retrieved_lst])

#pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]

for context_question_pair in context_question_pairs_lst:
  with torch.no_grad():
      inputs = tokenizer(context_question_pair, padding=True, truncation=True, return_tensors='pt', max_length=512)
      scores = model(**inputs, return_dict=True).logits.view(-1, ).float()
      print(scores)
"""

### Answer generation

Optimization potential: different LLMs, different prompt templates

In [61]:

prompt_qa_template = """\
Your task is to answer a question based on a context.
Your answer should be concise and you should only return your answer.

context: {context}
question: {question}
answer: """

prompt_qa_template = PromptTemplate.from_template(prompt_qa_template)


In [62]:
from langchain.llms import HuggingFaceEndpoint

qa_model = "OAI"

if qa_model == "HF":
  llm_qa = HuggingFaceEndpoint(
    endpoint_url="https://ytjpei7t003tedav.us-east-1.aws.endpoints.huggingface.cloud",  #"https://nqoa2is3qe7y82ww.us-east-1.aws.endpoints.huggingface.cloud",
    task="text-generation",
    huggingfacehub_api_token=userdata.get('hf_api_key'),
    model_kwargs={}
  )
elif qa_model == "OAI":
  llm_qa = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

chain = prompt_qa_template | llm_qa | StrOutputParser()

answer_lst = []
for question, context in zip(questions_lst , context_retrieved_lst):
  answer = chain.invoke({"context": context, "question": question})
  answer_lst.append(answer)


### Automatic LLM evaluation of generated answer

In [75]:
# this scoring prompt can be freely adapted to evaluation criteria
# of different use-cases

instruction_judge_answer = """\
Your task is to score the quality of an answer to a question in a given context.

Your scoring criteria for assessing the answer are:
- pertinence: Does the answer directly answer the question?
- context grounding: Is the answer clearly grounded in the context? To be well grounded, the answer does not need to explicitly reference the context.
- conciseness: Is the answer concise without unnecessary verbosity?

Your quality score should be in the range of 0 to 100.\
100 means a very good answer, 0 means a very bad answer, 50 means a mediocre answer.

First briefly reason step-by-step to assess the extent to which the answer fulfills these criteria. Your reasoning should be short.
Then return the quality score.

Always answer in this JSON evaluation format: {{"reason": "...", "score": "..."}}

context: {context}\n
question: {question}\n
answer: {answer}\n
JSON evaluation: """

instruction_judge_answer = ChatPromptTemplate.from_template(instruction_judge_answer)

# currently need to use OAI here, because it enforces JSON very well
llm_evaluation = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

chain = instruction_judge_answer | llm_evaluation


output_quality_lst = []
for answer, question, context_retrieved in zip(answer_lst, questions_lst, context_retrieved_lst):

  output_quality = chain.invoke({
      "context": context_retrieved,
      "question": question,
      "answer": answer
  })

  output_quality_lst.append(output_quality.content)



In [76]:
# parsing the JSON output can lead to errors
# with open-source models, which don't enforce JSON as well as OAI
import ast

output_quality_dic = [ast.literal_eval(output) for output in output_quality_lst]
output_quality_score = [int(dic["score"]) for dic in output_quality_dic]
output_quality_reason = [dic["reason"] for dic in output_quality_dic]


## Results

In [77]:
import pandas as pd

df_results = pd.DataFrame({
    "question": questions_lst,
    "answer": answer_lst,
    "answer_quality_score": output_quality_score,
    "answer_quality_reason": output_quality_reason,
    "correct_context": [a == b for a, b in zip(context_for_q_generation, context_retrieved_lst)],
    "context_retrieved": context_retrieved_lst,
    "context_for_q_generation": context_for_q_generation
})

mean_answer_score = df_results["answer_quality_score"].mean()
retrieval_accuracy = sum(df_results["correct_context"]) / len(df_results["correct_context"])

print(f"Mean retrieval accuracy: {retrieval_accuracy}")
print(f"Mean answer socre: {mean_answer_score}")
print("\n")

df_results


Mean retrieval accuracy: 1.0
Mean answer socre: 94.0




Unnamed: 0,question,answer,answer_quality_score,answer_quality_reason,correct_context,context_retrieved,context_for_q_generation
0,\nWhat is the role of Information Technology a...,"\nAccording to Professor Ruberti, the roles of...",70,The answer directly answers the question by st...,True,access to school-made teaching material coveri...,access to school-made teaching material coveri...
1,"\n""What percentage of engineers say they would...",48% of engineers say they would benefit from t...,100,The answer directly answers the question by st...,True,got materials engineering that’s coming in. An...,got materials engineering that’s coming in. An...
2,\nWhat is the recommended approach for ensurin...,\nThe recommended approach for ensuring the pr...,100,The answer directly answers the question by st...,True,wymogów należy ograniczyć do zastosowań SI wys...,wymogów należy ograniczyć do zastosowań SI wys...
3,\n\nwhat is the legal stance of deployers rega...,\n\nDeployers in the given context can often a...,100,The answer directly answers the question by st...,True,deployer can often absolve themselves for any ...,deployer can often absolve themselves for any ...
4,"\n\nWhat is the EU's stance on regulating AI, ...",\n\nThe EU's stance on regulating AI is to all...,100,The answer directly answers the question by ex...,True,should allow the benefits of AI to be realized...,should allow the benefits of AI to be realized...
