# Evaluation with RAGAS and Advanced Retrieval Methods Using LangChain

In [1]:
!pip install -U -q langchain openai ragas arxiv pymupdf chromadb wandb tiktoken

[33m  DEPRECATION: sgmllib3k is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559[0m[33m
[0m[33m  DEPRECATION: grpcio is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559[0m[33m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please enter your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

### Data Collection

We're going to be using papers from Arxiv as our context today.

We can collect these documents rather straightforwardly with the `ArxivLoader` document loader from LangChain.

Let's grab and load 5 documents.

- [`ArxivLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.arxiv.ArxivLoader.html)

In [60]:
from langchain.document_loaders import PyPDFLoader

# Specify the path to your local PDF file
pdf_path = "./scikit22.pdf"

# Load the PDF document
loader = PyPDFLoader(pdf_path)
base_docs = loader.load()

# Check the number of documents loaded
print(f"Number of pages loaded: {len(base_docs)}")


Number of pages loaded: 13


In [61]:
for doc in base_docs:
  print(doc.metadata)

{'source': './scikit22.pdf', 'page': 0}
{'source': './scikit22.pdf', 'page': 1}
{'source': './scikit22.pdf', 'page': 2}
{'source': './scikit22.pdf', 'page': 3}
{'source': './scikit22.pdf', 'page': 4}
{'source': './scikit22.pdf', 'page': 5}
{'source': './scikit22.pdf', 'page': 6}
{'source': './scikit22.pdf', 'page': 7}
{'source': './scikit22.pdf', 'page': 8}
{'source': './scikit22.pdf', 'page': 9}
{'source': './scikit22.pdf', 'page': 10}
{'source': './scikit22.pdf', 'page': 11}
{'source': './scikit22.pdf', 'page': 12}


### Creating an Index

Let's use a naive index creation strategy of just using `RecursiveCharacterTextSplitter` on our documents and embedding each into our `VectorStore` using `OpenAIEmbeddings()`.

- [`RecursiveCharacterTextSplitter()`](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)
- [`Chroma`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html?highlight=chroma#langchain.vectorstores.chroma.Chroma)
- [`OpenAIEmbeddings()`](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html?highlight=openaiembeddings#langchain-embeddings-openai-openaiembeddings)

In [62]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=250)

docs = text_splitter.split_documents(base_docs)

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

In [63]:
len(docs)

175

In [64]:
print(max([len(chunk.page_content) for chunk in docs]))

249


In [65]:
# convert our `Chroma` vectorstore into a retriever with the `.as_retriever()` method.

base_retriever = vectorstore.as_retriever(search_kwargs={"k" : 2})

In [66]:
relevant_docs = base_retriever.get_relevant_documents("What is regression?")

In [67]:
len(relevant_docs)
relevant_docs

[Document(metadata={'page': 12, 'source': './scikit2.pdf'}, page_content='Regression\n▶ Classiﬁcation vs. Regression 1:\n▶ Classify for categorical output\n▶ Regression: predicting continuous-valued attribute(s)\n▶ Can be ”by-products” of classiﬁcation methods, e.g.:\nRandomForestClassifier and RandomForestRegressor, or'),
 Document(metadata={'page': 5, 'source': './scikit22.pdf'}, page_content='1.5.2. Regression\nThe class SGDRegressor implements a plain stochastic gradient descent learning routine which\nsupports different loss functions and penalties to fit linear regression models. SGDRegressor is')]

## Creating a Retrieval Augmented Generation Prompt

Now we can set up a prompt template that will be used to provide the LLM with the necessary contexts, user query, and instructions!

In [68]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

### CONTEXT
{context}

### QUESTION
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

### Setting Up our Basic QA Chain

In [69]:
from operator import itemgetter

from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

In [70]:
question = "Stochastic gradient descent is an optimization method for..."

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result)

{'response': AIMessage(content='unconstrained optimization problems.', additional_kwargs={}, response_metadata={'token_usage': {'completion_tokens': 6, 'prompt_tokens': 190, 'total_tokens': 196, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-68617c9f-dd08-4442-8583-403ca478452f-0'), 'context': [Document(metadata={'page': 11, 'source': './scikit22.pdf'}, page_content='Stochastic gradient descent is an optimization method for unconstrained optimization problems.\nIn contrast to (batch) gradient descent, SGD approximates the true gradient of  by\nconsidering a single training example at a time.'), Document(metadata={'page': 0, 'source': './scikit22.pdf'}, page_content='1.5. Stochastic Gradient Descent\nStochastic Gradient D

### Evaluating RAG Pipelines

If you skipped ahead and need to load the `.csv` directly - uncomment the code below.

If you're using Colab to do this notebook - please ensure you add it to your session files.

In [71]:
# from datasets import Dataset
# # eval_dataset = Dataset.from_csv("./nba.csv")
# eval_dataset = Dataset.from_csv("./scikit.csv")

import pandas as pd
from datasets import Dataset

# Try reading the CSV with a specific encoding
eval_dataset = pd.read_csv("./scikit.csv", encoding='ISO-8859-1')  # or encoding='Windows-1252'

print(f"Dataset loaded with {len(eval_dataset)} rows")

# eval_dataset = Dataset.from_pandas(df)


Dataset loaded with 20 rows


In [72]:
eval_dataset

Unnamed: 0,question,context,ground_truth
0,Logistic regression classifier has different s...,Algorithm:\n\nLogistic Regression ('sag'/'saga...,Logistic Regression with 'sag'/'saga' solvers ...
1,How gaussian mixture models work?\n\nI am give...,For your example:\nGiven: Two normal distribut...,Gaussian Mixture Models (GMMs) estimate parame...
2,I'm working in a sklearn homework and I don't ...,When preprocessing:\n\nTraining set: Compute t...,Standardizing and normalizing the test data wi...
3,I have a collection of very large images. I te...,Spectral clustering involves creating a simila...,The Nystroem approximation can significantly r...
4,how does sklearn's gaussian mixture model avoi...,"When working with Gaussian Mixture Models, com...",Scikit-learn's Gaussian Mixture Model (GMM) av...
5,Procedure for selecting optimal number of feat...,The response is based on the general understan...,"The approach of using PCA, SelectFromModel, an..."
6,I have trained a classifier on 'Rocks and Mine...,This answer is based on the understanding of c...,It is likely that your model is overfitting or...
7,After reading sklearn manual it was not very o...,This explanation is based on the CalibratedCla...,Isotonic regression in the context of probabil...
8,How do you choose the feature selection algori...,Based on scikit-learnâs feature selection me...,Feature selection depends on the dataset and m...
9,I have a dataset of images that I would like t...,"In scikit-learn, PCA is commonly used for meas...","In scikit-learn, non-linear dimensionality red..."


### Evaluation Using RAGAS

Now we can evaluate using RAGAS!

The set-up is fairly straightforward - we simply need to create a dataset with our generated answers and our contexts, and then evaluate using the framework.

In [73]:
import ragas.metrics
print(dir(ragas.metrics))


['AgentGoalAccuracyWithReference', 'AgentGoalAccuracyWithoutReference', 'AnswerCorrectness', 'AnswerRelevancy', 'AnswerSimilarity', 'AspectCritic', 'BleuScore', 'ContextEntityRecall', 'ContextPrecision', 'ContextRecall', 'ContextUtilization', 'DataCompyScore', 'DistanceMeasure', 'ExactMatch', 'FactualCorrectness', 'Faithfulness', 'FaithfulnesswithHHEM', 'InstanceRubrics', 'LLMContextPrecisionWithReference', 'LLMContextPrecisionWithoutReference', 'LLMContextRecall', 'LLMSQLEquivalence', 'Metric', 'MetricOutputType', 'MetricType', 'MetricWithEmbeddings', 'MetricWithLLM', 'MultiModalFaithfulness', 'MultiModalRelevance', 'MultiTurnMetric', 'NoiseSensitivity', 'NonLLMContextPrecisionWithReference', 'NonLLMContextRecall', 'NonLLMStringSimilarity', 'ResponseRelevancy', 'RougeScore', 'RubricsScore', 'SemanticSimilarity', 'SimpleCriteriaScore', 'SingleTurnMetric', 'StringPresence', 'SummarizationScore', 'ToolCallAccuracy', 'TopicAdherenceScore', '__all__', '__builtins__', '__cached__', '__doc__

In [121]:
# from tqdm import tqdm

# from ragas.metrics import (
#     answer_relevancy,
#     faithfulness,
#     context_recall,
#     context_precision,
#     answer_correctness,
#     answer_similarity
# )

# # from ragas.metrics.critique import harmfulness
# from ragas import evaluate

# # def create_ragas_dataset(rag_pipeline, eval_dataset):
# #   rag_dataset = []
# #   for row in tqdm(eval_dataset):
# #     answer = rag_pipeline.invoke({"question" : row["question"]})
# #     rag_dataset.append(
# #         {"question" : row["question"],
# #          "answer" : answer["response"].content,
# #          "contexts" : [context.page_content for context in answer["context"]],
# #          "ground_truths" : [row["ground_truth"]]
# #          }
# #     )
# #   rag_df = pd.DataFrame(rag_dataset)
# #   rag_eval_dataset = Dataset.from_pandas(rag_df)
# #   return rag_eval_dataset

# # from tqdm import tqdm
# # import pandas as pd
# # from datasets import Dataset

# from tqdm import tqdm
# import pandas as pd
# from datasets import Dataset

# def create_ragas_dataset(rag_pipeline, eval_dataset):
#     rag_dataset = []
#     for _, row in eval_dataset.iterrows():
#         try:
#             answer = rag_pipeline.invoke({"question": row["question"]})
            
#             # Validate pipeline output
#             if not isinstance(answer, dict) or "response" not in answer or "context" not in answer:
#                 raise ValueError("Invalid pipeline output structure.")
            
#             # Append valid data
#             rag_dataset.append(
#                 {
#                     "question": row["question"],
#                     "answer": answer["response"].content if "response" in answer else "",
#                     "contexts": [context.page_content for context in answer.get("context", [])],
#                     "reference": row["ground_truth"] if "ground_truth" in row else "",
#                 }
#             )
#         except Exception as e:
#             print(f"Error processing row with question '{row.get('question', 'N/A')}'\n{e}")
#     return Dataset.from_pandas(pd.DataFrame(rag_dataset))




# # def evaluate_ragas_dataset(ragas_dataset):
# #   result = evaluate(
# #     ragas_dataset,
# #     metrics=[
# #         context_precision,
# #         faithfulness,
# #         answer_relevancy,
# #         context_recall,
# #         answer_correctness,
# #         answer_similarity
# #     ],
# #   )
# #   return result

# def evaluate_ragas_dataset(ragas_dataset):
#     try:
#         # Pass the entire dataset to the evaluate function
#         result = evaluate(
#             ragas_dataset,
#             metrics=[
#                 context_precision,
#                 faithfulness,
#                 answer_relevancy,
#                 context_recall,
#                 answer_correctness,
#                 answer_similarity
#             ],
#         )
#         return result
#     except Exception as e:
#         print(f"Error during evaluation: {e}")
#         return None




In [133]:
from tqdm import tqdm
import pandas as pd
from datasets import Dataset
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    answer_correctness,
    answer_similarity,
)
from ragas import evaluate


def create_ragas_dataset(rag_pipeline, eval_dataset):
    """
    Create a dataset compatible with Ragas evaluation from the retrieval-augmented QA pipeline.
    """
    rag_dataset = []
    for _, row in tqdm(eval_dataset.iterrows(), desc="Processing dataset"):
        try:
            # Get the response from the pipeline
            answer = rag_pipeline.invoke({"question": row["question"]})

            # Validate pipeline output
            if not isinstance(answer, dict) or "response" not in answer or "context" not in answer:
                raise ValueError("Invalid pipeline output structure.")

            # Ensure proper structure for Ragas dataset
            rag_dataset.append(
                {
                    "question": row["question"],
                    "answer": answer["response"].content if "response" in answer else "",
                    "contexts": [context.page_content for context in answer.get("context", [])],
                    "reference": row["ground_truth"] if "ground_truth" in row else "",
                }
            )
        except Exception as e:
            print(f"Error processing row with question: {row.get('question', 'N/A')}\n{e}")
            continue
    return Dataset.from_pandas(pd.DataFrame(rag_dataset))


def evaluate_ragas_dataset(ragas_dataset):
    """
    Evaluate the RAG dataset using specified metrics and handle errors gracefully.
    """
    try:
        # Perform the evaluation
        result = evaluate(
            ragas_dataset,
            metrics=[
                context_precision,
                faithfulness,
                answer_relevancy,
                context_recall,
                answer_correctness,
                answer_similarity,
            ],
        )
        return result
    except Exception as e:
        print(f"Error during evaluation: {e}")
        for i, row in enumerate(ragas_dataset):
            print(f"Row {i}: {row}")
        return None


Lets create our dataset first:

In [136]:
# Step 2: Check the structure of the dataset
print(type(eval_dataset))
for row in eval_dataset.iterrows():
    print(row)
    break

<class 'pandas.core.frame.DataFrame'>
(0, question        Logistic regression classifier has different s...
context         Algorithm:\n\nLogistic Regression ('sag'/'saga...
ground_truth    Logistic Regression with 'sag'/'saga' solvers ...
Name: 0, dtype: object)


In [125]:
from tqdm import tqdm
import pandas as pd

basic_qa_ragas_dataset = create_ragas_dataset(retrieval_augmented_qa_chain, eval_dataset)

Processing dataset: 20it [00:16,  1.23it/s]


In [126]:
basic_qa_ragas_dataset

Dataset({
    features: ['question', 'answer', 'contexts', 'reference'],
    num_rows: 20
})

In [127]:
for key, value in basic_qa_ragas_dataset[0].items():
    print(f"{key}: {type(value)}")


question: <class 'str'>
answer: <class 'str'>
contexts: <class 'list'>
reference: <class 'str'>


In [128]:
basic_qa_ragas_dataset[1]

{'question': 'How gaussian mixture models work?\n\nI am given an example:\n\nSuppose 1000 observations are drawn from N(0,1) and N(5,2) with mixing parameters ?1=0.2 and ?2=0.8 respectively. Suppose we only know ? and want to estimate ? and ?. How does one go about using Gaussian Mixture models to estimate these parameters? I know I have to use the EM algorithm but I do not know where to start. I want to use this simple example to get a better understanding of how it works.',
 'answer': "I don't know.",
 'contexts': ['We cannot answer that instantly, but consider the following requirements:\n▶ How much training data do you have?\n▶ Is your problem continuous or discrete?\n▶ What is the ratio # features and #samples ?\n▶ Do you need a sparse model?',
  'What Method is the Best for Me?\nWe cannot answer that instantly, but consider the following requirements:\n▶ How much training data do you have?\n▶ Is your problem continuous or discrete?\n▶ What is the ratio # features and #samples ?']

Save it for later:

In [129]:
basic_qa_ragas_dataset.to_csv("basic_qa_ragas_dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 80.58ba/s]


27811

And finally - evaluate how it did!

In [132]:
print(basic_qa_ragas_dataset[0])


{'question': "Logistic regression classifier has different solvers and one of them is 'sgd'\n\nIt also has a different classifier 'SGDClassifier' and the loss parameter can be mentioned as 'log' for logistic regression.\n\nAre they essentially the same or different? If they are different, how different is the implementation between two? And how do you decide which one to use given the problem of logistic regression?\n", 'answer': 'They are essentially the same. The only difference is that LogisticRegression can be fitted using different solvers, while SGDClassifier specifically fits logistic regression using stochastic gradient descent. The decision on which one to use may depend on the size of the dataset and computational resources available.', 'contexts': ["LogisticRegression which is fitted via SGD instead of being fitted by one of the other solvers in\nLogisticRegression. Similarly, SGDRegressor(loss='squared_error', penalty='l2') and\nRidge solve the same optimization problem, vi

In [131]:
basic_qa_result = evaluate_ragas_dataset(basic_qa_ragas_dataset)

Evaluating:  24%|██▍       | 29/120 [00:10<00:27,  3.26it/s]No statements were generated from the answer.
Evaluating:  64%|██████▍   | 77/120 [00:26<00:17,  2.49it/s]No statements were generated from the answer.
Evaluating:  89%|████████▉ | 107/120 [00:36<00:02,  5.01it/s]No statements were generated from the answer.
Evaluating: 100%|██████████| 120/120 [00:53<00:00,  2.23it/s]


Error during evaluation: 'float' object is not subscriptable
Row 0: {'question': "Logistic regression classifier has different solvers and one of them is 'sgd'\n\nIt also has a different classifier 'SGDClassifier' and the loss parameter can be mentioned as 'log' for logistic regression.\n\nAre they essentially the same or different? If they are different, how different is the implementation between two? And how do you decide which one to use given the problem of logistic regression?\n", 'answer': 'They are essentially the same. The only difference is that LogisticRegression can be fitted using different solvers, while SGDClassifier specifically fits logistic regression using stochastic gradient descent. The decision on which one to use may depend on the size of the dataset and computational resources available.', 'contexts': ["LogisticRegression which is fitted via SGD instead of being fitted by one of the other solvers in\nLogisticRegression. Similarly, SGDRegressor(loss='squared_erro

TypeError: 'float' object is not subscriptable

In [111]:
basic_qa_result

### Testing Other Retrievers

Now we can test our how changing our Retriever impacts our RAGAS evaluation!

We'll build this simple qa_chain factory to create standardized qa_chains where the only different component will be the retriever.

In [29]:
def create_qa_chain(retriever):
  primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
  created_qa_chain = (
    {"context": itemgetter("question") | retriever,
     "question": itemgetter("question")
    }
    | RunnablePassthrough.assign(
        context=itemgetter("context")
      )
    | {
         "response": prompt | primary_qa_llm,
         "context": itemgetter("context"),
      }
  )

  return created_qa_chain

#### Parent Document Retriever

One of the easier ways we can imagine improving a retriever is to embed our documents into small chunks, and then retrieve a significant amount of additional context that "surrounds" the found context.

You can read more about this method [here](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever)!

The basic outline of this retrieval method is as follows:

1. Obtain User Question
2. Retrieve child documents using Dense Vector Retrieval
3. Merge the child documents based on their parents. If they have the same parents - they become merged.
4. Replace the child documents with their respective parent documents from an in-memory-store.
5. Use the parent documents to augment generation.

In [30]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1500)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

vectorstore = Chroma(collection_name="split_parents", embedding_function=OpenAIEmbeddings())

store = InMemoryStore()

  vectorstore = Chroma(collection_name="split_parents", embedding_function=OpenAIEmbeddings())


In [31]:
parent_document_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

In [32]:
parent_document_retriever.add_documents(base_docs)

Let's create, test, and then evaluate our new chain!

In [33]:
parent_document_retriever_qa_chain = create_qa_chain(parent_document_retriever)

In [36]:
parent_document_retriever_qa_chain.invoke({"question" : "How to get scikit-learn?"})["response"].content

'Answer: Open Source (BSD License) available on Github. Current version: 0.24.2. Easy install via PIP or Conda for Windows, macOS and Linux.'

In [37]:
pdr_qa_ragas_dataset = create_ragas_dataset(parent_document_retriever_qa_chain, eval_dataset)

NameError: name 'create_ragas_dataset' is not defined

In [None]:
pdr_qa_ragas_dataset.to_csv("pdr_qa_ragas_dataset.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

55620

In [None]:
pdr_qa_result = evaluate_ragas_dataset(pdr_qa_ragas_dataset)

evaluating with [context_precision]


100%|██████████| 1/1 [01:01<00:00, 61.80s/it]


evaluating with [faithfulness]


100%|██████████| 1/1 [01:09<00:00, 69.23s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [01:04<00:00, 64.76s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:06<00:00,  6.54s/it]


evaluating with [context_relevancy]


100%|██████████| 1/1 [00:04<00:00,  4.02s/it]


evaluating with [answer_correctness]


100%|██████████| 1/1 [00:06<00:00,  6.83s/it]


evaluating with [answer_similarity]


100%|██████████| 1/1 [00:00<00:00,  1.54it/s]


In [None]:
pdr_qa_result

{'context_precision': 0.6972, 'faithfulness': 0.3500, 'answer_relevancy': 0.9439, 'context_recall': 1.0000, 'context_relevancy': 0.0134, 'answer_correctness': 0.6000, 'answer_similarity': 1.0000}

#### Ensemble Retrieval

Next let's look at ensemble retrieval!

You can read more about this [here](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble)!

The basic idea is as follows:

1. Obtain User Question
2. Hit the Retriever Pair
    - Retrieve Documents with BM25 Sparse Vector Retrieval
    - Retrieve Documents with Dense Vector Retrieval Method
3. Collect and "fuse" the retrieved docs based on their weighting using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm into a single ranked list.
4. Use those documents to augment our generation.

Ensure your `weights` list - the relative weighting of each retriever - sums to 1!

In [None]:
!pip install -q -U rank_bm25

In [None]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

text_splitter = RecursiveCharacterTextSplitter(chunk_size=450, chunk_overlap=75)
docs = text_splitter.split_documents(base_docs)

bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 2

embedding = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embedding)
chroma_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, chroma_retriever], weights=[0.75, 0.25])

In [None]:
ensemble_retriever_qa_chain = create_qa_chain(ensemble_retriever)

In [None]:
ensemble_retriever_qa_chain.invoke({"question" : "What is RAG?"})["response"].content

'RAG stands for Retrieval-Augmented Generation.'

In [None]:
ensemble_qa_ragas_dataset = create_ragas_dataset(ensemble_retriever_qa_chain, eval_dataset)

100%|██████████| 10/10 [00:20<00:00,  2.07s/it]


In [None]:
ensemble_qa_ragas_dataset.to_csv("ensemble_qa_ragas_dataset.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

22820

In [None]:
ensemble_qa_result = evaluate_ragas_dataset(ensemble_qa_ragas_dataset)

evaluating with [context_precision]


100%|██████████| 1/1 [01:01<00:00, 61.76s/it]


evaluating with [faithfulness]


100%|██████████| 1/1 [01:08<00:00, 68.62s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:05<00:00,  5.37s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:11<00:00, 11.67s/it]


evaluating with [context_relevancy]


100%|██████████| 1/1 [01:02<00:00, 62.45s/it]


evaluating with [answer_correctness]


100%|██████████| 1/1 [00:08<00:00,  9.00s/it]


evaluating with [answer_similarity]


100%|██████████| 1/1 [00:00<00:00,  1.57it/s]


In [None]:
ensemble_qa_result

{'context_precision': 0.8858, 'faithfulness': 0.7000, 'answer_relevancy': 0.8918, 'context_recall': 0.9800, 'context_relevancy': 0.0192, 'answer_correctness': 0.7750, 'answer_similarity': 1.0000}

### Conclusion

Observe your results in a table!

In [None]:
basic_qa_result

{'context_precision': 0.5000, 'faithfulness': 0.4000, 'answer_relevancy': 0.9535, 'context_recall': 1.0000, 'context_relevancy': 0.0559, 'answer_correctness': 0.6167, 'answer_similarity': 1.0000}

In [None]:
pdr_qa_result

{'context_precision': 0.6972, 'faithfulness': 0.3500, 'answer_relevancy': 0.9439, 'context_recall': 1.0000, 'context_relevancy': 0.0134, 'answer_correctness': 0.6000, 'answer_similarity': 1.0000}

In [None]:
ensemble_qa_result

{'context_precision': 0.8858, 'faithfulness': 0.7000, 'answer_relevancy': 0.8918, 'context_recall': 0.9800, 'context_relevancy': 0.0192, 'answer_correctness': 0.7750, 'answer_similarity': 1.0000}

We can also zoom in on each result and find specific information about each of the questions and answers.

In [None]:
ensemble_qa_result_df = ensemble_qa_result.to_pandas()

In [None]:
ensemble_qa_result_df

Unnamed: 0,question,contexts,answer,ground_truths,context_precision,faithfulness,answer_relevancy,context_recall,context_relevancy,answer_correctness,answer_similarity
0,What is the focus of this paper?,[has to make an important career decision.\nNe...,The focus of this paper is on a framework call...,[The focus of this paper is on retrieval-augme...,1.0,0.666667,0.784617,1.0,0.0,0.5,True
1,What is the title of the paper?,[of War. The game was released worldwide in\nG...,"Title: Self-RAG: Learning to Retrieve, Generat...",[The title of the paper is 'A Survey on Retrie...,0.5,1.0,0.976911,1.0,0.0,0.5,True
2,What is the aim of this paper?,[has to make an important career decision.\nNe...,The aim of this paper is to introduce a new fr...,[The aim of this paper is to conduct a compreh...,1.0,0.333333,0.800732,1.0,0.078947,0.75,True
3,What is the main focus of the paper 'A Survey ...,[A Survey on Retrieval-Augmented Text Generati...,The main focus of the paper 'A Survey on Retri...,[The main focus of the paper 'A Survey on Retr...,0.679167,1.0,0.982435,0.8,0.017857,1.0,True
4,What is the main focus of this paper?,[example of completions of the prompt by diffe...,I don't know.,[The main focus of this paper is to conduct a ...,1.0,0.0,0.742601,1.0,0.0,0.5,True
5,What is the main focus of this paper?,[example of completions of the prompt by diffe...,I don't know.,[The main focus of this paper is to conduct a ...,1.0,0.0,0.742601,1.0,0.0,0.5,True
6,What are the advantages of retrieval-augmented...,[attracted increasing attention of the compu-\...,The advantages of retrieval-augmented text gen...,[The advantages of retrieval-augmented text ge...,1.0,1.0,0.968712,1.0,0.025641,1.0,True
7,What is the main focus of the paper 'A Survey ...,[A Survey on Retrieval-Augmented Text Generati...,The main focus of the paper 'A Survey on Retri...,[The main focus of the paper 'A Survey on Retr...,0.679167,1.0,0.982422,1.0,0.017857,1.0,True
8,What are the advantages of retrieval-augmented...,[attracted increasing attention of the compu-\...,The advantages of retrieval-augmented text gen...,[The advantages of retrieval-augmented text ge...,1.0,1.0,0.968731,1.0,0.025641,1.0,True
9,What are the advantages of retrieval-augmented...,[attracted increasing attention of the compu-\...,The advantages of retrieval-augmented text gen...,[The advantages of retrieval-augmented text ge...,1.0,1.0,0.968692,1.0,0.025641,1.0,True


We'll also look at combining the results and looking at them in a single table so we can make inferences about them!

In [None]:
def create_df_dict(pipeline_name, pipeline_items):
  df_dict = {"name" : pipeline_name}
  for name, score in pipeline_items:
    df_dict[name] = score
  return df_dict

In [None]:
basic_rag_df_dict = create_df_dict("basic_rag", basic_qa_result.items())

In [None]:
pdr_rag_df_dict = create_df_dict("pdr_rag", pdr_qa_result.items())

In [None]:
ensemble_rag_df_dict = create_df_dict("ensemble_rag", ensemble_qa_result.items())

In [None]:
results_df = pd.DataFrame([basic_rag_df_dict, pdr_rag_df_dict, ensemble_rag_df_dict])

In [None]:
results_df.sort_values("answer_correctness", ascending=False)

Unnamed: 0,name,context_precision,faithfulness,answer_relevancy,context_recall,context_relevancy,answer_correctness,answer_similarity
2,ensemble_rag,0.885833,0.7,0.891845,0.98,0.019158,0.775,1.0
0,basic_rag,0.5,0.4,0.953475,1.0,0.055904,0.616667,1.0
1,pdr_rag,0.697222,0.35,0.943909,1.0,0.013386,0.6,1.0


### ❓QUESTION❓

What conclusions can you draw about the above results?

Describe in your own words what the metrics are expressing.

In [None]:
retrieval_augmented_qa_chain = (
    RunnableParallel({
        'context': itemgetter('question') | base_retriever,
        'question': RunnablePassthrough()
    }) | {
        'response': prompt | primary_qa_llm | parser,
        'context': itemgetter('context')
    }
)