# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

> NOTE: Please skip the pip install commands if you are running the notebook locally.

In [1]:
!pip install -qU ragas==0.2.10

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/175.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m25.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [2]:
!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━[0m [32m532.5/981.5 kB[0m [31m15.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.2/137.2 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.7/44.7 kB[0m [31m2.9 MB

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [3]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

Please enter your OpenAI API key!··········


**OPTIONALLY**:

We can also provide a Ragas API key - which you can sign-up for [here](https://app.ragas.io/).

In [None]:
os.environ["RAGAS_APP_TOKEN"] = getpass("Please enter your Ragas API key!")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [4]:
!mkdir data

In [5]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31314    0 31314    0     0  71625      0 --:--:-- --:--:-- --:--:-- 71821


In [6]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 70173    0 70173    0     0   425k      0 --:--:-- --:--:-- --:--:--  425k


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [7]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [8]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [9]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [10]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What advancements has OpenAI made in the field...,[Prompt driven app generation is a commodity a...,"In 2024, OpenAI's GPT-4 model, which was once ...",single_hop_specifc_query_synthesizer
1,Wht is Anthropic's cheepest model and its cost?,"[gets you OpenAI’s most expensive model, o1. G...","Anthropic’s cheapest model is Claude 3 Haiku, ...",single_hop_specifc_query_synthesizer
2,What was the initial challenge with OpenAI's W...,[feed with the model and talk about what you c...,OpenAI started with a WebSocket API that was q...,single_hop_specifc_query_synthesizer
3,What does it mean when someone says a prompt w...,[dependent on AGI itself. A model that’s robus...,"A prompt without the evals, models, and especi...",single_hop_specifc_query_synthesizer
4,How has the increased energy efficiency of AI ...,[<1-hop>\n\nPrompt driven app generation is a ...,The increased energy efficiency of AI models h...,multi_hop_abstract_query_synthesizer
5,"How do the criticisms of LLMs, particularly re...",[<1-hop>\n\nPrompt driven app generation is a ...,"The criticisms of LLMs, especially concerning ...",multi_hop_abstract_query_synthesizer
6,How has the concept of universal access to AI ...,[<1-hop>\n\nPrompt driven app generation is a ...,"In 2024, the concept of universal access to AI...",multi_hop_abstract_query_synthesizer
7,Why LLMs need better criticism and what are th...,[<1-hop>\n\nPrompt driven app generation is a ...,LLMs need better criticism because there are s...,multi_hop_abstract_query_synthesizer
8,How does the Claude 3.5 Sonnet compare to othe...,[<1-hop>\n\nthat. DeepSeek v3 is a huge 685B p...,Claude 3.5 Sonnet is benchmarked alongside oth...,multi_hop_specific_query_synthesizer
9,How have the advancements in Llama 3.2 models ...,[<1-hop>\n\neasy to follow. The rest of the do...,Recent advancements in Llama 3.2 models have d...,multi_hop_specific_query_synthesizer


In [11]:
dataset.to_pandas().to_csv('ragas_data_2.csv')


#### OPTIONAL:

If you've provided your Ragas API key - you can use this web interface to look at the created data!

In [None]:
dataset.upload()

Testset uploaded! View at https://app.ragas.io/dashboard/alignment/testset/45d7742f-e0c6-4e85-9b66-6e819adfaec3


'https://app.ragas.io/dashboard/alignment/testset/45d7742f-e0c6-4e85-9b66-6e819adfaec3'

## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [11]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

Now that we have our data loaded, let's split it into chunks!

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

74

#### ❓ Question:

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

### ANSWER
It is used to specify the number of overlapping characters between consecutive text chunks when splitting documents. This overlap helps maintain contextual continuity between chunks

Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [13]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [14]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="ai_across_years",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=embeddings,
)

We can now add our documents to our vector store.

In [15]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [16]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Now we can produce a node for retrieval!

In [17]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [18]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-mini` to avoid using the same model as our judge model.

In [19]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

Then we can create a `generate` node!

In [20]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [21]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [22]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [23]:
response = graph.invoke({"question" : "How are LLM agents useful?"})

In [24]:
response["response"]

'LLM agents are useful in several ways, particularly in the realm of software development and code generation. Here are some key points:\n\n1. **Ease of Building**: LLMs can be built with relatively little code—just a few hundred lines of Python—provided there is sufficient quality and quantity of training data. This makes them more accessible than one might initially think.\n\n2. **Local Operation**: Recent advancements have allowed LLMs to be run on personal devices, making them more practical for individual users without needing expensive server setups.\n\n3. **Effective Code Generation**: LLMs excel at writing code due to the simpler grammar rules of programming languages compared to natural languages. They can generate code effectively, and tools like ChatGPT Code Interpreter can execute and test this code, handling errors and correcting issues in real-time.\n\n4. **Automation of Tasks**: LLM agents are seen as potential systems that can perform tasks on behalf of users, although 

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [None]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [27]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What advancements has OpenAI made in the field...,[OpenAI are not the only game in town here. Go...,[Prompt driven app generation is a commodity a...,"In 2024, OpenAI has made significant advanceme...","In 2024, OpenAI's GPT-4 model, which was once ...",single_hop_specifc_query_synthesizer
1,Wht is Anthropic's cheepest model and its cost?,[Today $30/mTok gets you OpenAI’s most expensi...,"[gets you OpenAI’s most expensive model, o1. G...","Anthropic's cheapest model is Claude 3 Haiku, ...","Anthropic’s cheapest model is Claude 3 Haiku, ...",single_hop_specifc_query_synthesizer
2,What was the initial challenge with OpenAI's W...,[Did you know ChatGPT has two entirely differe...,[feed with the model and talk about what you c...,The initial challenge with OpenAI's WebSocket ...,OpenAI started with a WebSocket API that was q...,single_hop_specifc_query_synthesizer
3,What does it mean when someone says a prompt w...,[It’s become abundantly clear over the course ...,[dependent on AGI itself. A model that’s robus...,When someone says a prompt without evals is li...,"A prompt without the evals, models, and especi...",single_hop_specifc_query_synthesizer
4,How has the increased energy efficiency of AI ...,[The much bigger problem here is the enormous ...,[<1-hop>\n\nPrompt driven app generation is a ...,The increased energy efficiency of AI models h...,The increased energy efficiency of AI models h...,multi_hop_abstract_query_synthesizer
5,"How do the criticisms of LLMs, particularly re...",[LLMs need better criticism\n\nA lot of people...,[<1-hop>\n\nPrompt driven app generation is a ...,"The criticisms of LLMs, particularly regarding...","The criticisms of LLMs, especially concerning ...",multi_hop_abstract_query_synthesizer
6,How has the concept of universal access to AI ...,"[In 2024, almost every significant model vendo...",[<1-hop>\n\nPrompt driven app generation is a ...,"In 2024, the concept of universal access to AI...","In 2024, the concept of universal access to AI...",multi_hop_abstract_query_synthesizer
7,Why LLMs need better criticism and what are th...,[LLMs need better criticism\n\nA lot of people...,[<1-hop>\n\nPrompt driven app generation is a ...,LLMs (Large Language Models) need better criti...,LLMs need better criticism because there are s...,multi_hop_abstract_query_synthesizer
8,How does the Claude 3.5 Sonnet compare to othe...,[Getting back to models that beat GPT-4: Anthr...,[<1-hop>\n\nthat. DeepSeek v3 is a huge 685B p...,The Claude 3.5 Sonnet model is noted for its a...,Claude 3.5 Sonnet is benchmarked alongside oth...,multi_hop_specific_query_synthesizer
9,How have the advancements in Llama 3.2 models ...,[Another common technique is to use larger mod...,[<1-hop>\n\neasy to follow. The rest of the do...,Recent advancements in Llama 3.2 models and th...,Recent advancements in Llama 3.2 models have d...,multi_hop_specific_query_synthesizer


In [27]:
dataset.to_pandas().to_csv('ragas_data_2_with response.csv')

Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [28]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluation_dataset

EvaluationDataset(features=['user_input', 'retrieved_contexts', 'reference_contexts', 'response', 'reference'], len=12)

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [29]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

Next up - we simply evaluate on our desired metrics!

In [30]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[24]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-TUDVNCQITlowbOaFB0KsnZu8 on tokens per min (TPM): Limit 30000, Used 29272, Requested 2315. Please try again in 3.174s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[1]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-TUDVNCQITlowbOaFB0KsnZu8 on tokens per min (TPM): Limit 30000, Used 28955, Requested 2442. Please try again in 2.794s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[7]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-TUDVNCQITlowbOaFB0KsnZ

{'context_recall': 0.6495, 'faithfulness': 0.6651, 'factual_correctness': 0.5242, 'answer_relevancy': 0.9525, 'context_entity_recall': 0.3639, 'noise_sensitivity_relevant': 0.1310}

In [31]:
result_df = result.to_pandas()
result_df

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,factual_correctness,answer_relevancy,context_entity_recall,noise_sensitivity_relevant
0,What advancements has OpenAI made in the field...,[OpenAI are not the only game in town here. Go...,[Prompt driven app generation is a commodity a...,"In 2024, OpenAI has made significant advanceme...","In 2024, OpenAI's GPT-4 model, which was once ...",0.0,,0.15,0.966099,0.583333,
1,Wht is Anthropic's cheepest model and its cost?,[Today $30/mTok gets you OpenAI’s most expensi...,"[gets you OpenAI’s most expensive model, o1. G...","Anthropic's cheapest model is Claude 3 Haiku, ...","Anthropic’s cheapest model is Claude 3 Haiku, ...",1.0,,1.0,0.945477,0.666667,0.0
2,What was the initial challenge with OpenAI's W...,[Did you know ChatGPT has two entirely differe...,[feed with the model and talk about what you c...,The initial challenge with OpenAI's WebSocket ...,OpenAI started with a WebSocket API that was q...,1.0,,0.5,1.0,1.0,0.333333
3,What does it mean when someone says a prompt w...,[It’s become abundantly clear over the course ...,[dependent on AGI itself. A model that’s robus...,When someone says a prompt without evals is li...,"A prompt without the evals, models, and especi...",1.0,,0.71,1.0,0.0,0.0
4,How has the increased energy efficiency of AI ...,[The much bigger problem here is the enormous ...,[<1-hop>\n\nPrompt driven app generation is a ...,The increased energy efficiency of AI models h...,The increased energy efficiency of AI models h...,,,0.59,0.945216,,
5,"How do the criticisms of LLMs, particularly re...",[LLMs need better criticism\n\nA lot of people...,[<1-hop>\n\nPrompt driven app generation is a ...,"The criticisms of LLMs, particularly regarding...","The criticisms of LLMs, especially concerning ...",,,0.37,0.913569,0.083333,
6,How has the concept of universal access to AI ...,"[In 2024, almost every significant model vendo...",[<1-hop>\n\nPrompt driven app generation is a ...,"In 2024, the concept of universal access to AI...","In 2024, the concept of universal access to AI...",0.428571,,0.17,0.96022,,
7,Why LLMs need better criticism and what are th...,[LLMs need better criticism\n\nA lot of people...,[<1-hop>\n\nPrompt driven app generation is a ...,LLMs (Large Language Models) need better criti...,LLMs need better criticism because there are s...,,,0.5,0.909795,,
8,How does the Claude 3.5 Sonnet compare to othe...,[Getting back to models that beat GPT-4: Anthr...,[<1-hop>\n\nthat. DeepSeek v3 is a huge 685B p...,The Claude 3.5 Sonnet model is noted for its a...,Claude 3.5 Sonnet is benchmarked alongside oth...,0.666667,,0.53,0.945543,0.3,
9,How have the advancements in Llama 3.2 models ...,[Another common technique is to use larger mod...,[<1-hop>\n\neasy to follow. The rest of the do...,Recent advancements in Llama 3.2 models and th...,Recent advancements in Llama 3.2 models have d...,0.0,,0.36,0.953485,0.0,


## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model (which was updated fairly [recently](https://docs.cohere.com/v2/changelog/rerank-v3.5)) - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [32]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")

Please enter your Cohere API key!··········


In [33]:
!pip install -qU cohere langchain_cohere

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/252.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m143.4/252.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.9/252.9 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m41.6 MB/s[0m eta [36m0:00:00[0m
[?25h


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [34]:
retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [35]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

We can simply rebuild our graph with the new retriever!

In [36]:
class State(TypedDict):
  question: str
  context: List[Document]
  response: str

graph_builder = StateGraph(State).add_sequence([retrieve_adjusted, generate])
graph_builder.add_edge(START, "retrieve_adjusted")
graph = graph_builder.compile()

In [37]:
response = graph.invoke({"question" : "How are LLM agents useful?"})
response["response"]

'LLM agents are considered useful primarily for their capability in writing code, as they can effectively handle programming languages with simpler grammar rules compared to natural languages. However, there is skepticism regarding their overall utility due to their inherent gullibility; they cannot reliably distinguish truth from fiction, which poses challenges for tasks that require meaningful decision-making. The excitement around AI agents, which are often described as systems that can operate on behalf of users, has not yet translated into widespread, practical applications. Many remain concerned about the potential negative impacts of LLMs, such as environmental concerns, ethical issues related to training data, and the reliability of the outputs. Thus, while LLMs have their strengths, especially in coding, there are significant reservations about their broader usefulness and the need for critical evaluation of the technology.'

In [38]:
import numpy as np
for test_row in dataset:
  test_row.eval_sample.response = np.nan
  test_row.eval_sample.retrieved_contexts = np.nan
dataset.to_pandas()


  Expected `list[str]` but got `float` with value `nan` - serialized value may not be as expected
  Expected `str` but got `float` with value `nan` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What advancements has OpenAI made in the field...,,[Prompt driven app generation is a commodity a...,,"In 2024, OpenAI's GPT-4 model, which was once ...",single_hop_specifc_query_synthesizer
1,Wht is Anthropic's cheepest model and its cost?,,"[gets you OpenAI’s most expensive model, o1. G...",,"Anthropic’s cheapest model is Claude 3 Haiku, ...",single_hop_specifc_query_synthesizer
2,What was the initial challenge with OpenAI's W...,,[feed with the model and talk about what you c...,,OpenAI started with a WebSocket API that was q...,single_hop_specifc_query_synthesizer
3,What does it mean when someone says a prompt w...,,[dependent on AGI itself. A model that’s robus...,,"A prompt without the evals, models, and especi...",single_hop_specifc_query_synthesizer
4,How has the increased energy efficiency of AI ...,,[<1-hop>\n\nPrompt driven app generation is a ...,,The increased energy efficiency of AI models h...,multi_hop_abstract_query_synthesizer
5,"How do the criticisms of LLMs, particularly re...",,[<1-hop>\n\nPrompt driven app generation is a ...,,"The criticisms of LLMs, especially concerning ...",multi_hop_abstract_query_synthesizer
6,How has the concept of universal access to AI ...,,[<1-hop>\n\nPrompt driven app generation is a ...,,"In 2024, the concept of universal access to AI...",multi_hop_abstract_query_synthesizer
7,Why LLMs need better criticism and what are th...,,[<1-hop>\n\nPrompt driven app generation is a ...,,LLMs need better criticism because there are s...,multi_hop_abstract_query_synthesizer
8,How does the Claude 3.5 Sonnet compare to othe...,,[<1-hop>\n\nthat. DeepSeek v3 is a huge 685B p...,,Claude 3.5 Sonnet is benchmarked alongside oth...,multi_hop_specific_query_synthesizer
9,How have the advancements in Llama 3.2 models ...,,[<1-hop>\n\neasy to follow. The rest of the do...,,Recent advancements in Llama 3.2 models have d...,multi_hop_specific_query_synthesizer


In [39]:
import time

for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(2) # To try to avoid rate limiting.

In [40]:
dataset.to_pandas()


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What advancements has OpenAI made in the field...,[Simon Willison’s Weblog\n\nSubscribe\n\nThing...,[Prompt driven app generation is a commodity a...,"In 2024, OpenAI made significant advancements ...","In 2024, OpenAI's GPT-4 model, which was once ...",single_hop_specifc_query_synthesizer
1,Wht is Anthropic's cheepest model and its cost?,[Today $30/mTok gets you OpenAI’s most expensi...,"[gets you OpenAI’s most expensive model, o1. G...","Anthropic's cheapest model is Claude 3 Haiku, ...","Anthropic’s cheapest model is Claude 3 Haiku, ...",single_hop_specifc_query_synthesizer
2,What was the initial challenge with OpenAI's W...,[These abilities are just a few weeks old at t...,[feed with the model and talk about what you c...,The initial challenge with OpenAI's WebSocket ...,OpenAI started with a WebSocket API that was q...,single_hop_specifc_query_synthesizer
3,What does it mean when someone says a prompt w...,[It’s become abundantly clear over the course ...,[dependent on AGI itself. A model that’s robus...,When someone says a prompt without evals is li...,"A prompt without the evals, models, and especi...",single_hop_specifc_query_synthesizer
4,How has the increased energy efficiency of AI ...,"[I think this means that, as individual users,...",[<1-hop>\n\nPrompt driven app generation is a ...,The increased energy efficiency of AI models h...,The increased energy efficiency of AI models h...,multi_hop_abstract_query_synthesizer
5,"How do the criticisms of LLMs, particularly re...",[LLMs need better criticism\n\nA lot of people...,[<1-hop>\n\nPrompt driven app generation is a ...,The criticisms of large language models (LLMs)...,"The criticisms of LLMs, especially concerning ...",multi_hop_abstract_query_synthesizer
6,How has the concept of universal access to AI ...,[Simon Willison’s Weblog\n\nSubscribe\n\nThing...,[<1-hop>\n\nPrompt driven app generation is a ...,"In 2024, the concept of universal access to AI...","In 2024, the concept of universal access to AI...",multi_hop_abstract_query_synthesizer
7,Why LLMs need better criticism and what are th...,[LLMs need better criticism\n\nA lot of people...,[<1-hop>\n\nPrompt driven app generation is a ...,LLMs (Large Language Models) need better criti...,LLMs need better criticism because there are s...,multi_hop_abstract_query_synthesizer
8,How does the Claude 3.5 Sonnet compare to othe...,[Getting back to models that beat GPT-4: Anthr...,[<1-hop>\n\nthat. DeepSeek v3 is a huge 685B p...,The Claude 3.5 Sonnet model is noted for its p...,Claude 3.5 Sonnet is benchmarked alongside oth...,multi_hop_specific_query_synthesizer
9,How have the advancements in Llama 3.2 models ...,[“Agents” still haven’t really happened yet\n\...,[<1-hop>\n\neasy to follow. The rest of the do...,Recent advancements in Llama 3.2 models and th...,Recent advancements in Llama 3.2 models have d...,multi_hop_specific_query_synthesizer


In [41]:
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

In [None]:
result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[65]: TimeoutError()
Exception raised in Job[71]: TimeoutError()


{'context_recall': 0.7610, 'faithfulness': 0.8309, 'factual_correctness': 0.4275, 'answer_relevancy': 0.8605, 'context_entity_recall': 0.5109, 'noise_sensitivity_relevant': 0.3846}

In [42]:
result2 = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result2

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[24]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-TUDVNCQITlowbOaFB0KsnZu8 on tokens per min (TPM): Limit 30000, Used 29031, Requested 1852. Please try again in 1.766s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[1]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-TUDVNCQITlowbOaFB0KsnZu8 on tokens per min (TPM): Limit 30000, Used 29129, Requested 2045. Please try again in 2.348s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
ERROR:ragas.executor:Exception raised in Job[13]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-TUDVNCQITlowbOaFB0Ksn

{'context_recall': 0.7542, 'faithfulness': 0.7047, 'factual_correctness': 0.5700, 'answer_relevancy': 0.8582, 'context_entity_recall': 0.3597, 'noise_sensitivity_relevant': 0.3261}

In [43]:
result_df2 = result2.to_pandas()
result_df2

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,factual_correctness,answer_relevancy,context_entity_recall,noise_sensitivity_relevant
0,What advancements has OpenAI made in the field...,[Simon Willison’s Weblog\n\nSubscribe\n\nThing...,[Prompt driven app generation is a commodity a...,"In 2024, OpenAI made significant advancements ...","In 2024, OpenAI's GPT-4 model, which was once ...",0.2,,,0.966099,0.333333,
1,Wht is Anthropic's cheepest model and its cost?,[Today $30/mTok gets you OpenAI’s most expensi...,"[gets you OpenAI’s most expensive model, o1. G...","Anthropic's cheapest model is Claude 3 Haiku, ...","Anthropic’s cheapest model is Claude 3 Haiku, ...",1.0,0.666667,1.0,0.888445,0.666667,0.666667
2,What was the initial challenge with OpenAI's W...,[These abilities are just a few weeks old at t...,[feed with the model and talk about what you c...,The initial challenge with OpenAI's WebSocket ...,OpenAI started with a WebSocket API that was q...,1.0,,0.5,0.966881,1.0,0.666667
3,What does it mean when someone says a prompt w...,[It’s become abundantly clear over the course ...,[dependent on AGI itself. A model that’s robus...,When someone says a prompt without evals is li...,"A prompt without the evals, models, and especi...",1.0,,0.4,1.0,0.0,0.0
4,How has the increased energy efficiency of AI ...,"[I think this means that, as individual users,...",[<1-hop>\n\nPrompt driven app generation is a ...,The increased energy efficiency of AI models h...,The increased energy efficiency of AI models h...,,1.0,,0.965649,0.545455,
5,"How do the criticisms of LLMs, particularly re...",[LLMs need better criticism\n\nA lot of people...,[<1-hop>\n\nPrompt driven app generation is a ...,The criticisms of large language models (LLMs)...,"The criticisms of LLMs, especially concerning ...",0.833333,0.516129,0.32,0.852971,0.083333,
6,How has the concept of universal access to AI ...,[Simon Willison’s Weblog\n\nSubscribe\n\nThing...,[<1-hop>\n\nPrompt driven app generation is a ...,"In 2024, the concept of universal access to AI...","In 2024, the concept of universal access to AI...",,0.421053,0.54,0.95477,0.384615,0.125
7,Why LLMs need better criticism and what are th...,[LLMs need better criticism\n\nA lot of people...,[<1-hop>\n\nPrompt driven app generation is a ...,LLMs (Large Language Models) need better criti...,LLMs need better criticism because there are s...,0.75,,0.63,0.895417,0.125,
8,How does the Claude 3.5 Sonnet compare to othe...,[Getting back to models that beat GPT-4: Anthr...,[<1-hop>\n\nthat. DeepSeek v3 is a huge 685B p...,The Claude 3.5 Sonnet model is noted for its p...,Claude 3.5 Sonnet is benchmarked alongside oth...,,,0.47,0.0,0.5,
9,How have the advancements in Llama 3.2 models ...,[“Agents” still haven’t really happened yet\n\...,[<1-hop>\n\neasy to follow. The rest of the do...,Recent advancements in Llama 3.2 models and th...,Recent advancements in Llama 3.2 models have d...,,0.647059,0.56,0.953485,0.0,


In [44]:
result_df

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,faithfulness,factual_correctness,answer_relevancy,context_entity_recall,noise_sensitivity_relevant
0,What advancements has OpenAI made in the field...,[OpenAI are not the only game in town here. Go...,[Prompt driven app generation is a commodity a...,"In 2024, OpenAI has made significant advanceme...","In 2024, OpenAI's GPT-4 model, which was once ...",0.0,,0.15,0.966099,0.583333,
1,Wht is Anthropic's cheepest model and its cost?,[Today $30/mTok gets you OpenAI’s most expensi...,"[gets you OpenAI’s most expensive model, o1. G...","Anthropic's cheapest model is Claude 3 Haiku, ...","Anthropic’s cheapest model is Claude 3 Haiku, ...",1.0,,1.0,0.945477,0.666667,0.0
2,What was the initial challenge with OpenAI's W...,[Did you know ChatGPT has two entirely differe...,[feed with the model and talk about what you c...,The initial challenge with OpenAI's WebSocket ...,OpenAI started with a WebSocket API that was q...,1.0,,0.5,1.0,1.0,0.333333
3,What does it mean when someone says a prompt w...,[It’s become abundantly clear over the course ...,[dependent on AGI itself. A model that’s robus...,When someone says a prompt without evals is li...,"A prompt without the evals, models, and especi...",1.0,,0.71,1.0,0.0,0.0
4,How has the increased energy efficiency of AI ...,[The much bigger problem here is the enormous ...,[<1-hop>\n\nPrompt driven app generation is a ...,The increased energy efficiency of AI models h...,The increased energy efficiency of AI models h...,,,0.59,0.945216,,
5,"How do the criticisms of LLMs, particularly re...",[LLMs need better criticism\n\nA lot of people...,[<1-hop>\n\nPrompt driven app generation is a ...,"The criticisms of LLMs, particularly regarding...","The criticisms of LLMs, especially concerning ...",,,0.37,0.913569,0.083333,
6,How has the concept of universal access to AI ...,"[In 2024, almost every significant model vendo...",[<1-hop>\n\nPrompt driven app generation is a ...,"In 2024, the concept of universal access to AI...","In 2024, the concept of universal access to AI...",0.428571,,0.17,0.96022,,
7,Why LLMs need better criticism and what are th...,[LLMs need better criticism\n\nA lot of people...,[<1-hop>\n\nPrompt driven app generation is a ...,LLMs (Large Language Models) need better criti...,LLMs need better criticism because there are s...,,,0.5,0.909795,,
8,How does the Claude 3.5 Sonnet compare to othe...,[Getting back to models that beat GPT-4: Anthr...,[<1-hop>\n\nthat. DeepSeek v3 is a huge 685B p...,The Claude 3.5 Sonnet model is noted for its a...,Claude 3.5 Sonnet is benchmarked alongside oth...,0.666667,,0.53,0.945543,0.3,
9,How have the advancements in Llama 3.2 models ...,[Another common technique is to use larger mod...,[<1-hop>\n\neasy to follow. The rest of the do...,Recent advancements in Llama 3.2 models and th...,Recent advancements in Llama 3.2 models have d...,0.0,,0.36,0.953485,0.0,


#### ❓ Question:

Which system performed better, on what metrics, and why?

### ANSWER

#### Result without reranker
{'context_recall': 0.6495, 'faithfulness': 0.6651, 'factual_correctness': 0.5242, 'answer_relevancy': 0.9525, 'context_entity_recall': 0.3639, 'noise_sensitivity_relevant': 0.1310}
#### Result with reranker
{'context_recall': 0.7542, 'faithfulness': 0.7047, 'factual_correctness': 0.5700, 'answer_relevancy': 0.8582, 'context_entity_recall': 0.3597, 'noise_sensitivity_relevant': 0.3261}


These results are not all reliable because of nan values. Logically, results are supposed to get better sisnce using a reranker improves the quality and relevancy of chunks given to the LLM

### Full Code

In [None]:
# Load data
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

# Define LLM and Embedder for Synthetic Data Generation using RAGAS
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

# Generate Questions dataset
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)
dataset.to_pandas()

################### Build RAg Langchain

# Define embedder
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Define Vector store
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="ai_across_years",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=embeddings,
)

# Add chunks to Vector store
_ = vector_store.add_documents(documents=split_documents)

# Set Vector store as a retriever
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

#Define Retrieval node
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

# Create prompt
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

# Define LLM model for RAG
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

# Define Generation node
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

####################### Build LangGraph

# Define State
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

# Add nodes as as sequence, add START and compile graph
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

# Invoke Graph
response = graph.invoke({"question" : "How are LLM agents useful?"})

############################

# Generate answers for test set questions and add them to the dataset
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

# Make the dataset as a RAGAS Evaluation Dataset
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

# Choose the evaluator LLM
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

# Evaluate using RAGAS
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

############################################### Add Reranker

# Redefine retriever to retrieve 20 chunks instead of 5
retriever = vector_store.as_retriever(search_kwargs={"k": 20})

# Re Define Retriever node and add reranker
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

# Re define state and build new graph using new retriever
class State(TypedDict):
 question: str
 context: List[Document]
 response: str

graph_builder = StateGraph(State).add_sequence([retrieve_adjusted, generate])
graph_builder.add_edge(START, "retrieve_adjusted")
graph = graph_builder.compile()

# Regenerate responses using new retriever with reranekr
import time

for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(2) # To try to avoid rate limiting.

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

# Evaluate
result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result