# Lab | Langchain Evaluation

## Intro

Pick different sets of data and re-run this notebook. The point is for you to understand all steps involve and the many different ways one can and should evaluate LLM applications.

What did you learn? - Let's discuss that in class

## LangChain: Evaluation

### Outline:

* Example generation
* Manual evaluation (and debuging)
* LLM-assisted evaluation

In [1]:
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY') 

### Example 1

#### Create our QandA application

In [2]:
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain.llms import OpenAI
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.document_loaders import CSVLoader, TextLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch
from langchain.chains import LLMChain
from tqdm.autonotebook import trange, tqdm


  from tqdm.autonotebook import trange, tqdm


In [3]:
file = './data/OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()

In [4]:
#!pip install --upgrade --force-reinstall sentence-transformers

In [5]:
#pip install docarray

In [6]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs = {'device': 'cpu'})
).from_loaders([loader])



In [7]:
llm = ChatOpenAI(temperature = 0.0)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

#### Coming up with test datapoints

In [8]:
data[10]

Document(metadata={'source': './data/OutdoorClothingCatalog_1000.csv', 'row': 10}, page_content=": 10\nname: Cozy Comfort Pullover Set, Stripe\ndescription: Perfect for lounging, this striped knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out.\n\nSize & Fit\n- Pants are Favorite Fit: Sits lower on the waist.\n- Relaxed Fit: Our most generous fit sits farthest from the body.\n\nFabric & Care\n- In the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features\n- Relaxed fit top with raglan sleeves and rounded hem.\n- Pull-on pants have a wide elastic waistband and drawstring, side pockets and a modern slim leg.\n\nImported.")

In [9]:
data[11]

Document(metadata={'source': './data/OutdoorClothingCatalog_1000.csv', 'row': 11}, page_content=': 11\nname: Ultra-Lofty 850 Stretch Down Hooded Jacket\ndescription: This technical stretch down jacket from our DownTek collection is sure to keep you warm and comfortable with its full-stretch construction providing exceptional range of motion. With a slightly fitted style that falls at the hip and best with a midweight layer, this jacket is suitable for light activity up to 20° and moderate activity up to -30°. The soft and durable 100% polyester shell offers complete windproof protection and is insulated with warm, lofty goose down. Other features include welded baffles for a no-stitch construction and excellent stretch, an adjustable hood, an interior media port and mesh stash pocket and a hem drawcord. Machine wash and dry. Imported.')

#### Hard-coded examples

In [10]:
from langchain.prompts import PromptTemplate

In [11]:
from langchain.prompts import PromptTemplate
from langchain.schema import BaseOutputParser
from pydantic import BaseModel, Field

examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

# Define the prompt template
prompt_template = PromptTemplate(
    input_variables=["query"],
    template="Examples:\n"
             "1. Query: Do the Cozy Comfort Pullover Set have side pockets?\n"
             "   Answer: Yes\n"
             "2. Query: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?\n"
             "   Answer: The DownTek collection\n"
             "Query: {query}\n"
             "Answer:"
)

# Define the output model
class Answer(BaseModel):
    answer: str = Field(description="The answer to the query")

# Create the output parser
class AnswerOutputParser(BaseOutputParser):
    def parse(self, text: str) -> Answer:
        # Split the response to get the answer
        answer = text.strip().split("Answer:")[-1].strip()
        return Answer(answer=answer)

# Initialize the LLM
# llm = OpenAI()
llm = ChatOpenAI()

# Create the LLMChain
llm_chain = LLMChain(
    llm=llm,
    prompt=prompt_template,
    output_parser=AnswerOutputParser()
)

# Example query
query = "Is the Cozy Comfort Pullover Set available in different colors?"

# Run the chain
result = llm_chain.invoke({"query": query})

# Print the result
print(result)


  warn_deprecated(
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


{'query': 'Is the Cozy Comfort Pullover Set available in different colors?', 'text': Answer(answer='Yes, it is available in black, gray, and white.')}


#### LLM-Generated examples

In [12]:
from langchain.evaluation.qa import QAGenerateChain

In [13]:
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI())

In [14]:
llm_chain = LLMChain(llm=llm, prompt=prompt_template)

In [15]:
new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)



In [16]:
new_examples[0]

{'qa_pairs': {'query': "What are the key features of the Women's Campside Oxfords as described in the document?",
  'answer': "The key features of the Women's Campside Oxfords include a super-soft canvas material for a broken-in feel, comfortable EVA innersole with antimicrobial odor control, moderate arch contour, EVA foam midsole for cushioning and support, and a chain-tread-inspired molded rubber outsole with a modified chain-tread pattern."}}

In [17]:
data[0]

Document(metadata={'source': './data/OutdoorClothingCatalog_1000.csv', 'row': 0}, page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.")

In [18]:
d_flattened = [data['qa_pairs'] for data in new_examples]
d_flattened

[{'query': "What are the key features of the Women's Campside Oxfords as described in the document?",
  'answer': "The key features of the Women's Campside Oxfords include a super-soft canvas material for a broken-in feel, comfortable EVA innersole with antimicrobial odor control, moderate arch contour, EVA foam midsole for cushioning and support, and a chain-tread-inspired molded rubber outsole with a modified chain-tread pattern."},
 {'query': 'What are the dimensions of the small Recycled Waterhog Dog Mat in the Chevron Weave design?',
  'answer': 'The dimensions of the small Recycled Waterhog Dog Mat in the Chevron Weave design are 18" x 28".'},
 {'query': "What are some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece as described in the document?",
  'answer': 'The key features of the swimsuit include bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ rated fabric for sun protection, crossover

#### Combine examples

In [19]:
# examples += new_example
examples += d_flattened

In [20]:
examples[0]

{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'answer': 'Yes'}

In [21]:
qa.invoke(examples[0]["query"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'result': 'Yes, the Cozy Comfort Pullover Set has side pockets.'}

### Manual Evaluation - Fun part

In [22]:
import langchain
langchain.debug = True

In [23]:
qa.invoke(examples[0]["query"])

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA] Entering Chain run with input:
[0m{
  "query": "Do the Cozy Comfort Pullover Set        have side pockets?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQA > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "Do the Cozy Comfort Pullover Set        have side pockets?",
  "context": ": 73\nname: Cozy Cuddles Knit Pullover Set\ndescription: Perfect for lounging, this knit set lives up to its name. We used ultrasoft fabric and an easy design that's as comfortable at bedtime as it is when we have to make a quick run out. \n\nSize & Fit \nPants are Favorite Fit: Sits lower on the waist. \nRelaxed Fit: Our most generous fit sits farthest from the body. \n\nFabric & Care \nIn the softest blend of 63% polyester, 35% rayon and 2% spandex.\n\nAdditional Features \

{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
 'result': 'Yes, the Cozy Comfort Pullover Set has side pockets.'}

In [24]:
# Turn off the debug mode
langchain.debug = False

### LLM assisted evaluation

In [25]:
examples += d_flattened

In [26]:
examples

[{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
  'answer': 'Yes'},
 {'query': 'What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection'},
 {'query': "What are the key features of the Women's Campside Oxfords as described in the document?",
  'answer': "The key features of the Women's Campside Oxfords include a super-soft canvas material for a broken-in feel, comfortable EVA innersole with antimicrobial odor control, moderate arch contour, EVA foam midsole for cushioning and support, and a chain-tread-inspired molded rubber outsole with a modified chain-tread pattern."},
 {'query': 'What are the dimensions of the small Recycled Waterhog Dog Mat in the Chevron Weave design?',
  'answer': 'The dimensions of the small Recycled Waterhog Dog Mat in the Chevron Weave design are 18" x 28".'},
 {'query': "What are some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece as describ

In [27]:
predictions = qa.batch(examples)



[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m

[1m> Finished chain.[0m


In [28]:
predictions

[{'query': 'Do the Cozy Comfort Pullover Set        have side pockets?',
  'answer': 'Yes',
  'result': 'Yes, the Cozy Comfort Pullover Set does have side pockets.'},
 {'query': 'What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?',
  'answer': 'The DownTek collection',
  'result': 'The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.'},
 {'query': "What are the key features of the Women's Campside Oxfords as described in the document?",
  'answer': "The key features of the Women's Campside Oxfords include a super-soft canvas material for a broken-in feel, comfortable EVA innersole with antimicrobial odor control, moderate arch contour, EVA foam midsole for cushioning and support, and a chain-tread-inspired molded rubber outsole with a modified chain-tread pattern.",
  'result': "I'm sorry, but there is no information provided about the Women's Campside Oxfords in the given context."},
 {'query': 'What are the dimensions of the smal

In [29]:
from langchain.evaluation.qa import QAEvalChain

In [30]:
llm = ChatOpenAI(temperature=0)
eval_chain = QAEvalChain.from_llm(llm)

In [31]:
graded_outputs = eval_chain.evaluate(examples, predictions)

In [32]:
graded_outputs

[{'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'},
 {'results': 'CORRECT'}]

In [33]:
for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    # print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:
Question: Do the Cozy Comfort Pullover Set        have side pockets?
Real Answer: Yes
Predicted Answer: Yes, the Cozy Comfort Pullover Set does have side pockets.

Example 1:
Question: What collection is the Ultra-Lofty         850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.

Example 2:
Question: What are the key features of the Women's Campside Oxfords as described in the document?
Real Answer: The key features of the Women's Campside Oxfords include a super-soft canvas material for a broken-in feel, comfortable EVA innersole with antimicrobial odor control, moderate arch contour, EVA foam midsole for cushioning and support, and a chain-tread-inspired molded rubber outsole with a modified chain-tread pattern.
Predicted Answer: I'm sorry, but there is no information provided about the Women's Campside Oxfords in the given context.

Example 3:
Question: Wh

### Example 2
One can also easily evaluate your QA chains with the metrics offered in ragas

In [34]:
from langchain_huggingface import HuggingFaceEmbeddings
loader = TextLoader("./data/nyc_text.txt")
index = VectorstoreIndexCreator(embedding=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs = {'device': 'mps'})).from_loaders([loader])


llm = ChatOpenAI(temperature= 0)
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=index.vectorstore.as_retriever(),
    return_source_documents=True,
)



In [35]:
# testing it out

question = "How did New York City get its name?"
result = qa_chain.invoke({"query": question})
result["result"]

'New York City was originally named New Amsterdam by Dutch colonists in 1626. When the city came under British control in 1664, it was renamed New York after King Charles II of England granted the lands to his brother, the Duke of York. The city has been continuously named New York since November 1674.'

In [36]:
result

{'query': 'How did New York City get its name?',
 'result': 'New York City was originally named New Amsterdam by Dutch colonists in 1626. When the city came under British control in 1664, it was renamed New York after King Charles II of England granted the lands to his brother, the Duke of York. The city has been continuously named New York since November 1674.',
 'source_documents': [Document(metadata={'source': './data/nyc_text.txt'}, page_content='The city and its metropolitan area constitute the premier gateway for legal immigration to the United States. As many as 800 languages are spoken in New York, making it the most linguistically diverse city in the world. New York City is home to more than 3.2 million residents born outside the U.S., the largest foreign-born population of any city in the world as of 2016.New York City traces its origins to a trading post founded on the southern tip of Manhattan Island by Dutch colonists in approximately 1624. The settlement was named New Ams

Now in order to evaluate the qa system we generated a few relevant questions. We've generated a few question for you but feel free to add any you want.

In [37]:
eval_questions = [
    "What is the population of New York City as of 2020?",
    "Which borough of New York City has the highest population?",
    "What is the economic significance of New York City?",
    "How did New York City get its name?",
    "What is the significance of the Statue of Liberty in New York City?",
]

eval_answers = [
    "8,804,190",
    "Brooklyn",
    "New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter.",
    "New York City got its name when it came under British control in 1664. King Charles II of England granted the lands to his brother, the Duke of York, who named the city New York in his own honor.",
    "The Statue of Liberty in New York City holds great significance as a symbol of the United States and its ideals of liberty and peace. It greeted millions of immigrants who arrived in the U.S. by ship in the late 19th and early 20th centuries, representing hope and freedom for those seeking a better life. It has since become an iconic landmark and a global symbol of cultural diversity and freedom.",
]

examples = [
    {"query": q, "ground_truths": [eval_answers[i]]}
    for i, q in enumerate(eval_questions)
]

In [38]:
examples

[{'query': 'What is the population of New York City as of 2020?',
  'ground_truths': ['8,804,190']},
 {'query': 'Which borough of New York City has the highest population?',
  'ground_truths': ['Brooklyn']},
 {'query': 'What is the economic significance of New York City?',
  'ground_truths': ["New York City's economic significance is vast, as it serves as the global financial capital, housing Wall Street and major financial institutions. Its diverse economy spans technology, media, healthcare, education, and more, making it resilient to economic fluctuations. NYC is a hub for international business, attracting global companies, and boasts a large, skilled labor force. Its real estate market, tourism, cultural industries, and educational institutions further fuel its economic prowess. The city's transportation network and global influence amplify its impact on the world stage, solidifying its status as a vital economic player and cultural epicenter."]},
 {'query': 'How did New York City

#### Introducing RagasEvaluatorChain

`RagasEvaluatorChain` creates a wrapper around the metrics ragas provides (documented [here](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md)), making it easier to run these evaluation with langchain and langsmith.

The evaluator chain has the following APIs

- `__call__()`: call the `RagasEvaluatorChain` directly on the result of a QA chain.
- `evaluate()`: evaluate on a list of examples (with the input queries) and predictions (outputs from the QA chain). 
- `evaluate_run()`: method implemented that is called by langsmith evaluators to evaluate langsmith datasets.

lets see each of them in action to learn more.

In [39]:
result = qa_chain.invoke({"query": eval_questions[1]})
result["result"]

'Manhattan (New York County) has the highest population density of any borough in New York City.'

In [40]:
key_mapping = {
    "query": "question",
    "result": "answer",
    "source_documents": "contexts"
}

result_updated = {}
for old_key, new_key in key_mapping.items():
    if old_key in result:
        result_updated[new_key] = result[old_key]


In [41]:
result_updated

{'question': 'Which borough of New York City has the highest population?',
 'answer': 'Manhattan (New York County) has the highest population density of any borough in New York City.',
 'contexts': [Document(metadata={'source': './data/nyc_text.txt'}, page_content="New York City is the most populous city in the United States, with 8,804,190 residents incorporating more immigration into the city than outmigration since the 2010 United States census. More than twice as many people live in New York City as compared to Los Angeles, the second-most populous U.S. city; and New York has more than three times the population of Chicago, the third-most populous U.S. city. New York City gained more residents between 2010 and 2020 (629,000) than any other U.S. city, and a greater amount than the total sum of the gains over the same decade of the next four largest U.S. cities, Los Angeles, Chicago, Houston, and Phoenix, Arizona combined. New York City's population is about 44% of New York State's p

In [42]:
#!pip install --no-cache-dir recordclass

In [43]:
#!pip install ragas==0.1.9

In [44]:
from ragas.integrations.langchain import EvaluatorChain
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
)

# Create evaluation chains
faithfulness_chain = EvaluatorChain(metric=faithfulness)
answer_rel_chain = EvaluatorChain(metric=answer_relevancy)
context_rel_chain = EvaluatorChain(metric=context_relevancy)
context_recall_chain = EvaluatorChain(metric=context_recall)

1. `__call__()`

Directly run the evaluation chain with the results from the QA chain. Do note that metrics like context_relevancy and faithfulness require the `source_documents` to be present.

In [45]:
from ragas.integrations.langchain import EvaluatorChain 
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_relevancy,
    context_recall,
)
from langchain.schema import Document

# Create evaluation chains
faithfulness_chain   = EvaluatorChain(metric=faithfulness)
answer_rel_chain     = EvaluatorChain(metric=answer_relevancy)
context_rel_chain    = EvaluatorChain(metric=context_relevancy)
context_recall_chain = EvaluatorChain(metric=context_recall)

# Print the structure and content of result_updated
print("Keys in result_updated:", result_updated.keys())
print("Content of result_updated:", result_updated)

# Transform the input to the required structure
eval_input = {
    "question": result_updated["question"],
    "answer": result_updated["answer"],
    "contexts": [doc.page_content for doc in result_updated["contexts"]]
}

async def evaluate_metrics():
    # Evaluate faithfulness
    try:
        eval_result = await faithfulness_chain.ainvoke(eval_input)
        faithfulness_score = eval_result.get("faithfulness", "No score generated")
        print(f"Faithfulness score: {faithfulness_score}")
    except Exception as e:
        print(f"Error in faithfulness evaluation: {str(e)}")

    # Create and evaluate the fake result for faithfulness
    fake_result = {
        "question": result_updated["question"],
        "answer": "we are the champions",
        "contexts": [doc.page_content for doc in result_updated["contexts"]]
    }
    try:
        eval_result = await faithfulness_chain.ainvoke(fake_result)
        fake_faithfulness_score = eval_result.get("faithfulness", "No score generated")
        print(f"Fake result faithfulness score: {fake_faithfulness_score}")
    except Exception as e:
        print(f"Error in fake result evaluation: {str(e)}")

    # Evaluate context recall
    context_recall_input = eval_input.copy()
    context_recall_input["ground_truth"] = context_recall_input["answer"]  # Using the answer as ground truth
    try:
        eval_result = await context_recall_chain.ainvoke(context_recall_input)
        context_recall_score = eval_result.get("context_recall", "No score generated")
        print(f"Context recall score: {context_recall_score}")
    except Exception as e:
        print(f"Error in context recall evaluation: {str(e)}")

    # Evaluate answer relevancy
    try:
        eval_result = await answer_rel_chain.ainvoke(eval_input)
        answer_relevancy_score = eval_result.get("answer_relevancy", "No score generated")
        print(f"Answer relevancy score: {answer_relevancy_score}")
    except Exception as e:
        print(f"Error in answer relevancy evaluation: {str(e)}")

    # Evaluate context relevancy
    try:
        eval_result = await context_rel_chain.ainvoke(eval_input)
        context_relevancy_score = eval_result.get("context_relevancy", "No score generated")
        print(f"Context relevancy score: {context_relevancy_score}")
    except Exception as e:
        print(f"Error in context relevancy evaluation: {str(e)}")

# Run the async function
await evaluate_metrics()

Keys in result_updated: dict_keys(['question', 'answer', 'contexts'])
Content of result_updated: {'question': 'Which borough of New York City has the highest population?', 'answer': 'Manhattan (New York County) has the highest population density of any borough in New York City.', 'contexts': [Document(metadata={'source': './data/nyc_text.txt'}, page_content="New York City is the most populous city in the United States, with 8,804,190 residents incorporating more immigration into the city than outmigration since the 2010 United States census. More than twice as many people live in New York City as compared to Los Angeles, the second-most populous U.S. city; and New York has more than three times the population of Chicago, the third-most populous U.S. city. New York City gained more residents between 2010 and 2020 (629,000) than any other U.S. city, and a greater amount than the total sum of the gains over the same decade of the next four largest U.S. cities, Los Angeles, Chicago, Housto

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Faithfulness score: 0.5


No statements were generated from the answer.


Fake result faithfulness score: nan
Context recall score: 1.0
Answer relevancy score: 0.9717103434530753
Context relevancy score: 0.045454545454545456


**Faithfulness**

In [46]:
# eval_result = faithfulness_chain(result_updated)
# eval_result["faithfulness_score"]

async def run_faithfulness():
    eval_result = await faithfulness_chain.ainvoke(eval_input)
    faithfulness_score = eval_result.get("faithfulness", "No score generated")
    print(f"Faithfulness score: {faithfulness_score}")

await run_faithfulness()

Faithfulness score: 0.5


High faithfulness_score means that there are exact consistency between the source documents and the answer.

You can check lower faithfulness scores by changing the result (answer from LLM) or source_documents to something else.

In [47]:
import asyncio
from langchain.schema import Document

async def run_fake_faithfulness():
    fake_result = result_updated.copy()
    fake_result["result"] = "we are the champions"
    
    # Convert Document objects to their page_content
    contexts = [
        doc.page_content if isinstance(doc, Document) else doc
        for doc in fake_result.get("contexts", [])
    ]
    
    eval_input = {
        "question": fake_result["question"],
        "answer": fake_result["result"],
        "contexts": contexts
    }
    
    eval_result = await faithfulness_chain.ainvoke(eval_input)
    fake_faithfulness_score = eval_result.get("faithfulness", "No score generated")
    print(f"Fake result faithfulness score: {fake_faithfulness_score}")

await run_fake_faithfulness()

No statements were generated from the answer.


Fake result faithfulness score: nan


**Context Relevancy**

In [50]:
# Run context recall chain
# eval_result = context_recall_chain(result)
# eval_result["context_recall_score"]

import asyncio
from langchain.schema import Document

async def run_context_recall():
    # Ensure result_updated is available and has the necessary keys
    if 'question' not in result_updated or 'answer' not in result_updated or 'contexts' not in result_updated:
        print("Error: result_updated is missing required keys.")
        return

    # Prepare the context_recall_input
    context_recall_input = {
        "question": result_updated["question"],
        "answer": result_updated["answer"],
        "contexts": [
            doc.page_content if isinstance(doc, Document) else doc
            for doc in result_updated.get("contexts", [])
        ],
        "ground_truth": result_updated["answer"]  # Using the answer as ground truth
    }

    try:
        eval_result = await context_recall_chain.ainvoke(context_recall_input)
        context_recall_score = eval_result.get("context_recall", "No score generated")
        print(f"Context recall score: {context_recall_score}")
    except Exception as e:
        print(f"Error in context recall evaluation: {str(e)}")

await run_context_recall()

Context recall score: 1.0


High context_recall_score means that the ground truth is present in the source documents.

You can check lower context recall scores by changing the source_documents to something else.

In [53]:
import asyncio
from ragas.integrations.langchain import EvaluatorChain 
from ragas.metrics import context_recall
from langchain.schema import Document

# Create context recall chain
context_recall_chain = EvaluatorChain(metric=context_recall)

async def run_context_recall_evaluation():
    # Create a fake result with new source documents
    fake_result = result_updated.copy()
    fake_result["contexts"] = [doc.page_content for doc in result_updated["contexts"]]
    fake_result["ground_truth"] = fake_result["answer"]
    
    try:
        eval_result = await context_recall_chain.ainvoke(fake_result)
        context_recall_score = eval_result.get("context_recall", "No score generated")
        print(f"Fake result with new source documents context recall score: {context_recall_score}")
    except Exception as e:
        print(f"Error in fake result with new source documents evaluation: {str(e)}")

# Run the async function
await run_context_recall_evaluation()

Fake result with new source documents context recall score: 1.0


2. `evaluate()`

Evaluate a list of inputs/queries and the outputs/predictions from the QA chain.

In [54]:
# Assuming examples is prepared with 'query' key instead of 'question'
examples = [
    {"query": "Which borough of New York City has the highest population?"},
    {"query": "What is the capital of France?"},
    # Add more examples as needed
]

# Run the queries as a batch for efficiency
predictions = qa_chain.batch(examples)

In [57]:
import asyncio

async def run_faithfulness_evaluation():
    try:
        # Print the structure of examples and predictions for debugging
        print("Structure of examples:", examples[0].keys())
        print("Structure of predictions:", predictions[0].keys())
        
        # Safely get the required fields
        question = examples[0].get("query", "No question available")
        answer = predictions[0].get("answer", predictions[0].get("result", "No answer available"))
        contexts = predictions[0].get("contexts", [])
        
        # If contexts are Document objects, extract their page_content
        if contexts and isinstance(contexts[0], Document):
            contexts = [doc.page_content for doc in contexts]
        
        eval_input = {
            "question": question,
            "answer": answer,
            "contexts": contexts,
        }
        
        # Print eval_input for debugging
        print("Evaluation input:", eval_input)
        
        eval_result = await faithfulness_chain.ainvoke(eval_input)
        faithfulness_score = eval_result.get("faithfulness", "No score generated")
        print(f"Faithfulness score for example 1: {faithfulness_score}")
    except Exception as e:
        print(f"Error in faithfulness evaluation for example 1: {str(e)}")
        # Print more details about the error
        import traceback
        traceback.print_exc()

# Run the async function
await run_faithfulness_evaluation()

Structure of examples: dict_keys(['query'])
Structure of predictions: dict_keys(['query', 'result', 'source_documents'])
Evaluation input: {'question': 'Which borough of New York City has the highest population?', 'answer': 'Manhattan (New York County) has the highest population density of any borough in New York City.', 'contexts': []}
Faithfulness score for example 1: 0.5


In [58]:
import asyncio

async def run_context_recall_evaluation():
    try:
        # Print the structure of examples and predictions for debugging
        print("Structure of examples:", examples[0].keys())
        print("Structure of predictions:", predictions[0].keys())
        
        # Safely get the required fields
        question = examples[0].get("query", "No question available")
        prediction = predictions[0].get("answer", predictions[0].get("result", "No prediction available"))
        ground_truth = "Actual answer or ground truth"  # Replace with actual ground truth
        contexts = predictions[0].get("contexts", [])
        
        # If contexts are Document objects, extract their page_content
        if contexts and isinstance(contexts[0], Document):
            contexts = [doc.page_content for doc in contexts]
        
        eval_input = {
            "question": question,
            "prediction": prediction,
            "ground_truth": ground_truth,
            "contexts": contexts,
        }
        
        # Print eval_input for debugging
        print("Evaluation input:", eval_input)
        
        eval_result = await context_recall_chain.ainvoke(eval_input)
        context_recall_score = eval_result.get("context_recall", "No score generated")
        print(f"Context recall score for example 1: {context_recall_score}")
    except Exception as e:
        print(f"Error in context recall evaluation for example 1: {str(e)}")
        # Print more details about the error
        import traceback
        traceback.print_exc()

# Run the async function
await run_context_recall_evaluation()


Structure of examples: dict_keys(['query'])
Structure of predictions: dict_keys(['query', 'result', 'source_documents'])
Evaluation input: {'question': 'Which borough of New York City has the highest population?', 'prediction': 'Manhattan (New York County) has the highest population density of any borough in New York City.', 'ground_truth': 'Actual answer or ground truth', 'contexts': []}


Failed to parse output. Returning None.


Context recall score for example 1: nan


# **FINAL THOUGHTS**

Having do deal with so many base errors I had to change many cells in order to work, leaving me little time to work properly on the lab itself.
This one might be a little messy, my apologies, I din't have time to clean it properly.

P.S. - asyncio gave me a huge headache, but managed to work it around

https://stackoverflow.com/questions/55409641/asyncio-run-cannot-be-called-from-a-running-event-loop-when-using-jupyter-no