# Evaluation

Metrics from https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/context_precision/


## Definition:

---

### 1. **Context Precision**
- **Focus on _retrieved_ context**: how many of the retrieved context chunks are actually relevant to answering a question.
- **Precision@k**: Measures how precise the context is at position _k_.
- **Types**:
  - **Without reference**  
    - Compares retrieved context with **response**
    - Compares each item in retrieved_contexts with the response using an LLM to determine how well the retrieved content supports the generated answer.

  - **With reference**  
    - Compares retrieved context with **reference** (gold answer)
    - compare each retrieved_context with the reference — and determine how relevant or helpful that context is in supporting the reference answer
- **Output**:
  - `1.0`: Good — Retrieved context is highly relevant and supports the answer very well.
  - `0.0`: Bad — Retrieved context is completely irrelevant to the answer.

---

### 2. **Context Recall**
- **Focus on _retrieved_ context**: How many parts of the gold answer (**reference**) can be found or supported in the retrieved context?
- **Output**:
  - High recall: Good — You retrieved most or all of the relevant documents.
  - Low recall: Bad — You missed many relevant pieces.

---

### 3. **Response Relevancy**
- **Focus on _response_**: How relevant a generated response is to the original **user input** (the question).
- **Output**:
  - Higher score: Good — The response closely matches the intent and content of the user's question.
  - Lower score: Bad — May indicate the response is off-topic, incomplete, or includes unnecessary info.

---

### 4. **Faithfulness**
- **Focus on _response_**: How factually accurate or consistent a response is with the **retrieved context**.
- **Output**:
  - `1.0`: Good — Fully faithful — all claims are supported by the context.
  - `0.0`: Bad — Completely unfaithful — no claim can be verified from the context.


## Implementation

In [1]:
!pip install -q ragas langchain openai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/187.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m187.2/187.2 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m66.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m44.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m423.3/423.3 kB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0

In [2]:
import os
import getpass

open_ai_key = getpass.getpass('Enter your OPENAI API Key')
os.environ['OPENAI_API_KEY'] = open_ai_key

Enter your OPENAI API Key··········


In [3]:
from google.colab import drive
drive.mount('/content/drive')
%cd  /content/drive/MyDrive/ECE1508_Project/Codes

Mounted at /content/drive
/content/drive/MyDrive/ECE1508_Project/Codes


In [4]:
from typing import Optional, List
from ragas import SingleTurnSample
from ragas.metrics import (
    LLMContextPrecisionWithReference,
    LLMContextRecall,
    ResponseRelevancy,
    Faithfulness
)
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

In [5]:
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-3.5-turbo"))
evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

  evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-3.5-turbo"))
  evaluator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())


In [15]:
# Define the evaluation function
async def evaluate_all_metrics(
    user_input: Optional[str],
    response: Optional[str],
    reference: Optional[str],
    retrieved_contexts: Optional[List[str]]
) -> dict:

    results = {
        "Context_Precision": None,
        "Context_Recall": None,
        "Response_Relevancy": None,
        "Faithfulness": None,
    }

    # Skip evaluation if required fields are missing
    if not response or not retrieved_contexts:
        return results

    sample = SingleTurnSample(
        user_input=user_input or "",
        response=response,
        reference=reference or "",
        retrieved_contexts=retrieved_contexts
    )

    # Run metrics only if their required inputs are present
    if retrieved_contexts and reference:
        context_precision = LLMContextPrecisionWithReference(llm=evaluator_llm)
        results["Context_Precision"] = round(await context_precision.single_turn_ascore(sample),4)

    if retrieved_contexts and reference:
        context_recall = LLMContextRecall(llm=evaluator_llm)
        results["Context_Recall"] = round(await context_recall.single_turn_ascore(sample) ,4)

    if user_input and response:
        response_relevancy = ResponseRelevancy(llm=evaluator_llm, embeddings=evaluator_embeddings)
        results["Response_Relevancy"] = round(await response_relevancy.single_turn_ascore(sample),4)

    if response and retrieved_contexts:
        faithfulness = Faithfulness(llm=evaluator_llm)
        results["Faithfulness"] = round(await faithfulness.single_turn_ascore(sample),4)

    return results

### Single Test

In [16]:

result = await evaluate_all_metrics(
    user_input="Where is the Eiffel Tower located?",
    response="The Eiffel Tower is located in Paris.",
    reference="The Eiffel Tower is located in Paris.",
    retrieved_contexts=["The Eiffel Tower is located in Paris."]
)

print(result)

{'Context_Precision': 1.0, 'Context_Recall': 1.0, 'Response_Relevancy': np.float64(1.0), 'Faithfulness': 1.0}


### Complete Evaluation

In [17]:
import json
from datetime import datetime
import time
import tiktoken
encoding = tiktoken.encoding_for_model("gpt-4")

In [19]:
async def evaluate_all(in_data):
  for item in in_data:
      user_input = item.get("input_question")
      response = item.get("response")

      # Combine long and short answers as reference
      gold = item.get("gold_answer", {})
      long_answer = gold.get("long_answer", "")
      short_answers = gold.get("short_answers", [])
      combined_reference = long_answer + " " + " ".join(short_answers)

      retrieved_contexts = item.get("retrieved_contexts")
      evaluation = await evaluate_all_metrics(user_input, response, combined_reference, retrieved_contexts)
      item["Evaluation"] = evaluation
  return in_data


In [20]:
#load the file to be tested
test_file_name = './evaluation/run_results_baseline.json'
with open(test_file_name, "r", encoding="utf-8") as f:
    result_to_be_evaluated= json.load(f)

In [21]:
start = time.time()
eval_result = await evaluate_all(result_to_be_evaluated)
end = time.time()

print(f"Evaluation of {test_file_name} took {end - start:.4f} seconds to run.")

Evaluation of ./evaluation/run_results_baseline.json took 383.4580 seconds to run.


In [22]:
!ls

 Baseline.ipynb		 'L1_Process_Chunk&Save.ipynb'	 Proposition_Complete.ipynb
 Baseline_vector	  L1_vector			 Proposition_Light.ipynb
 evaluation		  L1_vector_test		 Proposition_Sample.ipynb
 Evaluation.ipynb	  L1_vector_test_2		 rag_sw_ver2.ipynb
 gold_test_file_30.json   L2_vector_prop		 test_single_doc.json


In [24]:
#save evaluation result
today = datetime.today().strftime("%Y-%m-%d")
eval_result_file_name = f'./evaluation/eval_run_results_baseline_{today}.json'
with open(eval_result_file_name, "w", encoding="utf-8") as f:
    json.dump(eval_result, f, indent=4, ensure_ascii=False)
print(f"Saved evaluated results to {eval_result_file_name}.json")

Saved evaluated results to ./evaluation/eval_run_results_baseline_2025-04-05.json.json
