# Evaluation for Redbox RAG chat  <a class="anchor" id="title"></a>

## Table of Contents <a class="anchor" id="toc"></a>
* [Overview](#one-section)
* [Metrics](#two-section)
    - [Retrieval Metrics - to add]()
    - [Fathfulness]()
    - [Answer Relevancy]()
    - [Hallucination]()
* [Notbook Setup](#setup)
* [Load Evaluation Dataset](#three-section)
* [Run Redbox Locally](#run-redbox)
* [Generate `actual_output` using RAG and evaluation dataset](#four-section)
* [Evaluate RAG pipeline](#five-section)
    - [Retrieval Evaluation Metrics](#six-section)
    - [Generation Evaluation Metrics](#seven-section)
* [#TODO](#todo)


## Overview <a class="anchor" id="one-section"></a>

When it comes to optimising the generation part of our RAG system, the only thing that we can modify are the `RAG prompts` that are passed with context to the LLM. Other components certainly play into the overall generation evaluation score, such as is the retrieved context of high-quality, but the levers to change these other components are further upstream in the RAG pipeline, and evaluated in Retrieval Evaluation and e2d Evaluation notebooks. These other components are also slower to change compared to prompts, which are just natural language!

We want to avoid using the /chat/rag endpoint for quick experimentation with `RAG prompts`, as the need to rebuild the core_api docker image, start and stop container etc will really slow down development --> changing prompts is very quick to do, so we want quick evaluation of how these prompt changes. 

For this reason, the /chat/rag endpoint function is in this notebook, and prompts can be changed in a single place, followed by much quicker feedback. If your prompt experiments look good, i.e. they improve generation evalution metrics, then you can consider making these changes in the `core_api` service. Information on where to make the corresponding changesin the the `core_api` service are at the bottom of this notebook. Once you make changes in `core_api` and rebuild, these changes will be reflected in the deployed /chat/rag endpoint.

We will evaluate RAG generation using metrics described in the next section.

[Back to top](#title)

---------------

## Metrics <a class="anchor" id="two-section"></a>

#TODO: Add retrieval metrics too

Start by using 3 DeepEval metrics:
- Faithfulness
- Answer Relevancy **(what are we taking as 'input'? Raw question or refined question?)**
- Hallucination

### Faithfulness

The faithfulness metric measures the quality of your RAG pipeline's generator by evaluating whether the `actual_output` factually aligns with the contents of your `retrieval_context`. `deepeval`'s faithfulness metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

##### Required Arguments
To use the `FaithfulnessMetric`, you need to provide the following arguments when creating an LLMTestCase:

- `input`
- `actual_output`
- `retrieval_context`

[Back to top](#title)

### Answer Relevancy
The answer relevancy metric measures the quality of your RAG pipeline's generator by evaluating how relevant the actual_output of your LLM application is compared to the provided `input`. `deepeval`'s answer relevancy metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

##### Required Arguments
To use the AnswerRelevancyMetric, you'll have to provide the following arguments when creating an LLMTestCase:

- `input`
- `actual_output`

[Back to top](#title)

### Hallucination
The hallucination metric determines whether your LLM generates factually correct information by comparing the `actual_output` to the provided `context`.

##### Required Arguments
To use the HallucinationMetric, you'll have to provide the following arguments when creating an LLMTestCase:

- `input`
- `actual_output`
- `retrieval_context`

[Back to top](#title)

-------

## Notebook Setup <a class="anchor" id="five-section"></a>

In [2]:
# Add autoreloatd
%reload_ext autoreload
%autoreload 2

#TODO: Check this autoreload works in vscode

[Back to top](#title)

-----------------

## Start Redbox Running locally

Run the following commands in terminal

Start docker runtime, example below uses colima
```bash
colima start --memory 8
``` 

-------------

#### First-time setup

First time users need to do the following

```bash
poetry install
```

Ensure you .env file has OpenAI API key in and has the following settings:

```bash
# === Object Storage ===

MINIO_HOST=minio
MINIO_PORT=9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin
AWS_ACCESS_KEY=minioadmin
AWS_SECRET_KEY=minioadmin

AWS_REGION=eu-west-2

# minio or s3
OBJECT_STORE=minio
BUCKET_NAME=redbox-storage-dev
```

Build redbox docker images (this takes several minutes)

```bash
docker compose build
```

------

If changes are made to the app, e.g. changes pulled in from main, it may require rebuilding docker images

```bash
docker compose build --no-cache
```

**Every time you start Redbox for evaluation (no Django frontend required), please run the following command**

```bash
make eval_backend
````

The above command will bring up everything you need for the backend (`core-api`, `worker`, `mino`, `elasticsearch` and `redis`), then create the MinIO bucket needed to store raw files

[Back to top](#title)

----------

## Generate `actual_output` using RAG and evaluation dataset

#### First need to upload files that we are going to 'RAG with'

In [4]:
from jose import jwt
from uuid import UUID
import requests
import json

**Use the printed out bearer token below to Authorize if you ever want to use the Swagger UI docs**

In [5]:
bearer_token = jwt.encode({"user_uuid": str(UUID("aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"))}, key="your-secret-key", algorithm="HS512")
print(bearer_token)

eyJhbGciOiJIUzUxMiIsInR5cCI6IkpXVCJ9.eyJ1c2VyX3V1aWQiOiJhYWFhYWFhYS1hYWFhLWFhYWEtYWFhYS1hYWFhYWFhYWFhYWEifQ.kwzm-8i8SveeqYqvsRUm4FiB7nd3I43aI70ImljgdudKM4xrDw9z3CUpEBRwqqh6D3ZghB2T-Lu7BlV36VR5sg


#### Set evaluation version you are using

In [2]:
version = "0.1.0"

Get absoluate paths for all files to be used for evaluation.

**Please just update the directory variable below (if required), to the directory containinig all your files**

In [6]:
# Specify the directory you want to scan
directory = f"./data/evaluation_data_v{version}"
# directory = './data/evaluation_files'

In [None]:
from pathlib import Path

directory_path = Path(directory)
absolute_paths = [str(file.resolve()) for file in directory_path.iterdir() if file.name != '.gitkeep']

**Only if you haven't uploaded files already** uncomment and run cell below

In [None]:
# url = 'http://127.0.0.1:5002/file/upload'

# headers={
#     'accept': 'application/json',
#     "Authorization": f"Bearer {bearer_token}"
# }
# for file in absolute_paths:
#     files = {'file': open(file, 'rb')}
#     upload_file_response = requests.post(url, headers=headers, files=files)

#     #TODO: Add some login in the loop to deal with status codes != 200
#     # if upload_file_response.status_code != 200:
#     #     print("Failed to upload data:", upload_file_response.status_code)

List files uploaded to server

In [10]:
url = 'http://127.0.0.1:5002/file/'

headers={
    'accept': 'application/json',
    "Authorization": f"Bearer {bearer_token}"
}

file_list_response = requests.get(url, headers=headers)

View JSON response

In [11]:
if file_list_response.status_code == 200:
    # Parse JSON from the response
    data = file_list_response.json()
    
    # Pretty-print the JSON data
    pretty_json = json.dumps(data, indent=4)
    print(pretty_json)
else:
    print("Failed to retrieve data:", file_list_response.status_code)

[
    {
        "uuid": "f52a6c97-c234-40e5-a9af-94daab9035c1",
        "created_datetime": "2024-05-21T11:54:52.633856",
        "creator_user_uuid": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa",
        "key": "Frontier AI Taskforce_ second progress report - GOV.UK.pdf",
        "bucket": "redbox-storage-dev",
        "model_type": "File"
    },
    {
        "uuid": "6bb3be69-5b5b-4ad5-9623-0eb783d3502a",
        "created_datetime": "2024-05-21T11:54:52.707323",
        "creator_user_uuid": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa",
        "key": "Prime Minister's speech on AI_ 26 October 2023 - GOV.UK.pdf",
        "bucket": "redbox-storage-dev",
        "model_type": "File"
    },
    {
        "uuid": "961985cf-c5a1-4eeb-b472-25a06c8ef5dd",
        "created_datetime": "2024-05-21T11:54:52.778875",
        "creator_user_uuid": "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa",
        "key": "The_impact_of_AI_on_UK_jobs_and_training_report.pdf",
        "bucket": "redbox-storage-dev",
        "model

Get a list of UUIDs

In [12]:
uuid_list = []

for item in data:
    if 'uuid' in item:
        uuid_list.append({'uuid': item['uuid']})

Get file status

In [13]:
status_url_list = []
for uuid in uuid_list:
    url = f"http://127.0.0.1:5002/file/{uuid['uuid']}/status"
    status_url_list.append(url)

status_responses = []
for url in status_url_list:
    status_response = requests.get(url, headers=headers)
    status_responses.append(status_response)

for status in status_responses:
    data = status.json()
    pretty_json = json.dumps(data, indent=4)
    print(pretty_json)

{
    "file_uuid": "f52a6c97-c234-40e5-a9af-94daab9035c1",
    "processing_status": "complete",
    "chunk_statuses": [
        {
            "chunk_uuid": "e652bc3d-7dfa-4421-9e04-33dc734d19ef",
            "embedded": true
        },
        {
            "chunk_uuid": "d46e17f4-d056-480d-891c-1824461dedfb",
            "embedded": true
        },
        {
            "chunk_uuid": "fb513f99-45f3-4dba-9a56-8debbec29749",
            "embedded": true
        },
        {
            "chunk_uuid": "23dde749-896d-42ea-8581-db49691801d1",
            "embedded": true
        },
        {
            "chunk_uuid": "e0149426-45c0-439d-8713-0f552b60fb79",
            "embedded": true
        },
        {
            "chunk_uuid": "82fe3e93-e13c-499a-a840-c6b904b368c0",
            "embedded": true
        },
        {
            "chunk_uuid": "9d44e108-33f7-4efa-832b-4bdaa0de2af3",
            "embedded": true
        },
        {
            "chunk_uuid": "37c41e96-9789-4a6c-896d-8611a49

------------

**Please ensure all emeddings have been completed before proceeding!**

Keep calm and go for a tea break!

--------------

#### Generate `actual_output` using `RAG` endpoint

In [11]:
import pandas as pd

In [12]:
df = pd.read_csv(f'./data/synthetic_data/ragas_synthetic_data_v{version}.csv')
inputs = df['input'].tolist()

In [13]:
retrieval_context = []
actual_output = []

headers = {
    'accept': 'application/json',
    'Authorization': 'Bearer ' + bearer_token,
    'Content-Type': 'application/json',
}

url = 'http://127.0.0.1:5002/chat/rag'

for input in inputs:
    data = {
        "message_history": [
            {
                "role": "system",
                "text": "You are a helpful AI Assistant"
            },
            {
                "role": "user",
                "text": input
            }
        ]
    }
    
    response = requests.post(url, headers=headers, data=json.dumps(data))
    data = response.json()

    retrieval_context.append(data['source_documents'])
    actual_output.append(data['output_text'])

In [86]:
# Assuming retrieved_context and actual_output are your lists
# rag_output = [{'retrieved_context': rc, 'actual_output': ao} for rc, ao in zip(retrieved_context, actual_output)]

In [14]:
df['actual_output'] = actual_output
df['retrieval_context'] = retrieval_context

In [None]:
## Uncomment to check df now contains the actual_output and retrieved_context
# df.head()

In [15]:
# df.to_csv('./data/synthetic_data/complete_ragas_synthetic_data_30.csv', index=False)

#### Remove rows containing NaN to prevent Pydantic validation errors

In [45]:
df = df.dropna()

In [47]:
df.to_csv(f'./data/synthetic_data/complete_ragas_synthetic_data_v{version}.csv', index=False)

#### Validate evaluation data before running DeepEval metrics

[Back to top](#title)

----

## Dev only

Get just a single test case from row 1 on df

In [55]:
df1 = pd.read_csv('./data/synthetic_data/complete_ragas_synthetic_data_30.csv')

In [56]:
df1 = df1.iloc[0:3]

In [None]:
df1.to_csv('./data/synthetic_data/complete_ragas_synthetic_data_3.csv', index=False)

In [39]:
df10 = df.iloc[10:12]

In [40]:
df10.head()

Unnamed: 0,input,context,expected_output,actual_output,retrieval_context
10,How is the exposure of occupations to AI measu...,[' occupation. As each training route may be a...,,**The exposure of occupations to AI is measure...,[{'page_content': '1.4 Calculating occupationa...
11,What is the recommendation for assessing incre...,"[' Child Payment. However, we would recommend ...",Increases to the Child Payment for low-income ...,**The recommendation for assessing increases t...,[{'page_content': 'ineffective if the wider ef...


In [41]:
df10.to_csv('./data/synthetic_data/complete_ragas_synthetic_data_item10.csv', index=False)

----------

## Load Evaluation Dataset <a class="anchor" id="three-section"></a>

Put the CSV file that you want to use for evaluation into `/notebooks/evaluation/data/synthetic_data/` directory

Import test cases from CSV

In [48]:
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_test_cases_from_csv_file(
    file_path=f"./data/synthetic_data/complete_ragas_synthetic_data_v{version}.csv",
    input_col_name="input",
    actual_output_col_name="actual_output",
    expected_output_col_name="expected_output",
    context_col_name="context",
    context_col_delimiter= ";",
    retrieval_context_col_name="retrieval_context",
    retrieval_context_col_delimiter= ";"
)

In [35]:
dataset

EvaluationDataset(test_cases=[LLMTestCase(input=None, actual_output=None, expected_output=None, context=[], retrieval_context=[], additional_metadata=None, comments=None), LLMTestCase(input=None, actual_output=None, expected_output=None, context=[], retrieval_context=[], additional_metadata=None, comments=None), LLMTestCase(input=None, actual_output=None, expected_output=None, context=[], retrieval_context=[], additional_metadata=None, comments=None), LLMTestCase(input=None, actual_output=None, expected_output=None, context=[], retrieval_context=[], additional_metadata=None, comments=None), LLMTestCase(input=None, actual_output=None, expected_output=None, context=[], retrieval_context=[], additional_metadata=None, comments=None)], goldens=[], conversational_goldens=[], _alias=None, _id=None)

[Back to top](#title)

---------

## Evaluate RAG pipeline

In [1]:
#TODO: Add code to handle rate limits - partiuclarly for metricus using GPT-4: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_handle_rate_limits.ipynb

In [49]:
from deepeval import evaluate
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
)

Instantiate retrieval and generation evaluation metrics

In [50]:
# Instantiate retrieval metrics
contextual_precision = ContextualPrecisionMetric(
        threshold=0.5, # default is 0.5
    model="gpt-4o",
    include_reason=True
)

contextual_recall = ContextualRecallMetric(
    threshold=0.5, # default is 0.5
    model="gpt-4o",
    include_reason=True
)

contextual_relevancy = ContextualRelevancyMetric(
    threshold=0.5, # default is 0.5
    model="gpt-4o",
    include_reason=True
)

In [None]:
# Instantiate generation metrics
answer_relevancy = AnswerRelevancyMetric(
    threshold=0.5, # default is 0.5
    model="gpt-4o",
    include_reason=True
)

faithfulness = FaithfulnessMetric(
    threshold=0.5, # default is 0.5
    model="gpt-4o",
    include_reason=True
)

hallucination = HallucinationMetric(
    threshold=0.5, # default is 0.5
    model="gpt-4o",
    include_reason=True
)

In [36]:
dataset.test_cases

[LLMTestCase(input=None, actual_output=None, expected_output=None, context=[], retrieval_context=[], additional_metadata=None, comments=None),
 LLMTestCase(input=None, actual_output=None, expected_output=None, context=[], retrieval_context=[], additional_metadata=None, comments=None),
 LLMTestCase(input=None, actual_output=None, expected_output=None, context=[], retrieval_context=[], additional_metadata=None, comments=None),
 LLMTestCase(input=None, actual_output=None, expected_output=None, context=[], retrieval_context=[], additional_metadata=None, comments=None),
 LLMTestCase(input=None, actual_output=None, expected_output=None, context=[], retrieval_context=[], additional_metadata=None, comments=None)]

#### Retrieval Evaluation
Separate retrieval and generation evaluation results, as retrieval evalation can take some time

In [51]:
retrieval_eval_results = evaluate(
    test_cases=dataset,
    metrics=[
        contextual_precision,
        contextual_recall,
        contextual_relevancy,
    ]
)

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...


Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()



Output()





Metrics Summary

  - ✅ Contextual Precision (score: 0.6388888888888888, threshold: 0.5, strict: False, evaluation model: gpt-4-turbo, reason: The score is 0.64 because the first node, which discusses parental supervision rather than direct findings on behavioural disorders, is ranked higher than the more relevant nodes. This node should have been ranked lower given that it does not directly address the primary query about the prevalence of behavioral disorders among EBCI children who received UBI-like payments. The other nodes, ranked second, third, and fourth, are more relevant as they explicitly mention significant findings related to the query, such as "levels of psychiatric symptoms fell significantly", "emotional disorders (-37% of a standard deviation)", and "a reduction in the prevalence of symptoms of behavioural disorders (-23% of a standard deviation)". Therefore, the ranking of the first node adversely affects the score by positioning less relevant content above more direc



#### Save retrieval evaluation results

In [52]:
import pickle 
with open(f'./data/results/retrieval_eval_results_v{version}', 'wb') as f:
    pickle.dump(retrieval_eval_results, f)

#### Generation Evaluation

In [None]:
generation_eval_results = evaluate(
    test_cases=dataset,
    metrics=[
        answer_relevancy,
        faithfulness,
        hallucination
    ]
)

#### Save generation evaluation results

In [83]:
import pickle 
with open(f'./data/results/generation_eval_results_v{version}', 'wb') as f:
    pickle.dump(generation_eval_results, f)

#### Load saved evaluation results

To load saves eval_results, uncomment and run the cell below

In [73]:
# with open(f"./data/results/generation_eval_results_v{version}", 'rb') as f:
#     eval_results = pickle.load(f)

#### Access evaluation results from TestResult object

#### How to access metrics from TestResult object

In [48]:
generation_eval_results[0].success

True

In [49]:
generation_eval_results[0].metrics

[<deepeval.metrics.answer_relevancy.answer_relevancy.AnswerRelevancyMetric at 0x14757b750>]

In [53]:
generation_eval_results[0].metrics[0].score

1.0

In [54]:
generation_eval_results[0].metrics[0].reason

'The score is 1.00 because the response fully addresses the query about the impact of variations in UBI-like payments on childhood obesity rates without any irrelevant information. Great job on maintaining focus and relevancy!'

[Back to top](#title)

-------------

## #TODO <a class="anchor" id="todo"></a>

In [3]:
#TODO: Add code to handle rate limits in the generator - partiuclarly for the critic using GPT-4: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_handle_rate_limits.ipynb

In [4]:
#TODO: Can we add cost estimate to DeepEval tests?

In [None]:
#TODO: Investigate open source (free!) evaluation models with DeepEval: 
# https://christophergs.com/blog/ai-engineering-evaluation-with-deepeval-and-open-source-models#closing

[Back to top](#title)

-------