# RAG QA Review

This example demonstrates how to use Label Studio to review and evaluate the outputs of a Retrieval-Augmented Generation (RAG) system.

It leverages the [RAGBench dataset](https://huggingface.co/datasets/rungalileo/ragbench) (specifically the `emanual` subset, containing questions and answers based on consumer electronics manuals) and the [RAGAS](https://github.com/explodinggradients/ragas) framework for automated evaluation metrics.

![RAG QA Review Screenshot](./images/rag_review.png)


## Load the Datasaet
First, we'll load the RAGBench dataset, which contains pre-computed RAGAS metrics for a collection of question-answering examples. The dataset includes:
- Questions and answers
- Retrieved context documents
- Evaluation metrics like RAGAS faithfulness and context relevance

In this example, we'll use the EManaual dataset - "a question answer dataset comprising consumer electronic device
manuals and realistic questions about them composed by human annotators."

In [16]:
import pandas as pd
import numpy as np
from datasets import load_dataset
import matplotlib.pyplot as plt
import seaborn as sns
import json

# Load the dataset
dataset = load_dataset("rungalileo/ragbench", 'emanual')

# Convert to pandas DataFrame for easier analysis
df = pd.DataFrame(dataset['test'])

In [17]:
# Display basic information about the dataset
print("Dataset Info:")
print(df.info())
print("\nSample of columns:")
print(df.columns.tolist())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 132 entries, 0 to 131
Data columns (total 26 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   id                                  132 non-null    object 
 1   question                            132 non-null    object 
 2   documents                           132 non-null    object 
 3   response                            132 non-null    object 
 4   generation_model_name               132 non-null    object 
 5   annotating_model_name               132 non-null    object 
 6   dataset_name                        132 non-null    object 
 7   documents_sentences                 132 non-null    object 
 8   response_sentences                  132 non-null    object 
 9   sentence_support_information        132 non-null    object 
 10  unsupported_response_sentence_keys  132 non-null    object 
 11  adherence_score                

In [18]:
df.head()

Unnamed: 0,id,question,documents,response,generation_model_name,annotating_model_name,dataset_name,documents_sentences,response_sentences,sentence_support_information,...,trulens_groundedness,trulens_context_relevance,ragas_faithfulness,ragas_context_relevance,gpt3_adherence,gpt3_context_relevance,gpt35_utilization,relevance_score,utilization_score,completeness_score
0,emanual_467,I want to enter into Ambient mode. How can I ...,[Changing the content and settings for Ambient...,"To enter into Ambient Mode, you need to press ...",gpt-3.5-turbo-0125,gpt-4-turbo-2024-04-09,emanual_test,"[[[0a, Changing the content and settings for A...","[[a, To enter into Ambient Mode, you need to p...","[{'explanation': 'This sentence is supported, ...",...,0.0,0.9,0.5,0.020408,1.0,0.061224,0.061224,0.020408,0.040816,1.0
1,emanual_413,Where do I find signal information ?,[Checking digital channel signal info and stre...,"You can find signal information under ""Setting...",gpt-3.5-turbo-0125,gpt-4-turbo-2024-04-09,emanual_test,"[[[0a, Checking digital channel signal info an...","[[a, You can find signal information under ""Se...",[{'explanation': 'The response correctly ident...,...,0.0,0.8,1.0,0.111111,1.0,0.222222,0.222222,0.111111,0.111111,1.0
2,emanual_32,How can I view the channels that are serached ...,[Scanning for available channels. Settings Bro...,To view the channels that are searched by the ...,gpt-3.5-turbo-0125,gpt-4-turbo-2024-04-09,emanual_test,"[[[0a, Scanning for available channels.], [0b,...","[[a, To view the channels that are searched by...",[{'explanation': 'This sentence is supported b...,...,0.36,0.8,1.0,0.263158,1.0,0.052632,0.052632,0.421053,0.157895,0.375
3,emanual_490,Can I configure Tint?,[Inverting the screen color. Settings General ...,"Yes, you can configure Tint by accessing the P...",gpt-3.5-turbo-0125,gpt-4-turbo-2024-04-09,emanual_test,"[[[0a, Inverting the screen color.], [0b, Sett...","[[a, Yes, you can configure Tint by accessing ...",[{'explanation': 'The sentence is supported by...,...,0.0,0.8,1.0,0.0,1.0,0.032258,0.032258,0.032258,0.032258,1.0
4,emanual_426,How do I fix the missing/wrong color issue ?,[Testing the picture. Settings Support Self Di...,To fix the missing/wrong color issue on your T...,gpt-3.5-turbo-0125,gpt-4-turbo-2024-04-09,emanual_test,"[[[0a, Testing the picture.], [0b, Settings Su...","[[a, To fix the missing/wrong color issue on y...",[{'explanation': 'This sentence is a general i...,...,0.76875,0.9,1.0,0.160714,1.0,0.107143,0.089286,0.089286,0.178571,0.4


## Identify Critical Samples

The `identify_critical_samples` function analyzes two main RAGAS metrics:

1. **Faithfulness**: Measures how well the response is grounded in the source documents (lower scores indicate potential hallucinations)
2. **Context Relevance**: Measures how relevant the retrieved documents are to the query

The function identifies problematic examples by:
- Finding cases where either faithfulness or context relevance falls below specified thresholds
- Calculating a severity score that combines both metrics (weighted equally)
- Sorting examples by severity score to prioritize the most problematic cases for human review

This helps focus human evaluation efforts on the most critical examples where the RAG system might be failing.

In [19]:
def identify_critical_samples(df, thresholds):
    """
    Identify examples that need human evaluation based on RAGAS metrics
    """

    # Identify problematic examples
    critical_samples = df[
        (df['ragas_faithfulness'] < thresholds['ragas_faithfulness']) |
        (df['ragas_context_relevance'] < thresholds['ragas_context_relevance'])
    ].copy()
    
    # Calculate overall severity score (weighted equally by hallucination and relevance)
    critical_samples['severity_score'] = 1 - (
        critical_samples['ragas_faithfulness'] * 0.5 +
        critical_samples['ragas_context_relevance'] * 0.5
    )
    return critical_samples.sort_values('severity_score', ascending=True)

In [20]:
# Cutoff values for critical samples
thresholds = {
        'ragas_faithfulness': 0.7,  # Lower risk of hallucination
        'ragas_context_relevance': 0.6  # Lower error in relevance
    }
    
# Get critical samples
critical_samples = identify_critical_samples(df, thresholds)

# Display top examples needing human evaluation
print("\nTop Examples Requiring Human Evaluation:")
display_columns = [
    'question', 
    'response',
    'documents',
    'ragas_faithfulness',
    'ragas_context_relevance',
    'severity_score'
]
critical_samples[display_columns].sort_values('severity_score', ascending=False).head(10)


Top Examples Requiring Human Evaluation:


Unnamed: 0,question,response,documents,ragas_faithfulness,ragas_context_relevance,severity_score
0,I want to enter into Ambient mode. How can I ...,"To enter into Ambient Mode, you need to press ...",[Changing the content and settings for Ambient...,0.5,0.020408,0.739796
65,How to create new account in SmartThings ?,"To create a new account in SmartThings, you ne...",[Creating and managing your Samsung account. Y...,0.571429,0.03125,0.698661
69,Can I configure Tint?,"Yes, according to the provided context, you ca...",[Inverting the screen color. Settings General ...,0.666667,0.0,0.666667
86,What is decor and how to set wallpaper of the ...,According to the context provided:\n\n1. Decor...,[Changing the content and settings for Ambient...,0.571429,0.102041,0.663265
126,Why my TV is making a popping noise?,"According to the provided context, the TV maki...",[Other Issues. Use these procedures to resolve...,0.666667,0.013699,0.659817
26,What is the function of 'Learn TV Remote'?,The function of 'Learn TV Remote' is to help i...,[Learning about the remote control (for the vi...,0.666667,0.023256,0.655039
9,Where do I find the list of my favorite channels?,You can find the list of your favorite channel...,[Creating a Personal Favorites List. Designate...,0.8,0.0,0.6
24,How can I change Antenna type?,"To change the antenna type on your TV, you can...",[Connecting an Antenna. You can connect an ant...,0.5,0.3,0.6
39,Can I fix powering on issue?,"Yes, you can fix the powering on issue by foll...",[Turning on the TV with a mobile device. Setti...,0.833333,0.066667,0.55
23,What is source and how to serch data for chann...,Source refers to the input signal that is curr...,[Changing the input signal. Source When you se...,1.0,0.0,0.5


## Converting Critical Samples to Label Studio Tasks

In the following cells, we'll prepare our critical samples for human evaluation by converting them into Label Studio tasks, structuring the data in a way that Label Studio can import and use for annotation.

This preparation ensures that human evaluators can efficiently review and assess the RAG system's outputs, with clear visibility of:
- The original query
- The system's response 
- The source documents used
- Key metrics like faithfulness and context relevance scores
- Overall severity scores for prioritization

The HTML formatting of documents and metrics tables is particularly important as it enables proper visual hierarchy and readability in the Label Studio interface, making the evaluation process more efficient and accurate.


In [21]:
def format_metrics_table(row):
    """Format RAGAS metrics into a table structure that Label Studio can properly display"""
    return {
        "RAGAS Faithfulness": f"{float(row['ragas_faithfulness']):.3f}",
        "RAGAS Context Relevance": f"{float(row['ragas_context_relevance']):.3f}"
    }


In [22]:
def format_for_label_studio(df):
    """
    Format RAGBench data for Label Studio import (JSON format)
    """
    label_studio_data = []
    
    for _, row in df.iterrows():
        # Format source documents
        documents_html = []
        for i, doc in enumerate(row['documents']):
            doc_html = f"""
            <div style='margin-bottom: 15px; padding: 10px; border: 1px solid #ddd; border-radius: 4px;'>
                <div style='font-weight: bold; margin-bottom: 5px;'>Source {i+1}</div>
                <div style='margin-bottom: 8px;'>{doc}</div>
            </div>
            """
            documents_html.append(doc_html)
        
        # Calculate severity score for prioritization
        severity_score = row['severity_score']
        
        # Create task data with only necessary fields
        task_data = {
            "id": str(row['id']),
            "data": {
                "user_query": row['question'],
                "assistant_response": row['response'],
                "sources_html": "".join(documents_html),
                "metrics": format_metrics_table(row),
                "severity_score": f"{severity_score:.3f}",
                "ragas_faithfulness": f"{float(row['ragas_faithfulness']):.3f}",
                "ragas_context_relevance": f"{float(row['ragas_context_relevance']):.3f}"
            }
        }
        
        label_studio_data.append(task_data)
    
    # Sort by severity score to prioritize more problematic examples
    label_studio_data.sort(key=lambda x: float(x['data']['severity_score']), reverse=True)
    
    return label_studio_data

In [23]:
review_tasks = format_for_label_studio(critical_samples)

In [24]:
# Sample of a task
print(json.dumps(review_tasks[0], indent=2))

{
  "id": "emanual_467",
  "data": {
    "user_query": "I want to  enter into Ambient mode. How can I do that?",
    "assistant_response": "To enter into Ambient Mode, you need to press the specified button on the remote control. In the provided context, it mentions that to enter Ambient Mode, you should press the button on the Samsung TV remote control.",
    "sources_html": "\n            <div style='margin-bottom: 15px; padding: 10px; border: 1px solid #ddd; border-radius: 4px;'>\n                <div style='font-weight: bold; margin-bottom: 5px;'>Source 1</div>\n                <div style='margin-bottom: 8px;'>Changing the content and settings for Ambient Mode. When you press the button in Ambient Mode , the Ambient Mode browser screen appears. In the Ambient Mode browser screen, you can select content and change the Ambient Mode settings. Setting up the content for Ambient Mode The Ambient Mode browser screen displays content at the top and categories at the bottom. Use the left o

In [25]:
# Optional: Save tasks as JSON

# with open('rag_evaluation_tasks.json', 'w', encoding='utf-8') as f:
#     json.dump(review_tasks, f, ensure_ascii=False, indent=2)

## Import Data into Label Studio
The ultimate goal of this exercise is to identify patterns in our RAG system failures. Now that we have prepared our evaluation tasks, we can import them into Label Studio for review by evaluators. 

- **Response Quality Assessment** - identify if the RAG system is providing accurate and complete answers
  - Overall Accuracy: Evaluates if the response is factually correct
  - Response Completeness: Checks if the answer fully addresses the query using available information
- **Context Evaluation** - identify retrieval issues and gaps in the knowledge base
  - Source Coverage: Assesses if the response is properly supported by the retrieved documents
  - Key Information Missing: Identifies specific types of missing information (dates, names, numbers, etc.)
  - Missing Context Details: Allows detailed feedback about what information is needed
- **Metric Agreement** - validate the effectiveness of automated metrics and identify edge cases
  - Agreement with Automated Metrics: Compares human judgment with RAGAS scores
  - Disagreement Notes: Captures why human judgment differs from automated metrics

Now that we have prepared our evaluation tasks in the correct format, we can import them into Label Studio for manual review. We will: 

- Connect to Label Studio (running locally in this case)
- Create a project with our custom labeling interface
- Importing the review tasks using the Label Studio SDK



In [26]:
from label_studio_sdk.client import LabelStudio

# Define the URL where Label Studio is accessible and the API key for your user account
LABEL_STUDIO_URL = 'http://localhost:8080'
API_KEY = '<LABEL_STUDIO_API_TOKEN>'

# Connect to the Label Studio API and check the connection
client = LabelStudio(base_url=LABEL_STUDIO_URL, api_key=API_KEY)

In [27]:
# Define the Label Studio configuration
label_config = """
<View>
    <Header value="RAG Evaluation Task" size="4" />
    <Style>
        .ls-panel { margin-bottom: 1em; }
        .ls-section {
            padding: 1em;
            border-radius: 8px;
            margin-bottom: 1em;
            border: 1px solid #e0e0e0; /* Subtle border for all sections */
        }
        .ls-query { background: #fafafa; border-color: #eeeeee; } /* Off-white */
        .ls-response { background: #e0f7fa; border-color: #b2ebf2; } /* Very Light Cyan */
        .ls-sources { background: #ffffff; border-color: #e0e0e0; } /* White with border */
        .ls-quality { background: #e8f5e9; border-color: #c8e6c9; } /* Very Light Green */
        .ls-context { background: #e0f2f1; border-color: #b2dfdb; } /* Very Light Teal */
        .ls-agreement { background: #f3e5f5; border-color: #e1bee7; } /* Very Light Purple */

        /* Base header style */
        .ls-section .ant-typography {
             font-weight: 500;
             color: #333;
             margin-bottom: 0.6em;
        }
        /* Make main section headers slightly more prominent (Targeting Headers with size="5") */
        .ls-section .ant-typography[role="heading"][aria-level="5"] {
             font-weight: 600; /* Slightly bolder */
             /* font-size: 1.1em; /* Optional: Slightly larger */
             border-bottom: 1px solid #e0e0e0; /* Optional: add a subtle underline */
             padding-bottom: 0.3em;
        }
        /* Style for sub-headers (size="6") - keep underline if desired */
        .ls-section .ant-typography[role="heading"][aria-level="6"] {
             font-weight: 500;
             margin-top: 0.5em; /* Add some space above sub-headers */
        }

        /* Ensure TextAreas are always readable */
        textarea {
            background-color: #ffffff !important; /* Force white background */
            color: #222222 !important; /* Force dark text */
            border: 1px solid #cccccc; /* Ensure border is visible */
        }
    </Style>

    <!-- User Query Section -->
    <View className="ls-section ls-query">
        <Header value="User Query" size="5" />
        <Text name="user_query" value="$user_query" />
    </View>

    <!-- Automated Metrics (Collapsible) -->
    <View className="ls-panel">
        <Collapse bordered="true" defaultActiveKey="['1']">
            <Panel value="Automated Metrics (Severity Score: $severity_score)">
                <Text name="metrics_explanation" value="RAGAS Faithfulness measures factual consistency with context. Context Relevance measures how relevant the retrieved context is to the query. Higher is better (0-1 scale). Severity is 1 - avg(metrics). "/>
                <View style="padding: 10px; background: #f8f9fa; border-radius: 4px; margin-top: 10px;">
                    <Table name="metrics_table" value="$metrics" />
                </View>
            </Panel>
        </Collapse>
    </View>

    <!-- Two-column layout for Response and Sources -->
    <View style="display: grid; grid-template-columns: 1fr 1fr; gap: 1em; margin-bottom: 1em;">
        <!-- Left column: Assistant Response -->
        <View className="ls-section ls-response">
            <Header value="Assistant Response" size="5" />
            <Text name="assistant_response" value="$assistant_response" />
        </View>

        <!-- Right column: Source Documents -->
        <View className="ls-section ls-sources">
            <Header value="Source Documents" size="5" />
            <HyperText name="sources_html" value="$sources_html" />
        </View>
    </View>

    <!-- Evaluation Sections -->

    <!-- Response Quality Evaluation -->
    <View className="ls-section ls-quality">
        <Header value="Response Quality Evaluation" size="5" />
        <View style="display: grid; grid-template-columns: 1fr 1fr; gap: 1em; margin-bottom: 10px;">
            <View>
                <Header value="Overall Accuracy" size="6" underline="true"/>
                <Choices name="accuracy" toName="assistant_response" choice="single" required="true">
                    <Choice value="accurate" html="&lt;span style='color: green;'&gt;Accurate&lt;/span&gt;" />
                    <Choice value="partially_accurate" html="&lt;span style='color: orange;'&gt;Partially Accurate&lt;/span&gt;" />
                    <Choice value="inaccurate" html="&lt;span style='color: red;'&gt;Inaccurate&lt;/span&gt;" />
                </Choices>
            </View>
            <View>
                <Header value="Response Completeness" size="6" underline="true"/>
                <Choices name="response_completeness" toName="assistant_response" choice="single" required="true">
                    <Choice value="complete" html="&lt;span style='color: green;'&gt;Complete (based on sources)&lt;/span&gt;" />
                    <Choice value="partial" html="&lt;span style='color: orange;'&gt;Partial (info in sources missed)&lt;/span&gt;" />
                    <Choice value="incomplete_sources" html="&lt;span style='color: blue;'&gt;Incomplete (due to sources)&lt;/span&gt;" />
                </Choices>
            </View>
        </View>
         <TextArea name="quality_notes" toName="assistant_response"
                  placeholder="Notes on accuracy or completeness..."
                  rows="2" maxSubmissions="1" editable="true" />
    </View>

    <!-- Context Evaluation -->
    <View className="ls-section ls-context">
        <Header value="Context &amp; Source Evaluation" size="5" />
        <View style="margin-bottom: 10px;">
            <Header value="Source Coverage" size="6" underline="true"/>
            <Choices name="source_coverage" toName="assistant_response" choice="single" required="true">
                <Choice value="complete" html="&lt;span style='color: green;'&gt;Sources fully support response&lt;/span&gt;" />
                <Choice value="partial" html="&lt;span style='color: orange;'&gt;Sources partially support response&lt;/span&gt;" />
                <Choice value="insufficient" html="&lt;span style='color: red;'&gt;Sources insufficiently support response (Hallucination?)&lt;/span&gt;" />
            </Choices>
        </View>
        <View style="margin-bottom: 10px;">
            <Header value="Key Information Missing from Sources?" size="6" underline="true"/>
            <Choices name="missing_info_type" toName="assistant_response" choice="multiple" showInline="true">
                <Choice value="dates_times" html="Dates/Times" />
                <Choice value="names" html="Names/Entities" />
                <Choice value="numbers" html="Numbers/Stats" />
                <Choice value="definitions" html="Definitions" />
                <Choice value="steps" html="Instructions/Steps" />
                <Choice value="context" html="Background Context" />
                <Choice value="other" html="Other (Specify Below)" />
                <Choice value="none" html="Nothing Missing" />
            </Choices>
        </View>
        <TextArea name="missing_context_details" toName="assistant_response"
                  placeholder="If info is missing or sources are incomplete/irrelevant, please elaborate here..."
                  rows="3" maxSubmissions="1" editable="true" />
    </View>

    <!-- Metric Agreement Section -->
    <View className="ls-section ls-agreement">
        <Header value="Metric Agreement" size="5" />
        <View style="margin-bottom: 10px;">
            <Header value="Do your evaluations agree with the automated metrics?" size="6" underline="true"/>
            <Choices name="metric_agreement" toName="assistant_response" choice="single" required="true">
                <Choice value="agrees" html="&lt;span style='color: green;'&gt;Yes, metrics seem correct&lt;/span&gt;"/>
                <Choice value="partially_agrees" html="&lt;span style='color: orange;'&gt;Partially (e.g., one metric ok, other not)&lt;/span&gt;"/>
                <Choice value="disagrees" html="&lt;span style='color: red;'&gt;No, metrics seem incorrect&lt;/span&gt;"/>
            </Choices>
            <TextArea name="disagreement_notes" toName="assistant_response"
                     placeholder="If disagreeing or partially agreeing, please explain why (e.g., faithfulness score too high, relevance too low...)"
                     rows="2" editable="true"/>
        </View>
    </View>

</View>
    """

In [28]:
# Create a new project
multi_turn_project = client.projects.create(
    title='EManual RAG Evaluation',
    color='#ECB800',
    description='Review samples from the EManual dataset, comprising consumer electronic device manuals and realistic questions about them composed by human annotators.',
    label_config=label_config
)

In [29]:
# Import tasks into the project
for task in review_tasks:
    client.tasks.create(
        project=multi_turn_project.id,
        data=task['data']
    )

### We're now ready to review our RAG data in Label Studio