# Evaluation of Chainlit-doc Copilot RAG system


In this notebook, we will evaluate the performance of the Generation step in a RAG system. The following steps are performed:
1. Initialize Literal AI SDK
2. Create a dataset from Threads in Literal AI
3. Evaluate Generation with RAGAS on Answer Relevancy and Faithfulness
4. Persist experiment to Literal AI

When you evaluate a RAG system, you should not evaluate the Retrieval step and the Generation step. This notebook focusses on evaluating the Generation step only. 

## 1. Import the Literal AI SDK

In [None]:
import os

from openai import OpenAI
from dotenv import load_dotenv
from literalai import LiteralClient

load_dotenv()

openai_client = OpenAI()

literal_client = LiteralClient(api_key=os.getenv("LITERAL_API"))
literal_client.instrument_openai()

## 2. Create a Dataset

In [None]:
DATASET_NAME = f"RAG-evaluation"

dataset = literal_client.api.get_dataset(name=DATASET_NAME)

In [None]:
number_of_threads = 2

if not dataset:
    dataset = literal_client.api.create_dataset(name=DATASET_NAME)
    
    threads = literal_client.api.get_threads(first=number_of_threads).data
    
    rag_steps = []
    for thread in threads:
        rag_steps.extend([step for step in thread.steps if step.name == "RAG Agent"])
    
    for step in rag_steps:
        dataset.add_step(step.id)

## 3. Evaluate with Ragas

#### Prepare Ragas data samples

In [None]:
import ast

items = dataset.items

# Get the retrieved contexts for each question.
contexts = []
message_histories = []

for item in items:
    context = []
    message_history = item.intermediary_steps

    for step in item.intermediary_steps:
        if step["name"] == "Cookbooks Retrieval" or step["name"] == "Documentation Retrieval":
            context.extend(ast.literal_eval(step["output"]["content"])) # convert string to list
      
    contexts.append(context)
    message_histories.append(message_history)

# Data samples, in the format expected by Ragas. No ground truth needed since we will evaluate context relevancy.
data_samples = {
    'question': [item.input["content"]["args"][0] for item in items],
    'answer': [item.expected_output["content"] if item.expected_output else "" for item in items],
    'contexts': contexts,
    'ground_truth': [""]*len(items),
    'messages': message_histories
}

#### Run the evaluation

We will evaluate context relevancy which checks how relevant the retrieved contexts are to answer the user's question. 

The more unneeded details in the contexts, the less relevant (between 0 and 1, 0 being least relevant).

In [None]:
from datasets import Dataset

from ragas import evaluate
from ragas.metrics import answer_relevancy, faithfulness

results = evaluate(Dataset.from_dict(data_samples), metrics=[answer_relevancy, faithfulness]).to_pandas()

In [None]:
results

In [None]:
results.faithfulness

## 4. Persist experiment to Literal

In [None]:
prompt = literal_client.api.get_prompt(name="RAG prompt - Tooled")

experiment = dataset.create_experiment(
    name="Experiment RAG",
    prompt_id=prompt.id
)

In [None]:

# Log each experiment result.
for index, row in results.iterrows():
    scores = [{ 
        "name": answer_relevancy.name,
        "type": "AI",
        "value": row[answer_relevancy.name] if (row[answer_relevancy.name] >= 0 and row[answer_relevancy.name] <=1) else 0
    }, { 
        "name": faithfulness.name,
        "type": "AI",
        "value": row[faithfulness.name] if (row[faithfulness.name] >= 0 and row[faithfulness.name] <=1) else 0
    }]

    experiment_item = {
        "datasetItemId": items[index].id,
        "scores": scores,
        "input": { "question": row["question"], "messages": row["messages"].tolist(), "retrieval": row["contexts"].tolist()},
        "output": { "output": row["answer"] }
    }
    
    experiment.log(experiment_item)