# RAG System evaluation

## Evaluation dataset

In [27]:
import pandas as pd

In [28]:
df = pd.read_csv("../data/evaluation_data.csv")

In [29]:
df.head(5)

Unnamed: 0,prompt,response,article_url,ground_truth,context
0,What can you tell me about a new Deepseek model?,"DeepSeek released DeepSeek-R1, a large languag...",https://www.deeplearning.ai/the-batch/issue-285/,"DeepSeek released DeepSeek-R1, a mixture-of-ex...","A new open model rivals OpenAI’s o1, and it’s ..."
1,Is Sony Music Group working with Generative AI?,"Yes, Sony Music Group is working with AI and h...",https://www.deeplearning.ai/the-batch/sony-mus...,Sony Music Group has prohibited the use of its...,\nThe world’s second-largest music publisher a...
2,Which AI companies are working with U.S. gover...,Meta and Anthropic are working with the U.S. g...,https://www.deeplearning.ai/the-batch/meta-and...,Meta and Anthropic,Two top AI companies changed their stances on ...
3,what are the models used in biochemistry?,AlphaFold 3 is a model that predicts the struc...,https://www.deeplearning.ai/the-batch/deepmind...,AlphaFold 3,
4,"Are neural networks assisting brain surgeons, ...","Yes, the article indicates that neural network...",https://www.deeplearning.ai/the-batch/research...,The article describes a deep learning-based te...,The latest update of DeepMind’s AlphaFold mode...


## Evaluation using Langkit

### Setup Langkit

In [1]:
from langkit import llm_metrics
from langkit_bounty_helpers import *

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/annacielas/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [10]:
schema = llm_metrics.init()

### Prompt-response relevance

In [11]:
visualize_langkit_metric(
    df,
    "response.relevance_to_prompt"
)

In [12]:
show_langkit_critical_queries(
    df,
    "response.relevance_to_prompt"
)

Unnamed: 0,prompt,response,response.relevance_to_prompt
3,what are the models used in biochemistry?,AlphaFold 3 is a model that predicts the struc...,0.44915
2,Which AI companies are working with U.S. gover...,Meta and Anthropic are working with the U.S. g...,0.556109
0,What can you tell me about a new Deepseek model?,"DeepSeek released DeepSeek-R1, a large languag...",0.561104


### Data Leakage

In [11]:
visualize_langkit_metric(
    df,
    "prompt.has_patterns"
)

In [34]:
visualize_langkit_metric(
    df, 
    "response.has_patterns")

### Toxicity

In [12]:
visualize_langkit_metric(
    df, 
    "prompt.toxicity")

Device set to use cpu


In [13]:
show_langkit_critical_queries(
    df,
    "prompt.toxicity"
)

Unnamed: 0,prompt,response,prompt.toxicity
9,Is there a blackmarket for AI services?,"Yes, there is a black market for AI services. ...",0.019972
6,Who is Anna Cielas?,The context provided does not contain any info...,0.003536
0,What can you tell me about a new Deepseek model?,"DeepSeek released DeepSeek-R1, a large languag...",0.001265


In [14]:
visualize_langkit_metric(
    df,
    "response.toxicity")

In [37]:
show_langkit_critical_queries(
    df,
    "response.toxicity"
)

Unnamed: 0,prompt,response,response.toxicity
2,Which AI companies are working with U.S. government on military and intelligence applications?,"Meta and Anthropic are working with the U.S. government on military and intelligence applications. Meta made its Llama family of large language models available to the U.S. government for national security purposes. ""Meta's partners in the defense and intelligence markets include Accenture, Amazon, Anduril, Booz Allen, Databricks, Deloitte, IBM, Leidos, Lockheed Martin, Microsoft, Oracle, Palantir, Scale AI, and Snowflake.",0.000866
1,Is Sony Music Group working with Generative AI?,"Yes, Sony Music Group is working with AI and has sent letters to AI developers. Specifically, Sony Music Group has forbidden the use of its music or other media such as lyrics, music videos, and album art for “training, developing, or commercializing any AI systems.”",0.000776
0,What can you tell me about a new Deepseek model?,"DeepSeek released DeepSeek-R1, a large language model that executes long lines of reasoning before producing output. The code and weights are licensed freely for commercial and personal use, including training new models on R1 outputs. DeepSeek-R1 is a mixture-of-experts transformer with 671 billion total parameters, 37 billion of which are active at any given time, and it processes 128,000 tokens of input context. DeepSeek-R1 went toe-to-toe with o1, outperforming that model on 5 of 11 of the benchmarks tested. ""DeepSeek released DeepSeek-R1 , a large language model that executes long lines of reasoning before producing output. The code and weights are licensed freely for commercial and personal use, including training new models on R1 outputs."" ""DeepSeek-R1 is a mixture-of-experts transformer with 671 billion total parameters, 37 billion of which are active at any given time, and it processes 128,000 tokens of input context."" ""In DeepSeek's tests, DeepSeek-R1 went toe-to-toe with o1, outperforming that model on 5 of 11 of the benchmarks tested.""",0.00075


## Hallucinations

In [None]:
import evaluate
from whylogs.experimental.core.udf_schema import register_dataset_udf # type: ignore

### BLEU Score

In [17]:
bleu = evaluate.load("bleu")

In [18]:
@register_dataset_udf(["prompt", "response"], 
                      "response.bleu_score_to_prompt")
def bleu_score(text):
  scores = []
  for x, y in zip(text["prompt"], text["response"]):
    scores.append(
      bleu.compute(
        predictions=[x], 
        references=[y], 
        max_order=2
      )["bleu"]
    )
  return scores

In [19]:
visualize_langkit_metric(
    df, 
    "response.bleu_score_to_prompt", 
    numeric=True)

### BERT Score

In [20]:
bertscore = evaluate.load("bertscore")

In [21]:
@register_dataset_udf(["prompt", "response"], "response.bert_score_to_prompt")
def bert_score(text):
  return bertscore.compute(
      predictions=text["prompt"].to_numpy(),
      references=text["response"].to_numpy(),
      model_type="distilbert-base-uncased"
    )["f1"]

In [22]:
visualize_langkit_metric(
    df, 
    "response.bert_score_to_prompt", 
    numeric=True)

### Sentence embedding cosine distance

In [23]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import pairwise_cos_sim

In [24]:
model = SentenceTransformer('all-MiniLM-L6-v2')

In [25]:
@register_dataset_udf(["response", "ground_truth"], 
                      "response.sentence_embedding_selfsimilarity")
def sentence_embedding_selfsimilarity(text):
  response_embeddings = model.encode(text["response"].to_numpy())
  ground_truth_embeddings = model.encode(text["ground_truth"].to_numpy())
  
  cos_sim_with_response = pairwise_cos_sim(
    response_embeddings, ground_truth_embeddings
    )
  
  return cos_sim_with_response

In [26]:
sentence_embedding_selfsimilarity(df)

tensor([0.8573, 0.8290, 0.4429, 0.6641, 0.8756, 0.8891, 0.9190, 0.8638, 0.7994,
        0.9078])

In [27]:
visualize_langkit_metric(
    df, 
    "response.sentence_embedding_selfsimilarity", 
    numeric=True)

In [28]:
annotated_df, _ = udf_schema().apply_udfs(df)

In [29]:
annotated_df.head(5)

Unnamed: 0,prompt,response,article_url,ground_truth,prompt.sentiment_nltk,response.sentiment_nltk,prompt.flesch_reading_ease,response.flesch_reading_ease,prompt.automated_readability_index,response.automated_readability_index,...,prompt.jailbreak_similarity,response.refusal_similarity,prompt.toxicity,response.toxicity,response.relevance_to_prompt,prompt.has_patterns,response.has_patterns,response.bleu_score_to_prompt,response.bert_score_to_prompt,response.sentence_embedding_selfsimilarity
0,What can you tell me about a new Deepseek model?,"DeepSeek released DeepSeek-R1, a large languag...",https://www.deeplearning.ai/the-batch/issue-285/,"DeepSeek released DeepSeek-R1, a mixture-of-ex...",0.0,0.8807,86.71,59.53,1.9,15.0,...,0.172156,0.085639,0.001265,0.00075,0.561104,,,0.0,0.621235,0.857284
1,Is Sony Music Group working with Generative AI?,"Yes, Sony Music Group is working with AI and h...",https://www.deeplearning.ai/the-batch/sony-mus...,Sony Music Group has prohibited the use of its...,0.0,-0.0258,46.44,40.18,6.1,13.2,...,0.243147,0.260159,0.000827,0.000776,0.746343,,,0.003369,0.775768,0.829011
2,Which AI companies are working with U.S. gover...,Meta and Anthropic are working with the U.S. g...,https://www.deeplearning.ai/the-batch/meta-and...,Meta and Anthropic,0.4767,0.8689,22.58,25.66,11.5,13.9,...,0.324931,0.297555,0.000861,0.000866,0.556109,,,0.016441,0.803309,0.442899
3,what are the models used in biochemistry?,AlphaFold 3 is a model that predicts the struc...,https://www.deeplearning.ai/the-batch/deepmind...,AlphaFold 3,0.0,0.4019,64.37,17.34,5.6,17.1,...,0.185594,0.027965,0.000863,0.000761,0.44915,,,0.0,0.740436,0.664142
4,"Are neural networks assisting brain surgeons, ...","Yes, the article indicates that neural network...",https://www.deeplearning.ai/the-batch/research...,The article describes a deep learning-based te...,0.4019,-0.5346,74.19,43.73,9.4,12.9,...,0.174729,0.1001,0.001265,0.000767,0.724774,,,0.001934,0.76841,0.875606


# Evaluation using LLM

In [30]:
import os
import google.generativeai as genai
from dotenv import load_dotenv


load_dotenv(override=True)

GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
GEMINI_MODEL = "gemini-2.0-flash-lite-preview-02-05"

genai.configure(api_key=GEMINI_API_KEY)


SYSTEM_PROMPT = """
Human: You are an AI assistant. You can find answers to the questions from the articles provided.
"""

gemini_model = genai.GenerativeModel(GEMINI_MODEL, system_instruction=SYSTEM_PROMPT)


def prompt_single_llm_selfsimilarity(dataset, index):
    USER_PROMPT = f"""You will be provided with a text passage \
and your task is to rate the consistency of that text to \
that of the provided ground truth. Your answer must be only \
a number between 0.0 and 1.0 rounded to the nearest one \
decimal place where 0.0 represents no consistency and \
1.0 represents perfect consistency and similarity. \n\n \
Text passage: {dataset['response'][index]}. \n\n \
Context: {dataset['ground_truth'][index]}.
"""
    response = gemini_model.generate_content(USER_PROMPT)
    return response.text

In [32]:
print(prompt_single_llm_selfsimilarity(df, 5))

0.8



# Evaluation with DeepEval

In [24]:
!deepeval test run test_rag_system.py

Evaluating 1 test case(s) in parallel: |█|100% (1/1) [Time Taken: 00:04,  4.91s/
Evaluating 1 test case(s) in parallel: |█|100% (1/1) [Time Taken: 00:02,  2.64s/
Evaluating 1 test case(s) in parallel: |█|100% (1/1) [Time Taken: 00:02,  2.25s/
Evaluating 1 test case(s) in parallel: | |  0% (0/1) [Time Taken: 00:02, ?test c
Evaluating 1 test case(s) in parallel: | |  0% (0/1) [Time Taken: 00:02, ?test c
Evaluating 1 test case(s) in parallel: | |  0% (0/1) [Time Taken: 00:02, ?test c
Evaluating 1 test case(s) in parallel: | |  0% (0/1) [Time Taken: 00:02, ?test c
Evaluating 1 test case(s) in parallel: | |  0% (0/1) [Time Taken: 00:02, ?test c
Evaluating 1 test case(s) in parallel: | |  0% (0/1) [Time Taken: 00:02, ?test c
Evaluating 1 test case(s) in parallel: | |  0% (0/1) [Time Taken: 00:02, ?test c
[31mF[0mRunning teardown with pytest sessionfinish[33m...[0m

[31m[1m_________________________ test_chat_model[test_case0] __________________________[0m

test_case = LLMTestCase(input