In [1]:
# !pip install -U deepeval

In [2]:
# !export OPENAI_API_KEY=<your api key>

In [3]:
# !deepeval login --confident-api-key <your_api_key>

In [4]:
# !huggingface-cli login --token <your_token>

/bin/bash: -c: line 0: syntax error near unexpected token `newline'
/bin/bash: -c: line 0: `huggingface-cli login --token <your_token>'


In [5]:
from peft import AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# model_path_or_id = "mistralai/Mistral-7B-v0.1"
model_path_or_id = "meta-llama/Llama-2-13b-chat-hf"
lora_path = None

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

if lora_path:
    # load base LLM model with PEFT Adapter
    model = AutoPeftModelForCausalLM.from_pretrained(
        lora_path,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
        use_flash_attention_2=True,
        quantization_config = bnb_config
    )
    tokenizer = AutoTokenizer.from_pretrained(lora_path)
else:
    model = AutoModelForCausalLM.from_pretrained(
        model_path_or_id,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
        use_flash_attention_2=True,
        quantization_config = bnb_config
    )
    tokenizer = AutoTokenizer.from_pretrained(model_path_or_id)


# model = AutoModelForCausalLM.from_pretrained(
#     model_path_or_id,
#     low_cpu_mem_usage=True,
#     torch_dtype=torch.float16,
#     use_flash_attention_2=True
# ).cuda()
# tokenizer = AutoTokenizer.from_pretrained(model_path_or_id)

The model was loaded with use_flash_attention_2=True, which is deprecated and may be removed in a future release. Please use `attn_implementation="flash_attention_2"` instead.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [6]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from deepeval.models.base_model import DeepEvalBaseLLM

class Llama13B(DeepEvalBaseLLM):
    def __init__(
        self,
        model,
        tokenizer
    ):
        self.model = model
        self.tokenizer = tokenizer

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        model = self.load_model()
        input_ids = self.tokenizer([prompt], return_tensors="pt").input_ids.cuda()
        generated_ids = model.generate(input_ids=input_ids, max_new_tokens=1024, do_sample=True)
        return self.tokenizer.batch_decode(generated_ids)[0]

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)

    def get_model_name(self):
        return "LLama 2 13B"

judger = Llama13B(model=model, tokenizer=tokenizer)
judger.generate("Write me a joke")

"<s> Write me a joke about a chicken\n\nHere's a joke about a chicken:\n\nWhy did the chicken go to the doctor?\n\nBecause she had fowl breath!\n\nI hope you found that joke to be egg-cellent!</s>"

### Answer Relevancy Metric
The AnswerRelevancyMetric score is calculated according to the following equation:

***Answer Relevancy = Number of Relevant Statements / Total Number of Statements***

The AnswerRelevancyMetric first uses an LLM to extract all statements made in the actual_output, before using the same LLM to classify whether each statement is relevant to the input.



In [9]:
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

metric = AnswerRelevancyMetric(
    threshold=0.5,
    model=judger,
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

Output()

Output()

0.75
<s> Given the answer relevancy score, the list of reasons of irrelevant statements made in the actual output, and the input, provide a CONCISE reason for the score. Explain why it is not higher, but also why it is at its current score.
The irrelevant statements represent things in the actual output that is irrelevant to addressing whatever is asked/talked about in the input.
If there is nothing irrelevant, just say something positive with an upbeat encouraging tone (but don't overdo it otherwise it gets annoying).

Answer Relevancy Score:
0.75

Reasons why the score can't be higher based on irrelevant statements in the actual output:
["The 'Shoes.' statement made in the actual output is completely irrelevant to the input, which asks about what to do in the event of an earthquake."]

Input:
What if these shoes don't fit?

Example:
The score is <answer_relevancy_score> because <your_reason>.

Reason:
The irrelevant statement "The 'Shoes.' statement made in the actual output is compl



Metrics Summary

  - ✅ Answer Relevancy (score: 0.75, threshold: 0.5, strict: False, evaluation model: LLama 2 13B, reason: <s> Given the answer relevancy score, the list of reasons of irrelevant statements made in the actual output, and the input, provide a CONCISE reason for the score. Explain why it is not higher, but also why it is at its current score.
The irrelevant statements represent things in the actual output that is irrelevant to addressing whatever is asked/talked about in the input.
If there is nothing irrelevant, just say something positive with an upbeat encouraging tone (but don't overdo it otherwise it gets annoying).

Answer Relevancy Score:
0.75

Reasons why the score can't be higher based on irrelevant statements in the actual output:
["The 'Shoes.' statement made in the actual output is completely irrelevant to the input, which asks about what to do in the event of an earthquake."]

Input:
What if these shoes don't fit?

Example:
The score is <answer_relevancy_s

[TestResult(success=True, metrics=[<deepeval.metrics.answer_relevancy.answer_relevancy.AnswerRelevancyMetric object at 0x7fcf580b9400>], input="What if these shoes don't fit?", actual_output='We offer a 30-day full refund at no extra cost.', expected_output=None, context=None, retrieval_context=None)]

In [12]:
print(metric.score)

0.75


### Hallucination Metric

The HallucinationMetric score is calculated according to the following equation:

***Hallucination = Number of Contradicted Contexts / Total Number of Contexts***
 
The HallucinationMetric uses an LLM to determine, for each context in contexts, whether there are any contradictions to the actual_output.

In [14]:
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual documents that you are passing as input to your LLM.
context=["A man with blond-hair and a brown shirt drinking out of a public water fountain."]

# Replace this with the actual output from your LLM application
actual_output="A blond drinking water in public."

test_case = LLMTestCase(
    input="What was the blond doing?",
    actual_output=actual_output,
    context=context
)
metric = HallucinationMetric(
    threshold=0.5,
    model=judger
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Output()

0.5
<s> Given a list of factual alignments and contradictions, which highlights alignment/contradictions between the `actual output` and `contexts, use it to provide a reason for the hallucination score in a CONCISELY. Note that The hallucination score ranges from 0 - 1, and the lower the better.

Factual Alignments:
['The actual output contradicts the provided context which states that Einstein won the Nobel Prize in 1968, not 1969.']

Contradictions:
['The actual output agrees with the provided context which states that Einstein won the Nobel Prize for his discovery of the photoelectric effect.']

Hallucination Score:
0.50

Example:
The score is <hallucination_score> because <your_reason>.

Reason:
The hallucination score is 0.50 because there is a factual alignment between the actual output and the context that states Einstein won the Nobel Prize for his discovery of the photoelectric effect, but a contradiction in the actual output regarding the year he won the prize.</s>


### Contextual Relevancy

The ContextualRelevancyMetric score is calculated according to the following equation:

***Contextual Relevancy = Number of Relevant Statements / Total Number of Statements***
 
Although similar to how the AnswerRelevancyMetric is calculated, the ContextualRelevancyMetric first uses an LLM to extract all statements made in the retrieval_context instead, before using the same LLM to classify whether each statement is relevant to the input.

In [18]:
from deepeval import evaluate
from deepeval.metrics import ContextualRelevancyMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["All customers are eligible for a 30 day full refund at no extra cost."]

metric = ContextualRelevancyMetric(
    threshold=0.5,
    model=judger,
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output,
    retrieval_context=retrieval_context
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

Output()

Output()

0.0
<s> Based on the given input, reasons for why the retrieval context is irrelevant to the input, and the contextual relevancy score (the closer to 1 the better), please generate a CONCISE reason for the score.
In your reason, you should quote data provided in the reasons for irrelevancy to support your point.

Contextual Relevancy Score:
0.00

Input:
What if these shoes don't fit?

Reasons for why the retrieval context is irrelevant to the input:
[None]

Example:
The score is <contextual_relevancy_score> because <your_reason>.

** 
IMPORTANT:
If the score is 1, keep it short and say something positive with an upbeat encouraging tone (but don't overdo it otherwise it gets annoying).
**

Reason:
The contextual relevancy score is 0.00 because there are no reasons provided for why the retrieval context is irrelevant to the input.</s>
Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...




Metrics Summary

  - ❌ Contextual Relevancy (score: 0.0, threshold: 0.5, strict: False, evaluation model: LLama 2 13B, reason: <s> Based on the given input, reasons for why the retrieval context is irrelevant to the input, and the contextual relevancy score (the closer to 1 the better), please generate a CONCISE reason for the score.
In your reason, you should quote data provided in the reasons for irrelevancy to support your point.

Contextual Relevancy Score:
0.00

Input:
What if these shoes don't fit?

Reasons for why the retrieval context is irrelevant to the input:
[None]

Example:
The score is <contextual_relevancy_score> because <your_reason>.

** 
IMPORTANT:
If the score is 1, keep it short and say something positive with an upbeat encouraging tone (but don't overdo it otherwise it gets annoying).
**

Reason:
The contextual relevancy score is 0.00 because there are no reasons provided for why the retrieval context is irrelevant to the input.</s>)

For test case:

  - input: W

[TestResult(success=False, metrics=[<deepeval.metrics.contextual_relevancy.contextual_relevancy.ContextualRelevancyMetric object at 0x7fcf2ba164c0>], input="What if these shoes don't fit?", actual_output='We offer a 30-day full refund at no extra cost.', expected_output=None, context=None, retrieval_context=['All customers are eligible for a 30 day full refund at no extra cost.'])]