### Required packages
- bert-score
- evaluate
    - rouge-score
- deepeval ( This says it's intergrated with Ollama but I wasn't able to make it work)

## Deepeval
I am using custom LLM to plug in into deepeval for evaluation. I am using llama3.2:latest from ollama. If you are doing the same please run the following commands in your CLI.

- deepeval set-ollama llama3.2:latest 

Once you are done you can use the following command to reset
- deepeval unset-ollama

In [None]:
predictions = ["You have 30 days to get a full refund at no extra cost."]
references = ["We offer a 30-day full refund at no extra costs."]

## Bert Score

Paper - https://arxiv.org/pdf/1904.09675

High level definition
- Extract the contextual embedding for each token in the answer and the reference answer
- Consine similarity across each answer word to reference word combination
- Precision would pick the max score with reference as the answer from the LLM
- Recall would pick the max score with reference as reference answer


In [None]:
from bert_score import score
P, R, F1 = score(predictions, references, lang='en', verbose=True)
print(f"Precision: {P.mean():.4f}, Recall: {R.mean():.4f}, F1: {F1.mean():.4f}")

  from .autonotebook import tqdm as notebook_tqdm
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


100%|██████████| 1/1 [00:01<00:00,  1.53s/it]


computing greedy matching.


100%|██████████| 1/1 [00:00<00:00, 118.54it/s]

done in 1.55 seconds, 0.65 sentences/sec
Precision: 0.9305, Recall: 0.9501, F1: 0.9402





## ROGUE score
Recall-Oriented Understudy for Gisting Evaluation
Paper - https://aclanthology.org/W04-1013.pdf

- For unigram, matching words/ numeber of words in the reference sentence


In [16]:
from evaluate import load
rouge = load('rouge')
scores = rouge.compute(predictions=predictions, references=references)
print(scores)

{'rouge1': np.float64(0.5833333333333334), 'rouge2': np.float64(0.3636363636363636), 'rougeL': np.float64(0.4999999999999999), 'rougeLsum': np.float64(0.4999999999999999)}


------

## Writing custom evaluation from local ollama model (llama 3.2 latest)

## How Is GE Calculated?

Since G-Eval is a two-step algorithm that generates chain of thoughts (CoTs) for better evaluation, in deepeval this means first generating a series of evaluation_steps using CoT based on the given criteria, before using the generated steps to determine the final score using the parameters presented in an LLMTestCase.

When you provide evaluation_steps, the GEval metric skips the first step and uses the provided steps to determine the final score instead, make it more reliable across different runs. If you don't have a clear evaluation_stepss, what we've found useful is to first write a criteria which can be extremely short, and use the evaluation_steps generated by GEval for subsequent evaluation and fine-tuning of criteria.

In [16]:
import pytest
from deepeval import assert_test
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

def test_case():
    correctness_metric = GEval(
        name="Correctness",
        criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
        threshold=0.5
    )
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        # Replace this with the actual output from your LLM application
        actual_output="You have 30 days to get a full refund at no extra cost.",
        expected_output="We offer a 30-day full refund at no extra costs.",
        retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
    )
    assert_test(test_case, [correctness_metric])

In [20]:
# test_case()