<a href="https://colab.research.google.com/github/DJCordhose/practical-llm/blob/main/GEval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# G-Eval

* https://arxiv.org/abs/2303.16634
* https://docs.confident-ai.com/docs/metrics-llm-evals
* https://docs.confident-ai.com/docs/guides-using-custom-llms
* https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct

In [1]:
!nvidia-smi

Fri Aug 23 21:06:08 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   75C    P0              33W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [2]:
%%time

!pip install -q deepeval

CPU times: user 39.1 ms, sys: 1.98 ms, total: 41.1 ms
Wall time: 6.06 s


In [3]:
%%time

!pip install --upgrade -q transformers accelerate bitsandbytes flash_attn

CPU times: user 34.1 ms, sys: 5.92 ms, total: 40 ms
Wall time: 4.84 s


In [4]:
!pip install lm-format-enforcer



In [5]:
from google.colab import userdata

import os
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [6]:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
  )

In [7]:
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="The dog chased the cat up the tree, who ran up the tree?",
    actual_output="It depends, some might consider the cat, while others might argue the dog.",
    expected_output="The cat."
)

correctness_metric.measure(test_case)
print(correctness_metric.score)
print(correctness_metric.reason)

Output()

0.15751306658782038
The actual output introduces ambiguity by suggesting different interpretations, whereas the expected output clearly states the cat ran up the tree. It omits the clear fact that the cat ran up the tree.


In [8]:
from google.colab import userdata

In [9]:
!huggingface-cli login --token {userdata.get('HF_TOKEN')}

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [10]:
import warnings
warnings.filterwarnings("ignore")

In [11]:
import transformers
import torch
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model_name = "microsoft/Phi-3.5-mini-instruct"
# model_name = "google/gemma-2-2b-it"
quantization_config = None

model = AutoModelForCausalLM.from_pretrained(
  model_name,
  device_map="cuda",
  torch_dtype=torch.bfloat16,
  quantization_config=quantization_config
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [12]:
!nvidia-smi

Fri Aug 23 21:07:16 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   70C    P0              32W /  70W |   7393MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [13]:
import json
from pydantic import BaseModel
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import (
    build_transformers_prefix_allowed_tokens_fn,
)

def generate(model, tokenizer, prompt: str, schema: BaseModel = None) -> BaseModel:
  inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
  if schema:
    parser = JsonSchemaParser(schema.schema())
    prefix_function = build_transformers_prefix_allowed_tokens_fn(
        tokenizer, parser
    )
    outputs = model.generate(
      **inputs,
      max_new_tokens=200,
      prefix_allowed_tokens_fn=prefix_function,
    )
    output_dict = tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):]
    # print(f"Generated JSON: {output_dict}", flush=True)
    json_result = json.loads(output_dict)
    return schema(**json_result)
  else:
    outputs = model.generate(**inputs, max_new_tokens=100)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [14]:
# generate(model, tokenizer, "Tell a joke")

In [15]:
from deepeval.models import DeepEvalBaseLLM

class CustomLlama3_1_8B(DeepEvalBaseLLM):
    def __init__(self, model, tokenizer):
      self.model = model
      self.tokenizer = tokenizer

    def load_model(self):
        return self.model

    def generate(self, prompt: str, schema: BaseModel=None) -> BaseModel:
      # print(f"Prompt: {prompt}", flush=True)
      # if schema: print(f"Schema: {schema.schema()}", flush=True)
      return generate(self.model, self.tokenizer, prompt, schema)

    async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        return self.generate(prompt, schema)

    def get_model_name(self):
        return "Custom Model"

custom_llm = CustomLlama3_1_8B(model, tokenizer)

In [16]:
# print(custom_llm.generate("Write me a joke"))

In [17]:
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

correctness_metric = GEval(
    model = custom_llm,
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
  )

In [18]:
from deepeval.test_case import LLMTestCase

test_case = LLMTestCase(
    input="The dog chased the cat up the tree, who ran up the tree?",
    actual_output="It depends, some might consider the cat, while others might argue the dog.",
    expected_output="The cat."
)

correctness_metric.measure(test_case)
print(correctness_metric.score)
print(correctness_metric.reason)

Output()

You are not running the flash-attention implementation, expect numerical differences.


0.5
The actual output acknowledges the ambiguity in the question, which aligns with the evaluation steps. However, it lacks specific detail about the scenario, which is penalized according to the steps.


In [19]:
from deepeval.metrics import AnswerRelevancyMetric

relevancy_metric = AnswerRelevancyMetric(model=custom_llm)

In [20]:
test_case = LLMTestCase(
    input="The dog chased the cat up the tree, who ran up the tree?",
    actual_output="It depends, some might consider the cat, while others might argue the dog."
)

relevancy_metric.measure(test_case)
print(relevancy_metric.score)
print(relevancy_metric.reason)

Output()

1.0
The score is 1.00 because the actual output directly answers the question asked in the input without any irrelevant statements. The question is about which animal ran up the tree, and the output clearly states that it was the cat.


In [21]:
weird_anwser_test_case = LLMTestCase(
    input="The dog chased the cat up the tree, who ran up the tree?",
    actual_output="The morning knows the answer",
)
relevancy_metric.measure(weird_anwser_test_case)
print(relevancy_metric.score)
print(relevancy_metric.reason)

Output()

0.0
The score is 0.00 because the actual output does not answer the question 'who ran up the tree?' and instead includes a metaphorical statement 'The morning knows the answer' which is irrelevant to the input question.
