<a href="https://colab.research.google.com/github/DJCordhose/practical-llm/blob/main/Eval4pptx.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hands on: Eval - small LLM as a judge

Goal
* see how llm-as-a-judge works in principle
* introduction to the G-Eval algorithm ([G-Eval on arxive ](https://arxiv.org/abs/2303.16634))
* see how the algorithm uses prompts to generate the actual eval prompt
* try out the [DeepEval library](https://docs.confident-ai.com/docs/guides-using-custom-llms)
* see the limitations of small llms as a judge
* optional: compare to Gpt-4o

# SetUp : create an *llm_run* method using a small LLM

* load & quantize a small model from huggingface
* define a simple **llm_run** method, that calls the loaded model
* try out llm_run

=> same setup as in Assement notebook

In [None]:
!nvidia-smi

Tue Aug 27 19:54:42 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   78C    P0              32W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### **Important:**
Ensure that no GPU memory is allocated yet (in the case of a T4 look for "0MiB / 15360MiB").
If GPU memory is already allocated use Runtime/Manage Sessions to delete all active sessions.

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
%%time

!pip install --upgrade -q transformers accelerate flash_attn torch bitsandbytes
!pip install lm-format-enforcer -q
!pip install deepeval==1.1.1 -q

CPU times: user 111 ms, sys: 17.8 ms, total: 129 ms
Wall time: 19 s


#### => you may need to restart the session

In [None]:
from google.colab import userdata

# Configure HuggingFace token as a Colab Secret, use key symbol on the left panel
!huggingface-cli login --token {userdata.get('HF_TOKEN')}

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Load & quantize Model (yielding: model_id, model, tokenizer)

In [None]:
# kind = 'Lllama_3.1_8B_4bit'
kind = 'Lllama_3.1_8B_8bit'
# kind = 'Lllama_3.1_8B_16bit'  # too large for T4
# kind = 'Phi-3.5-MoE_4bit'     # No module named 'triton' ???
# kind = "Phi-3.5-mini_16bit"   # not "strong" enough

if "Lllama_3.1_8B" in kind:
  model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
elif "Phi-3.5-MoE" in kind:
  model_id = "microsoft/Phi-3.5-MoE-instruct"
else:
  model_id = "microsoft/Phi-3.5-mini-instruct"

print(kind)
print(model_id)

Lllama_3.1_8B_8bit
meta-llama/Meta-Llama-3.1-8B-Instruct


***note:*** execute in a terminal 'watch -n 0.5 nvidia-smi' to see the GPU usage and when the model is loaded onto it

In [None]:
%%time

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

torch_dtype = None
quantization_config = None

if "8bit" in kind:
  print("Using 8Bit quantization")
  quantization_config = BitsAndBytesConfig(load_in_8bit=True)
elif "4bit" in kind:
  print("Using 4Bit quantization")
  quantization_config = BitsAndBytesConfig(load_in_4bit=True)
else:
  print("Using Full Resolution")
  torch_dtype = torch.bfloat16

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    torch_dtype=torch_dtype,
    device_map="cuda",
    trust_remote_code=True
)

Using 8Bit quantization


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

CPU times: user 20.2 s, sys: 16.3 s, total: 36.5 s
Wall time: 1min 39s


In [None]:
!nvidia-smi

Tue Aug 27 19:56:43 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   77C    P0              32W /  70W |   8825MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
from transformers import AutoTokenizer

def llm_run(messages):
  if type(messages) == str:
    messages = [{"role": "user", "content": messages}]
  tokenizer = AutoTokenizer.from_pretrained(model_id)
  terminators = [
      tokenizer.eos_token_id,
      tokenizer.convert_tokens_to_ids("<|eot_id|>")
  ]

  input_token_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
  ).to(model.device)

  outputs = model.generate(
      input_token_ids,
      max_new_tokens=512,
      eos_token_id=terminators,
      pad_token_id=tokenizer.eos_token_id,
      do_sample=False
  )
  output_token_ids = outputs[0][input_token_ids.shape[-1]:]
  result = tokenizer.decode(output_token_ids, skip_special_tokens=True)
  return result

Try out our model:

In [None]:
%%time
print(model_id)
llm_run("who are you ?")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


CPU times: user 9.01 s, sys: 729 ms, total: 9.74 s
Wall time: 16 s


'I\'m an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."'

In [None]:
from IPython.display import Markdown

messages = [
    {"role": "system", "content": "You are an English-speaking, competent expert in the field of sanitary piping systems.."},
    {"role": "user", "content": f'''What are waste-water pipes made out of ?'''}
  ]

#answer = llm_run(messages)
#Markdown(answer)

CPU times: user 9 µs, sys: 0 ns, total: 9 µs
Wall time: 11.9 µs


# LLM-as-a-judge: in principle

In [None]:
llm_output="Witing texts is painful, caus im making mitakes."

In [None]:
simple_eval_prompt = f'''
You are an expert on english language, grading a students text with scores between 0 and 10.
A text written in proper english, in a fluent style, containing no grammatical or syntax errors is graded 10.
A text written in a different language or with spelling errors gets a low score.
Also give a detailed explanation why the score was chosen.
Do not repeat the students text in your explanation.

Always answer in the following json format:
{{
    "score": 8,
    "reason": "some reason"
}}

Examples
1. Student Text: Pipes are cylindrical conduits used to transport fluids or gases, typically made of materials like metal, plastic, or concrete.
   Answer:
   {{
    "score": 8,
    "reason": "The text is written in english and does not contain any syntactical or grammatical erros"
  }}
2. Student Text: Zwischen Neonlichtern und Straßenlärm träum ich leise von Freiheit.
   Answer:
   {{
    "score": 2,
    "reason": "The text is written in german and not in english."
  }}

Student Text: {llm_output}
Answer:
'''

In [None]:
%%time
import json

answer=llm_run(simple_eval_prompt)

json.loads(answer)

CPU times: user 15.4 s, sys: 84.8 ms, total: 15.4 s
Wall time: 15.6 s


{'score': 1,
 'reason': "The text contains multiple spelling errors, such as 'Witing' instead of 'Writing', 'caus' instead of 'because', and'mitakes' instead of'mistakes'. Additionally, the text is written in a non-standard dialect of English, which affects its clarity and coherence."}

# Llm-as-a-judge: G-Eval in principal

### G-Eval Step 1

In [None]:
input="What is a pipe ?"
actual_output = "Pipes are beautiful, black and round. Because they are round they are very convenient and don't have any edges."

In [None]:
%%time
criteria="Determine how concise the actual output is"

geval_step1_prompt = f'''Given an evaluation criteria which outlines how you should judge the Actual Output, generate
3-4 concise evaluation steps based on the criteria below. You MUST make it clear how to evaluate Actual Output in
relation to one another.

Evaluation Criteria:
{criteria}

**
IMPORTANT: Please make sure to only return in JSON format, with the "steps" key as a list of strings. No words or
explanation is needed.
Example JSON:
{{
    "steps": <list_of_strings>
}}
**

JSON:
'''

answer = llm_run(geval_step1_prompt)
json_answer=json.loads(answer)
json_answer


CPU times: user 17.2 s, sys: 35.2 ms, total: 17.2 s
Wall time: 17.6 s


{'steps': ['Compare the length of the actual output to the length of the expected output.',
  'Evaluate the actual output for any unnecessary words, phrases, or sentences.',
  'Assess the actual output for clarity and concision in relation to the expected output.',
  'Determine if the actual output is more concise than the expected output.']}

### G-Eval Step 2

In [None]:
steps="\n".join(f"{index+1}. {step}" for index, step in enumerate(json_answer['steps']))
print(steps)

1. Compare the length of the actual output to the length of the expected output.
2. Evaluate the actual output for any unnecessary words, phrases, or sentences.
3. Assess the actual output for clarity and concision in relation to the expected output.
4. Determine if the actual output is more concise than the expected output.


In [None]:
geval_step2_prompt = f'''
Given the evaluation steps, return a JSON with two keys:
1) a `score` key ranging from 0 - 10, with 10 being that it follows the criteria outlined in the steps and 0 being that it does not, and
2) a `reason` key, a reason for the given score, but DO NOT QUOTE THE SCORE in your reason.
Please mention specific information from Actual Output and Input in your reason, but be very concise with it!

Evaluation Steps:
{steps}

Actual Output:
{actual_output}

Input:
{input}



**
IMPORTANT: Please make sure to only return in JSON format, with the "score" and "reason" key. No words or explanation is needed.

Example JSON:
{{
    "score": 0,
    "reason": "The text does not follow the evaluation steps provided."
}}
**

JSON:

'''
print(geval_step2_prompt)


Given the evaluation steps, return a JSON with two keys: 
1) a `score` key ranging from 0 - 10, with 10 being that it follows the criteria outlined in the steps and 0 being that it does not, and 
2) a `reason` key, a reason for the given score, but DO NOT QUOTE THE SCORE in your reason. 
Please mention specific information from Actual Output and Input in your reason, but be very concise with it!

Evaluation Steps:
1. Compare the length of the actual output to the length of the expected output.
2. Evaluate the actual output for any unnecessary words, phrases, or sentences.
3. Assess the actual output for clarity and concision in relation to the expected output.
4. Determine if the actual output is more concise than the expected output.

Actual Output:
Pipes are beautiful, black and round. Because they are round they are very convenient and don't have any edges.

Input:
What is a pipe ?



**
IMPORTANT: Please make sure to only return in JSON format, with the "score" and "reason" key. N

In [None]:
answer=llm_run(geval_step2_prompt)
json.loads(answer)

{'score': 0,
 'reason': 'The actual output is more verbose than the expected output, contains unnecessary words and phrases, and lacks clarity and concision.'}

# G-Eval Implementation by DeepEval

see: https://docs.confident-ai.com/docs/guides-using-custom-llms


In [None]:
from deepeval.models import DeepEvalBaseLLM

log_output=""
llm_log = True

def log(log_message):
    global log_output

    if llm_log:
          log_output += log_message + "\n"

# wrapper calling llm_run using the global members model_id, model
class CustomDeepEvalLlm(DeepEvalBaseLLM):
    def __init__(self):
        super().__init__()
        self.generate_count = 0

    def load_model(self):
        return model

    def generate(self, prompt: str) -> str:
        self.generate_count += 1
        count = self.generate_count
        log(f'[{count}] ********************** deepEval LLM Generate BEGIN ********************************************')
        log(f'[{count}] ***** Prompt         : ' + prompt)
        result = llm_run(prompt)
        log(f'[{count}] ***** Answer         : ' + result)
        log(f'[{count}] ********************** deepEval LLM Generate END   ')
        log(f'[{count}] ')
        return result

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)

    def get_model_name(self):
        return model_id

deepeval_custom_model = CustomDeepEvalLlm()

try out the wrapper

In [None]:
deepeval_custom_model.generate('who are you ?')

'I\'m an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."'

In [None]:
print(log_output)

[1] ********************** deepEval LLM Generate BEGIN ********************************************
[1] ***** Prompt         : who are you ?
[1] ***** Answer         : I'm an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."
[1] ********************** deepEval LLM Generate END   
[1] 



In [None]:
import deepeval
import deepeval.metrics
import deepeval.test_case
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

def geval_run(name, criteria, question, answer):
    test_case = deepeval.test_case.LLMTestCase(
        input=question,
        actual_output=answer
      )

    metric = GEval(
        name=name,
        criteria=criteria,
        evaluation_params=[
            LLMTestCaseParams.ACTUAL_OUTPUT,
            LLMTestCaseParams.INPUT],
        model=deepeval_custom_model
    )

    eval_result = deepeval.evaluate(
        test_cases=[test_case],
        metrics=[metric]
    )
    return eval_result

In [None]:
input="What is a pipe ?"
actual_output_concise="A pipe is a tubular conduit used to transport fluids or sometimes solids. Pipes are typically made of materials like metal, plastic, or concrete."
actual_output_inconcise="Pipes are beautiful, black and round. Because they are round they are very convenient and don't have any edges."

In [None]:
%%time
log_output=""
r = geval_run("Conciseness",
              "Determine how concise the actual output is.",
              input,
              actual_output_inconcise )

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...




Metrics Summary

  - ❌ Conciseness (GEval) (score: 0.0, threshold: 0.5, strict: False, evaluation model: meta-llama/Meta-Llama-3.1-8B-Instruct, reason: The actual output is not concise, contains unnecessary information and redundant details, and is not relevant to the input., error: None)

For test case:

  - input: What is a pipe ?
  - actual output: Pipes are beautiful, black and round. Because they are round they are very convenient and don't have any edges.
  - expected output: None
  - context: None
  - retrieval context: None


Overall Metric Pass Rates

Conciseness (GEval): 0.00% pass rate




CPU times: user 29.2 s, sys: 311 ms, total: 29.5 s
Wall time: 30.6 s


In [None]:
def print_metrics_data(deep_eval_result):
    for testcase in deep_eval_result:
      print("input        :", testcase.input)
      print("actual_output:", testcase.actual_output)
      for metric in testcase.metrics_data:
        print("name         :",metric.name)
        print("score        :",metric.score)
        print("reason       :",metric.reason)
        print("model        :",metric.evaluation_model)
        print()
      print("-----------")

In [None]:
print_metrics_data(r)
r

input        : What is a pipe ?
actual_output: Pipes are beautiful, black and round. Because they are round they are very convenient and don't have any edges.
name         : Conciseness (GEval)
score        : 0.0
reason       : The actual output is not concise, contains unnecessary information and redundant details, and is not relevant to the input.
model        : meta-llama/Meta-Llama-3.1-8B-Instruct

-----------


[TestResult(success=False, metrics_data=[MetricData(name='Conciseness (GEval)', threshold=0.5, success=False, score=0.0, reason='The actual output is not concise, contains unnecessary information and redundant details, and is not relevant to the input.', strict_mode=False, evaluation_model='meta-llama/Meta-Llama-3.1-8B-Instruct', error=None, evaluation_cost=None, verbose_logs='Criteria:\nDetermine how concise the actual output is. \n \nEvaluation Steps:\n[\n    "Compare the actual output to the expected output to determine if it is concise.",\n    "Evaluate the actual output\'s length and content to ensure it is brief and to the point.",\n    "Assess the actual output\'s relevance to the input and the task at hand.",\n    "Determine if the actual output is free from unnecessary information and redundant details."\n]')], input='What is a pipe ?', actual_output="Pipes are beautiful, black and round. Because they are round they are very convenient and don't have any edges.", expected_outp

In [None]:
print(log_output)

[2] ********************** deepEval LLM Generate BEGIN ********************************************
[2] ***** Prompt         : Given an evaluation criteria which outlines how you should judge the Actual Output and Input, generate 3-4 concise evaluation steps based on the criteria below. You MUST make it clear how to evaluate Actual Output and Input in relation to one another.

Evaluation Criteria:
Determine how concise the actual output is.

**
IMPORTANT: Please make sure to only return in JSON format, with the "steps" key as a list of strings. No words or explanation is needed.
Example JSON:
{
    "steps": <list_of_strings>
}
**

JSON:

[2] ***** Answer         : {
  "steps": [
    "Compare the actual output to the expected output to determine if it is concise.",
    "Evaluate the actual output's length and content to ensure it is brief and to the point.",
    "Assess the actual output's relevance to the input and the task at hand.",
    "Determine if the actual output is free from un

In [None]:
%%time
log_output=""
r = geval_run("Grammar", "Determine the english syntax and grammar of the actual output. Do not rely on the input or on the expected output.", input, actual_output_inconcise )

Output()



Metrics Summary

  - ❌ Grammar (GEval) (score: 0.0, threshold: 0.5, strict: False, evaluation model: meta-llama/Meta-Llama-3.1-8B-Instruct, reason: The Actual Output does not follow standard English syntax, as it uses a comma after 'beautiful' instead of a period, and the sentence structure is not clear., error: None)

For test case:

  - input: What is a pipe ?
  - actual output: Pipes are beautiful, black and round. Because they are round they are very convenient and don't have any edges.
  - expected output: None
  - context: None
  - retrieval context: None


Overall Metric Pass Rates

Grammar (GEval): 0.00% pass rate




CPU times: user 34 s, sys: 395 ms, total: 34.4 s
Wall time: 36 s


In [None]:
print(log_output)

[4] ********************** deepEval LLM Generate BEGIN ********************************************
[4] ***** Prompt         : Given an evaluation criteria which outlines how you should judge the Actual Output and Input, generate 3-4 concise evaluation steps based on the criteria below. You MUST make it clear how to evaluate Actual Output and Input in relation to one another.

Evaluation Criteria:
Determine the english syntax and grammar of the actual output. Do not rely on the input or on the expected output.

**
IMPORTANT: Please make sure to only return in JSON format, with the "steps" key as a list of strings. No words or explanation is needed.
Example JSON:
{
    "steps": <list_of_strings>
}
**

JSON:

[4] ***** Answer         : {
  "steps": [
    "Compare the Actual Output to the standard rules of English syntax and grammar.",
    "Evaluate the Actual Output independently, without considering the Input or Expected Output.",
    "Assess the Actual Output for grammatical correctnes

# DeepEval: AnswerRelevance, Toxicity,...

check out some other metrics [https://docs.confident-ai.com/docs/metrics-introduction](https://docs.confident-ai.com/docs/metrics-introduction)

In [None]:
deepeval_custom_model = CustomDeepEvalLlm()

In [None]:
from deepeval.metrics import AnswerRelevancyMetric, ToxicityMetric

def metrics_run(question, answer):
    test_case = deepeval.test_case.LLMTestCase(
        input=question,
        actual_output=answer
      )

    conciseness_metric = GEval(
        name="Conciseness",
        criteria="Determine how concise the actual output is",
        evaluation_params=[
            LLMTestCaseParams.ACTUAL_OUTPUT,
            LLMTestCaseParams.INPUT],
        model=deepeval_custom_model
    )

    metrics = [
        conciseness_metric,
        AnswerRelevancyMetric(model=deepeval_custom_model),
       # ToxicityMetric(model=deepeval_custom_model) # Lllama_3.1_8B_8bit not "strong" enough
    ]

    eval_result = deepeval.evaluate(
        test_cases=[test_case],
        metrics=metrics,
    )
    return eval_result

In [None]:
log_output=""
metrics_run(input, actual_output_inconcise)

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...


In [None]:
print(log_output)


[6] ********************** deepEval LLM Generate BEGIN ********************************************
[6] ***** Prompt         : Given an evaluation criteria which outlines how you should judge the Actual Output and Input, generate 3-4 concise evaluation steps based on the criteria below. You MUST make it clear how to evaluate Actual Output and Input in relation to one another.

Evaluation Criteria:
Determine how concise the actual output is

**
IMPORTANT: Please make sure to only return in JSON format, with the "steps" key as a list of strings. No words or explanation is needed.
Example JSON:
{
    "steps": <list_of_strings>
}
**

JSON:

[6] ***** Answer         : {
  "steps": [
    "Compare the actual output to the expected output to determine if it is concise.",
    "Evaluate the actual output for unnecessary information, such as extra words or details.",
    "Assess the actual output in relation to the actual input to ensure it is a direct and concise response.",
    "Check if the ac

## Switch from "local" Llama to OpenAI gpt-4o

To use OpenAI you need an Api key. deepEval defaults to openAI if no model is set.

In [None]:
import os

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
deepeval_custom_model = None

In [None]:
log_output=""
r=metrics_run(input, actual_output_inconcise)

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...




Metrics Summary

  - ❌ Conciseness (GEval) (score: 0.1539260661173085, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The Actual Output does not address the question in the Input and adds irrelevant details about the characteristics of pipes., error: None)
  - ✅ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 1.00 because the answer fully addresses the question without any irrelevant statements. Great job!, error: None)
  - ✅ Toxicity (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 0.00 because the actual output is entirely non-toxic and contains no harmful language., error: None)

For test case:

  - input: What is a pipe ?
  - actual output: Pipes are beautiful, black and round. Because they are round they are very convenient and don't have any edges.
  - expected output: None
  - context: None
  - retrieval context: None


Overall Metric Pass Rates

Conciseness 

In [None]:
print_metrics_data(r)