<a href="https://colab.research.google.com/github/DJCordhose/practical-llm/blob/main/Eval4pptx.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hands on: Eval - small LLM as a judge

Goal
* see how llm-as-a-judge works in principle
* introduction to the G-Eval algorithm ([G-Eval on arxive ](https://arxiv.org/abs/2303.16634))
* see how the algorithm uses prompts to generate the actual eval prompt
* try out the [DeepEval library](https://docs.confident-ai.com/docs/guides-using-custom-llms)
* see the limitations of small llms as a judge
* optional: compare to Gpt-4o

# SetUp : create an *llm_run* method using a small LLM

* load & quantize a small model from huggingface
* define a simple **llm_run** method, that calls the loaded model
* try out llm_run

=> same setup as in Assement notebook

In [1]:
!nvidia-smi

Tue Sep  3 11:03:08 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P0              28W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

### **Important:**
Ensure that no GPU memory is allocated yet (in the case of a T4 look for "0MiB / 15360MiB").
If GPU memory is already allocated use Runtime/Manage Sessions to delete all active sessions.

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
%%time

!pip install --upgrade -q transformers accelerate flash_attn torch bitsandbytes
!pip install lm-format-enforcer -q
!pip install deepeval==1.1.1 -q

CPU times: user 35.5 ms, sys: 5.14 ms, total: 40.6 ms
Wall time: 8.34 s


#### => you may need to restart the session

In [4]:
from google.colab import userdata

# Configure HuggingFace token as a Colab Secret, use key symbol on the left panel
!huggingface-cli login --token {userdata.get('HF_TOKEN')}

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### Load & quantize Model (yielding: model_id, model, tokenizer)

In [5]:
# kind = 'Lllama_3.1_8B_4bit'
kind = 'Lllama_3.1_8B_8bit'
# kind = 'Lllama_3.1_8B_16bit'  # too large for T4
# kind = 'Phi-3.5-MoE_4bit'     # No module named 'triton' ???
# kind = "Phi-3.5-mini_16bit"   # not "strong" enough

if "Lllama_3.1_8B" in kind:
  model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
elif "Phi-3.5-MoE" in kind:
  model_id = "microsoft/Phi-3.5-MoE-instruct"
else:
  model_id = "microsoft/Phi-3.5-mini-instruct"

print(kind)
print(model_id)

Lllama_3.1_8B_8bit
meta-llama/Meta-Llama-3.1-8B-Instruct


***note:*** execute in a terminal 'watch -n 0.5 nvidia-smi' to see the GPU usage and when the model is loaded onto it

In [6]:
%%time

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

torch_dtype = None
quantization_config = None

if "8bit" in kind:
  print("Using 8Bit quantization")
  quantization_config = BitsAndBytesConfig(load_in_8bit=True)
elif "4bit" in kind:
  print("Using 4Bit quantization")
  quantization_config = BitsAndBytesConfig(load_in_4bit=True)
else:
  print("Using Full Resolution")
  torch_dtype = torch.bfloat16

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    torch_dtype=torch_dtype,
    device_map="cuda",
    trust_remote_code=True
)

Using 8Bit quantization


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

CPU times: user 18.9 s, sys: 16.9 s, total: 35.9 s
Wall time: 1min 31s


In [7]:
!nvidia-smi

Tue Sep  3 11:04:50 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P0              28W /  70W |   8825MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [8]:
from transformers import AutoTokenizer

def llm_run(messages):
  if type(messages) == str:
    messages = [{"role": "user", "content": messages}]
  tokenizer = AutoTokenizer.from_pretrained(model_id)
  terminators = [
      tokenizer.eos_token_id,
      tokenizer.convert_tokens_to_ids("<|eot_id|>")
  ]

  input_token_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
  ).to(model.device)

  outputs = model.generate(
      input_token_ids,
      max_new_tokens=512,
      eos_token_id=terminators,
      pad_token_id=tokenizer.eos_token_id,
      do_sample=False
  )
  output_token_ids = outputs[0][input_token_ids.shape[-1]:]
  result = tokenizer.decode(output_token_ids, skip_special_tokens=True)
  return result

Try out our model:

In [41]:
%%time
print(model_id)
llm_run("who are you ?")

meta-llama/Meta-Llama-3.1-8B-Instruct
CPU times: user 5.47 s, sys: 38 ms, total: 5.51 s
Wall time: 5.82 s


'I\'m an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."'

In [42]:
from IPython.display import Markdown

messages = [
    {"role": "system", "content": "You are an English-speaking, competent expert in the field of sanitary piping systems.."},
    {"role": "user", "content": f'''What are waste-water pipes made out of ?'''}
  ]

#answer = llm_run(messages)
#Markdown(answer)

# LLM-as-a-judge: in principle

In [11]:
llm_output="Witing texts is painful, caus im making mitakes."

In [43]:
simple_eval_prompt = f'''
You are an expert on english language, grading a students text with scores between 0 and 10.
A text written in proper english, in a fluent style, containing no grammatical or syntax errors is graded 10.
A text written in a different language or with spelling errors gets a low score.
Also give a detailed explanation why the score was chosen.
Do not repeat the students text in your explanation.

Always answer in the following json format:
{{
    "score": 8,
    "reason": "some reason"
}}

Examples
1. Student Text: Pipes are cylindrical conduits used to transport fluids or gases, typically made of materials like metal, plastic, or concrete.
   Answer:
   {{
    "score": 8,
    "reason": "The text is written in english and does not contain any syntactical or grammatical erros"
  }}
2. Student Text: Zwischen Neonlichtern und Straßenlärm träum ich leise von Freiheit.
   Answer:
   {{
    "score": 2,
    "reason": "The text is written in german and not in english."
  }}

Student Text: {llm_output}
Answer:
'''

In [44]:
%%time

print("***** Prompt         :")
print(simple_eval_prompt)

answer = llm_run(simple_eval_prompt)

print("***** Answer         :")
print(answer)

***** Prompt         :

You are an expert on english language, grading a students text with scores between 0 and 10.
A text written in proper english, in a fluent style, containing no grammatical or syntax errors is graded 10.
A text written in a different language or with spelling errors gets a low score.
Also give a detailed explanation why the score was chosen.
Do not repeat the students text in your explanation.

Always answer in the following json format:
{
    "score": 8,
    "reason": "some reason"
}

Examples
1. Student Text: Pipes are cylindrical conduits used to transport fluids or gases, typically made of materials like metal, plastic, or concrete.
   Answer:
   {
    "score": 8,
    "reason": "The text is written in english and does not contain any syntactical or grammatical erros"
  }
2. Student Text: Zwischen Neonlichtern und Straßenlärm träum ich leise von Freiheit.
   Answer:
   {
    "score": 2,
    "reason": "The text is written in german and not in english."
  }

Stu

# Llm-as-a-judge: G-Eval in principal

### **Idea:** given just a criteria use an llm to generate a detailed evaluation prompt.

https://arxiv.org/pdf/2303.16634

### G-Eval Phase 1: generate evaluation steps based on the criteria

In [14]:
criteria="Grade the english grammar and syntax"

In [45]:
%%time
geval_phase1_prompt = f'''Given an evaluation criteria which outlines how you should judge the Actual Output, generate
3-4 concise evaluation steps based on the criteria below. You MUST make it clear how to evaluate Actual Output in
relation to one another.

Evaluation Criteria:
{criteria}

**
IMPORTANT: Please make sure to only return in JSON format, with the "steps" key as a list of strings. No words or
explanation is needed.
Example JSON:
{{
    "steps": <list_of_strings>
}}
**

JSON:
'''

print("***** Prompt         :")
print(geval_phase1_prompt)

answer_phase1 = llm_run(geval_phase1_prompt)

print("***** Answer         :")
print(answer_phase1)


***** Prompt         :
Given an evaluation criteria which outlines how you should judge the Actual Output, generate
3-4 concise evaluation steps based on the criteria below. You MUST make it clear how to evaluate Actual Output in
relation to one another.

Evaluation Criteria:
Grade the english grammar and syntax

**
IMPORTANT: Please make sure to only return in JSON format, with the "steps" key as a list of strings. No words or
explanation is needed.
Example JSON:
{
    "steps": <list_of_strings>
}
**

JSON:

***** Answer         :
{
  "steps": [
    "Compare the subject-verb agreement in the Actual Output with the expected output.",
    "Evaluate the tense consistency and correct use of verb forms in the Actual Output.",
    "Check for correct use of pronouns, articles, and prepositions in the Actual Output.",
    "Assess the overall sentence structure and flow of the Actual Output for grammatical correctness."
  ]
}
CPU times: user 17.5 s, sys: 61.7 ms, total: 17.6 s
Wall time: 17.8 

In [46]:
import json

json_answer_step1=json.loads(answer_phase1)
steps="\n".join(f"{index+1}. {step}" for index, step in enumerate(json_answer_step1['steps']))
print(steps)

1. Compare the subject-verb agreement in the Actual Output with the expected output.
2. Evaluate the tense consistency and correct use of verb forms in the Actual Output.
3. Check for correct use of pronouns, articles, and prepositions in the Actual Output.
4. Assess the overall sentence structure and flow of the Actual Output for grammatical correctness.


### G-Eval Phase 2: evaluate the llm_output using the generated steps

In [17]:
llm_input="Why do you dislike writing texts ?"
llm_output="Witing texts is painful, caus im making mitakes."

In [47]:
geval_phase2_prompt = f'''
Given the evaluation steps, return a JSON with two keys:
1) a `score` key ranging from 0 - 10, with 10 being that it follows the criteria outlined in the steps and 0 being that it does not, and
2) a `reason` key, a reason for the given score, but DO NOT QUOTE THE SCORE in your reason.
Please mention specific information from Actual Output and Input in your reason, but be very concise with it!

Evaluation Steps:
{steps}

Actual Output:
{llm_output}

Input:
{llm_input}



**
IMPORTANT: Please make sure to only return in JSON format, with the "score" and "reason" key. No words or explanation is needed.

Example JSON:
{{
    "score": 0,
    "reason": "The text does not follow the evaluation steps provided."
}}
**

JSON:

'''


In [48]:
print("***** Prompt         :")
print(geval_phase2_prompt)

answer_phase2=llm_run(geval_phase2_prompt)
print("***** Answer         :")
print(answer_phase2)

***** Prompt         :

Given the evaluation steps, return a JSON with two keys:
1) a `score` key ranging from 0 - 10, with 10 being that it follows the criteria outlined in the steps and 0 being that it does not, and
2) a `reason` key, a reason for the given score, but DO NOT QUOTE THE SCORE in your reason.
Please mention specific information from Actual Output and Input in your reason, but be very concise with it!

Evaluation Steps:
1. Compare the subject-verb agreement in the Actual Output with the expected output.
2. Evaluate the tense consistency and correct use of verb forms in the Actual Output.
3. Check for correct use of pronouns, articles, and prepositions in the Actual Output.
4. Assess the overall sentence structure and flow of the Actual Output for grammatical correctness.

Actual Output:
Witing texts is painful, caus im making mitakes.

Input:
What is a pipe used for ?



**
IMPORTANT: Please make sure to only return in JSON format, with the "score" and "reason" key. No w

# G-Eval Implementation by DeepEval

A little adapter to connect our **llm_run** function with the deepEval library (and do some logging)

see: https://docs.confident-ai.com/docs/guides-using-custom-llms

In [51]:
from deepeval.models import DeepEvalBaseLLM

log_output=""
llm_log = True

def log(log_message):
    global log_output

    if llm_log:
          log_output += log_message + "\n"

# wrapper calling llm_run using the global members model_id, model
class CustomDeepEvalLlm(DeepEvalBaseLLM):
    def __init__(self):
        super().__init__()
        self.generate_count = 0

    def load_model(self):
        return model

    def generate(self, prompt: str) -> str:
        self.generate_count += 1
        count = self.generate_count
        log(f'[{count}] ********************** deepEval LLM Generate BEGIN ********************************************')
        log(f'[{count}] ***** Prompt         : ' + prompt)

        result = llm_run(prompt)

        log(f'[{count}] ***** Answer         : ' + result)
        log(f'[{count}] ********************** deepEval LLM Generate END   ')
        log(f'[{count}] ')
        return result

    async def a_generate(self, prompt: str) -> str:
        return self.generate(prompt)

    def get_model_name(self):
        return model_id

deepeval_custom_model = CustomDeepEvalLlm()

try out the wrapper and check log output

In [52]:
deepeval_custom_model.generate('who are you ?')

'I\'m an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."'

In [53]:
print(log_output)

[1] ********************** deepEval LLM Generate BEGIN ********************************************
[1] ***** Prompt         : who are you ?
[1] ***** Answer         : I'm an artificial intelligence model known as Llama. Llama stands for "Large Language Model Meta AI."
[1] ********************** deepEval LLM Generate END   
[1] 



**geval_run** calls deepEval's implementation passing our criteria, llm_input and llm_output

In [55]:
import deepeval
import deepeval.metrics
import deepeval.test_case
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

def geval_run(name, criteria, input, output):
    result = deepeval.evaluate(

        test_cases=[deepeval.test_case.LLMTestCase(input=input, actual_output=output )],

        metrics=[ GEval(
              name=name,
              criteria=criteria,
              evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT,LLMTestCaseParams.INPUT],
              model=deepeval_custom_model
        )]
    )
    return result

In [56]:
%%time
log_output=""
r = geval_run("Language", "Grade the english grammar and syntax.", llm_input, llm_output )

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...




Metrics Summary

  - ❌ Language (GEval) (score: 0.0, threshold: 0.5, strict: False, evaluation model: meta-llama/Meta-Llama-3.1-8B-Instruct, reason: The text contains grammatical errors, such as 'Witing' instead of 'Writing', 'caus' instead of 'because', and'mitakes' instead of'mistakes'. The sentence structure and punctuation are also incorrect., error: None)

For test case:

  - input: What is a pipe used for ?
  - actual output: Witing texts is painful, caus im making mitakes.
  - expected output: None
  - context: None
  - retrieval context: None


Overall Metric Pass Rates

Language (GEval): 0.00% pass rate




CPU times: user 43.1 s, sys: 533 ms, total: 43.6 s
Wall time: 51.7 s


In [57]:
def print_metrics_data(deep_eval_result):
    for testcase in deep_eval_result:
      print("input        :", testcase.input)
      print("actual_output:", testcase.actual_output)
      print()
      for metric in testcase.metrics_data:
        print("name         :",metric.name)
        print("score        :",metric.score)
        print("reason       :",metric.reason)
        print("model        :",metric.evaluation_model)
        print()
      print("-----------")

In [58]:
print_metrics_data(r)
r

input        : What is a pipe used for ?
actual_output: Witing texts is painful, caus im making mitakes.

name         : Language (GEval)
score        : 0.0
reason       : The text contains grammatical errors, such as 'Witing' instead of 'Writing', 'caus' instead of 'because', and'mitakes' instead of'mistakes'. The sentence structure and punctuation are also incorrect.
model        : meta-llama/Meta-Llama-3.1-8B-Instruct

-----------


[TestResult(success=False, metrics_data=[MetricData(name='Language (GEval)', threshold=0.5, success=False, score=0.0, reason="The text contains grammatical errors, such as 'Witing' instead of 'Writing', 'caus' instead of 'because', and'mitakes' instead of'mistakes'. The sentence structure and punctuation are also incorrect.", strict_mode=False, evaluation_model='meta-llama/Meta-Llama-3.1-8B-Instruct', error=None, evaluation_cost=None, verbose_logs='Criteria:\nGrade the english grammar and syntax. \n \nEvaluation Steps:\n[\n    "Compare the Actual Output to the Input to identify any grammatical errors or inconsistencies.",\n    "Evaluate the Actual Output\'s sentence structure, verb tense, and subject-verb agreement in relation to the Input.",\n    "Assess the Actual Output\'s punctuation, capitalization, and spelling accuracy in comparison to the Input.",\n    "Determine the overall grammatical correctness of the Actual Output by considering its alignment with the Input\'s intended mea

In [59]:
print(log_output)

[2] ********************** deepEval LLM Generate BEGIN ********************************************
[2] ***** Prompt         : Given an evaluation criteria which outlines how you should judge the Actual Output and Input, generate 3-4 concise evaluation steps based on the criteria below. You MUST make it clear how to evaluate Actual Output and Input in relation to one another.

Evaluation Criteria:
Grade the english grammar and syntax.

**
IMPORTANT: Please make sure to only return in JSON format, with the "steps" key as a list of strings. No words or explanation is needed.
Example JSON:
{
    "steps": <list_of_strings>
}
**

JSON:

[2] ***** Answer         : {
  "steps": [
    "Compare the Actual Output to the Input to identify any grammatical errors or inconsistencies.",
    "Evaluate the Actual Output's sentence structure, verb tense, and subject-verb agreement in relation to the Input.",
    "Assess the Actual Output's punctuation, capitalization, and spelling accuracy in comparison

# Evaluating multiple metrics: "Conciseness", AnswerRelevance, Toxicity,...

check out some other metrics [https://docs.confident-ai.com/docs/metrics-introduction](https://docs.confident-ai.com/docs/metrics-introduction)

In [29]:
deepeval_custom_model = CustomDeepEvalLlm()

In [65]:
from deepeval.metrics import AnswerRelevancyMetric, ToxicityMetric

def metrics_run(input, output):
    test_case = deepeval.test_case.LLMTestCase(
        input=input,
        actual_output=output
      )

    conciseness_metric = GEval(
        name="Conciseness",
        criteria="Determine how concise the actual output is. Ignore the input.",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.INPUT],
        model=deepeval_custom_model
    )

    language_metric = GEval(
        name="Language",
        criteria="Grade the english grammar and syntax. Ignore the input.",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.INPUT],
        model=deepeval_custom_model
    )

    metrics = [
        conciseness_metric,
        language_metric,
        AnswerRelevancyMetric(model=deepeval_custom_model),
        # ToxicityMetric(model=deepeval_custom_model) # Lllama_3.1_8B_8bit not "strong" enough
    ]

    eval_result = deepeval.evaluate(
        test_cases=[test_case],
        metrics=metrics,
    )
    return eval_result

In [66]:
llm_input="What is a pipe used for ?"
llm_output_concise="A pipe is a tubular conduit used to transport fluids or sometimes solids."
llm_output_inconcise="Pipes are beautiful, black and round. Because they are round they are very convenient and don't have any edges."

In [69]:
log_output=""
deepeval_custom_model = CustomDeepEvalLlm()
r=metrics_run(llm_input, llm_output_concise)
print_metrics_data(r)

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...




Metrics Summary

  - ✅ Conciseness (GEval) (score: 0.8, threshold: 0.5, strict: False, evaluation model: meta-llama/Meta-Llama-3.1-8B-Instruct, reason: The actual output is concise and conveys necessary information without unnecessary details, but it could be more clear in its definition of a pipe., error: None)
  - ✅ Language (GEval) (score: 0.9, threshold: 0.5, strict: False, evaluation model: meta-llama/Meta-Llama-3.1-8B-Instruct, reason: The text follows the evaluation steps, except for a minor capitalization error in the first word., error: None)
  - ✅ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: meta-llama/Meta-Llama-3.1-8B-Instruct, reason: The score is 1.00 because the actual output directly addresses the question about the use of a pipe, making it highly relevant., error: None)

For test case:

  - input: What is a pipe used for ?
  - actual output: A pipe is a tubular conduit used to transport fluids or sometimes solids.
  - expected output

input        : What is a pipe used for ?
actual_output: A pipe is a tubular conduit used to transport fluids or sometimes solids.

name         : Conciseness (GEval)
score        : 0.8
reason       : The actual output is concise and conveys necessary information without unnecessary details, but it could be more clear in its definition of a pipe.
model        : meta-llama/Meta-Llama-3.1-8B-Instruct

name         : Language (GEval)
score        : 0.9
reason       : The text follows the evaluation steps, except for a minor capitalization error in the first word.
model        : meta-llama/Meta-Llama-3.1-8B-Instruct

name         : Answer Relevancy
score        : 1.0
reason       : The score is 1.00 because the actual output directly addresses the question about the use of a pipe, making it highly relevant.
model        : meta-llama/Meta-Llama-3.1-8B-Instruct

-----------


In [62]:
#print(log_output)

# Optional: Compare with OpenAI gpt-4o

To use OpenAI you need an Api key. deepEval defaults to openAI if no model is set.
We use explicit wrapper implementation to reduce non-determinism,
but OpenAI is still not deterministc.

In [36]:
!pip install openai



In [63]:
log_output=""
deepeval_custom_model = None # deep_eval defaults to openai
r=metrics_run(llm_input, llm_output_concise)
print_metrics_data(r)

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...




Metrics Summary

  - ✅ Conciseness (GEval) (score: 0.672558390477256, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The Actual Output is slightly longer than the Input and adds some detail about 'sometimes solids' that is not explicitly requested but retains essential information., error: None)
  - ✅ Language (GEval) (score: 0.6851780405955414, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The grammar and syntax are correct, but the Actual Output does not match the question format of the Input., error: None)
  - ✅ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 1.00 because the answer is perfectly relevant and addresses the question directly. Great job!, error: None)

For test case:

  - input: What is a pipe used for ?
  - actual output: A pipe is a tubular conduit used to transport fluids or sometimes solids.
  - expected output: None
  - context: None
  - retrieval context: None


Ove

input        : What is a pipe used for ?
actual_output: A pipe is a tubular conduit used to transport fluids or sometimes solids.

name         : Conciseness (GEval)
score        : 0.672558390477256
reason       : The Actual Output is slightly longer than the Input and adds some detail about 'sometimes solids' that is not explicitly requested but retains essential information.
model        : gpt-4o

name         : Language (GEval)
score        : 0.6851780405955414
reason       : The grammar and syntax are correct, but the Actual Output does not match the question format of the Input.
model        : gpt-4o

name         : Answer Relevancy
score        : 1.0
reason       : The score is 1.00 because the answer is perfectly relevant and addresses the question directly. Great job!
model        : gpt-4o

-----------


In [64]:
#print(log_output)