<a href="https://colab.research.google.com/github/DJCordhose/llm-from-prototype-to-production/blob/main/Eval4pptx.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hands on: LLM as a judge

Goal
* see how llm-as-a-judge works in principle
* introduction to the G-Eval algorithm ([G-Eval on arxive ](https://arxiv.org/abs/2303.16634))
* see how the algorithm uses prompts to generate the actual eval prompt
* try out the [DeepEval library](https://docs.confident-ai.com/docs/guides-using-custom-llms)


Requirements:
* OpenAI api key

# SetUp : create an *llm_run* method calling OpenAI Gpt

* define a simple **llm_run** method, that calls the gpt model
* try out llm_run

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
%%time

!pip install openai
!pip install deepeval==1.1.1 -q

CPU times: user 85 ms, sys: 9.43 ms, total: 94.4 ms
Wall time: 12.4 s


In [3]:
import os
from google.colab import userdata
from openai import OpenAI

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [4]:
def llm_run(messages):
  if type(messages) == str:
    messages = [{"role": "user", "content": messages}]

  client = OpenAI()
  completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages
  )
  result = completion.choices[0].message.content
  return result

Try out our model:

In [5]:
llm_run("who are you ?")

'I am an AI language model created by OpenAI, designed to assist with providing information, answering questions, and engaging in conversation on a wide range of topics. How can I help you today?'

# LLM-as-a-judge: in principle

In [6]:
llm_output="Witing texts is painful, caus im making mitakes."

In [7]:
simple_eval_prompt = f'''
You are an expert on english language, grading a students text with scores between 0 and 10.
A text written in proper english, in a fluent style, containing no grammatical or syntax errors is graded 10.
A text written in a different language or with spelling errors gets a low score.
Also give a detailed explanation why the score was chosen.
Do not repeat the students text in your explanation.

Always answer in the following json format:
{{
    "score": 8,
    "reason": "some reason"
}}

Examples
1. Student Text: Pipes are cylindrical conduits used to transport fluids or gases, typically made of materials like metal, plastic, or concrete.
   Answer:
   {{
    "score": 8,
    "reason": "The text is written in english and does not contain any syntactical or grammatical erros"
  }}
2. Student Text: Zwischen Neonlichtern und Straßenlärm träum ich leise von Freiheit.
   Answer:
   {{
    "score": 2,
    "reason": "The text is written in german and not in english."
  }}

Student Text: {llm_output}
Answer:
'''

In [8]:
%%time

print("***** Prompt         :")
print(simple_eval_prompt)

answer = llm_run(simple_eval_prompt)

print("***** Answer         :")
print(answer)

***** Prompt         :

You are an expert on english language, grading a students text with scores between 0 and 10.
A text written in proper english, in a fluent style, containing no grammatical or syntax errors is graded 10.
A text written in a different language or with spelling errors gets a low score.
Also give a detailed explanation why the score was chosen.
Do not repeat the students text in your explanation.

Always answer in the following json format:
{
    "score": 8,
    "reason": "some reason"
}

Examples
1. Student Text: Pipes are cylindrical conduits used to transport fluids or gases, typically made of materials like metal, plastic, or concrete.
   Answer:
   {
    "score": 8,
    "reason": "The text is written in english and does not contain any syntactical or grammatical erros"
  }
2. Student Text: Zwischen Neonlichtern und Straßenlärm träum ich leise von Freiheit.
   Answer:
   {
    "score": 2,
    "reason": "The text is written in german and not in english."
  }

Stu

# Llm-as-a-judge: G-Eval in principal

### **Idea:** given just a criteria use an llm to generate a detailed evaluation prompt.

https://arxiv.org/pdf/2303.16634

### G-Eval Phase 1: generate evaluation steps based on the criteria

In [9]:
criteria="Grade the english grammar and syntax"

In [10]:
%%time
geval_phase1_prompt = f'''Given an evaluation criteria which outlines how you should judge the Actual Output, generate
3-4 concise evaluation steps based on the criteria below. You MUST make it clear how to evaluate Actual Output in
relation to one another.

Evaluation Criteria:
{criteria}

**
IMPORTANT: Please make sure to only return in JSON format, with the "steps" key as a list of strings. No words or
explanation is needed.
Example JSON:
{{
    "steps": <list_of_strings>
}}
**

JSON:
'''

print("***** Prompt         :")
print(geval_phase1_prompt)

answer_phase1 = llm_run(geval_phase1_prompt)

print("***** Answer         :")
print(answer_phase1)


***** Prompt         :
Given an evaluation criteria which outlines how you should judge the Actual Output, generate
3-4 concise evaluation steps based on the criteria below. You MUST make it clear how to evaluate Actual Output in
relation to one another.

Evaluation Criteria:
Grade the english grammar and syntax

**
IMPORTANT: Please make sure to only return in JSON format, with the "steps" key as a list of strings. No words or
explanation is needed.
Example JSON:
{
    "steps": <list_of_strings>
}
**

JSON:

***** Answer         :
{
    "steps": [
        "Assess the overall grammatical correctness of each output, noting any errors in sentence structure, punctuation, or verb tense consistency.",
        "Evaluate the clarity and coherence of the writing, ensuring that ideas are logically organized and effectively communicated.",
        "Compare the complexity and variety of the vocabulary used, determining if higher-level language is appropriately implemented.",
        "Check for co

In [11]:
import json

json_answer_step1=json.loads(answer_phase1)
steps="\n".join(f"{index+1}. {step}" for index, step in enumerate(json_answer_step1['steps']))
print(steps)

1. Assess the overall grammatical correctness of each output, noting any errors in sentence structure, punctuation, or verb tense consistency.
2. Evaluate the clarity and coherence of the writing, ensuring that ideas are logically organized and effectively communicated.
3. Compare the complexity and variety of the vocabulary used, determining if higher-level language is appropriately implemented.
4. Check for consistency in tone and style, ensuring that the output maintains an appropriate voice for the intended audience.


### G-Eval Phase 2: evaluate the llm_output using the generated steps

In [12]:
llm_input="Why do you dislike writing texts ?"
llm_output="Witing texts is painful, caus im making mitakes."

In [13]:
geval_phase2_prompt = f'''
Given the evaluation steps, return a JSON with two keys:
1) a `score` key ranging from 0 - 10, with 10 being that it follows the criteria outlined in the steps and 0 being that it does not, and
2) a `reason` key, a reason for the given score, but DO NOT QUOTE THE SCORE in your reason.
Please mention specific information from Actual Output and Input in your reason, but be very concise with it!

Evaluation Steps:
{steps}

Actual Output:
{llm_output}

Input:
{llm_input}



**
IMPORTANT: Please make sure to only return in JSON format, with the "score" and "reason" key. No words or explanation is needed.

Example JSON:
{{
    "score": 0,
    "reason": "The text does not follow the evaluation steps provided."
}}
**

JSON:

'''


In [14]:
print("***** Prompt         :")
print(geval_phase2_prompt)

answer_phase2=llm_run(geval_phase2_prompt)
print("***** Answer         :")
print(answer_phase2)

***** Prompt         :

Given the evaluation steps, return a JSON with two keys:
1) a `score` key ranging from 0 - 10, with 10 being that it follows the criteria outlined in the steps and 0 being that it does not, and
2) a `reason` key, a reason for the given score, but DO NOT QUOTE THE SCORE in your reason.
Please mention specific information from Actual Output and Input in your reason, but be very concise with it!

Evaluation Steps:
1. Assess the overall grammatical correctness of each output, noting any errors in sentence structure, punctuation, or verb tense consistency.
2. Evaluate the clarity and coherence of the writing, ensuring that ideas are logically organized and effectively communicated.
3. Compare the complexity and variety of the vocabulary used, determining if higher-level language is appropriately implemented.
4. Check for consistency in tone and style, ensuring that the output maintains an appropriate voice for the intended audience.

Actual Output:
Witing texts is pa

# G-Eval Implementation by DeepEval

A little adapter to connect our **llm_run** function with the deepEval library (and do some logging)

see: https://docs.confident-ai.com/docs/guides-using-custom-llms

**geval_run** calls deepEval's implementation passing our criteria, llm_input and llm_output

In [15]:
import deepeval
import deepeval.metrics
import deepeval.test_case
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

def geval_run(name, criteria, input, output):
    result = deepeval.evaluate(

        test_cases=[deepeval.test_case.LLMTestCase(input=input, actual_output=output )],

        metrics=[ GEval(
              name=name,
              criteria=criteria,
              evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT,LLMTestCaseParams.INPUT],
              model=None # deepEval defaults to openAI
        )]
    )
    return result

In [16]:
log_output=""
r = geval_run("Language", "Grade the english grammar and syntax.", llm_input, llm_output )

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...


Output()



Metrics Summary

  - ❌ Language (GEval) (score: 0.2238696496947669, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: Numerous grammatical errors in Actual Output including 'Witing' instead of 'Writing', 'caus' instead of 'because', and 'mitakes' instead of 'mistakes'. Poor punctuation and capitalization affect readability., error: None)

For test case:

  - input: Why do you dislike writing texts ?
  - actual output: Witing texts is painful, caus im making mitakes.
  - expected output: None
  - context: None
  - retrieval context: None


Overall Metric Pass Rates

Language (GEval): 0.00% pass rate




In [17]:
def print_metrics_data(deep_eval_result):
    for testcase in deep_eval_result:
      print("input        :", testcase.input)
      print("actual_output:", testcase.actual_output)
      print()
      for metric in testcase.metrics_data:
        print("name         :",metric.name)
        print("score        :",metric.score)
        print("reason       :",metric.reason)
        print("model        :",metric.evaluation_model)
        print()
      print("-----------")

In [18]:
print_metrics_data(r)
r

input        : Why do you dislike writing texts ?
actual_output: Witing texts is painful, caus im making mitakes.

name         : Language (GEval)
score        : 0.2238696496947669
reason       : Numerous grammatical errors in Actual Output including 'Witing' instead of 'Writing', 'caus' instead of 'because', and 'mitakes' instead of 'mistakes'. Poor punctuation and capitalization affect readability.
model        : gpt-4o

-----------


[TestResult(success=False, metrics_data=[MetricData(name='Language (GEval)', threshold=0.5, success=False, score=0.2238696496947669, reason="Numerous grammatical errors in Actual Output including 'Witing' instead of 'Writing', 'caus' instead of 'because', and 'mitakes' instead of 'mistakes'. Poor punctuation and capitalization affect readability.", strict_mode=False, evaluation_model='gpt-4o', error=None, evaluation_cost=0.00425, verbose_logs='Criteria:\nGrade the english grammar and syntax. \n \nEvaluation Steps:\n[\n    "Compare the grammatical structure of Input and Actual Output, ensuring they follow standard English grammar rules.",\n    "Evaluate the syntax in both the Input and Actual Output, checking for proper sentence construction and coherence.",\n    "Identify and note any grammatical errors or inconsistencies in the Actual Output compared to the Input.",\n    "Assess the use of punctuation, capitalization, and overall readability in both the Input and Actual Output."\n]')]

In [19]:
print(log_output)




# Evaluating multiple metrics: "Conciseness", AnswerRelevance, Toxicity,...

check out some other metrics [https://docs.confident-ai.com/docs/metrics-introduction](https://docs.confident-ai.com/docs/metrics-introduction)

In [20]:
from deepeval.metrics import AnswerRelevancyMetric, ToxicityMetric

def metrics_run(input, output):
    test_case = deepeval.test_case.LLMTestCase(
        input=input,
        actual_output=output
      )

    conciseness_metric = GEval(
        name="Conciseness",
        criteria="Determine how concise the actual output is. Ignore the input.",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.INPUT],
    )

    language_metric = GEval(
        name="Language",
        criteria="Grade the english grammar and syntax. Ignore the input.",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.INPUT],
    )

    metrics = [
        conciseness_metric,
        language_metric,
        AnswerRelevancyMetric(),
        # ToxicityMetric()
    ]

    eval_result = deepeval.evaluate(
        test_cases=[test_case],
        metrics=metrics,
    )
    return eval_result

In [21]:
llm_input="What is a pipe used for ?"
llm_output_concise="A pipe is a tubular conduit used to transport fluids or sometimes solids."
llm_output_inconcise="Pipes are beautiful, black and round. Because they are round they are very convenient and don't have any edges."

In [22]:
log_output=""
r=metrics_run(llm_input, llm_output_concise)
print_metrics_data(r)

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...




Metrics Summary

  - ✅ Conciseness (GEval) (score: 0.8523812670840514, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The output is succinct but includes 'or sometimes solids,' which is not strictly necessary given the input., error: None)
  - ✅ Language (GEval) (score: 0.9974042643560711, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The text is grammatically correct, has proper syntax, and follows standard English sentence construction rules., error: None)
  - ✅ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 1.00 because the response is fully relevant and directly addresses the question without any irrelevant information. Great job!, error: None)

For test case:

  - input: What is a pipe used for ?
  - actual output: A pipe is a tubular conduit used to transport fluids or sometimes solids.
  - expected output: None
  - context: None
  - retrieval context: None


Overall Metric Pass Ra

input        : What is a pipe used for ?
actual_output: A pipe is a tubular conduit used to transport fluids or sometimes solids.

name         : Conciseness (GEval)
score        : 0.8523812670840514
reason       : The output is succinct but includes 'or sometimes solids,' which is not strictly necessary given the input.
model        : gpt-4o

name         : Language (GEval)
score        : 0.9974042643560711
reason       : The text is grammatically correct, has proper syntax, and follows standard English sentence construction rules.
model        : gpt-4o

name         : Answer Relevancy
score        : 1.0
reason       : The score is 1.00 because the response is fully relevant and directly addresses the question without any irrelevant information. Great job!
model        : gpt-4o

-----------


In [23]:
#print(log_output)