<a href="https://colab.research.google.com/github/DJCordhose/llm-from-prototype-to-production/blob/main/Eval4pptx.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Hands on: LLM as a judge

Goal
* see how llm-as-a-judge works in principle
* introduction to the G-Eval algorithm ([G-Eval on arxive ](https://arxiv.org/abs/2303.16634))
* see how the algorithm uses prompts to generate the actual eval prompt
* try out the [DeepEval library](https://docs.confident-ai.com/docs/guides-using-custom-llms)


Requirements:
* OpenAI api key

# SetUp : create an *llm_run* method calling OpenAI Gpt

* define a simple **llm_run** method, that calls the gpt model
* try out llm_run

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
%%time

!pip install openai
!pip install deepeval==1.1.1 -q

Collecting openai
  Downloading openai-1.44.1-py3-none-any.whl.metadata (22 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading openai-1.44.1-py3-none-any.whl (373 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m373.5/373.5 kB[0m [31m24.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K   [90m━

In [13]:
import os
from google.colab import userdata
from openai import OpenAI

os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [14]:
def llm_run(messages):
  if type(messages) == str:
    messages = [{"role": "user", "content": messages}]

  client = OpenAI()
  completion = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
  )
  result = completion.choices[0].message.content
  return result

Try out our model:

In [15]:
llm_run("who are you ?")

"I'm an artificial intelligence called Assistant, created by OpenAI. I'm here to help you with any questions or information you might need. How can I assist you today?"

# LLM-as-a-judge: in principle

In [16]:
llm_output="Witing texts is painful, caus im making mitakes."

In [17]:
simple_eval_prompt = f'''
You are an expert on english language, grading a students text with scores between 0 and 10.
A text written in proper english, in a fluent style, containing no grammatical or syntax errors is graded 10.
A text written in a different language or with spelling errors gets a low score.
Also give a detailed explanation why the score was chosen.
Do not repeat the students text in your explanation.

Always answer in the following json format:
{{
    "score": 8,
    "reason": "some reason"
}}

Examples
1. Student Text: Pipes are cylindrical conduits used to transport fluids or gases, typically made of materials like metal, plastic, or concrete.
   Answer:
   {{
    "score": 8,
    "reason": "The text is written in english and does not contain any syntactical or grammatical erros"
  }}
2. Student Text: Zwischen Neonlichtern und Straßenlärm träum ich leise von Freiheit.
   Answer:
   {{
    "score": 2,
    "reason": "The text is written in german and not in english."
  }}

Student Text: {llm_output}
Answer:
'''

In [18]:
llm_run(simple_eval_prompt)

'{\n    "score": 3,\n    "reason": "The text contains multiple spelling errors and informal contractions: \'Witing\' should be \'Writing\', \'caus\' should be \'because\', and \'im\' should be \'I\'m\'. It is understandable but not well-written."\n}'

# Llm-as-a-judge: G-Eval in principal

### **Idea:** given just a criteria use an llm to generate a detailed evaluation prompt.

https://arxiv.org/pdf/2303.16634

### G-Eval Phase 1: generate evaluation steps based on the criteria

In [19]:
criteria="Grade the english grammar and syntax"

In [26]:
geval_phase1_prompt = f'''Given an evaluation criteria which outlines how you should judge the Actual Output, generate
3-4 concise evaluation steps based on the criteria below. You MUST make it clear how to evaluate Actual Output in
relation to one another.

Evaluation Criteria:
{criteria}

**
IMPORTANT: Please make sure to only return in JSON format, with the "steps" key as a list of strings. No words or
explanation is needed.

Example JSON:
{{
    "steps": <list_of_strings>
}}
**

Answer:
'''
answer_phase1 = llm_run(geval_phase1_prompt)
print(answer_phase1)

{
    "steps": [
        "Check for grammatical errors such as subject-verb agreement, verb forms, and correct usage of articles and prepositions.",
        "Evaluate sentence structure for clarity, coherence, and variety in sentence length and type.",
        "Assess the punctuation to ensure it enhances readability and accurately conveys the intended message.",
        "Verify the proper use of capitalization, spelling accuracy, and correct syntax throughout the text."
    ]
}


In [27]:
import json

json_answer_step1=json.loads(answer_phase1)
steps="\n".join(f"{index+1}. {step}" for index, step in enumerate(json_answer_step1['steps']))
print(steps)

1. Check for grammatical errors such as subject-verb agreement, verb forms, and correct usage of articles and prepositions.
2. Evaluate sentence structure for clarity, coherence, and variety in sentence length and type.
3. Assess the punctuation to ensure it enhances readability and accurately conveys the intended message.
4. Verify the proper use of capitalization, spelling accuracy, and correct syntax throughout the text.


### G-Eval Phase 2: evaluate the llm_output using the generated steps

In [28]:
llm_input="Why do you dislike writing texts ?"
llm_output="Witing texts is painful, caus im making mitakes."

In [29]:
geval_phase2_prompt = f'''
Given the evaluation steps, return a JSON with two keys:
1) a `score` key ranging from 0 - 10, with 10 being that it follows the criteria outlined in the steps and 0 being that it does not, and
2) a `reason` key, a reason for the given score, but DO NOT QUOTE THE SCORE in your reason.
Please mention specific information from Actual Output and Input in your reason, but be very concise with it!

Evaluation Steps:
{steps}

Actual Output:
{llm_output}

Input:
{llm_input}

**
IMPORTANT: Please make sure to only return in JSON format, with the "score" and "reason" key. No words or explanation is needed.

Example JSON:
{{
    "score": 0,
    "reason": "The text does not follow the evaluation steps provided."
}}
**

JSON:
'''

In [30]:
answer_phase2=llm_run(geval_phase2_prompt)
print(answer_phase2)

```json
{
    "score": 2,
    "reason": "Multiple grammatical errors such as 'witing' and 'mitakes'; improper sentence structure and lack of clarity; poor punctuation; inconsistent capitalization and spelling mistakes present."
}
```


# G-Eval Implementation by DeepEval

**geval_run** calls deepEval's implementation passing our criteria, llm_input and llm_output

In [32]:
import deepeval
import deepeval.metrics
import deepeval.test_case
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

def geval_run(name, criteria, input, output):
    return deepeval.evaluate(
        test_cases=[deepeval.test_case.LLMTestCase(input=input, actual_output=output )],
        metrics=[
            GEval( criteria=criteria,
                   evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT,LLMTestCaseParams.INPUT],
                   model=None, # deepEval defaults to openAI
                   name=name
        )]
    )

In [33]:
log_output=""
r = geval_run("Language", "Grade the english grammar and syntax.", llm_input, llm_output )

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...




Metrics Summary

  - ❌ Language (GEval) (score: 0.1922801796419053, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The Actual Output has multiple grammatical errors such as 'Witing' and 'caus im making mitakes', which do not align with the Input's correct grammar. The syntax structure is also poor compared to the Input, and there are inconsistencies in tense and subject-verb agreement., error: None)

For test case:

  - input: Why do you dislike writing texts ?
  - actual output: Witing texts is painful, caus im making mitakes.
  - expected output: None
  - context: None
  - retrieval context: None


Overall Metric Pass Rates

Language (GEval): 0.00% pass rate




# Evaluating multiple metrics: "Conciseness", AnswerRelevance, Toxicity,...

check out some other metrics [https://docs.confident-ai.com/docs/metrics-introduction](https://docs.confident-ai.com/docs/metrics-introduction)

In [39]:
from deepeval.metrics import AnswerRelevancyMetric, ToxicityMetric

def metrics_run(input, output):
    test_case = deepeval.test_case.LLMTestCase(
        input=input,
        actual_output=output
      )

    conciseness_metric = GEval(
        name="Conciseness",
        criteria="Determine how concise the actual output is. Ignore the input.",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.INPUT],
    )

    language_metric = GEval(
        name="Language",
        criteria="Grade the english grammar and syntax. Ignore the input.",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.INPUT],
    )

    metrics = [
        conciseness_metric,
        language_metric,
        AnswerRelevancyMetric(),
        ToxicityMetric()
    ]

    eval_result = deepeval.evaluate(
        test_cases=[test_case],
        metrics=metrics,
    )
    return eval_result

In [41]:
r=metrics_run(llm_input, llm_output)

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...




Metrics Summary

  - ❌ Conciseness (GEval) (score: 0.44643985836491035, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The actual output conveys the message but contains spelling errors and redundancies like 'caus' and 'mitakes' which should be corrected for clarity., error: None)
  - ❌ Language (GEval) (score: 0.22637047689837106, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: There are punctuation and capitalization errors, such as 'Witing' and 'caus im'. The sentence structure is mostly understandable but contains grammatical errors like 'caus im' and 'mitakes'. Overall fluency and readability are low due to these issues., error: None)
  - ✅ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 1.00 because the output is perfectly relevant and addresses the input directly with no irrelevant statements. Great job!, error: None)
  - ✅ Toxicity (score: 0.0, threshold: 0.5, strict: False, evaluat

In [42]:
llm_input="What is a pipe used for ?"
llm_output_concise="A pipe is a tubular conduit used to transport fluids or sometimes solids."
llm_output_inconcise="Pipes are beautiful, black and round. Because they are round they are very convenient and don't have any edges."

In [46]:
r=metrics_run(llm_input, llm_output_inconcise)

Output()

Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...




Metrics Summary

  - ❌ Conciseness (GEval) (score: 0.14139483425727736, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The actual output does not answer the input question about the use of a pipe and contains superfluous details about its shape and color., error: None)
  - ✅ Language (GEval) (score: 0.7120739590084814, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The Actual Output mostly follows the criteria, but the sentence 'Because they are round they are very convenient and don't have any edges.' could be clearer with better punctuation (e.g., a comma after 'round')., error: None)
  - ❌ Answer Relevancy (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: The score is 0.00 because the statements provided discuss the appearance and shape of pipes rather than their use, which does not address the actual question., error: None)
  - ✅ Toxicity (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-4o, reason: T