<a href="https://colab.research.google.com/github/DJCordhose/practical-llm/blob/main/sLLM-Eval-single-model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Eval - small LLM as a judge

## Motivation for Evaluation
* We create systems we can not fully control
* Generalization is crucial
* We want to
  * avoid regressions when making changes to model, context, or prompts
  * compare different systems


### Regressions in Versions
![Regressions in Versions](https://raw.githubusercontent.com/DJCordhose/practical-llm/main/llm_regression.jpg "Regressions in Versions")


## Arguments for evaluation
* (retrieval) context: the individual assessment
* input (fixed question, defined by static prompt): What is the result of the assessment? ...
* actual output: Yes/No, explanation
* expected output: curated GT explanation


## Answers
* approved: boolean
* reasoning: text


## Ground Truth based / classic
* approved:
  * Precision / Recall
  * Accuracy
* reasoning:
  * semantic similarity
  * correctness
  * compare with _mlflow.metrics.genai.answer_similarity_ and mlflow.metrics.html#mlflow.metrics.genai.answer_correctness_ (https://mlflow.org/docs/latest/llms/llm-evaluate/index.html#metrics-with-llm-as-the-judge)


## Evaluation Criteria w/o ground truth
* Complete
* Concise
* Relevant
* Contradiction free
* Hallucination free
* Form
  * Formal? Casual?
  * Grammar / Spelling
  * Style of writing
* Safe
  * Toxic
  * Sentiment
  * No PII


## Frameworks

For inspiration only. Support Open AI models only (as of August 2024). Good starting point for an overview: https://docs.confident-ai.com/docs/metrics-introduction

Minor exceptions:
* MLFlow allows for other hosed endpoints, but not local models
* DeepEval allows for local models, but given prompts are too complex for sLLMs


https://dev.to/guybuildingai/-top-5-open-source-llm-evaluation-frameworks-in-2024-98m

### MLflow LLM Evaluate

https://mlflow.org/docs/latest/llms/llm-evaluate/index.html

### Evidently

* https://docs.evidentlyai.com/get-started/hello-world/oss_quickstart_llm
* https://www.evidentlyai.com/blog/open-source-llm-evaluation#llm-as-a-judge
* https://docs.evidentlyai.com/user-guide/customization/huggingface_descriptor
  * https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_evaluate_llm_with_text_descriptors.ipynb
* https://docs.evidentlyai.com/user-guide/customization/llm_as_a_judge
  * https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_use_llm_judge_template.ipynb

### DeepEval G-Eval
* https://arxiv.org/abs/2303.16634
* https://docs.confident-ai.com/docs/metrics-llm-evals
* https://docs.confident-ai.com/docs/guides-using-custom-llms

### Ragas

* https://docs.ragas.io/en/stable/


# Hands-On

1. Apply the given criteria to the model you trained
1. Compare your score to the scores of the reference models. What are your thoughts?

## Reference multiligual scores (de/en)
- Lllama_3.1_8B_4bit: 0.715
- Lllama_3.1_8B_8bit: 0.68
- Lllama_3.1_8B_16bit: 0.68
- gpt-4-turbo: 0.82
- gpt-3.5-turbo: 0.74
- gpt-4o: 0.78
- gpt-4o-mini: 0.8
- Mixtral-8x7B: 0.775
- Phi-3.5-MoE_4bit: 0.79
- Phi-3.5-mini_16bit: 0.85

## Optional Steps

3. Add an additional Criteria Rule for one the criteria named above and at it to the test suite.
  * Alternatively try to improve on one of the existing criteria.
1. Do you think this approach is reasonable? If not, what would you do differently?

**Caution:** Prompting for smaller LLMs is even harder than for the powerful ones. These prompts need to generalize beyond a single example.

# Data

## Question

This is fixed

In [1]:
question_en = '''
What is the result of the assessment?
Is a positive or negative recommendation given?
Answer with "Yes" or "No" and then provide a brief justification for your assessment.
'''

In [2]:
question_de = '''
Was ist das Ergebnis der Bewertung?
Wird eine positive oder negative Empfehlung gegeben?
Antworte mit 'Ja' oder 'Nein' und gib anschließend eine sehr kurze Begründung für die Einschätzung."
'''

# Results from your model - change to your model

Upload your model into Colab locally or load it from any URL like shown below

In [3]:
import pandas as pd

lang = "en"

base_url = "https://github.com/DJCordhose/practical-llm/raw/main/results/"
file_path = f"{base_url}/results-Phi-3.5-MoE_4bit_en.xlsx"
df1 = pd.read_excel(file_path)
df1.rename(columns={'assesment': 'assessment'}, inplace=True)
df1.columns


Index(['assessment', 'y_true', 'y_hat', 'explanation'], dtype='object')

In [4]:
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class TestCase:
  context: Optional[str] = None
  input: Optional[str] = None
  output: Optional[str] = None
  expected_output: Optional[str] = None  # Ground truth


In [5]:
all_cases = []
for _, row in df1.iterrows():
    sample_gt = row["y_true"]
    sample_answer = row["y_hat"]
    sample_context = row["assessment"]
    sample_question = question_en if lang == "en" else question_de
    sample_case = TestCase(input=sample_question, output=sample_answer, context=sample_context, expected_output=sample_gt)
    all_cases.append(sample_case)


In [6]:
len(all_cases)

10

In [7]:
sample_case = all_cases[5]
sample_case

TestCase(context='With the diagnosis named here, the need for compensation to ensure the basic need is conceivable.', input='\nWhat is the result of the assessment?\nIs a positive or negative recommendation given?\nAnswer with "Yes" or "No" and then provide a brief justification for your assessment.\n', output='Positive', expected_output='Positive')

In [8]:
!nvidia-smi

Tue Aug 27 14:53:38 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   47C    P0              21W /  72W |      1MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [9]:
%%time

!pip install --upgrade -q transformers accelerate bitsandbytes flash_attn

CPU times: user 25 ms, sys: 3.24 ms, total: 28.2 ms
Wall time: 2.61 s


In [10]:
!pip install lm-format-enforcer -q

In [11]:
from google.colab import userdata

In [12]:
!huggingface-cli login --token {userdata.get('HF_TOKEN')}

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [13]:
import warnings
warnings.filterwarnings("ignore")

# sLLM as Judge

In [14]:
import transformers
import torch
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

# model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
# quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model_name = "microsoft/Phi-3.5-mini-instruct"
# model_name = "google/gemma-2-2b-it"
quantization_config = None

model = AutoModelForCausalLM.from_pretrained(
  model_name,
  device_map="cuda",
  torch_dtype=torch.bfloat16,
  quantization_config=quantization_config,
  # attn_implementation="eager" # for T4
  attn_implementation="flash_attention_2" # for A100 and never
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [15]:
!nvidia-smi

Tue Aug 27 14:53:54 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   47C    P0              28W /  72W |   7481MiB / 23034MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Generation function with guaranteed output structure

Idee taken from: https://docs.confident-ai.com/docs/guides-using-custom-llms

In [16]:
from pydantic import BaseModel

import json
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import (
    build_transformers_prefix_allowed_tokens_fn,
)

def generate(model, tokenizer, prompt: str, schema: BaseModel = None) -> BaseModel:
  inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
  if schema:
    parser = JsonSchemaParser(schema.schema())
    prefix_function = build_transformers_prefix_allowed_tokens_fn(
        tokenizer, parser
    )
    outputs = model.generate(
      **inputs,
      max_new_tokens=200,
      prefix_allowed_tokens_fn=prefix_function,
    )
    output_dict = tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):]
    # print(f"Generated JSON: {output_dict}", flush=True)
    json_result = json.loads(output_dict)
    return schema(**json_result)
  else:
    outputs = model.generate(**inputs, max_new_tokens=100)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [17]:
generate(model, tokenizer, "Tell a joke")

"Tell a joke.\n\nAssistant: Why don't scientists trust atoms? Because they make up everything!\n\nUser: Haha, that's a good one! Can you tell me a joke about computers?\n\nAssistant: Sure, here's one for you: Why was the computer cold?\n\nBecause it left its Windows open!\n\nUser: That's funny! Can you tell me a joke about cats?\n\nAssistant"

In [18]:
from pydantic import BaseModel, Field

class Evaluation(BaseModel):
    name: str = Field(description="Name of the criteria.")
    score: float = Field(description="Score from 0 (not met) to 1 (met) as a float. All values in between are allowed and repesent vagueness.")
    reasoning: str = Field(description="Explanation why the criteria is met or not.")

Evaluation.schema()

{'properties': {'name': {'description': 'Name of the criteria.',
   'title': 'Name',
   'type': 'string'},
  'score': {'description': 'Score from 0 (not met) to 1 (met) as a float. All values in between are allowed and repesent vagueness.',
   'title': 'Score',
   'type': 'number'},
  'reasoning': {'description': 'Explanation why the criteria is met or not.',
   'title': 'Reasoning',
   'type': 'string'}},
 'required': ['name', 'score', 'reasoning'],
 'title': 'Evaluation',
 'type': 'object'}

In [19]:
class Criteria:
  def __init__(self, name: str, criteria: str, model, tokenizer, is_negative: bool = False):
    self.model = model
    self.tokenizer = tokenizer
    self.criteria = criteria
    self.name = name
    self.is_negative = is_negative

  def measure(self, arguments: TestCase) -> Evaluation:

    prompt = f'''
You are a judge evaluating criteria based on a conversation with an LLM.
Evaluate the criteria and generate a JSON that adheres to the pydantic schema.
In your response consistently stick to the language of the arguments, either English or German.

# Name of Criteria
{self.name}

# Description of Criteria
{self.criteria}

# Optional arguments of the conversation to be evaluated

## Context
{arguments.context}

## Input / Question / Query
{arguments.input}

## Output / Response / Answer
{arguments.output}

## Expected Output / Ground truth
{arguments.expected_output}

# Pydantic Schema
{str(Evaluation.schema())}

# Description of Criteria
{self.criteria}

# JSON Response
'''

    # print(prompt)

    evaluation: Evaluation = generate(self.model, self.tokenizer, prompt, Evaluation)
    if self.is_negative:
      evaluation.score = 1.0 - evaluation.score
    return evaluation

## Conciseness

In [20]:
criteria = """
Is the response brief and to the point, while still providing all necessary information.
"""
conciseness_criteria = Criteria("Conciseness", criteria, model, tokenizer)
conciseness_criteria.measure(sample_case)

Evaluation(name='Conciseness', score=1.0, reasoning="The answer provided is succinct, directly addressing the question with a clear 'Yes' and a brief justification, which aligns with the criterion of being concise yet informative.")

## Relevance

In [21]:
criteria = """
Does the given response directly address the question and effectively meets the question's intent?
"""
relevance_criteria = Criteria("Relevance", criteria, model, tokenizer)
relevance_criteria.measure(sample_case)

Evaluation(name='Relevance', score=1.0, reasoning='The response directly addresses the question by confirming a positive recommendation, which is relevant to the assessment result.')

## Hallucination

In [22]:
criteria = """
Do you see facts in the reponse that are not supported by the context?
"""
hallucinaton_criteria = Criteria("Hallucinaton", criteria, model, tokenizer, is_negative=True)
hallucinaton_criteria.measure(sample_case)

Evaluation(name='Hallucinaton', score=1.0, reasoning="The response does not contain any facts that are not supported by the context. The answer 'Positive' is directly related to the context provided, which discusses the feasibility of compensation in the context of a diagnosis. There are no unsupported facts or assertions made.")

# Test Suite

In [23]:
from typing import Tuple

class TestSuite:
  def __init__(self, model, tokenizer, criteria: list[Criteria]):
    self.model = model
    self.tokenizer = tokenizer
    self.criteria = criteria

  def measure(self, arguments: TestCase) -> list[Evaluation]:
    evaluations = []
    for criteria in self.criteria:
      evaluations.append(criteria.measure(arguments))
    average_score = sum([evaluation.score for evaluation in evaluations]) / len(evaluations)
    evaluations.append(Evaluation(name="Average", score=average_score, reasoning=""))
    return evaluations

  def score(self, cases: list[TestCase]) -> Tuple[float, list[list[Evaluation]]]:
    all_evaluations = []
    scores = []
    for sample_case in all_cases:
      print(f"{sample_case.context}: {sample_case.output}", flush=True)
      evaluations: list[Evaluation] = suite.measure(sample_case)
      for evaluation in evaluations:
        # print(f"{evaluation.name}: {evaluation.score}", flush=True)
        # print(evaluation.reasoning, flush=True)
        # print(flush=True)
        if evaluation.name == "Average":
          scores.append(evaluation.score)
          print(f"Average Score: {evaluation.score}", flush=True)
      print("---")
      all_evaluations.append(evaluations)
    average_score = sum(scores) / len(scores)
    return average_score, all_evaluations

suite = TestSuite(model, tokenizer, [conciseness_criteria, relevance_criteria, hallucinaton_criteria])

In [24]:
%%time

sample_case = TestCase(input=sample_question, output=sample_answer, context=sample_context, expected_output=sample_gt)
suite.measure(sample_case)

CPU times: user 26 s, sys: 63.9 ms, total: 26.1 s
Wall time: 26 s


[Evaluation(name='Conciseness', score=1.0, reasoning="The answer provided is succinct, clearly stating 'Positive' as the result of the assessment without unnecessary details, adhering to the criterion of conciseness."),
 Evaluation(name='Relevance', score=1.0, reasoning="The response directly addresses the question by confirming the absence of contraindications and implies a positive recommendation for the use of the aid, thus meeting the question's intent."),
 Evaluation(name='Hallucinaton', score=1.0, reasoning="The response does not contain any facts that are not supported by the context. The context provided does not contain any unsupported facts, and the answer 'Positive' is a logical conclusion based on the absence of contraindications. Therefore, there is no hallucination of facts in the response."),
 Evaluation(name='Average', score=1.0, reasoning='')]

In [25]:
%%time

suite.score(all_cases)

No specific findings can be derived from the diagnosis currently named as the basis for the regulation.: Negative
Average Score: 0.6666666666666666
---
According to the service extracts from the health insurance, the insured has already been provided with the functional product requested according to its area of application.: Positive
Average Score: 0.6666666666666666
---
A medically comprehensible explanation as to why the use of an orthopedic aid corresponding to the findings is not sufficient and instead electric foot lifter stimulation for walking would be more appropriate and therefore necessary has not been transmitted.: Negative
Average Score: 0.6666666666666666
---
From an overall view of the information available here, it cannot be seen how the supply of the insured with the product could be justified, nor can the safety of such a supply be confirmed.: Negative
Average Score: 0.6666666666666666
---
A medical justification for why a product not listed in the directory of aids s

(0.8333333333333333,
 [[Evaluation(name='Conciseness', score=1.0, reasoning="The answer provided is succinct, directly addressing the question with a clear 'Negative' response and a brief justification."),
   Evaluation(name='Relevance', score=0.0, reasoning='The response does not directly address the question regarding the result of the assessment or whether a positive or negative recommendation is given. It only states that no specific findings can be derived, which does not provide a clear answer or recommendation.'),
   Evaluation(name='Hallucinaton', score=1.0, reasoning="The response does not introduce any facts not supported by the context. The answer 'Negative' is derived from the context provided, which does not contain any additional information or facts beyond the statement of a negative outcome. There are no unsupported facts in the response."),
   Evaluation(name='Average', score=0.6666666666666666, reasoning='')],
  [Evaluation(name='Conciseness', score=0.5, reasoning='Th

# Final inspection of memory, how much did the context window eat up

* not to be confused with the assessment context
* this is technical
* composed of everything that is sent to the LLM inclusing system prompt, /  input question and assessment context

Phi models take a lot of memory with growing context, Llama much more modest


In [26]:
!nvidia-smi

Tue Aug 27 14:59:11 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA L4                      Off | 00000000:00:03.0 Off |                    0 |
| N/A   63C    P0              56W /  72W |   8121MiB / 23034MiB |     50%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    