# Judge 2 Gold Score Test

This is a judge for evaluating the reason provided by the model when answering to general questions.
The model will give the score as follows:
- ⁠0 when the reason does not make sense and the answer is wrong.
- 1 when the reason makes sense but does not lead to the expected answer, and the answer is wrong.
- 2 when the reason makes sense and lead to the expected answer, but the answer is wrong.
- 3 when the reason makes sense and lead to the expected answer, and the answer is correct and corresponding to the reason.

In [15]:
# Imports
import json
import logging
import statistics
import sys
from typing import List

import openai
from openai.types.chat import (
    chat_completion,
    chat_completion_message,
)
import sklearn.metrics

from martian_apart_hack_sdk import exceptions, judge_specs, martian_client, utils
from martian_apart_hack_sdk.models import judge_evaluation, llm_models, router_constraints

## Load Credentials
You must have a .env file with the following values set:

1. `MARTIAN_API_URL` - withmartian.com/api
1. `MARTIAN_API_KEY` - your personal API key

In [16]:
# Load the config and make a client.
config = utils.load_config()
client = martian_client.MartianClient(
    api_url=config.api_url,
    api_key=config.api_key,
)

In [17]:
# One quick thing we can do with the client is confirm we have credits.
credit_balance = client.organization.get_credit_balance()
print(credit_balance)

OrganizationBalance(credits=50.0)


## Judging
The prompt used for the judge model and the rubrics.

In [23]:
rubric = """
You are tasked with evaluating whether the answer of a model is correct to the question and whether the reason make sense. You will be doing this by reading the REASON part of the answer and evaluate it with the question to see if the reason make sense and if it actually leads to the answer, and check if the ANSWER part corresponds to the reason.
The scoring is as follows:
- ⁠0 when the reason does not make sense and the answer is wrong.
- 1 when the reason is correct according to the context but has nothing to do with the question, and the answer is wrong.
- 2 when the reason makes sense and lead to the expected answer, but the answer is wrong.
- 3 when the reason makes sense and lead to the expected answer, and the answer is correct and corresponding to the reason.
- ⁠Show the logic chain you use that lead to your judge.
"""

rubric_judge_spec = judge_specs.RubricJudgeSpec(
	model_type="rubric_judge",
	rubric=rubric,
	model="openai/openai/gpt-4o",
	min_score=0,
	max_score=3,
)

In [19]:
# Run the judge spec.
def eval(question, answer):
	chat_request_text = question
	chat_response_text = answer

	completion_request = {
		"model": llm_models.GPT_4O_MINI,
		"messages": [{"role": "user", "content": chat_request_text}],
	}

	chat_completion_response = chat_completion.ChatCompletion(
		id="123",
		choices=[
			chat_completion.Choice(
				finish_reason="stop",
				index=0,
				message=chat_completion_message.ChatCompletionMessage(
					role="assistant",
					content=chat_response_text,
				),
			)
		],
		created=0,
		model="gpt-4o",
		object="chat.completion",
		service_tier=None,
	)

	evaluation_result = client.judges.evaluate_using_judge_spec(
		rubric_judge_spec.to_dict(),
		completion_request=completion_request,
		completion_response=chat_completion_response,
	)

	print(f"Evaluation result: {evaluation_result}")
	return int(evaluation_result.score)

## The Gold Score Test
Measuring the IAA (Inter-Annotator Agreement) of the judge

In [20]:
example_prompt = [
    """Original text: This compact study features a single, worn brown leather armchair facing a substantial oak desk. Three white ceramic mugs, filled with various pens and pencils, stand neatly in the desk's left corner, contrasting with the dark wood.
    What is the colour of the leather armchair?
    Answer in the language of the original text, answer in the following format:
    ANSWER: your answer.
    REASON: provide your reason.""",
    """Original text: This compact study features a single, worn brown leather armchair facing a substantial oak desk. Three white ceramic mugs, filled with various pens and pencils, stand neatly in the desk's left corner, contrasting with the dark wood.
    How many white ceramic mugs are there?
    Answer in the language of the original text, answer in the following format:
    ANSWER: your answer.
    REASON: provide your reason.""",
    """Original text: This compact study features a single, worn brown leather armchair facing a substantial oak desk. Three white ceramic mugs, filled with various pens and pencils, stand neatly in the desk's left corner, contrasting with the dark wood.
    How many armchair are there in the study?
    Answer in the language of the original text, answer in the following format:
    ANSWER: your answer.
    REASON: provide your reason."""
]
example_answer = [
    # expected score: 3
    """ANSWER: brown
    REASON: We know the answer from this sentence in the original text: worn brown leather armchair facing a substantial oak desk""",
    # expected score 2
    """ANSWER: four
    REASON: The text mentioned that there are three white ceramic mugs filled with various pens and pencils""",
    # expected score 0
    """ANSWER: the study shows armchairs are better for human bodies
    REASON: The study claims that armchairs are better for human bodies in long-term uses"""
]

In [None]:
import pandas as pd

# Create the client.
openai_client = openai.OpenAI(
    api_key=config.api_key,
    base_url=config.api_url + "/openai/v2"
)
score = {"0":0, "1":0, "2":0, "3":0}
for prompt, answer in zip(example_prompt, example_answer):
    evaluation_model_1 = eval(prompt, answer)
    
    if evaluation_model_1 == 0:
        score["0"] = score["0"] + 1
    elif evaluation_model_1 == 1:
        score["1"] = score["1"] + 1
    elif evaluation_model_1 == 2:
        score["2"] = score["2"] + 1
    elif evaluation_model_1 == 3:
        score["3"] = score["3"] + 1
print(score)
