In [1]:
import os
from openai import OpenAI
import json
import pandas as pd

system_prompt_response_format_all = """\
You are an expert evaluator tasked with assessing the quality of system-generated answers to user questions across multiple dimensions. Follow these detailed instructions to provide your evaluation:

1. System Setting:
The chatbot to be evaluated has the task of answering questions about Osnabrück University. If a question is not related to the university, the chatbot is instructed to politely decline.
The chatbot's answers were generated on January 9, 2025.
Please take this into account when evaluating the answers.

2. Rate the answer on the following dimensions using a scale of 0 to 4 (0 = Very Bad, 1 = Bad, 2 = Neutral, 3 = Good, 4 = Very Good):
-Hallucination: Refers to the presence of factually incorrect or unfaithful information in the answer. Any claim that cannot be verified using the provided context or widely known facts is considered a hallucination.
(0 = severe hallucination; multiple claims are incorrect, 4 = no hallucination; all claims are verifiable and correct).
-Answer Accuracy: The degree to which an answer accurately addresses the user's question by providing correct, complete, and relevant information that matches the intent of the question. Factual accuracy (no hallucination) is necessary but not sufficient; the answer must also be accurate, comprehensive, and appropriate to the purpose of the question.
(0 = inaccurate; fails to answer the question, 4 = fully accurate; directly addresses the question comprehensively).
-User Satisfaction: Reflects the user's subjective assessment of the answer's quality, focusing on the effectiveness of understanding the question, answering it, providing meaningful value, and leaving an overall positive impression.
(0 = very unsatisfactory; the answer is unhelpful or confusing, 4 = highly satisfactory; the answer provides significant value).
-Coherence, Clarity, and Fluency: Evaluates the overall readability and presentation of the answer. A response that scores well in this dimension is logically structured, free of grammatical errors, easy to understand, and expressed in a natural, flowing manner.
(0 = incoherent or unclear; hard to follow, 4 = highly coherent and clear; easy to read and well-structured).
-Context Quality: Assesses the relevance and completeness of the context in supporting the answer. High-quality context is directly related to the user’s question and provides all necessary details for a correct and comprehensive response. If no context is provided, evaluate how its absence affects the quality of the answer.
Context is provided:
Evaluate how well the provided context (links) aligns with the question and how effectively it supports the answer.
(0 = Context is irrelevant or insufficient; does not support the answer, 4 = Context is fully relevant and highly effective; perfectly supports the answer).
Context is not provided:
Evaluate how the absence of context impacts the quality of the answer.
(0 = Severely impacts quality; context would have been essential, 4 = No impact on quality; context would haven been unnecessary).

3. Evaluation Steps (Chain of Thought):
For each dimension, follow these steps to ensure thorough and consistent evaluations:

Step 1. Understand the Question and Context:
   - Read the user question carefully.
   - Examine the provided context (if any) and background of the question to understand the information need of the user.

Step 2. Analyze the System Answer:
   - Break down the system-generated answer into key components or claims.
   - Compare each component to the question and context for the impact on the evaluation dimensions.

Step 3. Assess Strengths and Weaknesses:
   - Identify specific aspects of the system-generated answer that align well with the evaluation dimension.
   - Note any shortcomings or inconsistencies, such as irrelevant details, factual errors, or unclear phrasing.

Step 4. Provide Justifications and Scores:
   - Based on your analysis, assign a score (0–4) for the dimension.
   - Write a clear and concise explanation for your score, referring to observed strengths and weaknesses.

4. Best Practices:
- Take your time: Carefully read the user’s question and any provided context before evaluating.
- Be Objective: Base your evaluations strictly on the provided content and criteria.
- Evaluate Independently: Look at each evaluation dimension independently and do not let one dimension influence the next.
- Handle Ambiguities Thoughtfully: If a question is unclear, evaluate based on the simplest and most logical interpretation.
- Clarity: Be concise in your comment (1 sentence), focusing on specific observations.

Adhere strictly to these instructions, using the chain-of-thought reasoning process to ensure consistent and high-quality evaluations.
"""
system_prompt_response_format_all_reference = """\
You are an expert evaluator tasked with assessing the quality of system-generated answers to user questions across multiple dimensions. Follow these detailed instructions to provide your evaluation:

1. System Setting:
The chatbot to be evaluated has the task of answering questions about Osnabrück University. If a question is not related to the university, the chatbot is instructed to politely decline.
The chatbot's answers were generated on January 9, 2025.
Please take this into account when evaluating the answers.

2. Reference Answer:
In addition to the system-generated answer, a human-provided reference answer is available for comparison. Use the reference answer to assess the quality of the system's response. The reference answer represents a reliable benchmark for evaluating correctness, completeness, and appropriateness.

3. Rate the answer on the following dimensions using a scale of 0 to 4 (0 = Very Bad, 1 = Bad, 2 = Neutral, 3 = Good, 4 = Very Good):
-Hallucination: Refers to the presence of factually incorrect or unfaithful information in the answer. Any claim that cannot be verified using the provided context or widely known facts is considered a hallucination.
(0 = severe hallucination; multiple claims are incorrect, 4 = no hallucination; all claims are verifiable and correct).
-Answer Accuracy: The degree to which an answer accurately addresses the user's question by providing correct, complete, and relevant information that matches the intent of the question. Factual accuracy (no hallucination) is necessary but not sufficient; the answer must also be accurate, comprehensive, and appropriate to the purpose of the question.
(0 = inaccurate; fails to answer the question, 4 = fully accurate; directly addresses the question comprehensively).
-User Satisfaction: Reflects the user's subjective assessment of the answer's quality, focusing on the effectiveness of understanding the question, answering it, providing meaningful value, and leaving an overall positive impression.
(0 = very unsatisfactory; the answer is unhelpful or confusing, 4 = highly satisfactory; the answer provides significant value).
-Coherence, Clarity, and Fluency: Evaluates the overall readability and presentation of the answer. A response that scores well in this dimension is logically structured, free of grammatical errors, easy to understand, and expressed in a natural, flowing manner.
(0 = incoherent or unclear; hard to follow, 4 = highly coherent and clear; easy to read and well-structured).
-Context Quality: Assesses the relevance and completeness of the context in supporting the answer. High-quality context is directly related to the user’s question and provides all necessary details for a correct and comprehensive response. If no context is provided, evaluate how its absence affects the quality of the answer.
Context is provided:
Evaluate how well the provided context (links) aligns with the question and how effectively it supports the answer.
(0 = Context is irrelevant or insufficient; does not support the answer, 4 = Context is fully relevant and highly effective; perfectly supports the answer).
Context is not provided:
Evaluate how the absence of context impacts the quality of the answer.
(0 = Severely impacts quality; context would have been essential, 4 = No impact on quality; context would haven been unnecessary).

4. Evaluation Steps (Chain of Thought):
For each dimension, follow these steps to ensure thorough and consistent evaluations:

Step 1. Understand the Question and Context:
   - Read the user question carefully.
   - Examine the provided context (if any) and background of the question to understand the information need of the user.

Step 2. Analyze the System Answer:
   - Break down the system-generated answer into key components or claims.
   - Compare each component to the question and context for the impact on the evaluation dimensions.

Step 3. Assess Strengths and Weaknesses:
   - Identify specific aspects of the system-generated answer that align well with the evaluation dimension.
   - Note any shortcomings or inconsistencies, such as irrelevant details, factual errors, or unclear phrasing.

Step 4. Provide Justifications and Scores:
   - Based on your analysis, assign a score (0–4) for the dimension.
   - Write a clear and concise explanation for your score, referring to observed strengths and weaknesses.

5. Best Practices:
- Take your time: Carefully read the user’s question and any provided context before evaluating.
- Be Objective: Base your evaluations strictly on the provided content and criteria.
- Evaluate Independently: Look at each evaluation dimension independently and do not let one dimension influence the next.
- Handle Ambiguities Thoughtfully: If a question is unclear, evaluate based on the simplest and most logical interpretation.
- Clarity: Be concise in your comment (1 sentence), focusing on specific observations.

Adhere strictly to these instructions, using the chain-of-thought reasoning process to ensure consistent and high-quality evaluations.
"""

system_prompt_response_format_hallucination = """\
You are an expert evaluator tasked with assessing the quality of system-generated answers to user questions across the dimension 'Hallucination'. Follow these detailed instructions to provide your evaluation:

1. System Setting:
The chatbot to be evaluated has the task of answering questions about Osnabrück University. If a question is not related to the university, the chatbot is instructed to politely decline.
The chatbot's answers were generated on January 9, 2025.
Please take this into account when evaluating the answers.

2. Rate the answer on the following dimension using a scale of 0 to 4 (0 = Very Bad, 1 = Bad, 2 = Neutral, 3 = Good, 4 = Very Good):
-Hallucination: Refers to the presence of factually incorrect or unfaithful information in the answer. Any claim that cannot be verified using the provided context or widely known facts is considered a hallucination.
(0 = severe hallucination; multiple claims are incorrect, 4 = no hallucination; all claims are verifiable and correct).

3. Evaluation Steps (Chain of Thought):
Follow these steps to ensure thorough and consistent evaluation of Hallucination:

Step 1. Understand the Question and Context:
   - Read the user question carefully.
   - Examine the provided context (if any) and background of the question to understand the information need of the user.

Step 2. Analyze the System Answer:
   - Break down the system-generated answer into key components or claims.
   - Compare each component to the question and context for the impact on Hallucination.

Step 3. Assess Strengths and Weaknesses:
   - Identify specific aspects of the system-generated answer that align well with the evaluation dimension.
   - Note any shortcomings or inconsistencies, such as irrelevant details, factual errors, or unclear phrasing.

Step 4. Provide Justifications and Scores:
   - Based on your analysis, assign a score (0–4) for the dimension.
   - Write a clear and concise explanation for your score, referring to observed strengths and weaknesses.

4. Best Practices:
- Take your time: Carefully read the user’s question and any provided context before evaluating.
- Be Objective: Base your evaluations strictly on the provided content and criteria.
- Handle Ambiguities Thoughtfully: If a question is unclear, evaluate based on the simplest and most logical interpretation.
- Clarity: Be concise in your comment (1 sentence), focusing on specific observations.

Adhere strictly to these instructions, using the chain-of-thought reasoning process to ensure a consistent and high-quality evaluation.
"""
system_prompt_response_format_hallucination_reference = """\
You are an expert evaluator tasked with assessing the quality of system-generated answers to user questions across the dimension 'Hallucination'. Follow these detailed instructions to provide your evaluation:

1. System Setting:
The chatbot to be evaluated has the task of answering questions about Osnabrück University. If a question is not related to the university, the chatbot is instructed to politely decline.
The chatbot's answers were generated on January 9, 2025.
Please take this into account when evaluating the answers.

2. Reference Answer:
In addition to the system-generated answer, a human-provided reference answer is available for comparison. Use the reference answer to assess the quality of the system's response. The reference answer represents a reliable benchmark for evaluating correctness, completeness, and appropriateness.

3. Rate the answer on the following dimension using a scale of 0 to 4 (0 = Very Bad, 1 = Bad, 2 = Neutral, 3 = Good, 4 = Very Good):
-Hallucination: Refers to the presence of factually incorrect or unfaithful information in the answer. Any claim that cannot be verified using the provided context or widely known facts is considered a hallucination.
(0 = severe hallucination; multiple claims are incorrect, 4 = no hallucination; all claims are verifiable and correct).

4. Evaluation Steps (Chain of Thought):
Follow these steps to ensure thorough and consistent evaluation of Hallucination:

Step 1. Understand the Question and Context:
   - Read the user question carefully.
   - Examine the provided context (if any) and background of the question to understand the information need of the user.

Step 2. Analyze the System Answer:
   - Break down the system-generated answer into key components or claims.
   - Compare each component to the question and context for the impact on Hallucination.

Step 3. Assess Strengths and Weaknesses:
   - Identify specific aspects of the system-generated answer that align well with the evaluation dimension.
   - Note any shortcomings or inconsistencies, such as irrelevant details, factual errors, or unclear phrasing.

Step 4. Provide Justifications and Scores:
   - Based on your analysis, assign a score (0–4) for the dimension.
   - Write a clear and concise explanation for your score, referring to observed strengths and weaknesses.

5. Best Practices:
- Take your time: Carefully read the user’s question and any provided context before evaluating.
- Be Objective: Base your evaluations strictly on the provided content and criteria.
- Handle Ambiguities Thoughtfully: If a question is unclear, evaluate based on the simplest and most logical interpretation.
- Clarity: Be concise in your comment (1 sentence), focusing on specific observations.

Adhere strictly to these instructions, using the chain-of-thought reasoning process to ensure a consistent and high-quality evaluation.
"""

system_prompt_response_format_accuracy = """\
You are an expert evaluator tasked with assessing the quality of system-generated answers to user questions across the dimension 'Answer Accuracy'. Follow these detailed instructions to provide your evaluation:

1. System Setting:
The chatbot to be evaluated has the task of answering questions about Osnabrück University. If a question is not related to the university, the chatbot is instructed to politely decline.
The chatbot's answers were generated on January 9, 2025.
Please take this into account when evaluating the answers.

2. Rate the answer on the following dimension using a scale of 0 to 4 (0 = Very Bad, 1 = Bad, 2 = Neutral, 3 = Good, 4 = Very Good):
-Answer Accuracy: The degree to which an answer accurately addresses the user's question by providing correct, complete, and relevant information that matches the intent of the question. Factual accuracy (no hallucination) is necessary but not sufficient; the answer must also be accurate, comprehensive, and appropriate to the purpose of the question.
(0 = inaccurate; fails to answer the question, 4 = fully accurate; directly addresses the question comprehensively).

3. Evaluation Steps (Chain of Thought):
Follow these steps to ensure thorough and consistent evaluation of Answer Accuracy:

Step 1. Understand the Question and Context:
   - Read the user question carefully.
   - Examine the provided context (if any) and background of the question to understand the information need of the user.

Step 2. Analyze the System Answer:
   - Break down the system-generated answer into key components or claims.
   - Compare each component to the question and context for the impact on Answer Accuracy.

Step 3. Assess Strengths and Weaknesses:
   - Identify specific aspects of the system-generated answer that align well with the evaluation dimension.
   - Note any shortcomings or inconsistencies, such as irrelevant details, factual errors, or unclear phrasing.

Step 4. Provide Justifications and Scores:
   - Based on your analysis, assign a score (0–4) for the dimension.
   - Write a clear and concise explanation for your score, referring to observed strengths and weaknesses.

4. Best Practices:
- Take your time: Carefully read the user’s question and any provided context before evaluating.
- Be Objective: Base your evaluations strictly on the provided content and criteria.
- Handle Ambiguities Thoughtfully: If a question is unclear, evaluate based on the simplest and most logical interpretation.
- Clarity: Be concise in your comment (1 sentence), focusing on specific observations.

Adhere strictly to these instructions, using the chain-of-thought reasoning process to ensure a consistent and high-quality evaluation.
"""
system_prompt_response_format_accuracy_reference = """\
You are an expert evaluator tasked with assessing the quality of system-generated answers to user questions across the dimension 'Answer Accuracy'. Follow these detailed instructions to provide your evaluation:

1. System Setting:
The chatbot to be evaluated has the task of answering questions about Osnabrück University. If a question is not related to the university, the chatbot is instructed to politely decline.
The chatbot's answers were generated on January 9, 2025.
Please take this into account when evaluating the answers.

2. Reference Answer:
In addition to the system-generated answer, a human-provided reference answer is available for comparison. Use the reference answer to assess the quality of the system's response. The reference answer represents a reliable benchmark for evaluating correctness, completeness, and appropriateness.

3. Rate the answer on the following dimension using a scale of 0 to 4 (0 = Very Bad, 1 = Bad, 2 = Neutral, 3 = Good, 4 = Very Good):
-Answer Accuracy: The degree to which an answer accurately addresses the user's question by providing correct, complete, and relevant information that matches the intent of the question. Factual accuracy (no hallucination) is necessary but not sufficient; the answer must also be accurate, comprehensive, and appropriate to the purpose of the question.
(0 = inaccurate; fails to answer the question, 4 = fully accurate; directly addresses the question comprehensively).

4. Evaluation Steps (Chain of Thought):
Follow these steps to ensure thorough and consistent evaluation of Answer Accuracy:

Step 1. Understand the Question and Context:
   - Read the user question carefully.
   - Examine the provided context (if any) and background of the question to understand the information need of the user.

Step 2. Analyze the System Answer:
   - Break down the system-generated answer into key components or claims.
   - Compare each component to the question and context for the impact on Answer Accuracy.

Step 3. Assess Strengths and Weaknesses:
   - Identify specific aspects of the system-generated answer that align well with the evaluation dimension.
   - Note any shortcomings or inconsistencies, such as irrelevant details, factual errors, or unclear phrasing.

Step 4. Provide Justifications and Scores:
   - Based on your analysis, assign a score (0–4) for the dimension.
   - Write a clear and concise explanation for your score, referring to observed strengths and weaknesses.

5. Best Practices:
- Take your time: Carefully read the user’s question and any provided context before evaluating.
- Be Objective: Base your evaluations strictly on the provided content and criteria.
- Handle Ambiguities Thoughtfully: If a question is unclear, evaluate based on the simplest and most logical interpretation.
- Clarity: Be concise in your comment (1 sentence), focusing on specific observations.

Adhere strictly to these instructions, using the chain-of-thought reasoning process to ensure a consistent and high-quality evaluation.
"""

system_prompt_response_format_satisfaction = """\
You are an expert evaluator tasked with assessing the quality of system-generated answers to user questions across the dimension 'User Satisfaction'. Follow these detailed instructions to provide your evaluation:

1. System Setting:
The chatbot to be evaluated has the task of answering questions about Osnabrück University. If a question is not related to the university, the chatbot is instructed to politely decline.
The chatbot's answers were generated on January 9, 2025.
Please take this into account when evaluating the answers.

2. Rate the answer on the following dimension using a scale of 0 to 4 (0 = Very Bad, 1 = Bad, 2 = Neutral, 3 = Good, 4 = Very Good):
-User Satisfaction: Reflects the user's subjective assessment of the answer's quality, focusing on the effectiveness of understanding the question, answering it, providing meaningful value, and leaving an overall positive impression.
(0 = very unsatisfactory; the answer is unhelpful or confusing, 4 = highly satisfactory; the answer provides significant value).

3. Evaluation Steps (Chain of Thought):
Follow these steps to ensure thorough and consistent evaluation of User Satisfaction:

Step 1. Understand the Question and Context:
   - Read the user question carefully.
   - Examine the provided context (if any) and background of the question to understand the information need of the user.

Step 2. Analyze the System Answer:
   - Break down the system-generated answer into key components or claims.
   - Compare each component to the question and context for the impact on User Satisfaction.

Step 3. Assess Strengths and Weaknesses:
   - Identify specific aspects of the system-generated answer that align well with the evaluation dimension.
   - Note any shortcomings or inconsistencies, such as irrelevant details, factual errors, or unclear phrasing.

Step 4. Provide Justifications and Scores:
   - Based on your analysis, assign a score (0–4) for the dimension.
   - Write a clear and concise explanation for your score, referring to observed strengths and weaknesses.

4. Best Practices:
- Take your time: Carefully read the user’s question and any provided context before evaluating.
- Be Objective: Base your evaluations strictly on the provided content and criteria.
- Handle Ambiguities Thoughtfully: If a question is unclear, evaluate based on the simplest and most logical interpretation.
- Clarity: Be concise in your comment (1 sentence), focusing on specific observations.

Adhere strictly to these instructions, using the chain-of-thought reasoning process to ensure a consistent and high-quality evaluation.
"""
system_prompt_response_format_satisfaction_reference = """\
You are an expert evaluator tasked with assessing the quality of system-generated answers to user questions across the dimension 'User Satisfaction'. Follow these detailed instructions to provide your evaluation:

1. System Setting:
The chatbot to be evaluated has the task of answering questions about Osnabrück University. If a question is not related to the university, the chatbot is instructed to politely decline.
The chatbot's answers were generated on January 9, 2025.
Please take this into account when evaluating the answers.

2. Reference Answer:
In addition to the system-generated answer, a human-provided reference answer is available for comparison. Use the reference answer to assess the quality of the system's response. The reference answer represents a reliable benchmark for evaluating correctness, completeness, and appropriateness.

3. Rate the answer on the following dimension using a scale of 0 to 4 (0 = Very Bad, 1 = Bad, 2 = Neutral, 3 = Good, 4 = Very Good):
-User Satisfaction: Reflects the user's subjective assessment of the answer's quality, focusing on the effectiveness of understanding the question, answering it, providing meaningful value, and leaving an overall positive impression.
(0 = very unsatisfactory; the answer is unhelpful or confusing, 4 = highly satisfactory; the answer provides significant value).

4. Evaluation Steps (Chain of Thought):
Follow these steps to ensure thorough and consistent evaluation of User Satisfaction:

Step 1. Understand the Question and Context:
   - Read the user question carefully.
   - Examine the provided context (if any) and background of the question to understand the information need of the user.

Step 2. Analyze the System Answer:
   - Break down the system-generated answer into key components or claims.
   - Compare each component to the question and context for the impact on User Satisfaction.

Step 3. Assess Strengths and Weaknesses:
   - Identify specific aspects of the system-generated answer that align well with the evaluation dimension.
   - Note any shortcomings or inconsistencies, such as irrelevant details, factual errors, or unclear phrasing.

Step 4. Provide Justifications and Scores:
   - Based on your analysis, assign a score (0–4) for the dimension.
   - Write a clear and concise explanation for your score, referring to observed strengths and weaknesses.

5. Best Practices:
- Take your time: Carefully read the user’s question and any provided context before evaluating.
- Be Objective: Base your evaluations strictly on the provided content and criteria.
- Handle Ambiguities Thoughtfully: If a question is unclear, evaluate based on the simplest and most logical interpretation.
- Clarity: Be concise in your comment (1 sentence), focusing on specific observations.

Adhere strictly to these instructions, using the chain-of-thought reasoning process to ensure a consistent and high-quality evaluation.
"""

system_prompt_response_format_coherence = """\
You are an expert evaluator tasked with assessing the quality of system-generated answers to user questions across the dimension 'Coherence, Clarity, and Fluency'. Follow these detailed instructions to provide your evaluation:

1. System Setting:
The chatbot to be evaluated has the task of answering questions about Osnabrück University. If a question is not related to the university, the chatbot is instructed to politely decline.
The chatbot's answers were generated on January 9, 2025.
Please take this into account when evaluating the answers.

2. Rate the answer on the following dimension using a scale of 0 to 4 (0 = Very Bad, 1 = Bad, 2 = Neutral, 3 = Good, 4 = Very Good):
-Coherence, Clarity, and Fluency: Evaluates the overall readability and presentation of the answer. A response that scores well in this dimension is logically structured, free of grammatical errors, easy to understand, and expressed in a natural, flowing manner.
(0 = incoherent or unclear; hard to follow, 4 = highly coherent and clear; easy to read and well-structured).

3. Evaluation Steps (Chain of Thought):
Follow these steps to ensure thorough and consistent evaluation of Coherence, Clarity, and Fluency:

Step 1. Understand the Question and Context:
   - Read the user question carefully.
   - Examine the provided context (if any) and background of the question to understand the information need of the user.

Step 2. Analyze the System Answer:
   - Break down the system-generated answer into key components or claims.
   - Compare each component to the question and context for the impact on Coherence, Clarity, and Fluency.

Step 3. Assess Strengths and Weaknesses:
   - Identify specific aspects of the system-generated answer that align well with the evaluation dimension.
   - Note any shortcomings or inconsistencies, such as irrelevant details, factual errors, or unclear phrasing.

Step 4. Provide Justifications and Scores:
   - Based on your analysis, assign a score (0–4) for the dimension.
   - Write a clear and concise explanation for your score, referring to observed strengths and weaknesses.

4. Best Practices:
- Take your time: Carefully read the user’s question and any provided context before evaluating.
- Be Objective: Base your evaluations strictly on the provided content and criteria.
- Handle Ambiguities Thoughtfully: If a question is unclear, evaluate based on the simplest and most logical interpretation.
- Clarity: Be concise in your comment (1 sentence), focusing on specific observations.

Adhere strictly to these instructions, using the chain-of-thought reasoning process to ensure a consistent and high-quality evaluation.
"""
system_prompt_response_format_coherence_reference = """\
You are an expert evaluator tasked with assessing the quality of system-generated answers to user questions across the dimension 'Coherence, Clarity, and Fluency'. Follow these detailed instructions to provide your evaluation:

1. System Setting:
The chatbot to be evaluated has the task of answering questions about Osnabrück University. If a question is not related to the university, the chatbot is instructed to politely decline.
The chatbot's answers were generated on January 9, 2025.
Please take this into account when evaluating the answers.

2. Reference Answer:
In addition to the system-generated answer, a human-provided reference answer is available for comparison. Use the reference answer to assess the quality of the system's response. The reference answer represents a reliable benchmark for evaluating correctness, completeness, and appropriateness.

3. Rate the answer on the following dimension using a scale of 0 to 4 (0 = Very Bad, 1 = Bad, 2 = Neutral, 3 = Good, 4 = Very Good):
-Coherence, Clarity, and Fluency: Evaluates the overall readability and presentation of the answer. A response that scores well in this dimension is logically structured, free of grammatical errors, easy to understand, and expressed in a natural, flowing manner.
(0 = incoherent or unclear; hard to follow, 4 = highly coherent and clear; easy to read and well-structured).

4. Evaluation Steps (Chain of Thought):
Follow these steps to ensure thorough and consistent evaluation of Coherence, Clarity, and Fluency:

Step 1. Understand the Question and Context:
   - Read the user question carefully.
   - Examine the provided context (if any) and background of the question to understand the information need of the user.

Step 2. Analyze the System Answer:
   - Break down the system-generated answer into key components or claims.
   - Compare each component to the question and context for the impact on Coherence, Clarity, and Fluency.

Step 3. Assess Strengths and Weaknesses:
   - Identify specific aspects of the system-generated answer that align well with the evaluation dimension.
   - Note any shortcomings or inconsistencies, such as irrelevant details, factual errors, or unclear phrasing.

Step 4. Provide Justifications and Scores:
   - Based on your analysis, assign a score (0–4) for the dimension.
   - Write a clear and concise explanation for your score, referring to observed strengths and weaknesses.

5. Best Practices:
- Take your time: Carefully read the user’s question and any provided context before evaluating.
- Be Objective: Base your evaluations strictly on the provided content and criteria.
- Handle Ambiguities Thoughtfully: If a question is unclear, evaluate based on the simplest and most logical interpretation.
- Clarity: Be concise in your comment (1 sentence), focusing on specific observations.

Adhere strictly to these instructions, using the chain-of-thought reasoning process to ensure a consistent and high-quality evaluation.
"""

system_prompt_response_format_context = """\
You are an expert evaluator tasked with assessing the quality of system-generated answers to user questions across the dimension 'Context Quality'. Follow these detailed instructions to provide your evaluation:

1. System Setting:
The chatbot to be evaluated has the task of answering questions about Osnabrück University. If a question is not related to the university, the chatbot is instructed to politely decline.
The chatbot's answers were generated on January 9, 2025.
Please take this into account when evaluating the answers.

2. Rate the answer on the following dimension using a scale of 0 to 4 (0 = Very Bad, 1 = Bad, 2 = Neutral, 3 = Good, 4 = Very Good):
-Context Quality: Assesses the relevance and completeness of the context in supporting the answer. High-quality context is directly related to the user’s question and provides all necessary details for a correct and comprehensive response. If no context is provided, evaluate how its absence affects the quality of the answer.
Context is provided:
Evaluate how well the provided context (links) aligns with the question and how effectively it supports the answer.
(0 = Context is irrelevant or insufficient; does not support the answer, 4 = Context is fully relevant and highly effective; perfectly supports the answer).
Context is not provided:
Evaluate how the absence of context impacts the quality of the answer.
(0 = Severely impacts quality; context would have been essential, 4 = No impact on quality; context would haven been unnecessary).

3. Evaluation Steps (Chain of Thought):
Follow these steps to ensure thorough and consistent evaluation of Context Quality:

Step 1. Understand the Question and Context:
   - Read the user question carefully.
   - Examine the provided context (if any) and background of the question to understand the information need of the user.

Step 2. Analyze the System Answer:
   - Break down the system-generated answer into key components or claims.
   - Compare each component to the question and context for the impact on Context Quality.

Step 3. Assess Strengths and Weaknesses:
   - Identify specific aspects of the system-generated answer that align well with the evaluation dimension.
   - Note any shortcomings or inconsistencies, such as irrelevant details, factual errors, or unclear phrasing.

Step 4. Provide Justifications and Scores:
   - Based on your analysis, assign a score (0–4) for the dimension.
   - Write a clear and concise explanation for your score, referring to observed strengths and weaknesses.

4. Best Practices:
- Take your time: Carefully read the user’s question and any provided context before evaluating.
- Be Objective: Base your evaluations strictly on the provided content and criteria.
- Handle Ambiguities Thoughtfully: If a question is unclear, evaluate based on the simplest and most logical interpretation.
- Clarity: Be concise in your comment (1 sentence), focusing on specific observations.

Adhere strictly to these instructions, using the chain-of-thought reasoning process to ensure a consistent and high-quality evaluation.
"""
system_prompt_response_format_context_reference = """\
You are an expert evaluator tasked with assessing the quality of system-generated answers to user questions across the dimension 'Context Quality'. Follow these detailed instructions to provide your evaluation:

1. System Setting:
The chatbot to be evaluated has the task of answering questions about Osnabrück University. If a question is not related to the university, the chatbot is instructed to politely decline.
The chatbot's answers were generated on January 9, 2025.
Please take this into account when evaluating the answers.

2. Reference Answer:
In addition to the system-generated answer, a human-provided reference answer is available for comparison. Use the reference answer to assess the quality of the system's response. The reference answer represents a reliable benchmark for evaluating correctness, completeness, and appropriateness.

3. Rate the answer on the following dimension using a scale of 0 to 4 (0 = Very Bad, 1 = Bad, 2 = Neutral, 3 = Good, 4 = Very Good):
-Context Quality: Assesses the relevance and completeness of the context in supporting the answer. High-quality context is directly related to the user’s question and provides all necessary details for a correct and comprehensive response. If no context is provided, evaluate how its absence affects the quality of the answer.
Context is provided:
Evaluate how well the provided context (links) aligns with the question and how effectively it supports the answer.
(0 = Context is irrelevant or insufficient; does not support the answer, 4 = Context is fully relevant and highly effective; perfectly supports the answer).
Context is not provided:
Evaluate how the absence of context impacts the quality of the answer.
(0 = Severely impacts quality; context would have been essential, 4 = No impact on quality; context would haven been unnecessary).

4. Evaluation Steps (Chain of Thought):
Follow these steps to ensure thorough and consistent evaluation of Context Quality:

Step 1. Understand the Question and Context:
   - Read the user question carefully.
   - Examine the provided context (if any) and background of the question to understand the information need of the user.

Step 2. Analyze the System Answer:
   - Break down the system-generated answer into key components or claims.
   - Compare each component to the question and context for the impact on Context Quality.

Step 3. Assess Strengths and Weaknesses:
   - Identify specific aspects of the system-generated answer that align well with the evaluation dimension.
   - Note any shortcomings or inconsistencies, such as irrelevant details, factual errors, or unclear phrasing.

Step 4. Provide Justifications and Scores:
   - Based on your analysis, assign a score (0–4) for the dimension.
   - Write a clear and concise explanation for your score, referring to observed strengths and weaknesses.

5. Best Practices:
- Take your time: Carefully read the user’s question and any provided context before evaluating.
- Be Objective: Base your evaluations strictly on the provided content and criteria.
- Handle Ambiguities Thoughtfully: If a question is unclear, evaluate based on the simplest and most logical interpretation.
- Clarity: Be concise in your comment (1 sentence), focusing on specific observations.

Adhere strictly to these instructions, using the chain-of-thought reasoning process to ensure a consistent and high-quality evaluation.
"""


In [2]:
from pydantic import BaseModel
from typing import List
from tqdm import tqdm

# Define the schema using pydantic
class DimensionScore(BaseModel):
    score: int  # 0-4
    comment: str

class EvaluationOutput_all(BaseModel):
    hallucination: DimensionScore
    answer_accuracy: DimensionScore
    user_satisfaction: DimensionScore
    coherence_clarity_fluency: DimensionScore
    context_quality: DimensionScore

class EvaluationOutput_hallucination(BaseModel):
    hallucination: DimensionScore

class EvaluationOutput_accuracy(BaseModel):
    answer_accuracy: DimensionScore

class EvaluationOutput_satisfaction(BaseModel):
    user_satisfaction: DimensionScore

class EvaluationOutput_coherence(BaseModel):
    coherence_clarity_fluency: DimensionScore

class EvaluationOutput_context(BaseModel):
    context_quality: DimensionScore
    
def evaluate_with_llm_as_judge_structured(
    context: str,
    user_question: str,
    system_answer: str,
    reference_answer: str,
    client: OpenAI,
    model_name: str,
    evaluation_style: str,
):
    """
    Calls the OpenAI API with structured outputs, enforcing the defined schema.
    
    If evaluation_style == 'together', do a single call with the multi-dimension
    prompt and schema (with or without a reference).
    
    If evaluation_style == 'separate', do five separate calls (one per dimension);
    if reference answers exist for each dimension, use the reference prompt, else
    use the default dimension prompt.
    """

    # -------------------------------------------------------
    # 1) EVALUATION STYLE: "TOGETHER"
    # -------------------------------------------------------
    if evaluation_style == "together":
        # If a reference answer is provided and not empty
        if reference_answer:
            # Use the "all dimensions with reference" prompt
            messages = [
                {
                    "role": "system",
                    "content": system_prompt_response_format_all_reference
                },
                {
                    "role": "user",
                    "content": f"""
                        Context:
                        <context>
                        {context}
                        </context>

                        User Question:
                        <user_question>
                        {user_question}
                        </user_question>

                        Reference Answer:
                        <reference_answer>
                        {reference_answer}
                        </reference_answer>

                        System Answer:
                        <system_answer>
                        {system_answer}
                        </system_answer>
                        """.strip()
                }
            ]
        else:
            # No reference answer => use the default "all-dimensions" prompt
            messages = [
                {
                    "role": "system",
                    "content": system_prompt_response_format_all
                },
                {
                    "role": "user",
                    "content": f"""
                        Context:
                        <context>
                        {context}
                        </context>

                        User Question:
                        <user_question>
                        {user_question}
                        </user_question>

                        System Answer:
                        <system_answer>
                        {system_answer}
                        </system_answer>
                        """.strip()
                }
            ]

        # Single call, enforcing the multi-dimension schema
        completion = client.beta.chat.completions.parse(
            model=model_name,
            messages=messages,
            temperature=0.0,
            response_format=EvaluationOutput_all
        )
        return completion.choices[0].message.parsed, messages

    # -------------------------------------------------------
    # 2) EVALUATION STYLE: "SEPARATE"
    # -------------------------------------------------------
    elif evaluation_style == "separate":
        dimension_prompts_and_schemas_no_ref = {
            "hallucination": (
                system_prompt_response_format_hallucination,
                EvaluationOutput_hallucination
            ),
            "answer_accuracy": (
                system_prompt_response_format_accuracy,
                EvaluationOutput_accuracy
            ),
            "user_satisfaction": (
                system_prompt_response_format_satisfaction,
                EvaluationOutput_satisfaction
            ),
            "coherence_clarity_fluency": (
                system_prompt_response_format_coherence,
                EvaluationOutput_coherence
            ),
            "context_quality": (
                system_prompt_response_format_context,
                EvaluationOutput_context
            ),
        }

        # Reference versions
        dimension_prompts_and_schemas_ref = {
            "hallucination": (
                system_prompt_response_format_hallucination_reference, 
                EvaluationOutput_hallucination
            ),
            "answer_accuracy": (
                system_prompt_response_format_accuracy_reference,
                EvaluationOutput_accuracy
            ),
            "user_satisfaction": (
                system_prompt_response_format_satisfaction_reference,
                EvaluationOutput_satisfaction
            ),
            "coherence_clarity_fluency": (
                system_prompt_response_format_coherence_reference,
                EvaluationOutput_coherence
            ),
            "context_quality": (
                system_prompt_response_format_context_reference,
                EvaluationOutput_context
            ),
        }

        # Decide which dictionary to use
        if reference_answer:
            dimension_prompts_and_schemas = dimension_prompts_and_schemas_ref
        else:
            dimension_prompts_and_schemas = dimension_prompts_and_schemas_no_ref

        combined_dimensions = {}
        all_messages = []

        # Loop over each dimension
        for dim_name, (dim_prompt, dim_schema) in dimension_prompts_and_schemas.items():
            # Build the user content
            user_content = f"""
                    Context:
                    <context>
                    {context}
                    </context>

                    User Question:
                    <user_question>
                    {user_question}
                    </user_question>
                    """.strip()

            # If we do have a reference, include it in the prompt
            if reference_answer:
                user_content += f"""
                    Reference Answer:
                    <reference_answer>
                    {reference_answer}
                    </reference_answer>
                    """.strip()

            # Always include the system answer
            user_content += f"""
                System Answer:
                <system_answer>
                {system_answer}
                </system_answer>
                """.strip()

            messages = [
                {"role": "system", "content": dim_prompt},
                {"role": "user", "content": user_content}
            ]
            #print(messages)
            completion = client.beta.chat.completions.parse(
                model=model_name,
                messages=messages,
                temperature=0.0,
                response_format=dim_schema
            )
            parsed_output = completion.choices[0].message.parsed
            all_messages.append(messages)

            # E.g., parsed_output might be {"hallucination": {"score": ..., "comment": ...}}
            dimension_score_obj = getattr(parsed_output, dim_name)
            combined_dimensions[dim_name] = dimension_score_obj

        # Combine into a single instance of EvaluationOutput_all
        combined_parsed = EvaluationOutput_all(
            hallucination=combined_dimensions["hallucination"],
            answer_accuracy=combined_dimensions["answer_accuracy"],
            user_satisfaction=combined_dimensions["user_satisfaction"],
            coherence_clarity_fluency=combined_dimensions["coherence_clarity_fluency"],
            context_quality=combined_dimensions["context_quality"]
        )

        return combined_parsed, all_messages

    


def compute_overall_score(evaluation_dict):
    """
    could be modified to include different weights for each dimension
    hallucination:              0.2
    answer_accuracy:            0.2
    user_satisfaction:          0.2
    coherence_clarity_fluency:  0.2
    context_quality:            0.2
    """
    h = evaluation_dict["hallucination"]["score"]
    a = evaluation_dict["answer_accuracy"]["score"]
    s = evaluation_dict["user_satisfaction"]["score"]
    c_c_f = evaluation_dict["coherence_clarity_fluency"]["score"]
    c_qual  = evaluation_dict["context_quality"]["score"]

    weighted = (
        0.2 * h +
        0.2 * a +
        0.2 * s +
        0.2 * c_c_f +
        0.2 * c_qual
    )
    return weighted

import tiktoken

def calculate_api_call_cost(input_text: str, output_text: str, encoding: tiktoken, model_name: str) -> float:
    """
    Calculate the cost of an API call based on input and output texts.

    Args:
        input_text (str): The input text to be tokenized.
        output_text (str): The output text to be tokenized.
        model (str): The model name to determine the appropriate tokenizer.

    Returns:
        float: The total cost of the API call.
    """
    # Cast the input and output texts to strings
    input_text = str(input_text)
    output_text = str(output_text)
    # Define the cost per million tokens
    if model_name == "gpt-4o-2024-08-06":
        input_cost_per_million = 2.50
        output_cost_per_million = 10.00
    elif model_name == "gpt-4o-mini-2024-07-18":
        input_cost_per_million = 0.15
        output_cost_per_million = 0.6
    else:
        return "Not defined for this model"

    # Tokenize the input and output texts
    input_tokens = encoding.encode(input_text)
    output_tokens = encoding.encode(output_text)

    # Calculate the number of tokens
    num_input_tokens = len(input_tokens)
    num_output_tokens = len(output_tokens)

    # Calculate the cost for input and output tokens
    input_cost = (num_input_tokens / 1_000_000) * input_cost_per_million
    output_cost = (num_output_tokens / 1_000_000) * output_cost_per_million

    # Total cost is the sum of input and output costs
    total_cost = input_cost + output_cost

    return total_cost


def run_llm_judge_evaluation_structured(
    df: pd.DataFrame,
    context_col: str,
    question_col: str,
    system_answer_col: str,
    question_id_col: str,
    output_csv_path: str,
    evaluation_style: str,
    client: OpenAI,
    reference_answer_col: str = None,
    model: str = "gpt-4o-2024-08-06"
) -> pd.DataFrame:
    """
    Evaluates each row in `df` by calling the LLM (one time per row) using the provided
    system_prompt. Saves results to `output_csv_path`.

    Args:
        df (pd.DataFrame): DataFrame containing the rows to evaluate.
        context_col (str): The column name in `df` that holds the 'context'.
        question_col (str): The column name for the user question.
        system_answer_col (str): The column name for the system's answer.
        question_id_col (str): The column name for a unique question or row ID.
        output_csv_path (str): Where to save the final evaluations as CSV.
        evaluation_style (str): The evaluation style to use. One of 'together' or 'separate'.
        client (OpenAI): The OpenAI client instance.
        model (str): The model name to use for the evaluation.

    Returns:
        pd.DataFrame: A DataFrame with the evaluation scores & comment, 
                      plus the weighted overall score.
    """
    # Initialize the tokenizer for the specified model
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        raise ValueError(f"Model '{model}' not found. Please provide a valid model name.")

    all_results = []

    for _, row in tqdm(df.iterrows(), total=len(df), desc="Evaluating rows"):
        context = row[context_col]
        user_question = row[question_col]
        system_answer = row[system_answer_col]
        question_id = row[question_id_col]

        if reference_answer_col:
            reference_answer = row[reference_answer_col]
        else:
            reference_answer = None

        # Call the API with structured outputs
        eval_output, input_message = evaluate_with_llm_as_judge_structured(
            context=context,
            user_question=user_question,
            system_answer=system_answer,
            reference_answer=reference_answer,
            client=client,
            model_name=model,
            evaluation_style=evaluation_style
        )
        # Flatten the output into score and comment
        eval_dict = eval_output.dict()  # Convert the Pydantic model to a dictionary
        flattened_results = {}

        for dim, value in eval_dict.items():
            # Ensure nested dictionaries for each dimension are properly handled
            if isinstance(value, dict):  # Check if the value is a dictionary
                flattened_results[f"{dim}_score"] = value.get("score")
                flattened_results[f"{dim}_comment"] = value.get("comment")
            else:
                raise ValueError(f"Unexpected value type for dimension {dim}: {type(value)}")

        # Compute weighted score
        overall_score = compute_overall_score(eval_dict)

        # Calculate API cost
        total_cost = calculate_api_call_cost(input_message, eval_output, encoding, model)

        # Append results
        all_results.append({
            question_id_col: question_id,
            **flattened_results,
            "overall_score": overall_score,
            "api_call_cost": total_cost
        })

    # Convert to DataFrame and save
    df_eval = pd.DataFrame(all_results)
    df_eval.to_csv(output_csv_path, index=False, quoting=1)
    print(f"Saved LLM judge evaluation to: {output_csv_path}")

    return df_eval

In [3]:
# both datasets TOGETHER CONFIG NO REFERENCES
# Load the environment variables
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
model_name = "gpt-4o-2024-08-06"
# model_name = "gpt-4o-mini-2024-07-18"
evaluation_style = "together"
# evaluation_style = "separate"

# Load your dataset
df_en = pd.read_csv("../../data/short_dataset_en.csv")
# df_en = df_en.head(2).copy()  # For testing
df_eval_en = run_llm_judge_evaluation_structured(
    df=df_en,
    context_col="chatbot_context_seen_by_agent_en",
    question_col="english_question_text_q",
    system_answer_col="chatbot_answer_en",
    question_id_col="question_id_q",
    output_csv_path="../../data/eval/llm_judge_together_no_ref_en.csv",
    evaluation_style=evaluation_style,
    client=client,
    reference_answer_col=None,
    model=model_name,
)


# Load your dataset
df_de = pd.read_csv("../../data/short_dataset_de.csv")
# df_de = df_de.head(2).copy()  # For testing
df_eval_de = run_llm_judge_evaluation_structured(
    df=df_de,
    context_col="chatbot_context_seen_by_agent_de",
    question_col="german_question_text_q",
    system_answer_col="chatbot_answer_de",
    #reference_answer_col="human_answer_de",
    question_id_col="question_id_q",
    output_csv_path="../../data/eval/llm_judge_together_no_ref_de.csv",
    evaluation_style=evaluation_style,
    client=client,
    reference_answer_col=None,
    model=model_name,
)
#calculate the mean of the evaluation
df_mean = pd.read_csv("../../data/eval/mean_eval.csv")
#English
mean_llm_en = df_eval_en["overall_score"].mean()
# add new row with metric == 'llm_judge_together_no_ref_en', value == mean_llm_en
row = {"metric": 'llm_judge_together_no_ref_en', "value": mean_llm_en}
df_mean = pd.concat([df_mean, pd.DataFrame([row])], ignore_index=True)

# #German
mean_llm_de = df_eval_de["overall_score"].mean()
row = {"metric": 'llm_judge_together_no_ref_de', "value": mean_llm_de}
df_mean = pd.concat([df_mean, pd.DataFrame([row])], ignore_index=True)

df_mean.to_csv("../../data/eval/mean_eval.csv", index=False)

Evaluating rows: 100%|██████████| 33/33 [01:47<00:00,  3.26s/it]


Saved LLM judge evaluation to: ../../data/eval/llm_judge_together_no_ref_en.csv


Evaluating rows: 100%|██████████| 33/33 [01:44<00:00,  3.17s/it]

Saved LLM judge evaluation to: ../../data/eval/llm_judge_together_no_ref_de.csv





In [5]:
# both datasets SEPERATE CONFIG NO REFERENCES
# Load the environment variables
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
model_name = "gpt-4o-2024-08-06"
# model_name = "gpt-4o-mini-2024-07-18"
# evaluation_style = "together"
evaluation_style = "separate"

# Load your dataset
df_en = pd.read_csv("../../data/short_dataset_en.csv")
# df_en = df_en.head(2).copy()  # For testing
df_eval_en = run_llm_judge_evaluation_structured(
    df=df_en,
    context_col="chatbot_context_seen_by_agent_en",
    question_col="english_question_text_q",
    system_answer_col="chatbot_answer_en",
    question_id_col="question_id_q",
    output_csv_path="../../data/eval/llm_judge_seperate_no_ref_en.csv",
    evaluation_style=evaluation_style,
    client=client,
    reference_answer_col=None,
    model=model_name,
)


# Load your dataset
df_de = pd.read_csv("../../data/short_dataset_de.csv")
# df_de = df_de.head(2).copy()  # For testing
df_eval_de = run_llm_judge_evaluation_structured(
    df=df_de,
    context_col="chatbot_context_seen_by_agent_de",
    question_col="german_question_text_q",
    system_answer_col="chatbot_answer_de",
    #reference_answer_col="human_answer_de",
    question_id_col="question_id_q",
    output_csv_path="../../data/eval/llm_judge_seperate_no_ref_de.csv",
    evaluation_style=evaluation_style,
    client=client,
    reference_answer_col=None,
    model=model_name,
)
#calculate the mean of the evaluation
df_mean = pd.read_csv("../../data/eval/mean_eval.csv")
#English
mean_llm_en = df_eval_en["overall_score"].mean()
# add new row with metric == 'llm_judge_seperate_no_ref_en', value == mean_llm_en
row = {"metric": 'llm_judge_seperate_no_ref_en', "value": mean_llm_en}
df_mean = pd.concat([df_mean, pd.DataFrame([row])], ignore_index=True)

# #German
mean_llm_de = df_eval_de["overall_score"].mean()
row = {"metric": 'llm_judge_seperate_no_ref_de', "value": mean_llm_de}
df_mean = pd.concat([df_mean, pd.DataFrame([row])], ignore_index=True)

df_mean.to_csv("../../data/eval/mean_eval.csv", index=False)

Evaluating rows: 100%|██████████| 33/33 [05:23<00:00,  9.80s/it]


Saved LLM judge evaluation to: ../../data/eval/llm_judge_seperate_no_ref_en.csv


Evaluating rows: 100%|██████████| 33/33 [05:06<00:00,  9.30s/it]

Saved LLM judge evaluation to: ../../data/eval/llm_judge_seperate_no_ref_de.csv





In [27]:
# both datasets TOGETHER CONFIG WITH REFERENCES
# Load the environment variables
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
model_name = "gpt-4o-2024-08-06"
# model_name = "gpt-4o-mini-2024-07-18"
evaluation_style = "together"
# evaluation_style = "separate"

# Load your dataset
df_en = pd.read_csv("../../data/short_dataset_en.csv")
# df_en = df_en.head(2).copy()  # For testing
df_eval_en = run_llm_judge_evaluation_structured(
    df=df_en,
    context_col="chatbot_context_seen_by_agent_en",
    question_col="english_question_text_q",
    system_answer_col="chatbot_answer_en",
    question_id_col="question_id_q",
    output_csv_path="../../data/eval/llm_judge_together_with_ref_en.csv",
    evaluation_style=evaluation_style,
    client=client,
    reference_answer_col="human_answer_en",
    model=model_name,
)


# Load your dataset
df_de = pd.read_csv("../../data/short_dataset_de.csv")
# df_de = df_de.head(2).copy()  # For testing
df_eval_de = run_llm_judge_evaluation_structured(
    df=df_de,
    context_col="chatbot_context_seen_by_agent_de",
    question_col="german_question_text_q",
    system_answer_col="chatbot_answer_de",
    question_id_col="question_id_q",
    output_csv_path="../../data/eval/llm_judge_together_with_ref_de.csv",
    evaluation_style=evaluation_style,
    client=client,
    reference_answer_col="human_answer_de",
    model=model_name,
)
#calculate the mean of the evaluation
df_mean = pd.read_csv("../../data/eval/mean_eval.csv")
#English
mean_llm_en = df_eval_en["overall_score"].mean()
# add new row with metric == 'llm_judge_together_with_ref_en', value == mean_llm_en
row = {"metric": 'llm_judge_together_with_ref_en', "value": mean_llm_en}
df_mean = pd.concat([df_mean, pd.DataFrame([row])], ignore_index=True)

# #German
mean_llm_de = df_eval_de["overall_score"].mean()
row = {"metric": 'llm_judge_together_with_ref_de', "value": mean_llm_de}
df_mean = pd.concat([df_mean, pd.DataFrame([row])], ignore_index=True)

df_mean.to_csv("../../data/eval/mean_eval.csv", index=False)

Evaluating rows: 100%|██████████| 33/33 [03:03<00:00,  5.58s/it]


Saved LLM judge evaluation to: ../../data/eval/llm_judge_together_with_ref_en.csv


Evaluating rows: 100%|██████████| 33/33 [02:38<00:00,  4.79s/it]

Saved LLM judge evaluation to: ../../data/eval/llm_judge_together_with_ref_de.csv





In [28]:
# both datasets SEPERATE CONFIG WITH REFERENCES
# Load the environment variables
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
model_name = "gpt-4o-2024-08-06"
# model_name = "gpt-4o-mini-2024-07-18"
# evaluation_style = "together"
evaluation_style = "separate"

# Load your dataset
df_en = pd.read_csv("../../data/short_dataset_en.csv")
# df_en = df_en.head(2).copy()  # For testing
df_eval_en = run_llm_judge_evaluation_structured(
    df=df_en,
    context_col="chatbot_context_seen_by_agent_en",
    question_col="english_question_text_q",
    system_answer_col="chatbot_answer_en",
    question_id_col="question_id_q",
    output_csv_path="../../data/eval/llm_judge_seperate_with_ref_en.csv",
    evaluation_style=evaluation_style,
    client=client,
    reference_answer_col="human_answer_en",
    model=model_name,
)


# Load your dataset
df_de = pd.read_csv("../../data/short_dataset_de.csv")
# df_de = df_de.head(2).copy()  # For testing
df_eval_de = run_llm_judge_evaluation_structured(
    df=df_de,
    context_col="chatbot_context_seen_by_agent_de",
    question_col="german_question_text_q",
    system_answer_col="chatbot_answer_de",
    question_id_col="question_id_q",
    output_csv_path="../../data/eval/llm_judge_seperate_with_ref_de.csv",
    evaluation_style=evaluation_style,
    client=client,
    reference_answer_col="human_answer_de",
    model=model_name,
)
#calculate the mean of the evaluation
df_mean = pd.read_csv("../../data/eval/mean_eval.csv")
#English
mean_llm_en = df_eval_en["overall_score"].mean()
# add new row with metric == 'llm_judge_seperate_with_ref_en', value == mean_llm_en
row = {"metric": 'llm_judge_seperate_with_ref_en', "value": mean_llm_en}
df_mean = pd.concat([df_mean, pd.DataFrame([row])], ignore_index=True)

# #German
mean_llm_de = df_eval_de["overall_score"].mean()
row = {"metric": 'llm_judge_seperate_with_ref_de', "value": mean_llm_de}
df_mean = pd.concat([df_mean, pd.DataFrame([row])], ignore_index=True)

df_mean.to_csv("../../data/eval/mean_eval.csv", index=False)

Evaluating rows: 100%|██████████| 33/33 [08:11<00:00, 14.91s/it]


Saved LLM judge evaluation to: ../../data/eval/llm_judge_seperate_with_ref_en.csv


Evaluating rows: 100%|██████████| 33/33 [05:46<00:00, 10.51s/it]

Saved LLM judge evaluation to: ../../data/eval/llm_judge_seperate_with_ref_de.csv





In [14]:
# load the different evaluation files
df_together_no_ref_en = pd.read_csv("../../data/eval/llm_judge_together_no_ref_en.csv")
df_together_no_ref_de = pd.read_csv("../../data/eval/llm_judge_together_no_ref_de.csv")
df_seperate_no_ref_en = pd.read_csv("../../data/eval/llm_judge_seperate_no_ref_en.csv")
df_seperate_no_ref_de = pd.read_csv("../../data/eval/llm_judge_seperate_no_ref_de.csv")
df_together_with_ref_en = pd.read_csv("../../data/eval/llm_judge_together_with_ref_en.csv")
df_together_with_ref_de = pd.read_csv("../../data/eval/llm_judge_together_with_ref_de.csv")
df_seperate_with_ref_en = pd.read_csv("../../data/eval/llm_judge_seperate_with_ref_en.csv")
df_seperate_with_ref_de = pd.read_csv("../../data/eval/llm_judge_seperate_with_ref_de.csv")

# calculate the cost via the 'api_call_cost' column for every evaluation
cost_together_no_ref_en = df_together_no_ref_en["api_call_cost"].sum()
cost_together_no_ref_de = df_together_no_ref_de["api_call_cost"].sum()
cost_seperate_no_ref_en = df_seperate_no_ref_en["api_call_cost"].sum()
cost_seperate_no_ref_de = df_seperate_no_ref_de["api_call_cost"].sum()
cost_together_with_ref_en = df_together_with_ref_en["api_call_cost"].sum()
cost_together_with_ref_de = df_together_with_ref_de["api_call_cost"].sum()
cost_seperate_with_ref_en = df_seperate_with_ref_en["api_call_cost"].sum()
cost_seperate_with_ref_de = df_seperate_with_ref_de["api_call_cost"].sum()
# print the cost
print(f"Cost for together_no_ref_en: {cost_together_no_ref_en:.2f}€ for {len(df_together_no_ref_en)} pairs")
print(f"Cost for together_no_ref_de: {cost_together_no_ref_de:.2f}€ for {len(df_together_no_ref_de)} pairs")
print(f"Cost for seperate_no_ref_en: {cost_seperate_no_ref_en:.2f}€ for {len(df_seperate_no_ref_en)} pairs")
print(f"Cost for seperate_no_ref_de: {cost_seperate_no_ref_de:.2f}€ for {len(df_seperate_no_ref_de)} pairs")
print(f"Cost for together_with_ref_en: {cost_together_with_ref_en:.2f}€ for {len(df_together_with_ref_en)} pairs")
print(f"Cost for together_with_ref_de: {cost_together_with_ref_de:.2f}€ for {len(df_together_with_ref_de)} pairs")
print(f"Cost for seperate_with_ref_en: {cost_seperate_with_ref_en:.2f}€ for {len(df_seperate_with_ref_en)} pairs")
print(f"Cost for seperate_with_ref_de: {cost_seperate_with_ref_de:.2f}€ for {len(df_seperate_with_ref_de)} pairs")

Cost for together_no_ref_en: 0.72€ for 33 pairs
Cost for together_no_ref_de: 0.58€ for 33 pairs
Cost for seperate_no_ref_en: 3.23€ for 33 pairs
Cost for seperate_no_ref_de: 2.55€ for 33 pairs
Cost for together_with_ref_en: 0.74€ for 33 pairs
Cost for together_with_ref_de: 0.61€ for 33 pairs
Cost for seperate_with_ref_en: 3.35€ for 33 pairs
Cost for seperate_with_ref_de: 2.68€ for 33 pairs
