# In this notebook, we will explicitly create the essential informations for each of the datasets we aim to use.

## Introduction

For the purpose of this project, we will be using the following datasets:
1. USR
2. SumPubMed
3. SummEval
4. HelpSteer2

### Definement

For the metric definement phase we will be needing the following materials:
1. Task description (used when asking the LLM to generate metrics)
2. Dataset description
3. Main data (when using samples in generation)
4. An indicator of the column that will act as the main quality label
5. An indicator of the content column

### Scoring

For this phase, in addition to the previously mentioned data, we will also use the following data:
1. Test sample
2. Ground Truth metrics

## Datasets and Their Information

### USR

#### Definement

In [None]:
content_column_name = "context"
response_column_name = "response"
overall_score_column_name = "Overall"
default_task_description = "You will be given a conversation between two individuals. " \
                           "You will then be given several potential responses for the next turn in the conversation. " \
                           "These responses all concern an interesting fact, which will be provided as well. Your task is " \
                           "to rate each of the responses on several metrics. The response for one metric should not influence " \
                           "the other metrics. For example, if a response is not understandable or has grammatical errors " \
                           "you should try to ignore this when considering whether it maintains context or if it is interesting. " \
                           "Please make sure you read and understand these instructions carefully. Feel free to ask if you require " \
                           "clarification. Please keep this document open while reviewing, and refer to it as needed."
definement_task_description = "We provided a task where contestants were asked to generate their ideal responses/continuation " \
                              "to a conversation, considering the context and an interesting fact provided to them." \
                              " Your task is to define clear and independent metrics that could be used to evaluate " \
                              "the quality of these generated continuations."
scoring_task_description = "You will be given a conversation between two individuals. " \
                           "You will then be given one potential response for the next turn in the conversation. " \
                           "The response concerns an interesting fact, which will be provided as well. Your task " \
                           "is to rate the responses on one metric."

In [None]:
%cd USR
import pickle

# Save the variables to their corresponding .pkl files
with open('content_column_name.pkl', 'wb') as f:
    pickle.dump(content_column_name, f)

with open('default_task_description.pkl', 'wb') as f:
    pickle.dump(default_task_description, f)

with open('definement_task_description.pkl', 'wb') as f:
    pickle.dump(definement_task_description, f)

with open('overall_score_column_name.pkl', 'wb') as f:
    pickle.dump(overall_score_column_name, f)

with open('scoring_task_description.pkl', 'wb') as f:
    pickle.dump(scoring_task_description, f)

with open('response_column_name.pkl', 'wb') as f:
    pickle.dump(response_column_name, f)

#### Scoring

In [None]:
import pandas as pd

# Load the data
usr_data = pd.read_csv('usr.csv')

# Select a subset of size 50 with random seed 42
test_data = usr_data.sample(n=50, random_state=42)

# Save the subset as test.csv
test_data.to_csv('test.csv', index=False)

In [None]:
class Metric:
    def __init__(self, name, description, scale, instruction=""):
        self.name = name
        self.description = description
        self.scale = scale
        self.instruction = instruction

    def set_instruction(self, instruction):
        self.instruction = instruction

    def to_str(self):
        return f"[Metric]: Name: {self.name}, Description: {self.description}, Scale: {self.scale} [Metric]\n"

    def to_inst(self):
        return f"[Metric]: Name: {self.name}, Description: {self.description}, Scale: {self.scale} [Metric]\n{self.instruction}"


def serialize_metric(metric):
    return {
        "name": metric.name,
        "description": metric.description,
        "scale": metric.scale,
        "instruction": metric.instruction
    }

In [None]:
# Create instances of the Metric class
understandable_metric = Metric(
    name="Understandable",
    description="Is the response understandable given the previous context?",
    scale="0-1",
    instruction="Is the response understandable in the context of the history? (Not if it's on topic, but for example if it uses pronouns they should make sense)\n"
                "A score of 0 (no) means that the response is difficult to understand. You do not know what the person is trying to say.\n"
                "Example: 'I didn’t know that. I love to watch the movie Inception, it’s also the first racing movie to be a woman haha. I guess the movie was originally titled “Inception” awesome movie!'\n"
                "Context: 'In my religion, there is no star. How about you?'\n"
                "Response: 'Yeah it was back in 1975.'\n"
                "A score of 1 (yes) means that the response is understandable. You know what the person is trying to say.\n"
                "Example: 'My favorite role would have to be quarterback. It is such an interesting role.'\n"
                "Response: 'That is true. I think LeBron is the highest paid celebrity, I wonder if he will be in the Space Jam sequel.'"
)

natural_metric = Metric(
    name="Natural",
    description="Does the response seem to be something that a person would naturally say?",
    scale="1-3",
    instruction="Is the response naturally written?\n"
                "A score of 1 (bad) means that the response is unnatural.\n"
                "Context: A: wow. do you believe in stars of the zodiac? what is your star?\n"
                "B: in my religion, there is no star. how about you?\n"
                "Response: yeah, it was back in 1975.\n"
                "i think he is, he is a great teacher and he also taught ellie kemper, she is a great teacher\n"
                "A score of 2 (ok) means the response is strange, but not entirely unnatural.\n"
                "Context: A: wow. do you believe in stars of the zodiac? what is your star?\n"
                "B: in my religion, there is no star. how about you?\n"
                "Response: i read it sometimes for the fun of it.\n"
                "A score of 3 (good) means that the response is natural.\n"
                "i think it’s funny that the soviet union sent a spacecraft to venus"
)
maintains_context_metric = Metric(
    name="Maintains Context",
    description="Does the response serve as a valid continuation of the preceding conversation?",
    scale="1-3",
    instruction="Does the response serve as a valid continuation of the conversation history?\n"
                "A score of 1 (no) means that the response drastically changes topic or ignores the conversation history.\n"
                "Context: A: wow. do you believe in stars of the zodiac? what is your star?\n"
                "B: in my religion, there is no star. how about you?\n"
                "Response: i think it’s funny that the soviet union sent a spacecraft to venus.\n"
                "A score of 2 (somewhat) means the response refers to the conversation history in a limited capacity (e.g., in a generic way) and shifts the conversation topic.\n"
                "Context: i do like some drama stuff, yeah he was awesome in that.\n"
                "Response: yeah. do you like jon hamm?\n"
                "Context: i believe that! he would have played longer i’m sure if he did the granny style approach to shooting free throws!\n"
                "Response: i agree. did you know that space jam is the highest grossing basketball movie of all time?\n"
                "A score of 3 (yes) means the response is on topic and strongly acknowledges the conversation history.\n"
                "Context: B: wow, that’s great. especially because more than 60% of NBA players go broke 5 years after retirement.\n"
                "A: i believe that! he would have played longer i’m sure if he did the granny style approach to shooting free throws!\n"
                "Response: a lot of players can make money by starring in movies. did you know space jam is the highest grossing movie of all time? maybe one of the broke retired players can be in the sequel!\n"
                "Context: B: you like drama? patrick stewart teaches classes now. i loved him in star trek.\n"
                "A: i do like some drama stuff, yeah he was awesome in that.\n"
                "Response: jon hamm was also a drama teacher. he taught erin from the office."
)

interesting_metric = Metric(
    name="Interesting",
    description="Is the response dull or interesting?",
    scale="1-3",
    instruction="Is the response dull/interesting?\n"
                "A score of 1 (dull) means that the response is generic and dull.\n"
                "Example: 'that's true. i agree.'\n"
                "A score of 2 (somewhat interesting) means the response is somewhat interesting and could engage you in the conversation (e.g., an opinion, thought).\n"
                "Example: 'my favorite role would have to be quarterback. it is such an interesting role.'\n"
                "'i love tom brady. i love tom brady.'\n"
                "A score of 3 (interesting) means the response is very interesting or presents an interesting fact.\n"
                "Example: 'i agree. did you know that space jam is the highest grossing basketball movie of all time?'\n"
                "'a lot of players can make money by starring in movies. did you know space jam is the highest grossing movie of all time? maybe one of the broke retired players can be in the sequel!'"
)

uses_knowledge_metric = Metric(
    name="Uses Knowledge",
    description="Given the fact that the response is conditioned on, how well does the response use that fact?",
    scale="0-1",
    instruction="Given the interesting fact that the response is conditioned on, how well does the response use the fact?\n"
                "A score of 0 (no) means the response does not mention or refer to the fact at all.\n"
                "A score of 1 (yes) means the response uses the fact well."
)

overall_quality_metric = Metric(
    name="Overall Quality",
    description="Given your answers above, what is your overall impression of the quality of this utterance?",
    scale="1-5",
    instruction="Given your answers above, what is your overall impression of this utterance?\n"
                "A score of 1 (very bad) means this is a completely invalid response. It would be difficult to recover the conversation after this.\n"
                "A score of 2 (bad) means this is a valid response, but otherwise poor in quality.\n"
                "A score of 3 (neutral) means this response is neither good nor bad. This response has no negative qualities, but no positive ones either.\n"
                "A score of 4 (good) means this is a good response, but falls short of being perfect because of a key flaw.\n"
                "A score of 5 (very good) means this response is good and does not have any strong flaws."
)

metrics = [understandable_metric, natural_metric, maintains_context_metric, interesting_metric, uses_knowledge_metric,
           overall_quality_metric]

In [None]:
import json
import os

results = [serialize_metric(metric) for metric in metrics]
json.dump(results, open('ground_truth_metrics_set.json', "w"), indent=4)

### SumPubMed

We designed a task where contestants where asked to generate summaries for biomedical research papers. These documents are sourced from diverse literature, including medline, life science journals, and online books. Moreover, these documents were related to medicine, pharmacy, nursing, dentistry, health care, health services, etc. Now we want to assess these provided summaries. Based on this information define a set of metrics that we should use to assess the provided answers by contestants.

In [None]:
#### Definement
content_column_name = "text"
response_column_name = "shorter_abstract"
overall_score_column_name = "IOF"
default_task_description = "We designed a task where contestants where asked to generate " \
                           "summaries for biomedical research papers. These documents are sourced from diverse literature, " \
                           "including medline, life science journals, and online books. Moreover, these documents were related " \
                           "to medicine, pharmacy, nursing, dentistry, health care, health services, etc. Now we want to assess " \
                           "these provided summaries. Your task is to rate each of the abstracts on several metrics. The response " \
                           "for one metric should not influence the other metrics. For example, if an abstract is not readable enough " \
                           "or has grammatical errors you should try to ignore this when considering whether it is informative or " \
                           "if it is coherent. Please make sure you read and understand these instructions carefully. Feel free to " \
                           "ask if you require clarification. Please keep this document open while reviewing, and refer to it as needed."

definement_task_description = "We designed a task where contestants where asked to generate " \
                              "summaries for biomedical research papers. These documents are sourced from diverse literature, " \
                              "including medline, life science journals, and online books. Moreover, these documents were related " \
                              "to medicine, pharmacy, nursing, dentistry, health care, health services, etc. Now we want to assess " \
                              "these provided summaries. Your task is to define clear and independent metrics that could be used to evaluate " \
                              "the quality of these generated continuations."

scoring_task_description = "We designed a task where contestants where asked to generate " \
                           "summaries for biomedical research papers. These documents are sourced from diverse literature, " \
                           "including medline, life science journals, and online books. Moreover, these documents were related " \
                           "to medicine, pharmacy, nursing, dentistry, health care, health services, etc. Now we want to assess " \
                           "these provided summaries. Your task is to rate the responses on one metric."

In [None]:
%cd SumPubMed
import pickle

# Save the variables to their corresponding .pkl files
with open('content_column_name.pkl', 'wb') as f:
    pickle.dump(content_column_name, f)

with open('default_task_description.pkl', 'wb') as f:
    pickle.dump(default_task_description, f)

with open('definement_task_description.pkl', 'wb') as f:
    pickle.dump(definement_task_description, f)

with open('overall_score_column_name.pkl', 'wb') as f:
    pickle.dump(overall_score_column_name, f)

with open('scoring_task_description.pkl', 'wb') as f:
    pickle.dump(scoring_task_description, f)

with open('response_column_name.pkl', 'wb') as f:
    pickle.dump(response_column_name, f)
#### Scoring
import pandas as pd

# Load the data
spm_data = pd.read_csv('sumpubmed.csv')

# Save the subset as test.csv
spm_data.to_csv('test.csv', index=False)

non_re_metric = Metric(
    name="Non-Repetition and no factual Redundancy",
    description="There should not be redundancy in the factual information, and no repetition of sentences is allowed.",
    scale="1-10"
)

coherence_metric = Metric(
    name="Coherence",
    description="Coherence means 'continuity of sense'. The arguments have to be connected sensibly so that the reader can see consecutive sentences as being about one (or a related) concept.",
    scale="1-10"
)

readability_metric = Metric(
    name="Readability",
    description="Consideration of general readability criteria such as good spelling, correct grammar, understandability, etc. in the summaries.",
    scale="1-10"
)

informativeness_metric = Metric(
    name="Informativeness, Overlap and Focus",
    description="How much information is covered by the summary. The goal is to find the common pieces of information via matching the same keywords (or key phrases), such as 'Nematodes', across the summary. For overlaps, annotators compare the keywords’ (or key-phrases) occurrence frequency and ensure the summaries are on the same topic.",
    scale="1-10"
)

overall_quality_metric = Metric(
    name="Overall Quality",
    description="Given your answers above, what is your overall impression of the quality of this summary?",
    scale="1-10"
)

metrics = [non_re_metric, coherence_metric, readability_metric, informativeness_metric, overall_quality_metric]
import json
import os

results = [serialize_metric(metric) for metric in metrics]
json.dump(results, open('ground_truth_metrics_set.json', "w"), indent=4)

### SummEval

In [None]:
#### Definement
content_column_name = "article"
response_column_name = "decoded"
overall_score_column_name = "expert_relevance"
default_task_description = "In this task you will evaluate the quality of summaries written " \
                           "for a news article. To correctly solve the task, follow these steps:\n" \
                           "1. Carefully read the news article, be aware of the information it contains.\n" \
                           "2. Read the proposed summary.\n" \
                           "3. Rate the summary on a scale from 1 (worst) to 5 (best) by its relevance, consistency, fluency, and coherence."
definement_task_description = "We provided a task where contestants were asked to generate their ideal summary " \
                              "for a news article." \
                              " Your task is to define clear and independent metrics that could be used to evaluate " \
                              "the quality of these generated summaries."
scoring_task_description = "We provided a task where contestants were asked to generate their ideal summary " \
                           "for a news article. Your task is to rate the responses on one metric."

%cd SummEval
import pickle

# Save the variables to their corresponding .pkl files
with open('content_column_name.pkl', 'wb') as f:
    pickle.dump(content_column_name, f)

with open('default_task_description.pkl', 'wb') as f:
    pickle.dump(default_task_description, f)

with open('definement_task_description.pkl', 'wb') as f:
    pickle.dump(definement_task_description, f)

with open('overall_score_column_name.pkl', 'wb') as f:
    pickle.dump(overall_score_column_name, f)

with open('scoring_task_description.pkl', 'wb') as f:
    pickle.dump(scoring_task_description, f)

with open('response_column_name.pkl', 'wb') as f:
    pickle.dump(response_column_name, f)

#### Scoring
import pandas as pd

# Load the data
usr_data = pd.read_csv('summeval.csv')

# Select a subset of size 50 with random seed 42
#test_data = usr_data.sample(n=50, random_state=42)

# Save the subset as test.csv
#test_data.to_csv('test.csv', index=False)

# Create instances of the Metric class
coherence_metric = Metric(
    name="Coherence",
    description="The rating measures the quality of all sentences collectively, to the fit together and sound naturally. Consider the quality of the summary as a whole.",
    scale="1-5"
)

consistency_metric = Metric(
    name="Consistency",
    description="The rating measures whether the facts in the summary are consistent with the facts in the original article. Consider whether the summary does reproduce all facts accurately and does not make up untrue information.",
    scale="1-5"
)

fluency_metric = Metric(
    name="Fluency",
    description="This rating measure the quality of individual sentences. Are they well-written and grammatically correct? Consider the quality of individual sentences.",
    scale="1-5"
)

relevance_metric = Metric(
    name="Relevance",
    description="The rating measures how well the summary captures the key points of the article. Consider whether all and only the important aspects are contained in the summary.",
    scale="1-5"
)

overall_quality_metric = Metric(
    name="Overall Quality",
    description="Given your answers above, what is your overall impression of the quality of this utterance?",
    scale="1-5"
)

ground_truth_metrics_set = [relevance_metric, consistency_metric, fluency_metric, coherence_metric,
                            overall_quality_metric]

import json

results = [serialize_metric(metric) for metric in ground_truth_metrics_set]
json.dump(results, open('ground_truth_metrics_set.json', "w"), indent=4)

In [None]:
coherence_metric = Metric(
    name="Coherence",
    description="The rating measures the quality of all sentences collectively, to the fit together and sound naturally. Consider the quality of the summary as a whole.",
    scale="1-5"
)

consistency_metric = Metric(
    name="Consistency",
    description="The rating measures whether the facts in the summary are consistent with the facts in the original article. Consider whether the summary does reproduce all facts accurately and does not make up untrue information.",
    scale="1-5"
)

fluency_metric = Metric(
    name="Fluency",
    description="This rating measure the quality of individual sentences. Are they well-written and grammatically correct? Consider the quality of individual sentences.",
    scale="1-5"
)

relevance_metric = Metric(
    name="Relevance",
    description="The rating measures how well the summary captures the key points of the article. Consider whether all and only the important aspects are contained in the summary.",
    scale="1-5"
)

overall_quality_metric = Metric(
    name="Overall Quality",
    description="Given your answers above, what is your overall impression of the quality of this utterance?",
    scale="1-5"
)

ground_truth_metrics_set = [relevance_metric, consistency_metric, fluency_metric, coherence_metric,
                            overall_quality_metric]

import json

results = [serialize_metric(metric) for metric in ground_truth_metrics_set]
json.dump(results, open('ground_truth_metrics_set.json', "w"), indent=4)

### HelpSteer2

In [None]:
#### Definement
content_column_name = "prompt"
response_column_name = "response"
overall_score_column_name = "helpfulness"
default_task_description = "You will be given prompts/instructions and a response from AI systems/humans. Your task consists of: "
definement_task_description = "We provided a task where contestants were asked to respond to prompts/instructions provided by users. " \
                              "Your task is to define clear and independent metrics that could be used to evaluate " \
                              "the quality of these generated continuations."
scoring_task_description = "We provided a task where contestants were asked to respond to prompts/instructions provided by users. " \
                           "Your task is to rate the responses on one metric."
%cd HelpSteer2
import pickle

# Save the variables to their corresponding .pkl files
with open('content_column_name.pkl', 'wb') as f:
    pickle.dump(content_column_name, f)

with open('default_task_description.pkl', 'wb') as f:
    pickle.dump(default_task_description, f)

with open('definement_task_description.pkl', 'wb') as f:
    pickle.dump(definement_task_description, f)

with open('overall_score_column_name.pkl', 'wb') as f:
    pickle.dump(overall_score_column_name, f)

with open('scoring_task_description.pkl', 'wb') as f:
    pickle.dump(scoring_task_description, f)

with open('response_column_name.pkl', 'wb') as f:
    pickle.dump(response_column_name, f)

#### Scoring
import pandas as pd

# Load the data
usr_data = pd.read_csv('helpsteer2.csv')

# Select a subset of size 50 with random seed 42
test_data = usr_data.sample(n=50, random_state=42)

# Save the subset as test.csv
test_data.to_csv('test.csv', index=False)

helpfulness_metric = Metric(
    name="Helpfulness/Understanding",
    description="How useful and helpful the response is (“overall quality rating”)",
    scale="0-4",
    instruction="""
    1. Helpfulness/Understanding
    • 4–The response is extremely helpful and completely aligned with the spirit of what the prompt was asking for.
    • 3–The response is mostly helpful and mainly aligned with what the user was looking for, but there is still some room for improvement.
    • 2–The response is partially helpful but misses the overall goal of the user’s query/input in some way. The response did not fully satisfy what the user was looking for.
    • 1–The response is borderline unhelpful and mostly does not capture what the user was looking for, but it is still usable and helpful in a small way.
    • 0–The response is not useful or helpful at all. The response completely missed the essence of what the user wanted.
    """
)

correctness_metric = Metric(
    name="Correctness/Completeness",
    description="The response is based on facts, no hallucinations, no mistakes. The response covers everything required in the instruction.",
    scale="0-4",
    instruction="""
    2. Correctness/Completeness
    • 4– The response is completely correct and accurate to what is requested by the prompt with no necessary details missing and without false, misleading, or hallucinated information. If the prompt asks the assistant to do a task, the task is completely done and addressed in the response.
    • 3–The response is mostly accurate and correct with a small amount of missing information. It contains no misleading information or hallucinations. If the prompt asks the assistant to perform a task, the task is mostly successfully attempted.
    • 2– The response contains a mix of correct and incorrect information. The response may miss some details, contain misleading information, or minor hallucinations, but is more or less aligned with what the prompt asks for. If the prompt asks the assistant to perform a task, the task is attempted with moderate success but still has clear room for improvement.
    • 1– The response has some correct elements but is mostly wrong or incomplete. The response may contain multiple instances of hallucinations, false information, misleading information, or irrelevant information. If the prompt asks the assistant to do a task, the task was attempted with a small amount of success.
    • 0–The response is completely incorrect. All information provided is wrong, false or hallucinated. If the prompt asks the assistant to do a task, the task is not at all attempted, or the wrong task was attempted in the response. The response is completely irrelevant to the prompt.
    • We also have a rating confidence check box where you can provide how confident you are in your correctness assessment:
    (a) Very confident
    (b) Somewhat confident
    (c) Not confident/Unsure (use it when unable to verify the correctness of key information provided in the response)
    • Additionally, we have binary check boxes that should be checked if they apply to the given response. The check boxes include:
    (a) Contains incorrect information
    (b) Contains irrelevant information
    (c) Key information is missing
    (d) Instruction is based on a false premise
    """
)

coherence_clarity_metric = Metric(
    name="Coherence/Clarity",
    description="The response is self consistent in terms of content, style of writing, and does not contradict itself. The response can be logically followed and understood by a human. The response does not contain redundant or repeated information.",
    scale="0-4",
    instruction="""
    3. Coherence/Clarity
    With this attribute we measure how lucid, cogent, and self-consistent the model’s response is. This attribute will be particularly varied for open-ended questions, tasks, and objectives like writing a story, generating a dialogue, or summary but also applies to more straightforward prompt/response pairs.
    • 4 (Perfectly Coherent and Clear)– The response is perfectly clear and self-consistent throughout. There are no contradictory assertions or statements, the writing flows logically and following the train of thought/story is not challenging.
    • 3 (Mostly Coherent and Clear)– The response is mostly clear and coherent, but there may be one or two places where the wording is confusing or the flow of the response is a little hard to follow. Over all, the response can mostly be followed with a little room for improvement.
    • 2 (A Little Unclear and/or Incoherent)– The response is a little unclear. There are some inconsistencies or contradictions, run on sentences, confusing statements, or hard to follow sections of the response.
    • 1 (Mostly Incoherent and/or Unclear)– The response is mostly hard to follow, with inconsistencies, contradictions, confusing logic flow, or unclear language used throughout, but there are some coherent/clear parts.
    • 0 (Completely Incoherent and/or Unclear)– The response is completely incomprehensible and no clear meaning or sensible message can be discerned from it.
    • Additionally has binary checkboxes for:
    (a) Contains repetitions
    (b) Contains style changes
    (c) Contains contradiction(s)
    """
)

simple_complex_language_metric = Metric(
    name="Simple vs. Complex Language",
    description="Rate the response along a simple → complex spectrum. The response uses simple, easy to understand vocabulary and sentence structure that children can understand vs the model uses sophisticated language with elevated vocabulary that adults with advanced education or experts on the topic would use.",
    scale="0-4",
    instruction="""
    4. Simple/Complex Language
    • 4 (Expert)– An expert in the field or area could have written the response. It uses specific and technically relevant vocabulary. Elevated language that someone at the simple or basic level may not understand at all. The professional language of a lawyer, scientist, engineer, or doctor falls into this category.
    • 3 (Advanced)– The response uses a fairly sophisticated vocabulary and terminology. Someone majoring in this subject at a college or university could have written it and would understand the response. An average adult who does not work or study in this area could not have written the response.
    • 2 (Intermediate)– People who have completed up through a high school education will probably be able to understand the vocabulary and sentence structure used, but those at the basic level or children might struggle to understand the response.
    • 1 (Simple)– The response uses relatively straightforward language and wording, but some schooling through elementary or a middle school in the language might be required to understand the response.
    • 0 (Basic)– The response uses very easy to understand language that is clear and completely interpretable by children, adults, and anyone with a functional command of the language.
    """
)

succinct_verbose_language_metric = Metric(
    name="Succinct vs. Verbose Language",
    description="The response is direct to the point without extra wordings. The opposite direction is verbose, the response is wordy, giving a long winded and/or detailed reply.",
    scale="0-4",
    instruction="""
    5. Succinctness/Verbosity
    The goal here is to place the response on a spectrum from the most short, crisp answers, to the most lengthy, detailed, and/or wordy answers under the context of what a user is expecting as a response to the prompt. For example, if the prompt asks the model a yes or no question and the model simply responds “yes” the answer is succinct. But if the model responds “yes”, restates the question worded as an answer, and explains why it gave that answer, the answer is verbose. Even if two responses have exactly the same length, one can be rated as verbose and the other as succinct depending on the prompting context. This verbosity rating evaluates the response as a whole (ex: a very long list of items would usually be considered verbose even if each item in the list is described succinctly).
    • 4 (Verbose)– The response is particularly lengthy, wordy, and/or extensive with extra details given what the prompt requested from the assistant model. The response can be verbose regardless of if the length is due to repetition and incoherency or if it is due to rich and insightful detail.
    • 3 (Moderately Long)– The response is on the longer side but could still have more added to it before it is considered fully detailed or rambling.
    • 2 (Average Length)–The response isn’t especially long or short given what the prompt is asking of the model. The length is adequate for conveying a full response but isn’t particularly wordy nor particularly concise.
    • 1 (Pretty Short)– The response is on the shorter side but could still have words, details, and/or text removed before it’s at a bare minimum of what the response is trying to convey.
    • 0 (Succinct)– The response is short, to the point, and the most concise it can be. No additional information is provided outside of what is requested by the prompt (regardless of if the information or response itself is incorrect, hallucinated, or misleading. A response that gives an incorrect answer can still be succinct.).
    """
)

safe_unsafe_metric = Metric(
    name="Safe vs Unsafe",
    description="Whether the response is safe (“Pass”), i.e, excludes any kind of harmful / toxic / illegal content, or is not (“Fail”).",
    scale="pass/fail",
    instruction="""
    6. Safe/Unsafe
    • Pass (Safe)– The response does not include any of the unsafe content described below
    • Fail (Unsafe)–  Harmful Content (response includes hate speech, bullying, harassment, sexually explicit content, or violence), illegal activities (response promotes any illegal activities or harmful actions to individuals’ or groups’ mental, physical, or emotional health), profanity (response contains any curse words), bias and stereotyping (response promotes and/or perpetuate biases and/or stereotypical responses on select societal groups)
    """
)

ground_truth_metrics_set = [correctness_metric, coherence_clarity_metric, simple_complex_language_metric,
                            succinct_verbose_language_metric, safe_unsafe_metric, helpfulness_metric]

import json

results = [serialize_metric(metric) for metric in ground_truth_metrics_set]
json.dump(results, open('ground_truth_metrics_set.json', "w"), indent=4)