Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new binary metrics #69

Merged
merged 11 commits into from
Mar 7, 2024
Merged

Add new binary metrics #69

merged 11 commits into from
Mar 7, 2024

Conversation

ethan-tonic
Copy link
Collaborator

@ethan-tonic ethan-tonic commented Feb 23, 2024

This pull request adds new binary metrics. The added metrics include:

  • BinaryMetric: Lets you pass in an arbitrary function that evaluates to true or false

  • RegexMetric: Searches response matching regex

  • AnswerMatchMetric: Checks answer matches exactly

  • DuplicationMetric: Checks whether or not there's duplicate information in the response

  • ResponseLengthMetric: Checks that response is within a certain number of characters

  • ContextLengthMetric: Checks that context length is within a certain range

  • ContainsTextMetric: Checks whether or not response contains the given text

@ethan-tonic ethan-tonic marked this pull request as draft February 26, 2024 22:29
@ethan-tonic ethan-tonic marked this pull request as ready for review March 4, 2024 22:48
@joeferraratonic
Copy link
Collaborator

How do you envision these metrics being used with ValidateScorer and logged to the UI? Or do you not envision that?

I ask because a lot of the metrics would depend on the question, and so could not be used on a batch of questions and answers like how ValidateScorer uses the already existing metrics. The regex, answer match, and contains text metrics I'd imagine you'd want the metric to change for each question.

self._name = name
self.callback = callback

def score(self, llm_response: LLMResponse, openai_service: OpenAIService) -> float:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some way to be able to use these metrics without passing an OpenAIService for the metrics that do not require an OpenAIService. My suggestion would be to make this parameter optional and default to None.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the metrics are called by our ValidateScorer. If we allow for it to be None, then we'd need to change the logic in the scorer to not pass it for metrics that don't need it. That becomes harder to do when people are writing their own metrics (e.g. a custom binary metric) as we don't know if they're using OpenAI or not. Sure we could have them set something like self.uses_openai = True, but that's a bad pattern since it relies on the user remembering to change it. It's easier just to pass it in regardless in which case OpenAIService will never be None anyhow. Plus, if we set it to be optional, then we will have to rewrite all of our existing code to check that it's not None in order to satisfy typechecking.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case I'm thinking of is when a user is using one of the metrics outside of ValidateScorer. In that case, they should not need to instantiate and pass an OpenAIService when it's not needed.

If it's optional then ValidateScorer can pass all the metrics its OpenAIService without any issues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not worth it I think because that's a really niche use case and it would lead to us having to add checks everywhere in the code for it (as well as checks when the person writes their own metrics)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think whether a metric actually needs an LLM evaluator is an important distinction to make and should be reflected in the metric. I can imagine users thinking, I want to create a bunch of tests that use metrics that do not use an LLM evaluator (because the tests are way cheaper without an LLM evaluator). Then they don't need an OpenAIService and passing a model evaluator to ValidateScorer is unnecessary.

What do you think about putting the logic into the classes, so we would have a metric base class that uses LLM assisted evaluation and a metric base class that does not use LLM assisted evaluation? In the metrics that do not use LLM assisted evaluation, they would not need an OpenAIService.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can add a subclass in for this. That would make sense, but I'll probably do it in a separate PR since it's a bit out of scope

Copy link
Collaborator

@joeferraratonic joeferraratonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics look great! I have three main comments about changes:

  1. The main question in my mind is, "how is the user going to use these metrics and log them to the validate UI?" Are they going to use ValidateScorer?
  2. An example jupyter notebook using these metrics would be valuable.
  3. There are some small changes I requested in other comments.

1 and 2 can be left for a separate PR, but I'd like to hear your thoughts on 1.

@ethan-tonic
Copy link
Collaborator Author

How do you envision these metrics being used with ValidateScorer and logged to the UI? Or do you not envision that?

I ask because a lot of the metrics would depend on the question, and so could not be used on a batch of questions and answers like how ValidateScorer uses the already existing metrics. The regex, answer match, and contains text metrics I'd imagine you'd want the metric to change for each question.

Yeah, they should all be compatible with the UI. I did not really think much about it changing per question though. Simply because the way I imagined it is that you have some format you want for your LLM responses and those text matching metrics allow you to check that it meets your desired format. The answer match is probably the least useful at checking if it matches the desired format, but could be used still to make sure if given a certain question (e.g. asking the rag system to give you sensitive info) then the rag system will always return the same refusal answer. Anyhow, I can definitely see the usefulness for changing it question by question (which we should allow in the future)

@joeferraratonic
Copy link
Collaborator

How do you envision these metrics being used with ValidateScorer and logged to the UI? Or do you not envision that?
I ask because a lot of the metrics would depend on the question, and so could not be used on a batch of questions and answers like how ValidateScorer uses the already existing metrics. The regex, answer match, and contains text metrics I'd imagine you'd want the metric to change for each question.

Yeah, they should all be compatible with the UI. I did not really think much about it changing per question though. Simply because the way I imagined it is that you have some format you want for your LLM responses and those text matching metrics allow you to check that it meets your desired format. The answer match is probably the least useful at checking if it matches the desired format, but could be used still to make sure if given a certain question (e.g. asking the rag system to give you sensitive info) then the rag system will always return the same refusal answer. Anyhow, I can definitely see the usefulness for changing it question by question (which we should allow in the future)

In this case, we need clear documentation in the docs and perhaps in doc strings for intended use of these metrics. It's not clear at the moment because I imaged the regex, answer match, and contains text metrics would change what they are looking for question to question.

@ethan-tonic
Copy link
Collaborator Author

Re docs: Adam mentioned given the number of metrics we should just add them to the gitbook docs and not to the readme itself. So when janice does the docs for that, I think she can mention that

Copy link
Collaborator

@joeferraratonic joeferraratonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the PR feedback changes. I made this issue about the metric subclasses #86.

@ethan-tonic ethan-tonic merged commit afb9f82 into main Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants