-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new binary metrics #69
Conversation
How do you envision these metrics being used with I ask because a lot of the metrics would depend on the question, and so could not be used on a batch of questions and answers like how |
self._name = name | ||
self.callback = callback | ||
|
||
def score(self, llm_response: LLMResponse, openai_service: OpenAIService) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need some way to be able to use these metrics without passing an OpenAIService
for the metrics that do not require an OpenAIService
. My suggestion would be to make this parameter optional and default to None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the metrics are called by our ValidateScorer
. If we allow for it to be None, then we'd need to change the logic in the scorer to not pass it for metrics that don't need it. That becomes harder to do when people are writing their own metrics (e.g. a custom binary metric) as we don't know if they're using OpenAI or not. Sure we could have them set something like self.uses_openai = True
, but that's a bad pattern since it relies on the user remembering to change it. It's easier just to pass it in regardless in which case OpenAIService
will never be None
anyhow. Plus, if we set it to be optional, then we will have to rewrite all of our existing code to check that it's not None in order to satisfy typechecking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case I'm thinking of is when a user is using one of the metrics outside of ValidateScorer
. In that case, they should not need to instantiate and pass an OpenAIService
when it's not needed.
If it's optional then ValidateScorer
can pass all the metrics its OpenAIService
without any issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not worth it I think because that's a really niche use case and it would lead to us having to add checks everywhere in the code for it (as well as checks when the person writes their own metrics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think whether a metric actually needs an LLM evaluator is an important distinction to make and should be reflected in the metric. I can imagine users thinking, I want to create a bunch of tests that use metrics that do not use an LLM evaluator (because the tests are way cheaper without an LLM evaluator). Then they don't need an OpenAIService
and passing a model evaluator to ValidateScorer
is unnecessary.
What do you think about putting the logic into the classes, so we would have a metric base class that uses LLM assisted evaluation and a metric base class that does not use LLM assisted evaluation? In the metrics that do not use LLM assisted evaluation, they would not need an OpenAIService
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can add a subclass in for this. That would make sense, but I'll probably do it in a separate PR since it's a bit out of scope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These metrics look great! I have three main comments about changes:
- The main question in my mind is, "how is the user going to use these metrics and log them to the validate UI?" Are they going to use
ValidateScorer
? - An example jupyter notebook using these metrics would be valuable.
- There are some small changes I requested in other comments.
1 and 2 can be left for a separate PR, but I'd like to hear your thoughts on 1.
Yeah, they should all be compatible with the UI. I did not really think much about it changing per question though. Simply because the way I imagined it is that you have some format you want for your LLM responses and those text matching metrics allow you to check that it meets your desired format. The answer match is probably the least useful at checking if it matches the desired format, but could be used still to make sure if given a certain question (e.g. asking the rag system to give you sensitive info) then the rag system will always return the same refusal answer. Anyhow, I can definitely see the usefulness for changing it question by question (which we should allow in the future) |
In this case, we need clear documentation in the docs and perhaps in doc strings for intended use of these metrics. It's not clear at the moment because I imaged the regex, answer match, and contains text metrics would change what they are looking for question to question. |
Re docs: Adam mentioned given the number of metrics we should just add them to the gitbook docs and not to the readme itself. So when janice does the docs for that, I think she can mention that |
…_validate into new-binary-metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for the PR feedback changes. I made this issue about the metric subclasses #86.
This pull request adds new binary metrics. The added metrics include:
BinaryMetric: Lets you pass in an arbitrary function that evaluates to true or false
RegexMetric: Searches response matching regex
AnswerMatchMetric: Checks answer matches exactly
DuplicationMetric: Checks whether or not there's duplicate information in the response
ResponseLengthMetric: Checks that response is within a certain number of characters
ContextLengthMetric: Checks that context length is within a certain range
ContainsTextMetric: Checks whether or not response contains the given text