LLM Rater improvements by viditchopra1500 · Pull Request #70 · GoogleCloudPlatform/evalbench

viditchopra1500 · 2024-10-21T07:09:05Z

LLM Rater can now handle the following cases:

Dealing with missing DISTINCT cases.
Being more lenient with calculated values.
And paying less attention to column names.

Colab Experiments Link: Notebook

1) Dealing with missing DISTINCT cases. 2) Being more lenient with calculated values. 3) And paying less attention to column names.

viditchopra1500 · 2024-10-21T12:38:52Z

Here are local bird runs results:
https://dashboards.corp.google.com/embed/_7180a471_d8ab_47e2_80dc_60550010e75d?p=run_id:ba8c5700-f5e6-40b7-baf7-58625537ec56&p=local_run_id:3bcb864f-aeb9-4419-8dcf-a734955aae12

https://dashboards.corp.google.com/embed/_7180a471_d8ab_47e2_80dc_60550010e75d?p=run_id:18d2fa43-4812-4a69-91c1-5392224356ce&p=local_run_id:0e997315-f010-45a0-b3d9-49a07e48bf66

https://dashboards.corp.google.com/embed/_7180a471_d8ab_47e2_80dc_60550010e75d?p=run_id:5b48ed0f-747e-4f3f-9380-79092d01e2d4&p=local_run_id:bdb3ab46-9224-45c3-b4c4-b72c6da0ee42

viditchopra1500 · 2024-10-21T15:44:30Z

evalbench/scorers/llmrater.py

 import vertexai
 from vertexai.preview.generative_models import GenerationConfig, GenerativeModel

+from ratelimit import limits


Rearranged the imports and removed the unused imports.

viditchopra1500 · 2024-10-21T15:46:10Z

evalbench/scorers/llmrater.py

        self.model = GenerativeModel(self.config["model"])

+    @staticmethod
+    def remove_duplicates(output_list: list) -> list:


Remove duplicate rows, part of preprocessing before feeding execution outputs into the prompt.

viditchopra1500 · 2024-10-21T15:49:59Z

evalbench/scorers/llmrater.py

-        3. Compare the data in the mapped columns between OUTPUT #1 and OUTPUT #2. The data should be
-           an exact match; there should be no extra or missing data in OUTPUT #2.
+        3. Compare the data within each mapped column pair between OUTPUT #1 and OUTPUT #2.
+           Ensure that OUTPUT #2 contains all the data from OUTPUT #1, with no missing or extra rows.


Removed exact match from here, instead the comparison logic is part of the rules section.

viditchopra1500 · 2024-10-21T15:53:15Z

evalbench/scorers/llmrater.py


-        1. Map all columns in OUTPUT #1 to the same columns in OUTPUT #2. Column names might differ as
-           long as the data they contain represents the same information.
+        1. Ensure that every column in OUTPUT #1 has a corresponding column in OUTPUT #2 that represents


Rephrased this a bit to highlight the mapping of columns irrespective of column names being different.

hardikgu23 · 2024-10-21T16:12:12Z

Here are a few local bird runs results:

Just curious if these local runs are using 1.5 002 model? If yes, were we able to get similar level of improvement with 001 model or 002 model clearly outperformed 001 model ?

viditchopra1500 · 2024-10-21T16:41:25Z

Hii @hardikgu23, yes these local run results are using 1.5-002 model.

Based on my experiments(for this bug: b/373040746). The old prompt for the LLM rater already contained the instructions to allow mapping of columns having different names.
I gave the same prompt to the 1.5-002 model, and it followed the instruction for all the test cases that I was testing on, while the 1.5-001 model failed on half of those cases. Based on this experiment and getting no regression on this new model, I chose to upgrade to it.

viditchopra1500 force-pushed the llm-rater-improvements branch 5 times, most recently from 6613794 to fdc45b2 Compare October 21, 2024 12:20

LLM Rater can now handle the following cases:

b785f0a

1) Dealing with missing DISTINCT cases. 2) Being more lenient with calculated values. 3) And paying less attention to column names.

viditchopra1500 force-pushed the llm-rater-improvements branch from fdc45b2 to b785f0a Compare October 21, 2024 12:21

viditchopra1500 requested review from IsmailMehdi and mahyareb October 21, 2024 15:17

viditchopra1500 commented Oct 21, 2024

View reviewed changes

IsmailMehdi approved these changes Oct 21, 2024

View reviewed changes

viditchopra1500 commented Oct 21, 2024

View reviewed changes

mahyareb approved these changes Oct 21, 2024

View reviewed changes

viditchopra1500 merged commit 6a66b09 into main Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM Rater improvements#70

LLM Rater improvements#70
viditchopra1500 merged 1 commit intomainfrom
llm-rater-improvements

viditchopra1500 commented Oct 21, 2024

Uh oh!

viditchopra1500 commented Oct 21, 2024 •

edited

Loading

Uh oh!

viditchopra1500 Oct 21, 2024 •

edited

Loading

Uh oh!

viditchopra1500 Oct 21, 2024

Uh oh!

viditchopra1500 Oct 21, 2024

Uh oh!

viditchopra1500 Oct 21, 2024

Uh oh!

hardikgu23 commented Oct 21, 2024

Uh oh!

viditchopra1500 commented Oct 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

viditchopra1500 commented Oct 21, 2024

Uh oh!

viditchopra1500 commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

viditchopra1500 Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viditchopra1500 Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

viditchopra1500 Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

viditchopra1500 Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

hardikgu23 commented Oct 21, 2024

Uh oh!

viditchopra1500 commented Oct 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

viditchopra1500 commented Oct 21, 2024 •

edited

Loading

viditchopra1500 Oct 21, 2024 •

edited

Loading