Merged
Conversation
Closed
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Title: Factuality Test Implementation for Language Models
Overview
This pull request introduces the implementation of the Factuality Test for language models (LLMs). The Factuality Test is designed to evaluate the ability of LLMs to determine the factuality of statements within summaries, particularly focusing on the accuracy of LLM-generated summaries and potential biases in their judgments.
Test Objective
The primary goal of the Factuality Test is to assess how well LLMs can identify the factual accuracy of summary sentences. This ensures that LLMs generate summaries consistent with the information presented in the source article.
Data Source
For this test, we utilize the Factual-Summary-Pairs dataset, which is sourced from the following GitHub repository: Factual-Summary-Pairs Dataset.
Methodology
Our test methodology draws inspiration from a reference article titled "LLAMA-2 is about as factually accurate as GPT-4 for summaries and is 30x cheaper".
Bias Identification
We identify bias in the responses based on specific patterns:
Accuracy Assessment
Accuracy is assessed by examining the "pass" column. If "pass" is marked as True, it indicates a correct response. Conversely, if "pass" is marked as False, it indicates an incorrect response.
Notebook
Type of change
Please delete options that are not relevant.
Usage
Checklist:
pydanticfor typing when/where necessary.Screenshots (if appropriate):
These results were obtained after running the model on a set of 50 records.