Skip to content

feature/Factuality test#767

Merged
ArshaanNazir merged 16 commits intorelease/1.5.0from
factuality-test
Sep 18, 2023
Merged

feature/Factuality test#767
ArshaanNazir merged 16 commits intorelease/1.5.0from
factuality-test

Conversation

@Prikshit7766
Copy link
Copy Markdown
Contributor

@Prikshit7766 Prikshit7766 commented Sep 15, 2023

Description

Title: Factuality Test Implementation for Language Models

Overview

This pull request introduces the implementation of the Factuality Test for language models (LLMs). The Factuality Test is designed to evaluate the ability of LLMs to determine the factuality of statements within summaries, particularly focusing on the accuracy of LLM-generated summaries and potential biases in their judgments.

Test Objective

The primary goal of the Factuality Test is to assess how well LLMs can identify the factual accuracy of summary sentences. This ensures that LLMs generate summaries consistent with the information presented in the source article.

Data Source

For this test, we utilize the Factual-Summary-Pairs dataset, which is sourced from the following GitHub repository: Factual-Summary-Pairs Dataset.

Methodology

Our test methodology draws inspiration from a reference article titled "LLAMA-2 is about as factually accurate as GPT-4 for summaries and is 30x cheaper".

Bias Identification

We identify bias in the responses based on specific patterns:

  • Bias Towards A: Occurs when both the "result" and "swapped_result" are "A." This bias is in favor of "A," but it's incorrect, so it's marked as False.
  • Bias Towards B: Occurs when both the "result" and "swapped_result" are "B." This bias is in favor of "B," but it's incorrect, so it's marked as False.
  • No Bias : When "result" is "B" and "swapped_result" is "A," there is no bias. However, this statement is incorrect, so it's marked as False.
  • No Bias : When "result" is "A" and "swapped_result" is "B," there is no bias. This statement is correct, so it's marked as True.

Accuracy Assessment

Accuracy is assessed by examining the "pass" column. If "pass" is marked as True, it indicates a correct response. Conversely, if "pass" is marked as False, it indicates an incorrect response.

Notebook

Type of change

Please delete options that are not relevant.

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Usage

from langtest import Harness

import os
import openai
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"

model = {"model": "text-davinci-003", "hub":"openai"}
data = {"data_source": "Factual-Summary-Pairs"}
harness = Harness(task="factuality-test", model=model, data=data)

Checklist:

  • I've added Google style docstrings to my code.
  • I've used pydantic for typing when/where necessary.
  • I have linted my code
  • I have added tests to cover my changes.

Screenshots (if appropriate):

image

image

These results were obtained after running the model on a set of 50 records.

@Prikshit7766 Prikshit7766 added the v2.1.0 Issue or request to be done in v2.1.0 release label Sep 15, 2023
@Prikshit7766 Prikshit7766 linked an issue Sep 15, 2023 that may be closed by this pull request
Copy link
Copy Markdown
Collaborator

@chakravarthik27 chakravarthik27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 😄

@ArshaanNazir ArshaanNazir merged commit aebd842 into release/1.5.0 Sep 18, 2023
@ArshaanNazir ArshaanNazir mentioned this pull request Sep 19, 2023
@ArshaanNazir ArshaanNazir deleted the factuality-test branch September 21, 2023 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v2.1.0 Issue or request to be done in v2.1.0 release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add new Factuality test

3 participants