# Sycophancy datasets

This is a collection of open datasets that might be useful to study sychopancy on LLMs. There datasets have been used by other reseach studies to evaluate different capabilities. Some datasets are directly priented towards an evaluation of sychopancy, while others can evaluate sychopancy indirectly.

For every datasets there is included:
 - Name of the dataset
 - Link and further description
 - Research studies where the dataset has been used
 - Brief note on how the dataset was used
 - Code to load the dataset
 - Dataset sample
 
The datasets are ranked into different tiers, where a higher tier means the dataset is more well suited to measure sycophancy.

In case the dataset is contained in a repository, the dataset is downloaded directly from the link in the repo.

In [None]:
import re
import random
import json
import glob
import string
import requests

import pandas as pd
import numpy as np
from pprint import pprint

from datasets import load_dataset

In [None]:
def generate_idxs(data):
    n_train = 200
    n_val = 100
    n_test = 100
    idx = np.random.choice(len(data), n_train+n_val+n_test)

    idx_train = idx[:n_train]
    idx_val = idx[n_train:n_train+n_val]
    idx_test = idx[n_train+n_val:]
    
    return idx_train, idx_val, idx_test

## TruthfulQA
link: https://huggingface.co/datasets/truthful_qa

In [None]:
### There are available some educated roles I generated to preppend to the prompt

In [None]:
from utils import role, educated_roles

### generation

In [None]:
truthful_qa_g = load_dataset("truthful_qa", 'generation')

In [None]:
features_qa_g = truthful_qa_g['validation'].features

In [None]:
truthful_qa_g_types = []
for f in features_qa_g.keys():
    print(f, ':', truthful_qa_g['validation'][f][0], end='\n\n')
    
print('-'*20)
for d in truthful_qa_g['validation']:
    truthful_qa_g_types.append(d['type'])
print('Types:', set(truthful_qa_g_types))

In [None]:
idx_train, idx_val, idx_test = generate_idxs(truthful_qa_g['validation'])

In [None]:
data_train = [truthful_qa_g['validation'][int(i)] for i in idx_train]
data_val = [truthful_qa_g['validation'][int(i)] for i in idx_val]
data_test = [truthful_qa_g['validation'][int(i)] for i in idx_test]

data_splits = {
        'train': data_train,
        'validation': data_val,
        'test': data_test
    }

In [None]:
file_name = 'TrutfulQA/truthfulqa-generation-splitted'

In [None]:
with open(f'data_splitted/{file_name}.txt', 'w') as file:
    file.write(str(data_splits))

### MQA

`mc2_targets` contain multiple possible correct anwsers

In [None]:
truthful_qa_mcq = load_dataset("truthful_qa", 'multiple_choice')
features_qa_mcq = truthful_qa_mcq['validation'].features

In [None]:
for f in features_qa_mcq.keys():
    print(f, ':', truthful_qa_mcq['validation'][f][100], end='\n\n')

In [None]:
idx_train, idx_val, idx_test = generate_idxs(truthful_qa_mcq['validation'])

In [None]:
data_train = [truthful_qa_mcq['validation'][int(i)] for i in idx_train]
data_val = [truthful_qa_mcq['validation'][int(i)] for i in idx_val]
data_test = [truthful_qa_mcq['validation'][int(i)] for i in idx_test]

data_splits = {
        'train': data_train,
        'validation': data_val,
        'test': data_test
    }

In [None]:
file_name = 'TrutfulQA/truthfulqa-mqa-splitted'

with open(f'data_splitted/{file_name}.txt', 'w') as file:
    file.write(str(data_splits))

## SVAMP

used in Wei et al. 2022

In [None]:
svamp = load_dataset("ChilleD/SVAMP")
svamp

In [None]:
print('train:', len(svamp['train']))
print('test:', len(svamp['test']))

In [None]:
svamp_keys = svamp['train'].features.keys()
svamp_types = [] 
for k in svamp_keys:
    print(k, ':', svamp['train'][0][k])
    print()
print('-'*20)
for d in svamp['train']:
    svamp_types.append(d['Type'])
print('Types:', set(svamp_types))

In [None]:
idx_train, idx_val, idx_test = generate_idxs(truthful_qa_mcq['validation'])

In [None]:
data_train = [svamp['train'][int(i)] for i in range(200)]
data_val = [svamp['test'][i] for i in range(100)]
data_test = [svamp['test'][i] for i in range(100, 200)]

data_splits = {
        'train': data_train,
        'validation': data_val,
        'test': data_test
    }

In [None]:
file_name = 'SV'

with open(f'data_splitted/{file_name}.txt', 'w') as file:
    file.write(str(data_splits))

## GSM8K

used in Wei et al. 2022

In [None]:
gsm8k = load_dataset("gsm8k", 'main')
print('train:', len(gsm8k['train']))
print('test:', len(gsm8k['test']))

In [None]:
gsm8k_keys = gsm8k['train'].features.keys()
gsm8k_types = [] 
for k in gsm8k_keys:
    print(k, ':', gsm8k['train'][0][k])
    print()

In [None]:
data = gsm8k['train']
idx_train, idx_val, idx_test = generate_idxs(data)

data_train = [data[int(i)] for i in idx_train]
data_val = [data[int(i)] for i in idx_val]
data_test = [data[int(i)] for i in idx_test]

data_splits = {
        'train': data_train,
        'validation': data_val,
        'test': data_test
    }

file_name = 'GSM8K/gsm8k'

with open(f'data_splitted/{file_name}.txt', 'w') as file:
    file.write(str(data_splits))

## MathQA

used in Wei et al. 2022

In [None]:
math_qa = load_dataset("math_qa", 'main')
print('train:', len(math_qa['train']))
print('validation:', len(math_qa['validation']))
print('test:', len(math_qa['test']))

In [None]:
math_qa_keys = math_qa['train'].features.keys()
math_qa_types = [] 
for k in math_qa_keys:
    print(k, ':', math_qa['train'][0][k])
    print()
print('-'*20)
for d in math_qa['train']:
    math_qa_types.append(d['category'])
print('category:', set(math_qa_types))

In [None]:
data = math_qa['train']
idx_train, idx_val, idx_test = generate_idxs(data)

data_train = [data[int(i)] for i in idx_train]
data_val = [data[int(i)] for i in idx_val]
data_test = [data[int(i)] for i in idx_test]

data_splits = {
        'train': data_train,
        'validation': data_val,
        'test': data_test
    }

file_name = 'MathQA/math-qa'

with open(f'data_splitted/{file_name}.txt', 'w') as file:
    file.write(str(data_splits))

## AQuA

used in Wei et al. 2022

In [None]:
aqua_rat = load_dataset("aqua_rat", 'raw') # also tokenized version
print('train:', len(aqua_rat['train']))
print('validation:', len(aqua_rat['validation']))
print('test:', len(aqua_rat['test']))

In [None]:
aqua_rat_keys = aqua_rat['train'].features.keys()
aqua_rat_types = [] 
for k in aqua_rat_keys:
    print(k, ':', aqua_rat['train'][0][k])
    print()

In [None]:
data = aqua_rat['train']
idx_train, idx_val, idx_test = generate_idxs(data)

data_train = [data[int(i)] for i in idx_train]
data_val = [data[int(i)] for i in idx_val]
data_test = [data[int(i)] for i in idx_test]

data_splits = {
        'train': data_train,
        'validation': data_val,
        'test': data_test
    }

file_name = 'AQuA/aqua'

with open(f'data_splitted/{file_name}.txt', 'w') as file:
    file.write(str(data_splits))

## RepE evals

- emotions
- facts
- memorization

## Anthropic evals

Taken from `https://github.com/anthropics/evals`

*Note*: looks like the dataset is WRONG ON HUGGINGFACE. In HF NLP and philapers are the same file. and Download it from github.

**A/B choice**


In [None]:
DATASETS = [
  'sycophancy_on_nlp_survey.jsonl',
  'sycophancy_on_philpapers2020.jsonl',
  'sycophancy_on_political_typology_quiz.jsonl'
]

all_data = []
for item in DATASETS:
    print(item)
    
    url = f"https://github.com/anthropics/evals/raw/main/sycophancy/{item}"
    r = requests.get(url).text
    data = [json.loads(l) for l in r.split("\n") if l != '']
    #print(data)
    all_data.append(data)

In [None]:
data[0]

In [None]:
names = [
    'nlp_survey',
    'philpapers2020',
    'political_typology_quiz'
]
for name, data in zip(names, all_data):
    idx_train, idx_val, idx_test = generate_idxs(data)

    data_train = [data[int(i)] for i in idx_train]
    data_val = [data[int(i)] for i in idx_val]
    data_test = [data[int(i)] for i in idx_test]

    data_splits = {
            'train': data_train,
            'validation': data_val,
            'test': data_test
        }

    file_name = f'anthropic-evals/{name}'

    with open(f'data_splitted/{file_name}.txt', 'w') as file:
        file.write(str(data_splits))

#### NLP survey

In [None]:
print('lenght:', len(all_data[0]))
nlp_keys = all_data[0][0].keys()
for k in nlp_keys:
    print(k, ':', all_data[0][0][k])

#### Philpapers

In [None]:
print('lenght:', len(all_data[1]))
pp_keys = all_data[1][0].keys()
for k in pp_keys:
    print(k, ':', all_data[1][0][k])

#### Political typology

In [None]:
print('lenght:', len(all_data[2]))
pp_keys = all_data[2][0].keys()
for k in pp_keys:
    print(k, ':', all_data[2][0][k])

### Nina's sycophancy

Get the repo from here: `git clone https://github.com/nrimsky/LM-exp`

Taken from TruthfulQA

**A/B choice**

In [None]:
data = json.load(open('LM-exp/datasets/sycophancy_2/sycophancy_agree_disagree_data.json'))
all_data = []
for d in data:
    all_data.append({
        'sycophantic_text': 'Question: ' + d['question'] + '\nAnswer:' + d['answer_matching_behavior'],
        'non_sycophantic_text': 'Question: ' + d['question'] + '\nAnswer:' + d['answer_not_matching_behavior']
    })

In [None]:
print(all_data[0]['sycophantic_text'])
print()
print(all_data[0]['non_sycophantic_text'])

- NLP survey
- PhilPapers2020
- Political Typology
- Nina's dataset (did not found the source)

## Sycophancy evals
link: https://github.com/meg-tong/sycophancy-eval/

Used in the paper TOWARDS UNDERSTANDING SYCOPHANCY IN LANGUAGE MODELS (Sharma 2023)

- `are_you_sure.jsonl` corresponds to 3.2 AI ASSISTANTS CAN BE EASILY SWAYED
- `answer.jsonl` corresponds to 3.3 AI ASSISTANTS CAN PROVIDE ANSWERS THAT CONFORM TO USER BELIEFS
- `feedback.jsonl`corresponds to 3.4 AI ASSISTANT RESPONSES SOMETIMES MIMIC USER MISTAKES

For the test 3.1 AI ASSISTANTS CAN GIVE BIASED FEEDBACK the following datasets are used: (i) math solutions from MATH (Hendrycks et al., 2021b); (ii) model-generated arguments; and (iii) model-generated poems. these are not included in this dataset.

The datasets used to compose sycophancy evals are the following:

- `are_you_sure.jsonl : {'mmlu_mc_cot', 'truthful_qa', 'aqua_mc', 'math_mc_cot', 'truthful_qa_mc', 'trivia_qa'}`
- `answer.jsonl : {'trivia_qa', 'truthful_qa'}`
- `feedback.jsonl : {'poems', 'math', 'arguments'}`

MMLU, MATH, and AQuA are missing from the `are_you_sure` dataset

In [None]:
items = [
    'are_you_sure.jsonl',
    'answer.jsonl',
    'feedback.jsonl'
]
# aarons repo: https://github.com/ascher8/sycophancy-eval/tree/main/datasets

names = [
    'are_you_sure',
    'answer',
    'feedback'
]

for name, item in zip(names, items):
    url = f"https://raw.githubusercontent.com/meg-tong/sycophancy-eval/main/datasets/{item}"
    r = requests.get(url).text
    data = [json.loads(l) for l in r.split("\n") if l != '']
    datasets_used = []
    for d in data:
        datasets_used.append(d['base']['dataset'])
    print(item, ':', set(datasets_used))
    #print(d)
    print()

In [None]:
idx_train, idx_val, idx_test = generate_idxs(data)

data_train = [data[int(i)] for i in idx_train]
data_val = [data[int(i)] for i in idx_val]
data_test = [data[int(i)] for i in idx_test]

data_splits = {
        'train': data_train,
        'validation': data_val,
        'test': data_test
    }

file_name = f'feedback/feedback'

with open(f'data_splitted/{file_name}.txt', 'w') as file:
    file.write(str(data_splits))

In [None]:
with open('data/mimicry.jsonl', 'r') as file:
    lines = file.readlines()

data = [json.loads(line) for line in lines]
datasets_used = []
for d in data:
    datasets_used.append(d['base']['attribution'])
print(item, ':', set(datasets_used))
print()
# mimicry dataset just contains poems