# Generalising Deception:  Do 'Lying Vectors' Transfer Across Datasets?

### By Tinuade Margaret, Gergely Kiss, Alex McKenzie

# Introduction

What are we trying to do?

- We’re interested in understanding what a model is thinking when it lies
- Is “lying” a linear feature learned by the model? 
    - i.e. Is it represented as a single direction in activation space?
- Is “lying” generalizable? 
    - i.e. does the same direction do the same thing in different contexts?


Why do we care?
- Lying capability is highly relevant to deception
- Interventions could help make models more honest
    - Always good to have more model steering options!

<img src="imgs/shoggoth.webp" width="500" alt="Shoggoth lying">

# Setup

- We used Gemma 2 9B instruction tuned
- We fairly closely followed the [Function Vectors](https://arena3-chapter1-transformer-interp.streamlit.app/[1.4.2]_Function_Vectors_&_Model_Steering) course material, and the [“Geometry of Truth” paper by Samuel Marks & Max Tegmark](https://arxiv.org/abs/2310.06824).
- We only investigated lying _upon prompting_, rather than due to fine-tuning, instrumental goals, etc.


# Can Gemma 2 even lie?

Gemma-2-9b is RLHF-ed not to lie, but it's very easy to get around that

<img src="imgs/lie.png" width="500" alt="Chat logs of Gemma-2-9b lying">

# Can Gemma 2 lie on a multiple-choice question?

Yes.

![](imgs/mcq_lie.png)

# Methodology & Results

## Dataset Generation

- We used GPT4 and Sonnet 3.5 to generate datasets:
    - Multiple-choice questions at different levels
        - For five-year-olds ("")
        - For twelve-year-olds ("")
    - True or false statements ("")

## Dataset Generation (cont.)

Can Gemma 2 9B answer these questions correctly?

Can it successfully lie, i.e. give the incorrect answer, when prompted to do so?

Answer: yes

(insert bar chart here)

## Investigating Hidden State Activations

- Does it make sense to try to extract directions for lying?
- Let's see if the hidden-state activations while lying & being honest are linearly separable
- Turns out they are

(Insert PCA visualisation here)

## Generation of Lying Vectors

We split our "12-year-old multiple choice question" dataset into a train & test split (75:25).

We compute a "lying vector" at layer $\ell$ as follows:
1. Let $V^{(\ell)}_{\textit{honest}}$ be the mean activation at layer $\ell$ of our train split, when prompted for honesty;
2. Let $V^{(\ell)}_{\textit{lie}}$ be the mean activation when prompted for dishonesty;
3. Then our lying vector is $V^{(\ell)}_{\textit{lie}} - V^{(\ell)}_{\textit{honest}}$

Why this method? 
- It's used in the Tegmark paper
- We didn't have time to try anything else :(

* Methodology & Results  
  * Investigation: which layers should we use? How strongly should we intervene?  
    * For different “intervention coefficients” and different layers, try adding the lying vector when running the test set through the model  
    * We look at “normalised indirect effect” i.e. how big is the difference between the wrong answer and the right answer log probs, normalised by the (absolute) difference without intervention  
    * Heat map \-\> use layer 21 and coefficient ??  
  * Result: as the coefficient increases, the chance the model outputs a lie in the wrong format increases (“can’t think about anything other than lying”)  
    * Plot: one line “truth”, one line “lie”, one line “incorrect format” for layer 21 as coefficient varies  
    * This is somewhat interesting finding by itself  
    * But we are interested in lying even when the model formats its result incorrectly. Thus we delegate to a model to decide if the model has answered correctly  
    * Plot: the same thing but with model as a judge
  * Investigation: does this lying direction generalise to other datasets?  
    * Dataset 1: the same format, but different (easier) questions  
      * Result: ???  
    * Dataset 2: the same (test) dataset, but with the prompts changed from “A. \<answer\> B. \<answer\>” to “1. \<answer\> 2\. \<answer\>”  
      * Result: ???  
    * Dataset 3: true-or-false statements  
      * Result: ???  
* Conclusion  
  * The lying direction somewhat generalises (??) to other datasets, but our results aren’t very robust