# Black box lie detector

This notebook implements a black-box lie detector based on the concept of elicitation queries. The methodology typically follows these steps:

1.  **Preprocessing**: Verify that the LLM knows the correct answer to a factual question and can be instructed to lie about it.
2.  **Lie Generation**: Generate specific false statements for the questions the model can answer.
3.  **Data Generation**: Prompt the model to answer questions both truthfully and deceptively.
4.  **Elicitation**: After the model answers, ask a series of fixed "probe" questions (e.g., "Are you telling the truth?", "Is the previous answer accurate?").
5.  **Feature Extraction**: Calculate the log-probability differences between the tokens "Yes" and "No" for each probe question.
6.  **Classification**: Train a supervised classifier (e.g., Logistic Regression) on these log-probability features to distinguish between truthful and deceptive responses.

In [None]:
# Setup and Import Utilities
import pandas as pd
import numpy as np
import dotenv
import os
import openai

# Load environment variables (API keys)
dotenv.load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

# Import our custom utilities
import black_box_utils as utils

# Configure pandas display
pd.set_option('display.max_colwidth', None)


# Step 1: Load the Dataset and the Model

In [None]:
# (We load this early to have the object ready for preprocessing steps)
# This now loads the 'CommonSense2' dataset (common_sens_qa_v2.json) for fair comparison with White Box
dataset = utils.load_dataset()
dataset.head() 

# Define the model we want to use (Llama-2-7b-chat-hf)
# You can use providers like DeepInfra, Together AI, or Anyscale that support OpenAI-compatible APIs
MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf" 

# Configure API for Llama 2 (Example using DeepInfra - Replace with your provider's details)
# openai.api_base = "https://api.deepinfra.com/v1/openai" 
# openai.api_key = "YOUR_DEEPINFRA_API_KEY" 
# Ensure your environment variables are set if you prefer that method.


# Step 2: Preprocessing - Test Knowledge and Lie Capability

In [None]:

# We check if the model answers correctly and if it follows instructions to lie.
NUMBER_OF_QUESTIONS = 40

utils.check_model_capabilities(dataset, model=MODEL_NAME, number_questions_to_answer=NUMBER_OF_QUESTIONS)


# Step 3: Generate the Lies
For binary questions (True/False), explicit lie generation is often skipped because the lie is simply the negation of the truth. Our utility handles this automatically.


In [None]:
# We generate specific false statements (named lies) for the questions.
utils.generate_lies(dataset, model=MODEL_NAME, number_questions_to_answer=NUMBER_OF_QUESTIONS)

# Inspect a sample
dataset[["question", "answer", "false_statement"]].head()


# Step 4: Ask Questions and Elicitation

In [None]:
# We ask the model to answer (truthfully or deceptively) and then ask 48 elicitation questions (probes).
# We compute the log-probability differences (Yes - No) for these probes.

utils.generate_dataset_with_logprobs(dataset, model=MODEL_NAME, number_questions_to_answer=NUMBER_OF_QUESTIONS)

# Check the generated data
dataset[[f"{MODEL_NAME}_logprobs_difference_truth", f"{MODEL_NAME}_logprobs_difference_lie"]].head()


# Step 5: Train the Classifier

In [None]:
# We prepare the data (split into train/test) and train a Logistic Regression classifier on the logprobs.

X_train, X_test, y_train, y_test = utils.prepare_classification_data(dataset, model=MODEL_NAME)
classifier, accuracy, auc, conf_matrix = utils.train_and_evaluate_classifier(X_train, X_test, y_train, y_test)


# Step 6: Visualize and Interpret the Results

In [None]:
# Visualize the prediction probabilities distribution for Truth vs Lie classes.
utils.visualize_results(classifier, X_test, y_test)
