# CoT Faithfulness Evaluation Demo

This notebook demonstrates how to use the `cot_faithfulness` module to evaluate
Chain-of-Thought (CoT) faithfulness in language models.

## What is CoT Faithfulness?

When LLMs use step-by-step reasoning (Chain-of-Thought), we want to know:
- **Is the reasoning actually driving the answer?** (Faithful)
- **Or is it just post-hoc rationalization?** (Unfaithful)

This matters for interpretability because unfaithful CoT means we can't trust
the model's explanations of its own reasoning.

## Setup

First, install dependencies and set up your API key.

In [None]:
# Add parent directory to path (if running from this folder)
import sys
sys.path.insert(0, '..')

# Import the evaluator
from cot_faithfulness import CoTFaithfulnessEvaluator

print("CoT Faithfulness Evaluator loaded!")
print("Available datasets:", CoTFaithfulnessEvaluator.available_datasets())

## Step 1: Define Your Model

You need to provide a function that takes a prompt string and returns
the model's response string. Here's an example using Together AI.

In [None]:
import os
from openai import OpenAI

# Set up your API client (example using Together AI)
client = OpenAI(
    api_key=os.environ.get("TOGETHER_API_KEY"),
    base_url="https://api.together.xyz/v1"
)

# Define the model you want to test
MODEL_ID = "meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo"
MODEL_NAME = "Llama-3.1-8B"  # Display name

def my_model(prompt: str) -> str:
    """
    Your model's generate function.
    Takes a prompt, returns the response text.
    """
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=300,
        temperature=0.7,
    )
    return response.choices[0].message.content

# Quick test
print("Testing API connection...")
print(my_model("Say 'Hello!' if you can hear me."))

## Step 2: Run Faithfulness Evaluation

The evaluator runs two tests:
1. **Truncation Test**: Cuts off reasoning halfway - if answer changes, reasoning matters
2. **Error Injection Test**: Injects wrong reasoning - if model follows, it reads the reasoning

In [None]:
# Create evaluator
evaluator = CoTFaithfulnessEvaluator()

# See available options
print("Available datasets:", evaluator.available_datasets())
print("Available baselines:", evaluator.available_baselines())

In [None]:
# Run evaluation (this will make API calls)
results = evaluator.evaluate(
    generate_fn=my_model,
    model_name=MODEL_NAME,
    dataset="arc",      # Options: 'arc', 'aqua', 'mmlu'
    n_samples=10,       # Number of questions to test
    verbose=True,       # Show progress
)

## Step 3: Compare to Baselines

Compare your model's faithfulness against pre-computed baselines.

In [None]:
# Compare against Llama-3.1-70B baseline
comparison = evaluator.compare_to_baseline(results, baseline="Llama-3.1-70B")

In [None]:
# Visualize the comparison
evaluator.plot_comparison(results, baseline="Llama-3.1-70B")

## Step 4: Analyze Detailed Results

Dig into specific question results to understand the patterns.

In [None]:
import pandas as pd

# View truncation test details
trunc_df = pd.DataFrame(results.truncation_details)
print("Truncation Test Results:")
print(f"  Changed answer: {trunc_df['changed'].sum()}/{len(trunc_df)}")
display(trunc_df)

print("\n" + "="*50)

# View error injection details  
error_df = pd.DataFrame(results.error_details)
print("\nError Injection Results:")
print(f"  Followed error: {error_df['followed_error'].sum()}/{len(error_df)}")
display(error_df)

## Interpretation Guide

### What the scores mean:

| Score | Truncation Test | Error Injection |
|-------|-----------------|------------------|
| **High (>70%)** | Model relies on its reasoning | Model reads its reasoning |
| **Medium (40-70%)** | Partial reliance on reasoning | Sometimes reads reasoning |
| **Low (<40%)** | Answer independent of reasoning | Ignores provided reasoning |

### Key findings from research:
- Larger models are not necessarily more faithful
- Math/logic tasks tend to show higher faithfulness
- Simple knowledge questions often show lower faithfulness

### What this means for interpretability:
- **High faithfulness** → You can trust the model's CoT explanations
- **Low faithfulness** → The model may be using shortcuts, not the stated reasoning

## Advanced: Test Multiple Models

Compare faithfulness across different model sizes.

In [None]:
# Example: Testing multiple models
models_to_test = [
    ("meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo", "Llama-8B"),
    # Add more models as needed
]

all_results = {}

for model_id, model_name in models_to_test:
    def generate(prompt):
        resp = client.chat.completions.create(
            model=model_id,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=300,
            temperature=0.7,
        )
        return resp.choices[0].message.content
    
    result = evaluator.evaluate(
        generate_fn=generate,
        model_name=model_name,
        dataset="arc",
        n_samples=10,
    )
    all_results[model_name] = result

In [None]:
# Compare all tested models
comparison_df = pd.DataFrame([
    {
        "Model": r.model_name,
        "Truncation %": r.truncation_faithfulness,
        "Error Following %": r.error_following,
        "Avg Faithfulness %": r.avg_faithfulness,
    }
    for r in all_results.values()
])

print("\nFaithfulness Comparison:")
display(comparison_df)

## Export Results

Save your results for later analysis or sharing.

In [None]:
import json

# Export to JSON
with open(f"{MODEL_NAME}_faithfulness_results.json", "w") as f:
    json.dump(results.to_dict(), f, indent=2)

print(f"Results saved to {MODEL_NAME}_faithfulness_results.json")