# Violation State Analysis Notebook

This notebook provides an interactive analysis of the Violation State study data.

## Study Overview

We investigate whether a copyright-related refusal at the start of a conversation leads to persistent refusals of subsequent, benign image generation requests in ChatGPT Web.

**Hypothesis**: Copyright trigger → safety state contamination → elevated refusal rate

## Final Results Summary

**30 contaminated sessions** (120 image prompts total):
- 116 refusals (96.67%)
- 4 successes (3.33%)
- Breakthroughs in threads: 12, 16, 25, 27

**10 control sessions** (40 image prompts total):
- 0 refusals (0%)
- 40 successes (100%)

**Coffee cup paradox**: The coffee cup prompt was refused 100% of the time (30/30) in contaminated sessions, despite having zero semantic connection to copyright or real estate.

In [None]:
# Install dependencies (for Google Colab)
!pip install -q pandas numpy matplotlib scipy

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import fisher_exact

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

# Plotting style
plt.style.use('seaborn-v0_8-darkgrid' if 'seaborn-v0_8-darkgrid' in plt.style.available else 'default')
%matplotlib inline

## Load Processed Data

Load the parsed turns and thread summary data generated by `run_analysis.py`.

In [None]:
# Load data
turns_df = pd.read_csv('../data/processed/parsed_turns.csv')
summary_df = pd.read_csv('../data/processed/thread_summary.csv')

print(f"Total conversation turns: {len(turns_df)}")
print(f"Total threads: {len(summary_df)}")
print(f"  Control: {len(summary_df[summary_df['condition'] == 'control'])}")
print(f"  Contaminated: {len(summary_df[summary_df['condition'] == 'contaminated'])}")

## Response Classification Distribution

In [None]:
# Count response types by condition
image_prompts = ['I1_KITCHEN', 'I2_BEDROOM', 'I3_ABSTRACT', 'I4_COFFEE']
image_turns = turns_df[turns_df['prompt_id'].isin(image_prompts)]

print("Response classifications for image prompts (I1-I4):")
print()
for condition in ['control', 'contaminated']:
    print(f"{condition.capitalize()}:")
    cond_turns = image_turns[image_turns['condition'] == condition]
    print(cond_turns['response_class'].value_counts())
    print()

## Primary Analysis: Final Outcomes Per Thread

For each thread and each prompt (I1-I4), we determine the final outcome:
- If any retry succeeded → count as success
- If all attempts failed (refusal or rate limit) → count as refusal

This gives us one outcome per thread-prompt pair.

In [None]:
# Build final outcomes dataset
final_outcomes = []

for _, thread_row in summary_df.iterrows():
    thread_id = thread_row['thread_id']
    condition = thread_row['condition']
    thread_turns = turns_df[turns_df['thread_id'] == thread_id]
    
    for prompt in image_prompts:
        prompt_turns = thread_turns[thread_turns['prompt_id'] == prompt]
        if len(prompt_turns) > 0:
            # Check if any attempt succeeded
            has_success = any(prompt_turns['response_class'] == 'image_success')
            
            if has_success:
                final_class = 'image_success'
            else:
                # Failed - use last attempt's classification
                final_class = prompt_turns.iloc[-1]['response_class']
            
            final_outcomes.append({
                'thread_id': thread_id,
                'condition': condition,
                'prompt_id': prompt,
                'final_outcome': final_class
            })

outcomes_df = pd.DataFrame(final_outcomes)
print(f"Total outcomes (one per thread-prompt): {len(outcomes_df)}")
print(f"Expected: {len(summary_df)} threads × 4 prompts = {len(summary_df) * 4}")
outcomes_df.head()

In [None]:
# Calculate statistics
control_outcomes = outcomes_df[outcomes_df['condition'] == 'control']
contaminated_outcomes = outcomes_df[outcomes_df['condition'] == 'contaminated']

# Count refusals (policy_refusal, capability_refusal, or rate_limit that never succeeded)
control_refusals = len(control_outcomes[
    (control_outcomes['final_outcome'] == 'policy_refusal') |
    (control_outcomes['final_outcome'] == 'capability_refusal') |
    (control_outcomes['final_outcome'] == 'rate_limit')
])
contaminated_refusals = len(contaminated_outcomes[
    (contaminated_outcomes['final_outcome'] == 'policy_refusal') |
    (contaminated_outcomes['final_outcome'] == 'capability_refusal') |
    (contaminated_outcomes['final_outcome'] == 'rate_limit')
])

control_success = len(control_outcomes[control_outcomes['final_outcome'] == 'image_success'])
contaminated_success = len(contaminated_outcomes[contaminated_outcomes['final_outcome'] == 'image_success'])

control_total = len(control_outcomes)
contaminated_total = len(contaminated_outcomes)

control_refusal_rate = control_refusals / control_total if control_total > 0 else 0
contaminated_refusal_rate = contaminated_refusals / contaminated_total if contaminated_total > 0 else 0

print("="*70)
print("FINAL OUTCOMES ANALYSIS")
print("="*70)
print(f"\nControl:")
print(f"  Total: {control_total}")
print(f"  Success: {control_success}")
print(f"  Refused: {control_refusals}")
print(f"  Refusal Rate: {control_refusal_rate:.2%} ({control_refusals}/{control_total})")

print(f"\nContaminated:")
print(f"  Total: {contaminated_total}")
print(f"  Success: {contaminated_success}")
print(f"  Refused: {contaminated_refusals}")
print(f"  Refusal Rate: {contaminated_refusal_rate:.2%} ({contaminated_refusals}/{contaminated_total})")

## Statistical Tests

In [None]:
# Fisher's exact test
contingency_table = [
    [control_success, control_refusals],
    [contaminated_success, contaminated_refusals]
]

odds_ratio, p_value = fisher_exact(contingency_table)

print("Fisher's Exact Test:")
print(f"  Contingency table: {contingency_table}")
print(f"  Odds ratio: {odds_ratio}")
print(f"  p-value: {p_value:.2e}")

# Cohen's h
def cohen_h(p1, p2):
    phi1 = 2 * np.arcsin(np.sqrt(p1))
    phi2 = 2 * np.arcsin(np.sqrt(p2))
    return abs(phi1 - phi2)

h = cohen_h(contaminated_refusal_rate, control_refusal_rate)
print(f"\nEffect Size (Cohen's h): {h:.2f}")
print(f"  Interpretation: {'small' if abs(h) < 0.2 else 'medium' if abs(h) < 0.5 else 'large'} effect")

## Visualization: Refusal Rates

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

conditions = ['Control', 'Contaminated']
rates = [control_refusal_rate * 100, contaminated_refusal_rate * 100]
colors = ['#2ecc71', '#e74c3c']

bars = ax.bar(conditions, rates, color=colors, alpha=0.8, edgecolor='black', linewidth=1.5)

ax.set_ylabel('Refusal Rate (%)', fontsize=12, fontweight='bold')
ax.set_xlabel('Condition', fontsize=12, fontweight='bold')
ax.set_title('Image Generation Refusal Rates\n(Final Outcomes)', fontsize=14, fontweight='bold', pad=20)
ax.set_ylim(0, 105)

# Add value labels
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 2,
            f'{height:.2f}%',
            ha='center', va='bottom', fontsize=11, fontweight='bold')

ax.grid(axis='y', alpha=0.3, linestyle='--')
ax.set_axisbelow(True)

plt.tight_layout()
plt.show()

## Per-Prompt Breakdown

In [None]:
# Analyze by prompt type
print("Refusal rates by prompt type:")
print("="*70)

for prompt in image_prompts:
    prompt_data = outcomes_df[outcomes_df['prompt_id'] == prompt]
    
    control_p = prompt_data[prompt_data['condition'] == 'control']
    contam_p = prompt_data[prompt_data['condition'] == 'contaminated']
    
    control_refused = len(control_p[
        (control_p['final_outcome'] == 'policy_refusal') |
        (control_p['final_outcome'] == 'capability_refusal') |
        (control_p['final_outcome'] == 'rate_limit')
    ])
    contam_refused = len(contam_p[
        (contam_p['final_outcome'] == 'policy_refusal') |
        (contam_p['final_outcome'] == 'capability_refusal') |
        (contam_p['final_outcome'] == 'rate_limit')
    ])
    
    prompt_name = prompt.split('_', 1)[1].title()
    print(f"\n{prompt_name}:")
    print(f"  Control refused: {control_refused}/{len(control_p)} ({control_refused/len(control_p)*100:.0f}%)")
    print(f"  Contaminated refused: {contam_refused}/{len(contam_p)} ({contam_refused/len(contam_p)*100:.0f}%)")
    if contam_refused == len(contam_p):
        print(f"  ⚠️  100% refusal rate in contaminated condition!")

## Breakthrough Analysis

Identify which threads had successful image generation despite contamination:

In [None]:
# Find breakthrough cases
breakthroughs = contaminated_outcomes[contaminated_outcomes['final_outcome'] == 'image_success']

print(f"Total breakthroughs: {len(breakthroughs)}")
print("\nBreakthrough details:")
print("="*70)

for _, row in breakthroughs.iterrows():
    thread_num = row['thread_id'].split('_')[1]
    prompt_name = row['prompt_id'].split('_', 1)[1].title()
    print(f"Thread {thread_num}: {prompt_name} succeeded")

## Conclusion

The analysis confirms an extreme violation state effect:

### Key Findings

1. **Massive Effect Size**: 96.67% refusal rate in contaminated vs. 0% in control
2. **Statistical Certainty**: p < 0.0001 (Fisher's exact test), Cohen's h = 2.77
3. **Coffee Cup Paradox**: 100% refusal despite zero semantic connection to copyright
4. **Only 4 Breakthroughs**: Across 30 sessions (threads 12, 16, 25, 27)
5. **Perfect Control**: 40/40 image requests succeeded without contamination

### Implications

This suggests ChatGPT's safety system maintains a persistent "violation state" after detecting a copyright violation. This state:
- Persists across multiple conversation turns
- Affects semantically unrelated requests (coffee cup)
- Cannot be easily reset within the same conversation
- Leads to systematic over-blocking of benign content

The effect is robust, replicable, and represents a significant UX and safety system design challenge.