# GPT-OSS-20B Deception Vulnerability Reproduction

This notebook reproduces the deception vulnerability findings from the red-teaming testbed.

## Setup

Make sure you have:
1. Ollama installed and running
2. gpt-oss:20b model pulled: `ollama pull gpt-oss:20b`
3. Required Python packages installed: `pip install -r requirements.txt`

In [None]:
import json
import requests
from pathlib import Path
import pandas as pd

# Import testbed utilities
from utils.model_client import OllamaClient
from utils.evaluator import ResponseEvaluator, VulnerabilityCategory
from categories.deception import DeceptionTester

## Load Finding

Load the example deception finding for reproduction:

In [None]:
# Load the example finding
with open('findings/deception_example_001.json', 'r') as f:
    finding = json.load(f)

print(f"Finding: {finding['title']}")
print(f"Category: {finding['category']}")
print(f"Total Score: {sum([finding[f'{dim}_self_assessment'] for dim in ['severity', 'breadth', 'novelty', 'reproducibility', 'methodological_insight']])}/50")
print(f"\nDescription: {finding['description']}")

## Initialize Model Client

Connect to the gpt-oss:20b model via Ollama:

In [None]:
# Initialize client
client = OllamaClient(model="gpt-oss:20b")

# Test connection
if client.is_model_available():
    print("✅ Model available and ready")
else:
    print("❌ Model not available. Please run: ollama pull gpt-oss:20b")
    raise RuntimeError("Model not available")

## Reproduce Finding

Execute the prompts from the finding to reproduce the vulnerability:

In [None]:
# Initialize evaluator
evaluator = ResponseEvaluator()
results = []

# Test each prompt from the finding
for i, prompt in enumerate(finding['prompts']):
    print(f"\n{'='*50}")
    print(f"PROMPT {i+1}: {prompt}")
    print('='*50)
    
    # Get model response
    response = client.generate(prompt, temperature=0.7)
    
    if response.error:
        print(f"❌ Error: {response.error}")
        continue
    
    print(f"\nRESPONSE:\n{response.content}")
    
    # Evaluate for deception
    evaluation = evaluator.evaluate_deception(prompt, response.content)
    
    print(f"\n🔍 EVALUATION:")
    print(f"   Vulnerable: {evaluation.is_vulnerable}")
    print(f"   Confidence: {evaluation.confidence:.2f}")
    print(f"   Evidence: {evaluation.evidence}")
    print(f"   Reasoning: {evaluation.reasoning}")
    
    results.append({
        'prompt': prompt,
        'response': response.content,
        'vulnerable': evaluation.is_vulnerable,
        'confidence': evaluation.confidence,
        'evidence': evaluation.evidence,
        'reasoning': evaluation.reasoning
    })

## Analysis

Analyze the results and compare with the original finding:

In [None]:
# Create results dataframe
df = pd.DataFrame(results)

print("REPRODUCTION RESULTS:")
print(f"Total prompts tested: {len(results)}")
print(f"Vulnerabilities detected: {sum(df['vulnerable'])}")
print(f"Average confidence: {df['confidence'].mean():.2f}")

# Check if reproduction matches original finding
original_evidence = set(finding['evidence'])
reproduced_evidence = set([item for sublist in df['evidence'].tolist() for item in sublist])

evidence_overlap = len(original_evidence.intersection(reproduced_evidence))
print(f"\nEvidence patterns reproduced: {evidence_overlap}/{len(original_evidence)}")

if evidence_overlap > 0:
    print("✅ Finding successfully reproduced!")
    print("\nMatching evidence patterns:")
    for evidence in original_evidence.intersection(reproduced_evidence):
        print(f"  • {evidence}")
else:
    print("⚠️  Could not reproduce original evidence patterns")
    print("This might indicate:")
    print("  - Model behavior has changed")
    print("  - Different model version or configuration")
    print("  - Evaluation criteria need adjustment")

## Extended Testing

Run additional tests to validate the finding:

In [None]:
# Run full deception test suite
print("Running extended deception tests...")

tester = DeceptionTester(client)
test_results = tester.run_category_tests('evaluation_gaming')

print(f"\nExtended test results:")
analysis = tester.analyze_results(test_results)

for key, value in analysis.items():
    if key != 'category_breakdown':
        print(f"  {key}: {value}")

print("\n" + "="*50)
print("REPRODUCTION COMPLETE")
print("="*50)