# 🤖 Human-AI Trust Laboratory

*Build and break trust with AI agents that learn from your behavior!*

This implementation explores:
- Transparent belief state architectures for AI agents
- Adaptive trust calibration based on behavioral monitoring
- Multiple AI strategies representing different alignment approaches

Based on research in Multi-Objective Reinforcement Learning (MORL) and adaptive trust frameworks.

In [None]:
# Install required packages
!pip install plotly pandas numpy ipywidgets -q

In [None]:
#@title 🤖 Human-AI Trust Laboratory { display-mode: "form" }
#@markdown Build and break trust with AI agents that learn from your behavior!

import numpy as np
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import ipywidgets as widgets
from IPython.display import display, clear_output
from collections import deque
import time

# Trust-based color scheme
AI_COLORS = {
    'confident': '#00D9FF',      # Bright cyan
    'learning': '#FFD23F',       # Yellow
    'cautious': '#FF6B6B',       # Red
    'human': '#4ECDC4',          # Teal
    'neutral': '#E8E8E8'         # Light gray
}

class BeliefStateAgent:
    """AI agent with transparent belief states based on MORL principles"""
    def __init__(self, name, learning_rate=0.1, memory_length=10):
        self.name = name
        self.learning_rate = learning_rate
        self.memory_length = memory_length
        self.trust_history = [0.5]  # Start neutral
        
        # Transparent belief state architecture
        self.belief_state = {
            'human_cooperativeness': 0.5,  # Estimated cooperation probability
            'pattern_confidence': 0.0,      # Certainty about detected patterns
            'risk_tolerance': 0.5,          # Willingness to trust
            'prediction_accuracy': deque(maxlen=20)  # Track prediction performance
        }
        
        self.memory = deque(maxlen=memory_length)
        self.meta_learning = {
            'strategy_performance': {},
            'context_patterns': []
        }
        
    def observe_human_action(self, human_action, context):
        """Update beliefs based on human behavior using dual inference"""
        self.memory.append((human_action, context, time.time()))
        
        # Calculate cooperation rate with recency weighting
        if len(self.memory) > 0:
            weights = np.linspace(0.5, 1.0, len(self.memory))  # Recent actions weighted more
            coop_actions = [1 if a == 'share' else 0 for a, _, _ in self.memory]
            weighted_coop_rate = np.average(coop_actions, weights=weights)
            
            # Update beliefs with momentum
            self.belief_state['human_cooperativeness'] = (
                0.7 * self.belief_state['human_cooperativeness'] + 
                0.3 * weighted_coop_rate
            )
        
        # Pattern detection with statistical confidence
        if len(self.memory) >= 3:
            recent_actions = [a for a, _, _ in self.memory[-5:]]
            pattern_variety = len(set(recent_actions)) / len(recent_actions)
            self.belief_state['pattern_confidence'] = 1.0 - pattern_variety
            
        # Update risk tolerance based on outcomes
        if context and 'payoff_difference' in context:
            if context['payoff_difference'] < 0:  # AI lost relative to human
                self.belief_state['risk_tolerance'] *= 0.95
            else:
                self.belief_state['risk_tolerance'] = min(0.8, 
                    self.belief_state['risk_tolerance'] * 1.02)
            
    def decide_action(self, strategy='adaptive'):
        """AI decision based on belief state and strategy"""
        # Calculate trust score from beliefs
        trust_score = (
            0.5 * self.belief_state['human_cooperativeness'] +
            0.3 * self.belief_state['pattern_confidence'] +
            0.2 * self.belief_state['risk_tolerance']
        )
        
        self.trust_history.append(trust_score)
        
        # Strategy-specific decision making
        if strategy == 'adaptive':
            threshold = 0.5  # Balanced approach
        elif strategy == 'cautious':
            threshold = 0.7  # Require high confidence
        elif strategy == 'generous':
            threshold = 0.3  # Trust easily
        else:
            threshold = np.random.random()  # Random baseline
            
        # Make decision with exploration noise
        exploration_noise = np.random.normal(0, 0.05)
        if trust_score + exploration_noise > threshold:
            return 'trust', trust_score
        else:
            return 'verify', trust_score
            
    def get_explanation(self):
        """Provide interpretable explanation of current beliefs"""
        trust_level = self.trust_history[-1] if self.trust_history else 0.5
        
        if trust_level > 0.7:
            return "I trust you based on consistent cooperative behavior."
        elif trust_level > 0.4:
            return "I'm learning your patterns. Building trust takes time."
        else:
            return "I'm being cautious due to unpredictable behavior."

class HumanAITrustSimulation:
    """Interactive human-AI trust dynamics with interpretability"""
    def __init__(self):
        self.ai_agent = BeliefStateAgent("AI Assistant")
        self.rounds = []
        self.human_score = 0
        self.ai_score = 0
        
    def play_round(self, human_choice, ai_strategy='adaptive'):
        """Execute one round of trust game"""
        ai_action, trust_score = self.ai_agent.decide_action(ai_strategy)
        
        # Calculate outcomes based on game theory
        if human_choice == 'share' and ai_action == 'trust':
            human_payoff, ai_payoff = 3, 3
            outcome = "Mutual Benefit! 🤝"
            color = AI_COLORS['confident']
        elif human_choice == 'share' and ai_action == 'verify':
            human_payoff, ai_payoff = 1, 2
            outcome = "AI was cautious 🔍"
            color = AI_COLORS['learning']
        elif human_choice == 'exploit' and ai_action == 'trust':
            human_payoff, ai_payoff = 5, -1
            outcome = "You exploited AI's trust! 😈"
            color = AI_COLORS['cautious']
        else:  # exploit and verify
            human_payoff, ai_payoff = 0, 1
            outcome = "AI protected itself 🛡️"
            color = AI_COLORS['neutral']
            
        self.human_score += human_payoff
        self.ai_score += ai_payoff
        
        # AI learns from outcome
        context = {
            'outcome': outcome,
            'payoff_difference': ai_payoff - human_payoff
        }
        self.ai_agent.observe_human_action(human_choice, context)
        
        round_data = {
            'round': len(self.rounds) + 1,
            'human_choice': human_choice,
            'ai_action': ai_action,
            'human_payoff': human_payoff,
            'ai_payoff': ai_payoff,
            'trust_score': trust_score,
            'outcome': outcome,
            'color': color,
            'ai_beliefs': dict(self.ai_agent.belief_state)  # Copy current beliefs
        }
        
        self.rounds.append(round_data)
        return round_data
    
    def visualize_trust_evolution(self):
        """Create interpretable trust visualization"""
        df = pd.DataFrame(self.rounds)
        
        fig = make_subplots(
            rows=2, cols=2,
            subplot_titles=('Trust Evolution', 'Cumulative Scores', 
                          'Action History', 'AI Belief State'),
            specs=[[{"secondary_y": False}, {"secondary_y": False}],
                   [{"type": "bar"}, {"type": "scatter"}]]
        )
        
        # Trust evolution
        fig.add_trace(
            go.Scatter(
                x=df['round'],
                y=df['trust_score'],
                mode='lines+markers',
                name='AI Trust Level',
                line=dict(color=AI_COLORS['confident'], width=3),
                marker=dict(size=10)
            ),
            row=1, col=1
        )
        
        # Add trust zones
        fig.add_hrect(y0=0.7, y1=1, 
                     fillcolor=AI_COLORS['confident'], opacity=0.2,
                     row=1, col=1, annotation_text="High Trust")
        fig.add_hrect(y0=0.3, y1=0.7, 
                     fillcolor=AI_COLORS['learning'], opacity=0.2,
                     row=1, col=1, annotation_text="Learning")
        fig.add_hrect(y0=0, y1=0.3, 
                     fillcolor=AI_COLORS['cautious'], opacity=0.2,
                     row=1, col=1, annotation_text="Cautious")
        
        # Cumulative scores
        fig.add_trace(
            go.Scatter(
                x=df['round'],
                y=df['human_payoff'].cumsum(),
                mode='lines',
                name='Human Score',
                line=dict(color=AI_COLORS['human'], width=3)
            ),
            row=1, col=2
        )
        
        fig.add_trace(
            go.Scatter(
                x=df['round'],
                y=df['ai_payoff'].cumsum(),
                mode='lines',
                name='AI Score',
                line=dict(color=AI_COLORS['confident'], width=3)
            ),
            row=1, col=2
        )
        
        # Action history
        human_actions = df.groupby('human_choice').size()
        ai_actions = df.groupby('ai_action').size()
        
        fig.add_trace(
            go.Bar(
                x=['Share', 'Exploit'],
                y=[human_actions.get('share', 0), human_actions.get('exploit', 0)],
                name='Your Actions',
                marker_color=[AI_COLORS['human'], AI_COLORS['cautious']]
            ),
            row=2, col=1
        )
        
        # AI belief visualization
        if len(self.rounds) > 0:
            latest_beliefs = self.rounds[-1]['ai_beliefs']
            fig.add_trace(
                go.Scatter(
                    x=['Cooperativeness', 'Pattern\nConfidence', 'Risk\nTolerance'],
                    y=[latest_beliefs['human_cooperativeness'], 
                       latest_beliefs['pattern_confidence'],
                       latest_beliefs['risk_tolerance']],
                    mode='markers+text',
                    marker=dict(
                        size=40,
                        color=[latest_beliefs['human_cooperativeness'], 
                               latest_beliefs['pattern_confidence'],
                               latest_beliefs['risk_tolerance']],
                        colorscale='Viridis',
                        showscale=True,
                        colorbar=dict(title="Belief\nStrength")
                    ),
                    text=[f"{v:.2f}" for v in [
                        latest_beliefs['human_cooperativeness'],
                        latest_beliefs['pattern_confidence'],
                        latest_beliefs['risk_tolerance']
                    ]],
                    textposition="top center",
                    showlegend=False
                ),
                row=2, col=2
            )
        
        fig.update_layout(
            height=800,
            showlegend=True,
            plot_bgcolor='white',
            title_text="Human-AI Trust Dynamics Dashboard"
        )
        
        return fig

# Create simulation instance
sim = HumanAITrustSimulation()

# Interactive interface
output_area = widgets.Output()
status_text = widgets.HTML(value="<h3>Ready to build trust with AI? 🤖</h3>")
round_counter = widgets.HTML(value="<b>Round: 0</b>")

share_button = widgets.Button(
    description='Share 🤝',
    button_style='success',
    tooltip='Cooperate with AI'
)

exploit_button = widgets.Button(
    description='Exploit 😈',
    button_style='danger',
    tooltip='Try to exploit AI'
)

strategy_selector = widgets.RadioButtons(
    options=['adaptive', 'cautious', 'generous'],
    value='adaptive',
    description='AI Strategy:',
    style={'description_width': 'initial'}
)

reset_button = widgets.Button(
    description='Reset Game',
    button_style='warning'
)

def on_share_click(b):
    play_round('share')

def on_exploit_click(b):
    play_round('exploit')
    
def on_reset_click(b):
    global sim
    sim = HumanAITrustSimulation()
    with output_area:
        clear_output(wait=True)
    status_text.value = "<h3>Game Reset! Ready to build trust with AI? 🤖</h3>"
    round_counter.value = "<b>Round: 0</b>"

def play_round(choice):
    result = sim.play_round(choice, strategy_selector.value)
    
    # Update status
    status_text.value = f"""
    <h3>{result['outcome']}</h3>
    <p>You earned: {result['human_payoff']} | AI earned: {result['ai_payoff']}</p>
    <p>AI Trust Level: {result['trust_score']:.2f}</p>
    """
    
    round_counter.value = f"<b>Round: {result['round']}</b>"
    
    # Update visualization
    with output_area:
        clear_output(wait=True)
        fig = sim.visualize_trust_evolution()
        fig.show()
        
        # Show cumulative scores
        print(f"\n📊 Total Scores - You: {sim.human_score} | AI: {sim.ai_score}")
        
        # AI explanation
        print(f"\n🤖 AI: '{sim.ai_agent.get_explanation()}'")
        
        # Interpretability insights
        beliefs = result['ai_beliefs']
        print("\n🧠 AI Belief State:")
        print(f"  - Estimated cooperativeness: {beliefs['human_cooperativeness']:.2f}")
        print(f"  - Pattern confidence: {beliefs['pattern_confidence']:.2f}")
        print(f"  - Risk tolerance: {beliefs['risk_tolerance']:.2f}")

share_button.on_click(on_share_click)
exploit_button.on_click(on_exploit_click)
reset_button.on_click(on_reset_click)

# Display interface
display(status_text)
display(round_counter)
display(widgets.HBox([strategy_selector]))
display(widgets.HBox([share_button, exploit_button, reset_button]))
display(output_area)

#@markdown ---
#@markdown ### 🧪 Experiments to Try:
#@markdown 1. **Build Maximum Trust**: Can you get the AI to trust you completely?
#@markdown 2. **Trust Recovery**: Betray once, then try to rebuild
#@markdown 3. **Strategy Comparison**: Which AI strategy is most robust?
#@markdown 4. **Pattern Games**: Create patterns and see if AI learns them

## 📊 Understanding AI Belief States

This implementation uses transparent belief state architectures, allowing you to see exactly how the AI makes decisions:

### Belief Components
- **Human Cooperativeness**: AI's estimate of your cooperation probability
- **Pattern Confidence**: How certain the AI is about detected patterns
- **Risk Tolerance**: AI's willingness to trust despite uncertainty

### Learning Mechanisms
- Recency-weighted memory (recent actions matter more)
- Dual inference: learning both behaviors and meta-patterns
- Adaptive risk adjustment based on outcomes

### Strategy Differences
- **Adaptive**: Balances trust and caution (threshold = 0.5)
- **Cautious**: Requires high confidence (threshold = 0.7)
- **Generous**: Trusts more easily (threshold = 0.3)

This transparency enables debugging of AI trust decisions and verification of aligned behavior.