# Notebook 3: Multi-Annotator Consensus for Committee Decisions

**Objective**: Handle approval scenarios with multiple approvers (committee decisions):
- Find consensus among disagreeing annotators
- Identify problematic approvers (always yes/no)
- Map annotator agreement to YRSN components

---

## Flow Diagram

```mermaid
flowchart TD
    subgraph Input["üì• Committee Decisions"]
        A[Request Data]
        B[Multiple Approver Votes]
        C[Final Decision]
    end

    subgraph Cleanlab["üßπ Cleanlab Multi-Annotator"]
        D[get_label_quality_multiannotator]
        E[get_majority_vote_label]
        F[get_active_learning_scores]
        G[Annotator Agreement Matrix]
    end

    subgraph Analysis["üîç Annotator Analysis"]
        H[Per-Annotator Quality]
        I[Agreement Patterns]
        J[Controversial Cases]
    end

    subgraph YRSN["üéØ YRSN Mapping"]
        K{Agreement Level?}
        L[High Agreement ‚Üí R High]
        M[Low Agreement, Good Consensus ‚Üí S Moderate]
        N[Low Agreement, Bad Consensus ‚Üí N High]
        O[CLASH Collapse Type]
    end

    subgraph Routing["üö¶ Temperature Routing"]
        P[Compute œÑ from Consensus Quality]
        Q{Stream Decision}
        R[üü¢ GREEN: Clear consensus]
        S[üü° YELLOW: Needs tie-breaker]
        T[üî¥ RED: Escalate to senior]
    end

    A --> D
    B --> D
    B --> E
    B --> F
    D --> G
    G --> H
    G --> I
    G --> J
    D --> K
    I --> K
    K --> L
    K --> M
    K --> N
    N --> O
    L --> P
    M --> P
    N --> P
    P --> Q
    Q -->|Strong consensus| R
    Q -->|Weak consensus| S
    Q -->|No consensus| T

    style Cleanlab fill:#e1f5fe
    style Analysis fill:#fff3e0
    style YRSN fill:#e8f5e9
    style Routing fill:#fce4ec
```

---

**Collapse Type Focus**: CLASH (approvers disagree)

**Difficulty**: ‚≠ê‚≠ê Medium

## 1. Setup

In [None]:
# Install dependencies
!pip install cleanlab scikit-learn pandas numpy --quiet

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

# Cleanlab multi-annotator functions
from cleanlab.multiannotator import (
    get_label_quality_multiannotator,
    get_majority_vote_label,
    get_active_learning_scores
)

# Import YRSN adapter
import sys
sys.path.append('../src')
from yrsn_iars.adapters.cleanlab_adapter import CleanlabAdapter, YRSNResult

print("Dependencies loaded successfully")

## 2. Generate Committee Decision Data

Simulate a scenario where 5 approvers vote on each request, with varying levels of agreement.

In [None]:
np.random.seed(42)
n_samples = 500
n_annotators = 5  # 5 committee members
n_classes = 3  # approve, reject, defer

# Annotator characteristics (some are biased)
annotator_biases = {
    'approver_A': {'approve': 0.5, 'reject': 0.3, 'defer': 0.2},  # Balanced
    'approver_B': {'approve': 0.7, 'reject': 0.2, 'defer': 0.1},  # Lenient
    'approver_C': {'approve': 0.2, 'reject': 0.6, 'defer': 0.2},  # Strict
    'approver_D': {'approve': 0.4, 'reject': 0.4, 'defer': 0.2},  # Balanced
    'approver_E': {'approve': 0.3, 'reject': 0.3, 'defer': 0.4},  # Often defers
}

# Generate "true" labels (what the decision should be)
true_labels = np.random.choice([0, 1, 2], n_samples, p=[0.4, 0.4, 0.2])  # 0=approve, 1=reject, 2=defer

# Generate annotator labels (with noise based on biases)
labels_multiannotator = np.full((n_samples, n_annotators), np.nan)

annotator_names = list(annotator_biases.keys())
class_names = ['approve', 'reject', 'defer']

for i in range(n_samples):
    true_label = true_labels[i]
    
    for j, (name, bias) in enumerate(annotator_biases.items()):
        # Probability of agreeing with true label
        agree_prob = 0.7  # Base agreement rate
        
        if np.random.random() < agree_prob:
            # Agree with true label
            labels_multiannotator[i, j] = true_label
        else:
            # Vote according to annotator's bias
            probs = [bias['approve'], bias['reject'], bias['defer']]
            labels_multiannotator[i, j] = np.random.choice([0, 1, 2], p=probs)
        
        # Some annotators miss some requests (10% NaN rate)
        if np.random.random() < 0.1:
            labels_multiannotator[i, j] = np.nan

print(f"Generated {n_samples} committee decisions with {n_annotators} annotators")
print(f"Missing annotation rate: {np.isnan(labels_multiannotator).mean()*100:.1f}%")

In [None]:
# Create request metadata
requests_df = pd.DataFrame({
    'request_id': [f'REQ-{i:04d}' for i in range(n_samples)],
    'amount': np.random.randint(1000, 100000, n_samples),
    'category': np.random.choice(['software', 'travel', 'vendor', 'equipment'], n_samples),
    'true_label': true_labels
})

# Add annotator votes as columns
for j, name in enumerate(annotator_names):
    requests_df[name] = labels_multiannotator[:, j]

requests_df.head(10)

## 3. Cleanlab Multi-Annotator Analysis

In [None]:
# Get majority vote labels
majority_vote = get_majority_vote_label(labels_multiannotator)

print("Majority Vote Distribution:")
print(pd.Series(majority_vote).value_counts().rename({0: 'approve', 1: 'reject', 2: 'defer'}))

In [None]:
# Get label quality scores for multi-annotator setting
# Note: This requires pred_probs from a trained model

# First, train a simple model on majority vote labels
from sklearn.preprocessing import StandardScaler

# Use simple features
X = np.column_stack([
    requests_df['amount'].values,
    pd.get_dummies(requests_df['category']).values
])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train classifier
clf = LogisticRegression(max_iter=1000)
clf.fit(X_scaled, majority_vote)

# Get predicted probabilities
pred_probs = clf.predict_proba(X_scaled)

print(f"Classifier trained on majority vote labels")
print(f"Prediction shape: {pred_probs.shape}")

In [None]:
# Get multi-annotator label quality
multiannotator_results = get_label_quality_multiannotator(
    labels_multiannotator=labels_multiannotator,
    pred_probs=pred_probs,
    verbose=False
)

print("Multi-annotator analysis results:")
print(f"Keys: {multiannotator_results.keys()}")

In [None]:
# Extract key metrics
consensus_label = multiannotator_results['label_quality']['consensus_label']
consensus_quality = multiannotator_results['label_quality']['consensus_quality_score']

# Annotator agreement for each example
annotator_quality = multiannotator_results.get('annotator_stats', {}).get('quality_of_annotator', None)

# Add to dataframe
requests_df['consensus_label'] = consensus_label
requests_df['consensus_quality'] = consensus_quality

print("Consensus Quality Statistics:")
print(requests_df['consensus_quality'].describe())

## 4. Compute Annotator Agreement

In [None]:
def compute_agreement_score(row_labels):
    """Compute agreement score for a single example's annotator labels."""
    valid = row_labels[~np.isnan(row_labels)]
    if len(valid) <= 1:
        return 1.0  # Can't measure disagreement with 1 annotator
    
    # Agreement = fraction of annotators who voted for the majority
    majority = pd.Series(valid).mode()[0]
    agreement = (valid == majority).sum() / len(valid)
    return agreement

# Compute agreement for each request
requests_df['annotator_agreement'] = [
    compute_agreement_score(labels_multiannotator[i])
    for i in range(n_samples)
]

# Count annotations per example
requests_df['n_annotations'] = (~np.isnan(labels_multiannotator)).sum(axis=1)

print("Agreement Statistics:")
print(requests_df['annotator_agreement'].describe())
print(f"\nHigh agreement (>=0.8): {(requests_df['annotator_agreement'] >= 0.8).sum()}")
print(f"Low agreement (<0.6): {(requests_df['annotator_agreement'] < 0.6).sum()}")

## 5. YRSN Decomposition from Multi-Annotator

In [None]:
# Use the YRSN adapter for multi-annotator
adapter = CleanlabAdapter()

yrsn_df = adapter.from_multiannotator(
    labels_multiannotator=labels_multiannotator,
    consensus_label=requests_df['consensus_label'].values,
    annotator_agreement=requests_df['annotator_agreement'].values,
    quality_of_consensus=requests_df['consensus_quality'].values,
    num_annotations=requests_df['n_annotations'].values
)

# Merge with requests
requests_df['R'] = yrsn_df['R'].values
requests_df['S'] = yrsn_df['S'].values
requests_df['N'] = yrsn_df['N'].values
requests_df['collapse_type'] = yrsn_df['collapse_type'].values

print("YRSN Statistics:")
print(requests_df[['R', 'S', 'N']].describe())

## 6. Identify CLASH Cases (High Disagreement)

In [None]:
# Find CLASH collapse cases
clash_cases = requests_df[requests_df['collapse_type'] == 'clash'].nlargest(15, 'S')

print("Top 15 CLASH Cases (Approver Disagreement):")
print("="*80)
for _, row in clash_cases.iterrows():
    votes = [row[name] for name in annotator_names]
    vote_str = ', '.join([f"{name}: {class_names[int(v)]}" if not np.isnan(v) else f"{name}: -" 
                          for name, v in zip(annotator_names, votes)])
    
    print(f"\n[{row['request_id']}] Amount: ${row['amount']:,}")
    print(f"R={row['R']:.2f}, S={row['S']:.2f}, N={row['N']:.2f}")
    print(f"Agreement: {row['annotator_agreement']:.2f}, Consensus Quality: {row['consensus_quality']:.2f}")
    print(f"Votes: {vote_str}")
    print(f"Consensus: {class_names[int(row['consensus_label'])]}")

## 7. Temperature-Based Routing for Committee Decisions

In [None]:
from yrsn_iars.adapters.temperature import compute_temperature

# Compute temperature
requests_df['temperature'] = requests_df['R'].apply(lambda r: compute_temperature(r))

# Routing logic for committee decisions
def route_committee_decision(row):
    agreement = row['annotator_agreement']
    consensus_q = row['consensus_quality']
    tau = row['temperature']
    
    # Strong consensus: auto-process
    if agreement >= 0.8 and consensus_q >= 0.7 and tau < 1.5:
        return 'green'
    
    # Moderate consensus: needs review
    elif agreement >= 0.6 or consensus_q >= 0.5:
        return 'yellow'
    
    # No consensus or clash: escalate
    else:
        return 'red'

requests_df['stream'] = requests_df.apply(route_committee_decision, axis=1)

print("Routing Distribution:")
print(requests_df['stream'].value_counts())
print(f"\nAuto-approval rate: {100 * (requests_df['stream'] == 'green').mean():.1f}%")

## 8. Analyze Per-Annotator Quality

In [None]:
# Compute per-annotator statistics
annotator_stats = []

for j, name in enumerate(annotator_names):
    votes = labels_multiannotator[:, j]
    valid_mask = ~np.isnan(votes)
    valid_votes = votes[valid_mask]
    valid_consensus = requests_df['consensus_label'].values[valid_mask]
    
    # Agreement with consensus
    agrees_with_consensus = (valid_votes == valid_consensus).mean()
    
    # Vote distribution
    vote_dist = pd.Series(valid_votes).value_counts(normalize=True)
    
    annotator_stats.append({
        'annotator': name,
        'n_votes': len(valid_votes),
        'agrees_with_consensus': agrees_with_consensus,
        'approve_rate': vote_dist.get(0, 0),
        'reject_rate': vote_dist.get(1, 0),
        'defer_rate': vote_dist.get(2, 0)
    })

annotator_df = pd.DataFrame(annotator_stats)

print("Per-Annotator Quality:")
print(annotator_df.round(3))

# Flag problematic annotators
print("\n" + "="*50)
print("Annotator Flags:")
for _, row in annotator_df.iterrows():
    flags = []
    if row['agrees_with_consensus'] < 0.6:
        flags.append("LOW CONSENSUS AGREEMENT")
    if row['approve_rate'] > 0.65:
        flags.append("ALWAYS APPROVES")
    if row['reject_rate'] > 0.55:
        flags.append("OFTEN REJECTS")
    if row['defer_rate'] > 0.35:
        flags.append("OFTEN DEFERS")
    
    if flags:
        print(f"{row['annotator']}: {', '.join(flags)}")

## 9. Export Results

In [None]:
# Save results
output_cols = ['request_id', 'amount', 'category', 'consensus_label',
               'annotator_agreement', 'consensus_quality', 'n_annotations',
               'R', 'S', 'N', 'collapse_type', 'temperature', 'stream']

requests_df[output_cols].to_csv('committee_yrsn_results.csv', index=False)
annotator_df.to_csv('annotator_quality.csv', index=False)

print(f"Saved {len(requests_df)} committee decisions with YRSN analysis")
print(f"Saved annotator quality metrics")

## Summary

In this notebook we:
1. Generated committee decision data with 5 approvers per request
2. Used Cleanlab multi-annotator functions to find consensus
3. Computed annotator agreement and consensus quality
4. Mapped to YRSN: Low agreement ‚Üí high S (CLASH collapse)
5. Applied temperature-based routing for committee decisions
6. Analyzed per-annotator quality and flagged problematic approvers

**Key Insight**: When approvers disagree, S increases (contentious case), which raises temperature and routes to yellow/red for additional review or tie-breaker.

**Next**: Notebook 4 - RAG/Retrieval Context Quality