# Task 3: Saliency Mapping & Model Interpretability
## Interpreting DistilBERT + LoRA AI Text Classifier

**Objective**: Perform post-hoc interpretability analysis on the fine-tuned DistilBERT model to understand which tokens/features drive AI vs Human classification decisions.

**Methods**:
- **Captum Integrated Gradients** - Token-level attribution analysis
- **HTML Heatmap Visualization** - Color-coded importance scores
- **Error Analysis** - Examining misclassified or borderline cases

**Key Questions**:
1. Does the model rely on specific lexical markers ("AI-isms")?
2. Does it capture broader structural patterns (rhythm, coherence)?
3. What causes false positives (Human → AI)?

---

## 1. Setup & Imports

**⚠️ IMPORTANT - If you get numpy/pandas import errors:**

This is caused by binary incompatibility between numpy and pandas versions.

**Steps to fix:**
1. **Run Cell 3** (installation cell) - it will reinstall compatible versions
2. **Restart Kernel**: Click Session → Restart Session (or circular arrow ⟳)
3. **Skip Cell 3** and **run from Cell 4** onwards

The installation cell uses:
- `numpy==1.26.4` 
- `pandas==2.1.4`

These versions are tested and compatible.

In [1]:
# Force reinstall compatible versions to fix binary incompatibility
!pip uninstall numpy pandas -y -q
!pip install numpy==1.26.4 pandas==2.1.4 captum -q
import sys
print(f"✓ Compatible packages installed.")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.7/11.7 MB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 2.26.0 requires google-cloud-bigquery-storage<3.0.0,>=2.30.0, which is not installed.
nilearn 0.13.0 requires pandas>=2.2.0, but you have pandas 2.1.4 which is incompatible.
cesium 0.12.4 requires numpy<3.0,>=2.0, but you have numpy 1.26.4 which is incompatible.
google-colab

In [2]:
# Core imports
import json
import torch
import numpy as np
import pandas as pd
from pathlib import Path
from IPython.display import HTML, display
import warnings
warnings.filterwarnings('ignore')

# Transformers & PEFT
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from peft import PeftModel

# Captum for interpretability
from captum.attr import IntegratedGradients, LayerIntegratedGradients
from captum.attr import visualization as viz

# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Set MAX_LENGTH for tokenization
MAX_LENGTH = 256

# Print versions after successful import
import sys
print("✓ All imports successful\n")
print(f"Python: {sys.version.split()[0]}")
print(f"PyTorch: {torch.__version__}")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"Device: {device}")
print(f"Max sequence length: {MAX_LENGTH}")

2026-02-06 18:42:36.060793: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1770403356.265010     108 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1770403356.323776     108 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1770403356.831781     108 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1770403356.831810     108 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1770403356.831812     108 computation_placer.cc:177] computation placer alr

✓ All imports successful

Python: 3.12.12
PyTorch: 2.8.0+cu126
NumPy: 1.26.4
Pandas: 2.1.4
Device: cuda
Max sequence length: 256


---
## 2. Load Fine-Tuned DistilBERT + LoRA Model

In [3]:
import os
from pathlib import Path
from transformers import AutoTokenizer, AutoModelForSequenceClassification

IN_KAGGLE = os.path.exists('/kaggle/input')

if IN_KAGGLE:
    # 1. Update this with the EXACT folder name from your sidebar!
    # Tip: Click the 'copy path' icon on the folder in the sidebar to be sure.
    TRAINING_NOTEBOOK_NAME = 'task_two_tierC' # <--- CHANGE THIS
    MODEL_PATH = Path('/kaggle/input/tier-c-lora-model/tier_c_lora_model')
    
    DATA_PATH = Path('/kaggle/input/precog-novels-data/precog-novels-data')
else:
    MODEL_PATH = str(Path('../output/tier_c_models/tier_c_lora_model'))
    DATA_PATH = Path('../output')

print(f"Loading model from: {MODEL_PATH}")

# Verify folder exists before loading to avoid the HFValidationError
if not os.path.exists(MODEL_PATH):
    print(f" ERROR: Folder not found at {MODEL_PATH}")
    print("Check your sidebar and make sure the notebook name matches.")
else:
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)

    # Load model
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_PATH,
        num_labels=2
    )
    model.to(device)
    model.eval()
    print("✓ Model and Tokenizer loaded successfully!")

Loading model from: /kaggle/input/tier-c-lora-model/tier_c_lora_model


config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


✓ Model and Tokenizer loaded successfully!


---
## 3. Load Test Samples

Load some AI and Human samples from the test set for analysis.

In [4]:
# Load sample AI-generated texts
AI_SAMPLES_PATH = DATA_PATH / 'class2'

sample_ai_texts = []
for jsonl_file in AI_SAMPLES_PATH.glob('*.jsonl'):
    with open(jsonl_file, 'r', encoding='utf-8') as f:
        for line in f:
            try:
                entry = json.loads(line.strip())
                text = entry.get('text') or entry.get('paragraph') or entry.get('content', '')
                if text and len(text.split()) >= 50:
                    sample_ai_texts.append(text)
                    if len(sample_ai_texts) >= 10:
                        break
            except:
                continue
    if len(sample_ai_texts) >= 10:
        break

# Load sample Human-written texts
HUMAN_SAMPLES_PATH = DATA_PATH / 'class1'

sample_human_texts = []
for txt_file in HUMAN_SAMPLES_PATH.glob('*_cleaned.txt'):
    with open(txt_file, 'r', encoding='utf-8') as f:
        text = f.read()
        # Chunk into paragraphs
        words = text.split()
        for i in range(0, min(len(words), 2000), 200):  # Sample first 10 chunks
            chunk = ' '.join(words[i:i+200])
            if len(chunk.split()) >= 50:
                sample_human_texts.append(chunk)
                if len(sample_human_texts) >= 10:
                    break
    if len(sample_human_texts) >= 10:
        break

print(f"✓ Loaded {len(sample_ai_texts)} AI-generated samples")
print(f"✓ Loaded {len(sample_human_texts)} Human-written samples")

# Preview samples
print(f"\n{'='*70}")
print("SAMPLE AI TEXT (first 300 chars):")
print(sample_ai_texts[0][:300] + "...")
print(f"\n{'='*70}")
print("SAMPLE HUMAN TEXT (first 300 chars):")
print(sample_human_texts[0][:300] + "...")

✓ Loaded 10 AI-generated samples
✓ Loaded 10 Human-written samples

SAMPLE AI TEXT (first 300 chars):
The elemental fury of nature often strikes with a stark, indiscriminate power that dwarfs human endeavors. Whether it be the tempestuous rage of a hurricane, the earth-shattering violence of an earthquake, or the relentless gnawing of a drought, these forces operate beyond our comprehension or contr...

SAMPLE HUMAN TEXT (first 300 chars):
Captain MacWhirr, of the steamer Nan-Shan, had a physiognomy that, in the order of material appearances, was the exact counterpart of his mind: it presented no marked characteristics of firmness or stupidity; it had no pronounced characteristics whatever; it was simply ordinary, irresponsive, and un...


---
## 4. Integrated Gradients Attribution Implementation

In [5]:
# FIXED VERSION - Minimal changes to make it work
# All outputs stay the same format!

def predict_with_confidence(text, model, tokenizer, device):
    """
    Get model prediction and confidence for a text sample.
    
    Returns:
        tuple: (prediction_class, confidence, probabilities)
    """
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=MAX_LENGTH, padding='max_length')
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)
    
    with torch.no_grad():
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        probs = torch.softmax(logits, dim=1)
        prediction = torch.argmax(probs, dim=1).item()
        confidence = probs[0, prediction].item()
    
    return prediction, confidence, probs[0].cpu().numpy()


def compute_attributions(text, model, tokenizer, device, target_class=1, n_steps=50):
    """
    Compute token-level attributions using Layer Integrated Gradients.
    
    Args:
        text: Input text to analyze
        model: Fine-tuned model
        tokenizer: Tokenizer
        device: torch device
        target_class: Class to explain (0=Human, 1=AI)
        n_steps: Number of integration steps
    
    Returns:
        tuple: (tokens, attributions, prediction, confidence, delta)
    """
    # Tokenize
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=MAX_LENGTH, padding='max_length')
    input_ids = inputs['input_ids'].to(device)
    attention_mask = inputs['attention_mask'].to(device)
    
    # Get prediction
    prediction, confidence, probs = predict_with_confidence(text, model, tokenizer, device)
    
    # FIX: Create a wrapper that doesn't require attention_mask as an argument
    # This way LayerIntegratedGradients can work properly
    class ModelWrapper(torch.nn.Module):
        def __init__(self, model, attention_mask, target_class):
            super().__init__()
            self.model = model
            self.attention_mask = attention_mask
            self.target_class = target_class
        
        def forward(self, input_ids):
            # Simply pass through to the model with the fixed attention mask
            outputs = self.model(input_ids=input_ids, attention_mask=self.attention_mask)
            return outputs.logits[:, self.target_class]
    
    # Create the wrapper
    wrapped_model = ModelWrapper(model, attention_mask, target_class)
    
    # Initialize Layer Integrated Gradients on embedding layer
    lig = LayerIntegratedGradients(wrapped_model, model.distilbert.embeddings.word_embeddings)
    
    # Baseline: all PAD tokens (ID 0 in DistilBERT)
    baseline_ids = torch.zeros_like(input_ids).long()
    
    # Compute attributions (no additional_forward_args needed now!)
    attributions, delta = lig.attribute(
        inputs=input_ids,
        baselines=baseline_ids,
        return_convergence_delta=True,
        n_steps=n_steps
    )
    
    # Sum attributions across embedding dimension
    attributions = attributions.sum(dim=-1).squeeze(0)
    attributions = attributions.cpu().detach().numpy()
    
    # Get tokens
    tokens = tokenizer.convert_ids_to_tokens(input_ids[0].cpu().numpy())
    
    return tokens, attributions, prediction, confidence, delta


def aggregate_subword_attributions(tokens, attributions):
    """
    Aggregate WordPiece subword tokens to word-level attributions.
    
    DistilBERT uses ## prefix for subword continuations.
    
    Returns:
        list of tuples: [(word, attribution_score), ...]
    """
    words = []
    word_scores = []
    current_word = ""
    current_score = 0
    
    for token, score in zip(tokens, attributions):
        # Skip special tokens
        if token in ['[CLS]', '[SEP]', '[PAD]']:
            continue
        
        # Check if subword continuation
        if token.startswith('##'):
            current_word += token[2:]
            current_score += score
        else:
            # Save previous word
            if current_word:
                words.append(current_word)
                word_scores.append(current_score)
            # Start new word
            current_word = token
            current_score = score
    
    # Add final word
    if current_word:
        words.append(current_word)
        word_scores.append(current_score)
    
    return list(zip(words, word_scores))


print("✓ Attribution functions defined")

✓ Attribution functions defined


---
## 5. HTML Heatmap Visualization

In [6]:
def create_html_heatmap(words, scores, prediction, confidence, title="Saliency Map"):
    """
    Create an HTML heatmap visualization.
    
    Positive scores (red) = push toward AI
    Negative scores (blue) = push toward Human
    """
    # Normalize scores for coloring
    max_abs_score = max(abs(min(scores)), abs(max(scores)))
    if max_abs_score == 0:
        max_abs_score = 1
    
    html = f"""
    <div style="font-family: Arial, sans-serif; padding: 20px; background-color: #f5f5f5; border-radius: 10px;">
        <h3 style="color: #333;">{title}</h3>
        <p style="margin: 10px 0;">
            <strong>Prediction:</strong> <span style="color: {'#d32f2f' if prediction == 1 else '#1976d2'};">
                {'AI-generated' if prediction == 1 else 'Human-written'}
            </span> 
            ({confidence:.1%} confidence)
        </p>
        <div style="margin-top: 15px; line-height: 2.2; font-size: 16px;">
    """
    
    for word, score in zip(words, scores):
        # Normalize score to [-1, 1]
        norm_score = score / max_abs_score
        
        # Color mapping
        if norm_score > 0:
            # Red for AI
            intensity = min(int(norm_score * 200 + 55), 255)
            bg_color = f"rgb({intensity}, {255-intensity}, {255-intensity})"
        else:
            # Blue for Human
            intensity = min(int(abs(norm_score) * 200 + 55), 255)
            bg_color = f"rgb({255-intensity}, {255-intensity}, {intensity})"
        
        # Determine text color for readability
        text_color = "#000" if abs(norm_score) < 0.5 else "#fff"
        
        html += f'''
            <span style="
                background-color: {bg_color};
                color: {text_color};
                padding: 3px 6px;
                margin: 2px;
                border-radius: 4px;
                display: inline-block;
                font-weight: {'bold' if abs(norm_score) > 0.3 else 'normal'};
            " title="Score: {score:.4f}">
                {word}
            </span>
        '''
    
    html += """
        </div>
        <div style="margin-top: 20px; font-size: 12px; color: #666;">
            <strong>Legend:</strong>
            <span style="background-color: #ffcccc; padding: 2px 8px; margin: 0 5px; border-radius: 3px;">Red = AI signal</span>
            <span style="background-color: #ccccff; padding: 2px 8px; margin: 0 5px; border-radius: 3px;">Blue = Human signal</span>
            <span style="color: #999; margin-left: 10px;">(Hover over words for exact scores)</span>
        </div>
    </div>
    """
    
    return html


print("✓ Visualization function defined")

✓ Visualization function defined


---
## 6. Analysis: AI-Generated Text Sample

In [7]:
# Analyze AI-generated sample
ai_text = sample_ai_texts[0]

print("="*70)
print("ANALYZING AI-GENERATED TEXT")
print("="*70)
print(f"\nText:\n{ai_text[:400]}...\n")

# Compute attributions
tokens, attrs, pred, conf, delta = compute_attributions(
    ai_text, model, tokenizer, device, target_class=1
)

# Aggregate to words
word_attrs = aggregate_subword_attributions(tokens, attrs)
words, scores = zip(*word_attrs)

print(f"Prediction: {'AI' if pred == 1 else 'Human'} ({conf:.2%} confidence)")
print(f"Convergence delta: {delta.item():.6f}")
print(f"\nTop 10 words pushing toward AI:")
sorted_words = sorted(word_attrs, key=lambda x: x[1], reverse=True)
for i, (w, s) in enumerate(sorted_words[:10], 1):
    print(f"  {i:2d}. {w:20s} {s:+.4f}")

# Visualize
html_viz = create_html_heatmap(
    words, scores, pred, conf,
    title="Saliency Map: AI-Generated Text"
)
display(HTML(html_viz))

ANALYZING AI-GENERATED TEXT

Text:
The elemental fury of nature often strikes with a stark, indiscriminate power that dwarfs human endeavors. Whether it be the tempestuous rage of a hurricane, the earth-shattering violence of an earthquake, or the relentless gnawing of a drought, these forces operate beyond our comprehension or control. They are not malicious, nor are they benevolent; they simply are, driven by vast, indifferent ge...

Prediction: AI (99.99% confidence)
Convergence delta: -0.055787

Top 10 words pushing toward AI:
   1. often                +0.1907
   2. whether              +0.1600
   3. or                   +0.1479
   4. can                  +0.1395
   5. processes            +0.0962
   6. .                    +0.0942
   7. the                  +0.0911
   8. unbridled            +0.0829
   9. by                   +0.0824
  10. are                  +0.0814


---
## 7. Analysis: Human-Written Text Sample

In [8]:
# Analyze Human-written sample
human_text = sample_human_texts[0]

print("="*70)
print("ANALYZING HUMAN-WRITTEN TEXT")
print("="*70)
print(f"\nText:\n{human_text[:400]}...\n")

# Compute attributions (still target AI class to see what prevents AI classification)
tokens, attrs, pred, conf, delta = compute_attributions(
    human_text, model, tokenizer, device, target_class=1
)

# Aggregate to words
word_attrs = aggregate_subword_attributions(tokens, attrs)
words, scores = zip(*word_attrs)

print(f"Prediction: {'AI' if pred == 1 else 'Human'} ({conf:.2%} confidence)")
print(f"Convergence delta: {delta.item():.6f}")
print(f"\nTop 10 words pushing AWAY from AI (toward Human):")
sorted_words = sorted(word_attrs, key=lambda x: x[1])
for i, (w, s) in enumerate(sorted_words[:10], 1):
    print(f"  {i:2d}. {w:20s} {s:+.4f}")

# Visualize
html_viz = create_html_heatmap(
    words, scores, pred, conf,
    title="Saliency Map: Human-Written Text"
)
display(HTML(html_viz))

ANALYZING HUMAN-WRITTEN TEXT

Text:
Captain MacWhirr, of the steamer Nan-Shan, had a physiognomy that, in the order of material appearances, was the exact counterpart of his mind: it presented no marked characteristics of firmness or stupidity; it had no pronounced characteristics whatever; it was simply ordinary, irresponsive, and unruffled. The only thing his aspect might have been said to suggest, at times, was bashfulness; becau...

Prediction: Human (100.00% confidence)
Convergence delta: -0.038063

Top 10 words pushing AWAY from AI (toward Human):
   1. macwhirr             -0.1900
   2. physiognomy          -0.1331
   3. captain              -0.0936
   4. his                  -0.0744
   5. his                  -0.0662
   6. he                   -0.0633
   7. as                   -0.0613
   8. ;                    -0.0613
   9. .                    -0.0604
  10. carroty              -0.0603


---
## 8. Find Borderline/Misclassified Cases

Let's find human texts that are classified as AI or have low confidence, for error analysis.

In [9]:
# Find borderline or misclassified human samples
print("Searching for borderline Human samples (predicted as AI or low confidence)...\n")

borderline_cases = []

for i, text in enumerate(sample_human_texts):
    pred, conf, probs = predict_with_confidence(text, model, tokenizer, device)
    
    # Look for:
    # 1. Misclassified (Human predicted as AI)
    # 2. Low confidence correct predictions
    # 3. High AI probability even if correctly classified
    
    ai_prob = probs[1]  # Probability of AI class
    
    if pred == 1 or conf < 0.7 or ai_prob > 0.3:
        borderline_cases.append({
            'text': text,
            'prediction': pred,
            'confidence': conf,
            'ai_probability': ai_prob,
            'type': 'MISCLASSIFIED' if pred == 1 else 'BORDERLINE'
        })
    
    if len(borderline_cases) >= 5:
        break

print(f"Found {len(borderline_cases)} borderline/misclassified cases\n")
print("="*70)

for i, case in enumerate(borderline_cases[:3], 1):  # Show top 3
    print(f"\nCase {i} - {case['type']}")
    print(f"Prediction: {'AI' if case['prediction'] == 1 else 'Human'}")
    print(f"Confidence: {case['confidence']:.2%}")
    print(f"AI Probability: {case['ai_probability']:.2%}")
    print(f"Text preview: {case['text'][:200]}...")
    print("-"*70)

Searching for borderline Human samples (predicted as AI or low confidence)...

Found 0 borderline/misclassified cases



---
## 9. Error Analysis: Detailed Investigation of 3 Cases

Now we'll perform detailed saliency analysis on 3 borderline/misclassified cases, looking for:
1. **Lexical repetition** - Repeated words or phrases
2. **Stylistic regularity** - Overly formal or structured language
3. **AI-isms** - Words like "tapestry", "delve", "testament", "furthermore", "consequently"

In [15]:
# Known AI-ism markers
AI_ISMS = [
    'tapestry', 'delve', 'testament', 'furthermore', 'consequently', 
    'moreover', 'additionally', 'paradigm', 'framework', 'multifaceted',
    'nuanced', 'comprehensive', 'facilitate', 'leverage', 'optimize',
    'intricate', 'pivotal', 'underscores', 'endeavor', 'myriad',
    'seamlessly', 'robust', 'dynamic', 'innovative', 'strategic'
]

def analyze_case(text, case_num):
    """
    Perform detailed error analysis on a single case.
    """
    print(f"\n{'='*70}")
    print(f"ERROR ANALYSIS - CASE {case_num}")
    print(f"{'='*70}\n")
    
    # Get attribution
    tokens, attrs, pred, conf, delta = compute_attributions(
        text, model, tokenizer, device, target_class=1
    )
    word_attrs = aggregate_subword_attributions(tokens, attrs)
    words_list, scores_list = zip(*word_attrs)
    
    # 1. Check for AI-isms
    print("1. AI-ISM DETECTION:")
    print("-" * 70)
    found_isms = []
    for word, score in word_attrs:
        if word.lower() in AI_ISMS:
            found_isms.append((word, score))
    
    if found_isms:
        print(f"Found {len(found_isms)} AI-ism marker(s):")
        for word, score in found_isms:
            print(f"  • '{word}' (attribution: {score:+.4f})")
    else:
        print("  No known AI-ism markers detected")
    
    # 2. Lexical repetition analysis
    print(f"\n2. LEXICAL REPETITION:")
    print("-" * 70)
    word_counts = {}
    for word in words_list:
        word_lower = word.lower()
        if len(word_lower) > 3:  # Only meaningful words
            word_counts[word_lower] = word_counts.get(word_lower, 0) + 1
    
    repeated = [(w, c) for w, c in word_counts.items() if c > 2]
    if repeated:
        repeated.sort(key=lambda x: x[1], reverse=True)
        print(f"Found {len(repeated)} repeated word(s) (>2 occurrences):")
        for word, count in repeated[:5]:
            print(f"  • '{word}' appears {count} times")
    else:
        print("  No significant repetition detected")
    
    # 3. Top AI-pushing words
    print(f"\n3. TOP WORDS PUSHING TOWARD AI CLASSIFICATION:")
    print("-" * 70)
    sorted_attrs = sorted(word_attrs, key=lambda x: x[1], reverse=True)
    for i, (word, score) in enumerate(sorted_attrs[:8], 1):
        indicator = ""
        if word.lower() in AI_ISMS:
            indicator = "  [AI-ISM]"
        if score > 0.01:  # Strong signal
            print(f"  {i}. {word:20s} {score:+.4f}{indicator}")
    
    # 4. Stylistic analysis
    print(f"\n4. STYLISTIC ANALYSIS:")
    print("-" * 70)
    
    # Check for formal connectives
    formal_connectives = ['furthermore', 'moreover', 'consequently', 'thus', 
                         'therefore', 'hence', 'nevertheless', 'nonetheless']
    found_connectives = [w for w in words_list if w.lower() in formal_connectives]
    
    if found_connectives:
        print(f"  Formal connectives found: {', '.join(found_connectives)}")
    
    # Check attribution distribution
    positive_attrs = [s for s in scores_list if s > 0]
    negative_attrs = [s for s in scores_list if s < 0]
    
    if positive_attrs:
        avg_positive = np.mean(positive_attrs)
        max_positive = max(positive_attrs)
        print(f"  Positive attributions: {len(positive_attrs)} words, avg={avg_positive:.4f}, max={max_positive:.4f}")
    if negative_attrs:
        avg_negative = np.mean(negative_attrs)
        min_negative = min(negative_attrs)
        print(f"  Negative attributions: {len(negative_attrs)} words, avg={avg_negative:.4f}, min={min_negative:.4f}")
    
    # 5. Prediction summary
    print(f"\n5. PREDICTION SUMMARY:")
    print("-" * 70)
    print(f"  Model prediction: {'AI-generated' if pred == 1 else 'Human-written'}")
    print(f"  Confidence: {conf:.2%}")
    print(f"  True label: Human-written")
    print(f"  Status: {' MISCLASSIFIED' if pred == 1 else '  BORDERLINE (Low confidence)'}")
    
    # Visualize
    print(f"\n6. SALIENCY HEATMAP:")
    print("-" * 70)
    html_viz = create_html_heatmap(
        words_list, scores_list, pred, conf,
        title=f"Error Analysis Case {case_num} - Human Text"
    )
    display(HTML(html_viz))
    
    return word_attrs


# Analyze up to 3 borderline cases
for i, case in enumerate(borderline_cases[:3], 1):
    analyze_case(case['text'], i)

In [14]:
# Analyze multiple samples to find common patterns
print("="*70)
print("BATCH ANALYSIS: Finding Common AI Markers")
print("="*70)

# Collect attributions from multiple AI samples
ai_word_scores = {}  # {word: [scores]}

print("\nAnalyzing AI samples...")
for i, text in enumerate(sample_ai_texts[:5], 1):
    tokens, attrs, pred, conf, _ = compute_attributions(text, model, tokenizer, device, target_class=1)
    word_attrs = aggregate_subword_attributions(tokens, attrs)
    
    for word, score in word_attrs:
        word_lower = word.lower()
        if word_lower not in ai_word_scores:
            ai_word_scores[word_lower] = []
        ai_word_scores[word_lower].append(score)
    
    print(f"  Sample {i}: {pred} prediction ({conf:.2%} confidence)")

# Find consistently high-scoring words
avg_scores = {word: np.mean(scores) for word, scores in ai_word_scores.items() if len(scores) >= 3}
top_ai_markers = sorted(avg_scores.items(), key=lambda x: x[1], reverse=True)[:20]

print(f"\n{'='*70}")
print("TOP 20 MOST CONSISTENT AI MARKERS (across 5 samples):")
print("="*70)
for i, (word, avg_score) in enumerate(top_ai_markers, 1):
    count = len(ai_word_scores[word])
    ism_flag = " wrong" if word in AI_ISMS else ""
    print(f"{i:2d}. {word:20s} avg={avg_score:+.4f} (appeared in {count} samples){ism_flag}")

# Same for Human samples
print(f"\n{'='*70}")
print("BATCH ANALYSIS: Finding Common Human Markers")
print("="*70)

human_word_scores = {}

print("\nAnalyzing Human samples...")
for i, text in enumerate(sample_human_texts[:5], 1):
    tokens, attrs, pred, conf, _ = compute_attributions(text, model, tokenizer, device, target_class=1)
    word_attrs = aggregate_subword_attributions(tokens, attrs)
    
    for word, score in word_attrs:
        word_lower = word.lower()
        if word_lower not in human_word_scores:
            human_word_scores[word_lower] = []
        human_word_scores[word_lower].append(score)
    
    print(f"  Sample {i}: {pred} prediction ({conf:.2%} confidence)")

# Find consistently negative-scoring words (Human markers)
avg_human_scores = {word: np.mean(scores) for word, scores in human_word_scores.items() if len(scores) >= 3}
top_human_markers = sorted(avg_human_scores.items(), key=lambda x: x[1])[:20]

print(f"\n{'='*70}")
print("TOP 20 MOST CONSISTENT HUMAN MARKERS (across 5 samples):")
print("="*70)
for i, (word, avg_score) in enumerate(top_human_markers, 1):
    count = len(human_word_scores[word])
    print(f"{i:2d}. {word:20s} avg={avg_score:+.4f} (appeared in {count} samples)")

BATCH ANALYSIS: Finding Common AI Markers

Analyzing AI samples...
  Sample 1: 1 prediction (99.99% confidence)
  Sample 2: 1 prediction (100.00% confidence)
  Sample 3: 1 prediction (99.99% confidence)
  Sample 4: 1 prediction (100.00% confidence)
  Sample 5: 1 prediction (100.00% confidence)

TOP 20 MOST CONSISTENT AI MARKERS (across 5 samples):
 1. world                avg=+0.1151 (appeared in 3 samples)
 2. often                avg=+0.0965 (appeared in 6 samples)
 3. can                  avg=+0.0859 (appeared in 4 samples)
 4. human                avg=+0.0786 (appeared in 5 samples)
 5. or                   avg=+0.0702 (appeared in 8 samples)
 6. by                   avg=+0.0696 (appeared in 3 samples)
 7. is                   avg=+0.0670 (appeared in 9 samples)
 8. individuals          avg=+0.0638 (appeared in 3 samples)
 9. are                  avg=+0.0555 (appeared in 5 samples)
10. when                 avg=+0.0529 (appeared in 3 samples)
11. not                  avg=+0.0529 (ap