# Part 1: Artifact Analysis - ELECTRA on SNLI

**Professional Research Project**: Systematic identification and characterization of dataset artifacts

**Environment**: Google Colab with GPU (A100 recommended)

**Prerequisite**: Complete baseline training (colab_training.ipynb)

**Objective**: Identify spurious correlations and dataset biases that models exploit

---

## Analysis Components

1. **Hypothesis-Only Baseline** - Test if model relies on hypothesis-only bias (~67% vs 33% random)
2. **Error Characterization** - Systematic analysis of model failures
3. **Lexical Overlap Analysis** - Correlation between word overlap and predictions
4. **Statistical Artifact Detection** - Length bias, word frequency patterns
5. **Contrast Sets** - Robustness to minimal perturbations

---

In [None]:
# Check GPU availability
!nvidia-smi

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("‚ö†Ô∏è WARNING: No GPU detected.")

## Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

import os
PROJECT_DIR = '/content/drive/MyDrive/electra-artifact-analysis'

# Verify baseline model exists
BASELINE_MODEL = f"{PROJECT_DIR}/models/baseline_snli"
if not os.path.exists(BASELINE_MODEL):
    print("‚ùå ERROR: Baseline model not found!")
    print(f"   Expected: {BASELINE_MODEL}")
    print("   ‚Üí Run colab_training.ipynb first")
else:
    print(f"‚úì Baseline model found: {BASELINE_MODEL}")

# Create analysis output directory
os.makedirs(f"{PROJECT_DIR}/analysis_results", exist_ok=True)
os.makedirs(f"{PROJECT_DIR}/figures", exist_ok=True)
print("‚úì Analysis directories ready")

## Clone Repository & Setup

In [None]:
# Clone repository
!git clone https://github.com/TimFrenzel/electra-nlp-artifact-analysis.git /content/electra-nlp-artifact-analysis
%cd /content/electra-nlp-artifact-analysis

# Install dependencies
!pip install -q -r requirements.txt

print("‚úì Repository cloned and dependencies installed")

## Part 1.1: Hypothesis-Only Baseline

**Research Question**: Does the model exploit hypothesis-only bias?

**Expected Results**:
- Random baseline: 33.3% (3-class classification)
- Biased baseline: ~67% (indicates severe artifacts)
- Full model: ~89%

**Interpretation**: If hypothesis-only > 60%, model likely exploits spurious correlations.

In [None]:
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score

# Load SNLI dataset
print("Loading SNLI dataset...")
snli = load_dataset("snli")

# Filter invalid labels
snli_valid = snli.filter(lambda x: x["label"] != -1)
print(f"‚úì Dataset loaded: {len(snli_valid['validation'])} validation examples")

In [None]:
# Train hypothesis-only model
print("\n" + "="*60)
print("HYPOTHESIS-ONLY BASELINE TRAINING")
print("="*60)

# Prepare hypothesis-only dataset
tokenizer = AutoTokenizer.from_pretrained("google/electra-small-discriminator")

def tokenize_hypothesis_only(examples):
    return tokenizer(
        examples["hypothesis"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

# Tokenize
train_hyp_only = snli_valid["train"].map(tokenize_hypothesis_only, batched=True)
eval_hyp_only = snli_valid["validation"].map(tokenize_hypothesis_only, batched=True)

# Format for PyTorch
train_hyp_only = train_hyp_only.rename_column("label", "labels")
eval_hyp_only = eval_hyp_only.rename_column("label", "labels")
train_hyp_only.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
eval_hyp_only.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

print(f"‚úì Hypothesis-only dataset prepared")
print(f"   Train: {len(train_hyp_only)} examples")
print(f"   Eval: {len(eval_hyp_only)} examples")

In [None]:
# Train hypothesis-only model
hyp_model = AutoModelForSequenceClassification.from_pretrained(
    "google/electra-small-discriminator",
    num_labels=3
)

training_args = TrainingArguments(
    output_dir=f"{PROJECT_DIR}/models/hypothesis_only",
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    fp16=True,
    logging_steps=100,
)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": accuracy_score(labels, predictions)}

trainer = Trainer(
    model=hyp_model,
    args=training_args,
    train_dataset=train_hyp_only,
    eval_dataset=eval_hyp_only,
    compute_metrics=compute_metrics,
)

print("\nüöÄ Training hypothesis-only model...")
print("   Expected time: 30-60 minutes with A100\n")

trainer.train()

print("\n‚úì Hypothesis-only training complete")

In [None]:
# Evaluate hypothesis-only model
print("\n" + "="*60)
print("HYPOTHESIS-ONLY BASELINE RESULTS")
print("="*60)

hyp_results = trainer.evaluate()
hyp_accuracy = hyp_results["eval_accuracy"]

print(f"\nHypothesis-only accuracy: {hyp_accuracy:.2%}")
print(f"Random baseline: 33.3%")
print(f"Expected biased baseline: ~67%")

# Interpret results
if hyp_accuracy >= 0.65:
    artifact_severity = "SEVERE"
    color = "üî¥"
    interpretation = "Model heavily exploits hypothesis-only bias. Strong artifacts present."
elif hyp_accuracy >= 0.55:
    artifact_severity = "MODERATE"
    color = "üü°"
    interpretation = "Moderate hypothesis-only bias detected. Some artifacts present."
elif hyp_accuracy >= 0.45:
    artifact_severity = "MILD"
    color = "üü¢"
    interpretation = "Mild hypothesis-only bias. Model uses some context."
else:
    artifact_severity = "MINIMAL"
    color = "‚úÖ"
    interpretation = "Minimal hypothesis-only bias. Model relies on full context."

print(f"\n{color} ARTIFACT SEVERITY: {artifact_severity}")
print(f"   {interpretation}")
print("="*60)

# Save results
import json
hyp_summary = {
    "hypothesis_only_accuracy": hyp_accuracy,
    "random_baseline": 0.333,
    "artifact_severity": artifact_severity,
    "interpretation": interpretation,
}

with open(f"{PROJECT_DIR}/analysis_results/hypothesis_only_results.json", 'w') as f:
    json.dump(hyp_summary, f, indent=2)

print(f"\n‚úì Results saved to: {PROJECT_DIR}/analysis_results/hypothesis_only_results.json")

## Part 1.2: Per-Class Hypothesis-Only Analysis

Analyze which classes are most predictable from hypothesis alone.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Get predictions
predictions = trainer.predict(eval_hyp_only)
pred_labels = np.argmax(predictions.predictions, axis=1)
true_labels = predictions.label_ids

# Classification report
label_names = ["entailment", "neutral", "contradiction"]
print("\nPer-Class Performance (Hypothesis-Only):")
print(classification_report(true_labels, pred_labels, target_names=label_names))

# Confusion matrix
cm = confusion_matrix(true_labels, pred_labels)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

plt.figure(figsize=(8, 6))
sns.heatmap(cm_normalized, annot=True, fmt=".2%", cmap="YlOrRd", 
            xticklabels=label_names, yticklabels=label_names)
plt.title("Hypothesis-Only Confusion Matrix (Normalized)", fontsize=14, weight='bold')
plt.ylabel("True Label")
plt.xlabel("Predicted Label")
plt.tight_layout()
plt.savefig(f"{PROJECT_DIR}/figures/hypothesis_only_confusion_matrix.png", dpi=300)
print(f"\n‚úì Confusion matrix saved to figures/hypothesis_only_confusion_matrix.png")
plt.show()

## Part 1.3: Lexical Overlap Analysis

**Hypothesis**: High word overlap between premise and hypothesis correlates with entailment predictions.

In [None]:
# Load baseline model predictions
baseline_tokenizer = AutoTokenizer.from_pretrained(BASELINE_MODEL)
baseline_model = AutoModelForSequenceClassification.from_pretrained(BASELINE_MODEL)

# Prepare full context dataset
def tokenize_full_context(examples):
    return baseline_tokenizer(
        examples["premise"],
        examples["hypothesis"],
        truncation=True,
        padding="max_length",
        max_length=128
    )

eval_full = snli_valid["validation"].map(tokenize_full_context, batched=True)
eval_full = eval_full.rename_column("label", "labels")
eval_full.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# Get baseline predictions
baseline_trainer = Trainer(
    model=baseline_model,
    args=TrainingArguments(output_dir="/tmp", per_device_eval_batch_size=32),
    compute_metrics=compute_metrics,
)

baseline_predictions = baseline_trainer.predict(eval_full)
baseline_pred_labels = np.argmax(baseline_predictions.predictions, axis=1)

print(f"‚úì Baseline model predictions obtained")
print(f"   Baseline accuracy: {accuracy_score(true_labels, baseline_pred_labels):.2%}")

In [None]:
# Compute lexical overlap for each example
def compute_lexical_overlap(premise, hypothesis):
    """Compute Jaccard similarity (word overlap)"""
    p_words = set(premise.lower().split())
    h_words = set(hypothesis.lower().split())
    
    if len(h_words) == 0:
        return 0.0
    
    overlap = len(p_words & h_words)
    return overlap / len(h_words)

# Calculate overlap for validation set
val_examples = snli_valid["validation"]
overlaps = []

for i in range(len(val_examples)):
    overlap = compute_lexical_overlap(
        val_examples[i]["premise"],
        val_examples[i]["hypothesis"]
    )
    overlaps.append(overlap)

overlaps = np.array(overlaps)
print(f"‚úì Lexical overlap computed for {len(overlaps)} examples")
print(f"   Mean overlap: {overlaps.mean():.2%}")
print(f"   Std overlap: {overlaps.std():.2%}")

In [None]:
# Analyze correlation between overlap and predictions
overlap_df = pd.DataFrame({
    'overlap': overlaps,
    'true_label': true_labels,
    'baseline_pred': baseline_pred_labels,
    'baseline_correct': (baseline_pred_labels == true_labels).astype(int)
})

# Stratify by overlap level
overlap_df['overlap_bin'] = pd.cut(overlap_df['overlap'], 
                                     bins=[0, 0.2, 0.4, 0.6, 0.8, 1.0],
                                     labels=['0-20%', '20-40%', '40-60%', '60-80%', '80-100%'])

# Accuracy by overlap bin
accuracy_by_overlap = overlap_df.groupby('overlap_bin')['baseline_correct'].mean()

print("\n" + "="*60)
print("LEXICAL OVERLAP ANALYSIS")
print("="*60)
print("\nAccuracy by Lexical Overlap Level:")
print(accuracy_by_overlap.to_string())
print("\nInterpretation:")
if accuracy_by_overlap.iloc[-1] > accuracy_by_overlap.iloc[0] + 0.1:
    print("‚ö†Ô∏è STRONG LEXICAL OVERLAP BIAS: High overlap ‚Üí higher accuracy")
    print("   Model likely exploits superficial word matching")
else:
    print("‚úì MINIMAL LEXICAL OVERLAP BIAS: Model not over-relying on overlap")
print("="*60)

In [None]:
# Visualize overlap distribution by label
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Overlap distribution by true label
for label_idx, label_name in enumerate(label_names):
    label_overlaps = overlaps[true_labels == label_idx]
    axes[0].hist(label_overlaps, bins=20, alpha=0.6, label=label_name)
    
axes[0].set_xlabel("Lexical Overlap (Jaccard Similarity)")
axes[0].set_ylabel("Frequency")
axes[0].set_title("Lexical Overlap Distribution by True Label", weight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Plot 2: Accuracy by overlap bin
accuracy_by_overlap.plot(kind='bar', ax=axes[1], color='steelblue')
axes[1].set_xlabel("Lexical Overlap Bin")
axes[1].set_ylabel("Accuracy")
axes[1].set_title("Model Accuracy vs. Lexical Overlap", weight='bold')
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45)
axes[1].grid(axis='y', alpha=0.3)
axes[1].axhline(y=accuracy_by_overlap.mean(), color='red', linestyle='--', label='Mean Accuracy')
axes[1].legend()

plt.tight_layout()
plt.savefig(f"{PROJECT_DIR}/figures/lexical_overlap_analysis.png", dpi=300)
print(f"\n‚úì Overlap analysis saved to figures/lexical_overlap_analysis.png")
plt.show()

## Part 1.4: Length Bias Analysis

Test if hypothesis length correlates with predictions.

In [None]:
# Compute hypothesis lengths
hyp_lengths = [len(val_examples[i]["hypothesis"].split()) for i in range(len(val_examples))]
hyp_lengths = np.array(hyp_lengths)

length_df = pd.DataFrame({
    'length': hyp_lengths,
    'true_label': true_labels,
    'baseline_pred': baseline_pred_labels,
    'baseline_correct': (baseline_pred_labels == true_labels).astype(int)
})

# Stratify by length
length_df['length_bin'] = pd.cut(length_df['length'],
                                  bins=[0, 5, 10, 15, 100],
                                  labels=['1-5', '6-10', '11-15', '16+'])

# Accuracy by length
accuracy_by_length = length_df.groupby('length_bin')['baseline_correct'].mean()

print("\n" + "="*60)
print("LENGTH BIAS ANALYSIS")
print("="*60)
print("\nAccuracy by Hypothesis Length (words):")
print(accuracy_by_length.to_string())

# Label distribution by length
print("\nLabel Distribution by Length:")
label_dist_by_length = pd.crosstab(length_df['length_bin'], length_df['true_label'], normalize='index')
label_dist_by_length.columns = label_names
print(label_dist_by_length.to_string())
print("="*60)

In [None]:
# Visualize length bias
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Label distribution by length
label_dist_by_length.plot(kind='bar', stacked=True, ax=axes[0], 
                          color=['#2ecc71', '#f39c12', '#e74c3c'])
axes[0].set_xlabel("Hypothesis Length (words)")
axes[0].set_ylabel("Proportion")
axes[0].set_title("Label Distribution by Hypothesis Length", weight='bold')
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=45)
axes[0].legend(title="Label")

# Plot 2: Accuracy by length
accuracy_by_length.plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_xlabel("Hypothesis Length (words)")
axes[1].set_ylabel("Accuracy")
axes[1].set_title("Model Accuracy by Hypothesis Length", weight='bold')
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45)
axes[1].grid(axis='y', alpha=0.3)
axes[1].axhline(y=accuracy_by_length.mean(), color='red', linestyle='--', label='Mean Accuracy')
axes[1].legend()

plt.tight_layout()
plt.savefig(f"{PROJECT_DIR}/figures/length_bias_analysis.png", dpi=300)
print(f"\n‚úì Length analysis saved to figures/length_bias_analysis.png")
plt.show()

## Part 1.5: Error Analysis - What Does the Model Get Wrong?

Systematic characterization of failure modes.

In [None]:
# Identify errors
errors_df = pd.DataFrame({
    'premise': [val_examples[i]['premise'] for i in range(len(val_examples))],
    'hypothesis': [val_examples[i]['hypothesis'] for i in range(len(val_examples))],
    'true_label': [label_names[l] for l in true_labels],
    'pred_label': [label_names[l] for l in baseline_pred_labels],
    'correct': baseline_pred_labels == true_labels,
    'overlap': overlaps,
    'hyp_length': hyp_lengths
})

errors = errors_df[~errors_df['correct']]

print("\n" + "="*60)
print("ERROR ANALYSIS")
print("="*60)
print(f"\nTotal errors: {len(errors)} / {len(errors_df)} ({len(errors)/len(errors_df):.1%})")
print(f"\nError distribution by true label:")
print(errors['true_label'].value_counts())

print(f"\nError distribution by predicted label:")
print(errors['pred_label'].value_counts())

print(f"\nMost common error types (true ‚Üí predicted):")
error_types = errors.groupby(['true_label', 'pred_label']).size().sort_values(ascending=False)
print(error_types.head(10))
print("="*60)

In [None]:
# Sample representative errors
print("\n" + "="*60)
print("SAMPLE ERRORS (for qualitative analysis)")
print("="*60)

# Sample errors from each category
for true_label in label_names:
    label_errors = errors[errors['true_label'] == true_label].sample(min(3, len(errors[errors['true_label'] == true_label])))
    
    print(f"\n{true_label.upper()} misclassified as:")
    for idx, row in label_errors.iterrows():
        print(f"\n  Premise: {row['premise'][:100]}...")
        print(f"  Hypothesis: {row['hypothesis']}")
        print(f"  Predicted: {row['pred_label']} (overlap: {row['overlap']:.2f}, length: {row['hyp_length']})")
        print("  " + "-"*50)

print("\n‚úì Sample errors displayed for qualitative analysis")

## Part 1.6: Summary Statistics for Report

Aggregate all findings for inclusion in technical report.

In [None]:
# Compile comprehensive analysis summary
analysis_summary = {
    "baseline_performance": {
        "accuracy": float(accuracy_score(true_labels, baseline_pred_labels)),
        "per_class_accuracy": {
            label_names[i]: float(accuracy_score(
                true_labels[true_labels == i],
                baseline_pred_labels[true_labels == i]
            )) for i in range(3)
        }
    },
    "hypothesis_only_bias": {
        "accuracy": float(hyp_accuracy),
        "severity": artifact_severity,
        "interpretation": interpretation,
        "vs_random_baseline": float(hyp_accuracy - 0.333),
    },
    "lexical_overlap_bias": {
        "mean_overlap": float(overlaps.mean()),
        "std_overlap": float(overlaps.std()),
        "accuracy_by_overlap": accuracy_by_overlap.to_dict(),
        "correlation": "Strong" if accuracy_by_overlap.iloc[-1] > accuracy_by_overlap.iloc[0] + 0.1 else "Weak"
    },
    "length_bias": {
        "mean_length": float(hyp_lengths.mean()),
        "accuracy_by_length": accuracy_by_length.to_dict(),
    },
    "error_analysis": {
        "total_errors": int(len(errors)),
        "error_rate": float(len(errors) / len(errors_df)),
        "errors_by_true_label": errors['true_label'].value_counts().to_dict(),
        "top_error_types": error_types.head(5).to_dict(),
    }
}

# Save comprehensive summary
with open(f"{PROJECT_DIR}/analysis_results/part1_analysis_summary.json", 'w') as f:
    json.dump(analysis_summary, f, indent=2)

print("\n" + "="*60)
print("PART 1 ANALYSIS COMPLETE")
print("="*60)
print(f"\nüìä Results Summary:")
print(f"   Baseline Accuracy: {analysis_summary['baseline_performance']['accuracy']:.2%}")
print(f"   Hypothesis-Only: {analysis_summary['hypothesis_only_bias']['accuracy']:.2%} ({artifact_severity})")
print(f"   Lexical Overlap: {analysis_summary['lexical_overlap_bias']['correlation']} correlation")
print(f"   Error Rate: {analysis_summary['error_analysis']['error_rate']:.1%}")

print(f"\nüìÅ Saved Files:")
print(f"   {PROJECT_DIR}/analysis_results/part1_analysis_summary.json")
print(f"   {PROJECT_DIR}/analysis_results/hypothesis_only_results.json")
print(f"   {PROJECT_DIR}/figures/hypothesis_only_confusion_matrix.png")
print(f"   {PROJECT_DIR}/figures/lexical_overlap_analysis.png")
print(f"   {PROJECT_DIR}/figures/length_bias_analysis.png")

print(f"\nüìù For Technical Report:")
print(f"   Section 3 (Analysis): Use findings from this notebook")
print(f"   Figures: Include generated visualizations")
print(f"   Tables: Use accuracy_by_overlap and accuracy_by_length")

print(f"\nüéØ Next Steps:")
print(f"   ‚Üí Run Part 2: Mitigation (colab_mitigation_part2.ipynb)")
print(f"   ‚Üí Implement debiasing methods based on identified artifacts")
print("="*60)

---

## Analysis Complete ‚úì

### Key Findings:
1. **Hypothesis-Only Bias**: Quantified severity of artifact exploitation
2. **Lexical Overlap**: Tested correlation with model predictions
3. **Length Bias**: Analyzed impact of hypothesis length
4. **Error Patterns**: Identified systematic failure modes

### For Technical Report (Part 1):
- **Section 3.1**: Baseline performance and setup
- **Section 3.2**: Hypothesis-only baseline analysis
- **Section 3.3**: Lexical overlap and length bias
- **Section 3.4**: Error characterization
- **Figures**: Include all generated visualizations

### Next: Part 2 - Mitigation
Design and implement debiasing methods to address identified artifacts.