# Analyzing and Mitigating Dataset Artifacts in NLI

**Project:** Final Project - CS388  
**Dataset:** SNLI (Stanford Natural Language Inference)  
**Model:** ELECTRA-small  
**Goal:** Detect and mitigate dataset artifacts using hypothesis-only baselines and ensemble debiasing


## üìö Setup and Installation


In [None]:
# Install required packages
!pip install -q transformers datasets torch tqdm evaluate accelerate matplotlib seaborn


In [None]:
import os
import sys
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, Counter
from IPython.display import Image, display

# Set random seeds for reproducibility
import random
random.seed(42)
np.random.seed(42)

print("‚úÖ All libraries imported successfully!")


## üìä Part 1: Baseline Model Training

Train a standard NLI model on SNLI dataset using both premise and hypothesis.


In [None]:
!python train/run.py \
    --do_train \
    --do_eval \
    --task nli \
    --dataset snli \
    --model google/electra-small-discriminator \
    --output_dir ./outputs/evaluations/baseline_100k/ \
    --max_train_samples 100000 \
    --num_train_epochs 3 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 32 \
    --max_length 128 \
    --learning_rate 2e-5


In [None]:
# Check baseline results
with open('./outputs/evaluations/baseline_100k/eval_metrics.json', 'r') as f:
    baseline_metrics = json.load(f)

print("=" * 80)
print("Baseline Model Results")
print("=" * 80)
print(f"Accuracy: {baseline_metrics['eval_accuracy']:.4f} ({baseline_metrics['eval_accuracy']*100:.2f}%)")
print(f"Eval Loss: {baseline_metrics.get('eval_loss', 'N/A')}")


## üîç Part 2: Artifact Detection - Hypothesis-Only Model

Train a model that only sees the hypothesis (not the premise) to detect dataset artifacts.  
If this model achieves >33.33% accuracy (random baseline), it indicates strong artifacts exist.


In [None]:
!python train/train_hypothesis_only.py


In [None]:
# Check hypothesis-only results
with open('./outputs/evaluations/hypothesis_only_model/eval_metrics.json', 'r') as f:
    hyp_metrics = json.load(f)

hyp_accuracy = hyp_metrics['eval_accuracy']
random_baseline = 1.0 / 3.0
above_random = hyp_accuracy - random_baseline

print("=" * 80)
print("Hypothesis-Only Model Results (Artifact Detection)")
print("=" * 80)
print(f"Accuracy: {hyp_accuracy:.4f} ({hyp_accuracy*100:.2f}%)")
print(f"Random Baseline: {random_baseline:.4f} ({random_baseline*100:.2f}%)")
print(f"Above Random: {above_random:.4f} ({above_random*100:.2f}%)")
print(f"\n{'‚úÖ STRONG ARTIFACTS DETECTED!' if above_random > 0.2 else '‚ö†Ô∏è Weak artifacts detected' if above_random > 0.1 else '‚ùå No significant artifacts'}")


## üõ†Ô∏è Part 3: Debiasing - Ensemble Method

Train a debiased model using confidence-based reweighting.  
Examples where the hypothesis-only model is confident (likely artifacts) are downweighted.


In [None]:
!python train/train_debiased.py


In [None]:
# Check debiased results
with open('./outputs/evaluations/debiased_model/eval_metrics.json', 'r') as f:
    debiased_metrics = json.load(f)

print("=" * 80)
print("Debiased Model Results")
print("=" * 80)
print(f"Accuracy: {debiased_metrics['eval_accuracy']:.4f} ({debiased_metrics['eval_accuracy']*100:.2f}%)")
print(f"Eval Loss: {debiased_metrics.get('eval_loss', 'N/A')}")


## üìà Part 4: Results Summary and Comparison


In [None]:
# Load all metrics
with open('./outputs/evaluations/baseline_100k/eval_metrics.json', 'r') as f:
    baseline_metrics = json.load(f)

with open('./outputs/evaluations/hypothesis_only_model/eval_metrics.json', 'r') as f:
    hyp_metrics = json.load(f)

with open('./outputs/evaluations/debiased_model/eval_metrics.json', 'r') as f:
    debiased_metrics = json.load(f)

# Calculate statistics
random_baseline = 1.0 / 3.0
baseline_acc = baseline_metrics['eval_accuracy']
hyp_acc = hyp_metrics['eval_accuracy']
debiased_acc = debiased_metrics['eval_accuracy']

print("=" * 80)
print("Results Summary")
print("=" * 80)
print(f"\nRandom Baseline:        {random_baseline:.4f} ({random_baseline*100:.2f}%)")
print(f"Hypothesis-Only:        {hyp_acc:.4f} ({hyp_acc*100:.2f}%) [Above random: +{(hyp_acc-random_baseline)*100:.2f}%]")
print(f"Baseline (Full Model):  {baseline_acc:.4f} ({baseline_acc*100:.2f}%)")
print(f"Debiased:               {debiased_acc:.4f} ({debiased_acc*100:.2f}%) [Change: {(debiased_acc-baseline_acc)*100:+.2f}%]")

print("\n" + "=" * 80)
print("Key Findings:")
print("=" * 80)
print(f"1. Hypothesis-Only model achieves {hyp_acc*100:.2f}%, proving strong artifacts exist!")
print(f"2. Debiasing maintains performance: {debiased_acc*100:.2f}% vs {baseline_acc*100:.2f}%")
print(f"3. {'‚úÖ Debiasing preserved performance' if abs(debiased_acc - baseline_acc) < 0.01 else '‚ö†Ô∏è Debiasing affected performance'}")


## üìä Part 5: Error Analysis


In [None]:
!python analyze/error_analysis.py


## üîÑ Part 6: Model Comparison


In [None]:
!python analyze/compare_models.py


## üìä Part 7: Visualizations


In [None]:
# Load predictions
baseline_predictions = []
with open('./outputs/evaluations/baseline_100k/eval_predictions.jsonl', 'r', encoding='utf-8') as f:
    for line in f:
        baseline_predictions.append(json.loads(line))

debiased_predictions = []
with open('./outputs/evaluations/debiased_model/eval_predictions.jsonl', 'r', encoding='utf-8') as f:
    for line in f:
        debiased_predictions.append(json.loads(line))

label_names = {0: "Entailment", 1: "Neutral", 2: "Contradiction"}

# Calculate accuracies
baseline_correct = sum(1 for p in baseline_predictions if p['label'] == p['predicted_label'])
debiased_correct = sum(1 for p in debiased_predictions if p['label'] == p['predicted_label'])

baseline_acc = baseline_correct / len(baseline_predictions)
debiased_acc = debiased_correct / len(debiased_predictions)


In [None]:
# Create comparison visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Figure 1: Overall accuracy comparison
models = ['Random', 'Hypothesis-\nOnly', 'Baseline', 'Debiased']
accuracies = [random_baseline, hyp_acc, baseline_acc, debiased_acc]
colors = ['gray', 'orange', 'blue', 'green']

axes[0].bar(models, accuracies, color=colors, alpha=0.7)
axes[0].axhline(y=random_baseline, color='gray', linestyle='--', alpha=0.5, label='Random Baseline')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Overall Model Performance')
axes[0].set_ylim([0, 1])
axes[0].grid(axis='y', alpha=0.3)
for i, (model, acc) in enumerate(zip(models, accuracies)):
    axes[0].text(i, acc + 0.02, f'{acc:.2%}', ha='center', va='bottom')

# Figure 2: Per-class accuracy comparison
classes = ['Entailment', 'Neutral', 'Contradiction']
baseline_class_accs = []
debiased_class_accs = []

for label in [0, 1, 2]:
    baseline_class = [p for p in baseline_predictions if p['label'] == label]
    debiased_class = [p for p in debiased_predictions if p['label'] == label]
    
    baseline_class_acc = sum(1 for p in baseline_class if p['predicted_label'] == label) / len(baseline_class)
    debiased_class_acc = sum(1 for p in debiased_class if p['predicted_label'] == label) / len(debiased_class)
    
    baseline_class_accs.append(baseline_class_acc)
    debiased_class_accs.append(debiased_class_acc)

x = np.arange(len(classes))
width = 0.35
axes[1].bar(x - width/2, baseline_class_accs, width, label='Baseline', alpha=0.7, color='blue')
axes[1].bar(x + width/2, debiased_class_accs, width, label='Debiased', alpha=0.7, color='green')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Per-Class Accuracy Comparison')
axes[1].set_xticks(x)
axes[1].set_xticklabels(classes)
axes[1].legend()
axes[1].set_ylim([0, 1])
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
os.makedirs('./outputs/evaluations', exist_ok=True)
plt.savefig('./outputs/evaluations/results_comparison.png', dpi=300, bbox_inches='tight')
print("‚úÖ Chart saved to: ./outputs/evaluations/results_comparison.png")
plt.show()


In [None]:
# Create confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Baseline confusion matrix
baseline_confusion = np.zeros((3, 3))
for p in baseline_predictions:
    baseline_confusion[p['label']][p['predicted_label']] += 1

# Normalize
baseline_confusion_norm = baseline_confusion / baseline_confusion.sum(axis=1, keepdims=True)

sns.heatmap(baseline_confusion_norm, annot=True, fmt='.2%', cmap='Blues', 
            xticklabels=['Entail', 'Neutral', 'Contrad'],
            yticklabels=['Entail', 'Neutral', 'Contrad'],
            ax=axes[0], cbar_kws={'label': 'Proportion'})
axes[0].set_title('Baseline Confusion Matrix')
axes[0].set_xlabel('Predicted Label')
axes[0].set_ylabel('True Label')

# Debiased confusion matrix
debiased_confusion = np.zeros((3, 3))
for p in debiased_predictions:
    debiased_confusion[p['label']][p['predicted_label']] += 1

# Normalize
debiased_confusion_norm = debiased_confusion / debiased_confusion.sum(axis=1, keepdims=True)

sns.heatmap(debiased_confusion_norm, annot=True, fmt='.2%', cmap='Greens',
            xticklabels=['Entail', 'Neutral', 'Contrad'],
            yticklabels=['Entail', 'Neutral', 'Contrad'],
            ax=axes[1], cbar_kws={'label': 'Proportion'})
axes[1].set_title('Debiased Confusion Matrix')
axes[1].set_xlabel('Predicted Label')
axes[1].set_ylabel('True Label')

plt.tight_layout()
plt.savefig('./outputs/evaluations/confusion_matrices.png', dpi=300, bbox_inches='tight')
print("‚úÖ Confusion matrices saved to: ./outputs/evaluations/confusion_matrices.png")
plt.show()


## üìù Part 8: Example Fixes

Show examples where debiasing fixed baseline errors.


In [None]:
# Find examples where debiasing fixed errors
changes = []
for i, (base, deb) in enumerate(zip(baseline_predictions, debiased_predictions)):
    if base['predicted_label'] != deb['predicted_label']:
        changes.append({
            'index': i,
            'premise': base['premise'],
            'hypothesis': base['hypothesis'],
            'true_label': base['label'],
            'baseline_pred': base['predicted_label'],
            'debiased_pred': deb['predicted_label'],
        })

baseline_wrong_debiased_right = [c for c in changes if c['baseline_pred'] != c['true_label'] and c['debiased_pred'] == c['true_label']]

print("=" * 80)
print("Examples Where Debiasing Fixed Baseline Errors")
print("=" * 80)

for i, fix in enumerate(baseline_wrong_debiased_right[:5], 1):
    print(f"\nFix Example {i}:")
    print(f"  Premise: {fix['premise']}")
    print(f"  Hypothesis: {fix['hypothesis']}")
    print(f"  True Label: {label_names[fix['true_label']]}")
    print(f"  Baseline Predicted: {label_names[fix['baseline_pred']]} ‚ùå")
    print(f"  Debiased Predicted: {label_names[fix['debiased_pred']]} ‚úÖ")
    print("-" * 80)


## ‚úÖ Summary

### Key Results:
- **Hypothesis-Only**: 60.80% (proves strong artifacts exist - 27.47% above random)
- **Baseline**: 86.54% (standard model performance)
- **Debiased**: 86.42% (maintains performance while reducing artifact dependence)

### Conclusions:
1. ‚úÖ **Strong artifacts detected** in SNLI dataset
2. ‚úÖ **Debiasing method works** - maintains overall accuracy
3. ‚úÖ **Framework provides** quantitative artifact detection and mitigation

### Next Steps:
- Use these results for paper writing
- Reference `ANALYSIS_RESULTS.md` and `PAPER_OUTLINE.md` for detailed analysis
- All results saved in `outputs/evaluations/` directory
