# Analyzing and Mitigating Dataset Artifacts in NLI

**Project:** Final Project - CS388  
**Dataset:** SNLI (Stanford Natural Language Inference)  
**Model:** ELECTRA-small  
**Goal:** Detect and mitigate dataset artifacts using hypothesis-only baselines and ensemble debiasing

## Project Structure
- **Part 1: Analysis** - Detect artifacts and analyze model errors
- **Part 2: Fix** - Implement and evaluate debiasing method


## Setup and Installation


In [1]:
# Connecting using personal token

import os
from google.colab import userdata

os.environ['gituser'] = userdata.get('gituser')
os.environ['gitpw'] = userdata.get('gitpw')
os.environ['REPO'] = 'fp-dataset-artifacts'

!git clone https://$gituser:$gitpw@github.com/$gituser/$REPO.git

Cloning into 'fp-dataset-artifacts'...
remote: Enumerating objects: 149, done.[K
remote: Counting objects: 100% (1/1), done.[K
remote: Total 149 (delta 0), reused 0 (delta 0), pack-reused 148 (from 2)[K
Receiving objects: 100% (149/149), 8.17 MiB | 17.61 MiB/s, done.
Resolving deltas: 100% (54/54), done.


In [2]:
%cd fp-dataset-artifacts/

/content/fp-dataset-artifacts


In [3]:
# Install required packages
%pip install -q -r requirements.txt


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25h

## Part 1: Analysis

### Part 1.1: Baseline Model Training

Train a standard NLI model on SNLI dataset using both premise and hypothesis.


In [4]:
!python train/run.py --do_train --do_eval --task nli --dataset snli --model google/electra-small-discriminator --output_dir ./outputs/evaluations/baseline_100k/ --max_train_samples 100000 --num_train_epochs 3 --per_device_train_batch_size 32 --per_device_eval_batch_size 32 --max_length 128 --learning_rate 2e-5

2025-12-05 15:21:03.668225: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-05 15:21:03.685375: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764948063.707115    1780 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764948063.713630    1780 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1764948063.730286    1780 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [5]:
# Check baseline results
import json
with open(os.path.join('outputs', 'evaluations', 'baseline_100k', 'eval_metrics.json'), 'r') as f:
    baseline_metrics = json.load(f)

print("=" * 80)
print("Baseline Model Results")
print("=" * 80)
print(f"Accuracy: {baseline_metrics['eval_accuracy']:.4f} ({baseline_metrics['eval_accuracy']*100:.2f}%)")
print(f"Eval Loss: {baseline_metrics.get('eval_loss', 'N/A')}")


Baseline Model Results
Accuracy: 0.8492 (84.92%)
Eval Loss: 0.4070534110069275


### Part 1.2: Artifact Detection - Hypothesis-Only Model

Train a model that only sees the hypothesis (not the premise) to detect dataset artifacts.  
If this model achieves >33.33% accuracy (random baseline), it indicates strong artifacts exist.


In [6]:
!python train/train_hypothesis_only.py


2025-12-05 15:31:02.105538: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-05 15:31:02.123057: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764948662.144336    4484 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764948662.150890    4484 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1764948662.167142    4484 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [7]:
# Check hypothesis-only results
with open(os.path.join('outputs', 'evaluations', 'hypothesis_only_model', 'eval_metrics.json'), 'r') as f:
    hyp_metrics = json.load(f)

hyp_accuracy = hyp_metrics['eval_accuracy']
random_baseline = 1.0 / 3.0
above_random = hyp_accuracy - random_baseline

print("=" * 80)
print("Hypothesis-Only Model Results (Artifact Detection)")
print("=" * 80)
print(f"Accuracy: {hyp_accuracy:.4f} ({hyp_accuracy*100:.2f}%)")
print(f"Random Baseline: {random_baseline:.4f} ({random_baseline*100:.2f}%)")
print(f"Above Random: {above_random:.4f} ({above_random*100:.2f}%)")
print(f"\n{'STRONG ARTIFACTS DETECTED!' if above_random > 0.2 else 'Weak artifacts detected' if above_random > 0.1 else 'No significant artifacts'}")


Hypothesis-Only Model Results (Artifact Detection)
Accuracy: 0.6080 (60.80%)
Random Baseline: 0.3333 (33.33%)
Above Random: 0.2747 (27.47%)

STRONG ARTIFACTS DETECTED!


### Part 1.3: Baseline Error Analysis

Analyze the baseline model's errors, confusion patterns, and identify artifact-related mistakes.


In [8]:
!python analyze/error_analysis.py


Loading predictions...
Total examples: 9842

Overall Accuracy: 84.92% (8358/9842)

Correct predictions: 8358
Incorrect predictions: 1484 (15.1%)

=== LABEL DISTRIBUTION ===
Entailment: 3329 (33.8%)
Neutral: 3235 (32.9%)
Contradiction: 3278 (33.3%)

=== CONFUSION MATRIX ===
Rows = True Label, Columns = Predicted Label
                         Entail    Neutral    Contrad
Entailment                2961       244       124
Neutral                    290      2563       382
Contradiction              130       314      2834

=== PER-CLASS ACCURACY ===
Entailment     : 88.95% (2961/3329)
Neutral        : 79.23% (2563/3235)
Contradiction  : 86.46% (2834/3278)

=== HYPOTHESIS-ONLY ARTIFACT ANALYSIS ===
Testing if model learns patterns from hypothesis words alone...

Hypotheses with negation words: 441
Hypotheses without negation: 9401

True label distribution for hypotheses WITH negation:
  Entailment: 110 (24.9%)
  Neutral: 119 (27.0%)
  Contradiction: 212 (48.1%)

Predicted label distributi

### Part 1.4: Visualizations - Baseline Model

Create visualizations to show error patterns and confusion matrices.


In [9]:
!python analyze/visualize_baseline.py

Loading baseline predictions...
Creating confusion matrix...
Confusion matrix saved to: /content/fp-dataset-artifacts/outputs/evaluations/baseline_confusion_matrix.png
Creating per-class accuracy chart...
Per-class accuracy chart saved to: /content/fp-dataset-artifacts/outputs/evaluations/baseline_per_class_accuracy.png
Baseline visualizations completed!


## Part 2: Fix - Debiasing Implementation

### Part 2.1: Train Debiased Model

Train a debiased model using confidence-based reweighting.  
Examples where the hypothesis-only model is confident (likely artifacts) are downweighted.


In [10]:
!python train/train_debiased.py


2025-12-05 15:48:28.312550: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-12-05 15:48:28.329758: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1764949708.350691    9073 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764949708.357033    9073 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1764949708.372975    9073 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking 

In [11]:
# Check debiased results
import json
with open(os.path.join('outputs', 'evaluations', 'debiased_model', 'eval_metrics.json'), 'r') as f:
    debiased_metrics = json.load(f)

print("=" * 80)
print("Debiased Model Results")
print("=" * 80)
print(f"Accuracy: {debiased_metrics['eval_accuracy']:.4f} ({debiased_metrics['eval_accuracy']*100:.2f}%)")
print(f"Eval Loss: {debiased_metrics.get('eval_loss', 'N/A')}")


Debiased Model Results
Accuracy: 0.8642 (86.42%)
Eval Loss: 0.24399055540561676


In [12]:
!python analyze/compare_results.py


Results Comparison - Baseline vs Debiased

Random Baseline:        0.3333 (33.33%)
Hypothesis-Only:        0.6080 (60.80%) [Above random: +27.47%]
Baseline (Full Model):  0.8492 (84.92%)
Debiased:               0.8642 (86.42%) [Change: +1.49%]

Key Findings:
1. Hypothesis-Only model achieves 60.80%, proving strong artifacts exist!
2. Debiasing maintains performance: 86.42% vs 84.92%
3. Debiasing affected performance

Per-Class Accuracy Comparison
Entailment     : Baseline=88.95%, Debiased=89.31%, Change=+0.36%
Neutral        : Baseline=79.23%, Debiased=82.38%, Change=+3.15%
Contradiction  : Baseline=86.46%, Debiased=87.46%, Change=+1.01%

Prediction Changes
Total predictions changed: 760 (7.7%)
Baseline wrong -> Debiased correct (FIXES): 425
Baseline correct -> Debiased wrong (BREAKS): 278
Net improvement: +147

Top 10 fixes saved to: /content/fp-dataset-artifacts/outputs/evaluations/fixes_examples.json


### Part 2.3: Visualizations - Comparison

Create visualizations comparing baseline and debiased models.


In [13]:
!python analyze/visualize_comparison.py


Loading metrics...
Loading predictions...
Creating comparison charts...
Comparison chart saved to: /content/fp-dataset-artifacts/outputs/evaluations/baseline_vs_debiased_comparison.png
Comparison visualizations completed!


In [14]:
!python analyze/show_fixes.py


Examples Where Debiasing Fixed Baseline Errors

Fix Example 1:
  Premise: Two young children in blue jerseys, one with the number 9 and one with the number 2 are standing on wooden steps in a bathroom and washing their hands in a sink.
  Hypothesis: Two kids at a ballgame wash their hands.
  True Label: Neutral
  Baseline Predicted: Contradiction [WRONG]
  Debiased Predicted: Neutral [CORRECT]
--------------------------------------------------------------------------------

Fix Example 2:
  Premise: A small ice cream stand with two people standing near it.
  Hypothesis: Two people in line to buy icecream.
  True Label: Neutral
  Baseline Predicted: Contradiction [WRONG]
  Debiased Predicted: Neutral [CORRECT]
--------------------------------------------------------------------------------

Fix Example 3:
  Premise: Number 916 is hoping that he is going to win the race.
  Hypothesis: A person is betting that he will win  the race.
  True Label: Neutral
  Baseline Predicted: Entailment [

### Part 2.4: Negation Word Analysis and Visualization

Analyze the correlation between negation words and model predictions.  
This helps identify if models learn spurious correlations (e.g., negation → contradiction).


In [15]:
# Comprehensive Negation Word Analysis with Visualizations
import json
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import os

# Load predictions
print("Loading predictions...")
baseline_predictions = []
with open(os.path.join('outputs', 'evaluations', 'baseline_100k', 'eval_predictions.jsonl'), 'r', encoding='utf-8') as f:
    for line in f:
        baseline_predictions.append(json.loads(line))

debiased_predictions = []
with open(os.path.join('outputs', 'evaluations', 'debiased_model', 'eval_predictions.jsonl'), 'r', encoding='utf-8') as f:
    for line in f:
        debiased_predictions.append(json.loads(line))

label_names = {0: "Entailment", 1: "Neutral", 2: "Contradiction"}

# Define negation words
negation_words = ['no', 'not', 'never', 'nobody', 'nothing', 'nowhere', 'neither', 'none', "n't", 'nor']

def has_negation(text):
    """Check if text contains negation words."""
    text_lower = text.lower()
    return any(neg in text_lower for neg in negation_words)

# Analyze negation for baseline
baseline_with_neg = [p for p in baseline_predictions if has_negation(p['hypothesis'])]
baseline_without_neg = [p for p in baseline_predictions if not has_negation(p['hypothesis'])]

# Analyze negation for debiased
debiased_with_neg = [p for p in debiased_predictions if has_negation(p['hypothesis'])]
debiased_without_neg = [p for p in debiased_predictions if not has_negation(p['hypothesis'])]

print("=" * 80)
print("NEGATION WORD ANALYSIS")
print("=" * 80)
print(f"\nTotal examples: {len(baseline_predictions)}")
print(f"Examples with negation: {len(baseline_with_neg)} ({len(baseline_with_neg)/len(baseline_predictions):.1%})")
print(f"Examples without negation: {len(baseline_without_neg)} ({len(baseline_without_neg)/len(baseline_predictions):.1%})")

# Calculate accuracy on negation examples
baseline_neg_acc = sum(1 for p in baseline_with_neg if p['label'] == p['predicted_label']) / len(baseline_with_neg)
debiased_neg_acc = sum(1 for p in debiased_with_neg if p['label'] == p['predicted_label']) / len(debiased_with_neg)

baseline_no_neg_acc = sum(1 for p in baseline_without_neg if p['label'] == p['predicted_label']) / len(baseline_without_neg)
debiased_no_neg_acc = sum(1 for p in debiased_without_neg if p['label'] == p['predicted_label']) / len(debiased_without_neg)

print(f"\n{'='*80}")
print("ACCURACY ON NEGATION EXAMPLES")
print("=" * 80)
print(f"Baseline - With negation: {baseline_neg_acc:.2%}")
print(f"Baseline - Without negation: {baseline_no_neg_acc:.2%}")
print(f"Debiased - With negation: {debiased_neg_acc:.2%}")
print(f"Debiased - Without negation: {debiased_no_neg_acc:.2%}")
print(f"\nChange on negation examples: {(debiased_neg_acc - baseline_neg_acc)*100:+.2f}%")

# Label distribution for negation examples
print(f"\n{'='*80}")
print("TRUE LABEL DISTRIBUTION (Hypotheses WITH Negation)")
print("=" * 80)
neg_true_labels = Counter(p['label'] for p in baseline_with_neg)
for label in [0, 1, 2]:
    count = neg_true_labels[label]
    pct = count / len(baseline_with_neg)
    print(f"{label_names[label]:15}: {count:4} ({pct:.1%})")

print(f"\n{'='*80}")
print("PREDICTED LABEL DISTRIBUTION (Hypotheses WITH Negation)")
print("=" * 80)
print("Baseline:")
baseline_neg_preds = Counter(p['predicted_label'] for p in baseline_with_neg)
for label in [0, 1, 2]:
    count = baseline_neg_preds[label]
    pct = count / len(baseline_with_neg)
    print(f"  {label_names[label]:15}: {count:4} ({pct:.1%})")

print("\nDebiased:")
debiased_neg_preds = Counter(p['predicted_label'] for p in debiased_with_neg)
for label in [0, 1, 2]:
    count = debiased_neg_preds[label]
    pct = count / len(debiased_with_neg)
    print(f"  {label_names[label]:15}: {count:4} ({pct:.1%})")

# Check if model over-predicts contradiction for negation
true_contrad_pct = neg_true_labels[2] / len(baseline_with_neg)
baseline_pred_contrad_pct = baseline_neg_preds[2] / len(baseline_with_neg)
debiased_pred_contrad_pct = debiased_neg_preds[2] / len(debiased_with_neg)

print(f"\n{'='*80}")
print("NEGATION → CONTRADICTION CORRELATION")
print("=" * 80)
print(f"True Contradiction rate (with negation): {true_contrad_pct:.1%}")
print(f"Baseline predicted Contradiction rate: {baseline_pred_contrad_pct:.1%}")
print(f"Debiased predicted Contradiction rate: {debiased_pred_contrad_pct:.1%}")
print(f"\nBaseline over-predicts Contradiction by: {(baseline_pred_contrad_pct - true_contrad_pct)*100:+.1f}%")
print(f"Debiased over-predicts Contradiction by: {(debiased_pred_contrad_pct - true_contrad_pct)*100:+.1f}%")


Loading predictions...
NEGATION WORD ANALYSIS

Total examples: 9842
Examples with negation: 441 (4.5%)
Examples without negation: 9401 (95.5%)

ACCURACY ON NEGATION EXAMPLES
Baseline - With negation: 80.73%
Baseline - Without negation: 85.12%
Debiased - With negation: 85.49%
Debiased - Without negation: 86.46%

Change on negation examples: +4.76%

TRUE LABEL DISTRIBUTION (Hypotheses WITH Negation)
Entailment     :  110 (24.9%)
Neutral        :  119 (27.0%)
Contradiction  :  212 (48.1%)

PREDICTED LABEL DISTRIBUTION (Hypotheses WITH Negation)
Baseline:
  Entailment     :  108 (24.5%)
  Neutral        :  104 (23.6%)
  Contradiction  :  229 (51.9%)

Debiased:
  Entailment     :  121 (27.4%)
  Neutral        :  102 (23.1%)
  Contradiction  :  218 (49.4%)

NEGATION → CONTRADICTION CORRELATION
True Contradiction rate (with negation): 48.1%
Baseline predicted Contradiction rate: 51.9%
Debiased predicted Contradiction rate: 49.4%

Baseline over-predicts Contradiction by: +3.9%
Debiased over-pr

In [None]:
# Create visualizations for negation analysis using the improved script
# This uses visualize_negation.py with professional colors and optimized two-column layout
!python analyze/visualize_negation.py



Creating negation analysis visualizations...
✓ Negation analysis chart saved to: outputs/evaluations/negation_analysis.png
✓ Negation-contradiction correlation chart saved to: outputs/evaluations/negation_contradiction_correlation.png

VISUALIZATION COMPLETE!

Saved visualizations:
  1. outputs/evaluations/negation_analysis.png
  2. outputs/evaluations/negation_contradiction_correlation.png


## Update

In [17]:
!git config --global user.name "DinaberryPi"
!git config --global user.email "dinahenrykyy@gmail.com"
!git status

On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   outputs/evaluations/baseline_100k/eval_metrics.json[m
	[31mmodified:   outputs/evaluations/baseline_100k/eval_predictions.jsonl[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31moutputs/evaluations/baseline_100k/checkpoint-1000/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-1500/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-2000/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-2500/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-3000/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-3500/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-4000/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-4500/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-500/[m
	

In [18]:
!git add .

In [19]:
!git status

On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/config.json[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/model.safetensors[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/optimizer.pt[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/rng_state.pth[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/scheduler.pt[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/special_tokens_map.json[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/tokenizer.json[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/tokenizer_config.json[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/trainer_state.json[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/trai

In [20]:
!git commit -m "update"

[main dd32320] update
 237 files changed, 1239367 insertions(+), 9843 deletions(-)
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/config.json
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/model.safetensors
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/optimizer.pt
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/rng_state.pth
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/scheduler.pt
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/special_tokens_map.json
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/tokenizer.json
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/tokenizer_config.json
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/trainer_state.json
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/training_args.bin
 create mode 100644 outputs/evaluations/baseline_

In [21]:
!git push origin main

Enumerating objects: 157, done.
Counting objects:   0% (1/157)Counting objects:   1% (2/157)Counting objects:   2% (4/157)Counting objects:   3% (5/157)Counting objects:   4% (7/157)Counting objects:   5% (8/157)Counting objects:   6% (10/157)Counting objects:   7% (11/157)Counting objects:   8% (13/157)Counting objects:   9% (15/157)Counting objects:  10% (16/157)Counting objects:  11% (18/157)Counting objects:  12% (19/157)Counting objects:  13% (21/157)Counting objects:  14% (22/157)Counting objects:  15% (24/157)Counting objects:  16% (26/157)Counting objects:  17% (27/157)Counting objects:  18% (29/157)Counting objects:  19% (30/157)Counting objects:  20% (32/157)Counting objects:  21% (33/157)Counting objects:  22% (35/157)Counting objects:  23% (37/157)Counting objects:  24% (38/157)Counting objects:  25% (40/157)Counting objects:  26% (41/157)Counting objects:  27% (43/157)Counting objects:  28% (44/157)Counting objects:  29% (46/157)Counting objec