# Analyzing and Mitigating Dataset Artifacts in NLI

**Project:** Final Project - CS388  
**Dataset:** SNLI (Stanford Natural Language Inference)  
**Model:** ELECTRA-small  
**Goal:** Detect and mitigate dataset artifacts using hypothesis-only baselines and ensemble debiasing

## Project Structure
- **Part 1: Analysis** - Detect artifacts and analyze model errors
- **Part 2: Fix** - Implement and evaluate debiasing method


## Setup and Installation


In [1]:
# Connecting using personal token

import os
from google.colab import userdata

os.environ['gituser'] = userdata.get('gituser')
os.environ['gitpw'] = userdata.get('gitpw')
os.environ['REPO'] = 'fp-dataset-artifacts'

!git clone https://$gituser:$gitpw@github.com/$gituser/$REPO.git

Cloning into 'fp-dataset-artifacts'...
remote: Enumerating objects: 139, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 139 (delta 15), reused 10 (delta 10), pack-reused 118 (from 3)[K
Receiving objects: 100% (139/139), 8.13 MiB | 15.68 MiB/s, done.
Resolving deltas: 100% (55/55), done.


In [2]:
%cd fp-dataset-artifacts/

/content/fp-dataset-artifacts


In [3]:
# Install required packages
%pip install -q -r requirements.txt


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h

## Part 1: Analysis

### Part 1.1: Baseline Model Training

Train a standard NLI model on SNLI dataset using both premise and hypothesis.


In [None]:
!python train/run.py --do_train --do_eval --task nli --dataset snli --model google/electra-small-discriminator --output_dir ./outputs/evaluations/baseline_100k/ --max_train_samples 100000 --num_train_epochs 3 --per_device_train_batch_size 32 --per_device_eval_batch_size 32 --max_length 128 --learning_rate 2e-5

2025-11-18 19:19:43.875385: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763493583.896605   15703 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1763493583.903012   15703 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1763493583.920181   15703 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1763493583.920208   15703 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1763493583.920211   15703 computation_placer.cc:177] computation placer alr

In [4]:
# Check baseline results
import json
with open(os.path.join('outputs', 'evaluations', 'baseline_100k', 'eval_metrics.json'), 'r') as f:
    baseline_metrics = json.load(f)

print("=" * 80)
print("Baseline Model Results")
print("=" * 80)
print(f"Accuracy: {baseline_metrics['eval_accuracy']:.4f} ({baseline_metrics['eval_accuracy']*100:.2f}%)")
print(f"Eval Loss: {baseline_metrics.get('eval_loss', 'N/A')}")


Baseline Model Results
Accuracy: 0.8654 (86.54%)
Eval Loss: 0.3831678628921509


### Part 1.2: Artifact Detection - Hypothesis-Only Model

Train a model that only sees the hypothesis (not the premise) to detect dataset artifacts.  
If this model achieves >33.33% accuracy (random baseline), it indicates strong artifacts exist.


In [None]:
!python train/train_hypothesis_only.py


2025-11-19 04:00:09.090136: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763524809.115943    3359 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1763524809.123488    3359 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1763524809.144010    3359 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1763524809.144058    3359 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1763524809.144063    3359 computation_placer.cc:177] computation placer alr

In [8]:
# Check hypothesis-only results
with open(os.path.join('outputs', 'evaluations', 'hypothesis_only_model', 'eval_metrics.json'), 'r') as f:
    hyp_metrics = json.load(f)

hyp_accuracy = hyp_metrics['eval_accuracy']
random_baseline = 1.0 / 3.0
above_random = hyp_accuracy - random_baseline

print("=" * 80)
print("Hypothesis-Only Model Results (Artifact Detection)")
print("=" * 80)
print(f"Accuracy: {hyp_accuracy:.4f} ({hyp_accuracy*100:.2f}%)")
print(f"Random Baseline: {random_baseline:.4f} ({random_baseline*100:.2f}%)")
print(f"Above Random: {above_random:.4f} ({above_random*100:.2f}%)")
print(f"\n{'STRONG ARTIFACTS DETECTED!' if above_random > 0.2 else 'Weak artifacts detected' if above_random > 0.1 else 'No significant artifacts'}")


Hypothesis-Only Model Results (Artifact Detection)
Accuracy: 0.6080 (60.80%)
Random Baseline: 0.3333 (33.33%)
Above Random: 0.2747 (27.47%)

STRONG ARTIFACTS DETECTED!


### Part 1.3: Baseline Error Analysis

Analyze the baseline model's errors, confusion patterns, and identify artifact-related mistakes.


In [None]:
!python analyze/error_analysis.py


### Part 1.4: Visualizations - Baseline Model

Create visualizations to show error patterns and confusion matrices.


In [None]:
!python analyze/visualize_baseline.py

## Part 2: Fix - Debiasing Implementation

### Part 2.1: Train Debiased Model

Train a debiased model using confidence-based reweighting.  
Examples where the hypothesis-only model is confident (likely artifacts) are downweighted.


In [None]:
!python train/train_debiased.py


In [12]:
# Check debiased results
import json
with open(os.path.join('outputs', 'evaluations', 'debiased_model', 'eval_metrics.json'), 'r') as f:
    debiased_metrics = json.load(f)

print("=" * 80)
print("Debiased Model Results")
print("=" * 80)
print(f"Accuracy: {debiased_metrics['eval_accuracy']:.4f} ({debiased_metrics['eval_accuracy']*100:.2f}%)")
print(f"Eval Loss: {debiased_metrics.get('eval_loss', 'N/A')}")


Debiased Model Results
Accuracy: 0.8642 (86.42%)
Eval Loss: 0.24399055540561676


In [11]:
!python analyze/compare_results.py


Results Comparison - Baseline vs Debiased

Random Baseline:        0.3333 (33.33%)
Hypothesis-Only:        0.6080 (60.80%) [Above random: +27.47%]
Baseline (Full Model):  0.8654 (86.54%)
Debiased:               0.8642 (86.42%) [Change: -0.12%]

Key Findings:
1. Hypothesis-Only model achieves 60.80%, proving strong artifacts exist!
2. Debiasing maintains performance: 86.42% vs 86.54%
3. Debiasing preserved performance

Per-Class Accuracy Comparison
Entailment     : Baseline=89.28%, Debiased=89.31%, Change=+0.03%
Neutral        : Baseline=82.97%, Debiased=82.38%, Change=-0.59%
Contradiction  : Baseline=87.28%, Debiased=87.46%, Change=+0.18%

Prediction Changes
Total predictions changed: 605 (6.1%)
Baseline wrong -> Debiased correct (FIXES): 270
Baseline correct -> Debiased wrong (BREAKS): 282
Net improvement: -12

Top 10 fixes saved to: /content/fp-dataset-artifacts/outputs/evaluations/fixes_examples.json


### Part 2.3: Visualizations - Comparison

Create visualizations comparing baseline and debiased models.


In [13]:
!python analyze/visualize_comparison.py


Loading metrics...
Loading predictions...
Creating comparison charts...
Comparison chart saved to: /content/fp-dataset-artifacts/outputs/evaluations/baseline_vs_debiased_comparison.png
Comparison visualizations completed!


In [14]:
!python analyze/show_fixes.py


Examples Where Debiasing Fixed Baseline Errors

Fix Example 1:
  Premise: A man selling donuts to a customer during a world exhibition event held in the city of Angeles
  Hypothesis: A man selling donuts to a customer.
  True Label: Entailment
  Baseline Predicted: Neutral [WRONG]
  Debiased Predicted: Entailment [CORRECT]
--------------------------------------------------------------------------------

Fix Example 2:
  Premise: A senior is waiting at the window of a restaurant that serves sandwiches.
  Hypothesis: A man is waiting in line for the bus.
  True Label: Contradiction
  Baseline Predicted: Neutral [WRONG]
  Debiased Predicted: Contradiction [CORRECT]
--------------------------------------------------------------------------------

Fix Example 3:
  Premise: Street performer with bowler hat and high boots performs outside.
  Hypothesis: The man is performing a magic act.
  True Label: Neutral
  Baseline Predicted: Contradiction [WRONG]
  Debiased Predicted: Neutral [CORRECT]


## Update

In [None]:
!git config --global user.name "DinaberryPi"
!git config --global user.email "dinahenrykyy@gmail.com"
!git status

On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   outputs/evaluations/baseline_100k/eval_metrics.json[m
	[31mmodified:   outputs/evaluations/baseline_100k/eval_predictions.jsonl[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31moutputs/evaluations/baseline_100k/checkpoint-1000/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-1500/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-2000/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-2500/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-3000/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-3500/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-4000/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-4500/[m
	[31moutputs/evaluations/baseline_100k/checkpoint-500/[m
	

In [None]:
!git add .

In [None]:
!git status

On branch main
Your branch is up to date with 'origin/main'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/config.json[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/model.safetensors[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/optimizer.pt[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/rng_state.pth[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/scheduler.pt[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/special_tokens_map.json[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/tokenizer.json[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/tokenizer_config.json[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/trainer_state.json[m
	[32mnew file:   outputs/evaluations/baseline_100k/checkpoint-1000/trai

In [None]:
!git commit -m "update"

[main fac3e4b] update
 229 files changed, 1238491 insertions(+), 9843 deletions(-)
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/config.json
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/model.safetensors
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/optimizer.pt
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/rng_state.pth
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/scheduler.pt
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/special_tokens_map.json
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/tokenizer.json
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/tokenizer_config.json
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/trainer_state.json
 create mode 100644 outputs/evaluations/baseline_100k/checkpoint-1000/training_args.bin
 create mode 100644 outputs/evaluations/baseline_

In [None]:
!git push origin main

Enumerating objects: 150, done.
Counting objects:   0% (1/150)Counting objects:   1% (2/150)Counting objects:   2% (3/150)Counting objects:   3% (5/150)Counting objects:   4% (6/150)Counting objects:   5% (8/150)Counting objects:   6% (9/150)Counting objects:   7% (11/150)Counting objects:   8% (12/150)Counting objects:   9% (14/150)Counting objects:  10% (15/150)Counting objects:  11% (17/150)Counting objects:  12% (18/150)Counting objects:  13% (20/150)Counting objects:  14% (21/150)Counting objects:  15% (23/150)Counting objects:  16% (24/150)Counting objects:  17% (26/150)Counting objects:  18% (27/150)Counting objects:  19% (29/150)Counting objects:  20% (30/150)Counting objects:  21% (32/150)Counting objects:  22% (33/150)Counting objects:  23% (35/150)Counting objects:  24% (36/150)Counting objects:  25% (38/150)Counting objects:  26% (39/150)Counting objects:  27% (41/150)Counting objects:  28% (42/150)Counting objects:  29% (44/150)Counting object