In [11]:
"""
LIME Stability Analysis for Text Classification
================================================
Team: ModelMiners (Abdul Aahad Qureshi, Khyzar Baig)
Project: iML Winter 2025/26

Research Question: How stable are LIME explanations for text classification?
"""

# Check Python version
import sys
print(f"Python version: {sys.version}")
print(f"Running on: Colab")

Python version: 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0]
Running on: Colab


In [10]:
# Restart runtime after this
!pip install transformers==4.36.0 huggingface_hub==0.20.0

Collecting transformers==4.36.0
  Downloading transformers-4.36.0-py3-none-any.whl.metadata (126 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/126.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.8/126.8 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface_hub==0.20.0
  Downloading huggingface_hub-0.20.0-py3-none-any.whl.metadata (12 kB)
Collecting tokenizers<0.19,>=0.14 (from transformers==4.36.0)
  Downloading tokenizers-0.15.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading transformers-4.36.0-py3-none-any.whl (8.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m108.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading huggingface_hub-0.20.0-py3-none-any.whl (329 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m329.1/329.1 kB[0m [31m36.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloadin

In [12]:
%%capture
# Install required packages (suppress output)
!pip install lime==0.2.0.1
!pip install seaborn==0.12.2

# Install datasets at the desired stable version.
# Then, upgrade huggingface_hub and fsspec to ensure compatibility with transformers
# datasets==2.18.0 requires huggingface-hub>=0.14.0,<1.0, so an upgraded version will still be compatible.
!pip install datasets==2.18.0
!pip install --upgrade huggingface_hub
!pip install --upgrade fsspec

print(" All packages installed!")

In [2]:
"""
# LIME Stability Analysis for Text Classification

**Team:** ModelMiners (Abdul Aahad Qureshi, Khyzar Baig)
**Course:** Interpretable Machine Learning (iML) Winter 2025/26
**Supervisor:** Lukas Fehring

---

## 1. Introduction & Motivation

LIME (Local Interpretable Model-agnostic Explanations) is one of the most widely-used methods for explaining black-box model predictions. However, it suffers from a critical flaw: **explanation instability due to random sampling**. Running LIME twice on the same input can yield different "important features," undermining trust in high-stakes applications like medical diagnosis or loan decisions.

**Research Question:** How stable are LIME explanations for text classification, and what factors affect this stability?

---

## 2. Related Work

Our work builds on several key papers analyzing explainability method robustness:

| Paper | Key Finding | Gap We Address |
|-------|-------------|----------------|
| **Ribeiro et al. (2016)** - "Why Should I Trust You?" | Introduced LIME | No stability analysis provided |
| **Alvarez-Melis & Jaakkola (2018)** - "On the Robustness of Interpretability Methods" | First systematic stability analysis | Limited text classification focus |
| **Slack et al. (2020)** - "Fooling LIME and SHAP" | Showed adversarial vulnerabilities | Focused on adversarial attacks, not hyperparameter sensitivity |
| **Molnar et al. (2020)** - "Interpretable Machine Learning" | Comprehensive XAI overview | No systematic num_samples analysis |
| **Krishna et al. (2022)** - "The Disagreement Problem in Explainable ML" | Showed explanation disagreement across methods | Cross-method focus, not within-method stability |

**Our Contribution:** We provide the first systematic analysis of how LIME's `num_samples` parameter affects explanation stability specifically for text classification, along with analysis of sentence length and model complexity effects.

---

## 3. LIME Methodology for Text

### 3.1 How LIME Works

LIME explains predictions by fitting an interpretable surrogate model locally around a specific instance:

1. **Original Prediction:** Get the black-box model's prediction for input text
2. **Perturbation Generation:** Create neighborhood samples by randomly masking words
3. **Surrogate Fitting:** Fit a Ridge Regression model on perturbations weighted by proximity
4. **Explanation:** Extract feature (word) importance coefficients from surrogate

### 3.2 Technical Implementation

```
Surrogate Model: Ridge Regression (L2-regularized linear model)
Text Representation: Bag-of-Words (binary word presence)
Perturbation Method: Random word masking (each word removed with probability p)
Similarity Kernel: Exponential kernel based on cosine distance
```

### 3.3 Why Instability Occurs

LIME's randomness comes from perturbation sampling:
- For a sentence with n words, there are 2^n possible perturbations
- LIME samples only `num_samples` of these (default: 5000)
- Different random samples → Different surrogate models → Different explanations

**Our Hypothesis:** Longer sentences have exponentially larger perturbation spaces (2^n), potentially causing greater instability with fixed sampling.

---

## 4. Experimental Setup

### 4.1 Dataset
- **SST-2** (Stanford Sentiment Treebank): Binary sentiment classification
- ~67,000 training samples, 872 validation samples

### 4.2 Models Tested
| Model | Type | Parameters | Purpose |
|-------|------|------------|---------|
| Logistic Regression + TF-IDF | Simple, linear | ~10K | Baseline |
| DistilBERT (fine-tuned) | Complex, transformer | ~66M | Real-world comparison |

### 4.3 Stability Metrics

| Metric | Description | Ideal Value |
|--------|-------------|-------------|
| **Top-K Agreement** | % overlap in top-3 important words across runs | 1.0 (perfect) |
| **Rank Correlation** | Spearman correlation of word rankings across runs | 1.0 (perfect) |
| **Coefficient of Variation** | Std/Mean of importance scores per word | 0.0 (no variation) |

### 4.4 Reproducibility
- **Seeds:** 5 random seeds (42, 123, 456, 789, 1000)
- **LIME Runs:** 30 runs per sentence
- **Results:** Reported as mean ± std across seeds

---
"""

'\n# LIME Stability Analysis for Text Classification\n\n**Team:** ModelMiners (Abdul Aahad Qureshi, Khyzer Baig)\n**Course:** Interpretable Machine Learning (iML) Winter 2025/26\n**Supervisor:** Lukas Fehring\n\n---\n\n## 1. Introduction & Motivation\n\nLIME (Local Interpretable Model-agnostic Explanations) is one of the most widely-used methods for explaining black-box model predictions. However, it suffers from a critical flaw: **explanation instability due to random sampling**. Running LIME twice on the same input can yield different "important features," undermining trust in high-stakes applications like medical diagnosis or loan decisions.\n\n**Research Question:** How stable are LIME explanations for text classification, and what factors affect this stability?\n\n---\n\n## 2. Related Work\n\nOur work builds on several key papers analyzing explainability method robustness:\n\n| Paper | Key Finding | Gap We Address |\n|-------|-------------|----------------|\n| **Ribeiro et al. (20

In [13]:
# Standard libraries
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import spearmanr
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# ML libraries
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

# Explainability
from lime.lime_text import LimeTextExplainer

# Dataset
from datasets import load_dataset

# Set random seed for reproducibility
np.random.seed(42)

# Plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("All libraries imported successfully!")

All libraries imported successfully!


In [None]:
# This cell's content is now handled in hteG-FUB-tjM for better dependency management.
# It is left empty or can be removed if desired.


In [4]:
print(" Downloading SST-2 dataset...")

# Load SST-2 from Hugging Face
dataset = load_dataset("glue", "sst2")

print(f" Dataset loaded!")
print(f"   Train samples: {len(dataset['train'])}")
print(f"   Test samples: {len(dataset['validation'])}")

# Convert to DataFrame
train_df = pd.DataFrame(dataset['train'])
test_df = pd.DataFrame(dataset['validation'])

print(f"\n First 3 test samples:")
for idx, row in test_df.head(3).iterrows():
    label = "POSITIVE" if row['label'] == 1 else "NEGATIVE"
    print(f"{label:10} | {row['sentence']}")

# Add sentence length
test_df['length'] = test_df['sentence'].str.split().str.len()

print(f"\n Sentence length statistics:")
print(test_df['length'].describe())

 Downloading SST-2 dataset...


Downloading readme: 0.00B [00:00, ?B/s]

Downloading data: 100%|██████████| 3.11M/3.11M [00:01<00:00, 1.70MB/s]
Downloading data: 100%|██████████| 72.8k/72.8k [00:00<00:00, 192kB/s]
Downloading data: 100%|██████████| 148k/148k [00:00<00:00, 390kB/s]


Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

 Dataset loaded!
   Train samples: 67349
   Test samples: 872

 First 3 test samples:
POSITIVE   | it 's a charming and often affecting journey . 
NEGATIVE   | unflinchingly bleak and desperate 
POSITIVE   | allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . 

 Sentence length statistics:
count    872.000000
mean      19.548165
std        8.763900
min        2.000000
25%       13.000000
50%       19.000000
75%       26.000000
max       47.000000
Name: length, dtype: float64


In [5]:
class SimpleLogisticModel:
    """Logistic Regression with TF-IDF for sentiment analysis"""

    def __init__(self, seed=42):
        self.vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
        self.model = LogisticRegression(random_state=seed, max_iter=1000)
        self.is_trained = False

    def fit(self, texts, labels):
        print(" Vectorizing text...")
        X = self.vectorizer.fit_transform(texts)
        print(f" Feature matrix: {X.shape}")

        print(" Training model...")
        self.model.fit(X, labels)
        self.is_trained = True

        train_acc = accuracy_score(labels, self.model.predict(X))
        print(f" Training accuracy: {train_acc:.3f}")

    def predict_proba(self, texts):
        if isinstance(texts, str):
            texts = [texts]
        X = self.vectorizer.transform(texts)
        return self.model.predict_proba(X)

# Train the model
print(" Training Sentiment Classifier\n")
model = SimpleLogisticModel(seed=42)
model.fit(train_df['sentence'].tolist(), train_df['label'].tolist())

# Test accuracy
print("\n Evaluating on test set...")
test_preds = model.predict_proba(test_df['sentence'].tolist())
test_acc = accuracy_score(test_df['label'].tolist(),
                          (test_preds[:, 1] > 0.5).astype(int))
print(f" Test accuracy: {test_acc:.3f}")

 Training Sentiment Classifier

 Vectorizing text...
 Feature matrix: (67349, 5000)
 Training model...
 Training accuracy: 0.867

 Evaluating on test set...
 Test accuracy: 0.804


In [6]:
class LIMEStabilityAnalyzer:
    """Analyze LIME explanation stability"""

    def __init__(self, model):
        self.model = model
        self.explainer = LimeTextExplainer(class_names=['negative', 'positive'])

    def explain_once(self, text, num_samples=1000):
        """Get single LIME explanation"""
        exp = self.explainer.explain_instance(
            text,
            self.model.predict_proba,
            num_features=10,
            num_samples=num_samples
        )
        return dict(exp.as_list())

    def explain_multiple(self, text, num_samples=1000, num_runs=30):
        """Run LIME multiple times"""
        explanations = []
        for _ in range(num_runs):
            exp = self.explain_once(text, num_samples)
            explanations.append(exp)
        return explanations

    def get_top_k(self, explanation, k=3):
        """Get top-k important words"""
        sorted_words = sorted(explanation.items(),
                             key=lambda x: abs(x[1]),
                             reverse=True)
        return [word for word, _ in sorted_words[:k]]

    def visualize_explanation(self, explanation, title="LIME Explanation"):
        """Plot word importances"""
        sorted_items = sorted(explanation.items(),
                             key=lambda x: abs(x[1]),
                             reverse=True)[:10]
        words, scores = zip(*sorted_items)

        colors = ['red' if s < 0 else 'green' for s in scores]

        plt.figure(figsize=(10, 6))
        plt.barh(words, scores, color=colors, alpha=0.7)
        plt.xlabel('Importance Score')
        plt.title(title)
        plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
        plt.tight_layout()
        plt.show()

# Create analyzer
analyzer = LIMEStabilityAnalyzer(model)
print(" LIME Analyzer ready!")

 LIME Analyzer ready!


In [None]:
# Test on example sentence
test_sentence = "This movie was absolutely terrible and boring"

print(f" Test sentence: '{test_sentence}'")
print(f"\n Model prediction:")
prob = model.predict_proba([test_sentence])[0]
pred = "POSITIVE" if prob[1] > 0.5 else "NEGATIVE"
print(f"   {pred} (confidence: {max(prob):.3f})")

print(f"\n Running LIME...")
explanation = analyzer.explain_once(test_sentence, num_samples=1000)

# Visualize
analyzer.visualize_explanation(explanation,
                               title="LIME Explanation: Negative Review")

# Show top words
print(f"\n Top 5 important words:")
sorted_exp = sorted(explanation.items(), key=lambda x: abs(x[1]), reverse=True)
for word, score in sorted_exp[:5]:
    direction = "→ Negative" if score < 0 else "→ Positive"
    print(f"   {word:15} {score:+.3f}  {direction}")

In [2]:
class StabilityMetrics:
    """Calculate stability metrics for LIME explanations"""

    @staticmethod
    def top_k_agreement(explanations, k=3):
        """What % of top-k words overlap across runs?"""
        top_k_lists = []
        for exp in explanations:
            sorted_words = sorted(exp.items(),
                                 key=lambda x: abs(x[1]),
                                 reverse=True)
            top_k_lists.append(set([w for w, _ in sorted_words[:k]]))

        # Pairwise overlap
        agreements = []
        n = len(top_k_lists)
        for i in range(n):
            for j in range(i+1, n):
                overlap = len(top_k_lists[i] & top_k_lists[j]) / k
                agreements.append(overlap)

        return np.mean(agreements) if agreements else 0.0

    @staticmethod
    def rank_correlation(explanations):
        """Average Spearman correlation of word rankings"""
        # Get all unique words
        all_words = set()
        for exp in explanations:
            all_words.update(exp.keys())
        all_words = sorted(list(all_words))

        # Create rank vectors
        rank_vectors = []
        for exp in explanations:
            ranks = []
            sorted_words = sorted(exp.items(),
                                 key=lambda x: abs(x[1]),
                                 reverse=True)
            word_to_rank = {w: i for i, (w, _) in enumerate(sorted_words)}

            for word in all_words:
                ranks.append(word_to_rank.get(word, len(sorted_words)))
            rank_vectors.append(ranks)

        # Pairwise correlations
        correlations = []
        n = len(rank_vectors)
        for i in range(n):
            for j in range(i+1, n):
                corr, _ = spearmanr(rank_vectors[i], rank_vectors[j])
                if not np.isnan(corr):
                    correlations.append(corr)

        return np.mean(correlations) if correlations else 0.0

    @staticmethod
    def coefficient_of_variation(explanations):
        """How much do importance scores vary?"""
        all_words = set()
        for exp in explanations:
            all_words.update(exp.keys())

        cvs = []
        for word in all_words:
            scores = [exp.get(word, 0) for exp in explanations]
            mean_abs = np.mean(np.abs(scores))
            std = np.std(scores)

            if mean_abs > 0:
                cvs.append(std / mean_abs)

        return np.mean(cvs) if cvs else 0.0

metrics_calc = StabilityMetrics()
print(" Stability metrics ready!")

 Stability metrics ready!


In [9]:
"""
## 5. Experiment 1: Effect of num_samples on Stability

**Research Question:** How many perturbation samples does LIME need for stable explanations?

**Hypothesis:** More samples → More stable explanations (but diminishing returns)

**Setup:**
- 50 sentences from SST-2 test set
- num_samples ∈ {100, 250, 500, 1000, 2000}
- 30 LIME runs per sentence × 5 seeds = 150 total runs per sentence

"""

'\n## 5. Experiment 1: Effect of num_samples on Stability\n\n**Research Question:** How many perturbation samples does LIME need for stable explanations?\n\n**Hypothesis:** More samples → More stable explanations (but diminishing returns)\n\n**Setup:**\n- 50 sentences from SST-2 test set\n- num_samples ∈ {100, 250, 500, 1000, 2000}\n- 30 LIME runs per sentence × 5 seeds = 150 total runs per sentence\n\n'

In [10]:

SEEDS = [42, 123, 456, 789, 1000]  # 5 seeds as proposed

# =============================================================================
# EXPERIMENT 1: How does num_samples affect stability?
# Proposal: 50 sentences, num_samples ∈ {100, 250, 500, 1000, 2000}, 30 runs
# =============================================================================

print("="*70)
print(" EXPERIMENT 1: How does num_samples affect stability?")
print("="*70)
print(f" Settings: 50 sentences | 5 seeds | 30 LIME runs per sentence")
print("="*70)

# Settings (matching proposal)
num_samples_list = [100, 250, 500, 1000, 2000]
n_sentences = 50      # Proposal: 50 sentences
n_runs = 30           # Proposal: 30 runs

exp1_all_seeds = []

for seed in SEEDS:
    print(f"\n{'='*70}")
    print(f" Running with SEED = {seed}")
    print(f"{'='*70}")

    np.random.seed(seed)

    # Get sample sentences for this seed
    sample_sentences = test_df.sample(n=n_sentences, random_state=seed)

    seed_results = []

    for num_samples in tqdm(num_samples_list, desc=f"num_samples (seed={seed})"):
        print(f"\n  Testing num_samples = {num_samples}")

        top3_list = []
        corr_list = []
        cv_list = []

        for idx, row in tqdm(sample_sentences.iterrows(),
                             total=len(sample_sentences),
                             desc=f"Sentences", leave=False):
            text = row['sentence']

            # Run LIME multiple times
            explanations = analyzer.explain_multiple(text,
                                                     num_samples=num_samples,
                                                     num_runs=n_runs)

            # Calculate metrics
            top3 = metrics_calc.top_k_agreement(explanations, k=3)
            corr = metrics_calc.rank_correlation(explanations)
            cv = metrics_calc.coefficient_of_variation(explanations)

            top3_list.append(top3)
            corr_list.append(corr)
            cv_list.append(cv)

        seed_results.append({
            'seed': seed,
            'num_samples': num_samples,
            'top3_agreement': np.mean(top3_list),
            'rank_correlation': np.mean(corr_list),
            'coeff_variation': np.mean(cv_list)
        })

        print(f"    Top-3: {np.mean(top3_list):.3f} | Corr: {np.mean(corr_list):.3f} | CV: {np.mean(cv_list):.3f}")

    exp1_all_seeds.extend(seed_results)

# Convert to DataFrame and aggregate across seeds
exp1_raw_df = pd.DataFrame(exp1_all_seeds)

# Average across seeds
exp1_results = exp1_raw_df.groupby('num_samples').agg({
    'top3_agreement': ['mean', 'std'],
    'rank_correlation': ['mean', 'std'],
    'coeff_variation': ['mean', 'std']
}).reset_index()

# Flatten column names
exp1_results.columns = ['num_samples',
                        'top3_mean', 'top3_std',
                        'corr_mean', 'corr_std',
                        'cv_mean', 'cv_std']

print(f"\n{'='*70}")
print(" EXPERIMENT 1 RESULTS (Averaged across 5 seeds):")
print(f"{'='*70}")
print(exp1_results.to_string(index=False))

# Save results
exp1_raw_df.to_csv('exp1_raw_results.csv', index=False)
exp1_results.to_csv('exp1_aggregated_results.csv', index=False)
print("\n Results saved!")

🔬 EXPERIMENT 1: How does num_samples affect stability?
📋 Settings: 50 sentences | 5 seeds | 30 LIME runs per sentence

🌱 Running with SEED = 42


num_samples (seed=42):   0%|          | 0/5 [00:00<?, ?it/s]


  Testing num_samples = 100


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.838 | Corr: 0.809 | CV: 1.147

  Testing num_samples = 250


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.894 | Corr: 0.880 | CV: 0.837

  Testing num_samples = 500


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.921 | Corr: 0.913 | CV: 0.706

  Testing num_samples = 1000


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.941 | Corr: 0.937 | CV: 0.569

  Testing num_samples = 2000


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.950 | Corr: 0.952 | CV: 0.476

🌱 Running with SEED = 123


num_samples (seed=123):   0%|          | 0/5 [00:00<?, ?it/s]


  Testing num_samples = 100


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.844 | Corr: 0.811 | CV: 1.080

  Testing num_samples = 250


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.897 | Corr: 0.890 | CV: 0.892

  Testing num_samples = 500


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.922 | Corr: 0.919 | CV: 0.741

  Testing num_samples = 1000


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.937 | Corr: 0.939 | CV: 0.555

  Testing num_samples = 2000


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.955 | Corr: 0.954 | CV: 0.506

🌱 Running with SEED = 456


num_samples (seed=456):   0%|          | 0/5 [00:00<?, ?it/s]


  Testing num_samples = 100


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.876 | Corr: 0.819 | CV: 1.115

  Testing num_samples = 250


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.927 | Corr: 0.893 | CV: 0.808

  Testing num_samples = 500


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.949 | Corr: 0.928 | CV: 0.600

  Testing num_samples = 1000


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.964 | Corr: 0.946 | CV: 0.508

  Testing num_samples = 2000


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.970 | Corr: 0.960 | CV: 0.376

🌱 Running with SEED = 789


num_samples (seed=789):   0%|          | 0/5 [00:00<?, ?it/s]


  Testing num_samples = 100


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.879 | Corr: 0.822 | CV: 1.149

  Testing num_samples = 250


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.922 | Corr: 0.903 | CV: 0.891

  Testing num_samples = 500


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.941 | Corr: 0.931 | CV: 0.691

  Testing num_samples = 1000


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.956 | Corr: 0.950 | CV: 0.590

  Testing num_samples = 2000


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.964 | Corr: 0.962 | CV: 0.404

🌱 Running with SEED = 1000


num_samples (seed=1000):   0%|          | 0/5 [00:00<?, ?it/s]


  Testing num_samples = 100


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.843 | Corr: 0.784 | CV: 1.120

  Testing num_samples = 250


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.896 | Corr: 0.862 | CV: 0.870

  Testing num_samples = 500


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.922 | Corr: 0.889 | CV: 0.699

  Testing num_samples = 1000


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.939 | Corr: 0.912 | CV: 0.520

  Testing num_samples = 2000


Sentences:   0%|          | 0/50 [00:00<?, ?it/s]

    Top-3: 0.951 | Corr: 0.925 | CV: 0.473

📊 EXPERIMENT 1 RESULTS (Averaged across 5 seeds):
 num_samples  top3_mean  top3_std  corr_mean  corr_std  cv_mean   cv_std
         100   0.855862  0.019807   0.808780  0.015069 1.122337 0.028092
         250   0.907307  0.015892   0.885498  0.015603 0.859506 0.036624
         500   0.930979  0.013096   0.916141  0.016506 0.687124 0.052543
        1000   0.947237  0.012190   0.936665  0.014612 0.548647 0.033969
        2000   0.958051  0.008545   0.950688  0.015122 0.447082 0.054359

✅ Results saved!


In [None]:
ig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: Top-3 Agreement
axes[0].errorbar(exp1_results['num_samples'], exp1_results['top3_mean'],
                 yerr=exp1_results['top3_std'], marker='o', linewidth=2,
                 markersize=8, color='#2E86AB', capsize=5, capthick=2)
axes[0].set_xlabel('Number of Samples', fontsize=12)
axes[0].set_ylabel('Top-3 Agreement', fontsize=12)
axes[0].set_title('Top-3 Feature Agreement vs num_samples', fontsize=14, fontweight='bold')
axes[0].set_ylim(0, 1.0)
axes[0].axhline(y=0.9, color='green', linestyle='--', alpha=0.5, label='High Stability (0.9)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Rank Correlation
axes[1].errorbar(exp1_results['num_samples'], exp1_results['corr_mean'],
                 yerr=exp1_results['corr_std'], marker='s', linewidth=2,
                 markersize=8, color='#A23B72', capsize=5, capthick=2)
axes[1].set_xlabel('Number of Samples', fontsize=12)
axes[1].set_ylabel('Spearman Rank Correlation', fontsize=12)
axes[1].set_title('Rank Correlation vs num_samples', fontsize=14, fontweight='bold')
axes[1].set_ylim(0, 1.0)
axes[1].axhline(y=0.9, color='green', linestyle='--', alpha=0.5, label='High Stability (0.9)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Plot 3: Coefficient of Variation
axes[2].errorbar(exp1_results['num_samples'], exp1_results['cv_mean'],
                 yerr=exp1_results['cv_std'], marker='^', linewidth=2,
                 markersize=8, color='#F18F01', capsize=5, capthick=2)
axes[2].set_xlabel('Number of Samples', fontsize=12)
axes[2].set_ylabel('Coefficient of Variation', fontsize=12)
axes[2].set_title('Score Variation vs num_samples', fontsize=14, fontweight='bold')
axes[2].axhline(y=0.5, color='green', linestyle='--', alpha=0.5, label='Low Variation (0.5)')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.suptitle('EXPERIMENT 1: Effect of num_samples on LIME Stability (5 seeds, 50 sentences)',
             fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('exp1_samples_vs_stability.png', dpi=300, bbox_inches='tight')
plt.show()

print(" Visualization saved to 'exp1_samples_vs_stability.png'")

In [16]:
plt.figure(1)  # Your exp1 figure
plt.savefig('results/exp1_samples.png', dpi=300, bbox_inches='tight')

<Figure size 1200x600 with 0 Axes>

In [26]:
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import FancyBboxPatch, Wedge, Circle, Rectangle
import numpy as np

# Set style
plt.rcParams['font.family'] = 'sans-serif'
plt.rcParams['font.size'] = 12
plt.rcParams['axes.spines.top'] = False
plt.rcParams['axes.spines.right'] = False
plt.rcParams['axes.linewidth'] = 2

# Color palette
COLORS = {
    'red': '#E74C3C',
    'dark_red': '#C0392B',
    'orange': '#F39C12',
    'yellow': '#F1C40F',
    'green': '#27AE60',
    'dark_green': '#1E8449',
    'blue': '#3498DB',
    'dark_blue': '#1A5276',
    'purple': '#9B59B6',
    'teal': '#17A589',
    'gray': '#7F8C8D',
    'light_gray': '#ECF0F1',
    'dark': '#2C3E50',
    'white': '#FFFFFF'
}
fig1, ax1 = plt.subplots(figsize=(10, 6))

num_samples = [100, 250, 500, 1000, 2000]
agreement = [0.856, 0.910, 0.933, 0.949, 0.962]
std_dev = [0.020, 0.011, 0.015, 0.012, 0.011]

# Create area fill with gradient effect
ax1.fill_between(range(len(num_samples)), agreement, 0.8, alpha=0.3, color=COLORS['blue'])
ax1.fill_between(range(len(num_samples)), agreement, 1.0, alpha=0.2, color=COLORS['red'])

# Plot line with markers
ax1.plot(range(len(num_samples)), agreement, 'o-', color=COLORS['dark_blue'],
         linewidth=3, markersize=15, markerfacecolor=COLORS['blue'],
         markeredgecolor=COLORS['dark_blue'], markeredgewidth=2)

# Error bars
ax1.errorbar(range(len(num_samples)), agreement, yerr=std_dev, fmt='none',
             color=COLORS['dark_blue'], capsize=8, capthick=2, linewidth=2)

# Add value labels above each point
for i, (val, err) in enumerate(zip(agreement, std_dev)):
    ax1.annotate(f'{val:.2f}', (i, val + err + 0.015), ha='center', va='bottom',
                fontsize=14, fontweight='bold', color=COLORS['dark_blue'])

# Reference lines
ax1.axhline(y=1.0, color=COLORS['green'], linestyle='--', linewidth=2, label='Perfect (1.0)')
ax1.axhline(y=0.95, color=COLORS['orange'], linestyle=':', linewidth=2, label='Acceptable (0.95)')

# Highlight the gap at the end
ax1.annotate('', xy=(4, 1.0), xytext=(4, 0.962),
            arrowprops=dict(arrowstyle='<->', color=COLORS['red'], lw=2))
ax1.text(4.3, 0.98, 'Still 4%\nUnreliable!', fontsize=11, fontweight='bold',
         color=COLORS['red'], va='center')

# Styling
ax1.set_xlim(-0.5, 4.5)
ax1.set_ylim(0.82, 1.05)
ax1.set_xticks(range(len(num_samples)))
ax1.set_xticklabels([str(n) for n in num_samples], fontsize=12, fontweight='bold')
ax1.set_xlabel('Number of Perturbation Samples', fontsize=14, fontweight='bold')
ax1.set_ylabel('Top-3 Agreement', fontsize=14, fontweight='bold')
ax1.set_title('Exp 1: More Samples Improve Stability, But Never Reach 100%\n(p < 0.001, ANOVA)',
              fontsize=16, fontweight='bold', color=COLORS['dark_blue'], pad=15)
ax1.legend(loc='lower right', fontsize=11)

# Add colored region labels
ax1.text(2, 0.85, 'RELIABLE ZONE', fontsize=10, ha='center', color=COLORS['blue'],
         fontweight='bold', alpha=0.7)
ax1.text(2, 0.99, 'UNRELIABLE ZONE', fontsize=10, ha='center', color=COLORS['red'],
         fontweight='bold', alpha=0.7)

plt.tight_layout()
plt.savefig('exp1_num_samples.png', dpi=300, bbox_inches='tight', facecolor='white')
plt.close()
print("✓ Saved: exp1_num_samples.png")

✓ Saved: exp1_num_samples.png


In [17]:
"""
## 6. Experiment 2: Effect of Sentence Length on Stability

**Research Question:** Does input complexity (sentence length) affect LIME stability?

**Hypothesis:** Longer sentences create larger perturbation spaces (2^n combinations). With fixed `num_samples`, this may cause instability as we sample a smaller fraction of the space.

| Sentence Length | Words (n) | Perturbation Space (2^n) | Coverage at num_samples=1000 |
|-----------------|-----------|--------------------------|------------------------------|
| Short (≤7) | 5 | 32 | 100% (oversampled) |
| Medium (8-15) | 10 | 1,024 | ~98% |
| Long (>15) | 20 | 1,048,576 | ~0.1% |

**Setup:**
- 15 sentences per length group × 5 seeds
- num_samples = 1000 (fixed)
- 30 LIME runs per sentence

"""

'\n## 6. Experiment 2: Effect of Sentence Length on Stability\n\n**Research Question:** Does input complexity (sentence length) affect LIME stability?\n\n**Hypothesis:** Longer sentences create larger perturbation spaces (2^n combinations). With fixed `num_samples`, this may cause instability as we sample a smaller fraction of the space.\n\n| Sentence Length | Words (n) | Perturbation Space (2^n) | Coverage at num_samples=1000 |\n|-----------------|-----------|--------------------------|------------------------------|\n| Short (≤7) | 5 | 32 | 100% (oversampled) |\n| Medium (8-15) | 10 | 1,024 | ~98% |\n| Long (>15) | 20 | 1,048,576 | ~0.1% |\n\n**Setup:**\n- 15 sentences per length group × 5 seeds\n- num_samples = 1000 (fixed)\n- 30 LIME runs per sentence\n\n'

In [18]:
print("\n" + "="*70)
print(" EXPERIMENT 2: Does sentence length affect stability?")
print("="*70)
print(f" Settings: 15 sentences per group | 5 seeds | 30 LIME runs | 1000 samples")
print("="*70)

n_per_group = 15
num_samples = 1000
n_runs = 30

exp2_all_seeds = []

for seed in SEEDS:
    print(f"\n{'='*70}")
    print(f" Running with SEED = {seed}")
    print(f"{'='*70}")

    np.random.seed(seed)

    # Group sentences by length
    short = test_df[test_df['length'] <= 7].sample(n=n_per_group, random_state=seed)
    medium = test_df[(test_df['length'] > 7) & (test_df['length'] <= 15)].sample(n=n_per_group, random_state=seed)
    long = test_df[test_df['length'] > 15].sample(n=n_per_group, random_state=seed)

    groups = {
        'Short (≤7 words)': short,
        'Medium (8-15 words)': medium,
        'Long (>15 words)': long
    }

    for group_name, group_df in groups.items():
        print(f"\n   {group_name}")

        top3_list = []
        corr_list = []
        cv_list = []

        for idx, row in tqdm(group_df.iterrows(), total=len(group_df),
                             desc=group_name, leave=False):
            explanations = analyzer.explain_multiple(row['sentence'],
                                                     num_samples=num_samples,
                                                     num_runs=n_runs)

            top3 = metrics_calc.top_k_agreement(explanations, k=3)
            corr = metrics_calc.rank_correlation(explanations)
            cv = metrics_calc.coefficient_of_variation(explanations)

            top3_list.append(top3)
            corr_list.append(corr)
            cv_list.append(cv)

        exp2_all_seeds.append({
            'seed': seed,
            'group': group_name,
            'top3_agreement': np.mean(top3_list),
            'rank_correlation': np.mean(corr_list),
            'coeff_variation': np.mean(cv_list),
            'avg_length': group_df['length'].mean()
        })

        print(f"    Top-3: {np.mean(top3_list):.3f} | Corr: {np.mean(corr_list):.3f} | CV: {np.mean(cv_list):.3f}")

# Aggregate across seeds
exp2_raw_df = pd.DataFrame(exp2_all_seeds)

exp2_results = exp2_raw_df.groupby('group').agg({
    'top3_agreement': ['mean', 'std'],
    'rank_correlation': ['mean', 'std'],
    'coeff_variation': ['mean', 'std'],
    'avg_length': 'mean'
}).reset_index()

exp2_results.columns = ['group', 'top3_mean', 'top3_std', 'corr_mean', 'corr_std',
                        'cv_mean', 'cv_std', 'avg_length']

# Reorder groups
group_order = ['Short (≤7 words)', 'Medium (8-15 words)', 'Long (>15 words)']
exp2_results['group'] = pd.Categorical(exp2_results['group'], categories=group_order, ordered=True)
exp2_results = exp2_results.sort_values('group')

print(f"\n{'='*70}")
print(" EXPERIMENT 2 RESULTS (Averaged across 5 seeds):")
print(f"{'='*70}")
print(exp2_results.to_string(index=False))

# Save
exp2_raw_df.to_csv('exp2_raw_results.csv', index=False)
exp2_results.to_csv('exp2_aggregated_results.csv', index=False)


🔬 EXPERIMENT 2: Does sentence length affect stability?
📋 Settings: 15 sentences per group | 5 seeds | 30 LIME runs | 1000 samples

🌱 Running with SEED = 42

  📊 Short (≤7 words)


Short (≤7 words):   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.944 | Corr: 0.908 | CV: 0.194

  📊 Medium (8-15 words)


Medium (8-15 words):   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.983 | Corr: 0.958 | CV: 0.416

  📊 Long (>15 words)


Long (>15 words):   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.967 | Corr: 0.952 | CV: 0.797

🌱 Running with SEED = 123

  📊 Short (≤7 words)


Short (≤7 words):   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.948 | Corr: 0.912 | CV: 0.191

  📊 Medium (8-15 words)


Medium (8-15 words):   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.983 | Corr: 0.970 | CV: 0.380

  📊 Long (>15 words)


Long (>15 words):   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.944 | Corr: 0.940 | CV: 0.861

🌱 Running with SEED = 456

  📊 Short (≤7 words)


Short (≤7 words):   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.938 | Corr: 0.941 | CV: 0.236

  📊 Medium (8-15 words)


Medium (8-15 words):   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.965 | Corr: 0.957 | CV: 0.415

  📊 Long (>15 words)


Long (>15 words):   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.934 | Corr: 0.957 | CV: 0.700

🌱 Running with SEED = 789

  📊 Short (≤7 words)


Short (≤7 words):   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.904 | Corr: 0.840 | CV: 0.235

  📊 Medium (8-15 words)


Medium (8-15 words):   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.980 | Corr: 0.956 | CV: 0.458

  📊 Long (>15 words)


Long (>15 words):   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.980 | Corr: 0.952 | CV: 0.595

🌱 Running with SEED = 1000

  📊 Short (≤7 words)


Short (≤7 words):   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.975 | Corr: 0.934 | CV: 0.313

  📊 Medium (8-15 words)


Medium (8-15 words):   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.966 | Corr: 0.965 | CV: 0.414

  📊 Long (>15 words)


Long (>15 words):   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.960 | Corr: 0.937 | CV: 0.715

📊 EXPERIMENT 2 RESULTS (Averaged across 5 seeds):
              group  top3_mean  top3_std  corr_mean  corr_std  cv_mean   cv_std  avg_length
   Short (≤7 words)   0.941722  0.025264   0.906910  0.040147 0.233832 0.048975    5.840000
Medium (8-15 words)   0.975550  0.009529   0.961182  0.006005 0.416517 0.027770   11.480000
   Long (>15 words)   0.957037  0.018318   0.947526  0.008760 0.733631 0.101157   25.893333


In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

colors = ['#2E86AB', '#A23B72', '#F18F01']
group_names = exp2_results['group'].tolist()
x_pos = np.arange(len(group_names))

# Plot 1: Top-3 Agreement
bars1 = axes[0].bar(x_pos, exp2_results['top3_mean'], yerr=exp2_results['top3_std'],
                    color=colors, edgecolor='black', linewidth=1.2, capsize=5)
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(group_names, rotation=15, ha='right')
axes[0].set_ylabel('Top-3 Agreement', fontsize=12)
axes[0].set_title('Top-3 Agreement vs Sentence Length', fontsize=14, fontweight='bold')
axes[0].set_ylim(0, 1.1)
axes[0].axhline(y=0.9, color='green', linestyle='--', alpha=0.5)

for i, (bar, val) in enumerate(zip(bars1, exp2_results['top3_mean'])):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
                 f'{val:.3f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

# Plot 2: Rank Correlation
bars2 = axes[1].bar(x_pos, exp2_results['corr_mean'], yerr=exp2_results['corr_std'],
                    color=colors, edgecolor='black', linewidth=1.2, capsize=5)
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(group_names, rotation=15, ha='right')
axes[1].set_ylabel('Rank Correlation', fontsize=12)
axes[1].set_title('Rank Correlation vs Sentence Length', fontsize=14, fontweight='bold')
axes[1].set_ylim(0, 1.1)
axes[1].axhline(y=0.9, color='green', linestyle='--', alpha=0.5)

for i, (bar, val) in enumerate(zip(bars2, exp2_results['corr_mean'])):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
                 f'{val:.3f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

# Plot 3: Coefficient of Variation
bars3 = axes[2].bar(x_pos, exp2_results['cv_mean'], yerr=exp2_results['cv_std'],
                    color=colors, edgecolor='black', linewidth=1.2, capsize=5)
axes[2].set_xticks(x_pos)
axes[2].set_xticklabels(group_names, rotation=15, ha='right')
axes[2].set_ylabel('Coefficient of Variation', fontsize=12)
axes[2].set_title('Score Variation vs Sentence Length', fontsize=14, fontweight='bold')
axes[2].axhline(y=0.5, color='green', linestyle='--', alpha=0.5)

for i, (bar, val) in enumerate(zip(bars3, exp2_results['cv_mean'])):
    axes[2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
                 f'{val:.3f}', ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.suptitle('EXPERIMENT 2: Effect of Sentence Length on LIME Stability (5 seeds)',
             fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('exp2_length_vs_stability.png', dpi=300, bbox_inches='tight')
plt.show()

print(" Visualization saved to 'exp2_length_vs_stability.png'")


In [27]:
fig2, ax2 = plt.subplots(figsize=(10, 6))

lengths = ['Short (≤7 words)', 'Medium (8-15 words)', 'Long (>15 words)']
agreement_len = [0.932, 0.973, 0.956]
cv_len = [0.229, 0.417, 0.744]
std_len = [0.022, 0.015, 0.020]

x = np.arange(len(lengths))
width = 0.35

# Create bars with different styles
bars1 = ax2.bar(x - width/2, agreement_len, width, yerr=std_len,
                color=[COLORS['orange'], COLORS['green'], COLORS['orange']],
                edgecolor=[COLORS['dark_red'], COLORS['dark_green'], COLORS['dark_red']],
                linewidth=3, capsize=6, label='Top-3 Agreement', hatch=['', '', ''])

bars2 = ax2.bar(x + width/2, cv_len, width,
                color=[COLORS['teal'], COLORS['teal'], COLORS['red']],
                edgecolor=COLORS['dark'], linewidth=2, label='Score Variation (CV)', alpha=0.8)

# Add value labels
for bar, val in zip(bars1, agreement_len):
    ax2.text(bar.get_x() + bar.get_width()/2, val + 0.05, f'{val:.2f}',
             ha='center', va='bottom', fontsize=13, fontweight='bold', color=COLORS['dark'])

for bar, val in zip(bars2, cv_len):
    color = COLORS['red'] if val > 0.5 else COLORS['dark']
    ax2.text(bar.get_x() + bar.get_width()/2, val + 0.02, f'{val:.2f}',
             ha='center', va='bottom', fontsize=13, fontweight='bold', color=color)

# Highlight best (Medium)
ax2.annotate('✓ BEST', xy=(1 - width/2, 0.973), xytext=(0.3, 1.05),
            fontsize=14, fontweight='bold', color=COLORS['green'],
            arrowprops=dict(arrowstyle='->', color=COLORS['green'], lw=2))

# Highlight worst CV (Long)
ax2.annotate('3x MORE\nVARIATION!', xy=(2 + width/2, 0.744), xytext=(2.6, 0.85),
            fontsize=11, fontweight='bold', color=COLORS['red'],
            arrowprops=dict(arrowstyle='->', color=COLORS['red'], lw=2),
            bbox=dict(boxstyle='round,pad=0.3', facecolor='#FADBD8', edgecolor=COLORS['red']))

# Reference line
ax2.axhline(y=1.0, color=COLORS['gray'], linestyle='--', linewidth=1.5, alpha=0.5)

# Styling
ax2.set_ylim(0, 1.15)
ax2.set_xticks(x)
ax2.set_xticklabels(lengths, fontsize=12, fontweight='bold')
ax2.set_ylabel('Score', fontsize=14, fontweight='bold')
ax2.set_title('Exp 2: Medium-Length Sentences Are Most Stable\n(p = 0.018, ANOVA)',
              fontsize=16, fontweight='bold', color=COLORS['dark_blue'], pad=15)
ax2.legend(loc='upper left', fontsize=11, framealpha=0.9)

plt.tight_layout()
plt.savefig('exp2_sentence_length.png', dpi=300, bbox_inches='tight', facecolor='white')
plt.close()
print("✓ Saved: exp2_sentence_length.png")

✓ Saved: exp2_sentence_length.png


In [22]:
"""
## 7. Experiment 3: Effect of Model Complexity on Stability

**Research Question:** Do complex models (DistilBERT) produce less stable LIME explanations than simple models (Logistic Regression)?

**Hypothesis:** Complex models have more non-linear decision boundaries, making local linear approximations (LIME's surrogate) less accurate and potentially less stable.

| Model | Decision Boundary | Expected Stability |
|-------|-------------------|-------------------|
| Logistic Regression | Linear | High (easy to approximate) |
| DistilBERT | Highly non-linear | Lower (harder to approximate) |

**Setup:**
- Same 15 sentences for both models × 5 seeds
- num_samples = 1000 (fixed)
- 30 LIME runs per sentence

"""

"\n## 7. Experiment 3: Effect of Model Complexity on Stability\n\n**Research Question:** Do complex models (DistilBERT) produce less stable LIME explanations than simple models (Logistic Regression)?\n\n**Hypothesis:** Complex models have more non-linear decision boundaries, making local linear approximations (LIME's surrogate) less accurate and potentially less stable.\n\n| Model | Decision Boundary | Expected Stability |\n|-------|-------------------|-------------------|\n| Logistic Regression | Linear | High (easy to approximate) |\n| DistilBERT | Highly non-linear | Lower (harder to approximate) |\n\n**Setup:**\n- Same 15 sentences for both models × 5 seeds\n- num_samples = 1000 (fixed)\n- 30 LIME runs per sentence\n\n"

In [23]:
# Install transformer library for DistilBERT
print(" Installing transformers library...")
!pip install transformers torch -q

print("Transformers installed!")

 Installing transformers library...
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m566.3/566.3 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 2.18.0 requires fsspec[http]<=2024.2.0,>=2023.1.0, but you have fsspec 2026.1.0 which is incompatible.
sentence-transformers 5.2.2 requires transformers<6.0.0,>=4.41.0, but you have transformers 4.36.0 which is incompatible.[0m[31m
[0mTransformers installed!


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

class DistilBERTModel:
    """DistilBERT model for sentiment classification"""

    def __init__(self, model_name="distilbert/distilbert-base-uncased-finetuned-sst-2-english"):
        """
        Initialize pre-trained DistilBERT
        """
        print(f" Loading {model_name}...")

        try:
            # Load tokenizer - NO trust_remote_code needed!
            self.tokenizer = AutoTokenizer.from_pretrained(model_name)

            # Load pre-trained model
            self.model = AutoModelForSequenceClassification.from_pretrained(model_name)

        except Exception as e:
            print(f" Error loading {model_name}: {e}")
            print(" Trying alternative model...")

            # Alternative: use a different SST-2 model
            alternative_name = "distilbert-base-uncased-finetuned-sst-2-english"
            self.tokenizer = AutoTokenizer.from_pretrained(alternative_name)
            self.model = AutoModelForSequenceClassification.from_pretrained(alternative_name)

        self.model.eval()
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(self.device)

        print(f" Model loaded on {self.device}")
        self.is_trained = True

    def predict_proba(self, texts):
        """Predict probabilities for text(s)"""
        if isinstance(texts, str):
            texts = [texts]

        inputs = self.tokenizer(
            texts,
            return_tensors="pt",
            padding=True,
            truncation=True,
            max_length=512
        )

        inputs = {k: v.to(self.device) for k, v in inputs.items()}

        with torch.no_grad():
            outputs = self.model(**inputs)
            probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

        return probs.cpu().numpy()

# Create and test
print(" Creating DistilBERT Model\n")
bert_model = DistilBERTModel()

test_text = "This movie was absolutely terrible and boring"
bert_prob = bert_model.predict_proba([test_text])[0]
bert_pred = "POSITIVE" if bert_prob[1] > 0.5 else "NEGATIVE"
print(f"\n Test: '{test_text}'")
print(f" Prediction: {bert_pred} (confidence: {max(bert_prob):.3f})")

In [28]:
print("\n" + "="*70)
print(" EXPERIMENT 3: Does model complexity affect stability?")
print("="*70)
print(f" Settings: 15 sentences | 5 seeds | 30 LIME runs | 1000 samples")
print("="*70)

n_sentences = 15
num_samples = 1000
n_runs = 30

models = {
    'Logistic Regression': model,
    'DistilBERT': bert_model
}

exp3_all_seeds = []

for seed in SEEDS:
    print(f"\n{'='*70}")
    print(f" Running with SEED = {seed}")
    print(f"{'='*70}")

    np.random.seed(seed)

    # Same sentences for both models
    exp3_sentences = test_df.sample(n=n_sentences, random_state=seed)

    for model_name, model_obj in models.items():
        print(f"\n   Testing: {model_name}")

        # Create analyzer for this model
        analyzer_exp3 = LIMEStabilityAnalyzer(model_obj)

        top3_list = []
        corr_list = []
        cv_list = []

        for idx, row in tqdm(exp3_sentences.iterrows(), total=len(exp3_sentences),
                             desc=model_name, leave=False):
            explanations = analyzer_exp3.explain_multiple(row['sentence'],
                                                          num_samples=num_samples,
                                                          num_runs=n_runs)

            top3 = metrics_calc.top_k_agreement(explanations, k=3)
            corr = metrics_calc.rank_correlation(explanations)
            cv = metrics_calc.coefficient_of_variation(explanations)

            top3_list.append(top3)
            corr_list.append(corr)
            cv_list.append(cv)

        exp3_all_seeds.append({
            'seed': seed,
            'model': model_name,
            'top3_agreement': np.mean(top3_list),
            'rank_correlation': np.mean(corr_list),
            'coeff_variation': np.mean(cv_list)
        })

        print(f"    Top-3: {np.mean(top3_list):.3f} | Corr: {np.mean(corr_list):.3f} | CV: {np.mean(cv_list):.3f}")

# Aggregate across seeds
exp3_raw_df = pd.DataFrame(exp3_all_seeds)

exp3_results = exp3_raw_df.groupby('model').agg({
    'top3_agreement': ['mean', 'std'],
    'rank_correlation': ['mean', 'std'],
    'coeff_variation': ['mean', 'std']
}).reset_index()

exp3_results.columns = ['model', 'top3_mean', 'top3_std', 'corr_mean', 'corr_std', 'cv_mean', 'cv_std']

print(f"\n{'='*70}")
print(" EXPERIMENT 3 RESULTS (Averaged across 5 seeds):")
print(f"{'='*70}")
print(exp3_results.to_string(index=False))

# Save
exp3_raw_df.to_csv('exp3_raw_results.csv', index=False)
exp3_results.to_csv('exp3_aggregated_results.csv', index=False)


🔬 EXPERIMENT 3: Does model complexity affect stability?
📋 Settings: 15 sentences | 5 seeds | 30 LIME runs | 1000 samples

🌱 Running with SEED = 42

  🤖 Testing: Logistic Regression


Logistic Regression:   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.945 | Corr: 0.948 | CV: 0.489

  🤖 Testing: DistilBERT


DistilBERT:   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.906 | Corr: 0.800 | CV: 0.951

🌱 Running with SEED = 123

  🤖 Testing: Logistic Regression


Logistic Regression:   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.933 | Corr: 0.956 | CV: 0.588

  🤖 Testing: DistilBERT


DistilBERT:   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.818 | Corr: 0.750 | CV: 1.168

🌱 Running with SEED = 456

  🤖 Testing: Logistic Regression


Logistic Regression:   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.948 | Corr: 0.922 | CV: 0.446

  🤖 Testing: DistilBERT


DistilBERT:   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.858 | Corr: 0.810 | CV: 1.059

🌱 Running with SEED = 789

  🤖 Testing: Logistic Regression


Logistic Regression:   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.987 | Corr: 0.964 | CV: 0.607

  🤖 Testing: DistilBERT


DistilBERT:   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.857 | Corr: 0.799 | CV: 0.897

🌱 Running with SEED = 1000

  🤖 Testing: Logistic Regression


Logistic Regression:   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.925 | Corr: 0.885 | CV: 0.547

  🤖 Testing: DistilBERT


DistilBERT:   0%|          | 0/15 [00:00<?, ?it/s]

    Top-3: 0.802 | Corr: 0.757 | CV: 0.878

📊 EXPERIMENT 3 RESULTS (Averaged across 5 seeds):
              model  top3_mean  top3_std  corr_mean  corr_std  cv_mean   cv_std
         DistilBERT   0.848153  0.040605   0.783254  0.027490 0.990798 0.121715
Logistic Regression   0.947617  0.024039   0.935147  0.031961 0.535597 0.067268


In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

colors = ['#2E86AB', '#F18F01']
model_names = exp3_results['model'].tolist()
x_pos = np.arange(len(model_names))

# Plot 1: Top-3 Agreement
bars1 = axes[0].bar(x_pos, exp3_results['top3_mean'], yerr=exp3_results['top3_std'],
                    color=colors, edgecolor='black', linewidth=1.2, capsize=5)
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels(model_names)
axes[0].set_ylabel('Top-3 Agreement', fontsize=12)
axes[0].set_title('Top-3 Agreement by Model', fontsize=14, fontweight='bold')
axes[0].set_ylim(0, 1.1)
axes[0].axhline(y=0.9, color='green', linestyle='--', alpha=0.5)

for bar, val in zip(bars1, exp3_results['top3_mean']):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
                 f'{val:.3f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

# Plot 2: Rank Correlation
bars2 = axes[1].bar(x_pos, exp3_results['corr_mean'], yerr=exp3_results['corr_std'],
                    color=colors, edgecolor='black', linewidth=1.2, capsize=5)
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(model_names)
axes[1].set_ylabel('Rank Correlation', fontsize=12)
axes[1].set_title('Rank Correlation by Model', fontsize=14, fontweight='bold')
axes[1].set_ylim(0, 1.1)
axes[1].axhline(y=0.9, color='green', linestyle='--', alpha=0.5)

for bar, val in zip(bars2, exp3_results['corr_mean']):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
                 f'{val:.3f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

# Plot 3: Coefficient of Variation
bars3 = axes[2].bar(x_pos, exp3_results['cv_mean'], yerr=exp3_results['cv_std'],
                    color=colors, edgecolor='black', linewidth=1.2, capsize=5)
axes[2].set_xticks(x_pos)
axes[2].set_xticklabels(model_names)
axes[2].set_ylabel('Coefficient of Variation', fontsize=12)
axes[2].set_title('Score Variation by Model', fontsize=14, fontweight='bold')
axes[2].axhline(y=0.5, color='green', linestyle='--', alpha=0.5)

for bar, val in zip(bars3, exp3_results['cv_mean']):
    axes[2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
                 f'{val:.3f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.suptitle('EXPERIMENT 3: Effect of Model Complexity on LIME Stability (5 seeds)',
             fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('exp3_model_vs_stability.png', dpi=300, bbox_inches='tight')
plt.show()

print(" Visualization saved to 'exp3_model_vs_stability.png'")

In [30]:
fig3, ax3 = plt.subplots(figsize=(10, 6))

# Data
models = ['Logistic Regression\n(Simple Model)', 'DistilBERT\n(Complex Model)']
agreement_model = [0.947, 0.848]
std_model = [0.015, 0.041]

# Create horizontal bullet chart style
y_pos = [1.5, 0.5]
bar_height = 0.6

# Background bars (full scale to 1.0)
ax3.barh(y_pos, [1.0, 1.0], height=bar_height, color=COLORS['light_gray'],
         edgecolor=COLORS['gray'], linewidth=2)

# Actual value bars
colors_model = [COLORS['green'], COLORS['red']]
for i, (y, val, std, color) in enumerate(zip(y_pos, agreement_model, std_model, colors_model)):
    ax3.barh(y, val, height=bar_height, color=color, edgecolor='white', linewidth=2)

    # Add value label inside bar
    ax3.text(val - 0.03, y, f'{val:.3f}', ha='right', va='center',
             fontsize=16, fontweight='bold', color='white')

    # Add unreliability label outside
    unreliable = (1 - val) * 100
    ax3.text(val + 0.02, y, f'({unreliable:.1f}% unreliable)', ha='left', va='center',
             fontsize=12, fontweight='bold',
             color=COLORS['red'] if unreliable > 10 else COLORS['gray'])

# Model labels
ax3.set_yticks(y_pos)
ax3.set_yticklabels(models, fontsize=13, fontweight='bold')

# Add drop annotation with arrow between bars
ax3.annotate('', xy=(0.848, 0.85), xytext=(0.947, 1.15),
            arrowprops=dict(arrowstyle='->', color=COLORS['red'], lw=3,
                           connectionstyle='arc3,rad=-0.2'))
ax3.text(0.85, 1.0, '~10%\nDROP!', ha='center', va='center', fontsize=14,
         fontweight='bold', color=COLORS['red'],
         bbox=dict(boxstyle='round,pad=0.4', facecolor='#FADBD8', edgecolor=COLORS['red'], linewidth=2))

# Threshold markers
ax3.axvline(x=0.95, color=COLORS['orange'], linestyle=':', linewidth=2, label='Acceptable (0.95)')
ax3.axvline(x=1.0, color=COLORS['green'], linestyle='--', linewidth=2, label='Perfect (1.0)')

# Styling
ax3.set_xlim(0.75, 1.08)
ax3.set_ylim(0, 2)
ax3.set_xlabel('Top-3 Agreement (Reliability)', fontsize=14, fontweight='bold')
ax3.set_title('Exp 3: Complex Models Significantly Reduce LIME Reliability\n(p = 0.0009, t-test)',
              fontsize=16, fontweight='bold', color=COLORS['dark_blue'], pad=15)
ax3.legend(loc='lower right', fontsize=10)

# Remove y-axis spine
ax3.spines['left'].set_visible(False)
ax3.tick_params(axis='y', length=0)

plt.tight_layout()
plt.savefig('exp3_model_complexity.png', dpi=300, bbox_inches='tight', facecolor='white')
plt.close()
print("✓ Saved: exp3_model_complexity.png")



✓ Saved: exp3_model_complexity.png


In [32]:
print("="*70)
print("🔬 EXPERIMENT 4: Feature Effects Stability Analysis")
print("="*70)
print("Analyzing stability of effect MAGNITUDES, not just rankings")
print("="*70)

# This addresses professor's feedback about "local feature effects"
# We analyze if the actual importance SCORES are stable, not just which words rank highest

def analyze_feature_effects(explanations):
    """
    Analyze stability of feature effect magnitudes across LIME runs.

    Returns:
    - sign_consistency: % of runs where word has same sign (positive/negative)
    - magnitude_stability: How consistent are the actual coefficient values
    """
    all_words = set()
    for exp in explanations:
        all_words.update(exp.keys())

    results = {}
    for word in all_words:
        scores = [exp.get(word, 0) for exp in explanations if word in exp]
        if len(scores) < 2:
            continue

        # Sign consistency: do all runs agree on positive/negative effect?
        signs = [1 if s > 0 else -1 for s in scores]
        sign_consistency = max(signs.count(1), signs.count(-1)) / len(signs)

        # Magnitude stability: coefficient of variation of absolute scores
        abs_scores = [abs(s) for s in scores]
        if np.mean(abs_scores) > 0:
            magnitude_cv = np.std(abs_scores) / np.mean(abs_scores)
        else:
            magnitude_cv = 0

        results[word] = {
            'sign_consistency': sign_consistency,
            'magnitude_cv': magnitude_cv,
            'mean_score': np.mean(scores),
            'std_score': np.std(scores)
        }

    return results

# Run feature effects analysis on sample sentences
n_sentences_fe = 20
num_samples = 1000
n_runs = 30

fe_results_all = []

for seed in SEEDS[:3]:  # Use 3 seeds for faster computation
    print(f"\n Seed = {seed}")
    np.random.seed(seed)

    sample_sentences = test_df.sample(n=n_sentences_fe, random_state=seed)

    for idx, row in tqdm(sample_sentences.iterrows(), total=len(sample_sentences)):
        explanations = analyzer.explain_multiple(row['sentence'],
                                                 num_samples=num_samples,
                                                 num_runs=n_runs)

        fe_analysis = analyze_feature_effects(explanations)

        # Aggregate per sentence
        if fe_analysis:
            avg_sign_consistency = np.mean([v['sign_consistency'] for v in fe_analysis.values()])
            avg_magnitude_cv = np.mean([v['magnitude_cv'] for v in fe_analysis.values()])

            fe_results_all.append({
                'seed': seed,
                'sentence_length': len(row['sentence'].split()),
                'sign_consistency': avg_sign_consistency,
                'magnitude_cv': avg_magnitude_cv
            })

fe_df = pd.DataFrame(fe_results_all)

print(f"\n{'='*70}")
print("📊 FEATURE EFFECTS STABILITY RESULTS:")
print(f"{'='*70}")
print(f"\nSign Consistency (do words keep same +/- direction?):")
print(f"   Mean: {fe_df['sign_consistency'].mean():.3f} ± {fe_df['sign_consistency'].std():.3f}")
print(f"\nMagnitude CV (how stable are actual scores?):")
print(f"   Mean: {fe_df['magnitude_cv'].mean():.3f} ± {fe_df['magnitude_cv'].std():.3f}")

🔬 EXPERIMENT 4: Feature Effects Stability Analysis
Analyzing stability of effect MAGNITUDES, not just rankings

🌱 Seed = 42


  0%|          | 0/20 [00:00<?, ?it/s]


🌱 Seed = 123


  0%|          | 0/20 [00:00<?, ?it/s]


🌱 Seed = 456


  0%|          | 0/20 [00:00<?, ?it/s]


📊 FEATURE EFFECTS STABILITY RESULTS:

Sign Consistency (do words keep same +/- direction?):
   Mean: 0.962 ± 0.056

Magnitude CV (how stable are actual scores?):
   Mean: 0.147 ± 0.095


In [None]:
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Sign Consistency Distribution
axes[0].hist(fe_df['sign_consistency'], bins=20, color='#2E86AB', edgecolor='black', alpha=0.7)
axes[0].axvline(x=fe_df['sign_consistency'].mean(), color='red', linestyle='--',
                label=f"Mean: {fe_df['sign_consistency'].mean():.3f}")
axes[0].set_xlabel('Sign Consistency', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Sign Consistency Across Sentences', fontsize=14, fontweight='bold')
axes[0].legend()

# Plot 2: Magnitude CV Distribution
axes[1].hist(fe_df['magnitude_cv'], bins=20, color='#F18F01', edgecolor='black', alpha=0.7)
axes[1].axvline(x=fe_df['magnitude_cv'].mean(), color='red', linestyle='--',
                label=f"Mean: {fe_df['magnitude_cv'].mean():.3f}")
axes[1].set_xlabel('Magnitude Coefficient of Variation', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Distribution of Magnitude Stability Across Sentences', fontsize=14, fontweight='bold')
axes[1].legend()

plt.suptitle('EXPERIMENT 4: Feature Effects Stability Analysis', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('exp4_feature_effects_stability.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n Feature effects analysis complete!")
print("   This addresses the question: Are effect MAGNITUDES stable, not just rankings?")
plt.savefig('exp4_feature_effects.png', dpi=300, bbox_inches='tight')


In [34]:
fig4, axes4 = plt.subplots(1, 2, figsize=(12, 5))

# Left: Sign Consistency (Donut)
ax4a = axes4[0]
sign_consistent = 95.7
sign_inconsistent = 100 - sign_consistent

# Create donut chart
colors_donut1 = [COLORS['green'], COLORS['light_gray']]
wedges1, texts1 = ax4a.pie([sign_consistent, sign_inconsistent], colors=colors_donut1,
                           startangle=90, wedgeprops=dict(width=0.4, edgecolor='white', linewidth=3))

# Center text
ax4a.text(0, 0, f'{sign_consistent:.1f}%', ha='center', va='center',
          fontsize=32, fontweight='bold', color=COLORS['green'])
ax4a.text(0, -0.15, 'Consistent', ha='center', va='top',
          fontsize=12, color=COLORS['dark'])

ax4a.set_title('Sign Consistency\n(Direction of Effect)', fontsize=14, fontweight='bold',
               color=COLORS['dark_blue'], pad=10)

# Add label
ax4a.text(0, -0.6, ' GOOD: Words keep their +/- direction', ha='center',
          fontsize=11, fontweight='bold', color=COLORS['green'],
          bbox=dict(boxstyle='round,pad=0.4', facecolor='#D5F5E3', edgecolor=COLORS['green']))

# Right: Magnitude Variation (Donut showing CV)
ax4b = axes4[1]
cv_magnitude = 14.9  # This represents the coefficient of variation
cv_rest = 100 - cv_magnitude

# Create donut chart - but showing this as "variation" which is bad
colors_donut2 = [COLORS['orange'], COLORS['light_gray']]
wedges2, texts2 = ax4b.pie([cv_magnitude, cv_rest], colors=colors_donut2,
                           startangle=90, wedgeprops=dict(width=0.4, edgecolor='white', linewidth=3))

# Center text
ax4b.text(0, 0, f'{cv_magnitude:.1f}%', ha='center', va='center',
          fontsize=32, fontweight='bold', color=COLORS['orange'])
ax4b.text(0, -0.15, 'Variation (CV)', ha='center', va='top',
          fontsize=12, color=COLORS['dark'])

ax4b.set_title('Magnitude Variation\n(Importance Scores)', fontsize=14, fontweight='bold',
               color=COLORS['dark_blue'], pad=10)

# Add label
ax4b.text(0, -0.6, '⚠ CAUTION: Importance values fluctuate', ha='center',
          fontsize=11, fontweight='bold', color=COLORS['orange'],
          bbox=dict(boxstyle='round,pad=0.4', facecolor='#FCF3CF', edgecolor=COLORS['orange']))

# Overall title
fig4.suptitle('Exp 4: Feature Effects Are Directionally Stable, But Magnitudes Vary',
              fontsize=16, fontweight='bold', color=COLORS['dark_blue'], y=1.02)

plt.tight_layout()
plt.savefig('exp4_feature_effects.png', dpi=300, bbox_inches='tight', facecolor='white')
plt.close()
print(" Saved: exp4_feature_effects.png")


 Saved: exp4_feature_effects.png


In [39]:
# --- CODE CELL ---
from scipy.stats import ttest_ind, f_oneway, pearsonr

print("="*70)
print(" STATISTICAL SIGNIFICANCE TESTS")
print("="*70)

# Test 1: Is Logistic Regression significantly more stable than DistilBERT?
print("\n" + "-"*70)
print("TEST 1: Logistic Regression vs DistilBERT Stability")
print("-"*70)

logreg_top3 = exp3_raw_df[exp3_raw_df['model']=='Logistic Regression']['top3_agreement'].values
bert_top3 = exp3_raw_df[exp3_raw_df['model']=='DistilBERT']['top3_agreement'].values

t_stat, p_value = ttest_ind(logreg_top3, bert_top3)
print(f"   Logistic Regression Top-3: {logreg_top3.mean():.3f} ± {logreg_top3.std():.3f}")
print(f"   DistilBERT Top-3: {bert_top3.mean():.3f} ± {bert_top3.std():.3f}")
print(f"   t-statistic: {t_stat:.3f}")
print(f"   p-value: {p_value:.4f}")
print(f"   Result: {' SIGNIFICANT (p < 0.05)' if p_value < 0.05 else ' Not significant'}")

# Test 2: Does num_samples significantly affect stability? (ANOVA)
print("\n" + "-"*70)
print("TEST 2: Effect of num_samples (One-way ANOVA)")
print("-"*70)

groups = [exp1_raw_df[exp1_raw_df['num_samples']==ns]['top3_agreement'].values
          for ns in [100, 250, 500, 1000, 2000]]
f_stat, p_value_anova = f_oneway(*groups)
print(f"   F-statistic: {f_stat:.3f}")
print(f"   p-value: {p_value_anova:.6f}")
print(f"   Result: {' SIGNIFICANT (p < 0.05)' if p_value_anova < 0.05 else ' Not significant'}")

# Test 3: Correlation between num_samples and stability
print("\n" + "-"*70)
print("TEST 3: Correlation between num_samples and Top-3 Agreement")
print("-"*70)

corr, p_corr = pearsonr(exp1_raw_df['num_samples'], exp1_raw_df['top3_agreement'])
print(f"   Pearson correlation: {corr:.3f}")
print(f"   p-value: {p_corr:.6f}")
print(f"   Result: {' SIGNIFICANT' if p_corr < 0.05 else ' Not significant'} positive correlation")

 STATISTICAL SIGNIFICANCE TESTS

----------------------------------------------------------------------
TEST 1: Logistic Regression vs DistilBERT Stability
----------------------------------------------------------------------
   Logistic Regression Top-3: 0.948 ± 0.022
   DistilBERT Top-3: 0.848 ± 0.036
   t-statistic: 4.713
   p-value: 0.0015
   Result:  SIGNIFICANT (p < 0.05)

----------------------------------------------------------------------
TEST 2: Effect of num_samples (One-way ANOVA)
----------------------------------------------------------------------
   F-statistic: 39.649
   p-value: 0.000000
   Result:  SIGNIFICANT (p < 0.05)

----------------------------------------------------------------------
TEST 3: Correlation between num_samples and Top-3 Agreement
----------------------------------------------------------------------
   Pearson correlation: 0.754
   p-value: 0.000014
   Result:  SIGNIFICANT positive correlation


In [38]:
print("\n" + "-"*70)
print("TEST 4: Effect of Sentence Length (One-way ANOVA)")
print("-"*70)

short_scores = exp2_raw_df[exp2_raw_df['group']=='Short (≤7 words)']['top3_agreement'].values
medium_scores = exp2_raw_df[exp2_raw_df['group']=='Medium (8-15 words)']['top3_agreement'].values
long_scores = exp2_raw_df[exp2_raw_df['group']=='Long (>15 words)']['top3_agreement'].values

f_stat_len, p_value_len = f_oneway(short_scores, medium_scores, long_scores)
print(f"   Short: {short_scores.mean():.3f} | Medium: {medium_scores.mean():.3f} | Long: {long_scores.mean():.3f}")
print(f"   F-statistic: {f_stat_len:.3f}")
print(f"   p-value: {p_value_len:.4f}")
print(f"   Result: {'SIGNIFICANT (p < 0.05)' if p_value_len < 0.05 else ' Not significant'}")

print("\n" + "="*70)
print(" STATISTICAL TESTS SUMMARY")
print("="*70)


----------------------------------------------------------------------
TEST 4: Effect of Sentence Length (One-way ANOVA)
----------------------------------------------------------------------
   Short: 0.942 | Medium: 0.976 | Long: 0.957
   F-statistic: 4.043
   p-value: 0.0455
   Result: SIGNIFICANT (p < 0.05)

 STATISTICAL TESTS SUMMARY
