# Statistical Comparison of Llama-2-70B vs Flan-T5-XXL Performance

This analysis compares task-level performance scores between two large language models:
- **Llama-2-70B-chat-hf** (Meta)
- **Flan-T5-XXL** (Google)

The evaluation covers **46 different benchmark tasks** across multiple domains including:
- Big Bench Hard (BBH) tasks (reasoning, logic, language understanding)
- GPQA (Graduate-level science questions)
- IFEval (Instruction following)
- MATH (Mathematical problem solving)
- MMLU Pro (Multitask language understanding)
- MuSR (Multistep soft reasoning)

In [2]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

## 1. Data Loading

In [3]:
# Each value represents the normalized accuracy on a specific benchmark task
llama = np.array([
    0.302204, 0.724000, 0.540107, 0.212000, 0.304000, 0.472000,
    0.004000, 0.552000, 0.200000, 0.152000, 0.336000, 0.168000,
    0.420000, 0.076000, 0.150685, 0.096000, 0.188000, 0.140000,
    0.483146, 0.628000, 0.276000, 0.208000, 0.152000, 0.332000,
    0.488000, 0.264262, 0.247475, 0.269231, 0.265625,
    0.443623, 0.547962, 0.534196, 0.631894,
    0.029456, 0.048860, 0.000000, 0.000000, 0.017857, 0.025974,
    0.067358, 0.014815,
    0.243268, 0.367725, 0.516000, 0.250000, 0.340000
])

# Task-level scores for Flan-T5-XXL
flan = np.array([
    0.503906, 0.516000, 0.604278, 0.540000, 0.676000, 0.548000,
    0.180000, 0.704000, 0.552000, 0.608000, 0.712000, 0.608000,
    0.608000, 0.420000, 0.417808, 0.600000, 0.552000, 0.504000,
    0.764045, 0.700000, 0.292000, 0.152000, 0.112000, 0.268000,
    0.520000, 0.270134, 0.257576, 0.278388, 0.265625,
    0.157116, 0.282974, 0.158965, 0.282974,
    0.010574, 0.019544, 0.016260, 0.000000, 0.000000, 0.025974,
    0.010363, 0.000000,
    0.234292, 0.420635, 0.516000, 0.281250, 0.468000
])


In [4]:
task_names = [
    'BBH', 'BBH Boolean Expr', 'BBH Causal Judge', 'BBH Date Understand',
    'BBH Disambiguation', 'BBH Formal Fallacies', 'BBH Geometric Shapes',
    'BBH Hyperbaton', 'BBH Logic Deduct 5obj', 'BBH Logic Deduct 7obj',
    'BBH Logic Deduct 3obj', 'BBH Movie Rec', 'BBH Navigate', 'BBH Object Count',
    'BBH Penguins Table', 'BBH Colored Objects', 'BBH Ruin Names',
    'BBH Translation Error', 'BBH Snarks', 'BBH Sports', 'BBH Temporal Seq',
    'BBH Track Shuffle 5obj', 'BBH Track Shuffle 7obj', 'BBH Track Shuffle 3obj',
    'BBH Web of Lies', 'GPQA', 'GPQA Diamond', 'GPQA Extended', 'GPQA Main',
    'IFEval Prompt Strict', 'IFEval Inst Strict', 'IFEval Prompt Loose',
    'IFEval Inst Loose', 'MATH Hard', 'MATH Algebra', 'MATH Counting/Prob',
    'MATH Geometry', 'MATH Intermediate', 'MATH Number Theory', 'MATH Prealgebra',
    'MATH Precalculus', 'MMLU Pro', 'MuSR', 'MuSR Murder Mysteries',
    'MuSR Object Placement', 'MuSR Team Allocation'
]

print(f"Total tasks compared: {len(llama)}")

Total tasks compared: 46


In [5]:
print("DESCRIPTIVE STATISTICS")


print(f"\nLlama-2-70B:")
print(f"  Mean accuracy: {llama.mean()} ({llama.mean()*100}%)")
print(f"  Median accuracy: {np.median(llama)}")
print(f"  Std deviation: {llama.std(ddof=1)}")
print(f"  Min: {llama.min()}, Max: {llama.max()}")
print(f"  Range: {llama.max() - llama.min()}")

print(f"\nFlan-T5-XXL:")
print(f"  Mean accuracy: {flan.mean()} ({flan.mean()*100}%)")
print(f"  Median accuracy: {np.median(flan)}")
print(f"  Std deviation: {flan.std(ddof=1)}")
print(f"  Min: {flan.min()}, Max: {flan.max()}")
print(f"  Range: {flan.max() - flan.min()}")

print(f"\nRaw difference (Flan - Llama):")
print(f"  Mean: {flan.mean() - llama.mean()} ({(flan.mean() - llama.mean())*100} percentage points)")

DESCRIPTIVE STATISTICS

Llama-2-70B:
  Mean accuracy: 0.2767331086956522 (27.67331086956522%)
  Median accuracy: 0.257131
  Std deviation: 0.19691738024756902
  Min: 0.0, Max: 0.724
  Range: 0.724

Flan-T5-XXL:
  Mean accuracy: 0.36127567391304344 (36.127567391304346%)
  Median accuracy: 0.354904
  Std deviation: 0.2347930886058384
  Min: 0.0, Max: 0.764045
  Range: 0.764045

Raw difference (Flan - Llama):
  Mean: 0.08454256521739123 (8.454256521739122 percentage points)


## 3. Statistical Hypothesis Testing

We use a **paired t-test** because:
- Same tasks are evaluated for both models (paired observations)
- We want to test if there's a significant difference in mean performance

**Hypotheses:**
- H₀ (Null): No difference in mean performance between the models
- H₁ (Alternative): There is a significant difference in mean performance

In [6]:
t_stat, p_value = stats.ttest_rel(llama, flan)

print("\n Statistical Comparison Results ")
print(f"\n  - The calculated t-statistic is: {t_stat}")
print(f"  - The p-value obtained is: {p_value}")
print(f"  - Our significance level (alpha) is set at: 0.05")

if p_value < 0.05:
    print(f"2713 Conclusion: With a p-value of {p_value} (which is less than 0.05), we can confidently say that there IS a statistically significant difference in performance between the two models.")
    print(f"  This means it's unlikely this difference happened by chance. Flan-T5-XXL tends to perform differently than Llama-2-70B on these tasks.")
else:
    print(f"2718 Conclusion: Our p-value of {p_value} is not lower than our significance level (0.05).")
    print(f"  Therefore, we do NOT have enough evidence to claim a statistically significant difference in performance between Llama-2-70B and Flan-T5-XXL based on this test.")


 Statistical Comparison Results 

  - The calculated t-statistic is: -2.7125787869472053
  - The p-value obtained is: 0.009424124700017885
  - Our significance level (alpha) is set at: 0.05
2713 Conclusion: With a p-value of 0.009424124700017885 (which is less than 0.05), we can confidently say that there IS a statistically significant difference in performance between the two models.
  This means it's unlikely this difference happened by chance. Flan-T5-XXL tends to perform differently than Llama-2-70B on these tasks.
