# LLM-as-a-Judge Inter-rater Reliability Analysis

## Framework Validation: Human-LLM Agreement Study

This notebook presents the second component of the framework validation: assessing the reliability of LLM-based evaluation compared to human expert judgments. Following the human-to-human reliability analysis, this experiment evaluates whether LLMs can consistently assess job listing quality according to expert-defined standards.

### Research Context
- **Objective**: Validate LLM-as-a-judge reliability against human expert evaluations
- **Models**: GPT-4o and Claude-3.5-Sonnet as independent evaluators
- **Methodology**: Gwet's $AC_1$ coefficient and raw agreement analysis
- **Significance**: Establishes automated evaluation capabilities for the framework

### Dual-Model Approach
The experiment employs two leading LLMs to mitigate potential self-enhancement bias and provide robust comparative analysis of automated evaluation performance.

In [155]:
library(pwr)
library(irr)
library(readxl)
library(dplyr)
library(irrCAC)
library(knitr)
library(kableExtra)

## 1. Dutch Sample Analysis

### Data Loading and Preprocessing
Loading expert evaluation data alongside LLM judgments for Dutch job listings. The dataset contains human expert ratings and corresponding LLM evaluations using identical prompts and scoring criteria.

**Data Structure:**
- Human expert ratings: Two recruitment professionals
- LLM evaluations: GPT-4o and Claude-3.5 assessments
- Evaluation criteria: Same 8 quality metrics used in human-only analysis
- Binary classification: ≥3 threshold for acceptable quality

In [156]:
df_dutch <- read_excel("IRR_testing_5.xlsx")

[1m[22mNew names:
[36m•[39m `` -> `...1`


In [157]:
df_dutch <- df_dutch %>%
  rename(
    rater1 = R_8q7mN6gv7GC7sxd,
    rater2 = R_2wT7oFicvVDdz9L
  )

df_dutch <- df_dutch %>%
  mutate(
    rater1 = as.numeric(rater1),
    rater2 = as.numeric(rater2)
  )



binarize <- function(x) ifelse(x >= 3, 1, 0)

### Comprehensive Reliability Analysis Framework

The analysis framework computes multiple reliability metrics to assess LLM-as-a-judge performance:

**Core Metrics:**
- **AC1 Coefficients**: Primary reliability measure between raters
- **Raw Agreement**: Proportion of identical classifications
- **Bias Index**: Systematic rating differences between evaluators
- **Sample Size Requirements**: Statistical power calculations for validation

**Comparison Strategy:**
- Human-to-human baseline reliability
- Each LLM compared individually against human experts
- Direct LLM-to-LLM agreement assessment
- Performance-based model selection criteria

In [158]:
calculate_all_metrics <- function(df, group_var) {
 df %>%
   group_by(!!sym(group_var)) %>%
   summarise(
     prevalence = {
       r1 <- binarize(rater1)
       r2 <- binarize(rater2)
       (mean(r1) + mean(r2)) / 2
     },
     pabak = {
       r1 <- binarize(rater1)
       r2 <- binarize(rater2)
       po <- mean(r1 == r2)
       2 * po - 1
     },
     cohens_kappa = {
       r1 <- binarize(rater1)
       r2 <- binarize(rater2)
       ratings <- data.frame(r1, r2)
       k <- kappa2(ratings, "unweighted")
       k$value
     },
     gwet_ac1 = {
       r1 <- binarize(rater1)
       r2 <- binarize(rater2)
       ratings <- data.frame(r1, r2)
       ac1 <- gwet.ac1.raw(ratings)
       ac1$est$coeff.val
     },
     .groups = 'drop'
   )
}

results_dutch <- calculate_all_metrics(df_dutch, "ResponseId")
results_dutch_display <- results_dutch %>%
 mutate(
   prevalence = sprintf("%.2f", prevalence),
   pabak = sprintf("%.2f", pabak),
   cohens_kappa = sprintf("%.2f", cohens_kappa),
   gwet_ac1 = sprintf("%.2f", gwet_ac1)
 )

cat("=== DUTCH RESULTS ===\n\n")
print(results_dutch_display)

=== DUTCH RESULTS ===

[90m# A tibble: 8 × 5[39m
  ResponseId prevalence pabak cohens_kappa gwet_ac1
       [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m      [3m[90m<chr>[39m[23m [3m[90m<chr>[39m[23m        [3m[90m<chr>[39m[23m   
[90m1[39m          1 0.93       0.70  -0.07        0.83    
[90m2[39m          2 0.90       0.60  -0.11        0.76    
[90m3[39m          3 0.80       0.20  0.00         0.41    
[90m4[39m          4 0.70       0.60  0.55         0.66    
[90m5[39m          5 0.70       0.60  0.53         0.66    
[90m6[39m          6 0.80       0.40  0.12         0.56    
[90m7[39m          7 0.88       0.50  0.00         0.68    
[90m8[39m          8 0.78       0.70  0.57         0.77    


### Dutch Sample Results

**Performance Summary:**
The tables below present comprehensive reliability metrics for Dutch job listing evaluations, comparing human expert agreement with LLM-based assessments across all quality dimensions.

**Interpretation Guidelines:**
- AC1 ≥ 0.61: Substantial agreement (framework deployment ready)
- Low Bias Index: Minimal systematic rating differences
- Best LLM determination: Based on agreement and bias performance

In [163]:
raw_agreement <- function(rater1, rater2) {
 if (length(rater1) != length(rater2)) return(NA_real_)
 mean(rater1 == rater2, na.rm = TRUE)
}

calculate_ac1_metrics <- function(df, group_var) {
 df %>%
   group_by(!!sym(group_var)) %>%
   summarise(
     ac1_humans = {
       r1 <- binarize(rater1)
       r2 <- binarize(rater2)
       gwet.ac1.raw(data.frame(r1, r2))$est$coeff.val
     },
     
     raw_agreement_humans = raw_agreement(binarize(rater1), binarize(rater2)),
     
     ac1_openai_vs_average_humans = mean(c(
       {
         r1 <- binarize(rater1)
         r2 <- binarize(LLM_judge_openai)
         gwet.ac1.raw(data.frame(r1, r2))$est$coeff.val
       },
       {
         r1 <- binarize(rater2)
         r2 <- binarize(LLM_judge_openai)
         gwet.ac1.raw(data.frame(r1, r2))$est$coeff.val
       }
     ), na.rm = TRUE),
     
     raw_agreement_openai_vs_average_humans = mean(c(
       raw_agreement(binarize(rater1), binarize(LLM_judge_openai)),
       raw_agreement(binarize(rater2), binarize(LLM_judge_openai))
     ), na.rm = TRUE),
     
     sample_size_openai_vs_humans = kappaSize.ac1(k0 = 0, k1 = ac1_openai_vs_average_humans, alpha = 0.05, power = 0.8)$N,
     
     bias_openai_vs_average_humans = (mean(binarize(rater1)) + mean(binarize(rater2))) / 2 - mean(binarize(LLM_judge_openai)),
     
     ac1_claude_vs_average_humans = mean(c(
       {
         r1 <- binarize(rater1)
         r2 <- binarize(LLM_judge_claude)
         gwet.ac1.raw(data.frame(r1, r2))$est$coeff.val
       },
       {
         r1 <- binarize(rater2)
         r2 <- binarize(LLM_judge_claude)
         gwet.ac1.raw(data.frame(r1, r2))$est$coeff.val
       }
     ), na.rm = TRUE),
     
     raw_agreement_claude_vs_average_humans = mean(c(
       raw_agreement(binarize(rater1), binarize(LLM_judge_claude)),
       raw_agreement(binarize(rater2), binarize(LLM_judge_claude))
     ), na.rm = TRUE),
     
     bias_claude_vs_average_humans = (mean(binarize(rater1)) + mean(binarize(rater2))) / 2 - mean(binarize(LLM_judge_claude)),
     
     ac1_openai_vs_claude = {
       r1 <- binarize(LLM_judge_openai)
       r2 <- binarize(LLM_judge_claude)
       gwet.ac1.raw(data.frame(r1, r2))$est$coeff.val
     },
     
     raw_agreement_openai_vs_claude = raw_agreement(binarize(LLM_judge_openai), binarize(LLM_judge_claude)),
     
     bias_openai_vs_claude = mean(binarize(LLM_judge_openai)) - mean(binarize(LLM_judge_claude)),
     
     best_llm_by_agreement = case_when(
       is.na(ac1_openai_vs_average_humans) & is.na(ac1_claude_vs_average_humans) ~ "Tie (both NA)",
       is.na(ac1_openai_vs_average_humans) ~ "Claude",
       is.na(ac1_claude_vs_average_humans) ~ "OpenAI",
       ac1_openai_vs_average_humans > ac1_claude_vs_average_humans ~ "OpenAI",
       ac1_claude_vs_average_humans > ac1_openai_vs_average_humans ~ "Claude",
       TRUE ~ "Tie"
     ),
     
     best_llm_by_bias = case_when(
       is.na(bias_openai_vs_average_humans) & is.na(bias_claude_vs_average_humans) ~ "Tie (both NA)",
       is.na(bias_openai_vs_average_humans) ~ "Claude",
       is.na(bias_claude_vs_average_humans) ~ "OpenAI",
       abs(bias_openai_vs_average_humans) < abs(bias_claude_vs_average_humans) ~ "OpenAI",
       abs(bias_claude_vs_average_humans) < abs(bias_openai_vs_average_humans) ~ "Claude",
       TRUE ~ "Tie"
     ),
     
     best_llm_overall = case_when(
       best_llm_by_agreement == best_llm_by_bias & !grepl("Tie", best_llm_by_agreement) ~ best_llm_by_agreement,
       grepl("Tie", best_llm_by_agreement) & grepl("Tie", best_llm_by_bias) ~ "Tie",
       grepl("Tie", best_llm_by_agreement) & !grepl("Tie", best_llm_by_bias) ~ best_llm_by_bias,
       !grepl("Tie", best_llm_by_agreement) & grepl("Tie", best_llm_by_bias) ~ best_llm_by_agreement,
       TRUE ~ "Tie"
     ),
     
     .groups = "drop"
   )
}

In [164]:
results_dutch <- calculate_ac1_metrics(df_dutch, "ResponseId")

In [167]:
results_dutch %>%
  select(ResponseId, ac1_humans, ac1_openai_vs_average_humans, 
         ac1_claude_vs_average_humans, ac1_openai_vs_claude, best_llm_overall) %>%
  knitr::kable(
    col.names = c("Response ID", "Human-Human", "OpenAI vs Humans", 
                  "Claude vs Humans", "OpenAI vs Claude", "Best LLM"),
    caption = "AC1 Agreement Coefficients",
    align = c("c", "c", "c", "c", "c", "l"),
    digits = 2
  )


results_dutch %>%
  select(ResponseId, raw_agreement_humans, raw_agreement_openai_vs_average_humans,
         raw_agreement_claude_vs_average_humans, raw_agreement_openai_vs_claude) %>%
  knitr::kable(
    col.names = c("Response ID", "Human-Human", "OpenAI vs Humans", 
                  "Claude vs Humans", "OpenAI vs Claude"),
    caption = "Raw Agreement Proportions",
    align = c("c", "c", "c", "c", "c"),
    digits = 2
  )


results_dutch %>%
  mutate(
    abs_bias_openai = abs(bias_openai_vs_average_humans),
    abs_bias_claude = abs(bias_claude_vs_average_humans)
  ) %>%
  select(ResponseId, 
         raw_agreement_openai_vs_average_humans, raw_agreement_claude_vs_average_humans,
         abs_bias_openai, abs_bias_claude,
         best_llm_by_agreement, best_llm_by_bias, best_llm_overall) %>%
  knitr::kable(
    col.names = c("Response ID", 
                  "OpenAI Raw Agreement", "Claude Raw Agreement",
                  "OpenAI |Bias|", "Claude |Bias|",
                  "Agreement Winner", "Bias Winner", "Overall Winner"),
    caption = "LLM Performance Comparison and Decision Rationale (Raw Agreement)",
    align = c("c", "c", "c", "c", "c", "l", "l", "l"),
    digits = 2
  )



Table: AC1 Agreement Coefficients

| Response ID | Human-Human | OpenAI vs Humans | Claude vs Humans | OpenAI vs Claude |Best LLM |
|:-----------:|:-----------:|:----------------:|:----------------:|:----------------:|:--------|
|      1      |    0.83     |       0.92       |       0.92       |       1.00       |Tie      |
|      2      |    0.76     |       0.89       |       0.89       |       1.00       |Tie      |
|      3      |    0.41     |       0.71       |       0.71       |       1.00       |Tie      |
|      4      |    0.66     |       0.58       |       0.58       |       1.00       |Tie      |
|      5      |    0.66     |       0.59       |       0.59       |       1.00       |Tie      |
|      6      |    0.56     |       0.79       |       0.71       |       0.87       |Tie      |
|      7      |    0.68     |       0.84       |       0.84       |       1.00       |Tie      |
|      8      |    0.77     |       0.80       |       0.79       |       0.91       |Open



Table: Raw Agreement Proportions

| Response ID | Human-Human | OpenAI vs Humans | Claude vs Humans | OpenAI vs Claude |
|:-----------:|:-----------:|:----------------:|:----------------:|:----------------:|
|      1      |    0.85     |       0.92       |       0.92       |       1.00       |
|      2      |    0.80     |       0.90       |       0.90       |       1.00       |
|      3      |    0.60     |       0.80       |       0.80       |       1.00       |
|      4      |    0.80     |       0.70       |       0.70       |       1.00       |
|      5      |    0.80     |       0.70       |       0.70       |       1.00       |
|      6      |    0.70     |       0.85       |       0.80       |       0.90       |
|      7      |    0.75     |       0.88       |       0.88       |       1.00       |
|      8      |    0.85     |       0.88       |       0.88       |       0.95       |



Table: LLM Performance Comparison and Decision Rationale (Raw Agreement)

| Response ID | OpenAI Raw Agreement | Claude Raw Agreement | OpenAI &#124;Bias&#124; | Claude &#124;Bias&#124; |Agreement Winner |Bias Winner |Overall Winner |
|:-----------:|:--------------------:|:--------------------:|:-----------------------:|:-----------------------:|:----------------|:-----------|:--------------|
|      1      |         0.92         |         0.92         |          0.07           |          0.07           |Tie              |Tie         |Tie            |
|      2      |         0.90         |         0.90         |          0.10           |          0.10           |Tie              |Tie         |Tie            |
|      3      |         0.80         |         0.80         |          0.20           |          0.20           |Tie              |Tie         |Tie            |
|      4      |         0.70         |         0.70         |          0.30           |          0.30           |Tie   

## 2. English Sample Analysis

### Cross-Language Validation
Extending the LLM-as-a-judge analysis to English job listings to assess framework generalizability across languages and cultural contexts.

**Methodological Consistency:**
- Identical evaluation procedures as Dutch analysis
- Same LLM models and prompt structures
- Equivalent human expert selection criteria
- Consistent binary classification thresholds

In [168]:
df_english <- read_excel("IRR_testing_en_1.xlsx")

[1m[22mNew names:
[36m•[39m `` -> `...1`


In [169]:
df_english <- df_english %>%
  rename(
    rater1 = R_2F4IWrRogs2MEvL,
    rater2 = R_2rkb0tBGzURC0Ia
  )

df_english <- df_english %>%
  mutate(
    rater1 = as.numeric(rater1),
    rater2 = as.numeric(rater2)
  )

In [170]:
results_english <- calculate_ac1_metrics(df_english, "ResponseId")

### English Sample Results

**Language-Specific Performance:**
The following analysis examines whether LLM-based evaluation maintains reliability across different languages, addressing framework scalability for international recruitment contexts.

**Comparative Focus:**
- English vs. Dutch reliability patterns
- Model performance consistency across languages
- Cultural/linguistic bias detection in automated evaluation

In [171]:
results_english %>%
  select(ResponseId, ac1_humans, ac1_openai_vs_average_humans, 
         ac1_claude_vs_average_humans, ac1_openai_vs_claude, best_llm_overall) %>%
  knitr::kable(
    col.names = c("Response ID", "Human-Human", "OpenAI vs Humans", 
                  "Claude vs Humans", "OpenAI vs Claude", "Best LLM"),
    caption = "AC1 Agreement Coefficients",
    align = c("c", "c", "c", "c", "c", "l"),
    digits = 2
  )

  
results_english %>%
  select(ResponseId, raw_agreement_humans, raw_agreement_openai_vs_average_humans,
         raw_agreement_claude_vs_average_humans, raw_agreement_openai_vs_claude) %>%
  knitr::kable(
    col.names = c("Response ID", "Human-Human", "OpenAI vs Humans", 
                  "Claude vs Humans", "OpenAI vs Claude"),
    caption = "Raw Agreement Proportions",
    align = c("c", "c", "c", "c", "c"),
    digits = 2
  )

results_english %>%
  mutate(
    abs_bias_openai = abs(bias_openai_vs_average_humans),
    abs_bias_claude = abs(bias_claude_vs_average_humans)
  ) %>%
  select(ResponseId, 
         raw_agreement_openai_vs_average_humans, raw_agreement_claude_vs_average_humans,
         abs_bias_openai, abs_bias_claude,
         best_llm_by_agreement, best_llm_by_bias, best_llm_overall) %>%
  knitr::kable(
    col.names = c("Response ID", 
                  "OpenAI Raw Agreement", "Claude Raw Agreement",
                  "OpenAI |Bias|", "Claude |Bias|",
                  "Agreement Winner", "Bias Winner", "Overall Winner"),
    caption = "LLM Performance Comparison and Decision Rationale (Raw Agreement)",
    align = c("c", "c", "c", "c", "c", "l", "l", "l"),
    digits = 2
  )



Table: AC1 Agreement Coefficients

| Response ID | Human-Human | OpenAI vs Humans | Claude vs Humans | OpenAI vs Claude |Best LLM |
|:-----------:|:-----------:|:----------------:|:----------------:|:----------------:|:--------|
|      1      |    0.60     |       0.81       |       0.81       |       1.00       |Tie      |
|      2      |    0.41     |       0.74       |       0.74       |       1.00       |Tie      |
|      3      |    0.36     |       0.66       |       0.68       |       0.95       |Tie      |
|      4      |    0.04     |       0.53       |       0.45       |       0.92       |OpenAI   |
|      5      |    0.65     |       0.75       |       0.73       |       0.92       |OpenAI   |
|      6      |    0.51     |       0.79       |       0.69       |       0.83       |Tie      |
|      7      |    -0.10    |       0.48       |       0.48       |       1.00       |Tie      |
|      8      |    0.32     |       0.65       |       0.65       |       1.00       |Tie 



Table: Raw Agreement Proportions

| Response ID | Human-Human | OpenAI vs Humans | Claude vs Humans | OpenAI vs Claude |
|:-----------:|:-----------:|:----------------:|:----------------:|:----------------:|
|      1      |    0.70     |       0.85       |       0.85       |       1.00       |
|      2      |    0.60     |       0.80       |       0.80       |       1.00       |
|      3      |    0.60     |       0.75       |       0.75       |       0.95       |
|      4      |    0.50     |       0.75       |       0.70       |       0.95       |
|      5      |    0.75     |       0.82       |       0.82       |       0.95       |
|      6      |    0.65     |       0.82       |       0.78       |       0.85       |
|      7      |    0.40     |       0.65       |       0.65       |       1.00       |
|      8      |    0.65     |       0.82       |       0.82       |       1.00       |



Table: LLM Performance Comparison and Decision Rationale (Raw Agreement)

| Response ID | OpenAI Raw Agreement | Claude Raw Agreement | OpenAI &#124;Bias&#124; | Claude &#124;Bias&#124; |Agreement Winner |Bias Winner |Overall Winner |
|:-----------:|:--------------------:|:--------------------:|:-----------------------:|:-----------------------:|:----------------|:-----------|:--------------|
|      1      |         0.85         |         0.85         |          0.15           |          0.15           |Tie              |Tie         |Tie            |
|      2      |         0.80         |         0.80         |          0.20           |          0.20           |Tie              |Tie         |Tie            |
|      3      |         0.75         |         0.75         |          0.20           |          0.25           |Claude           |OpenAI      |Tie            |
|      4      |         0.75         |         0.70         |          0.10           |          0.15           |OpenAI

## 3. Combined Analysis: Cross-Language Framework Validation

### Pooled Sample Assessment
Combining Dutch and English samples to establish overall framework reliability estimates and provide robust statistical foundation for deployment recommendations.

**Statistical Advantages:**
- Increased sample size for reliable coefficient estimation
- Cross-language generalizability assessment
- Comprehensive model comparison across diverse contexts
- Framework-level performance validation

In [153]:
df_combined <- bind_rows(df_dutch, df_english)
results_combined <- calculate_ac1_metrics(df_combined, "ResponseId")


results_combined %>%
  select(ResponseId, ac1_humans, ac1_openai_vs_average_humans, 
         ac1_claude_vs_average_humans, ac1_openai_vs_claude, best_llm_overall) %>%
  knitr::kable(
    col.names = c("Response ID", "Human-Human", "OpenAI vs Humans", 
                  "Claude vs Humans", "OpenAI vs Claude", "Best LLM"),
    caption = "AC1 Agreement Coefficients",
    align = c("c", "c", "c", "c", "c", "l"),
    digits = 2
  )

  
results_combined %>%
  select(ResponseId, raw_agreement_humans, raw_agreement_openai_vs_average_humans,
         raw_agreement_claude_vs_average_humans, raw_agreement_openai_vs_claude) %>%
  knitr::kable(
    col.names = c("Response ID", "Human-Human", "OpenAI vs Humans", 
                  "Claude vs Humans", "OpenAI vs Claude"),
    caption = "Raw Agreement Proportions",
    align = c("c", "c", "c", "c", "c"),
    digits = 2
  )



Table: AC1 Agreement Coefficients

| Response ID | Human-Human | OpenAI vs Humans | Claude vs Humans | OpenAI vs Claude |Best LLM |
|:-----------:|:-----------:|:----------------:|:----------------:|:----------------:|:--------|
|      1      |    0.72     |       0.87       |       0.87       |       1.00       |Tie      |
|      2      |    0.60     |       0.82       |       0.82       |       1.00       |Tie      |
|      3      |    0.39     |       0.69       |       0.70       |       0.97       |Tie      |
|      4      |    0.36     |       0.54       |       0.51       |       0.97       |OpenAI   |
|      5      |    0.65     |       0.67       |       0.66       |       0.97       |Tie      |
|      6      |    0.53     |       0.79       |       0.69       |       0.84       |Tie      |
|      7      |    0.33     |       0.67       |       0.67       |       1.00       |Tie      |
|      8      |    0.55     |       0.72       |       0.72       |       0.95       |Open



Table: Raw Agreement Proportions

| Response ID | Human-Human | OpenAI vs Humans | Claude vs Humans | OpenAI vs Claude |
|:-----------:|:-----------:|:----------------:|:----------------:|:----------------:|
|      1      |    0.78     |       0.89       |       0.89       |       1.00       |
|      2      |    0.70     |       0.85       |       0.85       |       1.00       |
|      3      |    0.60     |       0.78       |       0.78       |       0.98       |
|      4      |    0.65     |       0.72       |       0.70       |       0.98       |
|      5      |    0.78     |       0.76       |       0.76       |       0.98       |
|      6      |    0.68     |       0.84       |       0.79       |       0.88       |
|      7      |    0.58     |       0.76       |       0.76       |       1.00       |
|      8      |    0.75     |       0.85       |       0.85       |       0.98       |

## 4. Summary and Framework Implications

### LLM-as-a-Judge Validation Outcomes

This analysis provides empirical validation of LLM-based evaluation reliability for the job listing quality assessment framework.

**Key Findings:**
1. **Model Performance**: Both GPT-4o and Claude-3.5 demonstrate substantial agreement with human experts
2. **Cross-Language Consistency**: Framework maintains reliability across Dutch and English evaluations
3. **Automated Scalability**: LLM evaluations show comparable or superior consistency to human-only assessment
4. **Practical Deployment**: Results support automated evaluation integration with minimal human oversight

**Framework Integration:**
- **Primary Evaluator**: GPT-4o selected based on marginal performance advantages
- **Quality Assurance**: Dual-model validation for critical assessments
- **Threshold Validation**: Binary classification approach confirmed for practical implementation
- **Scalability Confirmation**: Framework ready for large-scale job listing evaluation

**Next Steps:**
These validation results enable progression to full framework implementation and real-world deployment testing in recruitment workflows.