# Inter-rater Reliability Between Human Experts

## Framework Validation: Pilot Study

This notebook presents the pilot validation of empirically derived quality metrics for job listing evaluation through human expert inter-rater reliability (IRR) analysis. The analysis validates whether recruitment experts can consistently apply the eight quality metrics across four dimensions (Clarity, Relevance, Correctness, Completeness) derived from requirements elicitation.

### Research Context
- **Objective**: Validate metric interpretability among recruitment experts
- **Sample**: Job listings rated by human experts (Dutch and English)
- **Methodology**: Gwet's $AC_1$ coefficient for reliability assessment
- **Significance**: Establishes foundation for LLM-based evaluation framework

In [408]:
library(pwr)
library(irr)
library(kappaSize)
library(readxl)
library(dplyr)
library(irrCAC)
library(knitr)

## 1. Sample Size Determination

**Statistical Parameters for Pilot Study:**


---

### Gwet's $AC_1$ Interpretation Framework:
| $AC_1$ Range | Interpretation          |
|---------------|-------------------------|
| < 0           | Poor agreement          |
| 0.00–0.20     | Slight agreement        |
| 0.20–0.40     | Fair agreement          |
| 0.40–0.60     | Moderate agreement      |
| 0.60–0.80     | Substantial agreement   |
| 0.80–1.00     | Almost perfect agreement|

Gwet's $AC_1$ is preferred over Cohen's kappa due to its stability with extreme prevalence distributions, which is relevant for job listing quality ratings where most listings may receive positive evaluations.

## 2. Data Loading and Preprocessing

### Survey Data
Loading expert evaluation data from Qualtrics surveys containing ratings on the eight empirically derived quality metrics. Expert raters evaluated job listings using 5-point Likert scales across both Dutch and English samples.

**Data Structure:**
- Dutch sample: Recruitment experts rating Dutch job listings
- English sample: Recruitment experts rating English job listings
- Variables: Expert ratings on 8 quality dimensions
- Rating scale: 1 (Strongly Disagree) to 5 (Strongly Agree)

In [410]:
df_dutch <- read_excel("Dutch_surveys_ICR_2.xlsx")
df_english <- read_excel("English_surveys_ICR_2.xlsx")

### Data Preparation and Standardization

**Variable Standardization:**
Renaming Qualtrics-generated response IDs to standardized rater identifiers for analysis consistency. The numeric conversion ensures proper statistical computation for reliability coefficients.

**Binary Classification Function:**
The `binarize()` function implements the threshold confirmed by all participating recruitment companies: ratings ≥3 represent "acceptable" quality, while ratings <3 represent "unacceptable" quality. This binary classification enables clearer interpretation of agreement patterns and aligns with practical decision-making in recruitment contexts.

In [411]:
df_dutch <- df_dutch %>%
  rename(
    rater1 = R_8q7mN6gv7GC7sxd,
    rater2 = R_2wT7oFicvVDdz9L
  )

df_english <- df_english %>%
  rename(
    rater1 = R_2F4IWrRogs2MEvL,
    rater2 = R_2rkb0tBGzURC0Ia
  )

df_dutch <- df_dutch %>%
  mutate(
    rater1 = as.numeric(rater1),
    rater2 = as.numeric(rater2)
  )

df_english <- df_english %>%
  mutate(
    rater1 = as.numeric(rater1),
    rater2 = as.numeric(rater2)
  )

binarize <- function(x) ifelse(x >= 3, 1, 0)

## 3. Descriptive Analysis by Response Category

### Individual Job Listing Analysis
Computing descriptive statistics for each job listing (ResponseId) stratified by generation method (Generated: 0=Human-written, 1=LLM-generated). This analysis provides insight into rating patterns across different job listings and generation methods.

**Metrics Calculated:**
- Central tendency: Mean, median
- Variability: Standard deviation, IQR
- Range: Minimum, maximum values

These descriptive statistics inform the subsequent reliability analysis by revealing potential systematic differences in rating patterns.

In [412]:
evaluation_metrics <- function(data) {
  data %>%
    group_by(Generated, ResponseId) %>%
    summarise(
      mean_rater1 = mean(rater1, na.rm = TRUE),
      mean_rater2 = mean(rater2, na.rm = TRUE),
      sd_rater1 = sd(rater1, na.rm = TRUE),
      sd_rater2 = sd(rater2, na.rm = TRUE),
      median_rater1 = median(rater1, na.rm = TRUE),
      median_rater2 = median(rater2, na.rm = TRUE),
      min_rater1 = min(rater1, na.rm = TRUE),
      min_rater2 = min(rater2, na.rm = TRUE),
      max_rater1 = max(rater1, na.rm = TRUE),
      max_rater2 = max(rater2, na.rm = TRUE),
      iqr_rater1 = IQR(rater1, na.rm = TRUE),
      iqr_rater2 = IQR(rater2, na.rm = TRUE),
      .groups = "drop"
    )
}


eval_dutch <- evaluation_metrics(df_dutch)
eval_english <- evaluation_metrics(df_english)

print(eval_dutch)
print(eval_english)

[90m# A tibble: 16 × 14[39m
   Generated ResponseId mean_rater1 mean_rater2 sd_rater1 sd_rater2
       [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m
[90m 1[39m         0          1         3.5         3.6     0.707     1.51 
[90m 2[39m         0          2         2.9         3.4     0.316     0.966
[90m 3[39m         0          3         3.7         2.7     0.823     1.57 
[90m 4[39m         0          4         3.2         2.7     1.23      1.57 
[90m 5[39m         0          5         2.6         3.4     0.516     0.966
[90m 6[39m         0          6         3.7         3.3     0.823     1.42 
[90m 7[39m         0          7         3.7         2.8     0.675     1.23 
[90m 8[39m         0          8         3.1         2.9     1.45      1.60 
[90m 9[39m         1          1         4.2         3.8     0.632     0.632
[90m10[39m      

## 4. Comparative Analysis: Human vs. LLM-Generated Content

### Generation Method Comparison
Aggregating ratings by generation method to examine systematic differences between human-written and LLM-generated job listings. This analysis addresses the sub-research question regarding quality differences between generation methods.

**Key Comparisons:**
- Mean rating differences between generation methods
- Variability patterns across human vs. LLM content
- Consistency of expert evaluations by generation type

The results inform both the validation of the generation framework and the reliability of expert evaluations across different content sources.

In [413]:
summary_generated_vs_human <- function(df) {
  df %>%
    mutate(
      rater1 = as.numeric(rater1),
      rater2 = as.numeric(rater2)
    ) %>%
    group_by(Generated) %>%
    summarise(
      mean_rater1 = mean(rater1, na.rm = TRUE),
      mean_rater2 = mean(rater2, na.rm = TRUE),
      sd_rater1   = sd(rater1, na.rm = TRUE),
      sd_rater2   = sd(rater2, na.rm = TRUE),
      median_rater1 = median(rater1, na.rm = TRUE),
      median_rater2 = median(rater2, na.rm = TRUE),
      iqr_rater1 = IQR(rater1, na.rm = TRUE),
      iqr_rater2 = IQR(rater2, na.rm = TRUE),
      min_rater1 = min(rater1, na.rm = TRUE),
      min_rater2 = min(rater2, na.rm = TRUE),
      max_rater1 = max(rater1, na.rm = TRUE),
      max_rater2 = max(rater2, na.rm = TRUE),
      .groups = "drop"
    )
}

# Run for both datasets
summary_dutch   <- summary_generated_vs_human(df_dutch)
summary_english <- summary_generated_vs_human(df_english)

print(summary_dutch)
print(summary_english)

[90m# A tibble: 2 × 13[39m
  Generated mean_rater1 mean_rater2 sd_rater1 sd_rater2 median_rater1
      [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m         [3m[90m<dbl>[39m[23m
[90m1[39m         0        3.3         3.1      0.933      1.36             3
[90m2[39m         1        3.89        3.54     0.914      1.07             4
[90m# ℹ 7 more variables: median_rater2 <dbl>, iqr_rater1 <dbl>, iqr_rater2 <dbl>,[39m
[90m#   min_rater1 <dbl>, min_rater2 <dbl>, max_rater1 <dbl>, max_rater2 <dbl>[39m
[90m# A tibble: 2 × 13[39m
  Generated mean_rater1 mean_rater2 sd_rater1 sd_rater2 median_rater1
      [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m       [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m         [3m[90m<dbl>[39m[23m
[90m1[39m         0        3.08        2.94      1.37      1.14             3
[90m2[39m         

## 5. Gwet's $AC_1$ Analysis and Sample Size Estimation

### Advanced Reliability Assessment
Gwet's $AC_1$ coefficient provides more stable reliability estimates than Cohen's kappa, particularly important for quality evaluation studies where high prevalence of positive ratings is expected.

**Combined Language Analysis:**
- Pooling Dutch and English samples for robust reliability estimates
- Response-level analysis across all eight job listing categories
- Sample size estimation for future large-scale validation studies

**Practical Implications:**
The $AC_1$ coefficients inform which quality metrics demonstrate sufficient inter-rater reliability for inclusion in the automated evaluation framework. Sample size estimations guide the planning of full-scale validation experiments.

In [418]:
calculate_all_metrics <- function(df, group_var) {
  df %>%
    group_by(!!sym(group_var)) %>%
    summarise(
      prevalence = {
        r1 <- binarize(rater1)
        r2 <- binarize(rater2)
        (mean(r1) + mean(r2)) / 2
      },
      gwet_ac1 = {
        r1 <- binarize(rater1)
        r2 <- binarize(rater2)
        ratings <- data.frame(r1, r2)
        ac1 <- gwet.ac1.raw(ratings)
        ac1$est$coeff.val
      },
      sample_size = {
        r1 <- binarize(rater1)
        r2 <- binarize(rater2)
        ratings <- data.frame(r1, r2)
        ac1_val <- gwet.ac1.raw(ratings)$est$coeff.val
        kappaSize.ac1(k0 = 0, k1 = ac1_val, alpha = 0.05, power = 0.8)$N
      },
      .groups = "drop"
    )
}

df_dutch <- df_dutch %>% mutate(language = "Dutch")
df_english <- df_english %>% mutate(language = "English")
df_combined <- bind_rows(df_dutch, df_english)
results_combined <- calculate_all_metrics(df_combined, "ResponseId")

results_combined_display <- results_combined %>%
  mutate(
    prevalence = sprintf("%.2f", prevalence),
    gwet_ac1 = sprintf("%.2f", gwet_ac1)
  )

cat("=== COMBINED RESULTS ===\n")
kable(results_combined_display, format = "simple", align = "l")

=== COMBINED RESULTS ===




ResponseId   prevalence   gwet_ac1   sample_size 
-----------  -----------  ---------  ------------
1            0.89         0.72       156         
2            0.85         0.60       185         
3            0.78         0.39       183         
4            0.65         0.36       177         
5            0.76         0.65       176         
6            0.81         0.53       192         
7            0.76         0.33       171         
8            0.68         0.55       190         

## 6. Basic Agreement Statistics by Language and Response

### Raw Agreement and Bias Analysis
Computing fundamental agreement metrics to complement the $AC_1$ reliability coefficients. These statistics provide additional insight into the nature of disagreements between expert raters.

**Key Metrics:**
- **Raw Agreement**: Proportion of identical ratings between raters
- **Bias Index**: Systematic difference in rating tendencies between raters

**Analytical Value:**
Raw agreement provides an intuitive measure of rater consensus, while bias index identifies systematic rating differences that may indicate varying interpretation of quality standards or response style differences between expert raters.

In [421]:
calculate_basic_agreement_simplified <- function(r1, r2) {

  r1_bin <- binarize(r1)
  r2_bin <- binarize(r2)
  

  raw_agreement <- mean(r1_bin == r2_bin)
  

  prevalence_r1 <- mean(r1_bin)
  prevalence_r2 <- mean(r2_bin)
  bias_index <- abs(prevalence_r1 - prevalence_r2)
  
  return(data.frame(
    raw_agreement = raw_agreement,
    bias_index = bias_index
  ))
}

basic_agreement_results <- df_combined %>%
  group_by(language, ResponseId) %>%
  summarise(
    calculate_basic_agreement_simplified(rater1, rater2),
    .groups = "drop"
  )

basic_agreement_display <- basic_agreement_results %>%
  mutate(
    raw_agreement = sprintf("%.2f", raw_agreement),
    bias_index = sprintf("%.2f", bias_index)
  )

cat("=== BASIC AGREEMENT STATISTICS BY LANGUAGE AND RESPONSE ID ===\n")
kable(basic_agreement_display, format = "simple", align = "l",
      col.names = c("Language", "Response ID", "Raw Agreement", "Bias Index"))

=== BASIC AGREEMENT STATISTICS BY LANGUAGE AND RESPONSE ID ===




Language   Response ID   Raw Agreement   Bias Index 
---------  ------------  --------------  -----------
Dutch      1             0.85            0.05       
Dutch      2             0.80            0.00       
Dutch      3             0.60            0.40       
Dutch      4             0.80            0.20       
Dutch      5             0.80            0.10       
Dutch      6             0.70            0.20       
Dutch      7             0.75            0.25       
Dutch      8             0.85            0.05       
English    1             0.70            0.20       
English    2             0.60            0.20       
English    3             0.60            0.10       
English    4             0.50            0.20       
English    5             0.75            0.15       
English    6             0.65            0.05       
English    7             0.40            0.30       
English    8             0.65            0.05       

## 7. Summary and Implications

### Pilot Study Findings
This analysis provides initial validation of the empirically derived quality metrics through human expert inter-rater reliability assessment. The results inform the refinement of metrics and guide the design of full-scale validation experiments.

**Key Implications for Framework Development:**
1. **Metric Reliability**: $AC_1$ coefficients indicate which quality dimensions demonstrate sufficient expert consensus
2. **Sample Size Planning**: Estimated sample sizes guide future validation study design
3. **Language Effects**: Systematic differences between Dutch and English evaluations suggest need for language-specific calibration
4. **Generation Method Performance**: Preliminary evidence of LLM-generated content quality relative to human-written baselines

**Next Steps:**
These pilot results inform the development of the LLM-as-a-judge evaluation component and guide possible refinements to the quality assessment framework before large-scale implementation.