# Inter-rater reliability between humans
1. Sample Size Determination
2. Inter-rater Reliability Testing



In [16]:
from scipy.stats import norm
import numpy as np
from statsmodels.stats.inter_rater import fleiss_kappa

## 1. Sample Size Determination

**Parameters based on thesis methodology:**
- Significance level (α): 0.05  
- Power: 0.8  
- Expected κ (kappa): 0.7 *(substantial agreement)*  
- Null hypothesis κ₀: 0.3 *(fair agreement)*
- Two-sided: True, due to no strong evidence for lower/higher agreement.
---

### First option: Fleiss Kappa
Use Fleiss’ kappa (3 raters: 2 humans + 1 LLM).


#### Fleiss Kappa Interpretation Guide:
| Kappa (κ)         | Interpretation          |
|-------------------|--------------------------|
| < 0               | Poor agreement           |
| 0.01–0.20         | Slight agreement         |
| 0.21–0.40         | Fair agreement           |
| 0.41–0.60         | Moderate agreement       |
| 0.61–0.80         | Substantial agreement    |
| 0.81–1.00         | Almost perfect agreement |



### Second option: individual comparison per human
It is also possible to compute pairwise Cohen’s kappa between the eval pipeline and each human on overlapping subsets.

If Fleiss Kappa scores lower than 0.6 this method can be utilized to identify if a specific company or sector lowers the score more than others. 

In [25]:
def fleiss_kappa_sample_size(kappa_0, kappa_1, n_raters, p=0.5, alpha=0.05, power=0.8):
    pe = p**2 + (1-p)**2
    
    po = pe + kappa_0 * (1 - pe)
    var_kappa_0 = (2 * (po * (1 - po) + (n_raters - 1) * (pe * (1 - pe) - (po - pe)**2))) / (n_raters * (n_raters - 1) * (1 - pe)**2)
    
    z_alpha = norm.ppf(1 - alpha / 2)
    z_beta = norm.ppf(power)
    
    n = ((z_alpha + z_beta) ** 2 * var_kappa_0) / ((kappa_1 - kappa_0) ** 2)
    
    return int(np.ceil(n))


In [26]:
n_required = fleiss_kappa_sample_size(
    kappa_0=0.3, 
    kappa_1=0.7, 
    n_raters=3,  
    p=0.5  
)
print("Required sample size:", n_required, "if using Fleiss' Kappa")

Required sample size: 45 if using Fleiss' Kappa


## 2. Inter-rater Reliability Testing


In [30]:


# Example: 5 items, each rated by 3 raters with 0 or 1
ratings = np.array([
    [1, 2],  
    [0, 3],  
    [0, 3],  
    [1, 2],
    [3, 0],  
    [0, 3],  
    [0, 3],  
    [0, 3],
    [3, 0]   
])

kappa = fleiss_kappa(ratings)
print("Fleiss' kappa:", kappa)

Fleiss' kappa: 0.6447368421052628
