# Defacing pre-registration - Statistical analysis on manual ratings in R

## Load simulated manual ratings

In [5]:
df_full <- readRDS(file="SimulatedData/SimulatedDefacedRatings_noMissing.Rda")
#df_missing <- readRDS(file="SimulatedDefacedRatings_10%Missing.Rda")

df_3 <- subset(df_full, rater == sprintf('rater%02d', 1:4))

#Define the number of raters in the dataset
n_rater = length(unique(df_full$rater))

The simulated data were generated by running the `SimulateDefacedRatings.ipnyb` notebook.

## Linear mixed effect regression

Because the continuation ratio model implementation could not deal with missing values, we are switching to linear mixed effect regression model. It has been shown that considering linear regression on ordinal data is ok, as long as the probabilities of belonging to each category are far from the extremes.

In [8]:
library(lme4)
fm1 <- lmer(as.numeric(ratings) ~ defaced + (defaced | rater), data=df_full)
summary(fm1)
ranef(fm1)

boundary (singular) fit: see ?isSingular



Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: as.numeric(ratings) ~ defaced + (defaced | rater)
   Data: df_full

REML criterion at convergence: 41816.5

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.02978 -0.76932  0.09164  0.93353  1.40823 

Random effects:
 Groups   Name           Variance  Std.Dev. Corr
 rater    (Intercept)    0.0002946 0.01716      
          defaceddefaced 0.0446172 0.21123  1.00
 Residual                1.1767307 1.08477      
Number of obs: 13920, groups:  rater, 12

Fixed effects:
               Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)     2.49397    0.01391 44.77454 179.233  < 2e-16 ***
defaceddefaced  0.33937    0.06369 11.02948   5.329 0.000239 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr)
defaceddfcd 0.150 
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingul

$rater
          (Intercept) defaceddefaced
rater01 -1.924076e-02   -0.236796985
rater02 -2.157676e-02   -0.265546301
rater03 -1.996710e-02   -0.245736045
rater04 -1.222424e-02   -0.150444321
rater05 -7.744713e-04   -0.009531457
rater06 -4.384571e-03   -0.053961134
rater07  9.021668e-05    0.001110301
rater08  6.395524e-03    0.078710032
rater09  5.054708e-03    0.062208544
rater10  1.157294e-02    0.142428727
rater11  2.736127e-02    0.336736508
rater12  2.769324e-02    0.340822132

with conditional variances for “rater” 

### Test for significance

To test whether defacing significantly biases the human ratings, we compute p-values associated with its regression coefficient. The p-value is computed using Satterthwaite approximation for degrees of freedom. You can find the justification of why we chose to test significance using Satterthwaite approximation in the following paper : https://link.springer.com/article/10.3758/s13428-016-0809-y#appendices.

In [9]:
library(lmerTest)
fm1 <- lmer(as.numeric(ratings) ~ defaced + (defaced | rater), data=df_full)
anova(fm1)
summary(fm1)

boundary (singular) fit: see ?isSingular



Unnamed: 0_level_0,Sum Sq,Mean Sq,NumDF,DenDF,F value,Pr(>F)
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>
defaced,33.4114,33.4114,1,11.02948,28.39341,0.0002394224


Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: as.numeric(ratings) ~ defaced + (defaced | rater)
   Data: df_full

REML criterion at convergence: 41816.5

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.02978 -0.76932  0.09164  0.93353  1.40823 

Random effects:
 Groups   Name           Variance  Std.Dev. Corr
 rater    (Intercept)    0.0002946 0.01716      
          defaceddefaced 0.0446172 0.21123  1.00
 Residual                1.1767307 1.08477      
Number of obs: 13920, groups:  rater, 12

Fixed effects:
               Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)     2.49397    0.01391 44.77454 179.233  < 2e-16 ***
defaceddefaced  0.33937    0.06369 11.02948   5.329 0.000239 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr)
defaceddfcd 0.150 
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingul

### Do I need interaction terms in my model ?

We use an interaction plot to figure if an interaction between the defacing status and the rater is present. If the lines are not parallel, an interaction exists. Plot construction based on https://stattrek.com/multiple-regression/interaction.

In [None]:
#Compute mean rating for each rater and each condition
mean_defaced = c()
mean_original = c()
for (i in 1:n_rater){
    df_small_defaced <- subset(df_full, defaced == 'defaced' & rater == sprintf('rater%02d', i))
    df_small_original <- subset(df_full, defaced == 'original' & rater == sprintf('rater%02d', i))
    mean_defaced[i] <- mean(as.numeric(df_small_defaced$ratings))
    mean_original[i] <- mean(as.numeric(df_small_original$ratings))
}

#Interaction plot
plot(c(0,1), c(mean_original[i], mean_defaced[i]),
    ylab="Mean rating",
    xlab="Defaced status",
    main ='Defaced * rater',
    ylim = c(2.4,3.3),
    type="o",
    col="blue")
for (i in 2:n_rater){ 
    lines(c(0,1), c(mean_original[i], mean_defaced[i]), type='o')
}

In the plot above, one line represents the evolution of the mean rating before and after defacing for one rater. The lines are not parallel, so we have to include the interaction term into the model. I interpret it as different raters are differently biased by the defacing process.
BUT we need to reduce the number of raters, because too many raters implies too many coefficients to estimate, hence the model doesn't converge.

### How to deal with missing values

List the different options available to deal with missing values.

In [None]:
getOption("na.action")

In [None]:
library(lme4)
fm1 <- lmer(as.numeric(ratings) ~ defaced + (1 | rater), data=df_missing, na.action=na.exclude)
summary(fm1)
ranef(fm1)