# Defacing pre-registration - Statistical analysis in R

## Load simulated data

In [2]:
df_full <- readRDS(file="SimulatedData/SimulatedDefacedRatings_noMissing.Rda")
#df_missing <- readRDS(file="SimulatedDefacedRatings_10%Missing.Rda")

df_3 <- subset(df_full, rater == sprintf('rater%02d', 1:4))

#Define the number of raters in the dataset
n_rater = length(unique(df_full$rater))

The simulated data were generated by running the `SimulateDefacedRatings.ipnyb` notebook.

## Linear mixed effect regression

Because the continuation ratio model implementation could not deal with missing values, we are switching to linear mixed effect regression model. It has been shown that considering linear regression on ordinal data is ok, as long as the probabilities of belonging to each category are far from the extremes.

In [3]:
library(lme4)
fm1 <- lmer(as.numeric(ratings) ~ defaced + (1 | rater), data=df_full)
summary(fm1)
ranef(fm1)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: as.numeric(ratings) ~ defaced + (1 | rater)
   Data: df_full

REML criterion at convergence: 41931.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.8718 -0.7702  0.1403  1.0063  1.5259 

Random effects:
 Groups   Name        Variance Std.Dev.
 rater    (Intercept) 0.01502  0.1226  
 Residual             1.18701  1.0895  
Number of obs: 13920, groups:  rater, 12

Fixed effects:
                Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)    2.494e+00  3.771e-02 1.245e+01   66.13   <2e-16 ***
defaceddefaced 3.394e-01  1.847e-02 1.391e+04   18.38   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr)
defaceddfcd -0.245

$rater
         (Intercept)
rater01 -0.103171623
rater02 -0.156438850
rater03 -0.136261870
rater04 -0.101557465
rater05  0.005784068
rater06 -0.038605288
rater07 -0.008743358
rater08  0.070350403
rater09  0.013854859
rater10  0.051787581
rater11  0.197061836
rater12  0.205939707

with conditional variances for “rater” 

### Test for significance

To test whether defacing significantly biases the human ratings, we compute p-values associated with its regression coefficient. The p-value is computed using Satterthwaite approximation for degrees of freedom. You can find the justification of why we chose to test significance using Satterthwaite approximation in the following paper : https://link.springer.com/article/10.3758/s13428-016-0809-y#appendices.

In [4]:
library(lmerTest)
fm1 <- lmer(as.numeric(ratings) ~ defaced + (1 | rater), data=df_full)
anova(fm1)
summary(fm1)

Unnamed: 0_level_0,Sum Sq,Mean Sq,NumDF,DenDF,F value,Pr(>F)
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>
defaced,400.7934,400.7934,1,13907,337.6501,1.5767200000000001e-74


Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: as.numeric(ratings) ~ defaced + (1 | rater)
   Data: df_full

REML criterion at convergence: 41931.6

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.8718 -0.7702  0.1403  1.0063  1.5259 

Random effects:
 Groups   Name        Variance Std.Dev.
 rater    (Intercept) 0.01502  0.1226  
 Residual             1.18701  1.0895  
Number of obs: 13920, groups:  rater, 12

Fixed effects:
                Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)    2.494e+00  3.771e-02 1.245e+01   66.13   <2e-16 ***
defaceddefaced 3.394e-01  1.847e-02 1.391e+04   18.38   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
            (Intr)
defaceddfcd -0.245

### Do I need interaction terms in my model ?

We use an interaction plot to figure if an interaction between the defacing status and the rater is present. If the lines are not parallel, an interaction exists. Plot construction based on https://stattrek.com/multiple-regression/interaction.

In [None]:
#Compute mean rating for each rater and each condition
mean_defaced = c()
mean_original = c()
for (i in 1:n_rater){
    df_small_defaced <- subset(df_full, defaced == 'defaced' & rater == sprintf('rater%02d', i))
    df_small_original <- subset(df_full, defaced == 'original' & rater == sprintf('rater%02d', i))
    mean_defaced[i] <- mean(as.numeric(df_small_defaced$ratings))
    mean_original[i] <- mean(as.numeric(df_small_original$ratings))
}

#Interaction plot
plot(c(0,1), c(mean_original[i], mean_defaced[i]),
    ylab="Mean rating",
    xlab="Defaced status",
    main ='Defaced * rater',
    ylim = c(2.4,3.3),
    type="o",
    col="blue")
for (i in 2:n_rater){ 
    lines(c(0,1), c(mean_original[i], mean_defaced[i]), type='o')
}

In the plot above, one line represents the evolution of the mean rating before and after defacing for one rater. The lines are not parallel, so we have to include the interaction term into the model. I interpret it as different raters are differently biased by the defacing process.
BUT we need to reduce the number of raters, because too many raters implies too many coefficients to estimate, hence the model doesn't converge.

### How to deal with missing values

List the different options available to deal with missing values.

In [None]:
getOption("na.action")

In [None]:
library(lme4)
fm1 <- lmer(as.numeric(ratings) ~ defaced + (1 | rater), data=df_missing, na.action=na.exclude)
summary(fm1)
ranef(fm1)