# Defacing pre-registration

## Statistical analysis in R

In [10]:
library(GLMMadaptive)

### Simulate data

Simulate 12 raters each rating 130 original images. The ratings are randomly sampled from {1,2,3,4} and randomly distributed across subjects. To introduce a bias in the ratings of defaced images, we add +1 to a predefined precentage of the ratings on original images. The percentage of scan affected varies between raters.

In [11]:
set.seed(1234)

n_rated <- 580 #nbr of subjects rated per rater
n_sub <- 580 #nbr of subjects available in the dataset
n_rater <- 12 #nbr of raters

manual_original <- matrix(, nrow = n_sub, ncol = n_rater)
manual_defaced <- matrix(, nrow = n_sub, ncol = n_rater)

#Define for each rater the percentage of biased ratings
perc_biased <- c(2,10,10,30,40,40,50,50,60,70,90,90)
for (i in 1:n_rater) {
    #Each rater rate 130 subjects picked at random
    ind_sub <- sample(1:n_sub, n_rated, replace = F)
    #130 random original ratings sampled from {1,2,3,4}
    ratings <- sample(1:4, n_rated, replace = T)
    manual_original[ind_sub, i] <- ratings
    
    #Improve the ratings of a percentage of the original scans
    ind_rat <- sample(1:n_rated, round(n_rated*perc_biased[i]/100), replace = F)
    ratings_biased <- ratings
    ratings_biased[ind_rat] <- ratings_biased[ind_rat] + 1
    #The scale stops at 4 so clip to 4 higher values
    ratings_biased[ratings_biased == 5] <- 4
    manual_defaced[ind_sub, i] <- ratings_biased
}

manual_original_vec <- c(t(manual_original))
manual_defaced_vec <- c(t(manual_defaced))
defaced <- c(rep(c('original'), times = n_rater*n_sub), rep(c('defaced'), times = n_rater*n_sub))
rater <- rep(1:n_rater, times = n_sub)
sub <- rep(1:n_sub, each=n_rater, length.out=n_rater*n_sub*2)

#Convert to dataframe to use in regression
df <- data.frame(sub = sub, rater = rater , defaced = defaced)
df$ratings <- factor(c(manual_original_vec, manual_defaced_vec), levels = 1:4, labels = c("excluded", "poor", "good", "excellent"))

### Continuation ratio mixed effects regression

Because the ratings are ordinal, we will use mixed effects regression from the GLMMadaptive package in R to model raters’ variabilities.

In [25]:
cr_vals <- cr_setup(df$ratings, direction = "backward")
cr_data <- df[cr_vals$subs, ]
cr_data$ratings_new <- cr_vals$y
cr_data$cohort <- cr_vals$cohort

fm <- mixed_model(ratings_new ~ defaced + rater, random = ~ 1 | rater, data = cr_data, family = binomial())
fm

“In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
 extra argument ‘family’ will be disregarded”



Call:
lm(formula = ratings_new ~ defaced + rater, data = cr_data, family = binomial())

Coefficients:
    (Intercept)  defacedoriginal            rater  
       0.366716        -0.094922         0.009251  


### Ordered logistic regression

Test just ordered logistic regression before increasing the complexity using mixed_model. Inspired from https://stats.oarc.ucla.edu/r/dae/ordinal-logistic-regression/

In [34]:
library(MASS)
m <- polr(ratings ~ defaced + rater, data = df, Hess=TRUE, method = "logistic")
m

Call:
polr(formula = ratings ~ defaced + rater, data = df, Hess = TRUE, 
    method = "logistic")

Coefficients:
defacedoriginal           rater 
    -0.54861742      0.05227948 

Intercepts:
 excluded|poor      poor|good good|excellent 
    -1.3730969     -0.1424015      0.9006451 

Residual Deviance: 37781.85 
AIC: 37791.85 

One way to calculate a p-value in this case is by comparing the t-value against the standard normal distribution, like a z test. Of course this is only true with infinite degrees of freedom, but is reasonably approximated by large samples, becoming increasingly biased as sample size decreases

In [36]:
## Store table
ctable <- coef(summary(m))

## calculate and store p values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2

## combined table
(ctable <- cbind(ctable, "p value" = p))

Unnamed: 0,Value,Std. Error,t value,p value
defacedoriginal,-0.54861742,0.03072127,-17.857904,2.508829e-71
rater,0.05227948,0.00443805,11.779831,4.959221e-32
excluded|poor,-1.37309692,0.03900182,-35.205974,1.62005e-271
poor|good,-0.14240145,0.03691166,-3.857899,0.000114366
good|excellent,0.90064511,0.03770962,23.883699,4.524274e-126


We can also get confidence intervals for the parameter estimates. These can be obtained either by profiling the likelihood function or by using the standard errors and assuming a normal distribution. Note that profiled CIs are not symmetric (although they are usually close to symmetric). If the 95% CI does not cross 0, the parameter estimate is statistically significant.

In [37]:
(ci <- confint(m)) # default method gives profiled CIs
confint.default(m) # CIs assuming normality

Waiting for profiling to be done...



Unnamed: 0,2.5 %,97.5 %
defacedoriginal,-0.60886781,-0.48844059
rater,0.04358562,0.06098284


Unnamed: 0,2.5 %,97.5 %
defacedoriginal,-0.60882999,-0.4884048
rater,0.04358106,0.0609779


Both the p-values and the confidence interval indicate that both defacing and rater improves the fit of the model.

### Test significance of predictors

We test significance of the predictors using Wald test. Specifically, we want to test whether the coefficient associated to the fixed effect defacing is significantly non-zero. Note that the use of this test is possible because we have a balanced, nested GLMM.


In [None]:
library(aod)

#perform Wald Test to determine if defacing variables is zero
#This is copied from an example using lm but doesn't work for mixed_model
wald.test(Sigma = vcov(fm), b = coef(fm), Terms = 1)

If p-value > 0.05, it means we fail to reject the null hypothesis, meaning defacing coefficient is approximately zero. Thus you can drop that predictor from the model, because it doesn't significantly improve the fit of the model.