# Defacing pre-registration - Statistical analysis in R

In this notebook, we are varying all dataset parameters (i.e number of subjects, raters, ratings levels) in the quest of a model that does not converge to a singular solution.The model is based on the formula `defaced + (defaced|rater)`. We did not successfully found such a model. 

### Function to simulate data with missing values

In [1]:
source("simulate_data.R")

## Starting model

We are automatically dropping ratings to test if missing values helps the model to converge to a non-singular value.
We fixe the percentage of biased scans per rater to around 50%. This percentage has been informed by our preliminary study. For both raters, ~50% of the scans were rated differently between the defaced and non-defaced conditions.

In [2]:
n_sub <- 580 #nbr of subjects available in the dataset
n_drop <- 100
n_rater <- 4 #nbr of raters
#Define for each rater the percentage of biased ratings
perc_biased <- c(2,40,60,80)
ratings_range <- 1:4
labels <- c('excluded','poor','good','excellent')
bias <- 1

library(coefplot2)

for (j in seq(0, n_sub, by=n_drop)){
    print(sprintf("_______________%.02f missing values__________", j*100/n_sub))
    
    df <- simulate_data(n_sub-j, n_sub, n_rater, perc_biased, ratings_range=ratings_range, bias=bias)
    
    library(lme4)
    fm1 <- lmer(as.numeric(ratings) ~ defaced + (defaced | rater), data=df, na.action=na.omit, REML = TRUE)
    
    print(summary(fm1)) 
    
}

Loading required package: coda



[1] "_______________0.00 missing values__________"


Loading required package: Matrix



Linear mixed model fit by REML ['lmerMod']
Formula: as.numeric(ratings) ~ defaced + (1 | rater)
   Data: df

REML criterion at convergence: 14045.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.7936 -0.7645  0.1469  1.0583  1.4661 

Random effects:
 Groups   Name        Variance Std.Dev.
 rater    (Intercept) 0.01095  0.1047  
 Residual             1.20384  1.0972  
Number of obs: 4640, groups:  rater, 4

Fixed effects:
               Estimate Std. Error t value
(Intercept)     2.48966    0.05707   43.62
defaceddefaced  0.34267    0.03221   10.64

Correlation of Fixed Effects:
            (Intr)
defaceddfcd -0.282
[1] "_______________17.24 missing values__________"
Linear mixed model fit by REML ['lmerMod']
Formula: as.numeric(ratings) ~ defaced + (1 | rater)
   Data: df

REML criterion at convergence: 11579.2

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1.77721 -0.85740  0.05973  0.97686  1.53129 

Random effects:
 Groups   Name        Variance 

## Modify nbr of raters

In [3]:
n_sub <- 580 #nbr of subjects available in the dataset
n_drop <- 100
n_rater <- 100 #nbr of raters
#Define for each rater the percentage of biased ratings
perc_biased <- rep(c(2,20,40,40,50,50,60,60,60,80), times = n_rater/10)
ratings_range <- 1:4
labels <- c('excluded','poor','good','excellent')
bias <- 1

library(coefplot2)

for (j in seq(0, n_sub, by=n_drop)){
    print(sprintf("_______________%.02f missing values__________", j*100/n_sub))
    
    df <- simulate_data(n_sub-j, n_sub, n_rater, perc_biased, ratings_range=ratings_range, bias=bias)
    
    library(lme4)
    fm1 <- lmer(as.numeric(ratings) ~ defaced + (defaced | rater), data=df, na.action=na.omit, REML = TRUE)
    
    print(summary(fm1)) 
    
}

[1] "_______________0.00 missing values__________"
Linear mixed model fit by REML ['lmerMod']
Formula: as.numeric(ratings) ~ defaced + (1 | rater)
   Data: df

REML criterion at convergence: 349334.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.8400 -0.7936  0.1220  1.0172  1.5529 

Random effects:
 Groups   Name        Variance Std.Dev.
 rater    (Intercept) 0.006315 0.07947 
 Residual             1.187448 1.08970 
Number of obs: 116000, groups:  rater, 100

Fixed effects:
               Estimate Std. Error t value
(Intercept)    2.498897   0.009145  273.26
defaceddefaced 0.346345   0.006399   54.12

Correlation of Fixed Effects:
            (Intr)
defaceddfcd -0.350
[1] "_______________17.24 missing values__________"
Linear mixed model fit by REML ['lmerMod']
Formula: as.numeric(ratings) ~ defaced + (1 | rater)
   Data: df

REML criterion at convergence: 289196.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.8487 -0.7994  0.1180  1.0155  1.6006 



In [4]:
n_sub <- 580 #nbr of subjects available in the dataset
n_drop <- 100
n_rater <- 20 #nbr of raters
#Define for each rater the percentage of biased ratings
perc_biased <- rep(c(2,20,40,40,50,50,60,60,60,80), times = n_rater/10)
ratings_range <- 1:4
labels <- c('excluded','poor','good','excellent')
bias <- 1

library(coefplot2)

for (j in seq(0, n_sub, by=n_drop)){
    print(sprintf("_______________%.02f missing values__________", j*100/n_sub))
    
    df <- simulate_data(n_sub-j, n_sub, n_rater, perc_biased, ratings_range=ratings_range, bias=bias)
    
    library(lme4)
    fm1 <- lmer(as.numeric(ratings) ~ defaced + (defaced | rater), data=df, na.action=na.omit, REML = TRUE)
    
    print(summary(fm1)) 
    
}

[1] "_______________0.00 missing values__________"
Linear mixed model fit by REML ['lmerMod']
Formula: as.numeric(ratings) ~ defaced + (1 | rater)
   Data: df

REML criterion at convergence: 69587.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.8094 -0.7906  0.1237  1.0177  1.5692 

Random effects:
 Groups   Name        Variance Std.Dev.
 rater    (Intercept) 0.007452 0.08632 
 Residual             1.172669 1.08290 
Number of obs: 23200, groups:  rater, 20

Fixed effects:
               Estimate Std. Error t value
(Intercept)     2.48897    0.02176   114.4
defaceddefaced  0.34974    0.01422    24.6

Correlation of Fixed Effects:
            (Intr)
defaceddfcd -0.327
[1] "_______________17.24 missing values__________"
Linear mixed model fit by REML ['lmerMod']
Formula: as.numeric(ratings) ~ defaced + (1 | rater)
   Data: df

REML criterion at convergence: 57776.5

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.8252 -0.8043  0.1209  1.0189  1.5896 

Rand

## Modify number of subjects

In [5]:
n_sub <- 3000 #nbr of subjects available in the dataset
n_drop <- 100
n_rater <- 20 #nbr of raters
#Define for each rater the percentage of biased ratings
perc_biased <- rep(c(2,20,40,40,50,50,60,60,60,80), times = n_rater/10)
ratings_range <- 1:4
labels <- c('excluded','poor','good','excellent')
bias <- 1

library(coefplot2)

for (j in seq(0, n_sub, by=n_drop)){
    print(sprintf("_______________%.02f missing values__________", j*100/n_sub))
    
    df <- simulate_data(n_sub-j, n_sub, n_rater, perc_biased, ratings_range=ratings_range, bias=bias)
    
    library(lme4)
    fm1 <- lmer(as.numeric(ratings) ~ defaced + (defaced | rater), data=df, na.action=na.omit, REML = TRUE)
    
    print(summary(fm1)) 
    
}

[1] "_______________0.00 missing values__________"
Linear mixed model fit by REML ['lmerMod']
Formula: as.numeric(ratings) ~ defaced + (1 | rater)
   Data: df

REML criterion at convergence: 361374.1

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.7971 -0.7960  0.1212  1.0224  1.5269 

Random effects:
 Groups   Name        Variance Std.Dev.
 rater    (Intercept) 0.005698 0.07549 
 Residual             1.188752 1.09030 
Number of obs: 120000, groups:  rater, 20

Fixed effects:
               Estimate Std. Error t value
(Intercept)    2.496800   0.017456  143.03
defaceddefaced 0.346517   0.006295   55.05

Correlation of Fixed Effects:
            (Intr)
defaceddfcd -0.180
[1] "_______________3.33 missing values__________"
Linear mixed model fit by REML ['lmerMod']
Formula: as.numeric(ratings) ~ defaced + (1 | rater)
   Data: df

REML criterion at convergence: 349181.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.7983 -0.7982  0.1178  1.0191  1.5436 

Ra

ERROR: Error in lme4::lFormula(formula = as.numeric(ratings) ~ defaced + (1 | : 0 (non-NA) cases


In [None]:
n_sub <- 20 #nbr of subjects available in the dataset
n_drop <- 6
n_rater <- 20 #nbr of raters
#Define for each rater the percentage of biased ratings
perc_biased <- rep(c(2,20,40,40,50,50,60,60,60,80), times = n_rater/10)
ratings_range <- 1:4
labels <- c('excluded','poor','good','excellent')
bias <- 1

library(coefplot2)

for (j in seq(0, n_sub, by=n_drop)){
    print(sprintf("_______________%.02f missing values__________", j*100/n_sub))
    
    df <- simulate_data(n_sub-j, n_sub, n_rater, perc_biased, ratings_range=ratings_range, bias=bias)
    
    library(lme4)
    fm1 <- lmer(as.numeric(ratings) ~ defaced + (defaced | rater), data=df, na.action=na.omit, REML = TRUE)
    
    print(summary(fm1)) 
    
}

## Change ratings levels

### Make ratings closer to continuous

In [None]:
n_sub <- 580 #nbr of subjects available in the dataset
n_drop <- 100
n_rater <- 20 #nbr of raters
#Define for each rater the percentage of biased ratings
perc_biased <- rep(c(2,20,40,40,50,50,60,60,60,80), times = n_rater/10)
ratings_range <- seq(0,1,length.out=11)
labels <- c('excluded','0.1','poor','0.3','acceptable','0.5','good','0.7','very good','0.9','excellent')
bias <- 0.1

library(coefplot2)

for (j in seq(0, n_sub, by=n_drop)){
    print(sprintf("_______________%.02f missing values__________", j*100/n_sub))
    
    df <- simulate_data(n_sub-j, n_sub, n_rater, perc_biased, ratings_range=ratings_range, bias=bias)
    
    library(lme4)
    fm1 <- lmer(as.numeric(ratings) ~ defaced + (defaced | rater), data=df, na.action=na.omit, REML = TRUE)
    
    print(summary(fm1)) 
    
}

In [None]:
n_sub <- 580 #nbr of subjects available in the dataset
n_drop <- 100
n_rater <- 20 #nbr of raters
#Define for each rater the percentage of biased ratings
perc_biased <- rep(c(2,20,40,40,50,50,60,60,60,80), times = n_rater/10)
ratings_range <- seq(0,1,length.out=51)
bias <- 0.1

library(coefplot2)

for (j in seq(0, n_sub, by=n_drop)){
    print(sprintf("_______________%.02f missing values__________", j*100/n_sub))
    
    df <- simulate_data(n_sub-j, n_sub, n_rater, perc_biased, ratings_range=ratings_range, bias=bias)
    
    library(lme4)
    fm1 <- lmer(as.numeric(ratings) ~ defaced + (defaced | rater), data=df, na.action=na.omit, REML = TRUE)
    
    print(summary(fm1)) 
    
}

### Change the ratings' levels

In [None]:
n_sub <- 580 #nbr of subjects available in the dataset
n_drop <- 100
n_rater <- 20 #nbr of raters
#Define for each rater the percentage of biased ratings
perc_biased <- rep(c(2,20,40,40,50,50,60,60,60,80), times = n_rater/10)
ratings_range <- seq(0,10,length.out=11)
bias <- 1

library(coefplot2)

for (j in seq(0, n_sub, by=n_drop)){
    print(sprintf("_______________%.02f missing values__________", j*100/n_sub))
    
    df <- simulate_data(n_sub-j, n_sub, n_rater, perc_biased, ratings_range=ratings_range, bias=bias)
    
    library(lme4)
    fm1 <- lmer(as.numeric(ratings) ~ defaced + (defaced | rater), data=df, na.action=na.omit, REML = TRUE)
    
    print(summary(fm1)) 
    
}

In [None]:
n_sub <- 580 #nbr of subjects available in the dataset
n_drop <- 100
n_rater <- 20 #nbr of raters
#Define for each rater the percentage of biased ratings
perc_biased <- rep(c(2,20,40,40,50,50,60,60,60,80), times = n_rater/10)
ratings_range <- seq(0,1000,length.out=11)
bias <- 100

library(coefplot2)

for (j in seq(0, n_sub, by=n_drop)){
    print(sprintf("_______________%.02f missing values__________", j*100/n_sub))
    
    df <- simulate_data(n_sub-j, n_sub, n_rater, perc_biased, ratings_range=ratings_range, bias=bias)
    
    library(lme4)
    fm1 <- lmer(as.numeric(ratings) ~ defaced + (defaced| rater), data=df, na.action=na.omit, REML = TRUE)
    
    print(summary(fm1)) 
    
}

## All raters rate same subset of data

In [None]:
## Load data
n_rated <- 130
n_sub <- 130 #nbr of subjects available in the dataset
n_rater <- 20 #nbr of raters
#perc_biased <- c(2,40,60,90) #4 raters + bias
perc_biased <- rep(c(2,20,40,40,40,40,60,60,60,80), times = n_rater/10)

df_4 <- simulate_data(n_rated, n_sub, n_rater, perc_biased)

## Fit model
library(lme4)
fm1 <- lmer(as.numeric(ratings) ~ defaced + (defaced | rater), data=df_4)
summary(fm1)
ranef(fm1)

In [None]:
## Load data
n_rated <- 50
n_sub <- 50 #nbr of subjects available in the dataset
n_rater <- 20 #nbr of raters
#perc_biased <- c(2,40,60,90) #4 raters + bias
perc_biased <- rep(c(2,20,40,40,40,40,60,60,60,80), times = n_rater/10)

df_4 <- simulate_data(n_rated, n_sub, n_rater, perc_biased)

## Fit model
library(lme4)
fm1 <- lmer(as.numeric(ratings) ~ defaced + (defaced | rater), data=df_4)
summary(fm1)
ranef(fm1)

In [None]:
## Load data
n_rated <- 1000
n_sub <- 1000 #nbr of subjects available in the dataset
n_rater <- 20 #nbr of raters
#perc_biased <- c(2,40,60,90) #4 raters + bias
perc_biased <- rep(c(2,20,40,40,40,40,60,60,60,80), times = n_rater/10)

df_4 <- simulate_data(n_rated, n_sub, n_rater, perc_biased)

## Fit model
library(lme4)
fm1 <- lmer(as.numeric(ratings) ~ defaced + (defaced | rater), data=df_4)
summary(fm1)
ranef(fm1)

## Use partial bayesian method

The help for ?isSingular suggests trying the blme package. "Use a partially Bayesian method that produces maximum a posteriori (MAP) estimates using regularizing priors to force the estimated random-effects variance-covariance matrices away from singularity"

In [None]:
library(blme)
n_rated <- 130
n_sub <- 580 #nbr of subjects available in the dataset
n_rater <- 20 #nbr of raters
#perc_biased <- c(2,40,60,80) #4 raters + bias
perc_biased <- rep(c(2,20,40,40,40,40,60,60,60,80), times = n_rater/10)
ratings_range <- seq(0,1,length.out=11)
labels <- c('excluded','0.1','poor','0.3','acceptable','0.5','good','0.7','very good','0.9','excellent')
bias <- 0.1

df_bay <- simulate_data(n_rated, n_sub, n_rater, perc_biased, ratings_range=ratings_range, bias=bias)

fm1_bay <- blmer(as.numeric(ratings) ~ defaced + (defaced | rater), data=df_bay)
summary(fm1_bay)

In [None]:
library(blme)
n_rated <- 580
n_sub <- 580 #nbr of subjects available in the dataset
n_rater <- 4 #nbr of raters
perc_biased <- c(2,40,60,80) #4 raters + bias
ratings_range <- seq(0,1,length.out=11)
labels <- c('excluded','0.1','poor','0.3','acceptable','0.5','good','0.7','very good','0.9','excellent')
bias <- 0.1

df_bay <- simulate_data(n_rated, n_sub, n_rater, perc_biased, ratings_range=ratings_range, bias=bias)

fm1_bay <- blmer(as.numeric(ratings) ~ defaced + (defaced | rater), data=df_bay)
summary(fm1_bay)

## Conclusion

The model always converge to a singular answer no matter which dataset parameters I changed. The bayesian extension of the model does not converge as well.

A last lead to explore is to use full bayesian model. However it takes a long time to run and the framework is completely different.

In [None]:
library(brms)

n_rated <- 20
n_sub <- 20 #nbr of subjects available in the dataset
n_rater <- 6 #nbr of raters
perc_biased <- c(2,40,50,50,60,90) #4 raters + bias

df_bay <- simulate_data(n_rated, n_sub, n_rater, perc_biased)

fm1_bay <- brm(as.numeric(ratings) ~ defaced + (defaced | rater), data=df_bay, family=cratio("logit"))
summary(fm1_bay)
plot(fm1_bay)