Contains the R models used to analyze the number of review comments before and after the introduction of Travis CI. Looks at whether a boolean variable `IsAfterTravisIntroduction` can be used to predict the number of review comments under a pull request. 

In [75]:
filename <- 'generated/num_of_review_comments.csv'

NumOfReviewCommentsData <- read.csv(file=filename, header=TRUE, sep=",")

summary(NumOfReviewCommentsData)

 ReviewComments    ShareReviewComments GeneralComments  
 Min.   :  0.000   Min.   :  0.0       Min.   :  0.000  
 1st Qu.:  0.000   1st Qu.:  0.0       1st Qu.:  0.000  
 Median :  0.000   Median :  0.0       Median :  1.000  
 Mean   :  1.465   Mean   : 10.6       Mean   :  3.142  
 3rd Qu.:  0.000   3rd Qu.:  0.0       3rd Qu.:  3.000  
 Max.   :404.000   Max.   :100.0       Max.   :220.000  
                                                        
 GeneralCommentsDiscussingBuild   Additions         Deletions        
 Min.   : 0.0000                Min.   :      0   Min.   :      0.0  
 1st Qu.: 0.0000                1st Qu.:      1   1st Qu.:      0.0  
 Median : 0.0000                Median :      8   Median :      2.0  
 Mean   : 0.0947                Mean   :   4358   Mean   :    701.8  
 3rd Qu.: 0.0000                3rd Qu.:     68   3rd Qu.:     17.0  
 Max.   :37.0000                Max.   :4146796   Max.   :1186576.0  
                                                      

In [76]:
library(lmerTest)
library(MuMIn)
library(VIF)

vif.mer <- function (fit) {
    ## adapted from rms::vif
    
    v <- vcov(fit)
    nam <- names(fixef(fit))

    ## exclude intercepts
    ns <- sum(1 * (nam == "Intercept" | nam == "(Intercept)"))
    if (ns > 0) {
        v <- v[-(1:ns), -(1:ns), drop = FALSE]
        nam <- nam[-(1:ns)]
    }
    
    d <- diag(v)^0.5
    v <- diag(solve(v/(d %o% d)))
    names(v) <- nam
    v
}

In [77]:
modelNumberReviewComments = lmer(ReviewComments ~ 
            Additions +
            Deletions +
            IsMerged +
            Commits +
            Assignees + 
            ChangedFiles + 
            PrOpenedDaysAfterProjectStart +
            IsAfter + 
            (1|ProjectLanguage) +
            (1|ProjectName),
          data= NumOfReviewCommentsData, 
          REML=FALSE)
summary(modelNumberReviewComments)
r.squaredGLMM(modelNumberReviewComments)
vif.mer(modelNumberReviewComments)
anova(modelNumberReviewComments)

“Some predictor variables are on very different scales: consider rescaling”

Linear mixed model fit by maximum likelihood t-tests use Satterthwaite
  approximations to degrees of freedom [lmerMod]
Formula: ReviewComments ~ Additions + Deletions + IsMerged + Commits +  
    Assignees + ChangedFiles + PrOpenedDaysAfterProjectStart +  
    IsAfter + (1 | ProjectLanguage) + (1 | ProjectName)
   Data: NumOfReviewCommentsData

      AIC       BIC    logLik  deviance  df.resid 
 260416.0  260518.9 -130196.0  260392.0     38870 

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-2.125 -0.233 -0.108  0.009 57.751 

Random effects:
 Groups          Name        Variance Std.Dev.
 ProjectName     (Intercept)  1.279   1.131   
 ProjectLanguage (Intercept)  0.000   0.000   
 Residual                    47.289   6.877   
Number of obs: 38882, groups:  ProjectName, 34; ProjectLanguage, 15

Fixed effects:
                                Estimate Std. Error         df t value Pr(>|t|)
(Intercept)                    4.323e-02  2.178e-01  4.800e+01   0.199    0.843
Additions

“Some predictor variables are on very different scales: consider rescaling”

Unnamed: 0,Sum Sq,Mean Sq,NumDF,DenDF,F.value,Pr(>F)
Additions,103.0091,103.0091,1,,,
Deletions,69.71227,69.71227,1,,,
IsMerged,50.0988,50.0988,1,,,
Commits,30030.4,30030.4,1,,,
Assignees,10839.2,10839.2,1,,,
ChangedFiles,2999.412,2999.412,1,,,
PrOpenedDaysAfterProjectStart,9830.6,9830.6,1,,,
IsAfter,0.9699131,0.9699131,1,,,


In [78]:
modelShareReviewComments = lmer(ShareReviewComments ~ 
            Additions +
            Deletions +
            IsMerged +
            Commits +
            Assignees + 
            ChangedFiles + 
            PrOpenedDaysAfterProjectStart +
            IsAfter + 
            (1|ProjectLanguage) +
            (1|ProjectName),
          data= NumOfReviewCommentsData, 
          REML=FALSE)
summary(modelShareReviewComments)
r.squaredGLMM(modelShareReviewComments)
vif.mer(modelShareReviewComments)
anova(modelShareReviewComments)

“Some predictor variables are on very different scales: consider rescaling”

Linear mixed model fit by maximum likelihood t-tests use Satterthwaite
  approximations to degrees of freedom [lmerMod]
Formula: ShareReviewComments ~ Additions + Deletions + IsMerged + Commits +  
    Assignees + ChangedFiles + PrOpenedDaysAfterProjectStart +  
    IsAfter + (1 | ProjectLanguage) + (1 | ProjectName)
   Data: NumOfReviewCommentsData

      AIC       BIC    logLik  deviance  df.resid 
 356752.7  356855.6 -178364.4  356728.7     38870 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.7189 -0.5238 -0.2814 -0.0285  4.2328 

Random effects:
 Groups          Name        Variance  Std.Dev. 
 ProjectName     (Intercept) 3.660e+01 6.050e+00
 ProjectLanguage (Intercept) 1.167e-12 1.080e-06
 Residual                    5.630e+02 2.373e+01
Number of obs: 38882, groups:  ProjectName, 34; ProjectLanguage, 15

Fixed effects:
                                Estimate Std. Error         df t value Pr(>|t|)
(Intercept)                    2.374e+00  1.093e+00  4.000e+01   2

“Some predictor variables are on very different scales: consider rescaling”

Unnamed: 0,Sum Sq,Mean Sq,NumDF,DenDF,F.value,Pr(>F)
Additions,9131.111,9131.111,1,,,
Deletions,4654.44,4654.44,1,,,
IsMerged,20198.518,20198.518,1,,,
Commits,42254.301,42254.301,1,,,
Assignees,61900.669,61900.669,1,,,
ChangedFiles,11109.393,11109.393,1,,,
PrOpenedDaysAfterProjectStart,468437.02,468437.02,1,,,
IsAfter,10346.4,10346.4,1,,,


In [79]:
modelNumberReviewComments = lmer(GeneralCommentsDiscussingBuild ~ 
            Additions +
            Deletions +
            IsMerged +
            Commits +
            Assignees + 
            ChangedFiles + 
            PrOpenedDaysAfterProjectStart +
            IsAfter + 
            (1|ProjectLanguage) +
            (1|ProjectName),
          data= NumOfReviewCommentsData, 
          REML=FALSE)
summary(modelNumberReviewComments)
r.squaredGLMM(modelNumberReviewComments)
vif.mer(modelNumberReviewComments)
anova(modelNumberReviewComments)

“Some predictor variables are on very different scales: consider rescaling”

Linear mixed model fit by maximum likelihood t-tests use Satterthwaite
  approximations to degrees of freedom [lmerMod]
Formula: GeneralCommentsDiscussingBuild ~ Additions + Deletions + IsMerged +  
    Commits + Assignees + ChangedFiles + PrOpenedDaysAfterProjectStart +  
    IsAfter + (1 | ProjectLanguage) + (1 | ProjectName)
   Data: NumOfReviewCommentsData

     AIC      BIC   logLik deviance df.resid 
 66921.8  67024.6 -33448.9  66897.8    38870 

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-1.000 -0.215 -0.105 -0.034 63.960 

Random effects:
 Groups          Name        Variance  Std.Dev.
 ProjectName     (Intercept) 0.0096781 0.09838 
 ProjectLanguage (Intercept) 0.0008534 0.02921 
 Residual                    0.3261783 0.57112 
Number of obs: 38882, groups:  ProjectName, 34; ProjectLanguage, 15

Fixed effects:
                                Estimate Std. Error         df t value Pr(>|t|)
(Intercept)                    8.348e-02  2.115e-02  2.200e+01   3.947 0.000710

“Some predictor variables are on very different scales: consider rescaling”

Unnamed: 0,Sum Sq,Mean Sq,NumDF,DenDF,F.value,Pr(>F)
Additions,0.002568378,0.002568378,1,,,
Deletions,2.06255694,2.06255694,1,,,
IsMerged,19.196757725,19.196757725,1,,,
Commits,29.69137587,29.69137587,1,,,
Assignees,23.064327819,23.064327819,1,,,
ChangedFiles,0.010566784,0.010566784,1,,,
PrOpenedDaysAfterProjectStart,43.676630602,43.676630602,1,,,
IsAfter,15.237216054,15.237216054,1,,,
