Contains the R models used to analyze the number of review comments before and after the introduction of Travis CI. Looks at whether a boolean variable `IsAfterTravisIntroduction` can be used to predict the number of review comments under a pull request. 

In [82]:
filename <- 'generated/num_of_review_comments.csv'

NumOfReviewCommentsData <- read.csv(file=filename, header=TRUE, sep=",")

summary(NumOfReviewCommentsData)

 ReviewComments    ShareReviewComments GeneralComments   
 Min.   :  0.000   Min.   :  0.00      Min.   :   0.000  
 1st Qu.:  0.000   1st Qu.:  0.00      1st Qu.:   0.000  
 Median :  0.000   Median :  0.00      Median :   1.000  
 Mean   :  1.435   Mean   : 11.78      Mean   :   2.773  
 3rd Qu.:  0.000   3rd Qu.:  0.00      3rd Qu.:   3.000  
 Max.   :404.000   Max.   :100.00      Max.   :1035.000  
                                                         
 GeneralCommentsDiscussingBuild   Additions         Deletions      
 Min.   :0                      Min.   :      0   Min.   :      0  
 1st Qu.:0                      1st Qu.:      1   1st Qu.:      0  
 Median :0                      Median :     10   Median :      3  
 Mean   :0                      Mean   :   2168   Mean   :    628  
 3rd Qu.:0                      3rd Qu.:     68   3rd Qu.:     20  
 Max.   :0                      Max.   :4146796   Max.   :1186576  
                                                            

In [83]:
library(lmerTest)
library(MuMIn)
library(VIF)
library(sqldf)

vif.mer <- function (fit) {
    ## adapted from rms::vif
    
    v <- vcov(fit)
    nam <- names(fixef(fit))

    ## exclude intercepts
    ns <- sum(1 * (nam == "Intercept" | nam == "(Intercept)"))
    if (ns > 0) {
        v <- v[-(1:ns), -(1:ns), drop = FALSE]
        nam <- nam[-(1:ns)]
    }
    
    d <- diag(v)^0.5
    v <- diag(solve(v/(d %o% d)))
    names(v) <- nam
    #v
}

In [84]:
hasReviewComments <- sqldf("select *
                      from 'NumOfReviewCommentsData' 
                      where ReviewComments > 0")

hasGeneralComments <- sqldf("select *
                      from 'NumOfReviewCommentsData' 
                      where GeneralComments > 0")

In [97]:
modelNumberReviewComments = lmer(log(ReviewComments) ~ 
            log(Additions + 1) +
            log(Deletions + 1) +
            IsMerged +
            log(Commits + 1) +
            log(Assignees + 1) + 
            log(ChangedFiles + 1) + 
            log(NumOfUniqueUsers + 1) +    
            log(PRsOpened + 1) +
            log(TotalBuilds + 1) +
            NewContributor + 
            #log(PrOpenedDaysAfterProjectStart + 1) +
            IsAfter + 
            (1|ProjectLanguage) +
            (1|ProjectName),
          data= hasReviewComments,
          REML=FALSE)
summary(modelNumberReviewComments)
r.squaredGLMM(modelNumberReviewComments)
vif.mer(modelNumberReviewComments)
anova(modelNumberReviewComments)

Linear mixed model fit by maximum likelihood t-tests use Satterthwaite
  approximations to degrees of freedom [lmerMod]
Formula: log(ReviewComments) ~ log(Additions + 1) + log(Deletions + 1) +  
    IsMerged + log(Commits + 1) + log(Assignees + 1) + log(ChangedFiles +  
    1) + log(NumOfUniqueUsers + 1) + log(PRsOpened + 1) + log(TotalBuilds +  
    1) + NewContributor + IsAfter + (1 | ProjectLanguage) + (1 |  
    ProjectName)
   Data: hasReviewComments

     AIC      BIC   logLik deviance df.resid 
 73156.8  73281.0 -36563.4  73126.8    29108 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.9710 -0.6781 -0.0389  0.6529  5.2212 

Random effects:
 Groups          Name        Variance Std.Dev.
 ProjectName     (Intercept) 0.042604 0.20641 
 ProjectLanguage (Intercept) 0.002023 0.04498 
 Residual                    0.715137 0.84566 
Number of obs: 29123, groups:  ProjectName, 107; ProjectLanguage, 25

Fixed effects:
                            Estimate Std. Error        

Unnamed: 0,Sum Sq,Mean Sq,NumDF,DenDF,F.value,Pr(>F)
log(Additions + 1),490.771187,490.771187,1,29059.99,686.261393,0.0
log(Deletions + 1),2.615343,2.615343,1,29105.78,3.657119,0.05583989
IsMerged,27.856005,27.856005,1,29119.79,38.951963,4.403442e-10
log(Commits + 1),570.442646,570.442646,1,29054.05,797.668598,0.0
log(Assignees + 1),53.295796,53.295796,1,26975.25,74.525253,0.0
log(ChangedFiles + 1),21.475412,21.475412,1,29096.18,30.02977,4.289894e-08
log(NumOfUniqueUsers + 1),3100.149661,3100.149661,1,27786.46,4335.040605,0.0
log(PRsOpened + 1),57.935594,57.935594,1,28633.3,81.013234,0.0
log(TotalBuilds + 1),8.42596,8.42596,1,28955.13,11.782295,0.0005988028
NewContributor,13.054998,13.054998,1,29113.87,18.25523,1.938078e-05


In [86]:
modelNumberGeneralComments = lmer(log(GeneralComments) ~ 
            log(Additions + 1) +
            log(Deletions + 1) +
            IsMerged +
            log(Commits + 1) +
            log(Assignees + 1) + 
            log(ChangedFiles + 1) + 
            log(NumOfUniqueUsers + 1) +    
            log(PRsOpened + 1) +
            log(TotalBuilds + 1) +
            NewContributor + 
            #log(PrOpenedDaysAfterProjectStart + 1) +
            IsAfter + 
            (1|ProjectLanguage) +
            (1|ProjectName),
          data= hasGeneralComments, 
          REML=FALSE)
summary(modelNumberGeneralComments)
r.squaredGLMM(modelNumberGeneralComments)
vif.mer(modelNumberGeneralComments)
anova(modelNumberGeneralComments)

Linear mixed model fit by maximum likelihood t-tests use Satterthwaite
  approximations to degrees of freedom [lmerMod]
Formula: log(GeneralComments) ~ log(Additions + 1) + log(Deletions + 1) +  
    IsMerged + log(Commits + 1) + log(Assignees + 1) + log(ChangedFiles +  
    1) + log(NumOfUniqueUsers + 1) + log(PRsOpened + 1) + log(TotalBuilds +  
    1) + NewContributor + IsAfter + (1 | ProjectLanguage) + (1 |  
    ProjectName)
   Data: hasGeneralComments

     AIC      BIC   logLik deviance df.resid 
127672.0 127813.8 -63821.0 127642.0    94499 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-6.2078 -0.4821 -0.1226  0.4330  8.7643 

Random effects:
 Groups          Name        Variance  Std.Dev.
 ProjectName     (Intercept) 0.0155769 0.12481 
 ProjectLanguage (Intercept) 0.0004271 0.02067 
 Residual                    0.2250216 0.47436 
Number of obs: 94514, groups:  ProjectName, 107; ProjectLanguage, 25

Fixed effects:
                            Estimate Std. Error  

Unnamed: 0,Sum Sq,Mean Sq,NumDF,DenDF,F.value,Pr(>F)
log(Additions + 1),7.318281,7.318281,1,94512.43,32.52258,1.181658e-08
log(Deletions + 1),2.78604,2.78604,1,94454.48,12.38122,0.0004338786
IsMerged,53.68845,53.68845,1,94493.77,238.5925,0.0
log(Commits + 1),120.2159,120.2159,1,94513.02,534.2418,0.0
log(Assignees + 1),42.82611,42.82611,1,92868.44,190.32,0.0
log(ChangedFiles + 1),0.0002518088,0.0002518088,1,94502.86,0.001119043,0.9733141
log(NumOfUniqueUsers + 1),40011.61,40011.61,1,94383.33,177812.4,0.0
log(PRsOpened + 1),14.33749,14.33749,1,93929.62,63.71609,1.332268e-15
log(TotalBuilds + 1),8.999497,8.999497,1,94510.41,39.99393,2.558864e-10
NewContributor,2.383812,2.383812,1,94513.46,10.59371,0.001135132


In [87]:
modelShareReviewComments = lmer(log(ShareReviewComments + 1) ~ 
            log(Additions + 1) +
            log(Deletions + 1) +
            IsMerged +
            log(Commits + 1) +
            log(Assignees + 1) + 
            log(ChangedFiles + 1) +             
            log(NumOfUniqueUsers + 1) +    
            log(PRsOpened + 1) +
            log(TotalBuilds + 1) +
            NewContributor + 
            IsAfter + 
            (1|ProjectLanguage) +
            (1|ProjectName),
          data= NumOfReviewCommentsData, 
          REML=FALSE)
summary(modelShareReviewComments)
r.squaredGLMM(modelShareReviewComments)
vif.mer(modelShareReviewComments)
anova(modelShareReviewComments)

Linear mixed model fit by maximum likelihood t-tests use Satterthwaite
  approximations to degrees of freedom [lmerMod]
Formula: log(ShareReviewComments + 1) ~ log(Additions + 1) + log(Deletions +  
    1) + IsMerged + log(Commits + 1) + log(Assignees + 1) + log(ChangedFiles +  
    1) + log(NumOfUniqueUsers + 1) + log(PRsOpened + 1) + log(TotalBuilds +  
    1) + NewContributor + IsAfter + (1 | ProjectLanguage) + (1 |  
    ProjectName)
   Data: NumOfReviewCommentsData

      AIC       BIC    logLik  deviance  df.resid 
 483006.5  483154.4 -241488.3  482976.5    141045 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.6530 -0.6937 -0.1612  0.3529  4.0587 

Random effects:
 Groups          Name        Variance  Std.Dev. 
 ProjectName     (Intercept) 1.218e-01 3.490e-01
 ProjectLanguage (Intercept) 2.532e-13 5.032e-07
 Residual                    1.791e+00 1.338e+00
Number of obs: 141060, groups:  ProjectName, 107; ProjectLanguage, 25

Fixed effects:
                     

Unnamed: 0,Sum Sq,Mean Sq,NumDF,DenDF,F.value,Pr(>F)
log(Additions + 1),3634.25421,3634.25421,1,141057.7,2028.912265,0.0
log(Deletions + 1),15.30642,15.30642,1,141000.8,8.545188,0.003464918
IsMerged,2648.5899,2648.5899,1,141059.9,1478.640797,0.0
log(Commits + 1),2307.53834,2307.53834,1,141055.5,1288.240328,0.0
log(Assignees + 1),550.31403,550.31403,1,138776.4,307.226413,0.0
log(ChangedFiles + 1),985.60979,985.60979,1,141050.7,550.241031,0.0
log(NumOfUniqueUsers + 1),53221.94166,53221.94166,1,140672.9,29712.464769,0.0
log(PRsOpened + 1),22.78643,22.78643,1,140289.1,12.721086,0.000361676
log(TotalBuilds + 1),185.20968,185.20968,1,141048.5,103.397883,0.0
NewContributor,193.12756,193.12756,1,141059.7,107.818233,0.0


modelBuildDiscussionComments = lmer(log(GeneralCommentsDiscussingBuild + 1) ~ 
            log(Additions + 1) +
            log(Deletions + 1) +
            IsMerged +
            log(Commits + 1) +
            log(Assignees + 1) + 
            log(ChangedFiles + 1) + 
            log(PrOpenedDaysAfterProjectStart + 1) +
            IsAfter + 
            (1|ProjectLanguage) +
            (1|ProjectName),
          data= hasGeneralComments, 
          REML=FALSE)
summary(modelBuildDiscussionComments)
r.squaredGLMM(modelBuildDiscussionComments)
vif.mer(modelBuildDiscussionComments)
anova(modelBuildDiscussionComments)

In [91]:
library(lme4)

print(sprintf("R2c of review comments is %f", r.squaredGLMM(modelNumberReviewComments)[['R2c']]))
print(sprintf("R2c of share review comments is %f", r.squaredGLMM(modelShareReviewComments)[['R2c']]))
print(sprintf("R2c of general comments is %f", r.squaredGLMM(modelNumberGeneralComments)[['R2c']]))

[1] "R2c of review comments is 0.326004"
[1] "R2c of share review comments is 0.317835"
[1] "R2c of general comments is 0.724985"
