In [70]:
library(lmerTest)
library(MuMIn)
library(VIF)
library(sqldf)

filename <- 'generated/metrics_for_time_series.csv'

TimeSeriesData <- read.csv(file=filename, header=TRUE, sep=",")

summary(TimeSeriesData)

TimeSeriesData <- sqldf("select *
                      from 'TimeSeriesData' 
                      where Period != 13")


TimeSeriesData$Intervention = TimeSeriesData$Period > 12
TimeSeriesData$TimeAfterIntervention = ifelse(TimeSeriesData$Period>12, TimeSeriesData$Period-12, 0)

                         ProjectName         Language    ShareReviewComments
 activemerchant/active_merchant:  25   Ruby      :1200   Min.   :0.0000     
 adobe/brackets                :  25   Python    :1125   1st Qu.:0.0000     
 AFNetworking/AFNetworking     :  25   JavaScript: 925   Median :0.1186     
 airbnb/javascript             :  25   PHP       : 800   Mean   :0.1855     
 ajaxorg/ace                   :  25   C++       : 775   3rd Qu.:0.3090     
 alexreisner/geocoder          :  25   Java      : 700   Max.   :1.0000     
 (Other)                       :7450   (Other)   :2075                      
   Additions        Deletions       ChangedFiles      Assignees      
 Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.: 1.586   1st Qu.:0.8815   1st Qu.:0.6931   1st Qu.:0.00000  
 Median : 2.382   Median :1.4915   Median :1.0104   Median :0.00000  
 Mean   : 2.423   Mean   :1.5884   Mean   :1.0583   Mean   :0.04135  
 3rd Qu.: 3.200   3rd Qu.:2.1859  

In [40]:
vif.mer <- function (fit) {
    ## adapted from rms::vif
    
    v <- vcov(fit)
    nam <- names(fixef(fit))

    ## exclude intercepts
    ns <- sum(1 * (nam == "Intercept" | nam == "(Intercept)"))
    if (ns > 0) {
        v <- v[-(1:ns), -(1:ns), drop = FALSE]
        nam <- nam[-(1:ns)]
    }
    
    d <- diag(v)^0.5
    v <- diag(solve(v/(d %o% d)))
    names(v) <- nam
    v
}

In [41]:
projects <- sqldf("select count(distinct(ProjectName)) from TimeSeriesData")

print(projects)

  count(distinct(ProjectName))
1                          304


In [76]:
modelNumberReviewComments = lmer(ReviewComments ~ 
            Additions+
            Deletions  +
            Commits  +
            #CommitsAfterCreate  +
            Assignees  + 
            ChangedFiles + 
            TotalPrs +
            Intervention +
            Period + 
            TimeAfterIntervention +
            (1|Language) +
            (1+Intervention|ProjectName),
          data= TimeSeriesData, 
          REML=FALSE)
summary(modelNumberReviewComments)
r.squaredGLMM(modelNumberReviewComments)
vif.mer(modelNumberReviewComments)
anova(modelNumberReviewComments)

Linear mixed model fit by maximum likelihood t-tests use Satterthwaite
  approximations to degrees of freedom [lmerMod]
Formula: ReviewComments ~ Additions + Deletions + Commits + Assignees +  
    ChangedFiles + TotalPrs + Intervention + Period + TimeAfterIntervention +  
    (1 | Language) + (1 + Intervention | ProjectName)
   Data: TimeSeriesData

     AIC      BIC   logLik deviance df.resid 
 -1428.9  -1325.4    729.4  -1458.9     7281 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.0023 -0.4528 -0.0819  0.3420 12.0942 

Random effects:
 Groups      Name             Variance Std.Dev. Corr 
 ProjectName (Intercept)      0.03424  0.18504       
             InterventionTRUE 0.01666  0.12907  -0.09
 Language    (Intercept)      0.00179  0.04231       
 Residual                     0.03997  0.19991       
Number of obs: 7296, groups:  ProjectName, 304; Language, 36

Fixed effects:
                        Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)  

Unnamed: 0,Sum Sq,Mean Sq,NumDF,DenDF,F.value,Pr(>F)
Additions,4.99089458,4.99089458,1,7145.084,124.8786956,0.0
Deletions,0.02682426,0.02682426,1,7085.995,0.6711779,0.4126688
Commits,8.94443838,8.94443838,1,7216.46,223.8015208,0.0
Assignees,0.01654348,0.01654348,1,6434.702,0.4139395,0.5199997
ChangedFiles,1.22701357,1.22701357,1,7185.965,30.7014809,3.115933e-08
TotalPrs,0.24375318,0.24375318,1,5678.215,6.0990227,0.01355486
Intervention,0.08423634,0.08423634,1,1202.994,2.1077032,0.1468199
Period,2.11071035,2.11071035,1,6705.739,52.8127274,4.085621e-13
TimeAfterIntervention,0.57356183,0.57356183,1,6687.504,14.3512654,0.0001530087


In [75]:
modelShareReviewComments = lmer(GeneralComments  ~ 
            Additions  +
            Deletions + 
            Commits + 
            Assignees +
            ChangedFiles +
            TotalPrs +
            #CommitsAfterCreate +
            Intervention +
            Period + 
            TimeAfterIntervention +
            (1|Language) +
            (1+Intervention|ProjectName),
          data= TimeSeriesData, 
          REML=FALSE)
summary(modelShareReviewComments)
r.squaredGLMM(modelShareReviewComments)
vif.mer(modelShareReviewComments)
anova(modelShareReviewComments)

Linear mixed model fit by maximum likelihood t-tests use Satterthwaite
  approximations to degrees of freedom [lmerMod]
Formula: GeneralComments ~ Additions + Deletions + Commits + Assignees +  
    ChangedFiles + TotalPrs + Intervention + Period + TimeAfterIntervention +  
    (1 | Language) + (1 + Intervention | ProjectName)
   Data: TimeSeriesData

     AIC      BIC   logLik deviance df.resid 
  4337.3   4440.7  -2153.6   4307.3     7281 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.3346 -0.5442 -0.0226  0.5107  6.4213 

Random effects:
 Groups      Name             Variance Std.Dev. Corr 
 ProjectName (Intercept)      0.11103  0.3332        
             InterventionTRUE 0.05460  0.2337   -0.26
 Language    (Intercept)      0.00000  0.0000        
 Residual                     0.08589  0.2931        
Number of obs: 7296, groups:  ProjectName, 304; Language, 36

Fixed effects:
                        Estimate Std. Error         df t value Pr(>|t|)    
(Intercept) 

Unnamed: 0,Sum Sq,Mean Sq,NumDF,DenDF,F.value,Pr(>F)
Additions,5.185921,5.185921,1,7059.5324,60.38043,8.881784e-15
Deletions,1.2085223,1.2085223,1,7011.7991,14.071,0.0001774564
Commits,14.1554108,14.1554108,1,7150.2411,164.8135,0.0
Assignees,4.1669915,4.1669915,1,6886.0245,48.51689,3.577583e-12
ChangedFiles,1.5311067,1.5311067,1,7106.6024,17.8269,2.449551e-05
TotalPrs,1.2847543,1.2847543,1,6345.4737,14.95858,0.0001109965
Intervention,0.9192769,0.9192769,1,906.5334,10.70327,0.001109841
Period,3.8164149,3.8164149,1,6705.4045,44.43507,2.838574e-11
TimeAfterIntervention,1.5782344,1.5782344,1,6690.4157,18.37561,1.839104e-05


In [77]:
modelShareReviewComments = lmer(EffectiveComments ~ 
            Additions + 
            Deletions + 
            Commits + 
            Assignees + 
            ChangedFiles + 
            #GeneralComments + 
            TotalPrs +
            Intervention +
            Period + 
            TimeAfterIntervention +
            (1|Language) +
            (1+Intervention|ProjectName),
          data= TimeSeriesData, 
          REML=FALSE)
summary(modelShareReviewComments)
r.squaredGLMM(modelShareReviewComments)
vif.mer(modelShareReviewComments)
anova(modelShareReviewComments)

Linear mixed model fit by maximum likelihood t-tests use Satterthwaite
  approximations to degrees of freedom [lmerMod]
Formula: EffectiveComments ~ Additions + Deletions + Commits + Assignees +  
    ChangedFiles + TotalPrs + Intervention + Period + TimeAfterIntervention +  
    (1 | Language) + (1 + Intervention | ProjectName)
   Data: TimeSeriesData

     AIC      BIC   logLik deviance df.resid 
-10200.7 -10097.3   5115.4 -10230.7     7281 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.9855 -0.3924 -0.0927  0.2136 21.5281 

Random effects:
 Groups      Name             Variance  Std.Dev. Corr 
 ProjectName (Intercept)      0.0053491 0.07314       
             InterventionTRUE 0.0032844 0.05731  -0.29
 Language    (Intercept)      0.0002394 0.01547       
 Residual                     0.0125468 0.11201       
Number of obs: 7296, groups:  ProjectName, 304; Language, 36

Fixed effects:
                        Estimate Std. Error         df t value Pr(>|t|)    
(Inte

Unnamed: 0,Sum Sq,Mean Sq,NumDF,DenDF,F.value,Pr(>F)
Additions,1.29695796,1.29695796,1,7275.612,103.3698477,0.0
Deletions,0.363749784,0.363749784,1,7223.17,28.9915024,7.498454e-08
Commits,5.310722393,5.310722393,1,7250.522,423.2739855,0.0
Assignees,0.010527377,0.010527377,1,4204.098,0.8390506,0.359721
ChangedFiles,0.239565637,0.239565637,1,7269.723,19.0938058,1.261778e-05
TotalPrs,0.005036171,0.005036171,1,3616.386,0.4013917,0.5264114
Intervention,0.006585597,0.006585597,1,1622.521,0.5248837,0.4688703
Period,0.391250747,0.391250747,1,6711.094,31.1833778,2.439582e-08
TimeAfterIntervention,0.267699412,0.267699412,1,6686.646,21.3361175,3.926174e-06


In [79]:
modelShareReviewComments = lmer(CommitsAfterCreate ~ 
            Additions + 
            Deletions + 
            Commits + 
            Assignees + 
            ChangedFiles + 
            TotalPrs +
            GeneralComments + 
            Intervention +
            Period + 
            TimeAfterIntervention +
            (1|Language) +
            (1+Intervention|ProjectName),
          data= TimeSeriesData, 
          REML=FALSE)
summary(modelShareReviewComments)
r.squaredGLMM(modelShareReviewComments)
vif.mer(modelShareReviewComments)
anova(modelShareReviewComments)

Linear mixed model fit by maximum likelihood t-tests use Satterthwaite
  approximations to degrees of freedom [lmerMod]
Formula: CommitsAfterCreate ~ Additions + Deletions + Commits + Assignees +  
    ChangedFiles + TotalPrs + GeneralComments + Intervention +  
    Period + TimeAfterIntervention + (1 | Language) + (1 + Intervention |  
    ProjectName)
   Data: TimeSeriesData

     AIC      BIC   logLik deviance df.resid 
 -4958.7  -4848.3   2495.3  -4990.7     7280 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-10.8614  -0.4345  -0.0085   0.4173  11.3471 

Random effects:
 Groups      Name             Variance Std.Dev. Corr 
 ProjectName (Intercept)      0.009398 0.09694       
             InterventionTRUE 0.006582 0.08113  -0.62
 Language    (Intercept)      0.000640 0.02530       
 Residual                     0.026253 0.16203       
Number of obs: 7296, groups:  ProjectName, 304; Language, 36

Fixed effects:
                        Estimate Std. Error        

Unnamed: 0,Sum Sq,Mean Sq,NumDF,DenDF,F.value,Pr(>F)
Additions,1.166546,1.166546,1,7230.328,44.43454,2.823652e-11
Deletions,0.04543434,0.04543434,1,7249.877,1.730626,0.1883726
Commits,105.467,105.467,1,6999.949,4017.311,0.0
Assignees,0.2730551,0.2730551,1,2443.285,10.40086,0.001276184
ChangedFiles,1.348665,1.348665,1,7130.359,51.37158,8.41105e-13
TotalPrs,0.03365115,0.03365115,1,2125.484,1.281796,0.2576931
GeneralComments,16.90303,16.90303,1,5072.985,643.8482,0.0
Intervention,0.002374365,0.002374365,1,1645.993,0.09044123,0.763655
Period,0.1556662,0.1556662,1,6735.611,5.929434,0.01491581
TimeAfterIntervention,0.0672195,0.0672195,1,6699.351,2.560438,0.1096154


# how about if we don't apply any aggregation

In [29]:
filename_no_aggr <- 'generated/metrics_for_time_series_no_aggr.csv'

TimeSeriesDataNoAggr <- read.csv(file=filename_no_aggr, header=TRUE, sep=",")

summary(TimeSeriesDataNoAggr)

TimeSeriesDataNoAggr <- sqldf("select *
                      from 'TimeSeriesDataNoAggr' 
                      where Period != 13")


TimeSeriesDataNoAggr$Intervention = TimeSeriesDataNoAggr$Period > 12
TimeSeriesDataNoAggr$TimeAfterIntervention = ifelse(TimeSeriesDataNoAggr$Period>12, TimeSeriesDataNoAggr$Period-12, 0)

                       ProjectName       Period      EffectiveComments
 adobe/brackets              : 638   Min.   : 1.00   Min.   :  1.000  
 caskdata/cdap               : 633   1st Qu.: 7.00   1st Qu.:  1.000  
 DynamoDS/Dynamo             : 420   Median :13.00   Median :  2.000  
 symfony/symfony-docs        : 370   Mean   :12.93   Mean   :  6.055  
 AnalyticalGraphicsInc/cesium: 261   3rd Qu.:19.00   3rd Qu.:  5.000  
 cakephp/docs                : 253   Max.   :25.00   Max.   :190.000  
 (Other)                     :1966                                    
 ReviewThreads    GeneralComments     Additions          Deletions       
 Min.   :  1.00   Min.   :  0.000   Min.   :     0.0   Min.   :     0.0  
 1st Qu.:  2.00   1st Qu.:  2.000   1st Qu.:    13.0   1st Qu.:     1.0  
 Median :  5.00   Median :  4.000   Median :    87.0   Median :    12.0  
 Mean   : 12.55   Mean   :  8.498   Mean   :   819.5   Mean   :   322.8  
 3rd Qu.: 13.00   3rd Qu.:  9.000   3rd Qu.:   347.0   3rd Qu.

In [30]:
projects <- sqldf("select count(distinct(ProjectName)) from TimeSeriesDataNoAggr")

print(projects)

  count(distinct(ProjectName))
1                           18


In [31]:
modelEffectiveCommentsNoAggr = lmer(log(EffectiveComments + 1) ~ 
            log(Additions + 1) +
            log(Deletions + 1) +
            log(Commits + 1) +
            log(Assignees + 1) + 
            log(ChangedFiles + 1) + 
            IsMerged + 
            Intervention +
            Period + 
            TimeAfterIntervention +
            (1+Intervention|ProjectName),
          data= TimeSeriesDataNoAggr, 
          REML=FALSE)
summary(modelEffectiveCommentsNoAggr)
r.squaredGLMM(modelEffectiveCommentsNoAggr)
vif.mer(modelEffectiveCommentsNoAggr)
anova(modelEffectiveCommentsNoAggr)

Linear mixed model fit by maximum likelihood t-tests use Satterthwaite
  approximations to degrees of freedom [lmerMod]
Formula: log(EffectiveComments + 1) ~ log(Additions + 1) + log(Deletions +  
    1) + log(Commits + 1) + log(Assignees + 1) + log(ChangedFiles +  
    1) + IsMerged + Intervention + Period + TimeAfterIntervention +  
    (1 + Intervention | ProjectName)
   Data: TimeSeriesDataNoAggr

     AIC      BIC   logLik deviance df.resid 
  9171.4   9260.6  -4571.7   9143.4     4317 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.5157 -0.6843 -0.1494  0.5464  4.4667 

Random effects:
 Groups      Name             Variance Std.Dev. Corr
 ProjectName (Intercept)      0.02523  0.1588       
             InterventionTRUE 0.01784  0.1336   0.00
 Residual                     0.47627  0.6901       
Number of obs: 4331, groups:  ProjectName, 18

Fixed effects:
                        Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)            3.615e-01  

Unnamed: 0,Sum Sq,Mean Sq,NumDF,DenDF,F.value,Pr(>F)
log(Additions + 1),75.04443963,75.04443963,1,4321.977,157.5661724,0.0
log(Deletions + 1),57.90042174,57.90042174,1,4317.63,121.56993748,0.0
log(Commits + 1),349.96244096,349.96244096,1,4319.894,734.79451083,0.0
log(Assignees + 1),4.23595771,4.23595771,1,3574.08,8.89397863,0.00288045
log(ChangedFiles + 1),2.82422692,2.82422692,1,4328.159,5.92985474,0.01492659
IsMerged,0.15024479,0.15024479,1,4303.772,0.31545971,0.5743783
Intervention,0.02744097,0.02744097,1,60.608,0.05761611,0.81111424
Period,2.46823576,2.46823576,1,4261.915,5.18240209,0.02286613
TimeAfterIntervention,2.09320811,2.09320811,1,4274.361,4.39497972,0.03610353
