# A comparison of multivariate models for count data (ZZ17)

In order to illustrate the variety of splitting models, we considered two datasets used in the literature to illustrate models for count data.
We here focus on the second one, denoted in the following by `D`, consists in simulated data mimicking data obtained from sequencing techonologies such as RNA-seq data [[ZZZS17](https://doi.org/10.1080/10618600.2016.1154063)].
The goal being to compare distributions and regressions models, comparisons were performed when considering all covariates or none of the covariates.
Remark that variable selection (e.g., using regularization methods [[ZZZS17](https://doi.org/10.1080/10618600.2016.1154063)]) is possible, but is out of the scope of this paper.

First, we need:

* to import the `bivpoiss` [[KN05](https://www.jstatsoft.org/article/view/v014i10)] and the `MGLM` [[ZZ17](https://cran.r-project.org/web/packages/MGLM/index.html)] and the `MASS` [[VR02](https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/glm.nb.html)] libraries.

In [19]:
library(MGLM)
library(bivpois)
library(MASS)

* to load the dataset.

In [20]:
data('rnaseq')
rnaseq['totalReads'] = log(rnaseq['totalReads'])
D = rnaseq

The `MGLM` library requires to separate responses and explanatories variables.
`Y` and `X` corresponds reespectively ro the responses and explanatories datasets built from `D`.

In [21]:
Y = as.matrix(D[,c(1,2,3,4,5,6)])
X = as.matrix(D[,c(7,8,9,10)])

Similarly, `S` corresponds to the dataset constructed using the sample totals of the `Y`.

In [22]:
S = rowSums(Y)

Considering the `bivpoiss` and the `MGLM` libraries, we have only 2 *sensu stricto* multivariate distributions to infer on both datasets: 

1. the multivariate Poisson distribution,
2. the negative multinomiale distribution.

The maximum linkelihood estimations (MLEs) of:

* the multivariate Poisson distribution or regression cannot be obtained since multivariate Poisson MLEs in `bivpois` do not handle more than 2 response variables.

* the negative multinomial distribution is obtained as follows,

In [23]:
NMD = MGLMfit(Y, dist='NegMN')
print(NMD)

        estimate           SE
p_X1  0.31148586 0.0013619556
p_X2  0.10649139 0.0008502964
p_X3  0.09837308 0.0008192291
p_X4  0.35049562 0.0014253376
p_X5  0.09426337 0.0008029165
p_X6  0.02122027 0.0003891465
phi  12.23256918 1.2292533050

Distribution: Negative Multinomial
Log-likelihood: -20673.71
BIC: 41384.52
AIC: 41361.43
LRT test p value: NA
Iterations: 3


* the negative multinomial regression is obtained as follows,

In [24]:
NMR = MGLMreg(Y ~ X, dist='NegMN')
print(NMR)

“The algorithm doesn't converge within 35 iterations. The norm of the gradient is  200.051475894594  Please interpret hessian matrix and MLE with caution. 
”

Call: MGLMreg(formula = Y ~ X, dist = "NegMN")

Coefficients:
$alpha
                       X1            X2            X3            X4
(Intercept) -13.587376137 -1.352181e+01 -22.380090898 -14.131338928
XtotalReads   0.907715672  8.504124e-01   1.242922188   0.915257796
Xtreatment   -0.753112730 -7.735067e-01   2.014641268   0.675195341
Xgender      -0.060696325  7.021728e-03  -0.069502706   0.012498726
Xage          0.002582547 -9.376888e-04   0.002915974  -0.002140646
                       X5            X6
(Intercept) -1.450769e+01 -18.526416540
XtotalReads  8.999176e-01   1.020359613
Xtreatment  -6.382956e-01  -0.730410081
Xgender     -8.368181e-02  -0.093397437
Xage        -8.241306e-04   0.008765665

$phi
         
31.60591 


Hypothesis test: 
             wald value    Pr(>wald)
(Intercept)   385.75443 3.224960e-80
XtotalReads   368.08192 2.020898e-76
Xtreatment  18377.53053 0.000000e+00
Xgender        54.84664 4.978064e-10
Xage           79.70908 4.103032e-15

Distribution: 

But, as stated in the article, the `MGLM` library provides the MLEs of:

* the singular multinomial distribution obtained as follows,

In [25]:
MND = MGLMfit(Y, dist='MN')
print(MND)

           estimate         SE
alpha_X1 0.31787518 0.03292647
alpha_X2 0.11246732 0.02234037
alpha_X3 0.09729872 0.02095611
alpha_X4 0.35093667 0.03374760
alpha_X5 0.09895906 0.02111471
alpha_X6 0.02246305 0.01047818

Distribution: Multinomial
Log-likelihood: -19370.71
BIC: 38767.91
AIC: 38751.42
LRT test p value: 
Iterations: 


* the singular multinomial regression (Note that this estimation is not working properly in `R`. It has therefore been commented, to uncomment, please change the next cell metadata from `Raw NbConvert` to `Code`)

In [26]:
MNR = MGLMreg(Y ~ X, dist='MN')
print(MNR)

Call: MGLMreg(formula = Y ~ X, dist = "MN")

Coefficients:
                      X1           X2           X3          X4           X5
(Intercept)  4.942732041  5.009649150 -3.792216067  4.43543383  4.027689144
XtotalReads -0.112845142 -0.170222056  0.219276996 -0.10726028 -0.120928058
Xtreatment  -0.022654652 -0.043099372  2.745276755  1.40574185  0.092245537
Xgender      0.032675527  0.100389059  0.020663497  0.10385929  0.009513895
Xage        -0.006187245 -0.009708565 -0.005907363 -0.01094467 -0.009599361

Hypothesis test: 
             wald value    Pr(>wald)
(Intercept)   144.88789 1.634268e-29
XtotalReads    69.92572 1.061922e-13
Xtreatment  18364.13260 0.000000e+00
Xgender        52.33670 4.601575e-10
Xage           79.91023 8.762650e-16

Distribution: Multinomial
Log-likelihood: -7506.393
BIC: 15145.24
AIC: 15062.79
Iterations: 6


* the singular Dirichlet multinomial distribution

In [27]:
DMD = MGLMfit(Y, dist='DM')
print(DMD)

          estimate         SE
alpha_X1 6.1281170 0.32788775
alpha_X2 2.4136468 0.13967619
alpha_X3 1.6256266 0.09942399
alpha_X4 6.8229292 0.36256781
alpha_X5 2.2142361 0.12923252
alpha_X6 0.7840283 0.05136868

Distribution: Dirichlet Multinomial
Log-likelihood: -4968.666
BIC: 9969.121
AIC: 9949.331
LRT test p value: <0.0001
Iterations: 6


* the singular Dirichlet multinomial regression

In [28]:
DMR = MGLMreg(Y ~ X, dist='DM')
print(DMR)

Call: MGLMreg(formula = Y ~ X, dist = "DM")

Coefficients:
                     X1           X2           X3           X4           X5
(Intercept) -0.89584994 -1.096920833 -8.997413980 -1.736871487 -1.774226549
XtotalReads  0.22198795  0.186918573  0.536571928  0.252679409  0.216671706
Xtreatment  -0.67929126 -0.686880554  1.835585178  0.707954021 -0.546469027
Xgender     -0.02617726  0.040244490 -0.052841946  0.023177795 -0.058339470
Xage         0.01024465  0.005226678  0.009134239  0.004252438  0.006090452
                     X6
(Intercept) -5.64682236
XtotalReads  0.34727128
Xtreatment  -0.54313366
Xgender     -0.03913945
Xage         0.01164235

Hypothesis test: 
             wald value  Pr(>wald)
(Intercept)   14.579069 0.02379579
XtotalReads    8.502549 0.20354699
Xtreatment  1851.437449 0.00000000
Xgender        4.133364 0.65863419
Xage          13.131512 0.04099442

Distribution: Dirichlet Multinomial
Log-likelihood: -4386.941
BIC: 8932.831
AIC: 8833.882
Iterations: 9


* the singular generalized Dirichlet multinomial distribution

In [29]:
GDMD = MGLMfit(Y, dist='GDM')
print(GDMD)

          estimate        SE
alpha_X1  3.741846 0.3670877
alpha_X2  2.400909 0.8154314
alpha_X3  1.558396 0.2331360
alpha_X4  6.988354 1.1647638
alpha_X5 20.689398 0.1492792
beta_X1   8.026379 0.9665023
beta_X2  11.038376 0.7259782
beta_X3   8.961428 0.2645201
beta_X4   2.702723 2.8717176
beta_X5   4.854816 0.6482710

Distribution: Generalized Dirichlet Multinomial
Log-likelihood: -4841.231
BIC: 9735.446
AIC: 9702.463
LRT test p value: <0.0001
Iterations: 59


* the singular generalized Dirichlet multinomial regression

In [30]:
GDMR = MGLMreg(Y ~ X, dist='GDM')
print(GDMR)

Call: MGLMreg(formula = Y ~ X, dist = "GDM")

Coefficients:
                alpha_X1     alpha_X2    alpha_X3      alpha_X4    alpha_X5
(Intercept)  5.987992852 -7.056673213  0.45608810 -10.120737500  2.63939594
XtotalReads -0.215098743  0.555972697  0.03955285   0.720358215 -0.01612137
Xtreatment  -0.047690534 -0.329319924  0.97935928   0.099957634  0.06339341
Xgender      0.233005806  0.374837886 -0.18642022  -0.202417040  0.14428946
Xage         0.006661015 -0.004342996  0.01936118   0.008173279  0.01239665
                 beta_X1       beta_X2    beta_X3     beta_X4     beta_X5
(Intercept)  4.661089454 -9.7891269676  7.0950614 -9.53000784 -1.68761454
XtotalReads -0.140896121  0.7138189416 -0.2229844  0.74314582  0.13398453
Xtreatment   0.628878471  0.7461977539 -1.5916297 -0.92371233 -0.04244096
Xgender      0.212071256  0.2732563514 -0.2331213 -0.27042831  0.12206235
Xage         0.003223835  0.0004533068  0.0159450  0.01254059  0.01918839

Hypothesis test: 
            wald valu

And, in the context of splitting distributions, each of these singular distribution can be combined with an univariate model of the sum to define a discrete multivariate model.

For example, we can use usual MLEs or MMEs of:

* Poisson distribution,

In [31]:
UPD = glm(S~1, family="poisson")
summary(UPD)
BIC(UPD)


Call:
glm(formula = S ~ 1, family = "poisson")

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-15.8076   -6.4047   -0.8109    4.6919   23.0403  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 6.522137   0.002712    2405   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 11405  on 199  degrees of freedom
Residual deviance: 11405  on 199  degrees of freedom
AIC: 13071

Number of Fisher Scoring iterations: 4


* Poisson regression,

In [32]:
UPR = glm(S~X, family="poisson")
summary(UPR)
BIC(UPR)


Call:
glm(formula = S ~ X, family = "poisson")

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-13.3031   -3.3875   -0.6941    2.7034   14.5858  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -9.5840042  0.2054558 -46.648  < 2e-16 ***
XtotalReads  0.9296642  0.0117171  79.342  < 2e-16 ***
Xtreatment  -0.0067466  0.0055118  -1.224    0.221    
Xgender     -0.0315672  0.0054328  -5.810 6.23e-09 ***
Xage         0.0002586  0.0002773   0.933    0.351    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 11405  on 199  degrees of freedom
Residual deviance:  4663  on 195  degrees of freedom
AIC: 6336.6

Number of Fisher Scoring iterations: 4


* Binomial distribution (since the MLE of the binommial distribution index parameter is not available in **R**, we use the MMEs),

In [33]:
index = max(round((mean(S) ^ 2)/(mean(S) - var(S))), max(S))
BS = cbind(S, index - S)
BD = glm(BS~1, family="binomial")
summary(BD)
print(BIC(BD))


Call:
glm(formula = BS ~ 1, family = "binomial")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-20.486   -8.709   -1.139    6.840   43.575  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.005777   0.003829  -1.509    0.131

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 24962  on 199  degrees of freedom
Residual deviance: 24962  on 199  degrees of freedom
AIC: 26471

Number of Fisher Scoring iterations: 3


[1] 26474.38


* Binomial regression (the index parameter of the binomial regression is assumed to be known and equal to the MME for binomial distribution),

In [34]:
BR = glm(BS~X, family="binomial")
summary(BR)
print(BIC(BR))


Call:
glm(formula = BS ~ X, family = "binomial")

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-19.178   -4.857   -1.009    3.639   34.692  

Coefficients:
              Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -3.233e+01  2.938e-01 -110.024   <2e-16 ***
XtotalReads  1.870e+00  1.682e-02  111.139   <2e-16 ***
Xtreatment  -7.446e-03  7.955e-03   -0.936    0.349    
Xgender     -7.541e-02  7.871e-03   -9.580   <2e-16 ***
Xage         1.491e-04  4.035e-04    0.370    0.712    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 24962  on 199  degrees of freedom
Residual deviance: 11465  on 195  degrees of freedom
AIC: 12983

Number of Fisher Scoring iterations: 4


[1] 12999.47


* Negative binomial distribution,

In [35]:
NBD = glm.nb(S~1)
summary(NBD)
BIC(NBD)


Call:
glm.nb(formula = S ~ 1, init.theta = 12.2325956, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3841  -0.8895  -0.1084   0.6066   2.7329  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   6.5221     0.0204   319.7   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(12.2326) family taken to be 1)

    Null deviance: 202.61  on 199  degrees of freedom
Residual deviance: 202.61  on 199  degrees of freedom
AIC: 2671.9

Number of Fisher Scoring iterations: 1


              Theta:  12.23 
          Std. Err.:  1.23 

 2 x log-likelihood:  -2667.949 

* Negative binomial regression,

In [36]:
NBR = glm.nb(S~X)
summary(NBR)
BIC(NBR)


Call:
glm.nb(formula = S ~ X, init.theta = 31.28138802, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6477  -0.7406  -0.1661   0.5601   2.7498  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -9.7226656  0.9499844 -10.235   <2e-16 ***
XtotalReads  0.9375435  0.0543613  17.247   <2e-16 ***
Xtreatment  -0.0136052  0.0262552  -0.518    0.604    
Xgender     -0.0317750  0.0259402  -1.225    0.221    
Xage         0.0003732  0.0013314   0.280    0.779    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(31.2814) family taken to be 1)

    Null deviance: 503.72  on 199  degrees of freedom
Residual deviance: 200.74  on 195  degrees of freedom
AIC: 2494.5

Number of Fisher Scoring iterations: 1


              Theta:  31.28 
          Std. Err.:  3.26 

 2 x log-likelihood:  -2482.506 