# A comparison of multivariate models for count data (KN05)

In order to illustrate the variety of splitting models, we considered two datasets used in the literature to illustrate models for count data.
We here focus on the first one, denoted in the following by `D` and consisting in outcomes of football games [[KN05](https://www.jstatsoft.org/article/view/v014i10)].
The goal being to compare distributions and regressions models, comparisons were performed when considering all covariates or none of the covariates.
Remark that variable selection (e.g., using regularization methods [[ZZZS17](https://doi.org/10.1080/10618600.2016.1154063)]) is possible, but is out of the scope of this paper.

First, we need:

* to import the `bivpoiss` [[KN05](https://www.jstatsoft.org/article/view/v014i10)] and the `MGLM` [[ZZ17](https://cran.r-project.org/web/packages/MGLM/index.html)] and the `MASS` [[VR02](https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/glm.nb.html)] libraries.

In [None]:
library(MGLM)
library(bivpois)
library(MASS)

* to load the dataset.

In [None]:
data('ex4.ita91')
D = ex4.ita91

The `MGLM` library requires to separate responses and explanatories variables.
`Y` and `X` corresponds reespectively ro the responses and explanatories datasets built from `D`.

In [None]:
Y = as.matrix(D[,c(1,2)])
X = as.matrix(model.matrix(~team1+team2, data=D)[,-c(1)])

Similarly, `S` corresponds to the dataset constructed using the sample totals of the `Y`.

In [None]:
S = rowSums(Y)

Note that for the `MLGM` library, lines where the sample total is `0` must be removed.

In [None]:
Y = Y[S > 0,]
X = X[S > 0,]

Considering the `bivpoiss` and the `MGLM` libraries, we have only 2 *sensu stricto* multivariate distributions to infer on both datasets: 

1. the multivariate Poisson distribution,
2. the negative multinomiale distribution.

The maximum linkelihood estimations (MLEs) of:

* the multivariate Poisson distribution is obtained as follows,

In [None]:
MPD = lm.bp(g1~1, g2~1, l1l2=~1, data=D)
print(MPD)
print(MPD$BIC)

* the multivariate Poisson regression is obtained as follows,

In [None]:
MPR = lm.bp(g1~team1+team2, g2~team1+team2, l1l2=~team1+team2, data=D)
print(MPD)
print(MPD$BIC)

* the negative multinomial distribution is obtained as follows,

In [None]:
NMD = MGLMfit(Y, dist='NegMN')
print(NMD)

* the negative multinomial regression is obtained as follows,

In [None]:
NMR = MGLMreg(Y ~ X, dist='NegMN')
print(NMR)

But, as stated in the article, the `MGLM` library provides the MLEs of:

* the singular multinomial distribution obtained as follows,

In [None]:
MND = MGLMfit(Y, dist='MN')
print(MND)

* the singular multinomial regression (Note that this estimation is not working properly in `R`. It has therefore been commented, to uncomment, please change the next cell metadata from `Raw NbConvert` to `Code`)

* the singular Dirichlet multinomial distribution (Note that the singular generalized Dirichlet multinomial distribution is not considered after since when there is only 2 response variables, generalized Dirichlet multinomial models are equivalent to Dirichlet multinomial models)

In [None]:
DMD = MGLMfit(Y, dist='DM')
print(DMD)

* the singular Dirichlet multinomial regression (Note that the singular generalized Dirichlet multinomial regression is not considered after since when there is only 2 response variables, generalized Dirichlet multinomial models are equivalent to Dirichlet multinomial models)

In [None]:
DMR = MGLMreg(Y ~ X, dist='DM')
print(DMR)

And, in the context of splitting distributions, each of these singular distribution can be combined with an univariate model of the sum to define a discrete multivariate model.

In [None]:
X = as.matrix(model.matrix(~team1+team2, data=D)[,-c(1)])

For example, we can use usual MLEs or MMEs of:

* Poisson distribution,

In [None]:
UPD = glm(S~1, family="poisson")
summary(UPD)
BIC(UPD)

* Poisson regression,

In [None]:
UPR = glm(S~X, family="poisson")
summary(UPR)
BIC(UPR)

* Binomial distribution (since the MLE of the binommial distribution index parameter is not available in **R**, we use the MMEs),

In [None]:
index = max(round((mean(S) ^ 2)/(mean(S) - var(S))), max(S))
BS = cbind(S, index - S)
BD = glm(BS~1, family="binomial")
summary(BD)
print(BIC(BD))

* Binomial regression (the index parameter of the binomial regression is assumed to be known and equal to the MME for binomial distribution),

In [None]:
index = max(round((mean(S) ^ 2)/(mean(S) - var(S))), max(S))
BS = cbind(S, index - S)
BD = glm(BS~X, family="binomial")
summary(BD)
print(BIC(BD))

* Negative binomial distribution,

In [None]:
NBR = glm.nb(S~1)
summary(NBR)
BIC(NBR)

* Negative binomial regression,

In [None]:
NBR = glm.nb(S~X)
summary(NBR)
BIC(NBR)