# HurdleDMR from Python

HurdleDMR.jl is a Julia implementation of the Hurdle Distributed Multinomial Regression (HDMR), as described in:

Kelly, Bryan, Asaf Manela, and Alan Moreira (2018). Text Selection. [Working paper](http://apps.olin.wustl.edu/faculty/manela/kmm/textselection/).

It includes a Julia implementation of the Distributed Multinomial Regression (DMR) model of [Taddy (2015)](https://arxiv.org/abs/1311.6139).

This tutorial explains how to use this package from Python via the [PyJulia](https://github.com/JuliaPy/pyjulia) package.

## Setup

### Install Julia

First, install Julia itself. The easiest way to do that is from the download site https://julialang.org/downloads/. An alternative is to install JuliaPro from https://juliacomputing.com

Once installed, open julia in a terminal (or in Juno), press `]` to activate package manager and add the following packages:
```
pkg> add HurdleDMR GLM Lasso
```

### Install PyJulia
See the documentation [here](https://pyjulia.readthedocs.io/en/stable/) for installation instructions.

Because I use miniconda, I also had to run the following, but you might not:

In [1]:
from julia.api import Julia
jl = Julia(compiled_modules=False)

### Add parallel workers and make HurdleDMR package available to workers

In [2]:
jl.eval("using Distributed")
from julia.Distributed import addprocs
addprocs(4)

from julia import HurdleDMR as hd
jl.eval("@everywhere using HurdleDMR")

### Example Data

Setup your data into an n-by-p covars matrix, and a (sparse) n-by-d counts matrix. Here we generate some random data.

In [3]:
import numpy as np
from scipy import sparse

n = 100
p = 3
d = 4

np.random.seed(123)
m = 1 + np.random.poisson(5,n)
covars = np.random.uniform(0,1,(n,p))

q = [[0 + j*sum(covars[i,:]) for j in range(d)] for i in range(n)]
#rowsums = [sum(q[i]) for i in range(n)]
q = [q[i]/sum(q[i]) for i in range(n)]

#counts = sparse.csr_matrix(np.concatenate([[np.random.multinomial(m[i],q[i]) for i in range(n)]]))
counts = np.concatenate([[np.random.multinomial(m[i],q[i]) for i in range(n)]])
counts

array([[0, 2, 3, 3],
       [0, 2, 2, 2],
       [0, 1, 2, 2],
       [0, 1, 1, 7],
       [0, 1, 1, 3],
       [0, 1, 1, 7],
       [0, 1, 2, 5],
       [0, 1, 0, 5],
       [0, 1, 5, 4],
       [0, 0, 1, 4],
       [0, 1, 2, 1],
       [0, 0, 3, 2],
       [0, 2, 3, 3],
       [0, 0, 2, 7],
       [0, 1, 5, 0],
       [0, 0, 2, 5],
       [0, 1, 0, 4],
       [0, 0, 1, 3],
       [0, 1, 3, 4],
       [0, 1, 3, 5],
       [0, 1, 1, 3],
       [0, 1, 0, 6],
       [0, 0, 2, 5],
       [0, 1, 3, 4],
       [0, 1, 3, 3],
       [0, 0, 2, 4],
       [0, 0, 1, 2],
       [0, 1, 0, 3],
       [0, 0, 2, 2],
       [0, 0, 2, 6],
       [0, 1, 1, 3],
       [0, 1, 0, 6],
       [0, 1, 2, 3],
       [0, 1, 2, 4],
       [0, 2, 0, 5],
       [0, 1, 1, 2],
       [0, 0, 2, 3],
       [0, 1, 1, 3],
       [0, 2, 2, 3],
       [0, 1, 6, 0],
       [0, 0, 0, 5],
       [0, 0, 2, 3],
       [0, 0, 3, 0],
       [0, 0, 2, 3],
       [0, 0, 2, 3],
       [0, 2, 2, 0],
       [0, 0, 2, 1],
       [0, 3,

## Distributed Multinomial Regression (DMR)

The Distributed Multinomial Regression (DMR) model of Taddy (2015) is a highly scalable
approximation to the Multinomial using distributed (independent, parallel)
Poisson regressions, one for each of the d categories (columns) of a large `counts` matrix,
on the `covars`.

To fit a DMR:

In [4]:
m = hd.dmr(covars, counts)

We can get the coefficients matrix for each variable + intercept as usual with

In [5]:
hd.coef(m)

array([[ 0.        , -1.94507425, -1.28828706, -0.59041306],
       [ 0.        ,  0.1056461 ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.1268672 ,  0.        ]])

By default we only return the AICc maximizing coefficients.
To also get back the entire regulatrization paths, run

In [6]:
paths = hd.dmrpaths(covars, counts)

We can now select, for example the coefficients that minimize 10-fold CV mse (takes a while)

In [7]:
jl.eval("using Lasso: MinCVmse")
from julia import Lasso
gen = jl.eval("MinCVKfold{MinCVmse}(10)")
hd.coef(paths, gen)

array([[ 0.00000000e+00, -1.89167038e+00, -1.22050226e+00,
        -5.90413062e-01],
       [ 0.00000000e+00,  3.18787704e-11,  0.00000000e+00,
         0.00000000e+00],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
         0.00000000e+00],
       [ 0.00000000e+00,  0.00000000e+00,  2.97862348e-07,
         0.00000000e+00]])

## Hurdle Distributed Multinomial Regression (HDMR)

For highly sparse counts, as is often the case with text that is selected for
various reasons, the Hurdle Distributed Multinomial Regression (HDMR) model of
Kelly, Manela, and Moreira (2018), may be superior to the DMR. It approximates
a higher dispersion Multinomial using distributed (independent, parallel)
Hurdle regressions, one for each of the d categories (columns) of a large `counts` matrix,
on the `covars`. It allows a potentially different sets of covariates to explain
category inclusion ($h=1{c>0}$), and repetition ($c>0$).

Both the model for zeroes and for positive counts are regularized by default,
using `GammaLassoPath`, picking the AICc optimal segment of the regularization
path.

HDMR can be fitted:

In [8]:
m = hd.hdmr(covars, counts, inpos=[1,2], inzero=[1,2,3])

We can get the coefficients matrix for each variable + intercept as usual with

In [9]:
coefspos, coefszero = hd.coef(m)
print("coefspos:\n", coefspos)
print("coefszero:\n", coefszero)

coefspos:
 [[ 0.         -2.18288411 -1.18060442 -0.41828599]
 [ 0.          0.33062404  0.          0.        ]
 [ 0.          0.          0.02997958  0.08338104]]
coefszero:
 [[ 0.          0.04616614  1.36309252  3.07912436]
 [ 0.          0.          0.         -0.67010761]
 [ 0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.        ]]


By default we only return the AICc maximizing coefficients.
To also get back the entire regulatrization paths, run

In [10]:
paths = hd.hdmrpaths(covars, counts)

hd.coef(paths, Lasso.AllSeg())

(array([[[ 0.00000000e+00, -2.02133392e+00, -1.16575159e+00,
          -3.76235470e-01],
         [ 0.00000000e+00,  2.90768861e-08,  0.00000000e+00,
           0.00000000e+00],
         [ 0.00000000e+00,  0.00000000e+00,  1.88213201e-12,
           1.36664985e-10],
         [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
           0.00000000e+00]],
 
        [[ 0.00000000e+00, -2.05931209e+00, -1.17116227e+00,
          -3.80396163e-01],
         [ 0.00000000e+00,  7.93643648e-02,  0.00000000e+00,
           0.00000000e+00],
         [ 0.00000000e+00,  0.00000000e+00,  1.09392297e-02,
           8.30076060e-03],
         [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00,
           0.00000000e+00]],
 
        [[ 0.00000000e+00, -2.09423553e+00, -1.17609960e+00,
          -3.84191708e-01],
         [ 0.00000000e+00,  1.51431219e-01,  0.00000000e+00,
           0.00000000e+00],
         [ 0.00000000e+00,  0.00000000e+00,  2.09033261e-02,
           1.58632356e-02],
         [ 0.00

## Sufficient reduction projection

A sufficient reduction projection summarizes the counts, much like a sufficient
statistic, and is useful for reducing the d dimensional counts in a potentially
much lower dimension matrix `z`.

To get a sufficient reduction projection in direction of vy for the above
example

In [11]:
z = hd.srproj(m,counts,1,1)
z

array([[ 0.08265601, -0.2233692 ,  8.        ,  3.        ],
       [ 0.11020801, -0.2233692 ,  6.        ,  3.        ],
       [ 0.06612481, -0.2233692 ,  5.        ,  3.        ],
       [ 0.036736  , -0.2233692 ,  9.        ,  3.        ],
       [ 0.06612481, -0.2233692 ,  5.        ,  3.        ],
       [ 0.036736  , -0.2233692 ,  9.        ,  3.        ],
       [ 0.04132801, -0.2233692 ,  8.        ,  3.        ],
       [ 0.05510401, -0.33505381,  6.        ,  2.        ],
       [ 0.0330624 , -0.2233692 , 10.        ,  3.        ],
       [ 0.        , -0.33505381,  5.        ,  2.        ],
       [ 0.08265601, -0.2233692 ,  4.        ,  3.        ],
       [ 0.        , -0.33505381,  5.        ,  2.        ],
       [ 0.08265601, -0.2233692 ,  8.        ,  3.        ],
       [ 0.        , -0.33505381,  9.        ,  2.        ],
       [ 0.05510401,  0.        ,  6.        ,  2.        ],
       [ 0.        , -0.33505381,  7.        ,  2.        ],
       [ 0.06612481, -0.

Here, the first column is the SR projection from the model for positive counts, the second is the the SR projection from the model for hurdle crossing (zeros), and the third is the total count for each observation.

## Counts Inverse Regression (CIR)

Counts inverse regression allows us to predict a covariate with the counts and other covariates.
Here we use hdmr for the backward regression and another model for the forward regression.
This can be accomplished with a single command, by fitting a CIR{HDMR,FM} where the forward model is FM <: RegressionModel.

In [12]:
jl.eval("using GLM: LinearModel")
spec = jl.eval("CIR{HDMR,LinearModel}")
cir = hd.fit(spec,covars,counts,1, 
             select=Lasso.MinBIC(), nocounts=True)
cir

<PyCall.jlwrap CIR{HDMR,LinearModel}(1, [1, 2], HDMRCoefs{InclusionRepetition,Array{Float64,2},Lasso.MinBIC,UnitRange{Int64}}([0.0 -2.18288 -1.1806 -0.415995; 0.0 0.330624 0.0 0.0; 0.0 0.0 0.0299796 0.0788679; 0.0 0.0 0.0 0.0], [0.0 0.0461661 1.36309 3.07912; 0.0 0.0 0.0 -0.670108; 0.0 0.0 0.0 0.0; 0.0 0.0 0.0 0.0], true, 100, 4, 1:3, 1:3, Lasso.MinBIC()), LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}}:

Coefficients:
───────────────────────────────────────────────────────────────────────
       Estimate  Std. Error    t value  Pr(>|t|)   Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
x1   0.887227     0.23313     3.80572     0.0003   0.424277   1.35018  
x2  -0.00275704   0.101541   -0.027152    0.9784  -0.204397   0.198883 
x3  -0.132434     0.101139   -1.30943     0.1936  -0.333275   0.0684076
x4   0.393855     0.697003    0.565069    0.5734  -0.990255   1.77797  
x5 

where the ```nocounts=True``` means we also fit a benchmark model without counts,
and ```select=Lasso.MinBIC()``` selects BIC minimizing Lasso segments for each category.

we can get the forward and backward model coefficients with

In [13]:
hd.coefbwd(cir)

(array([[ 0.        , -2.18288411, -1.18060442, -0.41599536],
        [ 0.        ,  0.33062404,  0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.02997958,  0.07886791],
        [ 0.        ,  0.        ,  0.        ,  0.        ]]),
 array([[ 0.        ,  0.04616614,  1.36309252,  3.07912436],
        [ 0.        ,  0.        ,  0.        , -0.67010761],
        [ 0.        ,  0.        ,  0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ]]))

In [14]:
hd.coeffwd(cir)

array([ 0.88722732, -0.00275704, -0.13243369,  0.39385493,  0.23609827,
       -0.00979175, -0.08174069])

The fitted model can be used to predict vy with new data

In [15]:
hd.predict(cir, covars[range(1,10),:], counts[range(1,10),:])

array([0.55374217, 0.44339513, 0.41348183, 0.44716221, 0.42806591,
       0.46489547, 0.48025307, 0.45795602, 0.56790706])

We can also predict only with the other covariates, which in this case
is just a linear regression

In [16]:
hd.predict(cir, covars[range(1,10),:], counts[range(1,10),:], nocounts=True)

array([0.55869446, 0.46528463, 0.48635113, 0.4689304 , 0.49773856,
       0.51824815, 0.45533037, 0.53886851, 0.55713808])

Kelly, Manela, and Moreira (2018) show that the differences between DMR and HDMR can be substantial in some cases, especially when the counts data is highly sparse.

Please reference the paper for additional details and example applications.