Skip to content

Commit

Permalink
Negative Binomial regression support (#238)
Browse files Browse the repository at this point in the history
* minor testing

* working on adding negative Binomial support, debugging now

* debugging a division by 0 error

* done with negative Binomial with dispersion parameter fixed. still need to construct more elaborate test cases because conversion may be an issue for non-traditional link functions. the next step is to estimate the dispersion parameter too

* fixed the bug caused by mis-calculating the deviance for negative binomial

* added a test case

* a small test case

* fixed typos

* fixed typos

* added RDatasets as a required package for testing

* added another test to clarify the loglink vs the NB canonical link

* added more details on the two test cases

* done with estimating theta too

* removed redundant struct

* removed redundant println

* took care of PR revew comments

* took care of review comments

* took care of PR review comments

* minor

* addressed more PR comments

* changed 1.0 to 1

* fixed the reparameterization issue

* minor

* added oftype(v, NaN) for type safety

* added oftype(v, NaN) for type safety

* trying latex with md

* adding theta in html

* added negative binomial regression section

* added negative binomial regression section

* put a listing in alphabetical order

* added NegativeBinomialLink to the list

* added InverseGaussian to the doc because it is already supported
  • Loading branch information
hung-q-ngo authored and dmbates committed Jul 4, 2018
1 parent 886f4fc commit fc5baf3
Show file tree
Hide file tree
Showing 7 changed files with 264 additions and 9 deletions.
65 changes: 59 additions & 6 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,14 @@ The [RDatasets package](https://github.com/johnmyleswhite/RDatasets.jl) is usefu
To fit a Generalized Linear Model (GLM), use the function, `glm(formula, data, family, link)`, where,
- `formula`: uses column symbols from the DataFrame data, for example, if `names(data)=[:Y,:X1,:X2]`, then a valid formula is `@formula(Y ~ X1 + X2)`
- `data`: a DataFrame which may contain NA values, any rows with NA values are ignored
- `family`: chosen from `Bernoulli()`, `Binomial()`, `Gamma()`, `Normal()`, or `Poisson()`
- `family`: chosen from `Bernoulli()`, `Binomial()`, `Gamma()`, `Normal()`, `Poisson()`, or `NegativeBinomial(θ)`
- `link`: chosen from the list below, for example, `LogitLink()` is a valid link for the `Binomial()` family

The `NegativeBinomial` distribution belongs to the exponential family only if θ (the shape
parameter) is fixed, thus θ has to be provided if we use `glm` with `NegativeBinomial` family.
If one would like to also estimate θ, then `negbin(formula, data, link)` should be
used instead.

An intercept is included in any GLM by default.

## Methods applied to fitted models
Expand Down Expand Up @@ -109,6 +114,46 @@ X -3.64399e-19 0.91665 -3.97534e-19 1.0000
```

### Negative Binomial Regression:
```jldoctest
julia> using GLM, RDatasets
julia> quine = dataset("MASS", "quine")
julia> nbrmodel = glm(@formula(Days ~ Eth+Sex+Age+Lrn), quine, NegativeBinomial(2.0), LogLink())
StatsModels.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.NegativeBinomial{Float64},GLM.LogLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: Days ~ 1 + Eth + Sex + Age + Lrn
Coefficients:
Estimate Std.Error z value Pr(>|z|)
(Intercept) 2.88645 0.227144 12.7076 <1e-36
Eth: N -0.567515 0.152449 -3.72265 0.0002
Sex: M 0.0870771 0.159025 0.547568 0.5840
Age: F1 -0.445076 0.239087 -1.86157 0.0627
Age: F2 0.0927999 0.234502 0.395731 0.6923
Age: F3 0.359485 0.246586 1.45785 0.1449
Lrn: SL 0.296768 0.185934 1.59609 0.1105
julia> nbrmodel = negbin(@formula(Days ~ Eth+Sex+Age+Lrn), quine, LogLink())
StatsModels.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.NegativeBinomial{Float64},GLM.LogLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: Days ~ 1 + Eth + Sex + Age + Lrn
Coefficients:
Estimate Std.Error z value Pr(>|z|)
(Intercept) 2.89453 0.227415 12.728 <1e-36
Eth: N -0.569341 0.152656 -3.72957 0.0002
Sex: M 0.0823881 0.159209 0.517485 0.6048
Age: F1 -0.448464 0.238687 -1.87888 0.0603
Age: F2 0.0880506 0.235149 0.374445 0.7081
Age: F3 0.356955 0.247228 1.44383 0.1488
Lrn: SL 0.292138 0.18565 1.57359 0.1156
julia> println("Estimated theta = ", nbrmodel.model.rr.d.r)
Estimated theta = 1.2748930396601978
```

## Other examples

Expand Down Expand Up @@ -314,23 +359,30 @@ julia> deviance(gm1)
Typical distributions for use with `glm` and their canonical link
functions are

Bernoulli (LogitLink)
Binomial (LogitLink)
Gamma (InverseLink)
Normal (IdentityLink)
Poisson (LogLink)
Bernoulli (LogitLink)
Binomial (LogitLink)
Gamma (InverseLink)
InverseGaussian (InverseSquareLink)
NegativeBinomial (LogLink)
Normal (IdentityLink)
Poisson (LogLink)

Currently the available Link types are

CauchitLink
CloglogLink
IdentityLink
InverseLink
InverseSquareLink
LogitLink
LogLink
NegativeBinomialLink
ProbitLink
SqrtLink

Note that the canonical link for negative binomial regression is `NegativeBinomialLink`, but
in practice one typically uses `LogLink`.

## Separation of response object and predictor object

The general approach in this code is to separate functionality related
Expand Down Expand Up @@ -429,6 +481,7 @@ InverseLink
InverseSquareLink
LogitLink
LogLink
NegativeBinomialLink
ProbitLink
SqrtLink
linkfun
Expand Down
4 changes: 4 additions & 0 deletions src/GLM.jl
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ module GLM
LogitLink,
LogLink,
LmResp,
NegativeBinomial,
NegativeBinomialLink,
Normal,
Poisson,
ProbitLink,
Expand All @@ -67,6 +69,7 @@ module GLM
logit,
mueta, # derivative of inverse link
mustart, # derive starting values for the mu vector
negbin, # interface to fitting genative binomial regression
nobs, # total number of observations
predict, # make predictions
updateμ!, # update the response type from the linear predictor
Expand Down Expand Up @@ -97,5 +100,6 @@ module GLM
include("glmtools.jl")
include("glmfit.jl")
include("ftest.jl")
include("negbinfit.jl")

end # module
18 changes: 17 additions & 1 deletion src/glmfit.jl
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ the expression for the working weights will cancel.
"""
cancancel(::GlmResp) = false
cancancel(::GlmResp{V,D,LogitLink}) where {V,D<:Union{Bernoulli,Binomial}} = true
cancancel(::GlmResp{V,D,NegativeBinomialLink}) where {V,D<:NegativeBinomial} = true
cancancel(::GlmResp{V,D,IdentityLink}) where {V,D<:Normal} = true
cancancel(::GlmResp{V,D,LogLink}) where {V,D<:Poisson} = true

Expand Down Expand Up @@ -118,6 +119,20 @@ function updateμ!(r::GlmResp{V,D,L}) where {V<:FPVector,D<:Union{Bernoulli,Bino
end
end

function updateμ!(r::GlmResp{V,D,L}) where {V<:FPVector,D<:NegativeBinomial,L<:NegativeBinomialLink}
y, η, μ, wrkres, wrkwt, dres = r.y, r.eta, r.mu, r.wrkresid, r.wrkwt, r.devresid

@inbounds for i in eachindex(y, η, μ, wrkres, wrkwt, dres)
θ = r.d.r # the shape parameter of the negative binomial distribution
μi, dμdη, μomμ = inverselink(L(θ), η[i])
μ[i] = μi
yi = y[i]
wrkres[i] = (yi - μi) / dμdη
wrkwt[i] = dμdη
dres[i] = devresid(r.d, yi, μi)
end
end

"""
wrkresp(r::GlmResp)
Expand Down Expand Up @@ -308,10 +323,11 @@ d::UnivariateDistribution,
l::Link=canonicallink(d); kwargs...) where {M<:AbstractGLM} =
fit(M, float(X), float(y), d, l; kwargs...)

glm(X, y, args...; kwargs...) = fit(GeneralizedLinearModel, X, y, args...; kwargs...)
glm(F, D, args...; kwargs...) = fit(GeneralizedLinearModel, F, D, args...; kwargs...)

GLM.Link(mm::AbstractGLM) = mm.l
GLM.Link(r::GlmResp{T,D,L}) where {T,D,L} = L()
GLM.Link(r::GlmResp{T,D,L}) where {T,D<:NegativeBinomial,L<:NegativeBinomialLink} = L(r.d.r)
GLM.Link(m::GeneralizedLinearModel) = Link(m.rr)

Distributions.Distribution(r::GlmResp{T,D,L}) where {T,D,L} = D
Expand Down
33 changes: 33 additions & 0 deletions src/glmtools.jl
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,16 @@ The canonical [`Link`](@ref) for [`Distributions.Poisson`](https://juliastats.gi
"""
mutable struct LogLink <: Link end

"""
NegativeBinomialLink
The canonical [`Link`](@ref) for [`Distributions.NegativeBinomial`](@ref) distribution, defined as `η = log(μ/(μ+θ))`.
The shape parameter θ has to be fixed for the distribution to belong to the exponential family.
"""
mutable struct NegativeBinomialLink <: Link
θ::Float64
end

"""
ProbitLink
Expand Down Expand Up @@ -230,6 +240,15 @@ function inverselink(::LogLink, η)
μ, μ, oftype(μ, NaN)
end

linkfun(nbl::NegativeBinomialLink, μ) = log/+ nbl.θ))
linkinv(nbl::NegativeBinomialLink, η) = e^η * nbl.θ / (1-e^η)
mueta(nbl::NegativeBinomialLink, η) = e^η * nbl.θ / (1-e^η)
function inverselink(nbl::NegativeBinomialLink, η)
μ = e^η * nbl.θ / (1-e^η)
deriv = μ * (1 + μ / nbl.θ)
μ, deriv, oftype(μ, NaN)
end

linkfun(::ProbitLink, μ) = -sqrt2 * erfcinv(2μ)
linkinv(::ProbitLink, η) = erfc(-η / sqrt2) / 2
mueta(::ProbitLink, η) = exp(-abs2(η) / 2) / sqrt2π
Expand All @@ -248,6 +267,7 @@ canonicallink(::Bernoulli) = LogitLink()
canonicallink(::Binomial) = LogitLink()
canonicallink(::Gamma) = InverseLink()
canonicallink(::InverseGaussian) = InverseSquareLink()
canonicallink(d::NegativeBinomial) = NegativeBinomialLink(d.r)
canonicallink(::Normal) = IdentityLink()
canonicallink(::Poisson) = LogLink()

Expand Down Expand Up @@ -277,6 +297,7 @@ function glmvar end
glmvar(::Union{Bernoulli,Binomial}, μ) = μ * (1 - μ)
glmvar(::Gamma, μ) = abs2(μ)
glmvar(::InverseGaussian, μ) = μ^3
glmvar(d::NegativeBinomial, μ) = μ * (1 + μ/d.r)
glmvar(::Normal, μ) = one(μ)
glmvar(::Poisson, μ) = μ

Expand Down Expand Up @@ -306,6 +327,7 @@ function mustart end
mustart(::Bernoulli, y, wt) = (y + oftype(y, 1/2)) / 2
mustart(::Binomial, y, wt) = (wt * y + oftype(y, 1/2)) / (wt + one(y))
mustart(::Union{Gamma, InverseGaussian}, y, wt) = y == 0 ? oftype(y, 1/10) : y
mustart(::NegativeBinomial, y, wt) = y == 0 ? y + oftype(y, 1/6) : y
mustart(::Normal, y, wt) = y
mustart(::Poisson, y, wt) = y + oftype(y, 1/10)

Expand Down Expand Up @@ -352,6 +374,11 @@ function devresid(::Binomial, y, μ)
end
devresid(::Gamma, y, μ) = -2 * (log(y / μ) - (y - μ) / μ)
devresid(::InverseGaussian, y, μ) = abs2(y - μ) / (y * abs2(μ))
function devresid(d::NegativeBinomial, y, μ)
θ = d.r
v = 2 * (xlogy(y, y / μ) + xlogy(y + θ, (μ + θ)/(y + θ)))
return μ == 0 ? oftype(v, NaN) : v
end
devresid(::Normal, y, μ) = abs2(y - μ)
devresid(::Poisson, y, μ) = 2 * (xlogy(y, y / μ) - (y - μ))

Expand Down Expand Up @@ -391,3 +418,9 @@ loglik_obs(::Gamma, y, μ, wt, ϕ) = wt*logpdf(Gamma(inv(ϕ), μ*ϕ), y)
loglik_obs(::InverseGaussian, y, μ, wt, ϕ) = wt*logpdf(InverseGaussian(μ, inv(ϕ)), y)
loglik_obs(::Normal, y, μ, wt, ϕ) = wt*logpdf(Normal(μ, sqrt(ϕ)), y)
loglik_obs(::Poisson, y, μ, wt, ϕ) = wt*logpdf(Poisson(μ), y)
# We use the following parameterization for the Negative Binomial distribution:
# (Γ(θ+y) / (Γ(θ) * y!)) * μ^y * θ^θ / (μ+θ)^{θ+y}
# The parameterization of NegativeBinomial(r=θ, p) in Distributions.jl is
# Γ(θ+y) / (y! * Γ(θ)) * p^θ(1-p)^y
# Hence, p = θ/(μ+θ)
loglik_obs(d::NegativeBinomial, y, μ, wt, ϕ) = wt*logpdf(NegativeBinomial(d.r, d.r/+d.r)), y)
85 changes: 85 additions & 0 deletions src/negbinfit.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
function mle_for_θ(y::AbstractVector, μ::AbstractVector, wts::AbstractVector;
maxIter=30, convTol=1.e-6)
function first_derivative::Real)
tmp(yi, μi) = (yi+θ)/(μi+θ) + log(μi+θ) - 1 - log(θ) - digamma+yi) + digamma(θ)
unit_weights ? sum(tmp(yi, μi) for (yi, μi) in zip(y, μ)) :
sum(wti * tmp(yi, μi) for (wti, yi, μi) in zip(wts, y, μ))
end
function second_derivative::Real)
tmp(yi, μi) = -(yi+θ)/(μi+θ)^2 + 2/(μi+θ) - 1/θ - trigamma+yi) + trigamma(θ)
unit_weights ? sum(tmp(yi, μi) for (yi, μi) in zip(y, μ)) :
sum(wti * tmp(yi, μi) for (wti, yi, μi) in zip(wts, y, μ))
end

unit_weights = length(wts) == 0
if unit_weights
n = length(y)
θ = n / sum((yi/μi - 1)^2 for (yi, μi) in zip(y, μ))
else
n = sum(wts)
θ = n / sum(wti * (yi/μi - 1)^2 for (wti, yi, μi) in zip(wts, y, μ))
end
δ, converged = one(θ), false

for t = 1:maxIter
θ = abs(θ)
δ = first_derivative(θ) / second_derivative(θ)
if abs(δ) <= convTol
converged = true
break
end
θ = θ - δ
end
converged || throw(ConvergenceException(maxIter))
θ
end

function negbin(F, D, args...;
initialθ::Real=Inf, maxIter::Integer=30, convTol::Real=1.e-6,
verbose::Bool=false, kwargs...)
maxIter >= 1 || throw(ArgumentError("maxIter must be positive"))
convTol > 0 || throw(ArgumentError("convTol must be positive"))
initialθ > 0 || throw(ArgumentError("initialθ must be positive"))

# fit a Poisson regression model if the user does not specify an initial θ
if isinf(initialθ)
regmodel = glm(F, D, Poisson(), args...;
maxIter=maxIter, convTol=convTol, verbose=verbose, kwargs...)
else
regmodel = glm(F, D, NegativeBinomial(initialθ), args...;
maxIter=maxIter, convTol=convTol, verbose=verbose, kwargs...)
end

μ = regmodel.model.rr.mu
y = regmodel.model.rr.y
wts = regmodel.model.rr.wts
lw, ly = length(wts), length(y)
if lw != ly && lw != 0
throw(ArgumentError("length of wts must be either $ly or 0 but was $lw"))
end

θ = mle_for_θ(y, μ, wts)
d = sqrt(2 * max(1, deviance(regmodel)))
δ = one(θ)
ll = loglikelihood(regmodel)
ll0 = ll + 2 * d

converged = false
for i = 1:maxIter
if abs(ll0 - ll)/d + abs(δ) <= convTol
converged = true
break
end
verbose && println("[ Alternating iteration ", i, ", θ = ", θ, " ]")
regmodel = glm(F, D, NegativeBinomial(θ), args...;
maxIter=maxIter, convTol=convTol, verbose=verbose, kwargs...)
μ = regmodel.model.rr.mu
prevθ = θ
θ = mle_for_θ(y, μ, wts; maxIter=maxIter, convTol=convTol)
δ = prevθ - θ
ll0 = ll
ll = loglikelihood(regmodel)
end
converged || throw(ConvergenceException(maxIter))
regmodel
end
3 changes: 2 additions & 1 deletion test/REQUIRE
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
CategoricalArrays 0.3.0
Compat 0.36.0
DataFrames 0.11.0
CSV 0.2.0
CSV 0.2.0
RDatasets 0.4.0
Loading

0 comments on commit fc5baf3

Please sign in to comment.