-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft of fweights, pweights, aweights for glms #194
Conversation
@@ -116,6 +116,8 @@ Base.cholfact!{T}(p::SparsePredChol{T}) = p.chol | |||
|
|||
invchol(x::DensePred) = inv(cholfact!(x)) | |||
invchol(x::SparsePredChol) = cholfact!(x) \ eye(size(x.X, 2)) | |||
# TODO: support fweights, pweights, and aweights | |||
# This vcov represents σ^2 X'X | |||
vcov(x::LinPredModel) = scale!(invchol(x.pp), dispersion(x, true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be updated to invchol(x.pp * sqrt(wts))
@@ -116,6 +116,8 @@ Base.cholfact!{T}(p::SparsePredChol{T}) = p.chol | |||
|
|||
invchol(x::DensePred) = inv(cholfact!(x)) | |||
invchol(x::SparsePredChol) = cholfact!(x) \ eye(size(x.X, 2)) | |||
# TODO: support fweights, pweights, and aweights | |||
# This vcov represents σ^2 X'X | |||
vcov(x::LinPredModel) = scale!(invchol(x.pp), dispersion(x, true)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check dispersion(x, true) will use nobs, which will use the right behavior for fweights vs pweights
"`wrkwt`: working case weights for the Iteratively Reweighted Least Squares (IRLS) algorithm" | ||
wrkwt::V | ||
"`wrkresid`: working residuals for IRLS" | ||
wrkresid::V | ||
end | ||
|
||
function GlmResp{V<:FPVector, D, L}(y::V, d::D, l::L, η::V, μ::V, off::V, wts::V) | ||
function GlmResp{V<:FPVector, D, L}(y::V, d::D, l::L, η::V, μ::V, off::V, wts::V, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the constructor doesn't need a wts anymore, just fweights, pweights and aweights
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some preliminary comments, mostly minor style things.
n = length(y) | ||
length(mu) == n || error("mismatched lengths of mu and y") | ||
ll = length(off) | ||
ll == 0 || ll == n || error("length of offset is $ll, must be $n or 0") | ||
ll = length(wts) | ||
ll == 0 || ll == n || error("length of wts is $ll, must be $n or 0") | ||
new{V}(mu, off, wts, y) | ||
ll = length(fweights) | ||
ll == 0 || ll == n || error("length of fweights is $ll, must be $n or 0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general it's better to use typed exceptions whenever possible rather than error
. In this case I think throw(ArgumentError("..."))
would be appropriate.
nf = length(fweights) | ||
np = length(pweights) | ||
na = length(aweights) | ||
if (nf != 0 && np == 0 && na == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if
s in Julia don't need outer parentheses, and indeed that style is generally discouraged.
@@ -98,3 +98,22 @@ loglik_obs(::Binomial, y, μ, wt, ϕ) = logpdf(Binomial(Int(wt), μ), Int(y*wt)) | |||
loglik_obs(::Gamma, y, μ, wt, ϕ) = wt*logpdf(Gamma(1/ϕ, μ*ϕ), y) | |||
loglik_obs(::Normal, y, μ, wt, ϕ) = wt*logpdf(Normal(μ, sqrt(ϕ)), y) | |||
loglik_obs(::Poisson, y, μ, wt, ϕ) = wt*logpdf(Poisson(μ), y) | |||
|
|||
# TODO: combine fweights, pweights, aweights | |||
function normalizeWeights(fweights::FrequencyWeights, pweights::ProbabilityWeights, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Camel case isn't used in Base Julia and the JuliaStats packages tend to match the Base style as closely as possible. For function names, typically it's all lowercase with no separation if the name isn't too long, otherwise snake case. For this function, I'd probably go with normalize_weights
.
np = length(pweights) | ||
na = length(aweights) | ||
if (nf != 0 && np == 0 && na == 0) | ||
return fweights |> collect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another minor stylistic point, but it's generally better to use standard calling conventions rather than piping. In this case you could actually just assign the result of the conditional to a variable (e.g. result = if ...
) then return collect(result)
@@ -150,10 +152,11 @@ For linear and generalized linear models, returns the number of rows, or, | |||
when prior weights are specified, the sum of weights. | |||
""" | |||
function nobs(obj::LinPredModel) | |||
if isempty(obj.rr.wts) | |||
if (length(obj.rr.fweights) == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is isempty
not defined for the weight objects in StatsBase?
end | ||
end | ||
convert{V<:FPVector}(::Type{LmResp{V}}, y::V) = | ||
LmResp{V}(zeros(y), similar(y, 0), similar(y, 0), y) | ||
LmResp{V}(zeros(y), similar(y, 0), similar(y, 0), y, | ||
similar(y, 0), similar(y, 0), similar(y, 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be aligned vertically with the arguments above it
type LmResp{V<:FPVector} <: ModResp # response in a linear model | ||
mu::V # mean response | ||
offset::V # offset added to linear predictor (may have length 0) | ||
wts::V # prior weights (may have length 0) | ||
fweights::FrequencyWeights | ||
pweights::ProbabilityWeights | ||
aweights::AnalyticWeights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we should carry around three different types of weights in the response object. Perhaps just a single AbstractWeights
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should store the weights internally as a single vector (maybe keep the existing wts
field), and only keep track of the necessary additional information like the actual number of observations (which is either the sum of weights or the vector length depending on the weights type).
@@ -14,6 +14,7 @@ module GLM | |||
import Base: (\), cholfact, convert, cor, show, size | |||
import StatsBase: coef, coeftable, confint, deviance, nulldeviance, dof, dof_residual, loglikelihood, nullloglikelihood, nobs, stderr, vcov, residuals, predict, fit, model_response, r2, r², adjr2, adjr², PValue | |||
import StatsFuns: xlogy | |||
import StatsBase: FrequencyWeights, ProbabilityWeights, AnalyticWeights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to bump the lower bound on StatsBase in the REQUIRE file to the version where these types were introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I wonder how useful is the ability to pass multiple weights vector and have GLM combine them automatically. Is that a really common need? Couldn't we require users to do that manually first (and maybe provide convenience functions for that in StatsBase)? Are there examples of existing software which support this? I'm asking because it will make the code more complex and will affect the APIs, so we should evaluate the benefits.
elseif (nf != 0 && np != 0) | ||
return (fweights .* pweights) |> collect | ||
else | ||
return zeros(Float64, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably handle this more efficiently by returning an empty vector. The existing code uses that trick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the empty vector gets overridden at some point by ones(n)
. I haven't checked exactly where or why it was done. I suspect it was so that a quantity could be normalized by sum(weights)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Looks like a good use case for a UnitWeights
or Ones
type which would implement the AbstractArray
interface and always return 1
(JuliaStats/StatsBase.jl#135). But that's not a requirement for a first implementation.
type LmResp{V<:FPVector} <: ModResp # response in a linear model | ||
mu::V # mean response | ||
offset::V # offset added to linear predictor (may have length 0) | ||
wts::V # prior weights (may have length 0) | ||
fweights::FrequencyWeights | ||
pweights::ProbabilityWeights | ||
aweights::AnalyticWeights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should store the weights internally as a single vector (maybe keep the existing wts
field), and only keep track of the necessary additional information like the actual number of observations (which is either the sum of weights or the vector length depending on the weights type).
I think if we only supported Frequency Weights and Probability Weights this PR would be a lot easier for me. I am completely unfamiliar with Analytic Weights, and how they affect formulas for covariance of the glm parameters. It would also mean that we could easily simplify the API by only passing in 1 vector of |
Retrospectively, by limiting the scope to only Fweights and Pweights, I think we only needed a way to define |
Sure, let's go with the simpler version for now. I don't know of other software accepting multiple types of weights at the same time, and we can always add support for analytic weights later. |
@jeffwong about your comment on Analytic Weights: They represent multiple observations, like fweights, but instead of representing multiple identical obs., they represent distinct observations that were averaged out. This happens with variables such as "average grade by classroom", "accident rate per state", etc. Since groups with more individuals have less noisy means, aweights lead to heteroskedasticity. That said, on linear models pweights are just "aweights with robust standard errors", so if the code supports pweights then aweights come for free. |
What's the status on this? |
Superseded by #487. |
This is a code outline for #186. I am new to contributing to Julia, so please feel free to leave feedback on how this feature should be implemented