Add wts keyword to `lm` #341

pdeffebach · 2019-11-01T22:07:43Z

Addresses #258. It definitely doesn't work right now, as I haven't been able to track down all the up-stream functions that need to use the weights of a LmResp object.

pdeffebach · 2019-11-06T01:31:36Z

I spent more time on this tonight, trying to make lm seem more like glm. I realized that the GlmResp holds a lot of things that are necessary for the weights to work. I think we should just dispatch to glm(..., Normal(), wts = wts) when the user uses wts with lm rather than alter the LmResp object and algorithm to use weights.

Does that sounds good?

nalimilan · 2019-11-07T18:19:28Z

I'd rather keep lm always returning a LinearModel. Otherwise people who dispatch on it will get in trouble when using weights, ftest won't work, p-values won't be based on the t distribution, etc.

pdeffebach · 2019-11-07T18:22:34Z

Thanks for the feedback. That's a valid concern.

If you have time, could you verify that we would need to add more fields to the LmResp object to accommodate actually running OLS with weights? At the moment it's hard to imagine enabling LinearModel to work with weights without basically making it look exactly like GeneralizedLinearModel.

nalimilan · 2019-11-07T21:12:26Z

Well LmResp already has a wts field, and other ones seem irrelevant (linked to the IRLS estimation algorithm). What would you need to add?

pdeffebach · 2019-11-27T21:40:24Z

After a decent amount of time inspecting functions to figure out what to change, I think I made the necessary changes.

julia> df = DataFrame(y = rand(10), x = rand(10), w = float.(rand(Bool, 10)))
julia> lm(@formula(y ~ x), df, wts = df.w)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x

Coefficients:
────────────────────────────────────────────────────────────────────────────
              Estimate  Std. Error   t value  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────────
(Intercept)   0.870868    0.447738   1.94504    0.1912   -1.05559    2.79733
x            -0.493512    0.902051  -0.5471     0.6392   -4.37472    3.3877 
────────────────────────────────────────────────────────────────────────────
julia> glm(@formula(y ~ x), df, Normal(), wts = df.w)
StatsModels.TableRegressionModel{GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Normal{Float64},IdentityLink},GLM.DensePredChol{Float64,LinearAlgebra.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

y ~ 1 + x

Coefficients:
──────────────────────────────────────────────────────────────────────────────
              Estimate  Std. Error   z value  Pr(>|z|)    Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)   0.870868    0.447738   1.94504    0.0518  -0.00668183    1.74842
x            -0.493512    0.902051  -0.5471     0.5843  -2.2615        1.27448
──────────────────────────────────────────────────────────────────────────────

Actually it looks like the confidence intervals are all off, which will require some tweaks. @nalimilan can you let me know if I'm heading in the right direction in terms of the way I changed the updatemu! method?

nalimilan

Yes, looks OK AFAICT. Regarding the confidence intervals, better compare with what you get by replicating each row according to its weight, as by definition the results must be equivalent. You can also check with another package (e.g. Stata's fweights). It's hard to tell whether what you showed is correct, since for a small number of observations the F distribution will be very different from the Normal distribution I think (for larger samples it should converge towards it).

nalimilan · 2019-11-28T09:09:39Z

Manifest.toml

@@ -0,0 +1,232 @@
+# This file is machine-generated - editing it directly is not advised


Remove this file.

nalimilan · 2019-11-28T09:10:34Z

src/lm.jl


-LmResp(y::AbstractVector{T}) where T<:Real = LmResp(float(y))
+LmResp(y::FPVector; wts::FPVector = similar(y, 0)) = LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y)


Make wts a positional argument for consistency with GlmResp.

Also, no spaces around = for arguments, and break lines at 92 chars.

I didn't want to make wts a positional argument because there are 4 fields in LmResp and wts is like the 4th argument. So it seemed weird to have the arguments taken be different from the fields of the object.

Maybe but then GlmResp should be changed too... Not sure it's worth it.

nalimilan · 2019-11-28T09:11:47Z

src/lm.jl


-function updateμ!(r::LmResp{V}, linPr::V) where V<:FPVector
+LmResp(y::AbstractVector{T}; wts::AbstractVector{T} = similar(y, 0)) where T<:Real = LmResp(float(y), float(wts))


Suggested change

LmResp(y::AbstractVector{T}; wts::AbstractVector{T} = similar(y, 0)) where T<:Real = LmResp(float(y), float(wts))

LmResp(y::AbstractVector{<:Real}; wts::AbstractVector{<:Real}=similar(y, 0)) = LmResp(float(y), float(wts))

nalimilan · 2019-11-28T09:14:39Z

src/lm.jl

@@ -126,24 +127,30 @@ end
 LinearAlgebra.cholesky(x::LinearModel) = cholesky(x.pp)

 function StatsBase.fit!(obj::LinearModel)
-    installbeta!(delbeta!(obj.pp, obj.rr.y))
+    if length(obj.rr.wts) == 0
+        installbeta!(delbeta!(obj.pp, obj.rr.y))


installbeta!(obj.pp) can be moved outside of the if AFAICT.

nalimilan · 2019-11-28T09:15:27Z

src/lm.jl

-             allowrankdeficient::Bool=false)
-    fit!(LinearModel(LmResp(y), cholpred(X, allowrankdeficient)))
+function fit(::Type{LinearModel}, X::AbstractMatrix, y::V,
+             allowrankdeficient::Bool=false; wts::V = similar(y, 0)) where {V<:FPVector}


Why make this signature more restrictive?

nalimilan · 2019-11-28T09:16:14Z

src/lm.jl

 end

 """
-    lm(X, y, allowrankdeficient::Bool=false)
+    lm(X, y, allowrankdeficient::Bool=false; wts = nothing)


Suggested change

lm(X, y, allowrankdeficient::Bool=false; wts = nothing)

lm(X, y, allowrankdeficient::Bool=false; wts=similar(y, 0))

nalimilan · 2019-12-13T13:16:45Z

Do you think this is ready? If so, could you add a test?

pdeffebach · 2019-12-13T13:21:41Z

No it's not ready yet. I still have to figure out why the confidence intervals don't match, as well as stuff like nobs, which I think is in StatsBase.

nalimilan · 2019-12-13T13:29:04Z

OK. FWIW, as I noted, I don't think confidence intervals should match with glm.

pdeffebach · 2019-12-13T13:30:41Z

Oh I see your point now. lm is more robust than glm with Normal() in that sense. I'll double check with R and Stata.

nalimilan · 2019-12-13T14:38:40Z

You don't even need other software: just fit the same model without weights but with replicated rows. R doesn't actually support frequency weights so I don't think you can reproduce the p-values. With Stata you can with fweights.

codecov-io · 2019-12-23T01:24:33Z

Codecov Report

Merging #341 into master will increase coverage by 11.67%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##           master     #341       +/-   ##
===========================================
+ Coverage   69.49%   81.17%   +11.67%     
===========================================
  Files           6        7        +1     
  Lines         531      632      +101     
===========================================
+ Hits          369      513      +144     
+ Misses        162      119       -43

Impacted Files	Coverage Δ
src/lm.jl	`95.12% <100%> (+22.89%)`	⬆️
src/glmfit.jl	`81.9% <100%> (+6.31%)`	⬆️
src/negbinfit.jl	`92.72% <0%> (-1.03%)`	⬇️
src/GLM.jl	`100% <0%> (ø)`
src/ftest.jl	`98.43% <0%> (+0.47%)`	⬆️
src/linpred.jl	`63.15% <0%> (+11.39%)`	⬆️
src/glmtools.jl	`69.29% <0%> (+21.67%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c97277b...f700891. Read the comment docs.

pdeffebach · 2019-12-23T01:28:43Z

This PR is ready for merging, with a caveat.

Currently issue #84 requires lm to support y and X having different types, while GLM doesn't, so it makes sense to allow the wts keyword argument to be AbstractVector{<:Real} rather than the same type as y. Also note that canonically frequency weights are integers, so there are statistical reasons for wts to be flexible.

Currently the constructors for GLMResp in the glm function are much stricter, wts must be the same type as y. So in my tests I have weights be a vector of Floats, but the reality is that a vector of integers works for wts for lm but not for glm.

I propse merging this and then in a separate PR adding a new constructor method for GLMResp which allows for more type flexibility.

nalimilan · 2019-12-24T14:28:47Z

src/lm.jl


-LmResp(y::AbstractVector{T}) where T<:Real = LmResp(float(y))
+function LmResp(y::FPVector, wts::FPVector=similar(y, 0)) 


Why not keep short function syntax? Same below.

nalimilan · 2019-12-24T14:28:59Z

src/lm.jl


-function updateμ!(r::LmResp{V}, linPr::V) where V<:FPVector
+function updateμ!(r::LmResp{V}, linPr::V) where {V<:FPVector}


This isn't needed.

src/lm.jl

nalimilan · 2019-12-24T14:36:14Z

test/runtests.jl

+@testset "lm with weights" begin 
+    df = dataset("quantreg", "engel")
+    N = nrow(df)
+    df.weights = repeat(1.0:5.0, Int(N/5))


Probably not worth repeating this test every time: now that you know you got the correct values, just hardcode them in the test (just like we do above).

Also test other statistics like p-values, confidence intervals, R², deviance, log-likelihood, etc. (see examples in the file).

nalimilan · 2019-12-24T14:37:35Z

I propse merging this and then in a separate PR adding a new constructor method for GLMResp which allows for more type flexibility.

Sure.

By the way, we could remove wts to weights, which is a much more explicit name. It would be nice to document clearly that these are frequency weights (in docstrings and in the manual).

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

…o lm_weights

pdeffebach · 2019-12-31T19:51:30Z

Thanks for requiring the R^2 be a test, there is a bug. We should be using the weighed mean to calculated the nulldeviance of the model, that is, instead of the sum of deviations from the mean, we should use the sum of deviations from the un-weighted mean.

I don't have Stata open to check that this is what fweights does, but this is what lfe does in R. The following code shows that the R^2 in lfe with weights is equivelent to the R^2 of the "long" dataset.

library("quantreg")
library("tidyverse")
library("lfe")
data(engel)
df = as_tibble(engel)
df$weights = rep(1:5, nrow(df) / 5)
felm(foodexp ~ income, data = df, weights = df$weights)
r2 = summary(felm_model)$r.squared # get r-squared
print(r2)

pdeffebach · 2019-12-31T20:39:48Z

As an addendum to the above post: I realized that lfe doesn't document the way weights are used when calculating the variance-covariance matrix. So I broke out stata on my other computer and confirmed that Stata also uses the weighted mean when calculating the R^2.

I will go through and just use stata to ensure everything is correct rather than mess with felm.

pdeffebach · 2019-12-31T21:43:48Z

@nalimilan I would like you to please confirm that the calculation of the loglikelihood is correct for lm with weights.

Stata assumes normality of errors when calculating the log-likelihood. This means that

reg foodexp income [fweight = weight]

gives you an e(ll) = -1445, which matches the loglikelihood from the Julia model

glm(@formula(foodexp ~ income), wts = df.weight)

However the loglikelihood of

lm(@formula(foodexp ~ income), wts = df.weight)

is different, due to the implementation here. If you could expand on why it's written like that so we know it's right I'd appreciate it.

nalimilan · 2020-01-01T18:51:13Z

Thanks for checking against Stata.

I don't remember the details of why I implemented loglikelihood that way for linear models, and that method doesn't seem to be covered by tests with weights, so maybe it's wrong. I'm not sure why the sum of weights appears in the formula separately from n: actually I would just replace n with the sum of weights, or just n = nobs(m) which does the right thing whether there are weights or not. (Maybe I incorrectly took the formula for analytic weights...)

If you can't find the solution, you could also use the same approach as for GLMs, which should be a bit less efficient, but we don't care that much.

pdeffebach · 2020-01-03T22:32:49Z

I didn't end up figuring out why the log-likelihood was implemented the way it was. It didn't match aweights. I implemented the glm version for lm and made sure it matched stata.

This is a weird PR because there exists a lot of code for lm with weights that has never been used since lm has never supported weights until now. For some reason the residuals function for lm multiplied each residual (i.e. x_i' * beta - y) by the weight, which is wrong and I have now fixed.

nalimilan · 2020-01-04T15:39:22Z

Indeed, multiplying the residual for each observation by its weight doesn't make sense.

nalimilan · 2019-12-30T20:40:00Z

src/lm.jl


-LmResp(y::AbstractVector{T}) where T<:Real = LmResp(float(y))
+LmResp(y::FPVector, wts::FPVector=similar(y, 0)) = LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y)


Suggested change

LmResp(y::FPVector, wts::FPVector=similar(y, 0)) = LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y)

LmResp(y::FPVector, wts::FPVector=similar(y, 0)) =

LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y)

nalimilan · 2019-12-30T20:40:10Z

src/lm.jl

-LmResp(y::AbstractVector{T}) where T<:Real = LmResp(float(y))
+LmResp(y::FPVector, wts::FPVector=similar(y, 0)) = LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y)
+
+LmResp(y::AbstractVector{<:Real}, wts::AbstractVector{<:Real}=similar(y, 0)) = LmResp(float(y), float(wts))


Suggested change

LmResp(y::AbstractVector{<:Real}, wts::AbstractVector{<:Real}=similar(y, 0)) = LmResp(float(y), float(wts))

LmResp(y::AbstractVector{<:Real}, wts::AbstractVector{<:Real}=similar(y, 0)) =

LmResp(float(y), float(wts))

nalimilan · 2020-01-04T15:22:09Z

src/lm.jl

-        sw += log(w)
+    y   = r.y
+    if isempty(wts)
+        @show mu  = mean(y)


Suggested change

@show mu = mean(y)

mu = mean(y)

nalimilan · 2020-01-04T15:22:26Z

src/lm.jl

+    ll
+end
+
+function residuals(r::LmResp)


Make this a one-liner?

nalimilan · 2020-01-04T15:22:52Z

src/lm.jl

@@ -126,24 +145,31 @@ end
 LinearAlgebra.cholesky(x::LinearModel) = cholesky(x.pp)

 function StatsBase.fit!(obj::LinearModel)
-    installbeta!(delbeta!(obj.pp, obj.rr.y))
+    if length(obj.rr.wts) == 0


Suggested change

if length(obj.rr.wts) == 0

if isempty(obj.rr.wts)

nalimilan · 2020-01-04T15:23:20Z

src/lm.jl

 """
-lm(X, y, allowrankdeficient::Bool=false) = fit(LinearModel, X, y, allowrankdeficient)
+lm(X, y, allowrankdeficient::Bool=false; kwargs...) = fit(LinearModel, X, y, allowrankdeficient; kwargs...)


Suggested change

lm(X, y, allowrankdeficient::Bool=false; kwargs...) = fit(LinearModel, X, y, allowrankdeficient; kwargs...)

lm(X, y, allowrankdeficient::Bool=false; kwargs...) =

fit(LinearModel, X, y, allowrankdeficient; kwargs...)

nalimilan · 2020-01-04T15:24:12Z

test/runtests.jl

@@ -46,6 +46,26 @@ linreg(x::AbstractVecOrMat, y::AbstractVector) = qr!(simplemm(x)) \ y
    @test isapprox(coef(lm1), coef(lm2) .* [1., 10.])
 end

+@testset "weights" begin 


Suggested change

@testset "weights" begin

@testset "linear model with weights" begin

test/runtests.jl

src/lm.jl

pdeffebach · 2020-01-05T17:20:57Z

I can't find any reference to Pseudo-R^2. Is that something that lives in GLM? Not sure how to test it.

nalimilan · 2020-01-05T18:08:22Z

See https://juliastats.org/StatsBase.jl/latest/statmodels/#StatsBase.r2.

pdeffebach · 2020-01-05T20:28:16Z

r2 doesn't work for GLM, unfortunately. It says mss not defined for Generalized.... It should be a one-liner to implement though the way it is for lm.

nalimilan · 2020-01-05T21:49:21Z

I mean r²(m, :McFadden). EDIT: sorry, it used to work, but it doesn't since it moved from using nulldeviance to using nullloglikelihood (which is correct). So we need to implement that.

nalimilan

Thanks!

getzze · 2020-01-16T15:23:01Z

The definition of loglikelihood (and nullloglikelihood) seems incorrect to me. As defined in StatsBase (https://juliastats.org/StatsBase.jl/latest/statmodels/#StatsBase.deviance), the deviance is 2*(l_S - l), where l is the loglikelihood of the model and l_S the loglikelihood of the saturated model, that is often 0. So I would expect that the loglikelihood is proportional to the deviance.

Also for GLMs, the definition of the loglikelihood is a bit confusing to me, I don't get why the deviance is used as an estimate of the second parameter of the distribution. Is the deviance supposed to be an estimate of the variance?

[EDIT] Sorry, I just realized that this was not introduced by this PR, I will open an issue. #356

pdeffebach added 2 commits November 1, 2019 18:01

Started adding wts keyword

6973dcd

Conemplate changes

a1295bf

pdeffebach added 3 commits November 12, 2019 13:48

get to work

caac53e

make ready for review

323b689

Fixes

23ee204

nalimilan reviewed Nov 28, 2019

View reviewed changes

pdeffebach added 3 commits December 12, 2019 18:36

Remove manifest.toml

57b888a

respond to comments

b32f979

delete manifest

44ea4f8

pdeffebach added 3 commits December 22, 2019 20:03

Add test

626fcea

Constrain types a little

aa437f4

move test

05798a9

Remove manifest

0954b47

nalimilan reviewed Dec 24, 2019

View reviewed changes

pdeffebach and others added 3 commits December 30, 2019 15:22

Update src/lm.jl

4963dc0

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

Update src/lm.jl

d7d1f71

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

comments

2285195

pdeffebach added 2 commits December 30, 2019 15:24

Merge branch 'lm_weights' of https://github.com/pdeffebach/GLM.jl int…

9a0d0be

…o lm_weights

Progress

52e057f

pdeffebach mentioned this pull request Dec 31, 2019

Change wts keyword argument name to weights #350

Open

pdeffebach added 2 commits December 31, 2019 16:07

expand types for GLMResp

1365387

Remove manifest (again)

f0a8b1c

loglikelihood fix and residuals fix

354e458

nalimilan reviewed Jan 4, 2020

View reviewed changes

More comments

1c08194

Add adjr2

f700891

nalimilan approved these changes Jan 7, 2020

View reviewed changes

nalimilan merged commit c3fc336 into JuliaStats:master Jan 15, 2020

kleinschmidt mentioned this pull request Feb 17, 2020

Document fitting models with categorical variables #324

Merged

		@@ -0,0 +1,232 @@
		# This file is machine-generated - editing it directly is not advised


		LmResp(y::AbstractVector{T}) where T<:Real = LmResp(float(y))
		LmResp(y::FPVector; wts::FPVector = similar(y, 0)) = LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y)


		function updateμ!(r::LmResp{V}, linPr::V) where V<:FPVector
		LmResp(y::AbstractVector{T}; wts::AbstractVector{T} = similar(y, 0)) where T<:Real = LmResp(float(y), float(wts))

	LmResp(y::AbstractVector{T}; wts::AbstractVector{T} = similar(y, 0)) where T<:Real = LmResp(float(y), float(wts))
	LmResp(y::AbstractVector{<:Real}; wts::AbstractVector{<:Real}=similar(y, 0)) = LmResp(float(y), float(wts))

	lm(X, y, allowrankdeficient::Bool=false; wts = nothing)
	lm(X, y, allowrankdeficient::Bool=false; wts=similar(y, 0))


		LmResp(y::AbstractVector{T}) where T<:Real = LmResp(float(y))
		function LmResp(y::FPVector, wts::FPVector=similar(y, 0))


		LmResp(y::AbstractVector{T}) where T<:Real = LmResp(float(y))
		LmResp(y::FPVector, wts::FPVector=similar(y, 0)) = LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y)

	LmResp(y::FPVector, wts::FPVector=similar(y, 0)) = LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y)
	LmResp(y::FPVector, wts::FPVector=similar(y, 0)) =
	LmResp{typeof(y)}(fill!(similar(y), 0), similar(y, 0), wts, y)

	LmResp(y::AbstractVector{<:Real}, wts::AbstractVector{<:Real}=similar(y, 0)) = LmResp(float(y), float(wts))
	LmResp(y::AbstractVector{<:Real}, wts::AbstractVector{<:Real}=similar(y, 0)) =
	LmResp(float(y), float(wts))

	lm(X, y, allowrankdeficient::Bool=false; kwargs...) = fit(LinearModel, X, y, allowrankdeficient; kwargs...)
	lm(X, y, allowrankdeficient::Bool=false; kwargs...) =
	fit(LinearModel, X, y, allowrankdeficient; kwargs...)

	@testset "weights" begin
	@testset "linear model with weights" begin

Add wts keyword to lm #341

Add wts keyword to lm #341

Conversation

pdeffebach commented Nov 1, 2019

pdeffebach commented Nov 6, 2019

nalimilan commented Nov 7, 2019

pdeffebach commented Nov 7, 2019

nalimilan commented Nov 7, 2019

pdeffebach commented Nov 27, 2019

nalimilan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Dec 13, 2019

pdeffebach commented Dec 13, 2019

nalimilan commented Dec 13, 2019

pdeffebach commented Dec 13, 2019

nalimilan commented Dec 13, 2019

codecov-io commented Dec 23, 2019 • edited

Codecov Report

pdeffebach commented Dec 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Dec 24, 2019

pdeffebach commented Dec 31, 2019

pdeffebach commented Dec 31, 2019

pdeffebach commented Dec 31, 2019

nalimilan commented Jan 1, 2020

pdeffebach commented Jan 3, 2020

nalimilan commented Jan 4, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pdeffebach commented Jan 5, 2020

nalimilan commented Jan 5, 2020

pdeffebach commented Jan 5, 2020

nalimilan commented Jan 5, 2020 • edited

nalimilan left a comment

Choose a reason for hiding this comment

getzze commented Jan 16, 2020 • edited by nalimilan

Add wts keyword to `lm` #341

Add wts keyword to `lm` #341

codecov-io commented Dec 23, 2019 •

edited

nalimilan commented Jan 5, 2020 •

edited

getzze commented Jan 16, 2020 •

edited by nalimilan