Taking weighting seriously #487

gragusa · 2022-07-15T16:07:11Z

This PR addresses several problems with the current GLM implementation.

Current status
In master, GLM/LM only accepts weights through the keyword wts. These weights are implicitly frequency weights.

With this PR
FrequencyWeights, AnalyticWeights, and ProbabilityWeights are possible. The API is the following

## Frequency Weights
lm(@formula(y~x), df; wts=fweights(df.wts)
## Analytic Weights
lm(@formula(y~x), df; wts=aweights(df.wts)
## ProbabilityWeights
lm(@formula(y~x), df; wts=pweights(df.wts)

The old behavior -- passing a vector wts=df.wts is deprecated and for the moment, the array os coerced df.wts to FrequencyWeights.

To allow dispatching on the weights, CholPred takes a parameter T<:AbstractWeights. The unweighted LM/GLM has UnitWeights as the parameter for the type.

This PR also implements residuals(r::RegressionModel; weighted::Bool=false) and modelmatrix(r::RegressionModel; weighted::Bool = false). The new signature for these two methods is pending in StatsApi.

There are many changes that I had to make to make everything work. Tests are passing, but some new feature needs new tests. Before implementing them, I wanted to ensure that the approach taken was liked.

I have also implemented momentmatrix, which returns the estimating function of the estimator. I arrived to the conclusion that it does not make sense to have a keyword argument weighted. Thus I will amend JuliaStats/StatsAPI.jl#16 to remove such a keyword from the signature.

Update

I think I covered all the suggestions/comments with this exception as I have to think about it. Maybe this can be addressed later. The new standard errors (the one for ProbabilityWeights) also work in the rank deficient case (and so does cooksdistance).

Tests are passing and I think they cover everything that I have implemented. Also, added a section in the documentation about using Weights and updated jldoc with the new signature of CholeskyPivoted.

To do:

Deal with weighted standard errors with rank deficient designs
Document the new API
Improve testing

Closes #186.

…liaStats-master

codecov-commenter · 2022-07-16T08:43:43Z

Codecov Report

Patch coverage: 84.08% and project coverage change: -2.67 ⚠️

Comparison is base (c13577e) 90.48% compared to head (fa63a9a) 87.82%.

❗ Current head fa63a9a differs from pull request most recent head 807731a. Consider uploading reports for the commit 807731a to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #487      +/-   ##
==========================================
- Coverage   90.48%   87.82%   -2.67%     
==========================================
  Files           8        8              
  Lines        1125     1191      +66     
==========================================
+ Hits         1018     1046      +28     
- Misses        107      145      +38

Impacted Files	Coverage Δ
src/GLM.jl	`50.00% <ø> (-10.00%)`	⬇️
src/glmfit.jl	`81.41% <79.80%> (-0.51%)`	⬇️
src/lm.jl	`89.06% <82.35%> (-5.35%)`	⬇️
src/glmtools.jl	`93.49% <85.71%> (-0.96%)`	⬇️
src/linpred.jl	`91.41% <90.32%> (-6.92%)`	⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

lrnv · 2022-07-20T07:45:33Z

Hey,

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

I think the interfacing should allow for a DataFrame input of weights, that would take care of such things (like it does for the other variables).

gragusa · 2022-07-20T17:14:41Z

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

not really. But it would be easy to make this a feature. But before digging further on this I would like to know whether there is consensus on the approach of this PR.

alecloudenback · 2022-08-14T19:14:57Z

FYI this appears to fix #420; a PR was started in #432 and the author closed for lack of time on their part to investigate CI failures.

Here's the test case pulled from #432 which passes with the in #487.

@testset "collinearity and weights" begin
    rng = StableRNG(1234321)
    x1 = randn(100)
    x1_2 = 3 * x1
    x2 = 10 * randn(100)
    x2_2 = -2.4 * x2
    y = 1 .+ randn() * x1 + randn() * x2 + 2 * randn(100)
    df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50))
    f = @formula(y ~ x1 + x2 + x3 + x4)
    lm_model = lm(f, df, wts = df.weights)#, dropcollinear = true)
    X = [ones(length(y)) x1_2 x2_2]
    W = Diagonal(df.weights)
    coef_naive = (X'W*X)\X'W*y
    @test lm_model.model.pp.chol isa CholeskyPivoted
    @test rank(lm_model.model.pp.chol) == 3
    @test isapprox(filter(!=(0.0), coef(lm_model)), coef_naive)
end

Can this test set be added?

Is there any other feedback for @gragusa ? It would be great to get this merged if good to go.

nalimilan · 2022-08-28T18:27:50Z

Sorry for the long delay, I hadn't realized you were waiting for feedback. Looks great overall, please feel free to finish it! I'll try to find the time to make more specific comments.

nalimilan

I've read the code. Lots of comments, but all of these are minor. The main one is mostly stylistic: in most cases it seems that using if wts isa UnitWeights inside a single method (like the current structure) gives simpler code than defining several methods. Otherwise the PR looks really clean!

What are you thoughts regarding testing? There are a lot of combinations to test and it's not easy to see how to integrate that into the current organization of tests. One way would be to add code for each kind of test to each @testset that checks a given model family (or a particular case, like collinear variables). There's also the issue of testing the QR factorization, which isn't used by default.

src/GLM.jl

src/glmfit.jl

src/lm.jl

test/runtests.jl

bkamins · 2022-08-31T08:49:28Z

A very nice PR. In the tests can we have some test set that compares the results of aweights, fweights, and pweights for the same set of data (coeffs, predictions, covariance matrix of the estimates, p-values etc.).

nalimilan · 2022-11-20T12:02:10Z

CI failures on Julia 1.0 can be fixed by requiring Julia 1.6 (more and more packages have started doing that).

alecloudenback · 2022-11-22T17:25:20Z

Sorry for the noise, but thank you @gragusa and reviewers for this big PR. As a user I've been watching for weighting for a while and appreciate the technical expertise and dedication to quality here.

gragusa · 2022-11-22T19:29:54Z

@nalimilan let’s give this a final push. Should I rebase this PR against #339 ? (rhetorical question!) What’s the most efficient way?

nalimilan · 2022-11-23T18:41:45Z

Yes the PR needs to be rebased against master -- or, simpler, merge master into the branch. Most conflicts seem relatively simple to resolve. You can try doing this online on GitHub, though there's always a chance that it won't be 100% correct the first time. Otherwise you can do that locally with git fetch; git merge origin/master. Or I can do it in a few days if you want.

gragusa · 2022-11-23T18:43:02Z

Don’t worry .. I am already on it. Sent from Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Milan Bouchet-Valat ***@***.***> Sent: Wednesday, November 23, 2022 7:41:56 PM To: JuliaStats/GLM.jl ***@***.***> Cc: Giuseppe Ragusa ***@***.***>; Mention ***@***.***> Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487) Yes the PR needs to be rebased against master -- or, simpler, merge master into the branch. Most conflicts seem relatively simple to resolve. You can try doing this online on GitHub, though there's always a chance that it won't be 100% correct the first time. Otherwise you can do that locally with git fetch; git merge origin/master. Or I can do it in a few days if you want. — Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DAUXTDGDDJXXCRUZLHTWJZQPJANCNFSM53WBWMMQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

nalimilan

Thanks for rebasing! I have more comments, as @bkamins had made a few ones above too.

src/lm.jl

test/runtests.jl

nalimilan · 2022-11-20T12:13:39Z

test/runtests.jl

+                      1.8686815106332157 0.0 0.0 0.0 1.8686815106332157; 
+                      0.010149793505874801 0.010149793505874801 0.0 0.0 0.010149793505874801; 
+                     -1.8788313148033928 -0.0 -1.8788313148033928 -0.0 -1.8788313148033928]
+        @test mm0_pois ≈  GLM.momentmatrix(gm_pois) atol=1e-06


Remove double space here and elsewhere.

nalimilan · 2022-11-20T12:14:30Z

test/runtests.jl

+        f = @formula(admit ~ 1 + rank)        
+        gm_bin = fit(GeneralizedLinearModel, f, admit_agr, Binomial(); rtol=1e-8)
+        gm_binw = fit(GeneralizedLinearModel, f, admit_agr, Binomial(), 
+                      wts=aweights(admit_agr.count); rtol=1e-08)


Any reason to use analytic weights rather than frequency weights? Here I think the latter make more sense for this dataset.

src/lm.jl

nalimilan · 2022-12-24T18:13:46Z

docs/src/index.md

+- `FrequencyWeights` describe the inverse of the sampling probability for each observation,
+  providing a correction mechanism for under- or over-sampling certain population groups.
+  These weights may also be referred to as sampling weights.
+- `ProbabilityWeights` describe how the sample can be scaled back to the population.
+  Usually are the reciprocals of sampling probabilities.


Let's use the same wording as in StatsBase for simplicity. If we want to improve it, we'll change it everywhere.

Suggested change

- `FrequencyWeights` describe the inverse of the sampling probability for each observation,

providing a correction mechanism for under- or over-sampling certain population groups.

These weights may also be referred to as sampling weights.

- `ProbabilityWeights` describe how the sample can be scaled back to the population.

Usually are the reciprocals of sampling probabilities.

- `FrequencyWeights` describe the number of times (or frequency) each observation was seen.

These weights may also be referred to as case weights or repeat weights.

- `ProbabilityWeights` represent the inverse of the sampling probability for each observation,

providing a correction mechanism for under- or over-sampling certain population groups.

These weights may also be referred to as sampling weights.

nalimilan · 2022-12-24T18:17:18Z

src/GLM.jl

           fitted, fit, fit!, model_response, response, modelmatrix, r2, r², adjr2, adjr²,
-           cooksdistance, hasintercept, dispersion
+           cooksdistance, hasintercept, dispersion, weights, AnalyticWeights, ProbabilityWeights, FrequencyWeights, 
+           UnitWeights, uweights, fweights, pweights, aweights, leverage



Add the description of weights types to COMMON_FIT_KWARGS_DOCS below.

jeremiedb · 2023-03-01T04:21:34Z

@nalimilan were there remaining fixes to have this PR completed? I was worried to have the important work brought by this PR loose its momentum.

gragusa · 2023-03-01T09:02:39Z

Mostly time 🤣 I think there few things to fix (addressing all the comments of @nalimilan) and making few decisions … I could Find some time in the next week to finish it if @nalimilan has some time to support me. Sent from Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: jeremiedb ***@***.***> Sent: Wednesday, March 1, 2023 5:21:44 AM To: JuliaStats/GLM.jl ***@***.***> Cc: Giuseppe Ragusa ***@***.***>; Mention ***@***.***> Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487) @nalimilan<https://github.com/nalimilan> were there remaining fixes to have this PR completed? I was worried to have the important work brought by this PR loose its momentum. — Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DAXJBMS64DMYJSZ573TWZ3FFRANCNFSM53WBWMMQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

nalimilan · 2023-03-01T12:43:57Z

Sure !

bkamins · 2023-03-05T14:07:45Z

While we are at weights my question is if we should not update the ftest implementation also? The issue is that currently ftest assumes that nobs and dof are integer (which does not have to hold with weights).

ParadaCarleton · 2023-03-05T14:42:04Z

While we are at weights my question is if we should not update the ftest implementation also? The issue is that currently ftest assumes that nobs and dof are integer (which does not have to hold with weights).

ftest, in its original form, won’t work here, I think. Mentioning @smishr who might know more about the specifics here.

bkamins · 2023-03-05T15:26:07Z

ftest, in its original form, won’t work here

I agree that we would need to carefully consider all cases of weights. I have not thought about probabilistic weights. However, for frequency weights and analytical weights, assuming we produce correct deviance and dof_residual things should be correct.

I have just checked against examples in Wooldridge, chapter 8 and properly scaled analytical weights produce correct result.

Also, in general, I think we should ensure that every function in GLM.jl that accepts model estimated with weighting should either:

error if it does not produce a correct result, or
work correctly

(it does not have to be in this PR, but if we are taking weighting seriously, I think we should ensure this property when we make a release)

Thank you for working on this!

smishr · 2023-03-31T17:37:15Z

In survey datasets weights are commonly calibrated to sum upto an (integral) population size.

While we are at weights my question is if we should not update the ftest implementation also? The issue is that currently ftest assumes that nobs and dof are integer (which does not have to hold with weights).

While most applications of F-test have integral dof, the F distribution is continuous and well-defined for non-integral values of dof.

In R:

> df(1.2, df1 = 10, df2 = 20)
[1] 0.5626125
> df(1.2, df1 = 10, df2 = 20.1)
[1] 0.5630353

Julia and R agree

julia> using Distributions

julia> d = FDist(10, 20)
FDist{Float64}(ν1=10.0, ν2=20.0)

julia> pdf(d, 1.2)
0.5626124566227022

julia> d = FDist(10, 20.1)
FDist{Float64}(ν1=10.0, ν2=20.1)

julia> pdf(d, 1.2)
0.5630352744353205

There is this StackExchange post discussing non-integral dof for t-tests and for GAMs in this post.

ftest, in its original form, won’t work here, I think. Mentioning @smishr who might know more about the specifics here.

The F-test is essentially the ratio of two variances. For the weighted GLM case, variances based on the weighted Least Squares could be used to calculate test statistic.

Note: whether doing an (adjusted) F-Test for comparing weighted GLM models is the right approach, that is up for debate...

ParadaCarleton · 2023-04-01T01:35:01Z

Hmm, did any of the people who worked on Survey.jl leave comments here? @iuliadmtru @aviks

…-master

gragusa · 2023-06-16T10:14:39Z

I finally found the time to rebase this PR against the latest main repository. Tests pass locally; let's see whether they pass on the CI.

I have a few days of "free" time and would like to finish this. @nalimilan It is difficult to track the comments and which ones were addressed by the various commit. On my side, the primary decision is about weight scaling. But before engaging in a conversation, I will add documentation so that whoever will contribute to the discussion can do it coherently.

Test passed!

nalimilan · 2023-06-24T19:57:44Z

Cool. Do you need any input from my side?

SamuelMathieu-code · 2024-01-31T16:37:38Z

Hi there! I wonder what will happen to this PR? As I understand, one review from a person with write access is needed?

gragusa · 2024-02-12T15:43:44Z

Just wanted to give a quick update on the PR. The PR was almost ready to go, but now, with more PR being merged, there are a few things that need to be straightened out. I should be able to work on it again next week to make sure everything's in good shape. Then I hope somebody will help get this merged. From: Samuel Mathieu ***@***.***> Date: Wednesday, January 31, 2024 at 17:37 To: JuliaStats/GLM.jl ***@***.***> Cc: Giuseppe Ragusa ***@***.***>, Mention ***@***.***> Subject: Re: [JuliaStats/GLM.jl] Taking weighting seriously (PR #487) Hi there! I wonder what will happen to this PR? As I understand, one review from a person with write access is needed? — Reply to this email directly, view it on GitHub<#487 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAD5DAVWZZIQDXOS5AQHBSLYRJXN5AVCNFSM53WBWMM2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJRHE2DOOJYGM4A>. You are receiving this because you were mentioned.Message ID: ***@***.***>

gragusa added 20 commits June 10, 2022 20:53

WIP

1754cbd

WIP

1d778a5

WIP

12121a3

Taking weights seriously

4363ba4

WIP

ca702dc

Taking weights seriously

e2b2d12

Merge branch 'master' of https://github.com/JuliaStats/GLM.jl into Ju…

bc8709a

…liaStats-master

Add depwarn for passing wts with Vector

84cd990

Cosmettic changes

cbc329f

WIP

23d67f5

Fix loglik for weighted models

f4d90a9

Fix remaining issues

6b7d95c

Final commit

c236b82

Merge branch 'master'

d4bd0c2

Fix merge

8bdfb55

Fix nulldeviance

3eb2ca4

Bypass crossmodelmatrix drom StatsAPI

63c8358

Delete momentmatrix.jl

e93a919

Delete scratch.jl

7bb0959

Delete settings.json

ded17a8

ararslan requested review from andreasnoack and nalimilan August 15, 2022 19:54

nalimilan mentioned this pull request Aug 28, 2022

Fixed linear model with perfectly collinear rhs variables and weights #432

Open

nalimilan reviewed Aug 31, 2022

View reviewed changes

gragusa added 8 commits November 15, 2022 16:58

Follow reviewer suggestions [Batch 1]

831f280

Follow reviewer's suggestions [Batch 2]

b00dc16

probability weights vcov uses momentmatrix

0825324

Fix ProbabilityWeights vcov and tests

48d15fb

Use leverage from StasAPI

3338eab

Merge branch 'master' into JuliaStats-master

c27c749

Rebase against master

970e26e

Fix test

8832e9d

gragusa added 2 commits December 20, 2022 17:52

Merge remote-tracking branch 'origin/master' into JuliaStats-master

9eb2390

Test on 1.6

587c129

nalimilan reviewed Dec 24, 2022

View reviewed changes

Address reviwer comments

fa63a9a

Merge branch 'master' of github.com:JuliaStats/GLM.jl into JuliaStats…

807731a

…-master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taking weighting seriously #487

Taking weighting seriously #487

gragusa commented Jul 15, 2022 •

edited

Loading

codecov-commenter commented Jul 16, 2022 •

edited by codecov bot

Loading

lrnv commented Jul 20, 2022

gragusa commented Jul 20, 2022

alecloudenback commented Aug 14, 2022 •

edited

Loading

nalimilan commented Aug 28, 2022

nalimilan left a comment

bkamins commented Aug 31, 2022

nalimilan commented Nov 20, 2022 •

edited

Loading

alecloudenback commented Nov 22, 2022

gragusa commented Nov 22, 2022

nalimilan commented Nov 23, 2022

gragusa commented Nov 23, 2022 via email

nalimilan left a comment

nalimilan Nov 20, 2022

nalimilan Nov 20, 2022

nalimilan Dec 24, 2022

nalimilan Dec 24, 2022

jeremiedb commented Mar 1, 2023

gragusa commented Mar 1, 2023 via email

nalimilan commented Mar 1, 2023

bkamins commented Mar 5, 2023

ParadaCarleton commented Mar 5, 2023 •

edited

Loading

bkamins commented Mar 5, 2023

smishr commented Mar 31, 2023 •

edited

Loading

ParadaCarleton commented Apr 1, 2023

gragusa commented Jun 16, 2023 •

edited

Loading

nalimilan commented Jun 24, 2023

SamuelMathieu-code commented Jan 31, 2024

gragusa commented Feb 12, 2024 via email

Taking weighting seriously #487

Are you sure you want to change the base?

Taking weighting seriously #487

Conversation

gragusa commented Jul 15, 2022 • edited Loading

codecov-commenter commented Jul 16, 2022 • edited by codecov bot Loading

Codecov Report

lrnv commented Jul 20, 2022

gragusa commented Jul 20, 2022

alecloudenback commented Aug 14, 2022 • edited Loading

nalimilan commented Aug 28, 2022

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Aug 31, 2022

nalimilan commented Nov 20, 2022 • edited Loading

alecloudenback commented Nov 22, 2022

gragusa commented Nov 22, 2022

nalimilan commented Nov 23, 2022

gragusa commented Nov 23, 2022 via email

nalimilan left a comment

Choose a reason for hiding this comment

nalimilan Nov 20, 2022

Choose a reason for hiding this comment

nalimilan Nov 20, 2022

Choose a reason for hiding this comment

nalimilan Dec 24, 2022

Choose a reason for hiding this comment

nalimilan Dec 24, 2022

Choose a reason for hiding this comment

jeremiedb commented Mar 1, 2023

gragusa commented Mar 1, 2023 via email

nalimilan commented Mar 1, 2023

bkamins commented Mar 5, 2023

ParadaCarleton commented Mar 5, 2023 • edited Loading

bkamins commented Mar 5, 2023

smishr commented Mar 31, 2023 • edited Loading

ParadaCarleton commented Apr 1, 2023

gragusa commented Jun 16, 2023 • edited Loading

nalimilan commented Jun 24, 2023

SamuelMathieu-code commented Jan 31, 2024

gragusa commented Feb 12, 2024 via email

gragusa commented Jul 15, 2022 •

edited

Loading

codecov-commenter commented Jul 16, 2022 •

edited by codecov bot

Loading

alecloudenback commented Aug 14, 2022 •

edited

Loading

nalimilan commented Nov 20, 2022 •

edited

Loading

ParadaCarleton commented Mar 5, 2023 •

edited

Loading

smishr commented Mar 31, 2023 •

edited

Loading

gragusa commented Jun 16, 2023 •

edited

Loading