Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed linear model with perfectly collinear rhs variables and weights #432

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

dwinkler1
Copy link

This is my probably not very efficient suggestion to fix #420
Suggestions very welcome!!

@dwinkler1
Copy link
Author

Please let me know if I should change anything in the PR. :)

@andreasnoack
Copy link
Member

Looks right to me. Could you please add a test case as well?

@dwinkler1
Copy link
Author

I tried to add a teset but for me the earlier testset "linear model with weights" errors already. I feel like it is related to the PR but I am not sure how we even end up with a CholeskyPivoted in this case.

This would be my testset

@testset "collinearity and weights" begin
    rng = StableRNG(1234321)
    x1 = randn(100)
    x1_2 = 3 * x1
    x2 = 10 * randn(100)
    x2_2 = -2.4 * x2
    y = 1 .+ randn() * x1 + randn() * x2 + 2 * randn(100)
    df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50))
    f = @formula(y ~ x1 + x2 + x3 + x4)
    lm_model = lm(f, df, wts = df.weights)#, dropcollinear = true)
    X = [ones(length(y)) x1_2 x2_2]
    W = Diagonal(df.weights)
    coef_naive = (X'W*X)\X'W*y
    @test lm_model.model.pp.chol isa CholeskyPivoted
    @test rank(lm_model.model.pp.chol) == 3
    @test isapprox(filter(!=(0.0), coef(lm_model)), coef_naive)
end

This is what goes wrong earlier

julia>     lm_model = lm(f, df, wts = df.weights)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

FoodExp ~ 1 + Income

Coefficients:
──────────────────────────────────────────────────────────────────────────────
                     Coef.  Std. Error      t  Pr(>|t|)   Lower 95%  Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)  -2794.53        2.92818e5  -0.01    0.9924  -5.77698e5  5.72109e5
Income           3.63222e6   3.36375e8   0.01    0.9914  -6.56787e8  6.64051e8
──────────────────────────────────────────────────────────────────────────────

@nalimilan
Copy link
Member

I tried to add a teset but for me the earlier testset "linear model with weights" errors already. I feel like it is related to the PR but I am not sure how we even end up with a CholeskyPivoted in this case.

We use dropcollinear=true by default, so you get a CholeskyPivoted unless you pass dropcollinear=false explicitly. From a quick check, it appears that the new failure is due to changes this PR makes for the if rnk == length(ch.p) case. Do you really need to change the code in that branch?

@dwinkler1 dwinkler1 changed the title WIP: Fixed linear model with perfectly collinear rhs variables and weights Fixed linear model with perfectly collinear rhs variables and weights Feb 4, 2022
@dwinkler1
Copy link
Author

dwinkler1 commented Feb 4, 2022

@nalimilan you were of course correct. I fixed that and added tests (all test passing now). However, I made a mess with autoformatting in vscode. What would be the best way to get just the stuff I changed in without the unnecessary indentation changes? Should I just fork again?
Sorry my experience with git pull requests is quite limited.

EDIT: I gave it a try let me know if that is ok!

@codecov-commenter
Copy link

codecov-commenter commented Feb 6, 2022

Codecov Report

Base: 87.08% // Head: 87.27% // Increases project coverage by +0.19% 🎉

Coverage data is based on head (b8eb789) compared to base (7299d18).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #432      +/-   ##
==========================================
+ Coverage   87.08%   87.27%   +0.19%     
==========================================
  Files           7        7              
  Lines         929      943      +14     
==========================================
+ Hits          809      823      +14     
  Misses        120      120              
Impacted Files Coverage Δ
src/linpred.jl 85.29% <100.00%> (+1.68%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@nalimilan
Copy link
Member

Thanks. There are still some failures which may well be random given that they happen on different versions. Any idea what may be going on? Can you reproduce these locally?

@dwinkler1
Copy link
Author

dwinkler1 commented Feb 7, 2022

Looks like numeric instability to me. I'll investigate. Maybe I can write a more stable version of the test case. They do not happen on my machine (ubuntu with 1.6)

@nalimilan
Copy link
Member

Bump. Maybe try a different set of values?

@alecloudenback
Copy link

Bump on this PR. I am encountering the issue in #420 as well.

On my local machine (1.8RC4, Mac M1) the tests pass. I agree with @danielw2904 that this appears to just be a numerical difference which stands out because the failing tests were printing +/- 0.0...

All four failing test cases in the F test for model comparison testset were a change in +/- 0.0, for example:

Screen.Recording.2022-08-13.at.11.53.45.PM.mov

@dwinkler1
Copy link
Author

Sorry, I am unable to work on this anymore.

@dwinkler1 dwinkler1 closed this Aug 14, 2022
@alecloudenback alecloudenback mentioned this pull request Aug 14, 2022
3 tasks
@nalimilan nalimilan reopened this Aug 28, 2022
@nalimilan
Copy link
Member

Unfortunately it seems to me that something more serious is going on. Locally, tests always fail, even on Julia 1.7. Maybe this has to do with BLAS and the CPU? The following models give completely different coefficients here:

julia> using GLM, StableRNGs

julia> rng = StableRNG(1234321);

julia> x1 = randn(rng, 100);

julia> x1_2 = 3 * x1;

julia> x2 = 10 * randn(rng, 100);

julia> x2_2 = -2.4 * x2;

julia> y = 1 .+ randn(rng) * x1 + randn(rng) * x2 + 2 * randn(rng, 100);

julia> df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50));

julia> lm_collinear = lm(@formula(y ~ x1 + x2 + x3 + x4), df, wts=df.weights)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

y ~ 1 + x1 + x2 + x3 + x4

Coefficients:
───────────────────────────────────────────────────────────────────────────
                  Coef.  Std. Error       t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)  0.00389006    7.74397     0.00    0.9996  -15.4334    15.4412
x1           0.0         NaN         NaN       NaN     NaN        NaN
x2           0.594857      2.78844     0.21    0.8317   -4.96379    6.1535
x3           0.0         NaN         NaN       NaN     NaN        NaN
x4           4.11564       0.351608   11.71    <1e-17    3.41472    4.81656
───────────────────────────────────────────────────────────────────────────

julia> lm_reduced = lm(@formula(y ~ x2 + x4), df, wts=df.weights)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

y ~ 1 + x2 + x4

Coefficients:
────────────────────────────────────────────────────────────────────────
                Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
────────────────────────────────────────────────────────────────────────
(Intercept)  1.29678    0.265978    4.88    <1e-05  0.766565    1.827
x2           0.215149   0.0973285   2.21    0.0302  0.0211286   0.40917
x4           0.648579   0.0124034  52.29    <1e-58  0.623853    0.673305
────────────────────────────────────────────────────────────────────────

julia> deviance(lm_collinear)
410484.4884591147

julia> deviance(lm_reduced)
355.2246843374244

On current master, both models have the same deviance and the same predictions, so it seems that this PR introduces a regression at least in some cases.

@nalimilan
Copy link
Member

Though if #487 (comment) fixes it, then maybe we don't need this PR after all?

@nalimilan
Copy link
Member

@alecloudenback If you're interested in making a PR, you could extract the relevant part from #487 so that we can fix this bug without waiting for the more general improvements of weighting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OLS with collinearity and weights
5 participants