Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Grouping contrasts #339

Merged
merged 7 commits into from Jul 15, 2020
Merged

add Grouping contrasts #339

merged 7 commits into from Jul 15, 2020

Conversation

kleinschmidt
Copy link
Member

This is in lieu of having these defined in StatsModels.jl; as we've discussed
they may be a bit to specialized for the specific demands of grouping variables
in MixedModels without defining enough functionality to be of use in other
settings, so it makes sense to have them here until some kind of sensible
handling of grouping is incorporated into StatsModels.

The Grouping contrasts are a "pseudocontrast" that never defines a contrasts
matrix, just keeps track of the unique values and an "inverse index" which maps
levels to a numeric index. This allows you to get around the OutOfMemoryError
that's triggered when you have a grouping variable with a very large number of
levels. The catch is that, at the moment, you have to manually specify the
grouping contrasts when constructing the model. I've added a guard to catch OOM
errors during schema construction which prints a warning, but we could also
throw a different error there instead to be clearer. The downside with that
would be that if it's NOT a problem with the ranef grouping term, then they'd
just get useless advice and no useful stacktrace, so that's why I went with
printing a warning.

@kleinschmidt
Copy link
Member Author

kleinschmidt commented Jul 13, 2020

based on JuliaStats/StatsModels.jl#181

@codecov
Copy link

codecov bot commented Jul 13, 2020

Codecov Report

Merging #339 into master will decrease coverage by 0.12%.
The diff coverage is 54.54%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #339      +/-   ##
==========================================
- Coverage   95.75%   95.63%   -0.13%     
==========================================
  Files          22       23       +1     
  Lines        1485     1512      +27     
==========================================
+ Hits         1422     1446      +24     
- Misses         63       66       +3     
Impacted Files Coverage Δ
src/MixedModels.jl 100.00% <ø> (ø)
src/linearmixedmodel.jl 99.15% <50.00%> (-0.85%) ⬇️
src/grouping.jl 60.00% <60.00%> (ø)
src/remat.jl 94.94% <0.00%> (-0.06%) ⬇️
src/linalg/rankUpdate.jl 98.43% <0.00%> (+0.76%) ⬆️
src/arraytypes.jl 100.00% <0.00%> (+4.87%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6fabb48...4356ad1. Read the comment docs.

Copy link
Member

@palday palday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch is that, at the moment, you have to manually specify the
grouping contrasts when constructing the model. I've added a guard to catch OOM
errors during schema construction which prints a warning, but we could also
throw a different error there instead to be clearer.

I think this is the way to go for now, but we should consider automatically setting the contrasts on grouping variables and doing this separately from the fixed-effects contrasts. This would allow us a bit of magic that could be good or bad, namely having a discrete, yet numeric fixed effect that also appears as a grouping variable. You can think of this as an approximation for a GAM (it's not quite the same as a GAM, but GAMs can indeed be rewritten as mixed models). Right now, you can of course still do this by having two copies of the same variable, one categorical and one numeric. For our planned neuro time-series stuff, this is actually surprisingly relevant because we time as a discrete numeric variable and expect non linear temporal trends.

src/grouping.jl Outdated Show resolved Hide resolved
Comment on lines +58 to +61
if isa(e, OutOfMemoryError)
@warn "Random effects grouping variables with many levels can cause out-of-memory errors. Try manually specifying `Grouping()` contrasts for those variables."
end
rethrow(e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the way to go right now, but I'm thinking about whether we should automatically set Grouping contrasts for grouping variables.

Co-authored-by: Phillip Alday <palday@users.noreply.github.com>
@kleinschmidt
Copy link
Member Author

Yeah I agree @palday, it would be nice to be able to do this automatically. It actually may be possible now but would take some annoying fiddling; you'd have to unpack the FunctionTerm{typeof(|)}, or conver them to ReTerms in order to figure out what the grouping variables are before constructing the schema, because that's when the contrasts are needed to avoid the OOM. This would basically amount to circumventing the normal StatsModels pipeline which is why I'd proposed somewhere (JuliaStats/StatsModels.jl#154) to add an apply_context stage where you don't have the schema yet but you do have the context in which the formula will be used, so you could do things like create all the ReTerms you need before creating the schema.

@kleinschmidt
Copy link
Member Author

I think at least some of the CI failures are not unrelated to this PR but have to do with how the OS handles OOM. The MacOS one failed before too, apparently during the tests for the added functionality where I intentionally trigger an OOM error. And the cancelled jobs might be related to this as well, but I'm not sure, although the passing tests on at least one windows build tells me that OOM doesn't always result in a canceled job. I guess the thing to do is to remove the tests which intentionally cause an OOM because otherwise we'll always have failing CI which...defeats the purpose.

@palday
Copy link
Member

palday commented Jul 14, 2020

@kleinschmidt The failures on Future are a known issue. For Windows .... I'll have to look at the details of the GitHub Action / its docker image to figure out why it just cancels. For OS X: do you have a local Mac?

@kleinschmidt
Copy link
Member Author

I'm afraid not.

@kliegl
Copy link

kliegl commented Jul 14, 2020

I'm afraid not. @palday @kleinschmidt

I have many local macs

@kleinschmidt
Copy link
Member Author

@kliegl can you check out this PR branch and run the tests? The procedure is to

] dev MixedModels
] activate ~/.julia/dev/MixedModels
; git checkout dfk/grouping
] test

@kliegl
Copy link

kliegl commented Jul 14, 2020

@kliegl can you check out this PR branch and run the tests? The procedure is to

] dev MixedModels
] activate ~/.julia/dev/MixedModels
; git checkout dfk/grouping
] test

The ;git checkout dfk/grouping fails. Do I first have to clone this?

@kleinschmidt
Copy link
Member Author

ah yeah you'll need to first go to that directory, with

; cd .julia/dev/MixedModels
; git fetch origin && git checkout dfk/grouping

@kliegl
Copy link

kliegl commented Jul 14, 2020

It is running now. I just had to break it into two commands.

@kliegl
Copy link

kliegl commented Jul 14, 2020

ERROR: Package MixedModels errored during testing (received signal: KILL)

Test_2020-07-14.txt

@kleinschmidt
Copy link
Member Author

Yeah looks from that like the process is just killed when it hits the OOM tests. I'll push a change that only tests the good path (I'm sufficiently convinced that trying to construct a 1,000,000x1,000,000 dense matrix will OOM...)

@kliegl
Copy link

kliegl commented Jul 14, 2020

Just ran it again:
Testing MixedModels tests passed

Do you need the report?

@kleinschmidt
Copy link
Member Author

Looks like the CI now is failing because of a problem with the artifacts...

@palday
Copy link
Member

palday commented Jul 15, 2020

@kleinschmidt Yeah, I'm still sorting out the artifact link stuff. Right now, the default link always points to the newest version, but that of course doesn't match the SHA for older versions. I'll see if I can't fix that as part of #341

@palday
Copy link
Member

palday commented Jul 15, 2020

@kleinschmidt assuming that all the CIs (potentially except Future) pass, I think this is good to go.

@dmbates dmbates merged commit 2f9d423 into master Jul 15, 2020
palday pushed a commit that referenced this pull request Aug 6, 2020
* add Grouping contrasts

* export and add a getproperty patch to work around statsmodels

* warn users to use Grouping when OOM during schema call

* Update src/grouping.jl

Co-authored-by: Phillip Alday <phillip.alday@mpi.nl>
(cherry picked from commit 2f9d423)
palday added a commit that referenced this pull request Aug 12, 2020
* add Grouping contrasts (#339)

Co-authored-by: Dave Kleinschmidt <dave.f.kleinschmidt@gmail.com>
@palday palday deleted the dfk/grouping branch November 19, 2020 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants