Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Default Coding of Union{Missing, Int or Float64} #145

Open
clintonTE opened this issue Aug 28, 2019 · 8 comments · May be fixed by #153 or #295
Open

Incorrect Default Coding of Union{Missing, Int or Float64} #145

clintonTE opened this issue Aug 28, 2019 · 8 comments · May be fixed by #153 or #295
Milestone

Comments

@clintonTE
Copy link

Pretty straight forward. I would expect the results of these to be the same:

function mwe(N=10)
  df = DataFrame(x = rand(N),
    zm = Vector{Union{Missing, Int}}(rand(1:4,N)),
    z = Vector{Int}(rand(1:4,N)))

  println("W/out missings")
  f = @eval(@formula(x ~ z))
  f = apply_schema(f, schema(f,df), StatisticalModel)
  m = modelcols(f.rhs, df)
  display(m)

  println("With missings")
  f = @eval(@formula(x ~ zm))
  f = apply_schema(f, schema(f,df), StatisticalModel)
  m = modelcols(f.rhs, df)
  display(m)
end

mwe()

OUTPUT:

W/out missings
10×2 Array{Float64,2}:
 1.0  3.0
 1.0  3.0
 1.0  2.0
 1.0  2.0
 1.0  1.0
 1.0  2.0
 1.0  4.0
 1.0  1.0
 1.0  4.0

With missings
10×4 Array{Float64,2}:
 1.0  0.0  1.0  0.0
 1.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0
 1.0  0.0  0.0  0.0
 1.0  0.0  1.0  0.0
 1.0  0.0  0.0  0.0
 1.0  0.0  1.0  0.0
 1.0  1.0  0.0  0.0
 1.0  0.0  0.0  1.0
 1.0  0.0  1.0  0.0
@kleinschmidt
Copy link
Member

Note that if you pass ContinuousTerm as a hint to the schema you should get the desired result:

julia> modelcols(apply_schema(f, schema(f, df, Dict(:zm => ContinuousTerm)), StatisticalModel), df)[2]
10×2 Array{Float64,2}:
 1.0  1.0
 1.0  2.0
 1.0  2.0
 1.0  3.0
 1.0  3.0
 1.0  3.0
 1.0  3.0
 1.0  2.0
 1.0  3.0
 1.0  1.0

That being said, it might make sense to widen the criterion for whether something is continuous or not to be Union{Missing, <:Real} instead of just <:Real. And leave it up to the user to decide how to handle missings in the model matrix that results...

@kleinschmidt
Copy link
Member

kleinschmidt commented Sep 9, 2019

Actually, if there are any missing values it will error:

julia> df.zm[2] = missing
missing

julia> modelcols(apply_schema(f, schema(f, df, Dict(:zm => ContinuousTerm)), StatisticalModel), df)[2]
ERROR: MethodError: no method matching copy(::Missing)

@matthieugomez
Copy link
Contributor

matthieugomez commented Sep 9, 2019

Still, the issue is what happens when the variable is Vector{Union{<:Real, Missing} but does not contain missing values. This is a pretty common situation, for instance:

using DataFrames
df = DataFrame(y = [missing, 1, 2])
idx = completecases(df)
subdf = df[idx, :]

@clintonTE
Copy link
Author

Still, the issue is what happens when the variable is Vector{Union{<:Real, Missing} but does not contain missing values. This is a pretty common situation, for instance:

using DataFrames
df = DataFrame(y = [missing, 1, 2])
idx = completecases(df)
subdf = df[idx, :]

This. A lot of my regression code follows a similar pattern before getting the model matrix. Also worth pointing out that the old pattern of ModelMatrix(ModelFrame(.)) treated Vector{Float64} and Vector{Union{Float64,Missing}} the same (as continuous), at least in the original use case.

Moreover, I really like how StatsModels has had common-sense rules for continuous variables in most reasonable cases. Some intuition for ideal behavior: A typical "raw" dataframe model and formula will involve Dates, Ints, Symbols, Strings, and Float64s, and the various permutations combined with Missing. Except for T<:Union{Number,Missing}, I would expect Dummy Coding to be default and for T<:Union{Number, Missing} I would think it would be Continuous by default. I guess I would prefer that as a bright-line rule than relying on behind the scenes inference based on the assortment and uniqueness of values. (Of course, this implies that if the column in the DataFrame is already designated as a CategoricalValue or similar, that would also be obeyed.)

@kleinschmidt
Copy link
Member

I think it makes sense to treat Union{Missing, <:Number} as continuous. I'll do a PR later this week or happily review one if someone else wants to make one before then!

@matthieugomez
Copy link
Contributor

Note that for now

using StatsModels
N = 10_000_000
x = rand(N) + rand([0, missing], N)
df = (x = x,)
schema(Term(:x), df)
#julia(20271,0x116991dc0) malloc: can't allocate region
#:*** mach_vm_map(size=199992320073728, flags: 60000100) failed (error code=3)
#julia(20271,0x116991dc0) malloc: *** set a breakpoint in malloc_error_break to debug
#141 ERROR: OutOfMemoryError(

@palday
Copy link
Member

palday commented Apr 23, 2020

@matthieugomez That's not too surprising given the underlying issue (and how it expands out), but thanks for the example.

@matthieugomez
Copy link
Contributor

yes, just connecting it to FixedEffects/FixedEffectModels.jl#99

@kleinschmidt kleinschmidt added this to the 0.7 milestone Dec 18, 2020
@palday palday linked a pull request Jun 10, 2023 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants