Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should coeffnames be Symbols? #113

Open
oxinabox opened this issue Jun 13, 2019 · 21 comments
Open

Should coeffnames be Symbols? #113

oxinabox opened this issue Jun 13, 2019 · 21 comments

Comments

@oxinabox
Copy link
Contributor

It might be a bit breaking,
but Symbol is the type generally used for The Name of a Thing

and DataFrame column names are Symbols,
as are ColumnTables

@Nosferican
Copy link
Contributor

Another thing to consider is whether we want the type to change depending on the inputs,

data = [(x = 0,)]
coefnames(apply_schema(@formula(0 ~ 0), schema(data)))
coefnames(apply_schema(@formula(0 ~ 1), schema(data)))
coefnames(apply_schema(@formula(0 ~ 1 + x), schema(data)))
coefnames(apply_schema(@formula(1 ~ 0), schema(data)))
coefnames(apply_schema(@formula(1 ~ 1), schema(data)))
coefnames(apply_schema(@formula(1 ~ 1 + x), schema(data)))

@Tokazama
Copy link

I really like the idea of using Symbol for coefnames instead of strings but what would be used for interaction terms? Would we just want to currently developed string name and convert it to a symbol?

@oxinabox
Copy link
Contributor Author

Sure, why not?

@nalimilan
Copy link
Member

As noted by @kleinschmidt at #169, it's not clear what would be the advantage of doing this. Yet it's quite disruptive. Can you develop?

@Tokazama
Copy link

Tokazama commented Feb 13, 2020

Just to add to what @oxinabox said in the OP, terms are also using symbols as names so it seems very odd that the only place we are using strings to refer to variables is the coefficients.

The actually reason I started looking at this was because it's easier to speed up operations on symbols than strings. I was looking at ways to query coefficients within models and formulas.

I think that if we were discussing this in the absence of other package dependencies the proposed solution would be a pretty obvious improvement. I consider statistical modeling a pretty important feature of Julia so I understand that disrupting the existing ecosystem isn't a trivial thing.

@oxinabox
Copy link
Contributor Author

To repreat my original point: it puts us inline with how Tables.jl represents things.

@Tokazama
Copy link

We should probably also address what we'd miss out on if we no longer used strings

@nalimilan
Copy link
Member

I see the point of consistency with Tables. OTOH, coefficients names are not exactly like terms, as you won't type them in a Julia-like syntax and they are often not valid Julia identifiers. This is clearly visible in the PR's tests: very often coefficient names cannot be typed nor printed directly using the : syntax. Compare:

julia> ["Intercept", "x1p: 6", "x1p: 7", "x1p: 8"]
4-element Array{String,1}:
 "Intercept"
 "x1p: 6"
 "x1p: 7"
 "x1p: 8"

julia> [:Intercept, Symbol("x1p: 6"), Symbol("x1p: 7"), Symbol("x1p: 8")]
4-element Array{Symbol,1}:
 :Intercept
 Symbol("x1p: 6")
 Symbol("x1p: 7")
 Symbol("x1p: 8")

Can you explain in what context does the performance of symbols compared to strings really matters?

@Nosferican
Copy link
Contributor

Aye. I would not be opposed to it if they mapped to the table, but something like

insure: Indemnity ~ (Intercept)   1.28694     0.59232     2.17271     0.0302   0.123689   2.4502   
insure: Indemnity ~ age           0.00779612  0.0114418   0.681372    0.4959  -0.0146743  0.0302666
insure: Indemnity ~ male         -0.451848    0.367486   -1.22957     0.2193  -1.17355    0.269855 
insure: Indemnity ~ nonwhite     -0.217059    0.425636   -0.509965    0.6103  -1.05296    0.618843 
insure: Indemnity ~ site: 2       1.21152     0.470506    2.57493     0.0103   0.287497   2.13554  
insure: Indemnity ~ site: 3       0.207813    0.366293    0.56734     0.5707  -0.511547   0.927172 
insure: Prepaid ~ (Intercept)     1.55666     0.596327    2.61041     0.0093   0.385533   2.72778  
insure: Prepaid ~ age            -0.00394887  0.0115993  -0.340439    0.7336  -0.0267287  0.018831 
insure: Prepaid ~ male            0.109846    0.365187    0.300793    0.7637  -0.607343   0.827035 
insure: Prepaid ~ nonwhite        0.757718    0.419575    1.80592     0.0714  -0.0662835  1.58172  
insure: Prepaid ~ site: 2         1.32456     0.469789    2.81947     0.0050   0.401941   2.24717  
insure: Prepaid ~ site: 3        -0.380175    0.372819   -1.01973     0.3083  -1.11235    0.352001 

has both the response and the predictor so it wouldn't map to a table. An even simpler case,

(Intercept)   8.23837      0.13135      62.721      <1e-99   7.98093       8.49581    
Educatn       0.000538588  0.000138739   3.88203    0.0001   0.000266663   0.000810512
Age           0.0389003    0.0068679     5.66408    <1e-7    0.0254394     0.0523613  
Age ^ 2      -0.000191192  8.86411e-5   -2.15692    0.0310  -0.000364926  -1.74575e-5 

(Intercept) is not a valid table column nor is Age ^ 2.

@Tokazama
Copy link

Can you explain in what context does the performance of symbols compared to strings really matters?

My point with performance was that using a string to index vs symbol should have a similar performance hit as this:

julia> using BenchmarkTools

julia> x = [:a, :b, :c, :d, :e, :f, :g];

julia> y = string.(x);

julia> @btime findfirst(==(:e), x)
@  82.534 ns (1 allocation: 16 bytes)
5

julia> @btime findfirst(==("e"), y)
  111.964 ns (1 allocation: 16 bytes)
5

@Tokazama
Copy link

Well, if we're not using the traditional table interface then why do we need to use a String or Symbol for the rowname? Just use a dictionary with the terms.

@nalimilan
Copy link
Member

My point with performance was that using a string to index vs symbol should have a similar performance hit as this:

But in practice, in what concrete cases does the 30ns overhead really matter?

Well, if we're not using the traditional table interface then why do we need to use a String or Symbol for the rowname? Just use a dictionary with the terms.

Sorry, I don't get what you mean.

@Tokazama
Copy link

But in practice, in what concrete cases does the 30ns overhead really matter?

I imagine it would only matter if a particular algorithm required frequently looking up coefficients via a Symbol. I don't think it' a common problem for something like a GLM but imagine it could become a source of overhead for something like a mixed effects model with many interacting random effects.

Sorry, I don't get what you mean.

This was in reference to comments about the Symbol("...") syntax not being as idiomatic as using strings. The point of the comment was that if the only reason to keep using strings is because it's easier to reference coefficients with them and print their names in a table then we may as well just use a Term. Those already print out the way we want them to and have a more direct relationship to the actual coefficient than a string does.

I understand these may seem like a lot of trivial points, but I still don't understand the benefit of using a string at all. Perhaps using a Symbol isn't the right thing to do, but I can at least see some benefit to using it over the current method, but maybe I'm totally missing something.

@nalimilan
Copy link
Member

But coefficient names aren't terms. There's a one-to-one relationship between coefficients and terms only for simple continuous terms, but not for categorical terms nor for splines, etc. We could introduce another type which would also store a reference to the term if that's useful, but coefnames would still have to return something that's name-like (strings or symbols).

The performance advantage doesn't sound like a strong motivation to me, given that nobody seems to have a real use case for it. Compared to fitting a model, the cost is probably not that high.

@Tokazama
Copy link

The performance thing was more of an example that there is some benefit to using symbols. I apologize for taking too much time on that.

We could introduce another type which would also store a reference to the term if that's useful

This might actually be very useful. My original interest in this started with trying to get plots for statistic models (JuliaPlots/StatsPlots.jl#290).

@oxinabox
Copy link
Contributor Author

It is practically very common for me to make actual tables where the column names at the coefnames
so I can eyeball it and check it did what i wanted
before using it to fit a model

We don't provide a good interface for that yet.
We could provide that interface which would convert the coefnames to symbols.

@kleinschmidt
Copy link
Member

We've run into a very similar use case recently, wanting to generate table-like things from the outputs of simulations where there's one entry in a vector per coef, and want to make a table to collect the results. It's easy enough to do Symbol.(coefnames(f)), but it could be nice to agree on a default. I don't know how breaking it would be to use symbols instead of strings from external perspective; internally we use the string concatenation syntax to (for instance) generate coefnames for categorical variables, but that's not a huge thing to change.

@nalimilan
Copy link
Member

I guess that's a tradeoff depending on the use case. But in many instances strings will be more convenient than symbols since (as I noted) there's no direct way to enter symbols that are not valid identifiers using :.

Maybe we could make it easier to create tables from names as string, automatically converting to symbols?

@Tokazama
Copy link

It seems like the greatest friction here is being caused by the extra syntax that using symbols would require? This is really only a problem for categorical and interacting terms, which the user never has to actually type in themselves unless searching for them using coefnames. Perhaps this seems could be solved using some special syntax with terms like term(:categories)[:g1] to refer to group one of the categories variable.

@Nosferican
Copy link
Contributor

Not really. Any FunctionTerm would have that issue (e.g., Age ^ 2). I don’t see what would be gained exactly. What would the change help?

@Tokazama
Copy link

Sorry, for some reason I thought you could do term(:Age)^2 already. I was just trying to think of ways to take advantage of the more recent terms interface to solve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants