Should coeffnames be Symbols? #113

oxinabox · 2019-06-13T13:22:22Z

It might be a bit breaking,
but Symbol is the type generally used for The Name of a Thing

and DataFrame column names are Symbols,
as are ColumnTables

The text was updated successfully, but these errors were encountered:

Nosferican · 2019-06-27T02:54:29Z

Another thing to consider is whether we want the type to change depending on the inputs,

data = [(x = 0,)]
coefnames(apply_schema(@formula(0 ~ 0), schema(data)))
coefnames(apply_schema(@formula(0 ~ 1), schema(data)))
coefnames(apply_schema(@formula(0 ~ 1 + x), schema(data)))
coefnames(apply_schema(@formula(1 ~ 0), schema(data)))
coefnames(apply_schema(@formula(1 ~ 1), schema(data)))
coefnames(apply_schema(@formula(1 ~ 1 + x), schema(data)))

Tokazama · 2020-01-16T14:55:19Z

I really like the idea of using Symbol for coefnames instead of strings but what would be used for interaction terms? Would we just want to currently developed string name and convert it to a symbol?

oxinabox · 2020-01-16T18:14:44Z

Sure, why not?

nalimilan · 2020-02-13T21:12:18Z

As noted by @kleinschmidt at #169, it's not clear what would be the advantage of doing this. Yet it's quite disruptive. Can you develop?

Tokazama · 2020-02-13T21:20:32Z

Just to add to what @oxinabox said in the OP, terms are also using symbols as names so it seems very odd that the only place we are using strings to refer to variables is the coefficients.

The actually reason I started looking at this was because it's easier to speed up operations on symbols than strings. I was looking at ways to query coefficients within models and formulas.

I think that if we were discussing this in the absence of other package dependencies the proposed solution would be a pretty obvious improvement. I consider statistical modeling a pretty important feature of Julia so I understand that disrupting the existing ecosystem isn't a trivial thing.

oxinabox · 2020-02-14T20:59:34Z

To repreat my original point: it puts us inline with how Tables.jl represents things.

Tokazama · 2020-02-15T19:09:41Z

We should probably also address what we'd miss out on if we no longer used strings

nalimilan · 2020-02-16T17:04:13Z

I see the point of consistency with Tables. OTOH, coefficients names are not exactly like terms, as you won't type them in a Julia-like syntax and they are often not valid Julia identifiers. This is clearly visible in the PR's tests: very often coefficient names cannot be typed nor printed directly using the : syntax. Compare:

julia> ["Intercept", "x1p: 6", "x1p: 7", "x1p: 8"]
4-element Array{String,1}:
 "Intercept"
 "x1p: 6"
 "x1p: 7"
 "x1p: 8"

julia> [:Intercept, Symbol("x1p: 6"), Symbol("x1p: 7"), Symbol("x1p: 8")]
4-element Array{Symbol,1}:
 :Intercept
 Symbol("x1p: 6")
 Symbol("x1p: 7")
 Symbol("x1p: 8")

Can you explain in what context does the performance of symbols compared to strings really matters?

Nosferican · 2020-02-16T17:08:31Z

Aye. I would not be opposed to it if they mapped to the table, but something like

insure: Indemnity ~ (Intercept)   1.28694     0.59232     2.17271     0.0302   0.123689   2.4502   
insure: Indemnity ~ age           0.00779612  0.0114418   0.681372    0.4959  -0.0146743  0.0302666
insure: Indemnity ~ male         -0.451848    0.367486   -1.22957     0.2193  -1.17355    0.269855 
insure: Indemnity ~ nonwhite     -0.217059    0.425636   -0.509965    0.6103  -1.05296    0.618843 
insure: Indemnity ~ site: 2       1.21152     0.470506    2.57493     0.0103   0.287497   2.13554  
insure: Indemnity ~ site: 3       0.207813    0.366293    0.56734     0.5707  -0.511547   0.927172 
insure: Prepaid ~ (Intercept)     1.55666     0.596327    2.61041     0.0093   0.385533   2.72778  
insure: Prepaid ~ age            -0.00394887  0.0115993  -0.340439    0.7336  -0.0267287  0.018831 
insure: Prepaid ~ male            0.109846    0.365187    0.300793    0.7637  -0.607343   0.827035 
insure: Prepaid ~ nonwhite        0.757718    0.419575    1.80592     0.0714  -0.0662835  1.58172  
insure: Prepaid ~ site: 2         1.32456     0.469789    2.81947     0.0050   0.401941   2.24717  
insure: Prepaid ~ site: 3        -0.380175    0.372819   -1.01973     0.3083  -1.11235    0.352001

has both the response and the predictor so it wouldn't map to a table. An even simpler case,

(Intercept)   8.23837      0.13135      62.721      <1e-99   7.98093       8.49581    
Educatn       0.000538588  0.000138739   3.88203    0.0001   0.000266663   0.000810512
Age           0.0389003    0.0068679     5.66408    <1e-7    0.0254394     0.0523613  
Age ^ 2      -0.000191192  8.86411e-5   -2.15692    0.0310  -0.000364926  -1.74575e-5

(Intercept) is not a valid table column nor is Age ^ 2.

Tokazama · 2020-02-16T17:16:00Z

Can you explain in what context does the performance of symbols compared to strings really matters?

My point with performance was that using a string to index vs symbol should have a similar performance hit as this:

julia> using BenchmarkTools

julia> x = [:a, :b, :c, :d, :e, :f, :g];

julia> y = string.(x);

julia> @btime findfirst(==(:e), x)
@  82.534 ns (1 allocation: 16 bytes)
5

julia> @btime findfirst(==("e"), y)
  111.964 ns (1 allocation: 16 bytes)
5

Tokazama · 2020-02-16T17:22:14Z

Well, if we're not using the traditional table interface then why do we need to use a String or Symbol for the rowname? Just use a dictionary with the terms.

nalimilan · 2020-02-17T12:44:52Z

My point with performance was that using a string to index vs symbol should have a similar performance hit as this:

But in practice, in what concrete cases does the 30ns overhead really matter?

Well, if we're not using the traditional table interface then why do we need to use a String or Symbol for the rowname? Just use a dictionary with the terms.

Sorry, I don't get what you mean.

Tokazama · 2020-02-17T13:14:32Z

But in practice, in what concrete cases does the 30ns overhead really matter?

I imagine it would only matter if a particular algorithm required frequently looking up coefficients via a Symbol. I don't think it' a common problem for something like a GLM but imagine it could become a source of overhead for something like a mixed effects model with many interacting random effects.

Sorry, I don't get what you mean.

This was in reference to comments about the Symbol("...") syntax not being as idiomatic as using strings. The point of the comment was that if the only reason to keep using strings is because it's easier to reference coefficients with them and print their names in a table then we may as well just use a Term. Those already print out the way we want them to and have a more direct relationship to the actual coefficient than a string does.

I understand these may seem like a lot of trivial points, but I still don't understand the benefit of using a string at all. Perhaps using a Symbol isn't the right thing to do, but I can at least see some benefit to using it over the current method, but maybe I'm totally missing something.

nalimilan · 2020-02-17T13:19:19Z

But coefficient names aren't terms. There's a one-to-one relationship between coefficients and terms only for simple continuous terms, but not for categorical terms nor for splines, etc. We could introduce another type which would also store a reference to the term if that's useful, but coefnames would still have to return something that's name-like (strings or symbols).

The performance advantage doesn't sound like a strong motivation to me, given that nobody seems to have a real use case for it. Compared to fitting a model, the cost is probably not that high.

Tokazama · 2020-02-17T13:26:44Z

The performance thing was more of an example that there is some benefit to using symbols. I apologize for taking too much time on that.

We could introduce another type which would also store a reference to the term if that's useful

This might actually be very useful. My original interest in this started with trying to get plots for statistic models (JuliaPlots/StatsPlots.jl#290).

oxinabox · 2020-02-17T13:27:47Z

It is practically very common for me to make actual tables where the column names at the coefnames
so I can eyeball it and check it did what i wanted
before using it to fit a model

We don't provide a good interface for that yet.
We could provide that interface which would convert the coefnames to symbols.

kleinschmidt · 2020-02-17T14:29:26Z

We've run into a very similar use case recently, wanting to generate table-like things from the outputs of simulations where there's one entry in a vector per coef, and want to make a table to collect the results. It's easy enough to do Symbol.(coefnames(f)), but it could be nice to agree on a default. I don't know how breaking it would be to use symbols instead of strings from external perspective; internally we use the string concatenation syntax to (for instance) generate coefnames for categorical variables, but that's not a huge thing to change.

nalimilan · 2020-02-17T16:05:50Z

I guess that's a tradeoff depending on the use case. But in many instances strings will be more convenient than symbols since (as I noted) there's no direct way to enter symbols that are not valid identifiers using :.

Maybe we could make it easier to create tables from names as string, automatically converting to symbols?

Tokazama · 2020-02-17T16:37:52Z

It seems like the greatest friction here is being caused by the extra syntax that using symbols would require? This is really only a problem for categorical and interacting terms, which the user never has to actually type in themselves unless searching for them using coefnames. Perhaps this seems could be solved using some special syntax with terms like term(:categories)[:g1] to refer to group one of the categories variable.

Nosferican · 2020-02-17T16:41:33Z

Not really. Any FunctionTerm would have that issue (e.g., Age ^ 2). I don’t see what would be gained exactly. What would the change help?

Tokazama · 2020-02-18T23:34:02Z

Sorry, for some reason I thought you could do term(:Age)^2 already. I was just trying to think of ways to take advantage of the more recent terms interface to solve the problem.

Tokazama mentioned this issue Jan 20, 2020

coefnames returns symbols #169

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should coeffnames be Symbols? #113

Should coeffnames be Symbols? #113

oxinabox commented Jun 13, 2019

Nosferican commented Jun 27, 2019

Tokazama commented Jan 16, 2020

oxinabox commented Jan 16, 2020

nalimilan commented Feb 13, 2020

Tokazama commented Feb 13, 2020 •

edited

oxinabox commented Feb 14, 2020

Tokazama commented Feb 15, 2020

nalimilan commented Feb 16, 2020

Nosferican commented Feb 16, 2020

Tokazama commented Feb 16, 2020

Tokazama commented Feb 16, 2020

nalimilan commented Feb 17, 2020

Tokazama commented Feb 17, 2020

nalimilan commented Feb 17, 2020

Tokazama commented Feb 17, 2020

oxinabox commented Feb 17, 2020

kleinschmidt commented Feb 17, 2020

nalimilan commented Feb 17, 2020

Tokazama commented Feb 17, 2020

Nosferican commented Feb 17, 2020

Tokazama commented Feb 18, 2020

Should coeffnames be Symbols? #113

Should coeffnames be Symbols? #113

Comments

oxinabox commented Jun 13, 2019

Nosferican commented Jun 27, 2019

Tokazama commented Jan 16, 2020

oxinabox commented Jan 16, 2020

nalimilan commented Feb 13, 2020

Tokazama commented Feb 13, 2020 • edited

oxinabox commented Feb 14, 2020

Tokazama commented Feb 15, 2020

nalimilan commented Feb 16, 2020

Nosferican commented Feb 16, 2020

Tokazama commented Feb 16, 2020

Tokazama commented Feb 16, 2020

nalimilan commented Feb 17, 2020

Tokazama commented Feb 17, 2020

nalimilan commented Feb 17, 2020

Tokazama commented Feb 17, 2020

oxinabox commented Feb 17, 2020

kleinschmidt commented Feb 17, 2020

nalimilan commented Feb 17, 2020

Tokazama commented Feb 17, 2020

Nosferican commented Feb 17, 2020

Tokazama commented Feb 18, 2020

Tokazama commented Feb 13, 2020 •

edited