Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cut into Intervals #404

Open
jariji opened this issue Sep 15, 2023 · 8 comments
Open

cut into Intervals #404

jariji opened this issue Sep 15, 2023 · 8 comments

Comments

@jariji
Copy link

jariji commented Sep 15, 2023

Currently cut returns String-valued categories but Interval-valued categories from IntervalSets.jl could be nicer. What do you think about this?

@bkamins
Copy link
Member

bkamins commented Sep 15, 2023

What we would need to do is to allow types from IntervalSets.jl to be category levels (now only char, string and number are allowed). But I think it would make sense to allow them. IntervalSets.jl is a relatively light package.

@nalimilan - what do you think?

@nalimilan
Copy link
Member

I'm not sure it would really be useful. Strings are convenient and flexible, and cut is intended to allow naming classes like "20-24 years", "Q1" or "Low". Do you have a concrete use case in mind?

IntervalSets could provide another function to cut a numeric variable into proper intervals. If you have intervals, you probably don't need a CategoricalArray since intervals are naturally sorted in the correct order (contrary to string classes).

Also in terms of implementation, allowing Interval values in CategoricalArrays would require taking a dependency on IntervalSets. I don't think an extension would work unfortunately. It's annoying we have to limit supported types like this but it's required to limit invalidations...

@jariji
Copy link
Author

jariji commented Sep 16, 2023

Do you have a concrete use case in mind?

I want to be able to use the IntervalSets.jl functions. At the moment I am just using Interval manually without CategoricalArrays, which is fine, but I am

  • extracting the leftendpoint and rightendpoint
  • checking which bin a new value belongs in, using findfirst(∋(x), intervals)

@nalimilan
Copy link
Member

That's not what I call a concrete use case. ;-) What kind of data are you processing? For what goal?

@bkamins
Copy link
Member

bkamins commented Sep 16, 2023

I assume that a basic use case is that:

  1. originally, you have a continuous variable.
  2. you bin it (e.g. for an input into some model)
  3. later you get new data (continuous), and you would want to learn into what level of the categorical (binned) variable it would fall.

Of course, all this can be done without CategoricalArrays.jl, but I assume that @jariji wants some convenience and consistency here.

@jariji
Copy link
Author

jariji commented Sep 16, 2023

I have duration data and I'm partitioning it into intervals for a model where each interval has a parameter, something like 90-day-period fixed effects.

Btw I'm using numbers for time durations but could imagine using Dates.Days or something.

IntervalSets could provide another function to cut a numeric variable into proper intervals.

This might be the best solution, though I guess IntervalSets.jl doesn't have the concept of a partition, so it doesn't know that some interval v_1 is part of a partition v of intervals dividing a space.

@nalimilan
Copy link
Member

I see. Do you then get a single new value to fit into existing intervals? Or a vector of new values?

@jariji
Copy link
Author

jariji commented Sep 16, 2023

I'm just doing it on the array, so I don't need any special "categorize a new value" function -- I'm basically just implementing cut with findfirst. But I wouldn't put too much weight on whatever I happen to be doing at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants