Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Type-instability in PooledDataArray's? #57

Open
johnmyleswhite opened this issue Jan 16, 2014 · 7 comments
Open

Type-instability in PooledDataArray's? #57

johnmyleswhite opened this issue Jan 16, 2014 · 7 comments
Labels

Comments

@johnmyleswhite
Copy link
Member

While trying to clean up the PDA code, I realized that our current approach, which allows PDA's to change the type of their references field, might be introducing type-instability into our code. I don't know if there are specific cases where this is a problem yet, but it seems worth starting to debate.

@johnmyleswhite
Copy link
Member Author

Following up on this, I'd really like to stop having the refs field for a PDA vary depending on the size of the pool. It makes the code for PDA's much more complex without gaining us that much -- and potentially creates subtle potential performance problems associated with type instability if we're going to ensure that setindex! operations will always maintain referential integrity in the pool.

Can we agree to only use Uint32 again, as was originally done in the first draft of the PDA code? This takes no more than 4x as much space in the worst case scenario. It removes a lot of complexity from the codebase. And we wouldn't overflow Uint32 for most datasets.

For example, the smallest vector of Int64 that could overflow a Uint32 pool would be a vector containing 2^32 - 1 distinct values. By my calculation, this vector takes up more than 32 GB of RAM just to be represented in raw Array form. If people are constructing vectors with so many unique values, I'd encourage them not to use PDA's and stick with DA's.

@nalimilan
Copy link
Member

Is it really so complex to handle? I think it would be too bad to lose the ability to use a compact storage when the number of levels is under 255, which is typically the case. A factor of 4 is not something to disregard. Maybe with some hacks it would even be possible to use only 4 bits for PDAs with less than 16 levels, which is quite common too. I think people working with DNA sequences would appreciate this kind of efficient storage.

Without evidence that the type instability of the refs is a performance issue in some places, I wouldn't remove this feature. And even if a few places suffered from this, without making the code too ugly it should be possible to write kernels as separate functions which would be specialized on the type.

@johnmyleswhite
Copy link
Member Author

@nalimilan: I hope this doesn't come off as rude, but some of your comments occasionally make me feel as if you don't realize how large a chunk of my life has been and is spent maintaining this codebase. On almost every day of my life for the last year and a half, I wake up, go to work and then, once I get home, work on this codebase until I go to bed. Every additional line of code that I have to maintain is a meaningful degradation in my quality of life.

That said, I'm still open to debating this issue. But it would make my days a lot more enjoyable if you kept in mind how much stress I already have to deal with each day to keep Julia moving forward.

@nalimilan
Copy link
Member

Yeah, I'm sorry if my comments sound negative and like I'm nagging you. ;-)

Unfortunately, I can't spend that much time working on Julia. Since my primary work is my PhD thesis, I can only help with relatively small things, and give the point of view of somebody doing social science in case it is useful. Do what you want of my remarks; I'm just underlining what seems important for users like me, and what doesn't, but in the end of course the one who does the work gets to decide what are the priorities. Also, you have set the bar high, that's why I'm always asking for more! :-)

@johnmyleswhite
Copy link
Member Author

No worries at all. I really appreciate your comments, since you've both very astute and very polite. And, even though it often create more works for me to do, you're usually dead on accurate about a lot of things that I haven't thought out well enough. So please keep letting us know where you'd like Julia to go. You're doing a good job of keeping me honest. And I really appreciate the work you've been doing to build a Julia RPM.

I mostly wanted to communicate to you that I occasionally feel burnt out by the amount of work that's still left to do on Julia. To keep going despite that, I find it helpful to let myself get away with less than ideal design decisions at times. Offering fewer features and less flexibility to end users makes it much easier for me to get the core infrastructure designed correctly.

@nalimilan
Copy link
Member

Be careful, if you encourage me you're going to get even more feature requests. ;-)

Regarding the issue at hand, it may still be reasonable to only support Uint32 for now, and note somewhere that in the future shorter storage types may be allowed. More generally, I think you could alleviate your burden by writing somewhere public a list of things which are considered useful in the long-term, but lower priority, and which could be implemented without breaking the API (or not breaking it too much). That way, you can make people aware of how great the final system is going to be, and that they can help in that regard if they particularly need a given feature; and then you can concentrate on getting the essential parts working.

@johnmyleswhite
Copy link
Member Author

I am working towards a roadmap for DataArrays, which should serve that purpose.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants