Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump PosLen capacity up to 80 bits #98

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Bump PosLen capacity up to 80 bits #98

wants to merge 2 commits into from

Conversation

quinnj
Copy link
Member

@quinnj quinnj commented Oct 22, 2021

cc: @nickrobinson251 @bkamins @nalimilan

I'm exploring solutions to JuliaData/CSV.jl#935. If you look at the first commit ("code"), I have 2 PosLen types, normal PosLen with 64-bits, and PosLen80, with 80-bits. As I started to look into how we would use that in CSV.jl; it got real messy real fast. We'd have to allow switching based on PosLen type and pass it through everywhere and it would be a mess to ensure performance stays correct throughout. I'm not saying it's impossible, just some real non-trivial work. There are additional complications because PosLenString/PosLenStringVector are hard-coded for PosLen right now, so we'd either have to make them parameterized on AbstractPosLen subtype, or not allow stringtype=PosLenString to avoid the mess there.

Alternatively, in the 2nd commit, I redefine the existing PosLen to have 80 bits, which results in maximum pos of ~70TB, and maximum single cell length of ~4GB. There are only 2-3 failing tests in Parsers.jl with this change, mainly from the removal of constants that are referenced in tests. I also ran CSV.jl's tests and with some minor changes to use of internal Parsers.jl consts, the tests also pass there. So the question is: are we ok w/ making the PosLen size go from 64-bits to 80-bits everywhere? Obviously that will result in more memory in the stringtype=PosLenString case, but surprisingly the impact would be pretty minimal otherwise. We use PosLen in several places during the "detection" code, but that's only looking at a small sample of rows here and there or parsing column names, so unlikely to make a noticeable difference in memory usage.

Why 80 bits? It seemed like the smallest increase that gives us the largest boost in pos/len values. Note the extra 16 bits vs. current 64-bit PosLen are allocated with 12 bits to len and only 4 to pos. pos already had a max size of around ~4TB, which seems like a pretty reasonable maximum already. With 12 extra bits for the len, we go from the max single cell size of ~1MB to ~4GB, which also seems like a pretty generous maximum. 80 bits also seems pretty reasonable from an alignment standpoint, since it's modulo 16 at least, which I believe is the default alignment value for Julia arrays/strings.

Anyway, I'm going to let this simmer in my mind for a bit and mull things over, but I'm leaning towards going forward with it.

@bkamins
Copy link
Member

bkamins commented Oct 22, 2021

I think switching to 80 is OK.

@nalimilan
Copy link
Member

Sounds reasonable, but I'd double-check the alignment issues, as AFAICT a Vector of 80-bit values takes as much space as a Vector of 128-bit values:

julia> sizeof(fill((1, Int16(1)), 1000))/1000
16.0

julia> sizeof(fill((1, 1), 1000))/1000
16.0

Also FWIW it seems that arrays are 64-bit aligned when they get large.

@quinnj
Copy link
Member Author

quinnj commented Oct 26, 2021

Yeah, upon further investigation, there are some very foundational alignment requirements, even for primitive types, that mean that PosLen will always take up 128-bits. Hmmmm, I'll have to think about this a little more; unfortunately, 128-bits means we effectively "waste" quite a lot of bits for a, IMO, rare use-case. Perhaps there's an alternative solution in CSV.jl where we can avoid using PosLen to parse large strings at the user's request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants