Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug in unique for partial InlineString #32

Closed
nilshg opened this issue Jun 27, 2022 · 7 comments · Fixed by #33
Closed

Possible bug in unique for partial InlineString #32

nilshg opened this issue Jun 27, 2022 · 7 comments · Fixed by #33

Comments

@nilshg
Copy link

nilshg commented Jun 27, 2022

Just came across this when working with a column of prices in British pounds, all of which were prepended with GBP. In trying to check whether indeed all entries in the (Vector{String15}) column start with GBP, I did unique(first.(my_col, 3)) and was surprised to get back a length 196 vector with all entries showing GBP.

A reproducer is:

julia> using InlineStrings

julia> x = inlinestrings(["GBP$i" for i ∈ rand(1.0:0.01:500.0, 1_000)]);

julia> length(unique(first.(x, 3)))
994

julia> length(unique(string.(first.(x, 3))))
1

The fact that there are 994 (rather than 1,000) unique results suggests to me that where the entire string is the same (because the same "price" was randomly chosen twice in setup of the xample), the first three characters are seen as the same by unique as well.

Is this intended, i.e. is there some fundamental issue with working with "SubInlineStrings" that means I shouldn't be doing what I'm doing or is this a bug?

@bkamins
Copy link

bkamins commented Jun 27, 2022

It is a bug.

@nickrobinson251
Copy link
Collaborator

similar bug exists for last:

julia> using InlineStrings

julia> z = inlinestrings(["$i GBP" for i  rand(1.0:0.01:500.0, 1_000)]);

julia> unique(last.(z, 3))
4-element Vector{String15}:
 "GBP"
 "GBP"
 "GBP"
 "GBP"

@nickrobinson251
Copy link
Collaborator

I think the bug is in the interaction between first/last and the definition of == for inline string of the same type.

julia> abc = InlineString3("abc")
"abc"

julia> first(abc, 2) == "ab"
true

julia> first(abc, 2) == InlineString15("ab")
true

julia> first(abc, 2) == InlineString3("ab")  # whoops!
false

I think because first only changes the final byte which tracks the length of the string, whereas our definition of ==/cmp assumes that if the final byte says the string has length 2 then only the first 2 bytes are non-zero

julia> bitstring(abc)
"01100001011000100110001100000011"

julia> bitstring(first(abc, 2))
"01100001011000100110001100000010"

julia> bitstring(InlineString3("ab"))
"01100001011000100000000000000010"

So i suppose we could change first to zero-out all the unused bits, or change our definition of ==/cmp

It seems this bug is fixed, and all current tests pass, if we just remove this definitions:

(==)(x::T, y::T) where {T <: InlineString} = Base.eq_int(x, y)

Base.cmp(a::T, b::T) where {T <: InlineString} =
Base.eq_int(a, b) ? 0 : Base.ult_int(a, b) ? -1 : 1

quinnj added a commit that referenced this issue Jun 27, 2022
Fixes #32. The core issue here is we're taking a few shortcuts in some operations
like chop, chomp, first, last where we just shuffle the bits around and OR the
new length. The problem is there can be "extra bits" in the inline string that
can then affect operations like `==`. So we need to ensure in these optimized
"modifying" operations, these extra bits get zeroed out to ensure a consistent
bit representation.
@quinnj
Copy link
Member

quinnj commented Jun 27, 2022

Thanks for the report @nilshg! PR up to fix: #33. I started reworking the internal representation a bit ago, but didn't get it finished, so I'm going to try and reprioritize that do some more stress testing on all the inline strings. We need to improve its support across the ecosystem as well.

quinnj added a commit that referenced this issue Jun 27, 2022
* Ensure shuffled bits get cleared out in "modifying" operations

Fixes #32. The core issue here is we're taking a few shortcuts in some operations
like chop, chomp, first, last where we just shuffle the bits around and OR the
new length. The problem is there can be "extra bits" in the inline string that
can then affect operations like `==`. So we need to ensure in these optimized
"modifying" operations, these extra bits get zeroed out to ensure a consistent
bit representation.

* remove old code

* fix tests
@nilshg
Copy link
Author

nilshg commented Jun 28, 2022

Brilliant, thanks a lot all! Now we just need a new release ;)

@nickrobinson251
Copy link
Collaborator

Now we just need a new release ;)

New release with the fix should be out 🤞 https://github.com/JuliaStrings/InlineStrings.jl/releases/tag/v1.1.3

@nilshg
Copy link
Author

nilshg commented Jun 28, 2022

Hooray! Thanks so much - I use the package for data analysis in the context of litigation where code has to be disclosed in court, so it's always a bit awkward to depend on unreleased versions of packages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants