Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added string radix sort and sortperm + tests #27

Closed
wants to merge 17 commits into from
Closed

added string radix sort and sortperm + tests #27

wants to merge 17 commits into from

Conversation

xiaodaigh
Copy link

I have added StringRadixSortAlg so that (LSD) radix sort can be applied to strings

The default is `UInt`, so on a 64 bit machine it loads 64 bits (8 bytes) at a time.
If the `String` is shorter than 8 bytes then it's padded with 0.

Assumes machine is little-endian
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that true even with the ntoh calls?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not true. Have removed

return ntoh(unsafe_load(Ptr{T}(pointer(s, skipbytes+1))))
else
ns = (sizeof(T) - min(sizeof(T), n - skipbytes))*8
h = unsafe_load(Ptr{T}(pointer(s, skipbytes+1)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this segfault if you read past the end of the string? e.g. if the string ends on a page boundary and the next page is inaccessible?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made a segfault safe version

h = unsafe_load(Ptr{T}(pointer(s, skipbytes+1)))
h = h << ns
h = h >> ns
return ntoh(h)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to do return (ntoh(unsafe_load(...)) >> ns) << ns, i.e. do ntoh first and flip the order of the shifts, so that it still works on bigendian machines?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. but supplanted by the segfault safe version now

if lo >= hi; return svec; end

# find the maximum string length
lens = reduce((x,y) -> max(x,sizeof(y)),0, svec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use maximum(sizeof, svec)

- `s`: a `String`
- `skipbytes`: how many bytes to skip e.g. load_bits("abc", 1) will load "bc" as bits
"""
load_bits(s::String, skipbytes = 0) = load_bits(UInt, s, skipbytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear why this method is needed, since you always pass a type to load_bits below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have removed

# are all values the same at this radix?
if bin[idx,j] == len; continue; end

cbin = cumsum(bin[:,j])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use cumsum! with a preallocated array. Note that bin[:,j] also allocates a copy. You might want to just write a loop to do the sum with no copies.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Currently radixsot
"""
function sortperm_radixsort(svec::AbstractVector{String}; rev::Union{Bool,Void}=nothing, order::Ordering=Forward)
siv = StringIndexVector(copy(svec), collect(1:length(svec)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually you would define a sortperm_radixsort! function that doesn't make a copy of svec, and then define sortperm_radixsort(svec, ...) = sortperm_radixsort!(copy(svec), ...)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@xiaodaigh
Copy link
Author

xiaodaigh commented Jan 4, 2018

The speed advantage of radixsort vs Base.sort seems to have vanished after implementing the segfault safe version. The performance is so bad that it's better to wait till I come up with a performant version.


function load_bits_with_padding(::Type{UInt128}, s::String, skipbytes = 0)
n = sizeof(s)
T = UInt128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the compiler is smart enough to constant-fold a type that you assign to a local variable like this. Does it improve performance if you just paste UInt128 directly in the code?

(I think it should be possible to write this function in a way that works for all T simultaneously with no penalty by making sure the branches get optimized out, but easier to optimize the case for a single T first.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will just replace wiht
Actually I think a macro will work better here; just generate the code for all types. It's basically the same anyway. Might look into that.

I actually initial I have one function for all types. But it keeps on crashing. I think it has to do with the Base.zext_int(UInt64, ::UInt128) which doesn't make sense; even that branch will never be reached, the compiler still complained and crashed the program every time.

if o == Reverse
sorttwo!(.~load_bits.(UInt128, svec, skipbytes), svec)
else
sorttwo!(load_bits.(UInt128, svec, skipbytes), svec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this allocates arrays over and over; would be better to allocate a bits array once and overwrite it in-place with bits .= load_bits.(...)

@xiaodaigh
Copy link
Author

more comprehensive test cases to come to test all branches

@xiaodaigh
Copy link
Author

Bump!

String radix sort is 3x faster than base sort and passes all tests.

using SortingAlgorithms
x = rand([randstring(8) for i=1:1_000_000], 100_000_00);
@time sort(x) |> issorted #11s
@time sort(x, alg=StringRadixSort) |> issorted # 3.7s

@nalimilan
Copy link
Contributor

Any reason why this cannot be just the String implementation of the RadixSort algorithm, rather than a separate StringRadixSort one?

@xiaodaigh
Copy link
Author

@nalimilan I tried to use RadixSort to start with but then I found that sort doesn't dispatch on alg, so I can't send it down the path I want. One can solve this by overloading sort! with this signiture sort!(x::Vector{String}, ..., ::RadixSort). So that can be solved.

But the real issue is sortperm as that function by default transforms the data using the Perm ordering and calls sort!(indexes(x),...Perm(x), RadixSort) so it goes down the RadixSort function! But in that function it will fail because it will try to sort order.data which is of string and not bits type so it fails! The only way to solve this is to overload sort!(x,....,RadixSort, o::Base.Perm{String}) but I believe there is no Perm{String} type. So the only way to solve this is to overload sortperm(x::Vector{String}). But I think the idea is not to overload sortperm and sort but to overload the ! function.

@nalimilan
Copy link
Contributor

Sorry, I'm not familiar with that machinery, but these implementation issues should definitely be fixed in some way. It would be too bad to have RadixSort fail for string vectors and require people to choose a different algorithm manually just because of implementation problems. For example, it could be useful for StatsBase.countmap to have RadixSort support strings.

@xiaodaigh
Copy link
Author

I think a way to solve it is some kind of method-piracy. Can overload the sort method in Base with sort(Vector{String}) and then check if alg == RadixSort, if it is then apply the fast string algorithm if not then call Base.sort. But notice there is no sort method in SortingAlgorithms.jl atm, so this will be an "innovation".

@nalimilan
Copy link
Contributor

Hmm, that wouldn't be great. Maybe we need to change something in Base so that it's possible to dispatch on the algorithm?

@xiaodaigh
Copy link
Author

I can't find it now but I was involved in a discussion, basically, either base adds in a sortperm(x,alg) signature or Julia can dispatch on named arguments so we can overload sortperm(x::Vector{String}, alg=RadixSort). If not then sortperm() calls sort!(x, alg = RadixSort) which will fail as there is no RadixSort method for Strings.

@nalimilan
Copy link
Contributor

Maybe JuliaLang/julia#24770?

Anyway, I'd say it would be fine to make a PR against Base to add sortperm(x,alg). Could even happen before 0.7 if that's not too disruptive.

@xiaodaigh
Copy link
Author

I recall Chris R saying that the real fix is to dispatch on named arguments... can't find that discussion now.

@nalimilan
Copy link
Contributor

You mean on keyword arguments? That won't happen in 1.0, and maybe never, so we'd better find a solution until then... :-)

@nalimilan
Copy link
Contributor

Could you file an issue in Base to discuss the problem with sortperm?

@ViralBShah
Copy link

Is this something we want to try and get merged?

"""
sorttwo!(vs, index)

Sort both the `vs` and reorder `index` at the same. This allows for faster sortperm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds similar to what I experimented at #33.

@xiaodaigh xiaodaigh closed this Sep 27, 2020
@nalimilan
Copy link
Contributor

Why close this? Can we try to find a way forward by discussing the needed changes in Base?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants