Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorting vectors and dataframes with nils #39

Closed
gnilrets opened this issue Dec 12, 2015 · 9 comments
Closed

Sorting vectors and dataframes with nils #39

gnilrets opened this issue Dec 12, 2015 · 9 comments

Comments

@gnilrets
Copy link
Contributor

I was starting to work on figuring out a solution for sorting vectors and data frames with nil values. I noticed you've got a pending test - https://github.com/v0dro/daru/blob/master/spec/vector_spec.rb#L488.

I'd argue that nils should be placed at the beginning of the vector.

  1. SQL sorts nulls at the beginning.
  2. Empty ('') and blank (' ') strings are functionally similar to nil when dealing with datasets, and they sort before all other strings. It would be strange to have empty/blank strings at the top and nils at the bottom.
  3. nil data is usually "special" and if it's at the top of a sort, you're more likely to notice.
@v0dro
Copy link
Member

v0dro commented Dec 13, 2015

Hmmm that makes sense. I'm good with placing nils at the top.

Additionally, sorting currently uses a handmade Ruby variant of quick sort, which makes it slower than say Array#sort. If you have a solution to doing this faster (maybe use Array#sort somewhere so things happen at C level and are fast) while keeping the functionality same, I'm all ears!

@gnilrets
Copy link
Contributor Author

Why do you use a handmade variant of quicksort rather than Ruby's native sort?

@v0dro
Copy link
Member

v0dro commented Feb 28, 2016

The problem was that I couldn't figure out of a way to change the index along with the data, hence I went ahead with a hand made quick sort so that control could be maintained over the index objects as well.

Now that I think of it, I think zipping the indexes and data into tuples would do the trick, but I'm not sure if this can be extended to DataFrame.

@lokeshh
Copy link
Member

lokeshh commented Feb 28, 2016

I think sort_by could be used to replace handmade variant. Isn't it?

@v0dro
Copy link
Member

v0dro commented Feb 28, 2016

Maybe only for Daru::Vector. DataFrame requires sorting only those specific columns that too in a nested manner.

@lokeshh lokeshh mentioned this issue Feb 29, 2016
@lokeshh
Copy link
Member

lokeshh commented Feb 29, 2016

Now that I think of it, I think zipping the indexes and data into tuples would do the trick, but I'm not sure if this can be extended to DataFrame.

I went ahead and took the liberty to implement your idea. There's a speed up.

DataFrame requires sorting only those specific columns that too in a nested manner.

I am not sure what that means.

@lokeshh
Copy link
Member

lokeshh commented Mar 1, 2016

DataFrame requires sorting only those specific columns that too in a nested manner.

Oh I get what it means now. I am thinking of sorting the DataFrame multiple times depending upon number of columns given. Say the columns to sort by are [:a, :b]. So, first we sort by :b and then by :a and we'll have the sorted DataFrame. Would that be fast?

@v0dro
Copy link
Member

v0dro commented Mar 1, 2016

Yea but when you sort by :a first, only the elements in :b that correspond to equal elements in :a should be sorted, the sort cascades onto successive vectors, and this order can change depending on the use case.

@v0dro
Copy link
Member

v0dro commented Mar 9, 2016

Solved in #67 and #63.

@v0dro v0dro closed this as completed Mar 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants