New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing values #208

Merged
merged 31 commits into from Aug 16, 2016

Conversation

Projects
None yet
3 participants
@lokeshh
Contributor

lokeshh commented Aug 3, 2016

Addresses #137

Show outdated Hide outdated spec/vector_spec.rb
let(:dv) { Daru::Vector.new [1, 2, 3, 1, 2, nil, nil] }
it { expect(dv.count_values 1, 2).to eq 4 }
it { expect(dv.count_values nil).to 2 }
it { expect(dv.count_values 3, Float::NAN).to 1 }

This comment has been minimized.

@lokeshh

lokeshh Aug 3, 2016

Contributor

Is this sort of testing fine, i.e. writing multiple tests in the same context?

@lokeshh

lokeshh Aug 3, 2016

Contributor

Is this sort of testing fine, i.e. writing multiple tests in the same context?

This comment has been minimized.

@zverok

zverok Aug 4, 2016

Collaborator

Absolutely.

@zverok

zverok Aug 4, 2016

Collaborator

Absolutely.

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 6, 2016

Contributor

I'm not clear about is whether we are removing set_missing_positions or not? If we remove it then its not possible to do the caching for nil and Float::NAN and if we do not remove it then, how will we get rid of lazy update which makes Daru slow.

Contributor

lokeshh commented Aug 6, 2016

I'm not clear about is whether we are removing set_missing_positions or not? If we remove it then its not possible to do the caching for nil and Float::NAN and if we do not remove it then, how will we get rid of lazy update which makes Daru slow.

@v0dro

This comment has been minimized.

Show comment
Hide comment
@v0dro

v0dro Aug 6, 2016

Member

Let set_missing_positions remain for now. This will also involve keeping lazy update, but I really don't see a fast way of updating the missing values cache for now. We'll probably need to port it to C or something funky like that.

Member

v0dro commented Aug 6, 2016

Let set_missing_positions remain for now. This will also involve keeping lazy update, but I really don't see a fast way of updating the missing values cache for now. We'll probably need to port it to C or something funky like that.

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 6, 2016

Contributor

We can also do something as follows:

We will store the missing values positions in two variables, one for nil and one for Float::NAN. They will store the positions of the missing data. Whenever an update is made instead of calling set_missing_positions we will set @missing_positions_outdated = false. And whenever in the future we require operation regarding missing value we will check @missing_positions_outdated, if its false, we need not recompute the missing positions and if its true we need to recompute the missing positions and update the appropriate variables.

WDYT about this approach?

Contributor

lokeshh commented Aug 6, 2016

We can also do something as follows:

We will store the missing values positions in two variables, one for nil and one for Float::NAN. They will store the positions of the missing data. Whenever an update is made instead of calling set_missing_positions we will set @missing_positions_outdated = false. And whenever in the future we require operation regarding missing value we will check @missing_positions_outdated, if its false, we need not recompute the missing positions and if its true we need to recompute the missing positions and update the appropriate variables.

WDYT about this approach?

@v0dro

This comment has been minimized.

Show comment
Hide comment
@v0dro

v0dro Aug 6, 2016

Member

We will store the missing values positions in two variables, one for nil and one for Float::NAN.

Why do nil and Float::NAN need to be in separate variables?

Whenever an update is made instead of calling set_missing_positions we will set @missing_positions_outdated = false.

Update to what? The vector or the missing positions cache?

And whenever in the future we require operation regarding missing value we will check @missing_positions_outdated, if its false, we need not recompute the missing positions and if its true we need to recompute the missing positions and update the appropriate variables.

Works! Make sure you reflect this everywhere in the code, though.

Member

v0dro commented Aug 6, 2016

We will store the missing values positions in two variables, one for nil and one for Float::NAN.

Why do nil and Float::NAN need to be in separate variables?

Whenever an update is made instead of calling set_missing_positions we will set @missing_positions_outdated = false.

Update to what? The vector or the missing positions cache?

And whenever in the future we require operation regarding missing value we will check @missing_positions_outdated, if its false, we need not recompute the missing positions and if its true we need to recompute the missing positions and update the appropriate variables.

Works! Make sure you reflect this everywhere in the code, though.

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 7, 2016

Contributor

Why do nil and Float::NAN need to be in separate variables?

User might ask for reject_values(nil) or reject_values(Float::NAN).

Update to what? The vector or the missing positions cache?

I meant update to vector.

Works! Make sure you reflect this everywhere in the code, though.

Also we won't require lazy_update anymore now.

Contributor

lokeshh commented Aug 7, 2016

Why do nil and Float::NAN need to be in separate variables?

User might ask for reject_values(nil) or reject_values(Float::NAN).

Update to what? The vector or the missing positions cache?

I meant update to vector.

Works! Make sure you reflect this everywhere in the code, though.

Also we won't require lazy_update anymore now.

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 7, 2016

Contributor

I'm deprecating the functions has_missing_data? etc., so they will be working but I'm getting rid of their functionality of working with arbitrary assigned missing values like 1, 2, etc. which are other than nil, Float::NAN. In other words missing_values = [1, 2] would have no effect. This makes it easy for me to remove the lazy_update and set_missing_positions. lazy_update and set_missing_positions have to removed to see the expected speed up. Let me know if its fine.

Contributor

lokeshh commented Aug 7, 2016

I'm deprecating the functions has_missing_data? etc., so they will be working but I'm getting rid of their functionality of working with arbitrary assigned missing values like 1, 2, etc. which are other than nil, Float::NAN. In other words missing_values = [1, 2] would have no effect. This makes it easy for me to remove the lazy_update and set_missing_positions. lazy_update and set_missing_positions have to removed to see the expected speed up. Let me know if its fine.

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 7, 2016

Contributor

Kindly have a look. I've made some radical changes like removed some methods like exist?, missing_values=, etc. and finally got rid of lazy_update, set_missing_positions to speed up Daru. Some common methods like has_missing_data?, only_valid, etc have been deprecated and been kept though. It might cause inconvenience for the next Daru release for methods like missing_values won't be working but I insist removing them as they will lead to better performance. Now I will carry on caching the missing positions of nil and Float::NAN.

Contributor

lokeshh commented Aug 7, 2016

Kindly have a look. I've made some radical changes like removed some methods like exist?, missing_values=, etc. and finally got rid of lazy_update, set_missing_positions to speed up Daru. Some common methods like has_missing_data?, only_valid, etc have been deprecated and been kept though. It might cause inconvenience for the next Daru release for methods like missing_values won't be working but I insist removing them as they will lead to better performance. Now I will carry on caching the missing positions of nil and Float::NAN.

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 7, 2016

Contributor

Here are some benchmarks. The performance gains are staggering.

I ran below code for this branch and master branch:

require 'benchmark'

n = 1000
dv = Daru::Vector.new 1..100000

Benchmark.bm do |x|
  x.report { 1000.times { dv[0, 10, 20, 30] = nil } }
end

Result for current branch:

       user     system      total        real
   0.010000   0.000000   0.010000 (  0.004940)

Result for master branch:

       user     system      total        real
  36.160000   0.260000  36.420000 ( 39.420752)
Contributor

lokeshh commented Aug 7, 2016

Here are some benchmarks. The performance gains are staggering.

I ran below code for this branch and master branch:

require 'benchmark'

n = 1000
dv = Daru::Vector.new 1..100000

Benchmark.bm do |x|
  x.report { 1000.times { dv[0, 10, 20, 30] = nil } }
end

Result for current branch:

       user     system      total        real
   0.010000   0.000000   0.010000 (  0.004940)

Result for master branch:

       user     system      total        real
  36.160000   0.260000  36.420000 ( 39.420752)
@zverok

This comment has been minimized.

Show comment
Hide comment
@zverok

zverok Aug 7, 2016

Collaborator

Here are some benchmarks. The performance gains are staggering.

Fascinating! It's a giant leap forward. Now it looks like we can even think about replacing where with normal select :)

Bravo, @lokeshh!

Collaborator

zverok commented Aug 7, 2016

Here are some benchmarks. The performance gains are staggering.

Fascinating! It's a giant leap forward. Now it looks like we can even think about replacing where with normal select :)

Bravo, @lokeshh!

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 8, 2016

Contributor

Thanks @zverok.

I have been running benchmarks on implementation with caching of nil's positions and implementation with no caching of nils and surprisingly I'm roughly the same time with caching. I couldn't figure out whats going on. The code seems correct. I'm looking at this in more detail. Is there some gem to profile which lines in the method are taking a long time and which not?

Contributor

lokeshh commented Aug 8, 2016

Thanks @zverok.

I have been running benchmarks on implementation with caching of nil's positions and implementation with no caching of nils and surprisingly I'm roughly the same time with caching. I couldn't figure out whats going on. The code seems correct. I'm looking at this in more detail. Is there some gem to profile which lines in the method are taking a long time and which not?

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 8, 2016

Contributor

It seems to me that its because operations such as #at, #[] are more expensive such that finding missing values in a vector is almost free, therefore caching of nil and Float::NAN is not helping.

Update: Here's the benchmark I used to compare the performance of caching with non-caching implementation:

require 'benchmark'

n = 10000
dv = Daru::Vector.new 1..n


Benchmark.bm do |x|
  x.report do
    dv.reject_values nil
  end

  x.report do
    100.times { dv.reject_values nil }
  end
end

Result with caching:

       user     system      total        real
   0.020000   0.010000   0.030000 (  0.034508)
   1.040000   0.020000   1.060000 (  1.059687)

Result without caching:

       user     system      total        real
   0.030000   0.010000   0.040000 (  0.034908)
   1.330000   0.010000   1.340000 (  1.344577)

I expected to see the second report in caching implementation way less than the second report in non-caching implementation because after first call to reject_values, the positions of nils are getting cached in implementation with caching. But this is not the case, there's very less improvement.

Contributor

lokeshh commented Aug 8, 2016

It seems to me that its because operations such as #at, #[] are more expensive such that finding missing values in a vector is almost free, therefore caching of nil and Float::NAN is not helping.

Update: Here's the benchmark I used to compare the performance of caching with non-caching implementation:

require 'benchmark'

n = 10000
dv = Daru::Vector.new 1..n


Benchmark.bm do |x|
  x.report do
    dv.reject_values nil
  end

  x.report do
    100.times { dv.reject_values nil }
  end
end

Result with caching:

       user     system      total        real
   0.020000   0.010000   0.030000 (  0.034508)
   1.040000   0.020000   1.060000 (  1.059687)

Result without caching:

       user     system      total        real
   0.030000   0.010000   0.040000 (  0.034908)
   1.330000   0.010000   1.340000 (  1.344577)

I expected to see the second report in caching implementation way less than the second report in non-caching implementation because after first call to reject_values, the positions of nils are getting cached in implementation with caching. But this is not the case, there's very less improvement.

@v0dro

This comment has been minimized.

Show comment
Hide comment
@v0dro

v0dro Aug 9, 2016

Member

Hmmm it's interesting to see that caching isn't really helping much. Can you try with profiling and see what's actually going on? I think you can try ruby-prof for this.

Member

v0dro commented Aug 9, 2016

Hmmm it's interesting to see that caching isn't really helping much. Can you try with profiling and see what's actually going on? I think you can try ruby-prof for this.

@zverok

This comment has been minimized.

Show comment
Hide comment
@zverok

zverok Aug 9, 2016

Collaborator

I'm looking at this in more detail. Is there some gem to profile which lines in the method are taking a long time and which not?

Yeah, like @v0dro says, ruby-prof is pretty easy to use. You just do something like this:

RubyProf.start
(code you want to profile)
result = RubyProf.stop
printer = RubyProf::GraphHtmlPrinter.new(result)
printer.print(File.open('profile.html', 'w'))

And then that file would have pretty list of what've taken how many time, cross-linked to other methods it uses and stuff.

Collaborator

zverok commented Aug 9, 2016

I'm looking at this in more detail. Is there some gem to profile which lines in the method are taking a long time and which not?

Yeah, like @v0dro says, ruby-prof is pretty easy to use. You just do something like this:

RubyProf.start
(code you want to profile)
result = RubyProf.stop
printer = RubyProf::GraphHtmlPrinter.new(result)
printer.print(File.open('profile.html', 'w'))

And then that file would have pretty list of what've taken how many time, cross-linked to other methods it uses and stuff.

@zverok

This comment has been minimized.

Show comment
Hide comment
@zverok

zverok Aug 11, 2016

Collaborator

Typically it is not a very good practice to do so (wrap entire test set in each). Sometimes it unavoidable (like trying several different backends in Daru), but for testing two different types of vector it looks like code smell. Either their behavior depends on type (and this should be different test sets) or it doesn't (and this should be one test set for generic vector, maybe with one-two tests in the end demonstratings "for category vector it is all the same").

Collaborator

zverok commented on spec/shared/missing_values_spec.rb in b14b452 Aug 11, 2016

Typically it is not a very good practice to do so (wrap entire test set in each). Sometimes it unavoidable (like trying several different backends in Daru), but for testing two different types of vector it looks like code smell. Either their behavior depends on type (and this should be different test sets) or it doesn't (and this should be one test set for generic vector, maybe with one-two tests in the end demonstratings "for category vector it is all the same").

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 11, 2016

Contributor

Ok... So how should we implement tests cases in this case? Test cases for category module have to be exhausted like in ordinary vector since they will have separate reject_values, include_values?, etc.

Contributor

lokeshh replied Aug 11, 2016

Ok... So how should we implement tests cases in this case? Test cases for category module have to be exhausted like in ordinary vector since they will have separate reject_values, include_values?, etc.

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 11, 2016

Contributor

@zverok I'm making the test set for Categorical Vector separate because there are some behavioral differences like inreject values in case of categorical vector, reject_values should also remove the rejected categories and also there shouldn't be specs for caching in Categorical Vector. So, I think its best to keep the testing separate as per what you have said and some behavioral differences that are there. Do you agree?

Contributor

lokeshh replied Aug 11, 2016

@zverok I'm making the test set for Categorical Vector separate because there are some behavioral differences like inreject values in case of categorical vector, reject_values should also remove the rejected categories and also there shouldn't be specs for caching in Categorical Vector. So, I think its best to keep the testing separate as per what you have said and some behavioral differences that are there. Do you agree?

This comment has been minimized.

Show comment
Hide comment
@zverok

zverok Aug 12, 2016

Collaborator

There is two options: either inclusion of category module changes behavior about that, or doesn't change. If it doesn't, there could be only one spec to check this (in this file, for ex.), just to be sure. If it does, it should be specified properly.

OK, there is third case: behaviors are exactly the same, but implementations are independent, and each should be tested. In this case I'd recommend make large shared examples group, and then include it into tests for category/non-category vectors.

Collaborator

zverok replied Aug 12, 2016

There is two options: either inclusion of category module changes behavior about that, or doesn't change. If it doesn't, there could be only one spec to check this (in this file, for ex.), just to be sure. If it does, it should be specified properly.

OK, there is third case: behaviors are exactly the same, but implementations are independent, and each should be tested. In this case I'd recommend make large shared examples group, and then include it into tests for category/non-category vectors.

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 12, 2016

Contributor

@zverok In our case, the behavior is not exactly not the same, so I guess I will go with writing different specs. Right?

Contributor

lokeshh commented Aug 12, 2016

@zverok In our case, the behavior is not exactly not the same, so I guess I will go with writing different specs. Right?

@zverok

This comment has been minimized.

Show comment
Hide comment
@zverok

zverok Aug 12, 2016

Collaborator

@lokeshh Yep, while extracting the same behaviors into shared example groups.

Collaborator

zverok commented Aug 12, 2016

@lokeshh Yep, while extracting the same behaviors into shared example groups.

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 12, 2016

Contributor

@lokeshh Yep, while extracting the same behaviors into shared example groups.

@zverok Do you mean extracting the same behaviors in common specs and then using include_context?

Contributor

lokeshh commented Aug 12, 2016

@lokeshh Yep, while extracting the same behaviors into shared example groups.

@zverok Do you mean extracting the same behaviors in common specs and then using include_context?

@zverok

This comment has been minimized.

Show comment
Hide comment
@zverok

zverok Aug 12, 2016

Collaborator

Yep.

Collaborator

zverok commented Aug 12, 2016

Yep.

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 12, 2016

Contributor

Alright. Thanks!

Contributor

lokeshh commented Aug 12, 2016

Alright. Thanks!

@zverok

This comment has been minimized.

Show comment
Hide comment
@zverok

zverok Aug 14, 2016

Collaborator

rename_hash = [*old_values].map...., and you'll don't need line 677

Collaborator

zverok commented on lib/daru/category.rb in 3f7ef06 Aug 14, 2016

rename_hash = [*old_values].map...., and you'll don't need line 677

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 14, 2016

Contributor

Nice trick.

Contributor

lokeshh replied Aug 14, 2016

Nice trick.

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 15, 2016

Contributor

Can't use because [[*nil]] gives [[]].

[14] pry(main)> [[*nil]]
=> [[]]
Contributor

lokeshh replied Aug 15, 2016

Can't use because [[*nil]] gives [[]].

[14] pry(main)> [[*nil]]
=> [[]]
@zverok

This comment has been minimized.

Show comment
Hide comment
@zverok

zverok Aug 14, 2016

Collaborator

👍

Collaborator

zverok commented on 3f7ef06 Aug 14, 2016

👍

Show outdated Hide outdated spec/vector_spec.rb
# its(:'index.to_a') { is_expected.to eq [11, 13, 14, 15, 17, 18] }
include_context 'reject values checker',
[dv, [nil]] => [[1, 3, :a, Float::NAN, Float::NAN, 1],
[11, 13, 14, 15, 17, 18]]
end

This comment has been minimized.

@lokeshh

lokeshh Aug 15, 2016

Contributor

@zverok This is driving me nuts. I'm trying to make this work but it isn't. It gives error something like 'dv' is not available on an example group. Could you please have a look?

Another thing I'm thinking is that this code doesn't look readable at all. Passing the expected index, along with expected values of the vector and also passing along the type makes it hard to read what's going on. What do you say?

@lokeshh

lokeshh Aug 15, 2016

Contributor

@zverok This is driving me nuts. I'm trying to make this work but it isn't. It gives error something like 'dv' is not available on an example group. Could you please have a look?

Another thing I'm thinking is that this code doesn't look readable at all. Passing the expected index, along with expected values of the vector and also passing along the type makes it hard to read what's going on. What do you say?

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 15, 2016

Contributor

The PR is complete. I will start adding documentation. @zverok I will take care of extracting the common behavior later on.

Here's the short summary:

Methods added in Daru::Vector (and category):

  • reject_values
  • include_values?
  • indexes
  • count_values
  • replace values

Methods added in Daru::DataFrame:

  • reject_values
  • include_values?
  • replace_values

Here are the benchmarks.

Methods removed in Daru::Vector:

  • missing_values
  • missing_values=
  • update
  • exists?
  • set_missing_positions

and other methods have been deprecated.

Contributor

lokeshh commented Aug 15, 2016

The PR is complete. I will start adding documentation. @zverok I will take care of extracting the common behavior later on.

Here's the short summary:

Methods added in Daru::Vector (and category):

  • reject_values
  • include_values?
  • indexes
  • count_values
  • replace values

Methods added in Daru::DataFrame:

  • reject_values
  • include_values?
  • replace_values

Here are the benchmarks.

Methods removed in Daru::Vector:

  • missing_values
  • missing_values=
  • update
  • exists?
  • set_missing_positions

and other methods have been deprecated.

@lokeshh lokeshh changed the title from [WIP] Missing values to Missing values Aug 16, 2016

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 16, 2016

Contributor

All done. This is ready to be merged.

Contributor

lokeshh commented Aug 16, 2016

All done. This is ready to be merged.

@v0dro

This comment has been minimized.

Show comment
Hide comment
@v0dro

v0dro Aug 16, 2016

Member

Works. Something will probably break in statsample so you might need to fix it.

Member

v0dro commented Aug 16, 2016

Works. Something will probably break in statsample so you might need to fix it.

@v0dro v0dro merged commit dadebb0 into SciRuby:master Aug 16, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh Aug 18, 2016

Contributor

@v0dro. Done. Modified the Statsample to work with this PR.

Contributor

lokeshh commented Aug 18, 2016

@v0dro. Done. Modified the Statsample to work with this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment