New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical Index and Categorical Data #134

Merged
merged 238 commits into from Jul 8, 2016

Conversation

Projects
None yet
4 participants
@lokeshh
Contributor

lokeshh commented May 23, 2016

The coding begins!

This PR adds CategoricalIndex to Daru.

@lokeshh lokeshh changed the title from Categorical Index to [WIP] Categorical Index May 23, 2016

Show outdated Hide outdated spec/index_spec.rb
it "retrives all elements belonging to that category" do
# The idea is to use #cat when retrieving numeric categories
expect(@idx2.cat[0]).to eq([0, 2, 3])
end

This comment has been minimized.

@lokeshh

lokeshh May 23, 2016

Contributor

Given a categorical index idx, idx[:a] returns all positional values which have value :a and idx[0, 1] returns [0, 1].

One point of discussion is what should happen when the category is numeric. I've chose to use #cat to identify that category is being mentioned and not positional index when the categories are numeric. When dealing with non-numeric categories #cat would work similar to #[].

@lokeshh

lokeshh May 23, 2016

Contributor

Given a categorical index idx, idx[:a] returns all positional values which have value :a and idx[0, 1] returns [0, 1].

One point of discussion is what should happen when the category is numeric. I've chose to use #cat to identify that category is being mentioned and not positional index when the categories are numeric. When dealing with non-numeric categories #cat would work similar to #[].

This comment has been minimized.

@agisga

agisga May 23, 2016

Hm.. not sure what might be a better solution. One other possibility that I can think of is to use idx["0"] vs idx[0] to denote category or positional index respectively.

@agisga

agisga May 23, 2016

Hm.. not sure what might be a better solution. One other possibility that I can think of is to use idx["0"] vs idx[0] to denote category or positional index respectively.

This comment has been minimized.

@v0dro

v0dro May 23, 2016

Member

Or the #[] function can handle generic data (i.e. it will only refer to the categories that already there in the data irrespective of numeric or non-numeric) and a separate function called #position will return positional index.

@v0dro

v0dro May 23, 2016

Member

Or the #[] function can handle generic data (i.e. it will only refer to the categories that already there in the data irrespective of numeric or non-numeric) and a separate function called #position will return positional index.

This comment has been minimized.

@v0dro

v0dro May 23, 2016

Member

And have a #pos method as shorthand.

@v0dro

v0dro May 23, 2016

Member

And have a #pos method as shorthand.

This comment has been minimized.

@v0dro

v0dro May 23, 2016

Member

If the category does not contain numerical indexes, specifying a number in #[] will return the positional index.

If the category has numerical indexes, specifying a number in #[] will return the actual categories.

If a user is sure that he must have the positional index, he should use the #position method. This is how it works in Daru::Vector#[] (thought the #position function has not been implemented yet so specifying a number in case of a numerical index returns the position pointed to by the index.)

@v0dro

v0dro May 23, 2016

Member

If the category does not contain numerical indexes, specifying a number in #[] will return the positional index.

If the category has numerical indexes, specifying a number in #[] will return the actual categories.

If a user is sure that he must have the positional index, he should use the #position method. This is how it works in Daru::Vector#[] (thought the #position function has not been implemented yet so specifying a number in case of a numerical index returns the position pointed to by the index.)

This comment has been minimized.

@lokeshh

lokeshh May 23, 2016

Contributor

If the category does not contain numerical indexes, specifying a number in #[] will return the positional index.

If the category has numerical indexes, specifying a number in #[] will return the actual categories.

@v0dro Is this for Daru::Vector#[] or Daru::Index#[]?

@lokeshh

lokeshh May 23, 2016

Contributor

If the category does not contain numerical indexes, specifying a number in #[] will return the positional index.

If the category has numerical indexes, specifying a number in #[] will return the actual categories.

@v0dro Is this for Daru::Vector#[] or Daru::Index#[]?

This comment has been minimized.

@lokeshh

lokeshh May 23, 2016

Contributor

I'm confused. @v0dro Could you summarize for Daru::Index#[] what you said?

Currently this is how Daru::Index works:

[27] pry(main)> a = Daru::Index.new [1, :a, :b]
=> #<Daru::Index:0x0000000327f350
 @keys=[1, :a, :b],
 @relation_hash={1=>0, :a=>1, :b=>2},
 @size=3>
[28] pry(main)> a[:a]   # Case I
=> 1
[29] pry(main)> a[1]    # Case II
=> 0
[30] pry(main)> a[0]    # Case III
=> 0
[31] pry(main)> a[3]    # Case IV
IndexError: Specified index 3 does not exist
from /home/ubuntu/workspace/daru/lib/daru/index.rb:94:in `[]'

To summarize, if value is an index value, then it returns its corresponding positional index (Case I and II), otherwise it checks if it is a valid positional index or not. In case it is a valid positional index, it returns the value as it is (Case III); otherwise it raises an exception (Case IV).

@lokeshh

lokeshh May 23, 2016

Contributor

I'm confused. @v0dro Could you summarize for Daru::Index#[] what you said?

Currently this is how Daru::Index works:

[27] pry(main)> a = Daru::Index.new [1, :a, :b]
=> #<Daru::Index:0x0000000327f350
 @keys=[1, :a, :b],
 @relation_hash={1=>0, :a=>1, :b=>2},
 @size=3>
[28] pry(main)> a[:a]   # Case I
=> 1
[29] pry(main)> a[1]    # Case II
=> 0
[30] pry(main)> a[0]    # Case III
=> 0
[31] pry(main)> a[3]    # Case IV
IndexError: Specified index 3 does not exist
from /home/ubuntu/workspace/daru/lib/daru/index.rb:94:in `[]'

To summarize, if value is an index value, then it returns its corresponding positional index (Case I and II), otherwise it checks if it is a valid positional index or not. In case it is a valid positional index, it returns the value as it is (Case III); otherwise it raises an exception (Case IV).

This comment has been minimized.

@lokeshh

lokeshh May 23, 2016

Contributor

Or the #[] function can handle generic data (i.e. it will only refer to the categories that already there in the data irrespective of numeric or non-numeric) and a separate function called #position(now #at) will return positional index.

I didn't get it. Why do we need #at? Say idx = Daru::CategoricalIndex.new [:a, :b, :a]. What should idx.at[0] return?

@lokeshh

lokeshh May 23, 2016

Contributor

Or the #[] function can handle generic data (i.e. it will only refer to the categories that already there in the data irrespective of numeric or non-numeric) and a separate function called #position(now #at) will return positional index.

I didn't get it. Why do we need #at? Say idx = Daru::CategoricalIndex.new [:a, :b, :a]. What should idx.at[0] return?

This comment has been minimized.

@v0dro

v0dro May 24, 2016

Member

If the category does not contain numerical indexes, specifying a number in #[] will return the positional index.

If the category has numerical indexes, specifying a number in #[] will return the actual categories.

@v0dro Is this for Daru::Vector#[] or Daru::Index#[]?

This is for Daru::Vector. I'm not very concerned with Index since users don't interact with it very often.

@v0dro

v0dro May 24, 2016

Member

If the category does not contain numerical indexes, specifying a number in #[] will return the positional index.

If the category has numerical indexes, specifying a number in #[] will return the actual categories.

@v0dro Is this for Daru::Vector#[] or Daru::Index#[]?

This is for Daru::Vector. I'm not very concerned with Index since users don't interact with it very often.

This comment has been minimized.

@v0dro

v0dro May 24, 2016

Member

Or the #[] function can handle generic data (i.e. it will only refer to the categories that already there in the data irrespective of numeric or non-numeric) and a separate function called #position(now #at) will return positional index.
I didn't get it. Why do we need #at? Say idx = Daru::CategoricalIndex.new [:a, :b, :a]. What should idx.at[0] return?

idx.at(0) should return element at positional index 0 i.e. :a.

@v0dro

v0dro May 24, 2016

Member

Or the #[] function can handle generic data (i.e. it will only refer to the categories that already there in the data irrespective of numeric or non-numeric) and a separate function called #position(now #at) will return positional index.
I didn't get it. Why do we need #at? Say idx = Daru::CategoricalIndex.new [:a, :b, :a]. What should idx.at[0] return?

idx.at(0) should return element at positional index 0 i.e. :a.

Show outdated Hide outdated spec/index_spec.rb
end
it "raises exception given wrong positional index" do
expect { @indx1[5] }.to raise_error

This comment has been minimized.

@agisga

agisga May 23, 2016

indx1 should be idx1?

@agisga

agisga May 23, 2016

indx1 should be idx1?

Show outdated Hide outdated spec/index_spec.rb
end
it "raises exception given wrong positional indexes" do
expect { @indx1[0, 1, 5] }.to raise_error

This comment has been minimized.

@agisga

agisga May 23, 2016

indx1 (same typo as above)

@agisga

agisga May 23, 2016

indx1 (same typo as above)

Show outdated Hide outdated spec/vector_spec.rb
it "returns vector of elements belonging to given categories" do
expect(@dv2[:a, :c]).to eq(Daru::Vector.new(
['a', 'c', 'd', 'e'], index: Daru::Index.new(
[:a, :a, :a, :e])))

This comment has been minimized.

@agisga

agisga May 23, 2016

shouldn't it be [:a, :a, :a, :c] instead of [:a, :a, :a, :e]?

@agisga

agisga May 23, 2016

shouldn't it be [:a, :a, :a, :c] instead of [:a, :a, :a, :e]?

This comment has been minimized.

@lokeshh

lokeshh May 23, 2016

Contributor

Yes

@lokeshh

lokeshh May 23, 2016

Contributor

Yes

Show outdated Hide outdated spec/vector_spec.rb
end
it "returns vector of elements belonging to given categories" do
expect(@dv2[:a, :c]).to eq(Daru::Vector.new(

This comment has been minimized.

@agisga

agisga May 23, 2016

shouldn't this be @dv1[:a, :c] because dv2 has numerical indices?

@agisga

agisga May 23, 2016

shouldn't this be @dv1[:a, :c] because dv2 has numerical indices?

Show outdated Hide outdated spec/index_spec.rb
describe Daru::CategoricalIndex do
before do
@idx = Daru::CategoricalIndex.new(

This comment has been minimized.

@agisga

agisga May 23, 2016

In all of the code below you refer as idx1 to what is defined here as idx, as far as I can tell

@agisga

agisga May 23, 2016

In all of the code below you refer as idx1 to what is defined here as idx, as far as I can tell

This comment has been minimized.

@zverok

zverok May 23, 2016

Collaborator

I strongly propose to make it "right" from the beginning of new specs and use subject and let, with readable names of subjects/lets. Let the old spec have instance variables for now, but new ones should use modern RSpec features. @v0dro, WDYT?

Also, I think it is better to have new large classes/features specs in separate files.

@zverok

zverok May 23, 2016

Collaborator

I strongly propose to make it "right" from the beginning of new specs and use subject and let, with readable names of subjects/lets. Let the old spec have instance variables for now, but new ones should use modern RSpec features. @v0dro, WDYT?

Also, I think it is better to have new large classes/features specs in separate files.

This comment has been minimized.

@v0dro

v0dro May 23, 2016

Member

Yes we should use modern RSpec.

@v0dro

v0dro May 23, 2016

Member

Yes we should use modern RSpec.

This comment has been minimized.

@lokeshh

lokeshh May 23, 2016

Contributor

@zverok I see that I can use "let" to replace before..do block but when and where to use "subject"? Its not clear to me. Please let me know.

@lokeshh

lokeshh May 23, 2016

Contributor

@zverok I see that I can use "let" to replace before..do block but when and where to use "subject"? Its not clear to me. Please let me know.

This comment has been minimized.

@agisga

agisga May 23, 2016

@lokeshh I found this resource very helpful for questions about rspec during my GSoC last summer. From there it seems that unlike let, subject can also be unnamed (which I actually find is more confusing, but I'm no expert). But also as a word, I find that subject conveys a different intent than let.

@agisga

agisga May 23, 2016

@lokeshh I found this resource very helpful for questions about rspec during my GSoC last summer. From there it seems that unlike let, subject can also be unnamed (which I actually find is more confusing, but I'm no expert). But also as a word, I find that subject conveys a different intent than let.

This comment has been minimized.

@zverok

zverok May 24, 2016

Collaborator

@lokeshh With pleasure.
Look.

  1. The first usage of subject is for semantics: modern rspec implies small contexts, each of them testing exactly one test subject (if possible, of course, which is not always the case). Like this:
describe Daru::CategoricalIndex do
  let(:idx) { Daru::CategoricalIndex.new(...) } # <- that's some data we use...
  context '#[]' do
    subject { idx[:a] } # that's the value we will be checking in this context
    # or, subject could be named:
    subject(:one_category) { idx[:a] } # that's the value we will be checking in this context
  end
end
  1. In many cases, when you have subject, you can write simpler its, like this (assuming we are inside context described above):
it { is_expected.to be_a Daru::Index }
it { is_expected.to eq Daru::Index.new([0, 2, 3]) }

^ rspec even wil format that properly, without any text description (if such test fails, it will write something like "In test 'expected to be an instance of Daru::Index class'")

  1. With rspec-its gem (which I also strongly suggest -- in any case, I'm adding it in my "refactor" branch), the "implicit subject" tests can be even more powerful, like this:
its(:size) { is_expected.to eq 3 }
its(:first) { is_expected.to eq 0 }
....

Such "atomic" checks in many cases are easier to write and rerun, and find the problems (in current Daru specs, it is sometimes complicated fails like "(this huge dataframe) is expected to be equal to (this huge dataframe), go figure the difference!")

So, summing it up: a) semantics of specs (focus on "we test this specific value") + b) more concise and DRY specs

@zverok

zverok May 24, 2016

Collaborator

@lokeshh With pleasure.
Look.

  1. The first usage of subject is for semantics: modern rspec implies small contexts, each of them testing exactly one test subject (if possible, of course, which is not always the case). Like this:
describe Daru::CategoricalIndex do
  let(:idx) { Daru::CategoricalIndex.new(...) } # <- that's some data we use...
  context '#[]' do
    subject { idx[:a] } # that's the value we will be checking in this context
    # or, subject could be named:
    subject(:one_category) { idx[:a] } # that's the value we will be checking in this context
  end
end
  1. In many cases, when you have subject, you can write simpler its, like this (assuming we are inside context described above):
it { is_expected.to be_a Daru::Index }
it { is_expected.to eq Daru::Index.new([0, 2, 3]) }

^ rspec even wil format that properly, without any text description (if such test fails, it will write something like "In test 'expected to be an instance of Daru::Index class'")

  1. With rspec-its gem (which I also strongly suggest -- in any case, I'm adding it in my "refactor" branch), the "implicit subject" tests can be even more powerful, like this:
its(:size) { is_expected.to eq 3 }
its(:first) { is_expected.to eq 0 }
....

Such "atomic" checks in many cases are easier to write and rerun, and find the problems (in current Daru specs, it is sometimes complicated fails like "(this huge dataframe) is expected to be equal to (this huge dataframe), go figure the difference!")

So, summing it up: a) semantics of specs (focus on "we test this specific value") + b) more concise and DRY specs

Show outdated Hide outdated spec/vector_spec.rb
@idx1 = Daru::CategoricalIndex [:a, :b, :a, :a, :c]
@idx2 = Daru::CategoricalIndex [0, 1, 0, 0, 2]
@dv1 = Daru::Vector.new 'a'..'e', index: @idx1
@dv2 = Daru::Vector.new 'a'..'e', index: @idx2

This comment has been minimized.

@agisga

agisga May 23, 2016

I don't see any tests for @dv2.

@agisga

agisga May 23, 2016

I don't see any tests for @dv2.

This comment has been minimized.

@lokeshh

lokeshh May 23, 2016

Contributor

Sorry for the confusion. I haven't add specs for them yet.

@lokeshh

lokeshh May 23, 2016

Contributor

Sorry for the confusion. I haven't add specs for them yet.

This comment has been minimized.

@v0dro

v0dro May 24, 2016

Member

Shouldn't that be Daru::CategoricalIndex.new [0, 1, 0, 0, 2]?

@v0dro

v0dro May 24, 2016

Member

Shouldn't that be Daru::CategoricalIndex.new [0, 1, 0, 0, 2]?

@v0dro

This comment has been minimized.

Show comment
Hide comment
@v0dro

v0dro May 24, 2016

Member

@agisga I don't think it's appropriate to use the strings in #[] like you suggested. (i.e using idx.["0"] for specifying index).

It would lead to confusion and counter intuitive behaviour.

Imagine yourself doing this: you want to perform some computation using categories and you use a number instead of a string since your categories are all numbers. This yields wrong output and until you stumble across the docs that say that it must be specified as a string, you will be searching your whole code base for that one wrong operation. APIs should ideally resemble the standards set by the language or other libraries and one must not deviate unless that's the only way to do it.

Member

v0dro commented May 24, 2016

@agisga I don't think it's appropriate to use the strings in #[] like you suggested. (i.e using idx.["0"] for specifying index).

It would lead to confusion and counter intuitive behaviour.

Imagine yourself doing this: you want to perform some computation using categories and you use a number instead of a string since your categories are all numbers. This yields wrong output and until you stumble across the docs that say that it must be specified as a string, you will be searching your whole code base for that one wrong operation. APIs should ideally resemble the standards set by the language or other libraries and one must not deviate unless that's the only way to do it.

Show outdated Hide outdated spec/index/categorical_spec.rb
it { is_expected.to eq Daru::Index.new [0, 2, 3, 4] }
end

This comment has been minimized.

@lokeshh

lokeshh May 24, 2016

Contributor

@zverok I am not sure how to use subject when dealing with multiple scenarios inside #[] like these. Is this the correct use of subject?

@lokeshh

lokeshh May 24, 2016

Contributor

@zverok I am not sure how to use subject when dealing with multiple scenarios inside #[] like these. Is this the correct use of subject?

This comment has been minimized.

@zverok

zverok May 24, 2016

Collaborator

Yep, pretty good (it may seem "too complicated" for now, but eventually you'll love it :)).

For simple scenarios, it is sometimes acceptable to just expect( idx1[:a, :c] ).to eq .., so you can start the simplest form of specs with it, and then introduce subject when there is more than one check. Like "write as less code as possible, but not lesser" :)

@zverok

zverok May 24, 2016

Collaborator

Yep, pretty good (it may seem "too complicated" for now, but eventually you'll love it :)).

For simple scenarios, it is sometimes acceptable to just expect( idx1[:a, :c] ).to eq .., so you can start the simplest form of specs with it, and then introduce subject when there is more than one check. Like "write as less code as possible, but not lesser" :)

This comment has been minimized.

@v0dro

v0dro May 24, 2016

Member

Damn! This takes unit testing to a whole new level :O

Thanks for this Victor!

@v0dro

v0dro May 24, 2016

Member

Damn! This takes unit testing to a whole new level :O

Thanks for this Victor!

Show outdated Hide outdated spec/index/categorical_spec.rb
describe Daru::CategoricalIndex do
let(:idx1) { Daru::CategoricalIndex.new [:a, :b, :a, :a, :c] }
let(:idx2) { Daru::CategoricalIndex.new [0, 1, 0, 0, 2] }

This comment has been minimized.

@zverok

zverok May 24, 2016

Collaborator

a) informative names! something like symbols_idx and numbers_idx
b) under that common describe, you can use described_class.new instead of Daru::CategoricalIndex.new (sometimes it looks cleaner)

@zverok

zverok May 24, 2016

Collaborator

a) informative names! something like symbols_idx and numbers_idx
b) under that common describe, you can use described_class.new instead of Daru::CategoricalIndex.new (sometimes it looks cleaner)

Show outdated Hide outdated spec/vector_spec.rb
it "returns vector of elements belonging to given categories" do
expect(@dv1[:a, :c]).to eq(Daru::Vector.new(
['a', 'c', 'd', 'e'], index: Daru::CategoricalIndex.new(
[:a, :a, :a, :c])))

This comment has been minimized.

@zverok

zverok May 24, 2016

Collaborator

BTW, this case is already pretty good example when subject+its may be clearer:

subject { dv[:a, :c] }
it { is_expected.to be_a Daru::Vector }
its(:size) { is_expected.to eq 4 }
its(:to_a) { is_expected.to eq  ['a', 'c', 'd', 'e'] }
its(:index) { is_expected.to eq Daru::CategoricalIndex.new([:a, :a, :a, :c]) }
# or even
its(:index) { is_expected.to be_a Daru::CategoricalIndex }
its(:'index.to_a') { is_expected.to eq [:a, :a, :a, :c] }

Why? Because this way you can see clearer what part of specification is failing (for example, first you implement data part of sliced vector, then implement its index, and then do refactoring -- and only index order fails, and you see something like expected [:a, :a, :a, :c], got [:a, :c] instead of two large unequal vectors you need to scan by eyes to find what went sideways).

Though, I should say that such things are mostly the matter of personal preferences, so it is just a suggestion to consider. I will not bother you with those things too frequent :)

@zverok

zverok May 24, 2016

Collaborator

BTW, this case is already pretty good example when subject+its may be clearer:

subject { dv[:a, :c] }
it { is_expected.to be_a Daru::Vector }
its(:size) { is_expected.to eq 4 }
its(:to_a) { is_expected.to eq  ['a', 'c', 'd', 'e'] }
its(:index) { is_expected.to eq Daru::CategoricalIndex.new([:a, :a, :a, :c]) }
# or even
its(:index) { is_expected.to be_a Daru::CategoricalIndex }
its(:'index.to_a') { is_expected.to eq [:a, :a, :a, :c] }

Why? Because this way you can see clearer what part of specification is failing (for example, first you implement data part of sliced vector, then implement its index, and then do refactoring -- and only index order fails, and you see something like expected [:a, :a, :a, :c], got [:a, :c] instead of two large unequal vectors you need to scan by eyes to find what went sideways).

Though, I should say that such things are mostly the matter of personal preferences, so it is just a suggestion to consider. I will not bother you with those things too frequent :)

This comment has been minimized.

@lokeshh

lokeshh May 26, 2016

Contributor

Though, I should say that such things are mostly the matter of personal preferences, so it is just a suggestion to consider. I will not bother you with those things too frequent :)

I have very less experience with this. Please let me know of any suggestion you have :)

@lokeshh

lokeshh May 26, 2016

Contributor

Though, I should say that such things are mostly the matter of personal preferences, so it is just a suggestion to consider. I will not bother you with those things too frequent :)

I have very less experience with this. Please let me know of any suggestion you have :)

@agisga

This comment has been minimized.

Show comment
Hide comment
@agisga

agisga May 24, 2016

@v0dro I know that was a horrible idea, especially compared to the at-idea.
But I think that the concern you raised, still applies in a lesser form if daru uses at. For a vector dv with non-numeric categorical index, dv[0] would return its first element and so would dv.at(0). But when the user has the number 0 as a category in the index for the first time, he/she will have to go through documentation to figure out that now dv.at(0) has to be used to get the first vector element and dv[0] does something else. (I hope I'm not misunderstanding the discussion so far)

I don't think that's really unexpected behavior though. But still it would be good to somehow make it so that the user does not have to adjust their code (i.e., change dv[ind] to dv.at ind), once they suddenly realize that some categories in their code might also be numeric.

agisga commented May 24, 2016

@v0dro I know that was a horrible idea, especially compared to the at-idea.
But I think that the concern you raised, still applies in a lesser form if daru uses at. For a vector dv with non-numeric categorical index, dv[0] would return its first element and so would dv.at(0). But when the user has the number 0 as a category in the index for the first time, he/she will have to go through documentation to figure out that now dv.at(0) has to be used to get the first vector element and dv[0] does something else. (I hope I'm not misunderstanding the discussion so far)

I don't think that's really unexpected behavior though. But still it would be good to somehow make it so that the user does not have to adjust their code (i.e., change dv[ind] to dv.at ind), once they suddenly realize that some categories in their code might also be numeric.

@agisga

This comment has been minimized.

Show comment
Hide comment
@agisga

agisga May 24, 2016

By the way, I just wanted to check how R data.frame handles something like categorical index, and as far as I can see, it does not have any equivalent to the functionality of categorical index proposed here (at least not in the base R installation; the closest would be to add an additional column with the indices to the data frame). So, that makes daru and Ruby stand out! 👍

agisga commented May 24, 2016

By the way, I just wanted to check how R data.frame handles something like categorical index, and as far as I can see, it does not have any equivalent to the functionality of categorical index proposed here (at least not in the base R installation; the closest would be to add an additional column with the indices to the data frame). So, that makes daru and Ruby stand out! 👍

Show outdated Hide outdated spec/vector_spec.rb
let (:dv) { Daru::Vector.new 'a'..'e', index: idx1 }
context "multiple positional indexes" do
subject { dv[0, 1, 2] }

This comment has been minimized.

@agisga

agisga May 24, 2016

do you mean dv.at [0, 1, 2] ?

@agisga

agisga May 24, 2016

do you mean dv.at [0, 1, 2] ?

Show outdated Hide outdated spec/vector_spec.rb
end
context "single positional index" do
subject { dv[1] }

This comment has been minimized.

@agisga

agisga May 24, 2016

do you mean dv.at 1 ?

@agisga

agisga May 24, 2016

do you mean dv.at 1 ?

@zverok

This comment has been minimized.

Show comment
Hide comment
@zverok

zverok May 24, 2016

Collaborator

For a vector dv with non-numeric categorical index, dv[0] would return its first element and so would dv.at(0). But when the user has the number 0 as a category in the index for the first time, he/she will have to go through documentation to figure out that now dv.at(0) has to be used to get the first vector element and dv[0] does something else. (I hope I'm not misunderstanding the discussion so far)

That's pretty valid point (and that's why I, for one, dislike too "multi-functional" #[] method, I'm not sure it should fallback to search-by-position when search-by-index fails).

By the way, I just wanted to check how R data.frame handles something like categorical index, and as far as I can see, it does not have any equivalent to the functionality of categorical index proposed here

Pandas have :) And as far as I can understand, it is, in fact, the example by which @lokeshh is inspired.

Collaborator

zverok commented May 24, 2016

For a vector dv with non-numeric categorical index, dv[0] would return its first element and so would dv.at(0). But when the user has the number 0 as a category in the index for the first time, he/she will have to go through documentation to figure out that now dv.at(0) has to be used to get the first vector element and dv[0] does something else. (I hope I'm not misunderstanding the discussion so far)

That's pretty valid point (and that's why I, for one, dislike too "multi-functional" #[] method, I'm not sure it should fallback to search-by-position when search-by-index fails).

By the way, I just wanted to check how R data.frame handles something like categorical index, and as far as I can see, it does not have any equivalent to the functionality of categorical index proposed here

Pandas have :) And as far as I can understand, it is, in fact, the example by which @lokeshh is inspired.

@v0dro v0dro referenced this pull request May 24, 2016

Merged

Large codebase refactoring #123

@v0dro

This comment has been minimized.

Show comment
Hide comment
@v0dro

v0dro May 24, 2016

Member

@agisga:
The #[] functionality in Daru::Vector is ported from pandas. So if you have numerical category, and you specify idx[0], it will return elements that have categories as 0 and ignore positional indexing. So basically positional index has lower precedence than actual index that's stored in an object of kind Daru::Index. 0 is first searched for in the index, then it's checked if the 0th position exists (i.e. the positional index is checked) and if neither conditions are satisfied an error is raised.

And I agree that a confusion might exist about the usage of #at or #[] when it comes to numerical data, but I've not heard from anybody about it so far and it would be really impractical to have separate methods just for handling numerical indexes. Lesser the interfaces lesser the confusion.

Also, dv[0] will do something else only if the 0th category is not at the 0th position.

Member

v0dro commented May 24, 2016

@agisga:
The #[] functionality in Daru::Vector is ported from pandas. So if you have numerical category, and you specify idx[0], it will return elements that have categories as 0 and ignore positional indexing. So basically positional index has lower precedence than actual index that's stored in an object of kind Daru::Index. 0 is first searched for in the index, then it's checked if the 0th position exists (i.e. the positional index is checked) and if neither conditions are satisfied an error is raised.

And I agree that a confusion might exist about the usage of #at or #[] when it comes to numerical data, but I've not heard from anybody about it so far and it would be really impractical to have separate methods just for handling numerical indexes. Lesser the interfaces lesser the confusion.

Also, dv[0] will do something else only if the 0th category is not at the 0th position.

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh May 25, 2016

Contributor

I can be wrong but I think the only functions required for Daru::Vector and Daru::DataFrame to interact with Daru::Index are these two. Implementing these two functions for all index classes should be sufficient:

  • #pos I've implemented this and it's role is to return positional index. If the argument is an index it will return the corresponding positional index and if the argument is a valid positional index, it will return it as it is.
  • #sub This is a new method which I am proposing. It will produce new index from given index. This will be useful to create new index when Daru::Vector#[] and similar functions are called.

I believe implementing these two above methods are sufficient to implement all major functionalities of Daru::Vector and Daru::DataFrame that has to do with Daru::Index.

The advantage here is that I won't be messing up with #[] method of other index classes and they can be changed later. In the mean time, all index classes including CategoricalIndex would be working with Daru::Vector without incurring conditions on type of Index class.

After Daru::Vector and Daru::DataFrame are integrated with Categorical Index we can go on implementing other features of Daru::CategoricalIndex.

Any concerns or am I missing something?

Contributor

lokeshh commented May 25, 2016

I can be wrong but I think the only functions required for Daru::Vector and Daru::DataFrame to interact with Daru::Index are these two. Implementing these two functions for all index classes should be sufficient:

  • #pos I've implemented this and it's role is to return positional index. If the argument is an index it will return the corresponding positional index and if the argument is a valid positional index, it will return it as it is.
  • #sub This is a new method which I am proposing. It will produce new index from given index. This will be useful to create new index when Daru::Vector#[] and similar functions are called.

I believe implementing these two above methods are sufficient to implement all major functionalities of Daru::Vector and Daru::DataFrame that has to do with Daru::Index.

The advantage here is that I won't be messing up with #[] method of other index classes and they can be changed later. In the mean time, all index classes including CategoricalIndex would be working with Daru::Vector without incurring conditions on type of Index class.

After Daru::Vector and Daru::DataFrame are integrated with Categorical Index we can go on implementing other features of Daru::CategoricalIndex.

Any concerns or am I missing something?

@zverok

This comment has been minimized.

Show comment
Hide comment
@zverok

zverok May 25, 2016

Collaborator

I'm not sure what #sub would do, can you please show (some short example)?..
And, about #pos -- isn't it (guessing from specs) exactly the same current Index's #[] do?.. What would be the difference?..

Collaborator

zverok commented May 25, 2016

I'm not sure what #sub would do, can you please show (some short example)?..
And, about #pos -- isn't it (guessing from specs) exactly the same current Index's #[] do?.. What would be the difference?..

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh May 25, 2016

Contributor

#pos differs from #[] for Daru::MultiIndex and Daru::DateTimeIndex. For example,

[7] pry(main)> mi = Daru::MultiIndex.from_tuples([
[7] pry(main)*     [2000, 'M'],        
[7] pry(main)*     [2000, 'F'],        
[7] pry(main)*     [2001, 'M'],        
[7] pry(main)*     [2001, 'F']        
[7] pry(main)* ])          
=> Daru::MultiIndex:20672840 (levels: [[2000, 2001], ["F", "M"]]
labels: [[0, 0, 1, 1], [1, 0, 1, 0]])
[8] pry(main)> mi[2000]
=> Daru::MultiIndex:18288840 (levels: [[2000], ["F", "M"]]
labels: [[0, 0], [1, 0]])

But mi.pos 2000 will return [0, 1]

And regarding #sub, it will do the following:

a = Daru::Index.new [:a, :b, :c]

a.sub(:a, :b) will return Daru::Index.new [:a, :b]
and a.sub(0, 1) will return Daru::Index.new [:a, :b]

And here's how these methods will be used. Let's take the example of Daru::Vector#[]:

def Daru::Vector#[] *args
  pos = @index.pos (*args)
  idx = @index.sub (*args)
  return @index.class.new pos.map { |i| @data[i] }, index: idx
end
Contributor

lokeshh commented May 25, 2016

#pos differs from #[] for Daru::MultiIndex and Daru::DateTimeIndex. For example,

[7] pry(main)> mi = Daru::MultiIndex.from_tuples([
[7] pry(main)*     [2000, 'M'],        
[7] pry(main)*     [2000, 'F'],        
[7] pry(main)*     [2001, 'M'],        
[7] pry(main)*     [2001, 'F']        
[7] pry(main)* ])          
=> Daru::MultiIndex:20672840 (levels: [[2000, 2001], ["F", "M"]]
labels: [[0, 0, 1, 1], [1, 0, 1, 0]])
[8] pry(main)> mi[2000]
=> Daru::MultiIndex:18288840 (levels: [[2000], ["F", "M"]]
labels: [[0, 0], [1, 0]])

But mi.pos 2000 will return [0, 1]

And regarding #sub, it will do the following:

a = Daru::Index.new [:a, :b, :c]

a.sub(:a, :b) will return Daru::Index.new [:a, :b]
and a.sub(0, 1) will return Daru::Index.new [:a, :b]

And here's how these methods will be used. Let's take the example of Daru::Vector#[]:

def Daru::Vector#[] *args
  pos = @index.pos (*args)
  idx = @index.sub (*args)
  return @index.class.new pos.map { |i| @data[i] }, index: idx
end
@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh May 25, 2016

Contributor

We are using "Daru::Vector#at" to retrieve elements by positional index. How can we do assignment using positional index? Is vector.at(0, 1) = 10 supported by ruby?

Contributor

lokeshh commented May 25, 2016

We are using "Daru::Vector#at" to retrieve elements by positional index. How can we do assignment using positional index? Is vector.at(0, 1) = 10 supported by ruby?

end
context "single instance" do
subject { dv[:c] }

This comment has been minimized.

@lokeshh

lokeshh May 25, 2016

Contributor

@zverok Would the let statements be re-executed after "multiple instances" context?

@lokeshh

lokeshh May 25, 2016

Contributor

@zverok Would the let statements be re-executed after "multiple instances" context?

This comment has been minimized.

@zverok

zverok May 25, 2016

Collaborator

let statements are re-executed for each example (it), and only when let-ed thing is called (otherwise, they are not executed at all).

@zverok

zverok May 25, 2016

Collaborator

let statements are re-executed for each example (it), and only when let-ed thing is called (otherwise, they are not executed at all).

@zverok

This comment has been minimized.

Show comment
Hide comment
@zverok

zverok May 25, 2016

Collaborator

That seems reasonable to me (if you really see how it will simplify code), I'm just as usual concerned about names. Let's look maybe for how "similar" things are named in Ruby's collections, like Array and Hash?.. Item-to-position seems like Array#index to me -- ok, Index#index is a bit tautological, but seems to have a great sense, all-in-all?..

And your sub description is pretty like replace, though I'm not sure.

Collaborator

zverok commented May 25, 2016

That seems reasonable to me (if you really see how it will simplify code), I'm just as usual concerned about names. Let's look maybe for how "similar" things are named in Ruby's collections, like Array and Hash?.. Item-to-position seems like Array#index to me -- ok, Index#index is a bit tautological, but seems to have a great sense, all-in-all?..

And your sub description is pretty like replace, though I'm not sure.

@zverok

This comment has been minimized.

Show comment
Hide comment
@zverok

zverok May 25, 2016

Collaborator

Is vector.at(:a, :b) = 10 supported by ruby?

Nope. I think "clever looking" code is not always the virtue, simple method call will do.

Collaborator

zverok commented May 25, 2016

Is vector.at(:a, :b) = 10 supported by ruby?

Nope. I think "clever looking" code is not always the virtue, simple method call will do.

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh May 25, 2016

Contributor

Nope. I think "clever looking" code is not always the virtue, simple method call will do.

I didn't get it! Could you please explain? What will simple method call do?

Contributor

lokeshh commented May 25, 2016

Nope. I think "clever looking" code is not always the virtue, simple method call will do.

I didn't get it! Could you please explain? What will simple method call do?

@zverok

This comment has been minimized.

Show comment
Hide comment
@zverok

zverok May 25, 2016

Collaborator

What will simple method call do?

Something like vector.set_at([:a, :b], 10). Though, I'm not 100% sure about its usability (AND usefullness)

Collaborator

zverok commented May 25, 2016

What will simple method call do?

Something like vector.set_at([:a, :b], 10). Though, I'm not 100% sure about its usability (AND usefullness)

Show outdated Hide outdated spec/vector_spec.rb
context "single category" do
context "multiple instances" do
subject { dv }
dv[:a] = 'x'

This comment has been minimized.

@zverok

zverok May 25, 2016

Collaborator

This will not work (specs logic, I mean). You should place this assignment in before{} block:

subject { dv }
before { dv[:a] = 'x' }
it  { ... }
@zverok

zverok May 25, 2016

Collaborator

This will not work (specs logic, I mean). You should place this assignment in before{} block:

subject { dv }
before { dv[:a] = 'x' }
it  { ... }
Show outdated Hide outdated spec/vector_spec.rb
subject { dv }
dv[:a] = 'x'
it { is_expected.to be_a Daru::Vector }

This comment has been minimized.

@zverok

zverok May 25, 2016

Collaborator

Is this check necessary for each spec? I don't think dv could change its class after #[]= even in case of hugest error possible, so, what exactly we a specifying here?.. :)

@zverok

zverok May 25, 2016

Collaborator

Is this check necessary for each spec? I don't think dv could change its class after #[]= even in case of hugest error possible, so, what exactly we a specifying here?.. :)

This comment has been minimized.

@lokeshh

lokeshh May 25, 2016

Contributor

Haha. True.

@lokeshh

lokeshh May 25, 2016

Contributor

Haha. True.

@v0dro

This comment has been minimized.

Show comment
Hide comment
@v0dro

v0dro May 25, 2016

Member

Daru::Index#[] also produces a new index when given multiple inputs:

DateTimeIndex:

2.2.1 :027 > a
 => #<DateTimeIndex:80567600 offset=D periods=367 data=[2012-01-01T00:00:00+00:00...2013-01-01T00:00:00+00:00]> 
2.2.1 :028 > a['2012-2']
 => #<DateTimeIndex:80402730 offset=D periods=29 data=[2012-02-01T00:00:00+00:00...2012-02-29T00:00:00+00:00]>

MultiIndex:

w = Daru::MultiIndex.from_tuples([[:a,0,1], [:a,0,3], [:b,1,4]])
 => Daru::MultiIndex:80214130 (levels: [[:a, :b], [0, 1], [1, 3, 4]]
labels: [[0, 0, 1], [0, 0, 1], [0, 1, 2]]) 
2.2.1 :036 > w[:a]
 => Daru::MultiIndex:80195680 (levels: [[:a], [0], [1, 3]]
labels: [[0, 0], [0, 0], [0, 1]]) 

Index:

2.2.1 :037 > q = Daru::Index.new([:a, :b,:e,:q])
 => #<Daru::Index:0x98edc7c @relation_hash={:a=>0, :b=>1, :e=>2, :q=>3}, @keys=[:a, :b, :e, :q], @size=4> 
2.2.1 :038 > q[:q]
 => 3 
2.2.1 :039 > q[:q,:e]
 => #<Daru::Index:0x98e61fc @relation_hash={:q=>0, :e=>1}, @keys=[:q, :e], @size=2> 

Can't see where #sub would come into the picture from what you describe.

Member

v0dro commented May 25, 2016

Daru::Index#[] also produces a new index when given multiple inputs:

DateTimeIndex:

2.2.1 :027 > a
 => #<DateTimeIndex:80567600 offset=D periods=367 data=[2012-01-01T00:00:00+00:00...2013-01-01T00:00:00+00:00]> 
2.2.1 :028 > a['2012-2']
 => #<DateTimeIndex:80402730 offset=D periods=29 data=[2012-02-01T00:00:00+00:00...2012-02-29T00:00:00+00:00]>

MultiIndex:

w = Daru::MultiIndex.from_tuples([[:a,0,1], [:a,0,3], [:b,1,4]])
 => Daru::MultiIndex:80214130 (levels: [[:a, :b], [0, 1], [1, 3, 4]]
labels: [[0, 0, 1], [0, 0, 1], [0, 1, 2]]) 
2.2.1 :036 > w[:a]
 => Daru::MultiIndex:80195680 (levels: [[:a], [0], [1, 3]]
labels: [[0, 0], [0, 0], [0, 1]]) 

Index:

2.2.1 :037 > q = Daru::Index.new([:a, :b,:e,:q])
 => #<Daru::Index:0x98edc7c @relation_hash={:a=>0, :b=>1, :e=>2, :q=>3}, @keys=[:a, :b, :e, :q], @size=4> 
2.2.1 :038 > q[:q]
 => 3 
2.2.1 :039 > q[:q,:e]
 => #<Daru::Index:0x98e61fc @relation_hash={:q=>0, :e=>1}, @keys=[:q, :e], @size=2> 

Can't see where #sub would come into the picture from what you describe.

@v0dro

This comment has been minimized.

Show comment
Hide comment
@v0dro

v0dro May 25, 2016

Member

@zverok Array#replace is more replacing the contents of self (Array) but his proposed #sub method is more of creating a subset of the index from the previous index.

Member

v0dro commented May 25, 2016

@zverok Array#replace is more replacing the contents of self (Array) but his proposed #sub method is more of creating a subset of the index from the previous index.

Show outdated Hide outdated spec/index/multi_spec.rb
@@ -1,123 +1,5 @@
require 'spec_helper.rb'

This comment has been minimized.

@v0dro

v0dro May 25, 2016

Member

rename file to multi_index_spec.rb

@v0dro

v0dro May 25, 2016

Member

rename file to multi_index_spec.rb

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh May 25, 2016

Contributor

Can't see where #sub would come into the picture from what you describe.

#sub will be useful when positional index is specified. #sub would form Index object from positional index while #[] won't. For example,

>> a = Daru::Index.new [:a, :b, :c]
=> #<Daru::Index:0x000000027bfd48 @relation_hash={:a=>0, :b=>1, :c=>2}, @keys=[:a, :b, :c], @size=3>
?> a[0, 1]
=> #<Daru::Index:0x000000027bc8c8 @relation_hash={0=>0, 1=>1}, @keys=[0, 1], @size=2>
?> a.sub 0, 1
=> #<Daru::Index:0x000000027b4718 @relation_hash={:a=>0, :b=>1}, @keys=[:a, :b], @size=2>
Contributor

lokeshh commented May 25, 2016

Can't see where #sub would come into the picture from what you describe.

#sub will be useful when positional index is specified. #sub would form Index object from positional index while #[] won't. For example,

>> a = Daru::Index.new [:a, :b, :c]
=> #<Daru::Index:0x000000027bfd48 @relation_hash={:a=>0, :b=>1, :c=>2}, @keys=[:a, :b, :c], @size=3>
?> a[0, 1]
=> #<Daru::Index:0x000000027bc8c8 @relation_hash={0=>0, 1=>1}, @keys=[0, 1], @size=2>
?> a.sub 0, 1
=> #<Daru::Index:0x000000027b4718 @relation_hash={:a=>0, :b=>1}, @keys=[:a, :b], @size=2>
@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh May 25, 2016

Contributor

@zverok I haven't looked upon how it's done but I think having vector.at(:a, :b) = 10 is possible to implement because df.row[0] = 1..5 is made possible in Daru::DataFrame. What do you say @v0dro ?

Contributor

lokeshh commented May 25, 2016

@zverok I haven't looked upon how it's done but I think having vector.at(:a, :b) = 10 is possible to implement because df.row[0] = 1..5 is made possible in Daru::DataFrame. What do you say @v0dro ?

@zverok

This comment has been minimized.

Show comment
Hide comment
@zverok

zverok May 26, 2016

Collaborator

I haven't looked upon how it's done but I think having vector.at(:a, :b) = 10 is possible to implement because df.row[0] = 1..5 is made possible in Daru::DataFrame.

Note the different kinds of parenthesis in two cases. df.row[0] = is implemented like this:

  • df.row method returns an instance of special Daru::Accessors::DataFrameByRow object,
  • which defines [] method, which allows doing the trick.

I'm not completely sure that this kind of trick everywhere in code is always a good thing. Basically, if you want to do vector.at[:a, :b] = 10 100 times, using the technique described above, you'll generate 100 "accessor objects" during this process.

I'd suggest to remain on "just plain Ruby" API, and only after the main work is done, try to think how to make code cleaner and smarter.

Collaborator

zverok commented May 26, 2016

I haven't looked upon how it's done but I think having vector.at(:a, :b) = 10 is possible to implement because df.row[0] = 1..5 is made possible in Daru::DataFrame.

Note the different kinds of parenthesis in two cases. df.row[0] = is implemented like this:

  • df.row method returns an instance of special Daru::Accessors::DataFrameByRow object,
  • which defines [] method, which allows doing the trick.

I'm not completely sure that this kind of trick everywhere in code is always a good thing. Basically, if you want to do vector.at[:a, :b] = 10 100 times, using the technique described above, you'll generate 100 "accessor objects" during this process.

I'd suggest to remain on "just plain Ruby" API, and only after the main work is done, try to think how to make code cleaner and smarter.

@lokeshh

This comment has been minimized.

Show comment
Hide comment
@lokeshh

lokeshh May 26, 2016

Contributor

I'd suggest to remain on "just plain Ruby" API, and only after the main work is done, try to think how to make code cleaner and smarter.

Point noted. I've almost completed specs regarding #[] and #[]= for Daru::Vector and Daru::DataFrame. Do you like the functionality presented by these specs?

Contributor

lokeshh commented May 26, 2016

I'd suggest to remain on "just plain Ruby" API, and only after the main work is done, try to think how to make code cleaner and smarter.

Point noted. I've almost completed specs regarding #[] and #[]= for Daru::Vector and Daru::DataFrame. Do you like the functionality presented by these specs?

@v0dro

This comment has been minimized.

Show comment
Hide comment
@v0dro

v0dro May 26, 2016

Member

Functions look good to me so far. And yes, don't use #at[].

Member

v0dro commented May 26, 2016

Functions look good to me so far. And yes, don't use #at[].

Show outdated Hide outdated spec/dataframe_spec.rb
@@ -696,7 +696,7 @@
}, index: idx)
end
context "modify" do
context "modify exiting row" do

This comment has been minimized.

@v0dro

v0dro May 26, 2016

Member

existing?

@v0dro

v0dro May 26, 2016

Member

existing?

Show outdated Hide outdated spec/dataframe_spec.rb
its(:a) { Daru::Vector.new ['x', 'b', 'x', 'x', 'e'] }
its(:b) { Daru::Vector.new ['y', 2, 'y', 'y', 5] }
end

This comment has been minimized.

@lokeshh

lokeshh May 27, 2016

Contributor

Is this way of checking an expected Daru::DataFrame fine?

@lokeshh

lokeshh May 27, 2016

Contributor

Is this way of checking an expected Daru::DataFrame fine?

This comment has been minimized.

@zverok

zverok May 27, 2016

Collaborator

Looks good to me.

@zverok

zverok May 27, 2016

Collaborator

Looks good to me.

@v0dro

This comment has been minimized.

Show comment
Hide comment
@v0dro

v0dro Jul 2, 2016

Member

@zverok @agisga could you have one last look before I merge this?

Member

v0dro commented Jul 2, 2016

@zverok @agisga could you have one last look before I merge this?

end
context 'bar' do
let(:plot) { instance_double 'Gruff::Bar' }

This comment has been minimized.

@v0dro

v0dro Jul 2, 2016

Member

Can you explain what instance_double does exactly?

@v0dro

v0dro Jul 2, 2016

Member

Can you explain what instance_double does exactly?

This comment has been minimized.

@lokeshh

lokeshh Jul 3, 2016

Contributor

instance_double here creates a fake instance of Gruff::Bar so we can test methods called on it.

@lokeshh

lokeshh Jul 3, 2016

Contributor

instance_double here creates a fake instance of Gruff::Bar so we can test methods called on it.

it 'plots bar graph with block' do
expect(plot).to receive :labels=
expect(plot).to receive(:data).exactly(3).times
expect(plot).to receive :title=

This comment has been minimized.

@v0dro

v0dro Jul 2, 2016

Member

Why are these specs necessary? From what I can see you're writing specs for Gruff here, not daru.

@v0dro

v0dro Jul 2, 2016

Member

Why are these specs necessary? From what I can see you're writing specs for Gruff here, not daru.

This comment has been minimized.

@lokeshh

lokeshh Jul 3, 2016

Contributor

These specs checks that Daru calls appropriate methods of Gruff at correct number of times. For example, in a bar plot Daru should use make use of labels= to label the x axis while this shoudn't be the case with scatter plot. Also data should be called thrice because there are three different bars to be plotted for three vectors.

These tests are also helpful to make sure that the code executes with any errors. While refactoring the code to solve the rubocop offenses, I accidentally broke some code and the specs were failing signifying where's the problem.

@lokeshh

lokeshh Jul 3, 2016

Contributor

These specs checks that Daru calls appropriate methods of Gruff at correct number of times. For example, in a bar plot Daru should use make use of labels= to label the x axis while this shoudn't be the case with scatter plot. Also data should be called thrice because there are three different bars to be plotted for three vectors.

These tests are also helpful to make sure that the code executes with any errors. While refactoring the code to solve the rubocop offenses, I accidentally broke some code and the specs were failing signifying where's the problem.

This comment has been minimized.

@v0dro

v0dro Jul 3, 2016

Member

ok works.

@v0dro

v0dro Jul 3, 2016

Member

ok works.

@zverok

This comment has been minimized.

Show comment
Hide comment
@zverok

zverok Jul 3, 2016

Collaborator

@v0dro OK, reviewing it.

Collaborator

zverok commented Jul 3, 2016

@v0dro OK, reviewing it.

else
raise ArgumentError, "Unsupported library #{lib}"
end
end

This comment has been minimized.

@zverok

zverok Jul 3, 2016

Collaborator

I'm not sure about this whole thing with plotting library selection.

Is there no case when you want Gruff graphs (pretty & static) and Nyaplot graphs (unstable yet dynamic) at the same presentation? And what if we'll add one more graphing library? Always select one and only one? Switch this setting on any action?

Maybe something like dataframe.nyaplot.plot(blah) (and .gruff.plot(blah)) would look better?
Or even not try to outsmart ourselves and just use Daru::Plotting::Gruff.new(dataframe)?..
The latter approach allows to have other plotting libraries (or even nyaplot and gruff) as separate gems, and any number of Daru::Plotting::MyCoolPlotter.new(dataframe) variants.

@zverok

zverok Jul 3, 2016

Collaborator

I'm not sure about this whole thing with plotting library selection.

Is there no case when you want Gruff graphs (pretty & static) and Nyaplot graphs (unstable yet dynamic) at the same presentation? And what if we'll add one more graphing library? Always select one and only one? Switch this setting on any action?

Maybe something like dataframe.nyaplot.plot(blah) (and .gruff.plot(blah)) would look better?
Or even not try to outsmart ourselves and just use Daru::Plotting::Gruff.new(dataframe)?..
The latter approach allows to have other plotting libraries (or even nyaplot and gruff) as separate gems, and any number of Daru::Plotting::MyCoolPlotter.new(dataframe) variants.

This comment has been minimized.

@lokeshh

lokeshh Jul 3, 2016

Contributor

Is there no case when you want Gruff graphs (pretty & static) and Nyaplot graphs (unstable yet dynamic) at the same presentation? And what if we'll add one more graphing library? Always select one and only one? Switch this setting on any action?

I have provided an option #plotting_library with dataframes and vectors also. So one can use, for example dv.plotting_library=:gruff if he/she wants to use gruff for this particular vector. When we set the library using Daru.plotting_library, it means all the vectors or dataframes created after that will follow this plotting library for the plots but you can change setting for particular vectors and dataframes using dv.plotting_library or df.plotting_library.

So what do you say? Is it still a problem?

@lokeshh

lokeshh Jul 3, 2016

Contributor

Is there no case when you want Gruff graphs (pretty & static) and Nyaplot graphs (unstable yet dynamic) at the same presentation? And what if we'll add one more graphing library? Always select one and only one? Switch this setting on any action?

I have provided an option #plotting_library with dataframes and vectors also. So one can use, for example dv.plotting_library=:gruff if he/she wants to use gruff for this particular vector. When we set the library using Daru.plotting_library, it means all the vectors or dataframes created after that will follow this plotting library for the plots but you can change setting for particular vectors and dataframes using dv.plotting_library or df.plotting_library.

So what do you say? Is it still a problem?

This comment has been minimized.

@v0dro

v0dro Jul 3, 2016

Member

I think the plotting_library thing actually makes sense. Since it exists for particular DFs or Vectors it's also flexible at the same time being specific. You can change your plotting library at any time and switch between interactive and static graphs.

I think this functionality will come in handy when we eventually create a library neutral plotting interface for daru, where setting plotting_library might somehow allow users to leverage specific library features.

@v0dro

v0dro Jul 3, 2016

Member

I think the plotting_library thing actually makes sense. Since it exists for particular DFs or Vectors it's also flexible at the same time being specific. You can change your plotting library at any time and switch between interactive and static graphs.

I think this functionality will come in handy when we eventually create a library neutral plotting interface for daru, where setting plotting_library might somehow allow users to leverage specific library features.

This comment has been minimized.

@v0dro

v0dro Jul 3, 2016

Member

I think the plotting_library thing actually makes sense. Since it exists for particular DFs or Vectors it's also flexible at the same time being specific. You can change your plotting library at any time and switch between interactive and static graphs.

I think this functionality will come in handy when we eventually create a library neutral plotting interface for daru, where setting plotting_library might somehow allow users to leverage specific library features.

@v0dro

v0dro Jul 3, 2016

Member

I think the plotting_library thing actually makes sense. Since it exists for particular DFs or Vectors it's also flexible at the same time being specific. You can change your plotting library at any time and switch between interactive and static graphs.

I think this functionality will come in handy when we eventually create a library neutral plotting interface for daru, where setting plotting_library might somehow allow users to leverage specific library features.

This comment has been minimized.

@v0dro

v0dro Jul 3, 2016

Member

I think the plotting_library thing actually makes sense. Since it exists for particular DFs or Vectors it's also flexible at the same time being specific. You can change your plotting library at any time and switch between interactive and static graphs.

I think this functionality will come in handy when we eventually create a library neutral plotting interface for daru, where setting plotting_library might somehow allow users to leverage specific library features.

@v0dro

v0dro Jul 3, 2016

Member

I think the plotting_library thing actually makes sense. Since it exists for particular DFs or Vectors it's also flexible at the same time being specific. You can change your plotting library at any time and switch between interactive and static graphs.

I think this functionality will come in handy when we eventually create a library neutral plotting interface for daru, where setting plotting_library might somehow allow users to leverage specific library features.

This comment has been minimized.

@zverok

zverok Jul 4, 2016

Collaborator

OK, I will not argue about it more, noting, however, that for me concern separation is broken here.
Though, it is a part of a larger (and harder) discussion: whether Daru should be looked at as a "new data structure" at a first place + infrastructure (of plotting and analytics and reading/writing) around it; or should it be seen as a main entry point for all data processing functions.

The former seems more future-proof for me, but current development tends to the latter, as far as I can understand.

@zverok

zverok Jul 4, 2016

Collaborator

OK, I will not argue about it more, noting, however, that for me concern separation is broken here.
Though, it is a part of a larger (and harder) discussion: whether Daru should be looked at as a "new data structure" at a first place + infrastructure (of plotting and analytics and reading/writing) around it; or should it be seen as a main entry point for all data processing functions.

The former seems more future-proof for me, but current development tends to the latter, as far as I can understand.

This comment has been minimized.

@v0dro

v0dro Jul 7, 2016

Member

Hmmm yes you have a point. The former will make it more extensible and probably more usable. I think we can have a further look at this when we make a uniform plotting interface.

Is your main concern the plotting_library variable in DF/Vector since it's pulling in plotting functionality into core daru data structures?

@v0dro

v0dro Jul 7, 2016

Member

Hmmm yes you have a point. The former will make it more extensible and probably more usable. I think we can have a further look at this when we make a uniform plotting interface.

Is your main concern the plotting_library variable in DF/Vector since it's pulling in plotting functionality into core daru data structures?

This comment has been minimized.

@zverok

zverok Jul 8, 2016

Collaborator

Is your main concern the plotting_library variable in DF/Vector since it's pulling in plotting functionality into core daru data structures?

Yep, speaking about current work, it is.

@zverok

zverok Jul 8, 2016

Collaborator

Is your main concern the plotting_library variable in DF/Vector since it's pulling in plotting functionality into core daru data structures?

Yep, speaking about current work, it is.

Show outdated Hide outdated lib/daru/category.rb
File.expand_path('../iruby/templates/vector.html.erb', __FILE__)
end
ERB.new(File.read(path).strip).result(binding)
end

This comment has been minimized.

@zverok

zverok Jul 3, 2016

Collaborator

Why do we need it?.. The definition seems to be exactly the same as in Vector

@zverok

zverok Jul 3, 2016

Collaborator

Why do we need it?.. The definition seems to be exactly the same as in Vector

This comment has been minimized.

@lokeshh

lokeshh Jul 3, 2016

Contributor

Ah... Sorry about that. Didn't thought about that. Removing it now.

@lokeshh

lokeshh Jul 3, 2016

Contributor

Ah... Sorry about that. Didn't thought about that. Removing it now.

Show outdated Hide outdated lib/daru/category.rb
# See #reindex!
def reindex index
dup.reindex! index
end

This comment has been minimized.

@zverok

zverok Jul 3, 2016

Collaborator

Also should be defined in Vector, don't it?..

@zverok

zverok Jul 3, 2016

Collaborator

Also should be defined in Vector, don't it?..

This comment has been minimized.

@lokeshh

lokeshh Jul 3, 2016

Contributor

Removing it.

BTW One question, I have written specs for reindex!, should I also write specs for reindex?

@lokeshh

lokeshh Jul 3, 2016

Contributor

Removing it.

BTW One question, I have written specs for reindex!, should I also write specs for reindex?

This comment has been minimized.

@zverok

zverok Jul 4, 2016

Collaborator

should I also write specs for reindex?

It is pretty trivial, so I think could be skipped.
But, ideally, I'd write specs testing that new value is reindexed, and old value still intact.

@zverok

zverok Jul 4, 2016

Collaborator

should I also write specs for reindex?

It is pretty trivial, so I think could be skipped.
But, ideally, I'd write specs testing that new value is reindexed, and old value still intact.

Show outdated Hide outdated lib/daru/category.rb
min_category: @cat_hash.keys.min_by { |cat| @cat_hash[cat].size }
}
Daru::Vector.new values

This comment has been minimized.

@zverok

zverok Jul 3, 2016

Collaborator

Maybe just:

def summary
  Daru::Vector.new(
    size: size,
    categories: categories.size,
    max_freq: @cat_hash.values.map(&:size).max,
    max_category: @cat_hash.keys.max_by { |cat| @cat_hash[cat].size },
    min_freq: @cat_hash.values.map(&:size).min,
    min_category: @cat_hash.keys.min_by { |cat| @cat_hash[cat].size }
  )

WDYT?

@zverok

zverok Jul 3, 2016

Collaborator

Maybe just:

def summary
  Daru::Vector.new(
    size: size,
    categories: categories.size,
    max_freq: @cat_hash.values.map(&:size).max,
    max_category: @cat_hash.keys.max_by { |cat| @cat_hash[cat].size },
    min_freq: @cat_hash.values.map(&:size).min,
    min_category: @cat_hash.keys.min_by { |cat| @cat_hash[cat].size }
  )

WDYT?

This comment has been minimized.

@lokeshh

lokeshh Jul 3, 2016

Contributor

Yeah. Its better. Thanks

@lokeshh

lokeshh Jul 3, 2016

Contributor

Yeah. Its better. Thanks

Show outdated Hide outdated lib/daru/category.rb
Daru::Vector.new values
end
alias_method :describe, :summary

This comment has been minimized.

@zverok

zverok Jul 3, 2016

Collaborator

Are we sure we want to spend not even one, but two of such rich and common method names for this particular functionality? @v0dro ?

@zverok

zverok Jul 3, 2016

Collaborator

Are we sure we want to spend not even one, but two of such rich and common method names for this particular functionality? @v0dro ?

This comment has been minimized.

@lokeshh

lokeshh Jul 3, 2016

Contributor

You are right. I have renamed this function to describe and removed the name summary. It can be used for some other function in the future.

@lokeshh

lokeshh Jul 3, 2016

Contributor

You are right. I have renamed this function to describe and removed the name summary. It can be used for some other function in the future.

Show outdated Hide outdated lib/daru/category.rb
# type category
# @return [TypeError] Raises an exception
def to_category
raise TypeError, 'The vector is already of type category.'

This comment has been minimized.

@zverok

zverok Jul 3, 2016

Collaborator

Typically, in such cases return self is more appropriate solution:

[1,2,3].to_a
# => [1, 2, 3] 
{a: 1, b: 2}.to_h
# => {:a=>1, :b=>2} 

I mean, when we are raising an exception in this case, what kind of error we want to signalize about?..

@zverok

zverok Jul 3, 2016

Collaborator

Typically, in such cases return self is more appropriate solution:

[1,2,3].to_a
# => [1, 2, 3] 
{a: 1, b: 2}.to_h
# => {:a=>1, :b=>2} 

I mean, when we are raising an exception in this case, what kind of error we want to signalize about?..

This comment has been minimized.

@lokeshh

lokeshh Jul 3, 2016

Contributor

Yeah. Makes sense. Thanks. Changing it to return self.

@lokeshh

lokeshh Jul 3, 2016

Contributor

Yeah. Makes sense. Thanks. Changing it to return self.

Show outdated Hide outdated lib/daru/category.rb
end
def assert_ordered operation
# Change ArgumentError to something more expressive

This comment has been minimized.

@zverok

zverok Jul 3, 2016

Collaborator

It is ok to leave this kind of comments, but please add TODO or FIXME tag to it.

@zverok

zverok Jul 3, 2016

Collaborator

It is ok to leave this kind of comments, but please add TODO or FIXME tag to it.

This comment has been minimized.

@lokeshh

lokeshh Jul 3, 2016

Contributor

Oops... I've made this mistake twice. Won't happen again :)

@lokeshh

lokeshh Jul 3, 2016

Contributor

Oops... I've made this mistake twice. Won't happen again :)

Show outdated Hide outdated lib/daru/dataframe.rb
def split_by_category cat_name
cat_dv = self[cat_name]
raise ArguementError, "#{cat_name} is not a category vector" if
cat_dv.type != :category

This comment has been minimized.

@zverok

zverok Jul 3, 2016

Collaborator

Maybe cat_dv.category? could look better for such cases?

@zverok

zverok Jul 3, 2016

Collaborator

Maybe cat_dv.category? could look better for such cases?

This comment has been minimized.

@lokeshh

lokeshh Jul 3, 2016

Contributor

Yep. Implementing it.

@lokeshh

lokeshh Jul 3, 2016

Contributor

Yep. Implementing it.

Show outdated Hide outdated lib/daru/plotting/dataframe.rb
.select { |a| a =~ Regexp.new("\\A#{opt}") }
.sort
.map { |a| options[a] }
module GruffLibrary

This comment has been minimized.

@zverok

zverok Jul 3, 2016

Collaborator

Two separate files for nyaplot and gruff?

@zverok

zverok Jul 3, 2016

Collaborator

Two separate files for nyaplot and gruff?

Show outdated Hide outdated lib/daru/plotting/vector.rb
end
end
module GruffLibrary

This comment has been minimized.

@zverok

zverok Jul 3, 2016

Collaborator

And, further, it'll be reasonable to structure files now this way:

nyaplot.rb
nyaplot/
+-vector.rb
+-dataframe.rb
gruff.rb
gruff/
+-vector.rb
+-dataframe.rb
@zverok

zverok Jul 3, 2016

Collaborator

And, further, it'll be reasonable to structure files now this way:

nyaplot.rb
nyaplot/
+-vector.rb
+-dataframe.rb
gruff.rb
gruff/
+-vector.rb
+-dataframe.rb

This comment has been minimized.

@lokeshh

lokeshh Jul 3, 2016

Contributor

What should be in nyaplot.rb and gruff.rb?

@lokeshh

lokeshh Jul 3, 2016

Contributor

What should be in nyaplot.rb and gruff.rb?

This comment has been minimized.